Files
face-sets/README.md
Peter 62dba3ddb3 Add Immich outage circuit breaker; document nic run + Tailscale quirk
work/immich_stage.py:
- Startup probe of /server/version (exit 2 if unreachable).
- Outage circuit breaker: after OUTAGE_FAIL_STREAK=12 consecutive
  faces_error/download_error results, run a quick probe; if the probe
  also fails, persist state and exit with code 2 so a long unattended
  run can pause rather than silently churning through tens of thousands
  of retries during an upstream outage. Resume by re-running the same
  command -- state.json + queue.json are intact.

README:
- Document the nic run (per-user API key necessary; second pipeline
  invocation confirmed expected behavior; cleaner library than peter's
  with 0 internal byte-dupes vs 2,976).
- Mention the circuit breaker as the mechanism that keeps long
  unattended runs safe under the known Tailscale flicker pattern at
  this site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 23:36:11 +02:00

356 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# face-sets
Sort photos by similar face using InsightFace embeddings + agglomerative clustering, refine into per-identity sets, and export ready-to-drop bundles for face-swap tooling (roop-unleashed, etc.).
## Pipeline
`sort_faces.py` is a single-file CLI with six subcommands:
| step | what it does |
|-------------|-------------------------------------------------------------------------------------------------------------|
| embed | Recursively scan a source tree, detect + embed every face, write `.npz` cache. Resumable; sha256-dedup. |
| cluster | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` with manifest. |
| refine | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/`. |
| dedup | Post-hoc near-duplicate report: byte-identical + visual near-dupe groups → `<cache>.duplicates.json`. |
| extend | Fold new embeddings into an existing raw/refine output via nearest person-centroid without renumbering. |
| enrich | Re-detect each cached face to persist landmark_2d_106, landmark_3d_68, pose (pitch/yaw/roll) into cache. |
| export-swap | Per-identity export: tight outlier gate + visual-dupe collapse + composite quality rank + single-face PNG crops + `.fsz` bundles (top-N and full) ready for roop-unleashed. Optional singleton rescue into `_candidates/`. |
### Design principles
- **embed is resumable and incremental.** It loads any existing cache at the target path and only hashes / embeds files it has not seen. Atomic flush every 50 new files so a mid-run crash loses at most ~50 embeddings.
- **Byte-identical duplicates are sha256-grouped at listing time.** The canonical file is embedded once; other paths with the same hash become `path_aliases` in the cache. Every alias is materialized by `cluster` / `refine` / `export-swap`, so each on-disk location is represented.
- **`safe_dst_name` always flattens the absolute path.** This keeps output filenames stable across runs even as `src_root` changes between embed / extend / export invocations.
- **Caches and outputs stay out of git** via `.gitignore`; defaults live under `work/`.
## Typical end-to-end run
```bash
SRC=/mnt/x/src/nl
CACHE=work/cache/nl_full.npz
OUT=/mnt/e/temp_things/fcswp/nl_sorted
# 1. Embed (CPU; InsightFace buffalo_l). Resumable on re-run.
python sort_faces.py embed "$SRC" "$CACHE"
# 2. Raw clusters (one person_NNN/ per multi-face cluster).
python sort_faces.py cluster "$CACHE" "$OUT/raw_full"
# 3. Refined facesets (quality-gated per-identity sets).
python sort_faces.py refine "$CACHE" "$OUT/facesets_full"
# 4. Near-duplicate report (byte + visual).
python sort_faces.py dedup "$CACHE"
# 5. Enrich the cache with landmarks + pose (needed by export-swap).
python sort_faces.py enrich "$CACHE"
# 6. Export roop-unleashed-ready bundles.
python sort_faces.py export-swap "$CACHE" \
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
```
### Merging a new source into an existing result
```bash
# Embed new source into the same cache (resume from existing embeddings + aliases).
python sort_faces.py embed /mnt/x/src/lzbkp_red "$CACHE"
# Fold new faces into raw_full + facesets_full without renumbering.
python sort_faces.py extend "$CACHE" "$OUT/raw_full" --refine-out "$OUT/facesets_full"
# Refresh the swap-ready export to reflect the merge.
python sort_faces.py enrich "$CACHE"
python sort_faces.py export-swap "$CACHE" \
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
```
### Importing hand-sorted folders as identities
When source folders are already hand-sorted by person (one folder per identity), the
clustering path is the wrong tool — the identity is asserted, not inferred. The
orchestration script `work/build_folders.py` covers this case:
- For each trusted folder, it filters cache records that fall under it, builds an
identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so
bystanders in group photos drop out, and writes a synthetic `refine_manifest.json`.
- It then routes each face record from a *mixed* folder (e.g. `osrc/`) into every
identity centroid within a tight cosine cutoff (default 0.45). A multi-identity
photo lands in multiple facesets; `export-swap`'s per-bbox outlier filter ensures
each faceset crops only its matching face.
- Finally it invokes `cmd_export_swap` against the synthetic manifest, renames the
emitted `.fsz` bundles after the source folder, drops a `<label>.txt` marker, and
merges the new entries into the canonical `facesets_swap_ready/manifest.json`
(existing facesets are left untouched).
```bash
# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
for d in k m mi mir s sab t osrc; do
python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
done
# Bring landmarks/pose + visual-dupe report in sync with the new records.
python sort_faces.py enrich "$CACHE"
python sort_faces.py dedup "$CACHE"
# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
python work/build_folders.py
```
The script's config block (`TRUSTED`, `START_NNN`, `OSRC_THRESHOLD`, `TOP_N`, etc.)
is the only thing to edit when adding more hand-sorted folders later.
### Splitting an identity by era (age sub-clustering)
Long-running source corpora produce identities that span 10+ years. The 2009 face
and the 2024 face of the same person sit in the same cluster (correctly — same
identity), but a single averaged embedding pulled from that cluster blurs across
ages. For face-swap output that should target a specific period, the identity
needs to be split by era *after* the identity is established.
`work/age_split_001.py` is a worked example for `faceset_001` and a template for
any other identity. The pipeline is:
- **Probe first** with `work/check_faceset001_age.py` — report intra-cluster
pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and
EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with
distinct year ranges, the identity is age-sortable.
- **Seed centroid** from the curated `facesets_swap_ready/faceset_001/`
(manifest provides face keys → cache rows).
- **Wide recovery** at cos-dist ≤ 0.55 against the seed under the original
source roots, then quality-gate (`face_short`, `blur`, `det_score`) and one
re-centroid + tighten pass at 0.50 to absorb new faces without drift.
- **Sub-cluster** the survivors at cos-dist 0.35 (precomputed-distance
agglomerative, average linkage).
- **Anchor-based fragment assignment** (not transitive merge — that caused
year-drift): sub-clusters with size ≥ 20 are *anchors*; smaller fragments
attach to the single nearest anchor only if both the centroid distance ≤ 0.40
AND the dominant EXIF year is within ±5 years. Fragments with no qualifying
anchor remain standalone (and end up THIN-tagged downstream).
- **EXIF year per source path** with on-disk caching at
`work/cache/age_split_exif.json` — the Windows-mount EXIF read is the
slowest step, so re-runs after a parameter tweak are nearly instant.
- **Per-era export** mirrors `export-swap`: composite-quality rank, single-face
square PNG crops, top-N + `_all` `.fsz` bundles, per-era `manifest.json`,
human-readable `<era>.txt` marker. Eras with < 20 face records also drop a
`THIN.txt` marker so they can be quarantined.
- **Top-level manifest merge**: era buckets are appended to
`facesets_swap_ready/manifest.json`. Operationally the THIN buckets should be
moved into `_thin/` (and the manifest split into `facesets` + `thin_eras`),
leaving only the substantive era buckets at the top level.
```bash
# 1. Confirm the identity is age-sortable.
python work/check_faceset001_age.py
# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
python work/age_split_001.py
```
For the `faceset_001` run on 5260-face `nl_full.npz`, this produced 6 substantive
era buckets (200510, 201013, 2011, 201417, 201819, 201820; sizes 43282)
plus 68 thin/fragment buckets quarantined under `_thin/`.
### Discovering new identities in a mixed bucket
A flat folder of mixed-identity photos (e.g. `osrc/`) is the opposite of the
hand-sorted case: identities have to be discovered, not asserted, but should
not collide with already-known identities or scramble their numbering.
`work/cluster_osrc.py` is the worked example. The pipeline:
- **Filter cache to the source root**, including any byte-aliased path that
resolves under it.
- **Drop already-covered faces** by comparing each candidate to the centroids
of the existing canonical facesets at the `EXISTING_MATCH_THRESHOLD`
(default 0.45 — same cutoff as `build_folders.py`'s osrc routing). These
faces are already routed by `extend` / `build_folders.py` and shouldn't
seed new facesets.
- **Cluster the unmatched** at cos-dist 0.55 (matches the `extend` default
for the new-cluster phase).
- **Apply `refine`-equivalent gates** per cluster: `face_short`, `blur`,
`det_score`, plus outlier rejection (cluster-centroid cos-dist > 0.55) for
clusters of size ≥ 4. Keep clusters whose surviving unique-source-path
count is ≥ `MIN_FACES`.
- **Number new facesets past the existing maximum** (`START_NNN`), so
`faceset_001..NNN` are never disturbed.
- **Synthesize a refine manifest** and run `cmd_export_swap` against it,
then move the resulting dirs into `facesets_swap_ready/` and append to the
top-level `manifest.json`. Each new dir gets an `osrc.txt` provenance
marker.
Always run `extend` first so `raw_full/` and `facesets_full/` reflect the new
source — the `cluster_osrc.py` step then operates against the canonical
cache and doesn't need `raw_full/` for input:
```bash
# 1. Bring raw_full / facesets_full up to date (folds matches into existing
# person folders + facesets, creates new person_NNN+ for unmatched).
python sort_faces.py extend "$CACHE" "$OUT/raw_full" \
--refine-out "$OUT/facesets_full"
# 2. Optional dry-run: report cluster sizes and per-faceset survivor counts
# without touching facesets_swap_ready/.
python work/cluster_osrc.py --dry-run
# 3. Real run: emits facesets_swap_ready/faceset_NNN+ and merges the manifest.
python work/cluster_osrc.py
```
For the 2026-04-26 run on 336 osrc face records (after dropping 18 covered by
existing identities), this produced 6 new facesets (`faceset_020..025`,
sizes 426 exported PNGs; the 7th candidate cluster lost all 6 faces to
export-swap's tighter `min_face_short=100` gate).
### Importing identities from a self-hosted Immich library
`work/immich_stage.py` + `work/embed_worker.py` + `work/cluster_immich.py`
together import an Immich library at scale, with the embed step running on
a Windows AMD GPU via DirectML and everything else on WSL. Three pieces:
1. **`work/immich_stage.py` (WSL)** — pages every IMAGE asset via
`/search/metadata`, fetches each asset's `/faces?id=` to read Immich's
own ML-driven bboxes, scales each bbox to original-image coordinates,
and prefilters by `face_short ≥ 90`. For survivors it downloads the
original, sha256-deduplicates against the canonical `nl_full.npz` and
against same-run staged files, and saves to
`/mnt/x/src/immich/<user>/<rel>`. Writes a `queue.json` that the embed
worker consumes. 8 concurrent worker threads run the full per-asset
I/O chain (`/faces` → filter → `/original`) so 8 workers ≈ 8× the
serial throughput.
2. **`work/embed_worker.py` (Windows venv at `C:\face_embed_venv\`)** —
loads `insightface.FaceAnalysis(buffalo_l)` with the
`DmlExecutionProvider` and runs detection + landmarks + recognition
over the queue. Produces a `.npz` cache that's bit-identical in
schema to what `sort_faces.py:cmd_embed` writes, so the result is
directly loadable by `load_cache()`. The cache already includes the
post-`enrich` fields (`landmark_2d_106`, `landmark_3d_68`, `pose`)
because FaceAnalysis returns them for free. AMD Vega gives ~7.5×
real-pipeline speedup over CPU.
3. **`work/cluster_immich.py` (WSL)** — mirrors `cluster_osrc.py`'s
shape but reads from `immich_<user>.npz`. Builds existing-identity
centroids from every canonical `faceset_NNN/` in
`facesets_swap_ready/` (skipping era splits and `_thin/`), drops
immich faces matching at cos-dist ≤ 0.45, clusters the rest at 0.55,
applies refine gates, numbers new facesets past the existing maximum,
and feeds `cmd_export_swap` via a synthetic manifest.
`work/finalize_immich.sh <user>` chains queue → Windows embed → cache
copy back → cluster_immich, with logging.
The Immich admin API key + base URL come from environment variables:
```bash
export IMMICH_URL=https://your-immich.example.com
export IMMICH_API_KEY=... # admin or per-user key
python work/immich_stage.py --user peter --workers 8
bash work/finalize_immich.sh peter
```
For the 2026-04-26 run against `https://fotos.computerliebe.org` (Immich
v2.7.2), with the admin API key:
| step | result |
|------|------|
| stage | 53,842 assets seen, **10,261 staged** (~10 GB), 978 byte-deduped against `nl_full.npz`, 2,976 internal byte-duplicates, 39K skipped no-face / no-big-face |
| Windows DML embed | 19,462 face records + 1 noface in **64.6 min** (2.6 img/s end-to-end) |
| matched existing identities | **8,103 of 19,480 (42%)** at cos-dist ≤ 0.45; biggest hits faceset_002 (+2,666), faceset_001 (+1,856), faceset_003 (+670) |
| new clusters | 2,534 at threshold 0.55 → 239 surviving refine gates → **185 emitted** as `faceset_026..264` (gaps where export-swap's tighter outlier filter dropped clusters below the export quality bar) |
A second 2026-04-26 run with **nic's per-user API key** confirmed the
expected behavior: 25,777 of nic's IMAGE assets were enumerated (matching
her `/server/statistics` count of 25,786, off by 9 ≈ the transient errors
that didn't get marked seen), **7,834 staged** (30% face-bearing-with-big-face,
denser than peter's 19%), 519 byte-deduped vs `nl_full.npz`, **0 internal
byte-duplicates** (cleaner library than peter's 2,976), 54 transient errors.
`work/immich_stage.py` carries a built-in **outage circuit breaker**:
after 12 consecutive HTTP errors it probes Immich; if that probe also
fails, the script exits cleanly with code 2, state preserved. This made
the nic run survive a mid-stage Immich outage — the script paused, the
operator confirmed connectivity was back, and the same command resumed
from the saved `state.json` without re-fetching what was already done.
**Important caveats for Immich v2.7.2**:
- The `userIds` filter on `/search/metadata` is **silently ignored** when
the API key is bound to a different user. The "import everything the
API key can see" semantics are what you actually get; cross-user
isolation is enforced server-side.
- `/server/statistics` reports counts that under-count what
`/search/metadata` actually returns (e.g. external library
thumbnail-dirs that got indexed because the import path included them).
Don't trust the statistics number as a denominator.
- A meaningful fraction of `originalPath`-based assets are *Immich's own
thumbnails* (`<library_root>/thumbs/.../-preview.jpeg`) — included if
the external library's import path covers the thumbs directory and the
exclusion patterns don't list `**/thumbs/**`. For our run, 5,563 of
10,261 staged were thumbnails. They embed and cluster fine but the
resulting faces are lower-resolution.
## Key defaults
`refine`:
| flag | default | meaning |
|-------------------------|--------:|---------|
| `--initial-threshold` | 0.55 | cosine distance for stage-1 clustering |
| `--merge-threshold` | 0.40 | centroid-level merge of over-split clusters |
| `--outlier-threshold` | 0.55 | drop face if cosine dist from centroid exceeds (only if cluster ≥ 4) |
| `--min-faces` | 15 | minimum unique images per faceset |
| `--min-short` | 90 | minimum short-edge pixels of face bbox |
| `--min-blur` | 40.0 | Laplacian-variance blur gate |
| `--min-det-score` | 0.6 | InsightFace detector score gate |
`export-swap`:
| flag | default | meaning |
|-------------------------------|--------:|---------|
| `--top-n` | 30 | size of the `<faceset>_topN.fsz` bundle |
| `--outlier-threshold` | 0.45 | tighter than refine; trims cluster boundary for averaging |
| `--pad-ratio` | 0.5 | padding around face bbox for PNG crop |
| `--out-size` | 512 | PNG output is square `out_size × out_size` |
| `--min-face-short` | 100 | export gate; stricter than refine's 90 |
| `--candidates` | off | rescue `_singletons/` into `_candidates/` for manual review |
| `--candidate-match-threshold` | 0.55 | cos-dist cutoff for singleton → existing faceset |
| `--candidate-min-score` | 0.40 | composite-quality floor for candidates |
The composite quality score in `export-swap` is `0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness`, each normalized to `[0, 1]`.
## Downstream: roop-unleashed
The `.fsz` bundles emitted by `export-swap` drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding.
Highly recommended at swap time: enable **Select post-processing = GFPGAN** with the **Original/Enhanced image blend ratio = 0.85** (default is 0.65 which is conservative). See `docs/analysis/facesets-downstream-refinement-evaluation.md` for the full evaluation.
## Layout
```
/opt/face-sets/
├─ README.md (this file)
├─ sort_faces.py (the tool)
├─ docs/
│ └─ analysis/
│ └─ facesets-downstream-refinement-evaluation.md
└─ work/ (gitignored except force-tracked .py / .sh)
├─ build_folders.py (hand-sorted-folder orchestration)
├─ check_faceset001_age.py (age-split readiness probe)
├─ age_split_001.py (age-split orchestration; faceset_001)
├─ cluster_osrc.py (mixed-bucket identity discovery)
├─ immich_stage.py (Immich library staging, parallel)
├─ embed_worker.py (Windows DML embed worker, runs from C:\face_embed_venv\)
├─ cluster_immich.py (Immich identity discovery + export)
├─ finalize_immich.sh (chains queue → embed → cluster)
├─ synthetic_*_manifest.json (per-run synthetic refine manifests)
├─ immich/
│ ├─ users.json (label -> userId map; gitignored)
│ └─ <user>/{queue,state,aliases}.json (per-user staging artifacts)
├─ cache/
│ ├─ nl_full.npz (canonical cache + duplicates.json)
│ ├─ immich_<user>.npz (per-user immich embeddings)
│ └─ age_split_exif.json (path → EXIF-year cache)
└─ logs/
└─ *.log (every long step writes here)
```