Files
face-sets/README.md
Peter e66c97fd58 Document Immich nic run: 95 new facesets, manifest 216 -> 311
Overnight 2026-04-27 nic finalize completed. Per-user API key worked as
expected. The pipeline survived one mid-stage Immich outage via the
circuit breaker added in 62dba3d -- script paused, operator confirmed
connectivity, same command resumed from saved state.json.

Embed (Windows DML): 7,834 images -> 15,627 face records + 1 noface in
59 minutes (2.2 img/s end-to-end).

Cluster: 6,770 of 15,627 faces (43%) matched existing canonical
identities at cos-dist <= 0.45; biggest hits faceset_002 (+3,261),
faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408). The
faceset_008 and faceset_007 hits are noteworthy cross-matches: those
are hand-sorted "sab" and "s" identities, recurring frequently in nic's
library.

Of the 8,857 unmatched faces, 3,787 raw clusters at threshold 0.55,
129 surviving refine gates, 95 emitted as new facesets at faceset_265+.

Top-level facesets_swap_ready/manifest.json: 216 -> 311 substantive
facesets + 68 thin_eras unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 00:32:11 +02:00

20 KiB
Raw Blame History

face-sets

Sort photos by similar face using InsightFace embeddings + agglomerative clustering, refine into per-identity sets, and export ready-to-drop bundles for face-swap tooling (roop-unleashed, etc.).

Pipeline

sort_faces.py is a single-file CLI with six subcommands:

step what it does
embed Recursively scan a source tree, detect + embed every face, write .npz cache. Resumable; sha256-dedup.
cluster Raw agglomerative clustering of the cache into person_NNN/ / _singletons/ / _noface/ with manifest.
refine Initial cluster → centroid merge → quality gate → outlier rejection → size filter → faceset_NNN/.
dedup Post-hoc near-duplicate report: byte-identical + visual near-dupe groups → <cache>.duplicates.json.
extend Fold new embeddings into an existing raw/refine output via nearest person-centroid without renumbering.
enrich Re-detect each cached face to persist landmark_2d_106, landmark_3d_68, pose (pitch/yaw/roll) into cache.
export-swap Per-identity export: tight outlier gate + visual-dupe collapse + composite quality rank + single-face PNG crops + .fsz bundles (top-N and full) ready for roop-unleashed. Optional singleton rescue into _candidates/.

Design principles

  • embed is resumable and incremental. It loads any existing cache at the target path and only hashes / embeds files it has not seen. Atomic flush every 50 new files so a mid-run crash loses at most ~50 embeddings.
  • Byte-identical duplicates are sha256-grouped at listing time. The canonical file is embedded once; other paths with the same hash become path_aliases in the cache. Every alias is materialized by cluster / refine / export-swap, so each on-disk location is represented.
  • safe_dst_name always flattens the absolute path. This keeps output filenames stable across runs even as src_root changes between embed / extend / export invocations.
  • Caches and outputs stay out of git via .gitignore; defaults live under work/.

Typical end-to-end run

SRC=/mnt/x/src/nl
CACHE=work/cache/nl_full.npz
OUT=/mnt/e/temp_things/fcswp/nl_sorted

# 1. Embed (CPU; InsightFace buffalo_l). Resumable on re-run.
python sort_faces.py embed "$SRC" "$CACHE"

# 2. Raw clusters (one person_NNN/ per multi-face cluster).
python sort_faces.py cluster "$CACHE" "$OUT/raw_full"

# 3. Refined facesets (quality-gated per-identity sets).
python sort_faces.py refine  "$CACHE" "$OUT/facesets_full"

# 4. Near-duplicate report (byte + visual).
python sort_faces.py dedup   "$CACHE"

# 5. Enrich the cache with landmarks + pose (needed by export-swap).
python sort_faces.py enrich  "$CACHE"

# 6. Export roop-unleashed-ready bundles.
python sort_faces.py export-swap "$CACHE" \
  "$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
  --raw-manifest "$OUT/raw_full/manifest.json" --candidates

Merging a new source into an existing result

# Embed new source into the same cache (resume from existing embeddings + aliases).
python sort_faces.py embed /mnt/x/src/lzbkp_red "$CACHE"

# Fold new faces into raw_full + facesets_full without renumbering.
python sort_faces.py extend "$CACHE" "$OUT/raw_full" --refine-out "$OUT/facesets_full"

# Refresh the swap-ready export to reflect the merge.
python sort_faces.py enrich "$CACHE"
python sort_faces.py export-swap "$CACHE" \
  "$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
  --raw-manifest "$OUT/raw_full/manifest.json" --candidates

Importing hand-sorted folders as identities

When source folders are already hand-sorted by person (one folder per identity), the clustering path is the wrong tool — the identity is asserted, not inferred. The orchestration script work/build_folders.py covers this case:

  • For each trusted folder, it filters cache records that fall under it, builds an identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so bystanders in group photos drop out, and writes a synthetic refine_manifest.json.
  • It then routes each face record from a mixed folder (e.g. osrc/) into every identity centroid within a tight cosine cutoff (default 0.45). A multi-identity photo lands in multiple facesets; export-swap's per-bbox outlier filter ensures each faceset crops only its matching face.
  • Finally it invokes cmd_export_swap against the synthetic manifest, renames the emitted .fsz bundles after the source folder, drops a <label>.txt marker, and merges the new entries into the canonical facesets_swap_ready/manifest.json (existing facesets are left untouched).
# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
for d in k m mi mir s sab t osrc; do
  python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
done

# Bring landmarks/pose + visual-dupe report in sync with the new records.
python sort_faces.py enrich "$CACHE"
python sort_faces.py dedup  "$CACHE"

# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
python work/build_folders.py

The script's config block (TRUSTED, START_NNN, OSRC_THRESHOLD, TOP_N, etc.) is the only thing to edit when adding more hand-sorted folders later.

Splitting an identity by era (age sub-clustering)

Long-running source corpora produce identities that span 10+ years. The 2009 face and the 2024 face of the same person sit in the same cluster (correctly — same identity), but a single averaged embedding pulled from that cluster blurs across ages. For face-swap output that should target a specific period, the identity needs to be split by era after the identity is established.

work/age_split_001.py is a worked example for faceset_001 and a template for any other identity. The pipeline is:

  • Probe first with work/check_faceset001_age.py — report intra-cluster pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with distinct year ranges, the identity is age-sortable.
  • Seed centroid from the curated facesets_swap_ready/faceset_001/ (manifest provides face keys → cache rows).
  • Wide recovery at cos-dist ≤ 0.55 against the seed under the original source roots, then quality-gate (face_short, blur, det_score) and one re-centroid + tighten pass at 0.50 to absorb new faces without drift.
  • Sub-cluster the survivors at cos-dist 0.35 (precomputed-distance agglomerative, average linkage).
  • Anchor-based fragment assignment (not transitive merge — that caused year-drift): sub-clusters with size ≥ 20 are anchors; smaller fragments attach to the single nearest anchor only if both the centroid distance ≤ 0.40 AND the dominant EXIF year is within ±5 years. Fragments with no qualifying anchor remain standalone (and end up THIN-tagged downstream).
  • EXIF year per source path with on-disk caching at work/cache/age_split_exif.json — the Windows-mount EXIF read is the slowest step, so re-runs after a parameter tweak are nearly instant.
  • Per-era export mirrors export-swap: composite-quality rank, single-face square PNG crops, top-N + _all .fsz bundles, per-era manifest.json, human-readable <era>.txt marker. Eras with < 20 face records also drop a THIN.txt marker so they can be quarantined.
  • Top-level manifest merge: era buckets are appended to facesets_swap_ready/manifest.json. Operationally the THIN buckets should be moved into _thin/ (and the manifest split into facesets + thin_eras), leaving only the substantive era buckets at the top level.
# 1. Confirm the identity is age-sortable.
python work/check_faceset001_age.py

# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
python work/age_split_001.py

For the faceset_001 run on 5260-face nl_full.npz, this produced 6 substantive era buckets (200510, 201013, 2011, 201417, 201819, 201820; sizes 43282) plus 68 thin/fragment buckets quarantined under _thin/.

Discovering new identities in a mixed bucket

A flat folder of mixed-identity photos (e.g. osrc/) is the opposite of the hand-sorted case: identities have to be discovered, not asserted, but should not collide with already-known identities or scramble their numbering.

work/cluster_osrc.py is the worked example. The pipeline:

  • Filter cache to the source root, including any byte-aliased path that resolves under it.
  • Drop already-covered faces by comparing each candidate to the centroids of the existing canonical facesets at the EXISTING_MATCH_THRESHOLD (default 0.45 — same cutoff as build_folders.py's osrc routing). These faces are already routed by extend / build_folders.py and shouldn't seed new facesets.
  • Cluster the unmatched at cos-dist 0.55 (matches the extend default for the new-cluster phase).
  • Apply refine-equivalent gates per cluster: face_short, blur, det_score, plus outlier rejection (cluster-centroid cos-dist > 0.55) for clusters of size ≥ 4. Keep clusters whose surviving unique-source-path count is ≥ MIN_FACES.
  • Number new facesets past the existing maximum (START_NNN), so faceset_001..NNN are never disturbed.
  • Synthesize a refine manifest and run cmd_export_swap against it, then move the resulting dirs into facesets_swap_ready/ and append to the top-level manifest.json. Each new dir gets an osrc.txt provenance marker.

Always run extend first so raw_full/ and facesets_full/ reflect the new source — the cluster_osrc.py step then operates against the canonical cache and doesn't need raw_full/ for input:

# 1. Bring raw_full / facesets_full up to date (folds matches into existing
#    person folders + facesets, creates new person_NNN+ for unmatched).
python sort_faces.py extend "$CACHE" "$OUT/raw_full" \
  --refine-out "$OUT/facesets_full"

# 2. Optional dry-run: report cluster sizes and per-faceset survivor counts
#    without touching facesets_swap_ready/.
python work/cluster_osrc.py --dry-run

# 3. Real run: emits facesets_swap_ready/faceset_NNN+ and merges the manifest.
python work/cluster_osrc.py

For the 2026-04-26 run on 336 osrc face records (after dropping 18 covered by existing identities), this produced 6 new facesets (faceset_020..025, sizes 426 exported PNGs; the 7th candidate cluster lost all 6 faces to export-swap's tighter min_face_short=100 gate).

Importing identities from a self-hosted Immich library

work/immich_stage.py + work/embed_worker.py + work/cluster_immich.py together import an Immich library at scale, with the embed step running on a Windows AMD GPU via DirectML and everything else on WSL. Three pieces:

  1. work/immich_stage.py (WSL) — pages every IMAGE asset via /search/metadata, fetches each asset's /faces?id= to read Immich's own ML-driven bboxes, scales each bbox to original-image coordinates, and prefilters by face_short ≥ 90. For survivors it downloads the original, sha256-deduplicates against the canonical nl_full.npz and against same-run staged files, and saves to /mnt/x/src/immich/<user>/<rel>. Writes a queue.json that the embed worker consumes. 8 concurrent worker threads run the full per-asset I/O chain (/faces → filter → /original) so 8 workers ≈ 8× the serial throughput.
  2. work/embed_worker.py (Windows venv at C:\face_embed_venv\) — loads insightface.FaceAnalysis(buffalo_l) with the DmlExecutionProvider and runs detection + landmarks + recognition over the queue. Produces a .npz cache that's bit-identical in schema to what sort_faces.py:cmd_embed writes, so the result is directly loadable by load_cache(). The cache already includes the post-enrich fields (landmark_2d_106, landmark_3d_68, pose) because FaceAnalysis returns them for free. AMD Vega gives ~7.5× real-pipeline speedup over CPU.
  3. work/cluster_immich.py (WSL) — mirrors cluster_osrc.py's shape but reads from immich_<user>.npz. Builds existing-identity centroids from every canonical faceset_NNN/ in facesets_swap_ready/ (skipping era splits and _thin/), drops immich faces matching at cos-dist ≤ 0.45, clusters the rest at 0.55, applies refine gates, numbers new facesets past the existing maximum, and feeds cmd_export_swap via a synthetic manifest.

work/finalize_immich.sh <user> chains queue → Windows embed → cache copy back → cluster_immich, with logging.

The Immich admin API key + base URL come from environment variables:

export IMMICH_URL=https://your-immich.example.com
export IMMICH_API_KEY=...                # admin or per-user key
python work/immich_stage.py --user peter --workers 8
bash   work/finalize_immich.sh peter

For the 2026-04-26 run against https://fotos.computerliebe.org (Immich v2.7.2), with the admin API key:

step result
stage 53,842 assets seen, 10,261 staged (~10 GB), 978 byte-deduped against nl_full.npz, 2,976 internal byte-duplicates, 39K skipped no-face / no-big-face
Windows DML embed 19,462 face records + 1 noface in 64.6 min (2.6 img/s end-to-end)
matched existing identities 8,103 of 19,480 (42%) at cos-dist ≤ 0.45; biggest hits faceset_002 (+2,666), faceset_001 (+1,856), faceset_003 (+670)
new clusters 2,534 at threshold 0.55 → 239 surviving refine gates → 185 emitted as faceset_026..264 (gaps where export-swap's tighter outlier filter dropped clusters below the export quality bar)

A second 2026-04-26 run with nic's per-user API key confirmed the expected behavior: 25,777 of nic's IMAGE assets were enumerated (matching her /server/statistics count of 25,786, off by 9 ≈ the transient errors that didn't get marked seen), 7,834 staged (30% face-bearing-with-big-face, denser than peter's 19%), 519 byte-deduped vs nl_full.npz, 0 internal byte-duplicates (cleaner library than peter's 2,976), 54 transient errors.

Embed + cluster on the nic queue:

step result
Windows DML embed 15,627 face records + 1 noface in 59 min (2.2 img/s end-to-end), 7 load errors
matched existing identities 6,770 of 15,627 (43%) at cos-dist ≤ 0.45; biggest hits faceset_002 (+3,261), faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408)
new clusters 3,787 at threshold 0.55 → 129 surviving refine gates → 95 emitted as faceset_265..NNN (gaps where export-swap's 0.45 outlier dropped clusters below the export bar)

Top-level facesets_swap_ready/manifest.json after both Immich runs: 311 substantive facesets (12 auto-cluster nl/lzbkp + 7 hand-sorted + 6 era splits + 6 osrc-discovered + 185 peter-Immich + 95 nic-Immich) + 68 thin_eras under _thin/.

work/immich_stage.py carries a built-in outage circuit breaker: after 12 consecutive HTTP errors it probes Immich; if that probe also fails, the script exits cleanly with code 2, state preserved. This made the nic run survive a mid-stage Immich outage — the script paused, the operator confirmed connectivity was back, and the same command resumed from the saved state.json without re-fetching what was already done.

Important caveats for Immich v2.7.2:

  • The userIds filter on /search/metadata is silently ignored when the API key is bound to a different user. The "import everything the API key can see" semantics are what you actually get; cross-user isolation is enforced server-side.
  • /server/statistics reports counts that under-count what /search/metadata actually returns (e.g. external library thumbnail-dirs that got indexed because the import path included them). Don't trust the statistics number as a denominator.
  • A meaningful fraction of originalPath-based assets are Immich's own thumbnails (<library_root>/thumbs/.../-preview.jpeg) — included if the external library's import path covers the thumbs directory and the exclusion patterns don't list **/thumbs/**. For our run, 5,563 of 10,261 staged were thumbnails. They embed and cluster fine but the resulting faces are lower-resolution.

Key defaults

refine:

flag default meaning
--initial-threshold 0.55 cosine distance for stage-1 clustering
--merge-threshold 0.40 centroid-level merge of over-split clusters
--outlier-threshold 0.55 drop face if cosine dist from centroid exceeds (only if cluster ≥ 4)
--min-faces 15 minimum unique images per faceset
--min-short 90 minimum short-edge pixels of face bbox
--min-blur 40.0 Laplacian-variance blur gate
--min-det-score 0.6 InsightFace detector score gate

export-swap:

flag default meaning
--top-n 30 size of the <faceset>_topN.fsz bundle
--outlier-threshold 0.45 tighter than refine; trims cluster boundary for averaging
--pad-ratio 0.5 padding around face bbox for PNG crop
--out-size 512 PNG output is square out_size × out_size
--min-face-short 100 export gate; stricter than refine's 90
--candidates off rescue _singletons/ into _candidates/ for manual review
--candidate-match-threshold 0.55 cos-dist cutoff for singleton → existing faceset
--candidate-min-score 0.40 composite-quality floor for candidates

The composite quality score in export-swap is 0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness, each normalized to [0, 1].

Downstream: roop-unleashed

The .fsz bundles emitted by export-swap drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding.

Highly recommended at swap time: enable Select post-processing = GFPGAN with the Original/Enhanced image blend ratio = 0.85 (default is 0.65 which is conservative). See docs/analysis/facesets-downstream-refinement-evaluation.md for the full evaluation.

Layout

/opt/face-sets/
├─ README.md                                     (this file)
├─ sort_faces.py                                 (the tool)
├─ docs/
│  └─ analysis/
│     └─ facesets-downstream-refinement-evaluation.md
└─ work/                                         (gitignored except force-tracked .py / .sh)
   ├─ build_folders.py                           (hand-sorted-folder orchestration)
   ├─ check_faceset001_age.py                    (age-split readiness probe)
   ├─ age_split_001.py                           (age-split orchestration; faceset_001)
   ├─ cluster_osrc.py                            (mixed-bucket identity discovery)
   ├─ immich_stage.py                            (Immich library staging, parallel)
   ├─ embed_worker.py                            (Windows DML embed worker, runs from C:\face_embed_venv\)
   ├─ cluster_immich.py                          (Immich identity discovery + export)
   ├─ finalize_immich.sh                         (chains queue → embed → cluster)
   ├─ synthetic_*_manifest.json                  (per-run synthetic refine manifests)
   ├─ immich/
   │  ├─ users.json                              (label -> userId map; gitignored)
   │  └─ <user>/{queue,state,aliases}.json       (per-user staging artifacts)
   ├─ cache/
   │  ├─ nl_full.npz                             (canonical cache + duplicates.json)
   │  ├─ immich_<user>.npz                       (per-user immich embeddings)
   │  └─ age_split_exif.json                     (path → EXIF-year cache)
   └─ logs/
      └─ *.log                                   (every long step writes here)