Overnight 2026-04-27 nic finalize completed. Per-user API key worked as
expected. The pipeline survived one mid-stage Immich outage via the
circuit breaker added in 62dba3d -- script paused, operator confirmed
connectivity, same command resumed from saved state.json.
Embed (Windows DML): 7,834 images -> 15,627 face records + 1 noface in
59 minutes (2.2 img/s end-to-end).
Cluster: 6,770 of 15,627 faces (43%) matched existing canonical
identities at cos-dist <= 0.45; biggest hits faceset_002 (+3,261),
faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408). The
faceset_008 and faceset_007 hits are noteworthy cross-matches: those
are hand-sorted "sab" and "s" identities, recurring frequently in nic's
library.
Of the 8,857 unmatched faces, 3,787 raw clusters at threshold 0.55,
129 surviving refine gates, 95 emitted as new facesets at faceset_265+.
Top-level facesets_swap_ready/manifest.json: 216 -> 311 substantive
facesets + 68 thin_eras unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
face-sets
Sort photos by similar face using InsightFace embeddings + agglomerative clustering, refine into per-identity sets, and export ready-to-drop bundles for face-swap tooling (roop-unleashed, etc.).
Pipeline
sort_faces.py is a single-file CLI with six subcommands:
| step | what it does |
|---|---|
| embed | Recursively scan a source tree, detect + embed every face, write .npz cache. Resumable; sha256-dedup. |
| cluster | Raw agglomerative clustering of the cache into person_NNN/ / _singletons/ / _noface/ with manifest. |
| refine | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → faceset_NNN/. |
| dedup | Post-hoc near-duplicate report: byte-identical + visual near-dupe groups → <cache>.duplicates.json. |
| extend | Fold new embeddings into an existing raw/refine output via nearest person-centroid without renumbering. |
| enrich | Re-detect each cached face to persist landmark_2d_106, landmark_3d_68, pose (pitch/yaw/roll) into cache. |
| export-swap | Per-identity export: tight outlier gate + visual-dupe collapse + composite quality rank + single-face PNG crops + .fsz bundles (top-N and full) ready for roop-unleashed. Optional singleton rescue into _candidates/. |
Design principles
- embed is resumable and incremental. It loads any existing cache at the target path and only hashes / embeds files it has not seen. Atomic flush every 50 new files so a mid-run crash loses at most ~50 embeddings.
- Byte-identical duplicates are sha256-grouped at listing time. The canonical file is embedded once; other paths with the same hash become
path_aliasesin the cache. Every alias is materialized bycluster/refine/export-swap, so each on-disk location is represented. safe_dst_namealways flattens the absolute path. This keeps output filenames stable across runs even assrc_rootchanges between embed / extend / export invocations.- Caches and outputs stay out of git via
.gitignore; defaults live underwork/.
Typical end-to-end run
SRC=/mnt/x/src/nl
CACHE=work/cache/nl_full.npz
OUT=/mnt/e/temp_things/fcswp/nl_sorted
# 1. Embed (CPU; InsightFace buffalo_l). Resumable on re-run.
python sort_faces.py embed "$SRC" "$CACHE"
# 2. Raw clusters (one person_NNN/ per multi-face cluster).
python sort_faces.py cluster "$CACHE" "$OUT/raw_full"
# 3. Refined facesets (quality-gated per-identity sets).
python sort_faces.py refine "$CACHE" "$OUT/facesets_full"
# 4. Near-duplicate report (byte + visual).
python sort_faces.py dedup "$CACHE"
# 5. Enrich the cache with landmarks + pose (needed by export-swap).
python sort_faces.py enrich "$CACHE"
# 6. Export roop-unleashed-ready bundles.
python sort_faces.py export-swap "$CACHE" \
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
Merging a new source into an existing result
# Embed new source into the same cache (resume from existing embeddings + aliases).
python sort_faces.py embed /mnt/x/src/lzbkp_red "$CACHE"
# Fold new faces into raw_full + facesets_full without renumbering.
python sort_faces.py extend "$CACHE" "$OUT/raw_full" --refine-out "$OUT/facesets_full"
# Refresh the swap-ready export to reflect the merge.
python sort_faces.py enrich "$CACHE"
python sort_faces.py export-swap "$CACHE" \
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
Importing hand-sorted folders as identities
When source folders are already hand-sorted by person (one folder per identity), the
clustering path is the wrong tool — the identity is asserted, not inferred. The
orchestration script work/build_folders.py covers this case:
- For each trusted folder, it filters cache records that fall under it, builds an
identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so
bystanders in group photos drop out, and writes a synthetic
refine_manifest.json. - It then routes each face record from a mixed folder (e.g.
osrc/) into every identity centroid within a tight cosine cutoff (default 0.45). A multi-identity photo lands in multiple facesets;export-swap's per-bbox outlier filter ensures each faceset crops only its matching face. - Finally it invokes
cmd_export_swapagainst the synthetic manifest, renames the emitted.fszbundles after the source folder, drops a<label>.txtmarker, and merges the new entries into the canonicalfacesets_swap_ready/manifest.json(existing facesets are left untouched).
# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
for d in k m mi mir s sab t osrc; do
python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
done
# Bring landmarks/pose + visual-dupe report in sync with the new records.
python sort_faces.py enrich "$CACHE"
python sort_faces.py dedup "$CACHE"
# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
python work/build_folders.py
The script's config block (TRUSTED, START_NNN, OSRC_THRESHOLD, TOP_N, etc.)
is the only thing to edit when adding more hand-sorted folders later.
Splitting an identity by era (age sub-clustering)
Long-running source corpora produce identities that span 10+ years. The 2009 face and the 2024 face of the same person sit in the same cluster (correctly — same identity), but a single averaged embedding pulled from that cluster blurs across ages. For face-swap output that should target a specific period, the identity needs to be split by era after the identity is established.
work/age_split_001.py is a worked example for faceset_001 and a template for
any other identity. The pipeline is:
- Probe first with
work/check_faceset001_age.py— report intra-cluster pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with distinct year ranges, the identity is age-sortable. - Seed centroid from the curated
facesets_swap_ready/faceset_001/(manifest provides face keys → cache rows). - Wide recovery at cos-dist ≤ 0.55 against the seed under the original
source roots, then quality-gate (
face_short,blur,det_score) and one re-centroid + tighten pass at 0.50 to absorb new faces without drift. - Sub-cluster the survivors at cos-dist 0.35 (precomputed-distance agglomerative, average linkage).
- Anchor-based fragment assignment (not transitive merge — that caused year-drift): sub-clusters with size ≥ 20 are anchors; smaller fragments attach to the single nearest anchor only if both the centroid distance ≤ 0.40 AND the dominant EXIF year is within ±5 years. Fragments with no qualifying anchor remain standalone (and end up THIN-tagged downstream).
- EXIF year per source path with on-disk caching at
work/cache/age_split_exif.json— the Windows-mount EXIF read is the slowest step, so re-runs after a parameter tweak are nearly instant. - Per-era export mirrors
export-swap: composite-quality rank, single-face square PNG crops, top-N +_all.fszbundles, per-eramanifest.json, human-readable<era>.txtmarker. Eras with < 20 face records also drop aTHIN.txtmarker so they can be quarantined. - Top-level manifest merge: era buckets are appended to
facesets_swap_ready/manifest.json. Operationally the THIN buckets should be moved into_thin/(and the manifest split intofacesets+thin_eras), leaving only the substantive era buckets at the top level.
# 1. Confirm the identity is age-sortable.
python work/check_faceset001_age.py
# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
python work/age_split_001.py
For the faceset_001 run on 5260-face nl_full.npz, this produced 6 substantive
era buckets (2005–10, 2010–13, 2011, 2014–17, 2018–19, 2018–20; sizes 43–282)
plus 68 thin/fragment buckets quarantined under _thin/.
Discovering new identities in a mixed bucket
A flat folder of mixed-identity photos (e.g. osrc/) is the opposite of the
hand-sorted case: identities have to be discovered, not asserted, but should
not collide with already-known identities or scramble their numbering.
work/cluster_osrc.py is the worked example. The pipeline:
- Filter cache to the source root, including any byte-aliased path that resolves under it.
- Drop already-covered faces by comparing each candidate to the centroids
of the existing canonical facesets at the
EXISTING_MATCH_THRESHOLD(default 0.45 — same cutoff asbuild_folders.py's osrc routing). These faces are already routed byextend/build_folders.pyand shouldn't seed new facesets. - Cluster the unmatched at cos-dist 0.55 (matches the
extenddefault for the new-cluster phase). - Apply
refine-equivalent gates per cluster:face_short,blur,det_score, plus outlier rejection (cluster-centroid cos-dist > 0.55) for clusters of size ≥ 4. Keep clusters whose surviving unique-source-path count is ≥MIN_FACES. - Number new facesets past the existing maximum (
START_NNN), sofaceset_001..NNNare never disturbed. - Synthesize a refine manifest and run
cmd_export_swapagainst it, then move the resulting dirs intofacesets_swap_ready/and append to the top-levelmanifest.json. Each new dir gets anosrc.txtprovenance marker.
Always run extend first so raw_full/ and facesets_full/ reflect the new
source — the cluster_osrc.py step then operates against the canonical
cache and doesn't need raw_full/ for input:
# 1. Bring raw_full / facesets_full up to date (folds matches into existing
# person folders + facesets, creates new person_NNN+ for unmatched).
python sort_faces.py extend "$CACHE" "$OUT/raw_full" \
--refine-out "$OUT/facesets_full"
# 2. Optional dry-run: report cluster sizes and per-faceset survivor counts
# without touching facesets_swap_ready/.
python work/cluster_osrc.py --dry-run
# 3. Real run: emits facesets_swap_ready/faceset_NNN+ and merges the manifest.
python work/cluster_osrc.py
For the 2026-04-26 run on 336 osrc face records (after dropping 18 covered by
existing identities), this produced 6 new facesets (faceset_020..025,
sizes 4–26 exported PNGs; the 7th candidate cluster lost all 6 faces to
export-swap's tighter min_face_short=100 gate).
Importing identities from a self-hosted Immich library
work/immich_stage.py + work/embed_worker.py + work/cluster_immich.py
together import an Immich library at scale, with the embed step running on
a Windows AMD GPU via DirectML and everything else on WSL. Three pieces:
work/immich_stage.py(WSL) — pages every IMAGE asset via/search/metadata, fetches each asset's/faces?id=to read Immich's own ML-driven bboxes, scales each bbox to original-image coordinates, and prefilters byface_short ≥ 90. For survivors it downloads the original, sha256-deduplicates against the canonicalnl_full.npzand against same-run staged files, and saves to/mnt/x/src/immich/<user>/<rel>. Writes aqueue.jsonthat the embed worker consumes. 8 concurrent worker threads run the full per-asset I/O chain (/faces→ filter →/original) so 8 workers ≈ 8× the serial throughput.work/embed_worker.py(Windows venv atC:\face_embed_venv\) — loadsinsightface.FaceAnalysis(buffalo_l)with theDmlExecutionProviderand runs detection + landmarks + recognition over the queue. Produces a.npzcache that's bit-identical in schema to whatsort_faces.py:cmd_embedwrites, so the result is directly loadable byload_cache(). The cache already includes the post-enrichfields (landmark_2d_106,landmark_3d_68,pose) because FaceAnalysis returns them for free. AMD Vega gives ~7.5× real-pipeline speedup over CPU.work/cluster_immich.py(WSL) — mirrorscluster_osrc.py's shape but reads fromimmich_<user>.npz. Builds existing-identity centroids from every canonicalfaceset_NNN/infacesets_swap_ready/(skipping era splits and_thin/), drops immich faces matching at cos-dist ≤ 0.45, clusters the rest at 0.55, applies refine gates, numbers new facesets past the existing maximum, and feedscmd_export_swapvia a synthetic manifest.
work/finalize_immich.sh <user> chains queue → Windows embed → cache
copy back → cluster_immich, with logging.
The Immich admin API key + base URL come from environment variables:
export IMMICH_URL=https://your-immich.example.com
export IMMICH_API_KEY=... # admin or per-user key
python work/immich_stage.py --user peter --workers 8
bash work/finalize_immich.sh peter
For the 2026-04-26 run against https://fotos.computerliebe.org (Immich
v2.7.2), with the admin API key:
| step | result |
|---|---|
| stage | 53,842 assets seen, 10,261 staged (~10 GB), 978 byte-deduped against nl_full.npz, 2,976 internal byte-duplicates, 39K skipped no-face / no-big-face |
| Windows DML embed | 19,462 face records + 1 noface in 64.6 min (2.6 img/s end-to-end) |
| matched existing identities | 8,103 of 19,480 (42%) at cos-dist ≤ 0.45; biggest hits faceset_002 (+2,666), faceset_001 (+1,856), faceset_003 (+670) |
| new clusters | 2,534 at threshold 0.55 → 239 surviving refine gates → 185 emitted as faceset_026..264 (gaps where export-swap's tighter outlier filter dropped clusters below the export quality bar) |
A second 2026-04-26 run with nic's per-user API key confirmed the
expected behavior: 25,777 of nic's IMAGE assets were enumerated (matching
her /server/statistics count of 25,786, off by 9 ≈ the transient errors
that didn't get marked seen), 7,834 staged (30% face-bearing-with-big-face,
denser than peter's 19%), 519 byte-deduped vs nl_full.npz, 0 internal
byte-duplicates (cleaner library than peter's 2,976), 54 transient errors.
Embed + cluster on the nic queue:
| step | result |
|---|---|
| Windows DML embed | 15,627 face records + 1 noface in 59 min (2.2 img/s end-to-end), 7 load errors |
| matched existing identities | 6,770 of 15,627 (43%) at cos-dist ≤ 0.45; biggest hits faceset_002 (+3,261), faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408) |
| new clusters | 3,787 at threshold 0.55 → 129 surviving refine gates → 95 emitted as faceset_265..NNN (gaps where export-swap's 0.45 outlier dropped clusters below the export bar) |
Top-level facesets_swap_ready/manifest.json after both Immich runs:
311 substantive facesets (12 auto-cluster nl/lzbkp + 7 hand-sorted +
6 era splits + 6 osrc-discovered + 185 peter-Immich + 95 nic-Immich) +
68 thin_eras under _thin/.
work/immich_stage.py carries a built-in outage circuit breaker:
after 12 consecutive HTTP errors it probes Immich; if that probe also
fails, the script exits cleanly with code 2, state preserved. This made
the nic run survive a mid-stage Immich outage — the script paused, the
operator confirmed connectivity was back, and the same command resumed
from the saved state.json without re-fetching what was already done.
Important caveats for Immich v2.7.2:
- The
userIdsfilter on/search/metadatais silently ignored when the API key is bound to a different user. The "import everything the API key can see" semantics are what you actually get; cross-user isolation is enforced server-side. /server/statisticsreports counts that under-count what/search/metadataactually returns (e.g. external library thumbnail-dirs that got indexed because the import path included them). Don't trust the statistics number as a denominator.- A meaningful fraction of
originalPath-based assets are Immich's own thumbnails (<library_root>/thumbs/.../-preview.jpeg) — included if the external library's import path covers the thumbs directory and the exclusion patterns don't list**/thumbs/**. For our run, 5,563 of 10,261 staged were thumbnails. They embed and cluster fine but the resulting faces are lower-resolution.
Key defaults
refine:
| flag | default | meaning |
|---|---|---|
--initial-threshold |
0.55 | cosine distance for stage-1 clustering |
--merge-threshold |
0.40 | centroid-level merge of over-split clusters |
--outlier-threshold |
0.55 | drop face if cosine dist from centroid exceeds (only if cluster ≥ 4) |
--min-faces |
15 | minimum unique images per faceset |
--min-short |
90 | minimum short-edge pixels of face bbox |
--min-blur |
40.0 | Laplacian-variance blur gate |
--min-det-score |
0.6 | InsightFace detector score gate |
export-swap:
| flag | default | meaning |
|---|---|---|
--top-n |
30 | size of the <faceset>_topN.fsz bundle |
--outlier-threshold |
0.45 | tighter than refine; trims cluster boundary for averaging |
--pad-ratio |
0.5 | padding around face bbox for PNG crop |
--out-size |
512 | PNG output is square out_size × out_size |
--min-face-short |
100 | export gate; stricter than refine's 90 |
--candidates |
off | rescue _singletons/ into _candidates/ for manual review |
--candidate-match-threshold |
0.55 | cos-dist cutoff for singleton → existing faceset |
--candidate-min-score |
0.40 | composite-quality floor for candidates |
The composite quality score in export-swap is 0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness, each normalized to [0, 1].
Downstream: roop-unleashed
The .fsz bundles emitted by export-swap drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding.
Highly recommended at swap time: enable Select post-processing = GFPGAN with the Original/Enhanced image blend ratio = 0.85 (default is 0.65 which is conservative). See docs/analysis/facesets-downstream-refinement-evaluation.md for the full evaluation.
Layout
/opt/face-sets/
├─ README.md (this file)
├─ sort_faces.py (the tool)
├─ docs/
│ └─ analysis/
│ └─ facesets-downstream-refinement-evaluation.md
└─ work/ (gitignored except force-tracked .py / .sh)
├─ build_folders.py (hand-sorted-folder orchestration)
├─ check_faceset001_age.py (age-split readiness probe)
├─ age_split_001.py (age-split orchestration; faceset_001)
├─ cluster_osrc.py (mixed-bucket identity discovery)
├─ immich_stage.py (Immich library staging, parallel)
├─ embed_worker.py (Windows DML embed worker, runs from C:\face_embed_venv\)
├─ cluster_immich.py (Immich identity discovery + export)
├─ finalize_immich.sh (chains queue → embed → cluster)
├─ synthetic_*_manifest.json (per-run synthetic refine manifests)
├─ immich/
│ ├─ users.json (label -> userId map; gitignored)
│ └─ <user>/{queue,state,aliases}.json (per-user staging artifacts)
├─ cache/
│ ├─ nl_full.npz (canonical cache + duplicates.json)
│ ├─ immich_<user>.npz (per-user immich embeddings)
│ └─ age_split_exif.json (path → EXIF-year cache)
└─ logs/
└─ *.log (every long step writes here)