After completing the rest-of-corpus run, update docs/analysis to reflect the final numbers across all three batches (test + 13-file + 45-file) and surface the numerical lessons: - 1,984 segments / 10.78h accepted content from 19.76h / 61 input videos - 0 worker errors across 143,137 sampled frames - rest batch sustained 15.78 fps from a fresh JSONL start (vs 7.5 fps for the migrated batch), confirming the append-only fix is the right steady-state design - skip-pattern note: 5-digit basename numbers need full padding (0005[0-9] not 005[0-9]) — bit me on the first relaunch - documented SIDECAR=yes opt-in for the chain script Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
face-sets
Sort photos by similar face using InsightFace embeddings + agglomerative clustering, refine into per-identity sets, and export ready-to-drop bundles for face-swap tooling (roop-unleashed, etc.).
Pipeline
sort_faces.py is a single-file CLI with six subcommands:
| step | what it does |
|---|---|
| embed | Recursively scan a source tree, detect + embed every face, write .npz cache. Resumable; sha256-dedup. |
| cluster | Raw agglomerative clustering of the cache into person_NNN/ / _singletons/ / _noface/ with manifest. |
| refine | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → faceset_NNN/. |
| dedup | Post-hoc near-duplicate report: byte-identical + visual near-dupe groups → <cache>.duplicates.json. |
| extend | Fold new embeddings into an existing raw/refine output via nearest person-centroid without renumbering. |
| enrich | Re-detect each cached face to persist landmark_2d_106, landmark_3d_68, pose (pitch/yaw/roll) into cache. |
| export-swap | Per-identity export: tight outlier gate + visual-dupe collapse + composite quality rank + single-face PNG crops + .fsz bundles (top-N and full) ready for roop-unleashed. Optional singleton rescue into _candidates/. |
Design principles
- embed is resumable and incremental. It loads any existing cache at the target path and only hashes / embeds files it has not seen. Atomic flush every 50 new files so a mid-run crash loses at most ~50 embeddings.
- Byte-identical duplicates are sha256-grouped at listing time. The canonical file is embedded once; other paths with the same hash become
path_aliasesin the cache. Every alias is materialized bycluster/refine/export-swap, so each on-disk location is represented. safe_dst_namealways flattens the absolute path. This keeps output filenames stable across runs even assrc_rootchanges between embed / extend / export invocations.- Caches and outputs stay out of git via
.gitignore; defaults live underwork/.
Typical end-to-end run
SRC=/mnt/x/src/nl
CACHE=work/cache/nl_full.npz
OUT=/mnt/e/temp_things/fcswp/nl_sorted
# 1. Embed (CPU; InsightFace buffalo_l). Resumable on re-run.
python sort_faces.py embed "$SRC" "$CACHE"
# 2. Raw clusters (one person_NNN/ per multi-face cluster).
python sort_faces.py cluster "$CACHE" "$OUT/raw_full"
# 3. Refined facesets (quality-gated per-identity sets).
python sort_faces.py refine "$CACHE" "$OUT/facesets_full"
# 4. Near-duplicate report (byte + visual).
python sort_faces.py dedup "$CACHE"
# 5. Enrich the cache with landmarks + pose (needed by export-swap).
python sort_faces.py enrich "$CACHE"
# 6. Export roop-unleashed-ready bundles.
python sort_faces.py export-swap "$CACHE" \
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
Merging a new source into an existing result
# Embed new source into the same cache (resume from existing embeddings + aliases).
python sort_faces.py embed /mnt/x/src/lzbkp_red "$CACHE"
# Fold new faces into raw_full + facesets_full without renumbering.
python sort_faces.py extend "$CACHE" "$OUT/raw_full" --refine-out "$OUT/facesets_full"
# Refresh the swap-ready export to reflect the merge.
python sort_faces.py enrich "$CACHE"
python sort_faces.py export-swap "$CACHE" \
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
Importing hand-sorted folders as identities
When source folders are already hand-sorted by person (one folder per identity), the
clustering path is the wrong tool — the identity is asserted, not inferred. The
orchestration script work/build_folders.py covers this case:
- For each trusted folder, it filters cache records that fall under it, builds an
identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so
bystanders in group photos drop out, and writes a synthetic
refine_manifest.json. - It then routes each face record from a mixed folder (e.g.
osrc/) into every identity centroid within a tight cosine cutoff (default 0.45). A multi-identity photo lands in multiple facesets;export-swap's per-bbox outlier filter ensures each faceset crops only its matching face. - Finally it invokes
cmd_export_swapagainst the synthetic manifest, renames the emitted.fszbundles after the source folder, drops a<label>.txtmarker, and merges the new entries into the canonicalfacesets_swap_ready/manifest.json(existing facesets are left untouched).
# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
for d in k m mi mir s sab t osrc; do
python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
done
# Bring landmarks/pose + visual-dupe report in sync with the new records.
python sort_faces.py enrich "$CACHE"
python sort_faces.py dedup "$CACHE"
# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
python work/build_folders.py
The script's config block (TRUSTED, START_NNN, OSRC_THRESHOLD, TOP_N, etc.)
is the only thing to edit when adding more hand-sorted folders later.
Splitting an identity by era (age sub-clustering)
Long-running source corpora produce identities that span 10+ years. The 2009 face and the 2024 face of the same person sit in the same cluster (correctly — same identity), but a single averaged embedding pulled from that cluster blurs across ages. For face-swap output that should target a specific period, the identity needs to be split by era after the identity is established.
work/age_split_001.py is a worked example for faceset_001 and a template for
any other identity. The pipeline is:
- Probe first with
work/check_faceset001_age.py— report intra-cluster pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with distinct year ranges, the identity is age-sortable. - Seed centroid from the curated
facesets_swap_ready/faceset_001/(manifest provides face keys → cache rows). - Wide recovery at cos-dist ≤ 0.55 against the seed under the original
source roots, then quality-gate (
face_short,blur,det_score) and one re-centroid + tighten pass at 0.50 to absorb new faces without drift. - Sub-cluster the survivors at cos-dist 0.35 (precomputed-distance agglomerative, average linkage).
- Anchor-based fragment assignment (not transitive merge — that caused year-drift): sub-clusters with size ≥ 20 are anchors; smaller fragments attach to the single nearest anchor only if both the centroid distance ≤ 0.40 AND the dominant EXIF year is within ±5 years. Fragments with no qualifying anchor remain standalone (and end up THIN-tagged downstream).
- EXIF year per source path with on-disk caching at
work/cache/age_split_exif.json— the Windows-mount EXIF read is the slowest step, so re-runs after a parameter tweak are nearly instant. - Per-era export mirrors
export-swap: composite-quality rank, single-face square PNG crops, top-N +_all.fszbundles, per-eramanifest.json, human-readable<era>.txtmarker. Eras with < 20 face records also drop aTHIN.txtmarker so they can be quarantined. - Top-level manifest merge: era buckets are appended to
facesets_swap_ready/manifest.json. Operationally the THIN buckets should be moved into_thin/(and the manifest split intofacesets+thin_eras), leaving only the substantive era buckets at the top level.
# 1. Confirm the identity is age-sortable.
python work/check_faceset001_age.py
# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
python work/age_split_001.py
For the faceset_001 run on 5260-face nl_full.npz, this produced 6 substantive
era buckets (2005–10, 2010–13, 2011, 2014–17, 2018–19, 2018–20; sizes 43–282)
plus 68 thin/fragment buckets quarantined under _thin/.
Discovering new identities in a mixed bucket
A flat folder of mixed-identity photos (e.g. osrc/) is the opposite of the
hand-sorted case: identities have to be discovered, not asserted, but should
not collide with already-known identities or scramble their numbering.
work/cluster_osrc.py is the worked example. The pipeline:
- Filter cache to the source root, including any byte-aliased path that resolves under it.
- Drop already-covered faces by comparing each candidate to the centroids
of the existing canonical facesets at the
EXISTING_MATCH_THRESHOLD(default 0.45 — same cutoff asbuild_folders.py's osrc routing). These faces are already routed byextend/build_folders.pyand shouldn't seed new facesets. - Cluster the unmatched at cos-dist 0.55 (matches the
extenddefault for the new-cluster phase). - Apply
refine-equivalent gates per cluster:face_short,blur,det_score, plus outlier rejection (cluster-centroid cos-dist > 0.55) for clusters of size ≥ 4. Keep clusters whose surviving unique-source-path count is ≥MIN_FACES. - Number new facesets past the existing maximum (
START_NNN), sofaceset_001..NNNare never disturbed. - Synthesize a refine manifest and run
cmd_export_swapagainst it, then move the resulting dirs intofacesets_swap_ready/and append to the top-levelmanifest.json. Each new dir gets anosrc.txtprovenance marker.
Always run extend first so raw_full/ and facesets_full/ reflect the new
source — the cluster_osrc.py step then operates against the canonical
cache and doesn't need raw_full/ for input:
# 1. Bring raw_full / facesets_full up to date (folds matches into existing
# person folders + facesets, creates new person_NNN+ for unmatched).
python sort_faces.py extend "$CACHE" "$OUT/raw_full" \
--refine-out "$OUT/facesets_full"
# 2. Optional dry-run: report cluster sizes and per-faceset survivor counts
# without touching facesets_swap_ready/.
python work/cluster_osrc.py --dry-run
# 3. Real run: emits facesets_swap_ready/faceset_NNN+ and merges the manifest.
python work/cluster_osrc.py
For the 2026-04-26 run on 336 osrc face records (after dropping 18 covered by
existing identities), this produced 6 new facesets (faceset_020..025,
sizes 4–26 exported PNGs; the 7th candidate cluster lost all 6 faces to
export-swap's tighter min_face_short=100 gate).
Importing identities from a self-hosted Immich library
work/immich_stage.py + work/embed_worker.py + work/cluster_immich.py
together import an Immich library at scale, with the embed step running on
a Windows AMD GPU via DirectML and everything else on WSL. Three pieces:
work/immich_stage.py(WSL) — pages every IMAGE asset via/search/metadata, fetches each asset's/faces?id=to read Immich's own ML-driven bboxes, scales each bbox to original-image coordinates, and prefilters byface_short ≥ 90. For survivors it downloads the original, sha256-deduplicates against the canonicalnl_full.npzand against same-run staged files, and saves to/mnt/x/src/immich/<user>/<rel>. Writes aqueue.jsonthat the embed worker consumes. 8 concurrent worker threads run the full per-asset I/O chain (/faces→ filter →/original) so 8 workers ≈ 8× the serial throughput.work/embed_worker.py(Windows venv atC:\face_embed_venv\) — loadsinsightface.FaceAnalysis(buffalo_l)with theDmlExecutionProviderand runs detection + landmarks + recognition over the queue. Produces a.npzcache that's bit-identical in schema to whatsort_faces.py:cmd_embedwrites, so the result is directly loadable byload_cache(). The cache already includes the post-enrichfields (landmark_2d_106,landmark_3d_68,pose) because FaceAnalysis returns them for free. AMD Vega gives ~7.5× real-pipeline speedup over CPU.work/cluster_immich.py(WSL) — mirrorscluster_osrc.py's shape but reads fromimmich_<user>.npz. Builds existing-identity centroids from every canonicalfaceset_NNN/infacesets_swap_ready/(skipping era splits and_thin/), drops immich faces matching at cos-dist ≤ 0.45, clusters the rest at 0.55, applies refine gates, numbers new facesets past the existing maximum, and feedscmd_export_swapvia a synthetic manifest.
work/finalize_immich.sh <user> chains queue → Windows embed → cache
copy back → cluster_immich, with logging.
The Immich admin API key + base URL come from environment variables:
export IMMICH_URL=https://your-immich.example.com
export IMMICH_API_KEY=... # admin or per-user key
python work/immich_stage.py --user peter --workers 8
bash work/finalize_immich.sh peter
For the 2026-04-26 run against https://fotos.computerliebe.org (Immich
v2.7.2), with the admin API key:
| step | result |
|---|---|
| stage | 53,842 assets seen, 10,261 staged (~10 GB), 978 byte-deduped against nl_full.npz, 2,976 internal byte-duplicates, 39K skipped no-face / no-big-face |
| Windows DML embed | 19,462 face records + 1 noface in 64.6 min (2.6 img/s end-to-end) |
| matched existing identities | 8,103 of 19,480 (42%) at cos-dist ≤ 0.45; biggest hits faceset_002 (+2,666), faceset_001 (+1,856), faceset_003 (+670) |
| new clusters | 2,534 at threshold 0.55 → 239 surviving refine gates → 185 emitted as faceset_026..264 (gaps where export-swap's tighter outlier filter dropped clusters below the export quality bar) |
A second 2026-04-26 run with nic's per-user API key confirmed the
expected behavior: 25,777 of nic's IMAGE assets were enumerated (matching
her /server/statistics count of 25,786, off by 9 ≈ the transient errors
that didn't get marked seen), 7,834 staged (30% face-bearing-with-big-face,
denser than peter's 19%), 519 byte-deduped vs nl_full.npz, 0 internal
byte-duplicates (cleaner library than peter's 2,976), 54 transient errors.
Embed + cluster on the nic queue:
| step | result |
|---|---|
| Windows DML embed | 15,627 face records + 1 noface in 59 min (2.2 img/s end-to-end), 7 load errors |
| matched existing identities | 6,770 of 15,627 (43%) at cos-dist ≤ 0.45; biggest hits faceset_002 (+3,261), faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408) |
| new clusters | 3,787 at threshold 0.55 → 129 surviving refine gates → 95 emitted as faceset_265..NNN (gaps where export-swap's 0.45 outlier dropped clusters below the export bar) |
Top-level facesets_swap_ready/manifest.json after both Immich runs:
311 substantive facesets (12 auto-cluster nl/lzbkp + 7 hand-sorted +
6 era splits + 6 osrc-discovered + 185 peter-Immich + 95 nic-Immich) +
68 thin_eras under _thin/.
work/immich_stage.py carries a built-in outage circuit breaker:
after 12 consecutive HTTP errors it probes Immich; if that probe also
fails, the script exits cleanly with code 2, state preserved. This made
the nic run survive a mid-stage Immich outage — the script paused, the
operator confirmed connectivity was back, and the same command resumed
from the saved state.json without re-fetching what was already done.
Important caveats for Immich v2.7.2:
- The
userIdsfilter on/search/metadatais silently ignored when the API key is bound to a different user. The "import everything the API key can see" semantics are what you actually get; cross-user isolation is enforced server-side. /server/statisticsreports counts that under-count what/search/metadataactually returns (e.g. external library thumbnail-dirs that got indexed because the import path included them). Don't trust the statistics number as a denominator.- A meaningful fraction of
originalPath-based assets are Immich's own thumbnails (<library_root>/thumbs/.../-preview.jpeg) — included if the external library's import path covers the thumbs directory and the exclusion patterns don't list**/thumbs/**. For our run, 5,563 of 10,261 staged were thumbnails. They embed and cluster fine but the resulting faces are lower-resolution.
Key defaults
refine:
| flag | default | meaning |
|---|---|---|
--initial-threshold |
0.55 | cosine distance for stage-1 clustering |
--merge-threshold |
0.40 | centroid-level merge of over-split clusters |
--outlier-threshold |
0.55 | drop face if cosine dist from centroid exceeds (only if cluster ≥ 4) |
--min-faces |
15 | minimum unique images per faceset |
--min-short |
90 | minimum short-edge pixels of face bbox |
--min-blur |
40.0 | Laplacian-variance blur gate |
--min-det-score |
0.6 | InsightFace detector score gate |
export-swap:
| flag | default | meaning |
|---|---|---|
--top-n |
30 | size of the <faceset>_topN.fsz bundle |
--outlier-threshold |
0.45 | tighter than refine; trims cluster boundary for averaging |
--pad-ratio |
0.5 | padding around face bbox for PNG crop |
--out-size |
512 | PNG output is square out_size × out_size |
--min-face-short |
100 | export gate; stricter than refine's 90 |
--candidates |
off | rescue _singletons/ into _candidates/ for manual review |
--candidate-match-threshold |
0.55 | cos-dist cutoff for singleton → existing faceset |
--candidate-min-score |
0.40 | composite-quality floor for candidates |
The composite quality score in export-swap is 0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness, each normalized to [0, 1].
Post-export corpus maintenance
The sort_faces.py pipeline above produces facesets_swap_ready/. Four
orchestration scripts under work/ operate on that already-built corpus to
clean it up over time:
| script | purpose |
|---|---|
work/filter_occlusions.py (+ Windows work/clip_worker.py) |
Drop PNGs of masked / sun-glassed faces using open_clip ViT-L-14/dfn2b_s39b zero-shot scoring. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. WSL stages a queue, Windows DML scores, WSL applies. See docs/analysis/clip-occlusion-filter.md. |
work/consolidate_facesets.py |
Merge duplicate identities (centroid cosine sim ≥ 0.55 with confident ≥ 0.65, complete-linkage to defeat single-link chaining). Pulls embeddings from cache, no GPU. See docs/analysis/identity-consolidation-and-age-extend.md. |
work/age_extend_001.py |
Slot newly-added PNGs into existing era buckets of faceset_001 (anchor cosine distance ≤ 0.40 AND ` |
work/dedup_optimize.py (+ Windows work/multiface_worker.py) |
(a) cross-family SHA256 byte-dedup, (b) within-faceset near-dup at cosine sim ≥ 0.95, (c) multi-face audit (re-detect via insightface, drop PNGs with face_count ≠ 1). Multi-face is the load-bearing roop invariant. See docs/analysis/dedup-and-roop-optimization.md. |
work/video_target_pipeline.py (+ Windows work/video_face_worker.py + work/run_video_pipeline.sh chain) |
Target-side preprocessing: scan a folder of videos → PySceneDetect shot-cuts → 2 fps frame sampling → DML face detection + embedding → IoU+embedding tracking → quality-gated segments (yaw≤75°, face≥80px, det≥0.5, ≥70% pass-rate, 1–120s duration, 2s cross-track merge gap) → ffmpeg stream-copy into UUID-named clips. Output organized into per-source subfolders. Provenance sidecars are opt-in (cut --write-sidecar or SIDECAR=yes env var); the full plan is always retained in the per-batch plan.json. See docs/analysis/video-target-preprocessing.md. |
All four operate idempotently and reversibly: dropped PNGs go to
<faceset>/faces/_dropped/, quarantined whole facesets go to
facesets_swap_ready/_masked/ or _merged/ (parallel to the existing
_thin/). The master manifest.json partitions entries across facesets[],
masked[], thin_eras[], and merged[] arrays, plus per-run provenance
blocks (occlusion_filter_run, merge_run, age_extend_runs, dedup_runs,
multiface_runs).
Downstream: roop-unleashed
The .fsz bundles emitted by export-swap drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding.
Highly recommended at swap time: enable Select post-processing = GFPGAN with the Original/Enhanced image blend ratio = 0.85 (default is 0.65 which is conservative). See docs/analysis/facesets-downstream-refinement-evaluation.md for the full evaluation.
Layout
/opt/face-sets/
├─ README.md (this file)
├─ sort_faces.py (the tool)
├─ docs/
│ └─ analysis/
│ └─ facesets-downstream-refinement-evaluation.md
└─ work/ (gitignored except force-tracked .py / .sh)
├─ build_folders.py (hand-sorted-folder orchestration)
├─ check_faceset001_age.py (age-split readiness probe)
├─ age_split_001.py (age-split orchestration; faceset_001)
├─ age_extend_001.py (extends existing era buckets with new PNGs)
├─ cluster_osrc.py (mixed-bucket identity discovery)
├─ immich_stage.py (Immich library staging, parallel)
├─ embed_worker.py (Windows DML embed worker; C:\face_embed_venv\)
├─ cluster_immich.py (Immich identity discovery + export)
├─ finalize_immich.sh (chains queue → embed → cluster)
├─ filter_occlusions.py (CLIP zero-shot mask + sunglasses filter)
├─ clip_worker.py (Windows DML CLIP worker; C:\clip_dml_venv\)
├─ consolidate_facesets.py (duplicate-identity merger; complete-linkage)
├─ dedup_optimize.py (byte + near-dup + multi-face audit driver)
├─ multiface_worker.py (Windows DML multi-face audit worker)
├─ video_target_pipeline.py (video → swappable segment cuts orchestration)
├─ video_face_worker.py (Windows DML per-frame face worker; JSONL append-only)
├─ run_video_pipeline.sh (generic chain driver: scenes → stage → worker → cut)
├─ status_video_pipeline.sh (status helper for any video_pipeline log)
├─ synthetic_*_manifest.json (per-run synthetic refine manifests)
├─ immich/
│ ├─ users.json (label -> userId map; gitignored)
│ └─ <user>/{queue,state,aliases}.json (per-user staging artifacts)
├─ cache/
│ ├─ nl_full.npz (canonical cache + duplicates.json)
│ ├─ immich_<user>.npz (per-user immich embeddings)
│ └─ age_split_exif.json (path → EXIF-year cache)
└─ logs/
└─ *.log (every long step writes here)