work/cluster_osrc.py mirrors build_folders.py's shape (synthesize a refine_manifest, hand off to cmd_export_swap, relocate, merge top-level manifest) but discovers identities by clustering rather than asserting them by folder. Drops faces already covered by existing identity centroids, clusters the rest at 0.55, applies refine-equivalent gates with min_faces=6, numbers new facesets past the existing maximum so faceset_001..NNN are never disturbed. The 2026-04-26 run on /mnt/x/src/osrc produced faceset_020..025 (sizes 4-26 exported PNGs); analysis writeup in docs/analysis/. README also notes the refine-renumbers caveat in passing — extend + orchestration script is the safe pattern; cmd_refine is for fresh clusters only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 KiB
face-sets
Sort photos by similar face using InsightFace embeddings + agglomerative clustering, refine into per-identity sets, and export ready-to-drop bundles for face-swap tooling (roop-unleashed, etc.).
Pipeline
sort_faces.py is a single-file CLI with six subcommands:
| step | what it does |
|---|---|
| embed | Recursively scan a source tree, detect + embed every face, write .npz cache. Resumable; sha256-dedup. |
| cluster | Raw agglomerative clustering of the cache into person_NNN/ / _singletons/ / _noface/ with manifest. |
| refine | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → faceset_NNN/. |
| dedup | Post-hoc near-duplicate report: byte-identical + visual near-dupe groups → <cache>.duplicates.json. |
| extend | Fold new embeddings into an existing raw/refine output via nearest person-centroid without renumbering. |
| enrich | Re-detect each cached face to persist landmark_2d_106, landmark_3d_68, pose (pitch/yaw/roll) into cache. |
| export-swap | Per-identity export: tight outlier gate + visual-dupe collapse + composite quality rank + single-face PNG crops + .fsz bundles (top-N and full) ready for roop-unleashed. Optional singleton rescue into _candidates/. |
Design principles
- embed is resumable and incremental. It loads any existing cache at the target path and only hashes / embeds files it has not seen. Atomic flush every 50 new files so a mid-run crash loses at most ~50 embeddings.
- Byte-identical duplicates are sha256-grouped at listing time. The canonical file is embedded once; other paths with the same hash become
path_aliasesin the cache. Every alias is materialized bycluster/refine/export-swap, so each on-disk location is represented. safe_dst_namealways flattens the absolute path. This keeps output filenames stable across runs even assrc_rootchanges between embed / extend / export invocations.- Caches and outputs stay out of git via
.gitignore; defaults live underwork/.
Typical end-to-end run
SRC=/mnt/x/src/nl
CACHE=work/cache/nl_full.npz
OUT=/mnt/e/temp_things/fcswp/nl_sorted
# 1. Embed (CPU; InsightFace buffalo_l). Resumable on re-run.
python sort_faces.py embed "$SRC" "$CACHE"
# 2. Raw clusters (one person_NNN/ per multi-face cluster).
python sort_faces.py cluster "$CACHE" "$OUT/raw_full"
# 3. Refined facesets (quality-gated per-identity sets).
python sort_faces.py refine "$CACHE" "$OUT/facesets_full"
# 4. Near-duplicate report (byte + visual).
python sort_faces.py dedup "$CACHE"
# 5. Enrich the cache with landmarks + pose (needed by export-swap).
python sort_faces.py enrich "$CACHE"
# 6. Export roop-unleashed-ready bundles.
python sort_faces.py export-swap "$CACHE" \
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
Merging a new source into an existing result
# Embed new source into the same cache (resume from existing embeddings + aliases).
python sort_faces.py embed /mnt/x/src/lzbkp_red "$CACHE"
# Fold new faces into raw_full + facesets_full without renumbering.
python sort_faces.py extend "$CACHE" "$OUT/raw_full" --refine-out "$OUT/facesets_full"
# Refresh the swap-ready export to reflect the merge.
python sort_faces.py enrich "$CACHE"
python sort_faces.py export-swap "$CACHE" \
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
Importing hand-sorted folders as identities
When source folders are already hand-sorted by person (one folder per identity), the
clustering path is the wrong tool — the identity is asserted, not inferred. The
orchestration script work/build_folders.py covers this case:
- For each trusted folder, it filters cache records that fall under it, builds an
identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so
bystanders in group photos drop out, and writes a synthetic
refine_manifest.json. - It then routes each face record from a mixed folder (e.g.
osrc/) into every identity centroid within a tight cosine cutoff (default 0.45). A multi-identity photo lands in multiple facesets;export-swap's per-bbox outlier filter ensures each faceset crops only its matching face. - Finally it invokes
cmd_export_swapagainst the synthetic manifest, renames the emitted.fszbundles after the source folder, drops a<label>.txtmarker, and merges the new entries into the canonicalfacesets_swap_ready/manifest.json(existing facesets are left untouched).
# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
for d in k m mi mir s sab t osrc; do
python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
done
# Bring landmarks/pose + visual-dupe report in sync with the new records.
python sort_faces.py enrich "$CACHE"
python sort_faces.py dedup "$CACHE"
# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
python work/build_folders.py
The script's config block (TRUSTED, START_NNN, OSRC_THRESHOLD, TOP_N, etc.)
is the only thing to edit when adding more hand-sorted folders later.
Splitting an identity by era (age sub-clustering)
Long-running source corpora produce identities that span 10+ years. The 2009 face and the 2024 face of the same person sit in the same cluster (correctly — same identity), but a single averaged embedding pulled from that cluster blurs across ages. For face-swap output that should target a specific period, the identity needs to be split by era after the identity is established.
work/age_split_001.py is a worked example for faceset_001 and a template for
any other identity. The pipeline is:
- Probe first with
work/check_faceset001_age.py— report intra-cluster pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with distinct year ranges, the identity is age-sortable. - Seed centroid from the curated
facesets_swap_ready/faceset_001/(manifest provides face keys → cache rows). - Wide recovery at cos-dist ≤ 0.55 against the seed under the original
source roots, then quality-gate (
face_short,blur,det_score) and one re-centroid + tighten pass at 0.50 to absorb new faces without drift. - Sub-cluster the survivors at cos-dist 0.35 (precomputed-distance agglomerative, average linkage).
- Anchor-based fragment assignment (not transitive merge — that caused year-drift): sub-clusters with size ≥ 20 are anchors; smaller fragments attach to the single nearest anchor only if both the centroid distance ≤ 0.40 AND the dominant EXIF year is within ±5 years. Fragments with no qualifying anchor remain standalone (and end up THIN-tagged downstream).
- EXIF year per source path with on-disk caching at
work/cache/age_split_exif.json— the Windows-mount EXIF read is the slowest step, so re-runs after a parameter tweak are nearly instant. - Per-era export mirrors
export-swap: composite-quality rank, single-face square PNG crops, top-N +_all.fszbundles, per-eramanifest.json, human-readable<era>.txtmarker. Eras with < 20 face records also drop aTHIN.txtmarker so they can be quarantined. - Top-level manifest merge: era buckets are appended to
facesets_swap_ready/manifest.json. Operationally the THIN buckets should be moved into_thin/(and the manifest split intofacesets+thin_eras), leaving only the substantive era buckets at the top level.
# 1. Confirm the identity is age-sortable.
python work/check_faceset001_age.py
# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
python work/age_split_001.py
For the faceset_001 run on 5260-face nl_full.npz, this produced 6 substantive
era buckets (2005–10, 2010–13, 2011, 2014–17, 2018–19, 2018–20; sizes 43–282)
plus 68 thin/fragment buckets quarantined under _thin/.
Discovering new identities in a mixed bucket
A flat folder of mixed-identity photos (e.g. osrc/) is the opposite of the
hand-sorted case: identities have to be discovered, not asserted, but should
not collide with already-known identities or scramble their numbering.
work/cluster_osrc.py is the worked example. The pipeline:
- Filter cache to the source root, including any byte-aliased path that resolves under it.
- Drop already-covered faces by comparing each candidate to the centroids
of the existing canonical facesets at the
EXISTING_MATCH_THRESHOLD(default 0.45 — same cutoff asbuild_folders.py's osrc routing). These faces are already routed byextend/build_folders.pyand shouldn't seed new facesets. - Cluster the unmatched at cos-dist 0.55 (matches the
extenddefault for the new-cluster phase). - Apply
refine-equivalent gates per cluster:face_short,blur,det_score, plus outlier rejection (cluster-centroid cos-dist > 0.55) for clusters of size ≥ 4. Keep clusters whose surviving unique-source-path count is ≥MIN_FACES. - Number new facesets past the existing maximum (
START_NNN), sofaceset_001..NNNare never disturbed. - Synthesize a refine manifest and run
cmd_export_swapagainst it, then move the resulting dirs intofacesets_swap_ready/and append to the top-levelmanifest.json. Each new dir gets anosrc.txtprovenance marker.
Always run extend first so raw_full/ and facesets_full/ reflect the new
source — the cluster_osrc.py step then operates against the canonical
cache and doesn't need raw_full/ for input:
# 1. Bring raw_full / facesets_full up to date (folds matches into existing
# person folders + facesets, creates new person_NNN+ for unmatched).
python sort_faces.py extend "$CACHE" "$OUT/raw_full" \
--refine-out "$OUT/facesets_full"
# 2. Optional dry-run: report cluster sizes and per-faceset survivor counts
# without touching facesets_swap_ready/.
python work/cluster_osrc.py --dry-run
# 3. Real run: emits facesets_swap_ready/faceset_NNN+ and merges the manifest.
python work/cluster_osrc.py
For the 2026-04-26 run on 336 osrc face records (after dropping 18 covered by
existing identities), this produced 6 new facesets (faceset_020..025,
sizes 4–26 exported PNGs; the 7th candidate cluster lost all 6 faces to
export-swap's tighter min_face_short=100 gate).
Key defaults
refine:
| flag | default | meaning |
|---|---|---|
--initial-threshold |
0.55 | cosine distance for stage-1 clustering |
--merge-threshold |
0.40 | centroid-level merge of over-split clusters |
--outlier-threshold |
0.55 | drop face if cosine dist from centroid exceeds (only if cluster ≥ 4) |
--min-faces |
15 | minimum unique images per faceset |
--min-short |
90 | minimum short-edge pixels of face bbox |
--min-blur |
40.0 | Laplacian-variance blur gate |
--min-det-score |
0.6 | InsightFace detector score gate |
export-swap:
| flag | default | meaning |
|---|---|---|
--top-n |
30 | size of the <faceset>_topN.fsz bundle |
--outlier-threshold |
0.45 | tighter than refine; trims cluster boundary for averaging |
--pad-ratio |
0.5 | padding around face bbox for PNG crop |
--out-size |
512 | PNG output is square out_size × out_size |
--min-face-short |
100 | export gate; stricter than refine's 90 |
--candidates |
off | rescue _singletons/ into _candidates/ for manual review |
--candidate-match-threshold |
0.55 | cos-dist cutoff for singleton → existing faceset |
--candidate-min-score |
0.40 | composite-quality floor for candidates |
The composite quality score in export-swap is 0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness, each normalized to [0, 1].
Downstream: roop-unleashed
The .fsz bundles emitted by export-swap drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding.
Highly recommended at swap time: enable Select post-processing = GFPGAN with the Original/Enhanced image blend ratio = 0.85 (default is 0.65 which is conservative). See docs/analysis/facesets-downstream-refinement-evaluation.md for the full evaluation.
Layout
/opt/face-sets/
├─ README.md (this file)
├─ sort_faces.py (the tool)
├─ docs/
│ └─ analysis/
│ └─ facesets-downstream-refinement-evaluation.md
└─ work/ (gitignored except force-tracked .py)
├─ build_folders.py (hand-sorted-folder orchestration)
├─ check_faceset001_age.py (age-split readiness probe)
├─ age_split_001.py (age-split orchestration; faceset_001)
├─ cluster_osrc.py (mixed-bucket identity discovery)
├─ synthetic_refine_manifest.json (last build_folders.py output)
├─ synthetic_osrc_manifest.json (last cluster_osrc.py output)
├─ cache/
│ ├─ nl_full.npz (canonical cache + duplicates.json)
│ └─ age_split_exif.json (path → EXIF-year cache)
└─ logs/
└─ *.log (every long step writes here)