# face-sets Sort photos by similar face using InsightFace embeddings + agglomerative clustering, refine into per-identity sets, and export ready-to-drop bundles for face-swap tooling (roop-unleashed, etc.). ## Pipeline `sort_faces.py` is a single-file CLI with six subcommands: | step | what it does | |-------------|-------------------------------------------------------------------------------------------------------------| | embed | Recursively scan a source tree, detect + embed every face, write `.npz` cache. Resumable; sha256-dedup. | | cluster | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` with manifest. | | refine | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/`. | | dedup | Post-hoc near-duplicate report: byte-identical + visual near-dupe groups → `.duplicates.json`. | | extend | Fold new embeddings into an existing raw/refine output via nearest person-centroid without renumbering. | | enrich | Re-detect each cached face to persist landmark_2d_106, landmark_3d_68, pose (pitch/yaw/roll) into cache. | | export-swap | Per-identity export: tight outlier gate + visual-dupe collapse + composite quality rank + single-face PNG crops + `.fsz` bundles (top-N and full) ready for roop-unleashed. Optional singleton rescue into `_candidates/`. | ### Design principles - **embed is resumable and incremental.** It loads any existing cache at the target path and only hashes / embeds files it has not seen. Atomic flush every 50 new files so a mid-run crash loses at most ~50 embeddings. - **Byte-identical duplicates are sha256-grouped at listing time.** The canonical file is embedded once; other paths with the same hash become `path_aliases` in the cache. Every alias is materialized by `cluster` / `refine` / `export-swap`, so each on-disk location is represented. - **`safe_dst_name` always flattens the absolute path.** This keeps output filenames stable across runs even as `src_root` changes between embed / extend / export invocations. - **Caches and outputs stay out of git** via `.gitignore`; defaults live under `work/`. ## Typical end-to-end run ```bash SRC=/mnt/x/src/nl CACHE=work/cache/nl_full.npz OUT=/mnt/e/temp_things/fcswp/nl_sorted # 1. Embed (CPU; InsightFace buffalo_l). Resumable on re-run. python sort_faces.py embed "$SRC" "$CACHE" # 2. Raw clusters (one person_NNN/ per multi-face cluster). python sort_faces.py cluster "$CACHE" "$OUT/raw_full" # 3. Refined facesets (quality-gated per-identity sets). python sort_faces.py refine "$CACHE" "$OUT/facesets_full" # 4. Near-duplicate report (byte + visual). python sort_faces.py dedup "$CACHE" # 5. Enrich the cache with landmarks + pose (needed by export-swap). python sort_faces.py enrich "$CACHE" # 6. Export roop-unleashed-ready bundles. python sort_faces.py export-swap "$CACHE" \ "$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \ --raw-manifest "$OUT/raw_full/manifest.json" --candidates ``` ### Merging a new source into an existing result ```bash # Embed new source into the same cache (resume from existing embeddings + aliases). python sort_faces.py embed /mnt/x/src/lzbkp_red "$CACHE" # Fold new faces into raw_full + facesets_full without renumbering. python sort_faces.py extend "$CACHE" "$OUT/raw_full" --refine-out "$OUT/facesets_full" # Refresh the swap-ready export to reflect the merge. python sort_faces.py enrich "$CACHE" python sort_faces.py export-swap "$CACHE" \ "$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \ --raw-manifest "$OUT/raw_full/manifest.json" --candidates ``` ## Key defaults `refine`: | flag | default | meaning | |-------------------------|--------:|---------| | `--initial-threshold` | 0.55 | cosine distance for stage-1 clustering | | `--merge-threshold` | 0.40 | centroid-level merge of over-split clusters | | `--outlier-threshold` | 0.55 | drop face if cosine dist from centroid exceeds (only if cluster ≥ 4) | | `--min-faces` | 15 | minimum unique images per faceset | | `--min-short` | 90 | minimum short-edge pixels of face bbox | | `--min-blur` | 40.0 | Laplacian-variance blur gate | | `--min-det-score` | 0.6 | InsightFace detector score gate | `export-swap`: | flag | default | meaning | |-------------------------------|--------:|---------| | `--top-n` | 30 | size of the `_topN.fsz` bundle | | `--outlier-threshold` | 0.45 | tighter than refine; trims cluster boundary for averaging | | `--pad-ratio` | 0.5 | padding around face bbox for PNG crop | | `--out-size` | 512 | PNG output is square `out_size × out_size` | | `--min-face-short` | 100 | export gate; stricter than refine's 90 | | `--candidates` | off | rescue `_singletons/` into `_candidates/` for manual review | | `--candidate-match-threshold` | 0.55 | cos-dist cutoff for singleton → existing faceset | | `--candidate-min-score` | 0.40 | composite-quality floor for candidates | The composite quality score in `export-swap` is `0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness`, each normalized to `[0, 1]`. ## Downstream: roop-unleashed The `.fsz` bundles emitted by `export-swap` drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding. Highly recommended at swap time: enable **Select post-processing = GFPGAN** with the **Original/Enhanced image blend ratio = 0.85** (default is 0.65 which is conservative). See `docs/analysis/facesets-downstream-refinement-evaluation.md` for the full evaluation. ## Layout ``` /opt/face-sets/ ├─ README.md (this file) ├─ sort_faces.py (the tool) ├─ docs/ │ └─ analysis/ │ └─ facesets-downstream-refinement-evaluation.md └─ work/ (gitignored) ├─ cache/ │ └─ nl_full.npz (canonical cache + duplicates.json) └─ logs/ └─ *.log (every long step writes here) ```