# face-sets Sort photos by similar face using InsightFace embeddings + agglomerative clustering, refine into per-identity sets, and export ready-to-drop bundles for face-swap tooling (roop-unleashed, etc.). ## Pipeline `sort_faces.py` is a single-file CLI with six subcommands: | step | what it does | |-------------|-------------------------------------------------------------------------------------------------------------| | embed | Recursively scan a source tree, detect + embed every face, write `.npz` cache. Resumable; sha256-dedup. | | cluster | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` with manifest. | | refine | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/`. | | dedup | Post-hoc near-duplicate report: byte-identical + visual near-dupe groups → `.duplicates.json`. | | extend | Fold new embeddings into an existing raw/refine output via nearest person-centroid without renumbering. | | enrich | Re-detect each cached face to persist landmark_2d_106, landmark_3d_68, pose (pitch/yaw/roll) into cache. | | export-swap | Per-identity export: tight outlier gate + visual-dupe collapse + composite quality rank + single-face PNG crops + `.fsz` bundles (top-N and full) ready for roop-unleashed. Optional singleton rescue into `_candidates/`. | ### Design principles - **embed is resumable and incremental.** It loads any existing cache at the target path and only hashes / embeds files it has not seen. Atomic flush every 50 new files so a mid-run crash loses at most ~50 embeddings. - **Byte-identical duplicates are sha256-grouped at listing time.** The canonical file is embedded once; other paths with the same hash become `path_aliases` in the cache. Every alias is materialized by `cluster` / `refine` / `export-swap`, so each on-disk location is represented. - **`safe_dst_name` always flattens the absolute path.** This keeps output filenames stable across runs even as `src_root` changes between embed / extend / export invocations. - **Caches and outputs stay out of git** via `.gitignore`; defaults live under `work/`. ## Typical end-to-end run ```bash SRC=/mnt/x/src/nl CACHE=work/cache/nl_full.npz OUT=/mnt/e/temp_things/fcswp/nl_sorted # 1. Embed (CPU; InsightFace buffalo_l). Resumable on re-run. python sort_faces.py embed "$SRC" "$CACHE" # 2. Raw clusters (one person_NNN/ per multi-face cluster). python sort_faces.py cluster "$CACHE" "$OUT/raw_full" # 3. Refined facesets (quality-gated per-identity sets). python sort_faces.py refine "$CACHE" "$OUT/facesets_full" # 4. Near-duplicate report (byte + visual). python sort_faces.py dedup "$CACHE" # 5. Enrich the cache with landmarks + pose (needed by export-swap). python sort_faces.py enrich "$CACHE" # 6. Export roop-unleashed-ready bundles. python sort_faces.py export-swap "$CACHE" \ "$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \ --raw-manifest "$OUT/raw_full/manifest.json" --candidates ``` ### Merging a new source into an existing result ```bash # Embed new source into the same cache (resume from existing embeddings + aliases). python sort_faces.py embed /mnt/x/src/lzbkp_red "$CACHE" # Fold new faces into raw_full + facesets_full without renumbering. python sort_faces.py extend "$CACHE" "$OUT/raw_full" --refine-out "$OUT/facesets_full" # Refresh the swap-ready export to reflect the merge. python sort_faces.py enrich "$CACHE" python sort_faces.py export-swap "$CACHE" \ "$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \ --raw-manifest "$OUT/raw_full/manifest.json" --candidates ``` ### Importing hand-sorted folders as identities When source folders are already hand-sorted by person (one folder per identity), the clustering path is the wrong tool — the identity is asserted, not inferred. The orchestration script `work/build_folders.py` covers this case: - For each trusted folder, it filters cache records that fall under it, builds an identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so bystanders in group photos drop out, and writes a synthetic `refine_manifest.json`. - It then routes each face record from a *mixed* folder (e.g. `osrc/`) into every identity centroid within a tight cosine cutoff (default 0.45). A multi-identity photo lands in multiple facesets; `export-swap`'s per-bbox outlier filter ensures each faceset crops only its matching face. - Finally it invokes `cmd_export_swap` against the synthetic manifest, renames the emitted `.fsz` bundles after the source folder, drops a `