# face-sets Sort photos by similar face using InsightFace embeddings + agglomerative clustering, then refine into faceset-ready folders for downstream face-swap tooling (roop-unleashed, etc.). ## Pipeline `sort_faces.py` is a single-file CLI with four subcommands: | step | what it does | |---------|------------------------------------------------------------------------------| | embed | Recursively scan a source tree, detect + embed every face, write `.npz` cache | | cluster | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` | | refine | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/` | | dedup | Post-hoc near-duplicate report: byte-identical groups + visual near-dupes (same face + same size within a tight cosine threshold) | `embed` is resumable and incremental: it loads any existing cache at the target path and only hashes/embeds files it hasn't processed before. A periodic flush (default every 50 new files) writes the cache atomically, so a mid-run crash loses at most a few dozen embeddings. Byte-identical duplicates are detected via sha256 during the listing phase. The canonical file is embedded once; other paths with the same hash are carried as `aliases` on the cache's top-level `path_aliases` dict. Every alias is materialized by `cluster`/`refine`, so each on-disk location ends up represented in the output. Cache and outputs are kept out of the repo via `.gitignore`; defaults live under `work/`. ## Typical run ```bash # 1. Embed (CPU; InsightFace buffalo_l). Caches faces + metadata. Resumable. python sort_faces.py embed /mnt/x/src/nl work/cache/nl_full.npz # 2. Raw clusters (every multi-face cluster -> a person_NNN/ folder). python sort_faces.py cluster work/cache/nl_full.npz /mnt/e/temp_things/fcswp/nl_sorted/raw_full # 3. Refined facesets (filters for faceset-ready quality). python sort_faces.py refine work/cache/nl_full.npz /mnt/e/temp_things/fcswp/nl_sorted/facesets_full # 4. (Optional) report on byte-identical + visual near-duplicates. python sort_faces.py dedup work/cache/nl_full.npz ``` ## Refine defaults | flag | default | meaning | |---|---|---| | `--initial-threshold` | 0.55 | cosine distance for stage-1 clustering | | `--merge-threshold` | 0.40 | centroid-level merge of over-split clusters | | `--outlier-threshold` | 0.55 | drop face if cosine dist from cluster centroid exceeds this (only if cluster ≥ 4) | | `--min-faces` | 15 | minimum unique images per faceset | | `--min-short` | 90 | minimum short-edge pixels of face bbox | | `--min-blur` | 40.0 | Laplacian-variance blur gate | | `--min-det-score` | 0.6 | InsightFace detector score gate | | `--mode` | copy | copy / move / symlink | ## Prior runs (as of 2026-04-22) - `work/cache/kos11.npz` — 181 images, 333 faces from `Kos '11/` → `kos11_sorted/` - `work/cache/nl_all.npz` — 916 images, 1396 faces from `Neuer Ordner (2)/New Folder/` → `nl_sorted/raw/`, refined to 6 facesets (197, 120, 91, 47, 23, 18 images) Output lives outside the repo at `/mnt/e/temp_things/fcswp/`.