Files

Peter 484278e70e Rewrite pipeline: resumable embed, byte-dedup, extend, dedup report

- embed: sha256-based dedup at listing (embed each unique hash once, carry
  other paths as aliases via a top-level path_aliases dict); resumable from
  any existing cache; atomic incremental flush every 50 files; explicit
  skip-ext filtering; schema bumped with processed_paths + path_aliases.
- extend: new subcommand that merges new embeddings into an existing raw +
  facesets output without renumbering. Nearest person-centroid match above
  threshold, unmatched faces re-clustered into new person_NNN / _singletons.
  Optional --refine-out also extends facesets by centroid + quality gate.
- dedup: new subcommand producing byte-identical + visual near-duplicate
  groups as a JSON report.
- cluster/refine: fan every placement across canonical + aliases so each
  on-disk location gets represented.
- safe_dst_name now always flattens the absolute path so filenames stay
  stable across runs when src_root shifts (fixes duplicate-copy bug that
  surfaced during the lzbkp_red extend).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-23 19:21:50 +02:00

3.2 KiB

Raw Blame History

face-sets

Sort photos by similar face using InsightFace embeddings + agglomerative clustering, then refine into faceset-ready folders for downstream face-swap tooling (roop-unleashed, etc.).

Pipeline

sort_faces.py is a single-file CLI with four subcommands:

step	what it does
embed	Recursively scan a source tree, detect + embed every face, write `.npz` cache
cluster	Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/`
refine	Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/`
dedup	Post-hoc near-duplicate report: byte-identical groups + visual near-dupes (same face + same size within a tight cosine threshold)

embed is resumable and incremental: it loads any existing cache at the target path and only hashes/embeds files it hasn't processed before. A periodic flush (default every 50 new files) writes the cache atomically, so a mid-run crash loses at most a few dozen embeddings.

Byte-identical duplicates are detected via sha256 during the listing phase. The canonical file is embedded once; other paths with the same hash are carried as aliases on the cache's top-level path_aliases dict. Every alias is materialized by cluster/refine, so each on-disk location ends up represented in the output.

Cache and outputs are kept out of the repo via .gitignore; defaults live under work/.

Typical run

# 1. Embed (CPU; InsightFace buffalo_l). Caches faces + metadata. Resumable.
python sort_faces.py embed /mnt/x/src/nl work/cache/nl_full.npz

# 2. Raw clusters (every multi-face cluster -> a person_NNN/ folder).
python sort_faces.py cluster work/cache/nl_full.npz /mnt/e/temp_things/fcswp/nl_sorted/raw_full

# 3. Refined facesets (filters for faceset-ready quality).
python sort_faces.py refine  work/cache/nl_full.npz /mnt/e/temp_things/fcswp/nl_sorted/facesets_full

# 4. (Optional) report on byte-identical + visual near-duplicates.
python sort_faces.py dedup   work/cache/nl_full.npz

Refine defaults

flag	default	meaning
`--initial-threshold`	0.55	cosine distance for stage-1 clustering
`--merge-threshold`	0.40	centroid-level merge of over-split clusters
`--outlier-threshold`	0.55	drop face if cosine dist from cluster centroid exceeds this (only if cluster ≥ 4)
`--min-faces`	15	minimum unique images per faceset
`--min-short`	90	minimum short-edge pixels of face bbox
`--min-blur`	40.0	Laplacian-variance blur gate
`--min-det-score`	0.6	InsightFace detector score gate
`--mode`	copy	copy / move / symlink

Prior runs (as of 2026-04-22)

work/cache/kos11.npz — 181 images, 333 faces from Kos '11/ → kos11_sorted/
work/cache/nl_all.npz — 916 images, 1396 faces from Neuer Ordner (2)/New Folder/ → nl_sorted/raw/, refined to 6 facesets (197, 120, 91, 47, 23, 18 images)

Output lives outside the repo at /mnt/e/temp_things/fcswp/.