work/cluster_osrc.py mirrors build_folders.py's shape (synthesize a refine_manifest, hand off to cmd_export_swap, relocate, merge top-level manifest) but discovers identities by clustering rather than asserting them by folder. Drops faces already covered by existing identity centroids, clusters the rest at 0.55, applies refine-equivalent gates with min_faces=6, numbers new facesets past the existing maximum so faceset_001..NNN are never disturbed. The 2026-04-26 run on /mnt/x/src/osrc produced faceset_020..025 (sizes 4-26 exported PNGs); analysis writeup in docs/analysis/. README also notes the refine-renumbers caveat in passing — extend + orchestration script is the safe pattern; cmd_refine is for fresh clusters only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
264 lines
14 KiB
Markdown
264 lines
14 KiB
Markdown
# face-sets
|
||
|
||
Sort photos by similar face using InsightFace embeddings + agglomerative clustering, refine into per-identity sets, and export ready-to-drop bundles for face-swap tooling (roop-unleashed, etc.).
|
||
|
||
## Pipeline
|
||
|
||
`sort_faces.py` is a single-file CLI with six subcommands:
|
||
|
||
| step | what it does |
|
||
|-------------|-------------------------------------------------------------------------------------------------------------|
|
||
| embed | Recursively scan a source tree, detect + embed every face, write `.npz` cache. Resumable; sha256-dedup. |
|
||
| cluster | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` with manifest. |
|
||
| refine | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/`. |
|
||
| dedup | Post-hoc near-duplicate report: byte-identical + visual near-dupe groups → `<cache>.duplicates.json`. |
|
||
| extend | Fold new embeddings into an existing raw/refine output via nearest person-centroid without renumbering. |
|
||
| enrich | Re-detect each cached face to persist landmark_2d_106, landmark_3d_68, pose (pitch/yaw/roll) into cache. |
|
||
| export-swap | Per-identity export: tight outlier gate + visual-dupe collapse + composite quality rank + single-face PNG crops + `.fsz` bundles (top-N and full) ready for roop-unleashed. Optional singleton rescue into `_candidates/`. |
|
||
|
||
### Design principles
|
||
|
||
- **embed is resumable and incremental.** It loads any existing cache at the target path and only hashes / embeds files it has not seen. Atomic flush every 50 new files so a mid-run crash loses at most ~50 embeddings.
|
||
- **Byte-identical duplicates are sha256-grouped at listing time.** The canonical file is embedded once; other paths with the same hash become `path_aliases` in the cache. Every alias is materialized by `cluster` / `refine` / `export-swap`, so each on-disk location is represented.
|
||
- **`safe_dst_name` always flattens the absolute path.** This keeps output filenames stable across runs even as `src_root` changes between embed / extend / export invocations.
|
||
- **Caches and outputs stay out of git** via `.gitignore`; defaults live under `work/`.
|
||
|
||
## Typical end-to-end run
|
||
|
||
```bash
|
||
SRC=/mnt/x/src/nl
|
||
CACHE=work/cache/nl_full.npz
|
||
OUT=/mnt/e/temp_things/fcswp/nl_sorted
|
||
|
||
# 1. Embed (CPU; InsightFace buffalo_l). Resumable on re-run.
|
||
python sort_faces.py embed "$SRC" "$CACHE"
|
||
|
||
# 2. Raw clusters (one person_NNN/ per multi-face cluster).
|
||
python sort_faces.py cluster "$CACHE" "$OUT/raw_full"
|
||
|
||
# 3. Refined facesets (quality-gated per-identity sets).
|
||
python sort_faces.py refine "$CACHE" "$OUT/facesets_full"
|
||
|
||
# 4. Near-duplicate report (byte + visual).
|
||
python sort_faces.py dedup "$CACHE"
|
||
|
||
# 5. Enrich the cache with landmarks + pose (needed by export-swap).
|
||
python sort_faces.py enrich "$CACHE"
|
||
|
||
# 6. Export roop-unleashed-ready bundles.
|
||
python sort_faces.py export-swap "$CACHE" \
|
||
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
|
||
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
|
||
```
|
||
|
||
### Merging a new source into an existing result
|
||
|
||
```bash
|
||
# Embed new source into the same cache (resume from existing embeddings + aliases).
|
||
python sort_faces.py embed /mnt/x/src/lzbkp_red "$CACHE"
|
||
|
||
# Fold new faces into raw_full + facesets_full without renumbering.
|
||
python sort_faces.py extend "$CACHE" "$OUT/raw_full" --refine-out "$OUT/facesets_full"
|
||
|
||
# Refresh the swap-ready export to reflect the merge.
|
||
python sort_faces.py enrich "$CACHE"
|
||
python sort_faces.py export-swap "$CACHE" \
|
||
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
|
||
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
|
||
```
|
||
|
||
### Importing hand-sorted folders as identities
|
||
|
||
When source folders are already hand-sorted by person (one folder per identity), the
|
||
clustering path is the wrong tool — the identity is asserted, not inferred. The
|
||
orchestration script `work/build_folders.py` covers this case:
|
||
|
||
- For each trusted folder, it filters cache records that fall under it, builds an
|
||
identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so
|
||
bystanders in group photos drop out, and writes a synthetic `refine_manifest.json`.
|
||
- It then routes each face record from a *mixed* folder (e.g. `osrc/`) into every
|
||
identity centroid within a tight cosine cutoff (default 0.45). A multi-identity
|
||
photo lands in multiple facesets; `export-swap`'s per-bbox outlier filter ensures
|
||
each faceset crops only its matching face.
|
||
- Finally it invokes `cmd_export_swap` against the synthetic manifest, renames the
|
||
emitted `.fsz` bundles after the source folder, drops a `<label>.txt` marker, and
|
||
merges the new entries into the canonical `facesets_swap_ready/manifest.json`
|
||
(existing facesets are left untouched).
|
||
|
||
```bash
|
||
# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
|
||
for d in k m mi mir s sab t osrc; do
|
||
python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
|
||
done
|
||
|
||
# Bring landmarks/pose + visual-dupe report in sync with the new records.
|
||
python sort_faces.py enrich "$CACHE"
|
||
python sort_faces.py dedup "$CACHE"
|
||
|
||
# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
|
||
python work/build_folders.py
|
||
```
|
||
|
||
The script's config block (`TRUSTED`, `START_NNN`, `OSRC_THRESHOLD`, `TOP_N`, etc.)
|
||
is the only thing to edit when adding more hand-sorted folders later.
|
||
|
||
### Splitting an identity by era (age sub-clustering)
|
||
|
||
Long-running source corpora produce identities that span 10+ years. The 2009 face
|
||
and the 2024 face of the same person sit in the same cluster (correctly — same
|
||
identity), but a single averaged embedding pulled from that cluster blurs across
|
||
ages. For face-swap output that should target a specific period, the identity
|
||
needs to be split by era *after* the identity is established.
|
||
|
||
`work/age_split_001.py` is a worked example for `faceset_001` and a template for
|
||
any other identity. The pipeline is:
|
||
|
||
- **Probe first** with `work/check_faceset001_age.py` — report intra-cluster
|
||
pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and
|
||
EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with
|
||
distinct year ranges, the identity is age-sortable.
|
||
- **Seed centroid** from the curated `facesets_swap_ready/faceset_001/`
|
||
(manifest provides face keys → cache rows).
|
||
- **Wide recovery** at cos-dist ≤ 0.55 against the seed under the original
|
||
source roots, then quality-gate (`face_short`, `blur`, `det_score`) and one
|
||
re-centroid + tighten pass at 0.50 to absorb new faces without drift.
|
||
- **Sub-cluster** the survivors at cos-dist 0.35 (precomputed-distance
|
||
agglomerative, average linkage).
|
||
- **Anchor-based fragment assignment** (not transitive merge — that caused
|
||
year-drift): sub-clusters with size ≥ 20 are *anchors*; smaller fragments
|
||
attach to the single nearest anchor only if both the centroid distance ≤ 0.40
|
||
AND the dominant EXIF year is within ±5 years. Fragments with no qualifying
|
||
anchor remain standalone (and end up THIN-tagged downstream).
|
||
- **EXIF year per source path** with on-disk caching at
|
||
`work/cache/age_split_exif.json` — the Windows-mount EXIF read is the
|
||
slowest step, so re-runs after a parameter tweak are nearly instant.
|
||
- **Per-era export** mirrors `export-swap`: composite-quality rank, single-face
|
||
square PNG crops, top-N + `_all` `.fsz` bundles, per-era `manifest.json`,
|
||
human-readable `<era>.txt` marker. Eras with < 20 face records also drop a
|
||
`THIN.txt` marker so they can be quarantined.
|
||
- **Top-level manifest merge**: era buckets are appended to
|
||
`facesets_swap_ready/manifest.json`. Operationally the THIN buckets should be
|
||
moved into `_thin/` (and the manifest split into `facesets` + `thin_eras`),
|
||
leaving only the substantive era buckets at the top level.
|
||
|
||
```bash
|
||
# 1. Confirm the identity is age-sortable.
|
||
python work/check_faceset001_age.py
|
||
|
||
# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
|
||
python work/age_split_001.py
|
||
```
|
||
|
||
For the `faceset_001` run on 5260-face `nl_full.npz`, this produced 6 substantive
|
||
era buckets (2005–10, 2010–13, 2011, 2014–17, 2018–19, 2018–20; sizes 43–282)
|
||
plus 68 thin/fragment buckets quarantined under `_thin/`.
|
||
|
||
### Discovering new identities in a mixed bucket
|
||
|
||
A flat folder of mixed-identity photos (e.g. `osrc/`) is the opposite of the
|
||
hand-sorted case: identities have to be discovered, not asserted, but should
|
||
not collide with already-known identities or scramble their numbering.
|
||
|
||
`work/cluster_osrc.py` is the worked example. The pipeline:
|
||
|
||
- **Filter cache to the source root**, including any byte-aliased path that
|
||
resolves under it.
|
||
- **Drop already-covered faces** by comparing each candidate to the centroids
|
||
of the existing canonical facesets at the `EXISTING_MATCH_THRESHOLD`
|
||
(default 0.45 — same cutoff as `build_folders.py`'s osrc routing). These
|
||
faces are already routed by `extend` / `build_folders.py` and shouldn't
|
||
seed new facesets.
|
||
- **Cluster the unmatched** at cos-dist 0.55 (matches the `extend` default
|
||
for the new-cluster phase).
|
||
- **Apply `refine`-equivalent gates** per cluster: `face_short`, `blur`,
|
||
`det_score`, plus outlier rejection (cluster-centroid cos-dist > 0.55) for
|
||
clusters of size ≥ 4. Keep clusters whose surviving unique-source-path
|
||
count is ≥ `MIN_FACES`.
|
||
- **Number new facesets past the existing maximum** (`START_NNN`), so
|
||
`faceset_001..NNN` are never disturbed.
|
||
- **Synthesize a refine manifest** and run `cmd_export_swap` against it,
|
||
then move the resulting dirs into `facesets_swap_ready/` and append to the
|
||
top-level `manifest.json`. Each new dir gets an `osrc.txt` provenance
|
||
marker.
|
||
|
||
Always run `extend` first so `raw_full/` and `facesets_full/` reflect the new
|
||
source — the `cluster_osrc.py` step then operates against the canonical
|
||
cache and doesn't need `raw_full/` for input:
|
||
|
||
```bash
|
||
# 1. Bring raw_full / facesets_full up to date (folds matches into existing
|
||
# person folders + facesets, creates new person_NNN+ for unmatched).
|
||
python sort_faces.py extend "$CACHE" "$OUT/raw_full" \
|
||
--refine-out "$OUT/facesets_full"
|
||
|
||
# 2. Optional dry-run: report cluster sizes and per-faceset survivor counts
|
||
# without touching facesets_swap_ready/.
|
||
python work/cluster_osrc.py --dry-run
|
||
|
||
# 3. Real run: emits facesets_swap_ready/faceset_NNN+ and merges the manifest.
|
||
python work/cluster_osrc.py
|
||
```
|
||
|
||
For the 2026-04-26 run on 336 osrc face records (after dropping 18 covered by
|
||
existing identities), this produced 6 new facesets (`faceset_020..025`,
|
||
sizes 4–26 exported PNGs; the 7th candidate cluster lost all 6 faces to
|
||
export-swap's tighter `min_face_short=100` gate).
|
||
|
||
## Key defaults
|
||
|
||
`refine`:
|
||
|
||
| flag | default | meaning |
|
||
|-------------------------|--------:|---------|
|
||
| `--initial-threshold` | 0.55 | cosine distance for stage-1 clustering |
|
||
| `--merge-threshold` | 0.40 | centroid-level merge of over-split clusters |
|
||
| `--outlier-threshold` | 0.55 | drop face if cosine dist from centroid exceeds (only if cluster ≥ 4) |
|
||
| `--min-faces` | 15 | minimum unique images per faceset |
|
||
| `--min-short` | 90 | minimum short-edge pixels of face bbox |
|
||
| `--min-blur` | 40.0 | Laplacian-variance blur gate |
|
||
| `--min-det-score` | 0.6 | InsightFace detector score gate |
|
||
|
||
`export-swap`:
|
||
|
||
| flag | default | meaning |
|
||
|-------------------------------|--------:|---------|
|
||
| `--top-n` | 30 | size of the `<faceset>_topN.fsz` bundle |
|
||
| `--outlier-threshold` | 0.45 | tighter than refine; trims cluster boundary for averaging |
|
||
| `--pad-ratio` | 0.5 | padding around face bbox for PNG crop |
|
||
| `--out-size` | 512 | PNG output is square `out_size × out_size` |
|
||
| `--min-face-short` | 100 | export gate; stricter than refine's 90 |
|
||
| `--candidates` | off | rescue `_singletons/` into `_candidates/` for manual review |
|
||
| `--candidate-match-threshold` | 0.55 | cos-dist cutoff for singleton → existing faceset |
|
||
| `--candidate-min-score` | 0.40 | composite-quality floor for candidates |
|
||
|
||
The composite quality score in `export-swap` is `0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness`, each normalized to `[0, 1]`.
|
||
|
||
## Downstream: roop-unleashed
|
||
|
||
The `.fsz` bundles emitted by `export-swap` drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding.
|
||
|
||
Highly recommended at swap time: enable **Select post-processing = GFPGAN** with the **Original/Enhanced image blend ratio = 0.85** (default is 0.65 which is conservative). See `docs/analysis/facesets-downstream-refinement-evaluation.md` for the full evaluation.
|
||
|
||
## Layout
|
||
|
||
```
|
||
/opt/face-sets/
|
||
├─ README.md (this file)
|
||
├─ sort_faces.py (the tool)
|
||
├─ docs/
|
||
│ └─ analysis/
|
||
│ └─ facesets-downstream-refinement-evaluation.md
|
||
└─ work/ (gitignored except force-tracked .py)
|
||
├─ build_folders.py (hand-sorted-folder orchestration)
|
||
├─ check_faceset001_age.py (age-split readiness probe)
|
||
├─ age_split_001.py (age-split orchestration; faceset_001)
|
||
├─ cluster_osrc.py (mixed-bucket identity discovery)
|
||
├─ synthetic_refine_manifest.json (last build_folders.py output)
|
||
├─ synthetic_osrc_manifest.json (last cluster_osrc.py output)
|
||
├─ cache/
|
||
│ ├─ nl_full.npz (canonical cache + duplicates.json)
|
||
│ └─ age_split_exif.json (path → EXIF-year cache)
|
||
└─ logs/
|
||
└─ *.log (every long step writes here)
|
||
```
|