Files
face-sets/README.md
Peter 03a0c75531 Document hand-sorted-folder import + age-split workflow
- README: document work/build_folders.py (hand-sorted folder identities)
  and the new age-split workflow for splitting a long-running identity
  into era-specific facesets after clustering.
- Force-track work/age_split_001.py and work/check_faceset001_age.py;
  these are the worked example + readiness probe for faceset_001 and
  the template for splitting any other identity by EXIF era.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:08:25 +02:00

211 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# face-sets
Sort photos by similar face using InsightFace embeddings + agglomerative clustering, refine into per-identity sets, and export ready-to-drop bundles for face-swap tooling (roop-unleashed, etc.).
## Pipeline
`sort_faces.py` is a single-file CLI with six subcommands:
| step | what it does |
|-------------|-------------------------------------------------------------------------------------------------------------|
| embed | Recursively scan a source tree, detect + embed every face, write `.npz` cache. Resumable; sha256-dedup. |
| cluster | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` with manifest. |
| refine | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/`. |
| dedup | Post-hoc near-duplicate report: byte-identical + visual near-dupe groups → `<cache>.duplicates.json`. |
| extend | Fold new embeddings into an existing raw/refine output via nearest person-centroid without renumbering. |
| enrich | Re-detect each cached face to persist landmark_2d_106, landmark_3d_68, pose (pitch/yaw/roll) into cache. |
| export-swap | Per-identity export: tight outlier gate + visual-dupe collapse + composite quality rank + single-face PNG crops + `.fsz` bundles (top-N and full) ready for roop-unleashed. Optional singleton rescue into `_candidates/`. |
### Design principles
- **embed is resumable and incremental.** It loads any existing cache at the target path and only hashes / embeds files it has not seen. Atomic flush every 50 new files so a mid-run crash loses at most ~50 embeddings.
- **Byte-identical duplicates are sha256-grouped at listing time.** The canonical file is embedded once; other paths with the same hash become `path_aliases` in the cache. Every alias is materialized by `cluster` / `refine` / `export-swap`, so each on-disk location is represented.
- **`safe_dst_name` always flattens the absolute path.** This keeps output filenames stable across runs even as `src_root` changes between embed / extend / export invocations.
- **Caches and outputs stay out of git** via `.gitignore`; defaults live under `work/`.
## Typical end-to-end run
```bash
SRC=/mnt/x/src/nl
CACHE=work/cache/nl_full.npz
OUT=/mnt/e/temp_things/fcswp/nl_sorted
# 1. Embed (CPU; InsightFace buffalo_l). Resumable on re-run.
python sort_faces.py embed "$SRC" "$CACHE"
# 2. Raw clusters (one person_NNN/ per multi-face cluster).
python sort_faces.py cluster "$CACHE" "$OUT/raw_full"
# 3. Refined facesets (quality-gated per-identity sets).
python sort_faces.py refine "$CACHE" "$OUT/facesets_full"
# 4. Near-duplicate report (byte + visual).
python sort_faces.py dedup "$CACHE"
# 5. Enrich the cache with landmarks + pose (needed by export-swap).
python sort_faces.py enrich "$CACHE"
# 6. Export roop-unleashed-ready bundles.
python sort_faces.py export-swap "$CACHE" \
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
```
### Merging a new source into an existing result
```bash
# Embed new source into the same cache (resume from existing embeddings + aliases).
python sort_faces.py embed /mnt/x/src/lzbkp_red "$CACHE"
# Fold new faces into raw_full + facesets_full without renumbering.
python sort_faces.py extend "$CACHE" "$OUT/raw_full" --refine-out "$OUT/facesets_full"
# Refresh the swap-ready export to reflect the merge.
python sort_faces.py enrich "$CACHE"
python sort_faces.py export-swap "$CACHE" \
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
```
### Importing hand-sorted folders as identities
When source folders are already hand-sorted by person (one folder per identity), the
clustering path is the wrong tool — the identity is asserted, not inferred. The
orchestration script `work/build_folders.py` covers this case:
- For each trusted folder, it filters cache records that fall under it, builds an
identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so
bystanders in group photos drop out, and writes a synthetic `refine_manifest.json`.
- It then routes each face record from a *mixed* folder (e.g. `osrc/`) into every
identity centroid within a tight cosine cutoff (default 0.45). A multi-identity
photo lands in multiple facesets; `export-swap`'s per-bbox outlier filter ensures
each faceset crops only its matching face.
- Finally it invokes `cmd_export_swap` against the synthetic manifest, renames the
emitted `.fsz` bundles after the source folder, drops a `<label>.txt` marker, and
merges the new entries into the canonical `facesets_swap_ready/manifest.json`
(existing facesets are left untouched).
```bash
# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
for d in k m mi mir s sab t osrc; do
python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
done
# Bring landmarks/pose + visual-dupe report in sync with the new records.
python sort_faces.py enrich "$CACHE"
python sort_faces.py dedup "$CACHE"
# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
python work/build_folders.py
```
The script's config block (`TRUSTED`, `START_NNN`, `OSRC_THRESHOLD`, `TOP_N`, etc.)
is the only thing to edit when adding more hand-sorted folders later.
### Splitting an identity by era (age sub-clustering)
Long-running source corpora produce identities that span 10+ years. The 2009 face
and the 2024 face of the same person sit in the same cluster (correctly — same
identity), but a single averaged embedding pulled from that cluster blurs across
ages. For face-swap output that should target a specific period, the identity
needs to be split by era *after* the identity is established.
`work/age_split_001.py` is a worked example for `faceset_001` and a template for
any other identity. The pipeline is:
- **Probe first** with `work/check_faceset001_age.py` — report intra-cluster
pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and
EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with
distinct year ranges, the identity is age-sortable.
- **Seed centroid** from the curated `facesets_swap_ready/faceset_001/`
(manifest provides face keys → cache rows).
- **Wide recovery** at cos-dist ≤ 0.55 against the seed under the original
source roots, then quality-gate (`face_short`, `blur`, `det_score`) and one
re-centroid + tighten pass at 0.50 to absorb new faces without drift.
- **Sub-cluster** the survivors at cos-dist 0.35 (precomputed-distance
agglomerative, average linkage).
- **Anchor-based fragment assignment** (not transitive merge — that caused
year-drift): sub-clusters with size ≥ 20 are *anchors*; smaller fragments
attach to the single nearest anchor only if both the centroid distance ≤ 0.40
AND the dominant EXIF year is within ±5 years. Fragments with no qualifying
anchor remain standalone (and end up THIN-tagged downstream).
- **EXIF year per source path** with on-disk caching at
`work/cache/age_split_exif.json` — the Windows-mount EXIF read is the
slowest step, so re-runs after a parameter tweak are nearly instant.
- **Per-era export** mirrors `export-swap`: composite-quality rank, single-face
square PNG crops, top-N + `_all` `.fsz` bundles, per-era `manifest.json`,
human-readable `<era>.txt` marker. Eras with < 20 face records also drop a
`THIN.txt` marker so they can be quarantined.
- **Top-level manifest merge**: era buckets are appended to
`facesets_swap_ready/manifest.json`. Operationally the THIN buckets should be
moved into `_thin/` (and the manifest split into `facesets` + `thin_eras`),
leaving only the substantive era buckets at the top level.
```bash
# 1. Confirm the identity is age-sortable.
python work/check_faceset001_age.py
# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
python work/age_split_001.py
```
For the `faceset_001` run on 5260-face `nl_full.npz`, this produced 6 substantive
era buckets (200510, 201013, 2011, 201417, 201819, 201820; sizes 43282)
plus 68 thin/fragment buckets quarantined under `_thin/`.
## Key defaults
`refine`:
| flag | default | meaning |
|-------------------------|--------:|---------|
| `--initial-threshold` | 0.55 | cosine distance for stage-1 clustering |
| `--merge-threshold` | 0.40 | centroid-level merge of over-split clusters |
| `--outlier-threshold` | 0.55 | drop face if cosine dist from centroid exceeds (only if cluster ≥ 4) |
| `--min-faces` | 15 | minimum unique images per faceset |
| `--min-short` | 90 | minimum short-edge pixels of face bbox |
| `--min-blur` | 40.0 | Laplacian-variance blur gate |
| `--min-det-score` | 0.6 | InsightFace detector score gate |
`export-swap`:
| flag | default | meaning |
|-------------------------------|--------:|---------|
| `--top-n` | 30 | size of the `<faceset>_topN.fsz` bundle |
| `--outlier-threshold` | 0.45 | tighter than refine; trims cluster boundary for averaging |
| `--pad-ratio` | 0.5 | padding around face bbox for PNG crop |
| `--out-size` | 512 | PNG output is square `out_size × out_size` |
| `--min-face-short` | 100 | export gate; stricter than refine's 90 |
| `--candidates` | off | rescue `_singletons/` into `_candidates/` for manual review |
| `--candidate-match-threshold` | 0.55 | cos-dist cutoff for singleton → existing faceset |
| `--candidate-min-score` | 0.40 | composite-quality floor for candidates |
The composite quality score in `export-swap` is `0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness`, each normalized to `[0, 1]`.
## Downstream: roop-unleashed
The `.fsz` bundles emitted by `export-swap` drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding.
Highly recommended at swap time: enable **Select post-processing = GFPGAN** with the **Original/Enhanced image blend ratio = 0.85** (default is 0.65 which is conservative). See `docs/analysis/facesets-downstream-refinement-evaluation.md` for the full evaluation.
## Layout
```
/opt/face-sets/
├─ README.md (this file)
├─ sort_faces.py (the tool)
├─ docs/
│ └─ analysis/
│ └─ facesets-downstream-refinement-evaluation.md
└─ work/ (gitignored except force-tracked .py)
├─ build_folders.py (hand-sorted-folder orchestration)
├─ check_faceset001_age.py (age-split readiness probe)
├─ age_split_001.py (age-split orchestration; faceset_001)
├─ synthetic_refine_manifest.json (last build_folders.py output)
├─ cache/
│ ├─ nl_full.npz (canonical cache + duplicates.json)
│ └─ age_split_exif.json (path → EXIF-year cache)
└─ logs/
└─ *.log (every long step writes here)
```