Document hand-sorted-folder import + age-split workflow

- README: document work/build_folders.py (hand-sorted folder identities)
  and the new age-split workflow for splitting a long-running identity
  into era-specific facesets after clustering.
- Force-track work/age_split_001.py and work/check_faceset001_age.py;
  these are the worked example + readiness probe for faceset_001 and
  the template for splitting any other identity by EXIF era.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-26 12:08:25 +02:00
parent 4d7a8780de
commit 03a0c75531
3 changed files with 729 additions and 2 deletions

View File

@@ -67,6 +67,92 @@ python sort_faces.py export-swap "$CACHE" \
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
```
### Importing hand-sorted folders as identities
When source folders are already hand-sorted by person (one folder per identity), the
clustering path is the wrong tool — the identity is asserted, not inferred. The
orchestration script `work/build_folders.py` covers this case:
- For each trusted folder, it filters cache records that fall under it, builds an
identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so
bystanders in group photos drop out, and writes a synthetic `refine_manifest.json`.
- It then routes each face record from a *mixed* folder (e.g. `osrc/`) into every
identity centroid within a tight cosine cutoff (default 0.45). A multi-identity
photo lands in multiple facesets; `export-swap`'s per-bbox outlier filter ensures
each faceset crops only its matching face.
- Finally it invokes `cmd_export_swap` against the synthetic manifest, renames the
emitted `.fsz` bundles after the source folder, drops a `<label>.txt` marker, and
merges the new entries into the canonical `facesets_swap_ready/manifest.json`
(existing facesets are left untouched).
```bash
# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
for d in k m mi mir s sab t osrc; do
python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
done
# Bring landmarks/pose + visual-dupe report in sync with the new records.
python sort_faces.py enrich "$CACHE"
python sort_faces.py dedup "$CACHE"
# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
python work/build_folders.py
```
The script's config block (`TRUSTED`, `START_NNN`, `OSRC_THRESHOLD`, `TOP_N`, etc.)
is the only thing to edit when adding more hand-sorted folders later.
### Splitting an identity by era (age sub-clustering)
Long-running source corpora produce identities that span 10+ years. The 2009 face
and the 2024 face of the same person sit in the same cluster (correctly — same
identity), but a single averaged embedding pulled from that cluster blurs across
ages. For face-swap output that should target a specific period, the identity
needs to be split by era *after* the identity is established.
`work/age_split_001.py` is a worked example for `faceset_001` and a template for
any other identity. The pipeline is:
- **Probe first** with `work/check_faceset001_age.py` — report intra-cluster
pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and
EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with
distinct year ranges, the identity is age-sortable.
- **Seed centroid** from the curated `facesets_swap_ready/faceset_001/`
(manifest provides face keys → cache rows).
- **Wide recovery** at cos-dist ≤ 0.55 against the seed under the original
source roots, then quality-gate (`face_short`, `blur`, `det_score`) and one
re-centroid + tighten pass at 0.50 to absorb new faces without drift.
- **Sub-cluster** the survivors at cos-dist 0.35 (precomputed-distance
agglomerative, average linkage).
- **Anchor-based fragment assignment** (not transitive merge — that caused
year-drift): sub-clusters with size ≥ 20 are *anchors*; smaller fragments
attach to the single nearest anchor only if both the centroid distance ≤ 0.40
AND the dominant EXIF year is within ±5 years. Fragments with no qualifying
anchor remain standalone (and end up THIN-tagged downstream).
- **EXIF year per source path** with on-disk caching at
`work/cache/age_split_exif.json` — the Windows-mount EXIF read is the
slowest step, so re-runs after a parameter tweak are nearly instant.
- **Per-era export** mirrors `export-swap`: composite-quality rank, single-face
square PNG crops, top-N + `_all` `.fsz` bundles, per-era `manifest.json`,
human-readable `<era>.txt` marker. Eras with < 20 face records also drop a
`THIN.txt` marker so they can be quarantined.
- **Top-level manifest merge**: era buckets are appended to
`facesets_swap_ready/manifest.json`. Operationally the THIN buckets should be
moved into `_thin/` (and the manifest split into `facesets` + `thin_eras`),
leaving only the substantive era buckets at the top level.
```bash
# 1. Confirm the identity is age-sortable.
python work/check_faceset001_age.py
# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
python work/age_split_001.py
```
For the `faceset_001` run on 5260-face `nl_full.npz`, this produced 6 substantive
era buckets (200510, 201013, 2011, 201417, 201819, 201820; sizes 43282)
plus 68 thin/fragment buckets quarantined under `_thin/`.
## Key defaults
`refine`:
@@ -111,9 +197,14 @@ Highly recommended at swap time: enable **Select post-processing = GFPGAN** with
├─ docs/
│ └─ analysis/
│ └─ facesets-downstream-refinement-evaluation.md
└─ work/ (gitignored)
└─ work/ (gitignored except force-tracked .py)
├─ build_folders.py (hand-sorted-folder orchestration)
├─ check_faceset001_age.py (age-split readiness probe)
├─ age_split_001.py (age-split orchestration; faceset_001)
├─ synthetic_refine_manifest.json (last build_folders.py output)
├─ cache/
─ nl_full.npz (canonical cache + duplicates.json)
─ nl_full.npz (canonical cache + duplicates.json)
│ └─ age_split_exif.json (path → EXIF-year cache)
└─ logs/
└─ *.log (every long step writes here)
```