Document hand-sorted-folder import + age-split workflow

- README: document work/build_folders.py (hand-sorted folder identities) and the new age-split workflow for splitting a long-running identity into era-specific facesets after clustering. - Force-track work/age_split_001.py and work/check_faceset001_age.py; these are the worked example + readiness probe for faceset_001 and the template for splitting any other identity by EXIF era. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:08:25 +02:00
parent 4d7a8780de
commit 03a0c75531
3 changed files with 729 additions and 2 deletions
--- a/README.md
+++ b/README.md
@@ -67,6 +67,92 @@ python sort_faces.py export-swap "$CACHE" \
  --raw-manifest "$OUT/raw_full/manifest.json" --candidates
 ```

+### Importing hand-sorted folders as identities
+
+When source folders are already hand-sorted by person (one folder per identity), the
+clustering path is the wrong tool — the identity is asserted, not inferred. The
+orchestration script `work/build_folders.py` covers this case:
+
+- For each trusted folder, it filters cache records that fall under it, builds an
+  identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so
+  bystanders in group photos drop out, and writes a synthetic `refine_manifest.json`.
+- It then routes each face record from a *mixed* folder (e.g. `osrc/`) into every
+  identity centroid within a tight cosine cutoff (default 0.45). A multi-identity
+  photo lands in multiple facesets; `export-swap`'s per-bbox outlier filter ensures
+  each faceset crops only its matching face.
+- Finally it invokes `cmd_export_swap` against the synthetic manifest, renames the
+  emitted `.fsz` bundles after the source folder, drops a `<label>.txt` marker, and
+  merges the new entries into the canonical `facesets_swap_ready/manifest.json`
+  (existing facesets are left untouched).
+
+```bash
+# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
+for d in k m mi mir s sab t osrc; do
+  python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
+done
+
+# Bring landmarks/pose + visual-dupe report in sync with the new records.
+python sort_faces.py enrich "$CACHE"
+python sort_faces.py dedup  "$CACHE"
+
+# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
+python work/build_folders.py
+```
+
+The script's config block (`TRUSTED`, `START_NNN`, `OSRC_THRESHOLD`, `TOP_N`, etc.)
+is the only thing to edit when adding more hand-sorted folders later.
+
+### Splitting an identity by era (age sub-clustering)
+
+Long-running source corpora produce identities that span 10+ years. The 2009 face
+and the 2024 face of the same person sit in the same cluster (correctly — same
+identity), but a single averaged embedding pulled from that cluster blurs across
+ages. For face-swap output that should target a specific period, the identity
+needs to be split by era *after* the identity is established.
+
+`work/age_split_001.py` is a worked example for `faceset_001` and a template for
+any other identity. The pipeline is:
+
+- **Probe first** with `work/check_faceset001_age.py` — report intra-cluster
+  pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and
+  EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with
+  distinct year ranges, the identity is age-sortable.
+- **Seed centroid** from the curated `facesets_swap_ready/faceset_001/`
+  (manifest provides face keys → cache rows).
+- **Wide recovery** at cos-dist ≤ 0.55 against the seed under the original
+  source roots, then quality-gate (`face_short`, `blur`, `det_score`) and one
+  re-centroid + tighten pass at 0.50 to absorb new faces without drift.
+- **Sub-cluster** the survivors at cos-dist 0.35 (precomputed-distance
+  agglomerative, average linkage).
+- **Anchor-based fragment assignment** (not transitive merge — that caused
+  year-drift): sub-clusters with size ≥ 20 are *anchors*; smaller fragments
+  attach to the single nearest anchor only if both the centroid distance ≤ 0.40
+  AND the dominant EXIF year is within ±5 years. Fragments with no qualifying
+  anchor remain standalone (and end up THIN-tagged downstream).
+- **EXIF year per source path** with on-disk caching at
+  `work/cache/age_split_exif.json` — the Windows-mount EXIF read is the
+  slowest step, so re-runs after a parameter tweak are nearly instant.
+- **Per-era export** mirrors `export-swap`: composite-quality rank, single-face
+  square PNG crops, top-N + `_all` `.fsz` bundles, per-era `manifest.json`,
+  human-readable `<era>.txt` marker. Eras with < 20 face records also drop a
+  `THIN.txt` marker so they can be quarantined.
+- **Top-level manifest merge**: era buckets are appended to
+  `facesets_swap_ready/manifest.json`. Operationally the THIN buckets should be
+  moved into `_thin/` (and the manifest split into `facesets` + `thin_eras`),
+  leaving only the substantive era buckets at the top level.
+
+```bash
+# 1. Confirm the identity is age-sortable.
+python work/check_faceset001_age.py
+
+# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
+python work/age_split_001.py
+```
+
+For the `faceset_001` run on 5260-face `nl_full.npz`, this produced 6 substantive
+era buckets (2005–10, 2010–13, 2011, 2014–17, 2018–19, 2018–20; sizes 43–282)
+plus 68 thin/fragment buckets quarantined under `_thin/`.
+
 ## Key defaults

 `refine`:
@@ -111,9 +197,14 @@ Highly recommended at swap time: enable **Select post-processing = GFPGAN** with
 ├─ docs/
 │  └─ analysis/
 │     └─ facesets-downstream-refinement-evaluation.md
-└─ work/                                         (gitignored)
+└─ work/                                         (gitignored except force-tracked .py)
+   ├─ build_folders.py                           (hand-sorted-folder orchestration)
+   ├─ check_faceset001_age.py                    (age-split readiness probe)
+   ├─ age_split_001.py                           (age-split orchestration; faceset_001)
+   ├─ synthetic_refine_manifest.json             (last build_folders.py output)
   ├─ cache/
-   │  └─ nl_full.npz                             (canonical cache + duplicates.json)
+   │  ├─ nl_full.npz                             (canonical cache + duplicates.json)
+   │  └─ age_split_exif.json                     (path → EXIF-year cache)
   └─ logs/
      └─ *.log                                   (every long step writes here)
 ```