Add age-split run analysis for faceset_001

Documents the 2026-04-26 split of faceset_001 (707 curated faces) into 6 substantive era buckets + 68 thin fragments, including the readiness probe evidence, the anchor-based assignment rationale (replaces transitive union-find that caused year-drift), and the re-run / apply- to-other-identity workflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:10:37 +02:00
parent 03a0c75531
commit e48dd8aec7
1 changed files with 119 additions and 0 deletions
@@ -0,0 +1,119 @@
+# Age-splitting faceset_001 into era-specific facesets
+
+_Run date: 2026-04-26. Cache: `work/cache/nl_full.npz` (5260 face records). Source: `work/age_split_001.py` and `work/check_faceset001_age.py`._
+
+## 1. Why split
+
+`faceset_001` aggregates a single identity across roughly 20 years of source
+material. The averaged embedding consumed by roop-unleashed therefore mixes
+features from very different ages. For face-swap output that should target a
+specific period (e.g. "this person around 2011" or "this person around
+2018–19"), the identity needs to be split *after* clustering — the cluster is
+correctly one identity, but the averaged embedding is the problem.
+
+## 2. Evidence the identity is age-sortable
+
+`work/check_faceset001_age.py` probes `faceset_001` (707 curated faces).
+
+**Pairwise cos-distance histogram** (249,571 pairs):
+
+| range       | pairs |
+|-------------|------:|
+| [0.0, 0.2)  | 1,250 |
+| [0.2, 0.3)  | 11,277 |
+| [0.3, 0.4)  | 63,920 |
+| [0.4, 0.5)  | 92,555 |
+| [0.5, 0.6)  | 63,288 |
+| [0.6, 0.7)  | 16,048 |
+| [0.7, 0.8)  | 1,217 |
+| [0.8, 1.0)  | 16 |
+
+Mean 0.453, median 0.452, max 0.842. The cluster is internally diffuse — wide
+enough to admit non-trivial sub-structure without crossing the
+inter-identity boundary (which sits well above 0.6 in this dataset).
+
+**Sub-clusters at threshold 0.35** (precomputed cos-dist, average linkage):
+156 sub-clusters, 10 with ≥ 10 faces, top-5 sizes [120, 105, 47, 40, 24].
+The top sub-clusters align with distinct EXIF year medians (2011, 2019,
+2018, 2011, 2010), so the split is meaningful.
+
+## 3. Pipeline
+
+`work/age_split_001.py`:
+
+1. **Seed centroid.** Load the 707 face keys from
+   `facesets_swap_ready/faceset_001/manifest.json`; resolve to cache rows;
+   normalize the mean embedding.
+2. **Wide recovery.** Pull every face record under `/mnt/x/src/{nl,
+   lzbkp_red}` from the cache with cos-dist ≤ 0.55 from the seed. The seed
+   is curated and tight, so 0.55 is a safe outer envelope. Result: 1,501
+   faces from 4,756 candidates.
+3. **Quality gate** (mirrors export-swap defaults): `face_short ≥ 100`,
+   `blur ≥ 40.0`, `det_score ≥ 0.6`. Result: 892 → 856 after one
+   re-centroid + tighten pass at 0.50 to absorb the recovery without
+   drift.
+4. **Sub-cluster** the survivors at cos-dist 0.35 (precomputed agglomerative,
+   average linkage). 223 raw sub-clusters; sizes top-10 = [127, 97, 55, 42,
+   40, 25, 17, 14, 13, 11].
+5. **EXIF year per source path.** Read `DateTimeOriginal` once per unique
+   path; cache on disk at `work/cache/age_split_exif.json` so re-runs after
+   parameter tweaks skip the slow Windows-mount EXIF read. 728 of 855 paths
+   were dated.
+6. **Anchor-based fragment assignment** (replaces transitive union-find merge
+   that caused observable year drift):
+   - sub-clusters with ≥ 20 faces are *anchors* (6 found: dom-years 2011,
+     2019, 2018, 2011, 2016, 2010);
+   - smaller fragments attach to the single nearest anchor *only if* both
+     `cent_dist ≤ 0.40` AND `|dom_year_anchor − dom_year_fragment| ≤ 5`;
+   - anchors do not merge with each other (transitive merging produced
+     anchor-to-anchor year drift across 2010 → 2014 → 2018 in earlier
+     runs);
+   - fragments with no qualifying anchor remain standalone.
+7. **Per-era export.** Composite-quality rank, single-face square PNG crops
+   (`pad_ratio=0.5`, `out_size=512`), top-N + `_all` `.fsz` bundles, per-era
+   `manifest.json`, `<label>.txt` marker, `THIN.txt` for buckets < 20 faces.
+8. **Top-level manifest merge.** New entries are appended to
+   `facesets_swap_ready/manifest.json`. Operationally the THIN buckets are
+   then moved into `_thin/` and partitioned into a `thin_eras` array (with
+   `relpath: _thin/<name>`) so consumers reading `facesets` see only the
+   substantive entries.
+
+## 4. Result
+
+74 era buckets emitted; 6 substantive + 68 thin/standalone fragments.
+
+| era               | faces | dom year(s) |
+|-------------------|------:|-------------|
+| `faceset_001_2010-13` | 282 | 2011 |
+| `faceset_001_2018-20` | 129 | 2019 |
+| `faceset_001_2014-17` | 125 | 2018 (anchor sub 15 dom_year=2018) |
+| `faceset_001_2018-19` | 107 | 2018 |
+| `faceset_001_2005-10` | 88  | 2010 |
+| `faceset_001_2011`    | 43  | 2011 |
+
+Two distinct 2011 anchors and two 2018-area anchors persist by design —
+embedding-space distance separated them despite year overlap. The era-label
+collisions are disambiguated with `_v2` suffixes, but only when both anchors
+landed on the *same* literal label string (none of the substantive six did).
+
+The 68 thin buckets are largely 1- or 2-face fragments with idiosyncratic
+embeddings; they are quarantined into `_thin/` rather than deleted because
+some are legitimate edge poses / lighting / age extremes that may be useful
+for narrow targeted swaps.
+
+## 5. Re-running and applying to other identities
+
+- **Re-run with different parameters**: just re-execute `age_split_001.py`.
+  Embeddings are loaded from cache, EXIF is loaded from
+  `age_split_exif.json`, and only the sub-cluster + export steps re-run.
+  Total runtime ~2 min.
+- **Apply to a different identity**: copy `age_split_001.py` to
+  `age_split_NNN.py` and change `FS001`. The `SCAN_ROOTS`,
+  `RECOVERY_THRESHOLD`, `TIGHTEN_THRESHOLD`, `SUBCLUSTER_THRESHOLD`,
+  `ANCHOR_MIN_SIZE`, `FRAGMENT_CENTROID_MAX`, and `FRAGMENT_YEAR_MAX`
+  defaults are tuned for `faceset_001`'s ~707-face curated cluster; smaller
+  identities likely need lower `ANCHOR_MIN_SIZE`.
+- **Always quarantine THIN buckets** afterwards using the same partition
+  pattern (move to `_thin/`, split top-level manifest into
+  `facesets` + `thin_eras`). The script appends THIN entries to the top-level
+  manifest as if they were full facesets, so the cleanup is a separate step.