Add age-split run analysis for faceset_001

Documents the 2026-04-26 split of faceset_001 (707 curated faces) into
6 substantive era buckets + 68 thin fragments, including the readiness
probe evidence, the anchor-based assignment rationale (replaces
transitive union-find that caused year-drift), and the re-run / apply-
to-other-identity workflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-26 12:10:37 +02:00
parent 03a0c75531
commit e48dd8aec7

View File

@@ -0,0 +1,119 @@
# Age-splitting faceset_001 into era-specific facesets
_Run date: 2026-04-26. Cache: `work/cache/nl_full.npz` (5260 face records). Source: `work/age_split_001.py` and `work/check_faceset001_age.py`._
## 1. Why split
`faceset_001` aggregates a single identity across roughly 20 years of source
material. The averaged embedding consumed by roop-unleashed therefore mixes
features from very different ages. For face-swap output that should target a
specific period (e.g. "this person around 2011" or "this person around
201819"), the identity needs to be split *after* clustering — the cluster is
correctly one identity, but the averaged embedding is the problem.
## 2. Evidence the identity is age-sortable
`work/check_faceset001_age.py` probes `faceset_001` (707 curated faces).
**Pairwise cos-distance histogram** (249,571 pairs):
| range | pairs |
|-------------|------:|
| [0.0, 0.2) | 1,250 |
| [0.2, 0.3) | 11,277 |
| [0.3, 0.4) | 63,920 |
| [0.4, 0.5) | 92,555 |
| [0.5, 0.6) | 63,288 |
| [0.6, 0.7) | 16,048 |
| [0.7, 0.8) | 1,217 |
| [0.8, 1.0) | 16 |
Mean 0.453, median 0.452, max 0.842. The cluster is internally diffuse — wide
enough to admit non-trivial sub-structure without crossing the
inter-identity boundary (which sits well above 0.6 in this dataset).
**Sub-clusters at threshold 0.35** (precomputed cos-dist, average linkage):
156 sub-clusters, 10 with ≥ 10 faces, top-5 sizes [120, 105, 47, 40, 24].
The top sub-clusters align with distinct EXIF year medians (2011, 2019,
2018, 2011, 2010), so the split is meaningful.
## 3. Pipeline
`work/age_split_001.py`:
1. **Seed centroid.** Load the 707 face keys from
`facesets_swap_ready/faceset_001/manifest.json`; resolve to cache rows;
normalize the mean embedding.
2. **Wide recovery.** Pull every face record under `/mnt/x/src/{nl,
lzbkp_red}` from the cache with cos-dist ≤ 0.55 from the seed. The seed
is curated and tight, so 0.55 is a safe outer envelope. Result: 1,501
faces from 4,756 candidates.
3. **Quality gate** (mirrors export-swap defaults): `face_short ≥ 100`,
`blur ≥ 40.0`, `det_score ≥ 0.6`. Result: 892 → 856 after one
re-centroid + tighten pass at 0.50 to absorb the recovery without
drift.
4. **Sub-cluster** the survivors at cos-dist 0.35 (precomputed agglomerative,
average linkage). 223 raw sub-clusters; sizes top-10 = [127, 97, 55, 42,
40, 25, 17, 14, 13, 11].
5. **EXIF year per source path.** Read `DateTimeOriginal` once per unique
path; cache on disk at `work/cache/age_split_exif.json` so re-runs after
parameter tweaks skip the slow Windows-mount EXIF read. 728 of 855 paths
were dated.
6. **Anchor-based fragment assignment** (replaces transitive union-find merge
that caused observable year drift):
- sub-clusters with ≥ 20 faces are *anchors* (6 found: dom-years 2011,
2019, 2018, 2011, 2016, 2010);
- smaller fragments attach to the single nearest anchor *only if* both
`cent_dist ≤ 0.40` AND `|dom_year_anchor dom_year_fragment| ≤ 5`;
- anchors do not merge with each other (transitive merging produced
anchor-to-anchor year drift across 2010 → 2014 → 2018 in earlier
runs);
- fragments with no qualifying anchor remain standalone.
7. **Per-era export.** Composite-quality rank, single-face square PNG crops
(`pad_ratio=0.5`, `out_size=512`), top-N + `_all` `.fsz` bundles, per-era
`manifest.json`, `<label>.txt` marker, `THIN.txt` for buckets < 20 faces.
8. **Top-level manifest merge.** New entries are appended to
`facesets_swap_ready/manifest.json`. Operationally the THIN buckets are
then moved into `_thin/` and partitioned into a `thin_eras` array (with
`relpath: _thin/<name>`) so consumers reading `facesets` see only the
substantive entries.
## 4. Result
74 era buckets emitted; 6 substantive + 68 thin/standalone fragments.
| era | faces | dom year(s) |
|-------------------|------:|-------------|
| `faceset_001_2010-13` | 282 | 2011 |
| `faceset_001_2018-20` | 129 | 2019 |
| `faceset_001_2014-17` | 125 | 2018 (anchor sub 15 dom_year=2018) |
| `faceset_001_2018-19` | 107 | 2018 |
| `faceset_001_2005-10` | 88 | 2010 |
| `faceset_001_2011` | 43 | 2011 |
Two distinct 2011 anchors and two 2018-area anchors persist by design —
embedding-space distance separated them despite year overlap. The era-label
collisions are disambiguated with `_v2` suffixes, but only when both anchors
landed on the *same* literal label string (none of the substantive six did).
The 68 thin buckets are largely 1- or 2-face fragments with idiosyncratic
embeddings; they are quarantined into `_thin/` rather than deleted because
some are legitimate edge poses / lighting / age extremes that may be useful
for narrow targeted swaps.
## 5. Re-running and applying to other identities
- **Re-run with different parameters**: just re-execute `age_split_001.py`.
Embeddings are loaded from cache, EXIF is loaded from
`age_split_exif.json`, and only the sub-cluster + export steps re-run.
Total runtime ~2 min.
- **Apply to a different identity**: copy `age_split_001.py` to
`age_split_NNN.py` and change `FS001`. The `SCAN_ROOTS`,
`RECOVERY_THRESHOLD`, `TIGHTEN_THRESHOLD`, `SUBCLUSTER_THRESHOLD`,
`ANCHOR_MIN_SIZE`, `FRAGMENT_CENTROID_MAX`, and `FRAGMENT_YEAR_MAX`
defaults are tuned for `faceset_001`'s ~707-face curated cluster; smaller
identities likely need lower `ANCHOR_MIN_SIZE`.
- **Always quarantine THIN buckets** afterwards using the same partition
pattern (move to `_thin/`, split top-level manifest into
`facesets` + `thin_eras`). The script appends THIN entries to the top-level
manifest as if they were full facesets, so the cleanup is a separate step.