# Age-splitting faceset_001 into era-specific facesets _Run date: 2026-04-26. Cache: `work/cache/nl_full.npz` (5260 face records). Source: `work/age_split_001.py` and `work/check_faceset001_age.py`._ ## 1. Why split `faceset_001` aggregates a single identity across roughly 20 years of source material. The averaged embedding consumed by roop-unleashed therefore mixes features from very different ages. For face-swap output that should target a specific period (e.g. "this person around 2011" or "this person around 2018–19"), the identity needs to be split *after* clustering — the cluster is correctly one identity, but the averaged embedding is the problem. ## 2. Evidence the identity is age-sortable `work/check_faceset001_age.py` probes `faceset_001` (707 curated faces). **Pairwise cos-distance histogram** (249,571 pairs): | range | pairs | |-------------|------:| | [0.0, 0.2) | 1,250 | | [0.2, 0.3) | 11,277 | | [0.3, 0.4) | 63,920 | | [0.4, 0.5) | 92,555 | | [0.5, 0.6) | 63,288 | | [0.6, 0.7) | 16,048 | | [0.7, 0.8) | 1,217 | | [0.8, 1.0) | 16 | Mean 0.453, median 0.452, max 0.842. The cluster is internally diffuse — wide enough to admit non-trivial sub-structure without crossing the inter-identity boundary (which sits well above 0.6 in this dataset). **Sub-clusters at threshold 0.35** (precomputed cos-dist, average linkage): 156 sub-clusters, 10 with ≥ 10 faces, top-5 sizes [120, 105, 47, 40, 24]. The top sub-clusters align with distinct EXIF year medians (2011, 2019, 2018, 2011, 2010), so the split is meaningful. ## 3. Pipeline `work/age_split_001.py`: 1. **Seed centroid.** Load the 707 face keys from `facesets_swap_ready/faceset_001/manifest.json`; resolve to cache rows; normalize the mean embedding. 2. **Wide recovery.** Pull every face record under `/mnt/x/src/{nl, lzbkp_red}` from the cache with cos-dist ≤ 0.55 from the seed. The seed is curated and tight, so 0.55 is a safe outer envelope. Result: 1,501 faces from 4,756 candidates. 3. **Quality gate** (mirrors export-swap defaults): `face_short ≥ 100`, `blur ≥ 40.0`, `det_score ≥ 0.6`. Result: 892 → 856 after one re-centroid + tighten pass at 0.50 to absorb the recovery without drift. 4. **Sub-cluster** the survivors at cos-dist 0.35 (precomputed agglomerative, average linkage). 223 raw sub-clusters; sizes top-10 = [127, 97, 55, 42, 40, 25, 17, 14, 13, 11]. 5. **EXIF year per source path.** Read `DateTimeOriginal` once per unique path; cache on disk at `work/cache/age_split_exif.json` so re-runs after parameter tweaks skip the slow Windows-mount EXIF read. 728 of 855 paths were dated. 6. **Anchor-based fragment assignment** (replaces transitive union-find merge that caused observable year drift): - sub-clusters with ≥ 20 faces are *anchors* (6 found: dom-years 2011, 2019, 2018, 2011, 2016, 2010); - smaller fragments attach to the single nearest anchor *only if* both `cent_dist ≤ 0.40` AND `|dom_year_anchor − dom_year_fragment| ≤ 5`; - anchors do not merge with each other (transitive merging produced anchor-to-anchor year drift across 2010 → 2014 → 2018 in earlier runs); - fragments with no qualifying anchor remain standalone. 7. **Per-era export.** Composite-quality rank, single-face square PNG crops (`pad_ratio=0.5`, `out_size=512`), top-N + `_all` `.fsz` bundles, per-era `manifest.json`, `