Documents the 2026-04-26 split of faceset_001 (707 curated faces) into 6 substantive era buckets + 68 thin fragments, including the readiness probe evidence, the anchor-based assignment rationale (replaces transitive union-find that caused year-drift), and the re-run / apply- to-other-identity workflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.6 KiB
Age-splitting faceset_001 into era-specific facesets
Run date: 2026-04-26. Cache: work/cache/nl_full.npz (5260 face records). Source: work/age_split_001.py and work/check_faceset001_age.py.
1. Why split
faceset_001 aggregates a single identity across roughly 20 years of source
material. The averaged embedding consumed by roop-unleashed therefore mixes
features from very different ages. For face-swap output that should target a
specific period (e.g. "this person around 2011" or "this person around
2018–19"), the identity needs to be split after clustering — the cluster is
correctly one identity, but the averaged embedding is the problem.
2. Evidence the identity is age-sortable
work/check_faceset001_age.py probes faceset_001 (707 curated faces).
Pairwise cos-distance histogram (249,571 pairs):
| range | pairs |
|---|---|
| [0.0, 0.2) | 1,250 |
| [0.2, 0.3) | 11,277 |
| [0.3, 0.4) | 63,920 |
| [0.4, 0.5) | 92,555 |
| [0.5, 0.6) | 63,288 |
| [0.6, 0.7) | 16,048 |
| [0.7, 0.8) | 1,217 |
| [0.8, 1.0) | 16 |
Mean 0.453, median 0.452, max 0.842. The cluster is internally diffuse — wide enough to admit non-trivial sub-structure without crossing the inter-identity boundary (which sits well above 0.6 in this dataset).
Sub-clusters at threshold 0.35 (precomputed cos-dist, average linkage): 156 sub-clusters, 10 with ≥ 10 faces, top-5 sizes [120, 105, 47, 40, 24]. The top sub-clusters align with distinct EXIF year medians (2011, 2019, 2018, 2011, 2010), so the split is meaningful.
3. Pipeline
work/age_split_001.py:
- Seed centroid. Load the 707 face keys from
facesets_swap_ready/faceset_001/manifest.json; resolve to cache rows; normalize the mean embedding. - Wide recovery. Pull every face record under
/mnt/x/src/{nl, lzbkp_red}from the cache with cos-dist ≤ 0.55 from the seed. The seed is curated and tight, so 0.55 is a safe outer envelope. Result: 1,501 faces from 4,756 candidates. - Quality gate (mirrors export-swap defaults):
face_short ≥ 100,blur ≥ 40.0,det_score ≥ 0.6. Result: 892 → 856 after one re-centroid + tighten pass at 0.50 to absorb the recovery without drift. - Sub-cluster the survivors at cos-dist 0.35 (precomputed agglomerative, average linkage). 223 raw sub-clusters; sizes top-10 = [127, 97, 55, 42, 40, 25, 17, 14, 13, 11].
- EXIF year per source path. Read
DateTimeOriginalonce per unique path; cache on disk atwork/cache/age_split_exif.jsonso re-runs after parameter tweaks skip the slow Windows-mount EXIF read. 728 of 855 paths were dated. - Anchor-based fragment assignment (replaces transitive union-find merge
that caused observable year drift):
- sub-clusters with ≥ 20 faces are anchors (6 found: dom-years 2011, 2019, 2018, 2011, 2016, 2010);
- smaller fragments attach to the single nearest anchor only if both
cent_dist ≤ 0.40AND|dom_year_anchor − dom_year_fragment| ≤ 5; - anchors do not merge with each other (transitive merging produced anchor-to-anchor year drift across 2010 → 2014 → 2018 in earlier runs);
- fragments with no qualifying anchor remain standalone.
- Per-era export. Composite-quality rank, single-face square PNG crops
(
pad_ratio=0.5,out_size=512), top-N +_all.fszbundles, per-eramanifest.json,<label>.txtmarker,THIN.txtfor buckets < 20 faces. - Top-level manifest merge. New entries are appended to
facesets_swap_ready/manifest.json. Operationally the THIN buckets are then moved into_thin/and partitioned into athin_erasarray (withrelpath: _thin/<name>) so consumers readingfacesetssee only the substantive entries.
4. Result
74 era buckets emitted; 6 substantive + 68 thin/standalone fragments.
| era | faces | dom year(s) |
|---|---|---|
faceset_001_2010-13 |
282 | 2011 |
faceset_001_2018-20 |
129 | 2019 |
faceset_001_2014-17 |
125 | 2018 (anchor sub 15 dom_year=2018) |
faceset_001_2018-19 |
107 | 2018 |
faceset_001_2005-10 |
88 | 2010 |
faceset_001_2011 |
43 | 2011 |
Two distinct 2011 anchors and two 2018-area anchors persist by design —
embedding-space distance separated them despite year overlap. The era-label
collisions are disambiguated with _v2 suffixes, but only when both anchors
landed on the same literal label string (none of the substantive six did).
The 68 thin buckets are largely 1- or 2-face fragments with idiosyncratic
embeddings; they are quarantined into _thin/ rather than deleted because
some are legitimate edge poses / lighting / age extremes that may be useful
for narrow targeted swaps.
5. Re-running and applying to other identities
- Re-run with different parameters: just re-execute
age_split_001.py. Embeddings are loaded from cache, EXIF is loaded fromage_split_exif.json, and only the sub-cluster + export steps re-run. Total runtime ~2 min. - Apply to a different identity: copy
age_split_001.pytoage_split_NNN.pyand changeFS001. TheSCAN_ROOTS,RECOVERY_THRESHOLD,TIGHTEN_THRESHOLD,SUBCLUSTER_THRESHOLD,ANCHOR_MIN_SIZE,FRAGMENT_CENTROID_MAX, andFRAGMENT_YEAR_MAXdefaults are tuned forfaceset_001's ~707-face curated cluster; smaller identities likely need lowerANCHOR_MIN_SIZE. - Always quarantine THIN buckets afterwards using the same partition
pattern (move to
_thin/, split top-level manifest intofacesets+thin_eras). The script appends THIN entries to the top-level manifest as if they were full facesets, so the cleanup is a separate step.