Files

Peter e48dd8aec7 Add age-split run analysis for faceset_001

Documents the 2026-04-26 split of faceset_001 (707 curated faces) into
6 substantive era buckets + 68 thin fragments, including the readiness
probe evidence, the anchor-based assignment rationale (replaces
transitive union-find that caused year-drift), and the re-run / apply-
to-other-identity workflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-26 12:10:37 +02:00

5.6 KiB

Raw Blame History

Age-splitting faceset_001 into era-specific facesets

Run date: 2026-04-26. Cache: work/cache/nl_full.npz (5260 face records). Source: work/age_split_001.py and work/check_faceset001_age.py.

1. Why split

faceset_001 aggregates a single identity across roughly 20 years of source material. The averaged embedding consumed by roop-unleashed therefore mixes features from very different ages. For face-swap output that should target a specific period (e.g. "this person around 2011" or "this person around 2018–19"), the identity needs to be split after clustering — the cluster is correctly one identity, but the averaged embedding is the problem.

2. Evidence the identity is age-sortable

work/check_faceset001_age.py probes faceset_001 (707 curated faces).

Pairwise cos-distance histogram (249,571 pairs):

range	pairs
[0.0, 0.2)	1,250
[0.2, 0.3)	11,277
[0.3, 0.4)	63,920
[0.4, 0.5)	92,555
[0.5, 0.6)	63,288
[0.6, 0.7)	16,048
[0.7, 0.8)	1,217
[0.8, 1.0)	16

Mean 0.453, median 0.452, max 0.842. The cluster is internally diffuse — wide enough to admit non-trivial sub-structure without crossing the inter-identity boundary (which sits well above 0.6 in this dataset).

Sub-clusters at threshold 0.35 (precomputed cos-dist, average linkage): 156 sub-clusters, 10 with ≥ 10 faces, top-5 sizes [120, 105, 47, 40, 24]. The top sub-clusters align with distinct EXIF year medians (2011, 2019, 2018, 2011, 2010), so the split is meaningful.

3. Pipeline

work/age_split_001.py:

Seed centroid. Load the 707 face keys from facesets_swap_ready/faceset_001/manifest.json; resolve to cache rows; normalize the mean embedding.
Wide recovery. Pull every face record under /mnt/x/src/{nl, lzbkp_red} from the cache with cos-dist ≤ 0.55 from the seed. The seed is curated and tight, so 0.55 is a safe outer envelope. Result: 1,501 faces from 4,756 candidates.
Quality gate (mirrors export-swap defaults): face_short ≥ 100, blur ≥ 40.0, det_score ≥ 0.6. Result: 892 → 856 after one re-centroid + tighten pass at 0.50 to absorb the recovery without drift.
Sub-cluster the survivors at cos-dist 0.35 (precomputed agglomerative, average linkage). 223 raw sub-clusters; sizes top-10 = [127, 97, 55, 42, 40, 25, 17, 14, 13, 11].
EXIF year per source path. Read DateTimeOriginal once per unique path; cache on disk at work/cache/age_split_exif.json so re-runs after parameter tweaks skip the slow Windows-mount EXIF read. 728 of 855 paths were dated.
Anchor-based fragment assignment (replaces transitive union-find merge that caused observable year drift):
- sub-clusters with ≥ 20 faces are anchors (6 found: dom-years 2011, 2019, 2018, 2011, 2016, 2010);
- smaller fragments attach to the single nearest anchor only if both cent_dist ≤ 0.40 AND |dom_year_anchor − dom_year_fragment| ≤ 5;
- anchors do not merge with each other (transitive merging produced anchor-to-anchor year drift across 2010 → 2014 → 2018 in earlier runs);
- fragments with no qualifying anchor remain standalone.
Per-era export. Composite-quality rank, single-face square PNG crops (pad_ratio=0.5, out_size=512), top-N + _all .fsz bundles, per-era manifest.json, <label>.txt marker, THIN.txt for buckets < 20 faces.
Top-level manifest merge. New entries are appended to facesets_swap_ready/manifest.json. Operationally the THIN buckets are then moved into _thin/ and partitioned into a thin_eras array (with relpath: _thin/<name>) so consumers reading facesets see only the substantive entries.

4. Result

74 era buckets emitted; 6 substantive + 68 thin/standalone fragments.

era	faces	dom year(s)
`faceset_001_2010-13`	282	2011
`faceset_001_2018-20`	129	2019
`faceset_001_2014-17`	125	2018 (anchor sub 15 dom_year=2018)
`faceset_001_2018-19`	107	2018
`faceset_001_2005-10`	88	2010
`faceset_001_2011`	43	2011

Two distinct 2011 anchors and two 2018-area anchors persist by design — embedding-space distance separated them despite year overlap. The era-label collisions are disambiguated with _v2 suffixes, but only when both anchors landed on the same literal label string (none of the substantive six did).

The 68 thin buckets are largely 1- or 2-face fragments with idiosyncratic embeddings; they are quarantined into _thin/ rather than deleted because some are legitimate edge poses / lighting / age extremes that may be useful for narrow targeted swaps.

5. Re-running and applying to other identities

Re-run with different parameters: just re-execute age_split_001.py. Embeddings are loaded from cache, EXIF is loaded from age_split_exif.json, and only the sub-cluster + export steps re-run. Total runtime ~2 min.
Apply to a different identity: copy age_split_001.py to age_split_NNN.py and change FS001. The SCAN_ROOTS, RECOVERY_THRESHOLD, TIGHTEN_THRESHOLD, SUBCLUSTER_THRESHOLD, ANCHOR_MIN_SIZE, FRAGMENT_CENTROID_MAX, and FRAGMENT_YEAR_MAX defaults are tuned for faceset_001's ~707-face curated cluster; smaller identities likely need lower ANCHOR_MIN_SIZE.
Always quarantine THIN buckets afterwards using the same partition pattern (move to _thin/, split top-level manifest into facesets + thin_eras). The script appends THIN entries to the top-level manifest as if they were full facesets, so the cleanup is a separate step.

5.6 KiB Raw Blame History Unescape Escape