Files
face-sets/docs/analysis/age-split-faceset-001.md
Peter e48dd8aec7 Add age-split run analysis for faceset_001
Documents the 2026-04-26 split of faceset_001 (707 curated faces) into
6 substantive era buckets + 68 thin fragments, including the readiness
probe evidence, the anchor-based assignment rationale (replaces
transitive union-find that caused year-drift), and the re-run / apply-
to-other-identity workflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:10:37 +02:00

5.6 KiB
Raw Blame History

Age-splitting faceset_001 into era-specific facesets

Run date: 2026-04-26. Cache: work/cache/nl_full.npz (5260 face records). Source: work/age_split_001.py and work/check_faceset001_age.py.

1. Why split

faceset_001 aggregates a single identity across roughly 20 years of source material. The averaged embedding consumed by roop-unleashed therefore mixes features from very different ages. For face-swap output that should target a specific period (e.g. "this person around 2011" or "this person around 201819"), the identity needs to be split after clustering — the cluster is correctly one identity, but the averaged embedding is the problem.

2. Evidence the identity is age-sortable

work/check_faceset001_age.py probes faceset_001 (707 curated faces).

Pairwise cos-distance histogram (249,571 pairs):

range pairs
[0.0, 0.2) 1,250
[0.2, 0.3) 11,277
[0.3, 0.4) 63,920
[0.4, 0.5) 92,555
[0.5, 0.6) 63,288
[0.6, 0.7) 16,048
[0.7, 0.8) 1,217
[0.8, 1.0) 16

Mean 0.453, median 0.452, max 0.842. The cluster is internally diffuse — wide enough to admit non-trivial sub-structure without crossing the inter-identity boundary (which sits well above 0.6 in this dataset).

Sub-clusters at threshold 0.35 (precomputed cos-dist, average linkage): 156 sub-clusters, 10 with ≥ 10 faces, top-5 sizes [120, 105, 47, 40, 24]. The top sub-clusters align with distinct EXIF year medians (2011, 2019, 2018, 2011, 2010), so the split is meaningful.

3. Pipeline

work/age_split_001.py:

  1. Seed centroid. Load the 707 face keys from facesets_swap_ready/faceset_001/manifest.json; resolve to cache rows; normalize the mean embedding.
  2. Wide recovery. Pull every face record under /mnt/x/src/{nl, lzbkp_red} from the cache with cos-dist ≤ 0.55 from the seed. The seed is curated and tight, so 0.55 is a safe outer envelope. Result: 1,501 faces from 4,756 candidates.
  3. Quality gate (mirrors export-swap defaults): face_short ≥ 100, blur ≥ 40.0, det_score ≥ 0.6. Result: 892 → 856 after one re-centroid + tighten pass at 0.50 to absorb the recovery without drift.
  4. Sub-cluster the survivors at cos-dist 0.35 (precomputed agglomerative, average linkage). 223 raw sub-clusters; sizes top-10 = [127, 97, 55, 42, 40, 25, 17, 14, 13, 11].
  5. EXIF year per source path. Read DateTimeOriginal once per unique path; cache on disk at work/cache/age_split_exif.json so re-runs after parameter tweaks skip the slow Windows-mount EXIF read. 728 of 855 paths were dated.
  6. Anchor-based fragment assignment (replaces transitive union-find merge that caused observable year drift):
    • sub-clusters with ≥ 20 faces are anchors (6 found: dom-years 2011, 2019, 2018, 2011, 2016, 2010);
    • smaller fragments attach to the single nearest anchor only if both cent_dist ≤ 0.40 AND |dom_year_anchor dom_year_fragment| ≤ 5;
    • anchors do not merge with each other (transitive merging produced anchor-to-anchor year drift across 2010 → 2014 → 2018 in earlier runs);
    • fragments with no qualifying anchor remain standalone.
  7. Per-era export. Composite-quality rank, single-face square PNG crops (pad_ratio=0.5, out_size=512), top-N + _all .fsz bundles, per-era manifest.json, <label>.txt marker, THIN.txt for buckets < 20 faces.
  8. Top-level manifest merge. New entries are appended to facesets_swap_ready/manifest.json. Operationally the THIN buckets are then moved into _thin/ and partitioned into a thin_eras array (with relpath: _thin/<name>) so consumers reading facesets see only the substantive entries.

4. Result

74 era buckets emitted; 6 substantive + 68 thin/standalone fragments.

era faces dom year(s)
faceset_001_2010-13 282 2011
faceset_001_2018-20 129 2019
faceset_001_2014-17 125 2018 (anchor sub 15 dom_year=2018)
faceset_001_2018-19 107 2018
faceset_001_2005-10 88 2010
faceset_001_2011 43 2011

Two distinct 2011 anchors and two 2018-area anchors persist by design — embedding-space distance separated them despite year overlap. The era-label collisions are disambiguated with _v2 suffixes, but only when both anchors landed on the same literal label string (none of the substantive six did).

The 68 thin buckets are largely 1- or 2-face fragments with idiosyncratic embeddings; they are quarantined into _thin/ rather than deleted because some are legitimate edge poses / lighting / age extremes that may be useful for narrow targeted swaps.

5. Re-running and applying to other identities

  • Re-run with different parameters: just re-execute age_split_001.py. Embeddings are loaded from cache, EXIF is loaded from age_split_exif.json, and only the sub-cluster + export steps re-run. Total runtime ~2 min.
  • Apply to a different identity: copy age_split_001.py to age_split_NNN.py and change FS001. The SCAN_ROOTS, RECOVERY_THRESHOLD, TIGHTEN_THRESHOLD, SUBCLUSTER_THRESHOLD, ANCHOR_MIN_SIZE, FRAGMENT_CENTROID_MAX, and FRAGMENT_YEAR_MAX defaults are tuned for faceset_001's ~707-face curated cluster; smaller identities likely need lower ANCHOR_MIN_SIZE.
  • Always quarantine THIN buckets afterwards using the same partition pattern (move to _thin/, split top-level manifest into facesets + thin_eras). The script appends THIN entries to the top-level manifest as if they were full facesets, so the cleanup is a separate step.