face-sets/docs/analysis/age-split-faceset-001.md

# Age-splitting faceset_001 into era-specific facesets

_Run date: 2026-04-26. Cache: `work/cache/nl_full.npz` (5260 face records). Source: `work/age_split_001.py` and `work/check_faceset001_age.py`._

## 1. Why split

`faceset_001` aggregates a single identity across roughly 20 years of source
material. The averaged embedding consumed by roop-unleashed therefore mixes
features from very different ages. For face-swap output that should target a
specific period (e.g. "this person around 2011" or "this person around
2018–19"), the identity needs to be split *after* clustering — the cluster is
correctly one identity, but the averaged embedding is the problem.

## 2. Evidence the identity is age-sortable

`work/check_faceset001_age.py` probes `faceset_001` (707 curated faces).

**Pairwise cos-distance histogram** (249,571 pairs):

| range       | pairs |
|-------------|------:|
| [0.0, 0.2)  | 1,250 |
| [0.2, 0.3)  | 11,277 |
| [0.3, 0.4)  | 63,920 |
| [0.4, 0.5)  | 92,555 |
| [0.5, 0.6)  | 63,288 |
| [0.6, 0.7)  | 16,048 |
| [0.7, 0.8)  | 1,217 |
| [0.8, 1.0)  | 16 |

Mean 0.453, median 0.452, max 0.842. The cluster is internally diffuse — wide
enough to admit non-trivial sub-structure without crossing the
inter-identity boundary (which sits well above 0.6 in this dataset).

**Sub-clusters at threshold 0.35** (precomputed cos-dist, average linkage):
156 sub-clusters, 10 with ≥ 10 faces, top-5 sizes [120, 105, 47, 40, 24].
The top sub-clusters align with distinct EXIF year medians (2011, 2019,
2018, 2011, 2010), so the split is meaningful.

## 3. Pipeline

`work/age_split_001.py`:

1. **Seed centroid.** Load the 707 face keys from
   `facesets_swap_ready/faceset_001/manifest.json`; resolve to cache rows;
   normalize the mean embedding.
2. **Wide recovery.** Pull every face record under `/mnt/x/src/{nl,
   lzbkp_red}` from the cache with cos-dist ≤ 0.55 from the seed. The seed
   is curated and tight, so 0.55 is a safe outer envelope. Result: 1,501
   faces from 4,756 candidates.
3. **Quality gate** (mirrors export-swap defaults): `face_short ≥ 100`,
   `blur ≥ 40.0`, `det_score ≥ 0.6`. Result: 892 → 856 after one
   re-centroid + tighten pass at 0.50 to absorb the recovery without
   drift.
4. **Sub-cluster** the survivors at cos-dist 0.35 (precomputed agglomerative,
   average linkage). 223 raw sub-clusters; sizes top-10 = [127, 97, 55, 42,
   40, 25, 17, 14, 13, 11].
5. **EXIF year per source path.** Read `DateTimeOriginal` once per unique
   path; cache on disk at `work/cache/age_split_exif.json` so re-runs after
   parameter tweaks skip the slow Windows-mount EXIF read. 728 of 855 paths
   were dated.
6. **Anchor-based fragment assignment** (replaces transitive union-find merge
   that caused observable year drift):
   - sub-clusters with ≥ 20 faces are *anchors* (6 found: dom-years 2011,
     2019, 2018, 2011, 2016, 2010);
   - smaller fragments attach to the single nearest anchor *only if* both
     `cent_dist ≤ 0.40` AND `|dom_year_anchor − dom_year_fragment| ≤ 5`;
   - anchors do not merge with each other (transitive merging produced
     anchor-to-anchor year drift across 2010 → 2014 → 2018 in earlier
     runs);
   - fragments with no qualifying anchor remain standalone.
7. **Per-era export.** Composite-quality rank, single-face square PNG crops
   (`pad_ratio=0.5`, `out_size=512`), top-N + `_all` `.fsz` bundles, per-era
   `manifest.json`, `<label>.txt` marker, `THIN.txt` for buckets < 20 faces.
8. **Top-level manifest merge.** New entries are appended to
   `facesets_swap_ready/manifest.json`. Operationally the THIN buckets are
   then moved into `_thin/` and partitioned into a `thin_eras` array (with
   `relpath: _thin/<name>`) so consumers reading `facesets` see only the
   substantive entries.

## 4. Result

74 era buckets emitted; 6 substantive + 68 thin/standalone fragments.

| era               | faces | dom year(s) |
|-------------------|------:|-------------|
| `faceset_001_2010-13` | 282 | 2011 |
| `faceset_001_2018-20` | 129 | 2019 |
| `faceset_001_2014-17` | 125 | 2018 (anchor sub 15 dom_year=2018) |
| `faceset_001_2018-19` | 107 | 2018 |
| `faceset_001_2005-10` | 88  | 2010 |
| `faceset_001_2011`    | 43  | 2011 |

Two distinct 2011 anchors and two 2018-area anchors persist by design —
embedding-space distance separated them despite year overlap. The era-label
collisions are disambiguated with `_v2` suffixes, but only when both anchors
landed on the *same* literal label string (none of the substantive six did).

The 68 thin buckets are largely 1- or 2-face fragments with idiosyncratic
embeddings; they are quarantined into `_thin/` rather than deleted because
some are legitimate edge poses / lighting / age extremes that may be useful
for narrow targeted swaps.

## 5. Re-running and applying to other identities

- **Re-run with different parameters**: just re-execute `age_split_001.py`.
  Embeddings are loaded from cache, EXIF is loaded from
  `age_split_exif.json`, and only the sub-cluster + export steps re-run.
  Total runtime ~2 min.
- **Apply to a different identity**: copy `age_split_001.py` to
  `age_split_NNN.py` and change `FS001`. The `SCAN_ROOTS`,
  `RECOVERY_THRESHOLD`, `TIGHTEN_THRESHOLD`, `SUBCLUSTER_THRESHOLD`,
  `ANCHOR_MIN_SIZE`, `FRAGMENT_CENTROID_MAX`, and `FRAGMENT_YEAR_MAX`
  defaults are tuned for `faceset_001`'s ~707-face curated cluster; smaller
  identities likely need lower `ANCHOR_MIN_SIZE`.
- **Always quarantine THIN buckets** afterwards using the same partition
  pattern (move to `_thin/`, split top-level manifest into
  `facesets` + `thin_eras`). The script appends THIN entries to the top-level
  manifest as if they were full facesets, so the cleanup is a separate step.