Documents the 2026-04-26 split of faceset_001 (707 curated faces) into 6 substantive era buckets + 68 thin fragments, including the readiness probe evidence, the anchor-based assignment rationale (replaces transitive union-find that caused year-drift), and the re-run / apply- to-other-identity workflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
120 lines
5.6 KiB
Markdown
120 lines
5.6 KiB
Markdown
# Age-splitting faceset_001 into era-specific facesets
|
||
|
||
_Run date: 2026-04-26. Cache: `work/cache/nl_full.npz` (5260 face records). Source: `work/age_split_001.py` and `work/check_faceset001_age.py`._
|
||
|
||
## 1. Why split
|
||
|
||
`faceset_001` aggregates a single identity across roughly 20 years of source
|
||
material. The averaged embedding consumed by roop-unleashed therefore mixes
|
||
features from very different ages. For face-swap output that should target a
|
||
specific period (e.g. "this person around 2011" or "this person around
|
||
2018–19"), the identity needs to be split *after* clustering — the cluster is
|
||
correctly one identity, but the averaged embedding is the problem.
|
||
|
||
## 2. Evidence the identity is age-sortable
|
||
|
||
`work/check_faceset001_age.py` probes `faceset_001` (707 curated faces).
|
||
|
||
**Pairwise cos-distance histogram** (249,571 pairs):
|
||
|
||
| range | pairs |
|
||
|-------------|------:|
|
||
| [0.0, 0.2) | 1,250 |
|
||
| [0.2, 0.3) | 11,277 |
|
||
| [0.3, 0.4) | 63,920 |
|
||
| [0.4, 0.5) | 92,555 |
|
||
| [0.5, 0.6) | 63,288 |
|
||
| [0.6, 0.7) | 16,048 |
|
||
| [0.7, 0.8) | 1,217 |
|
||
| [0.8, 1.0) | 16 |
|
||
|
||
Mean 0.453, median 0.452, max 0.842. The cluster is internally diffuse — wide
|
||
enough to admit non-trivial sub-structure without crossing the
|
||
inter-identity boundary (which sits well above 0.6 in this dataset).
|
||
|
||
**Sub-clusters at threshold 0.35** (precomputed cos-dist, average linkage):
|
||
156 sub-clusters, 10 with ≥ 10 faces, top-5 sizes [120, 105, 47, 40, 24].
|
||
The top sub-clusters align with distinct EXIF year medians (2011, 2019,
|
||
2018, 2011, 2010), so the split is meaningful.
|
||
|
||
## 3. Pipeline
|
||
|
||
`work/age_split_001.py`:
|
||
|
||
1. **Seed centroid.** Load the 707 face keys from
|
||
`facesets_swap_ready/faceset_001/manifest.json`; resolve to cache rows;
|
||
normalize the mean embedding.
|
||
2. **Wide recovery.** Pull every face record under `/mnt/x/src/{nl,
|
||
lzbkp_red}` from the cache with cos-dist ≤ 0.55 from the seed. The seed
|
||
is curated and tight, so 0.55 is a safe outer envelope. Result: 1,501
|
||
faces from 4,756 candidates.
|
||
3. **Quality gate** (mirrors export-swap defaults): `face_short ≥ 100`,
|
||
`blur ≥ 40.0`, `det_score ≥ 0.6`. Result: 892 → 856 after one
|
||
re-centroid + tighten pass at 0.50 to absorb the recovery without
|
||
drift.
|
||
4. **Sub-cluster** the survivors at cos-dist 0.35 (precomputed agglomerative,
|
||
average linkage). 223 raw sub-clusters; sizes top-10 = [127, 97, 55, 42,
|
||
40, 25, 17, 14, 13, 11].
|
||
5. **EXIF year per source path.** Read `DateTimeOriginal` once per unique
|
||
path; cache on disk at `work/cache/age_split_exif.json` so re-runs after
|
||
parameter tweaks skip the slow Windows-mount EXIF read. 728 of 855 paths
|
||
were dated.
|
||
6. **Anchor-based fragment assignment** (replaces transitive union-find merge
|
||
that caused observable year drift):
|
||
- sub-clusters with ≥ 20 faces are *anchors* (6 found: dom-years 2011,
|
||
2019, 2018, 2011, 2016, 2010);
|
||
- smaller fragments attach to the single nearest anchor *only if* both
|
||
`cent_dist ≤ 0.40` AND `|dom_year_anchor − dom_year_fragment| ≤ 5`;
|
||
- anchors do not merge with each other (transitive merging produced
|
||
anchor-to-anchor year drift across 2010 → 2014 → 2018 in earlier
|
||
runs);
|
||
- fragments with no qualifying anchor remain standalone.
|
||
7. **Per-era export.** Composite-quality rank, single-face square PNG crops
|
||
(`pad_ratio=0.5`, `out_size=512`), top-N + `_all` `.fsz` bundles, per-era
|
||
`manifest.json`, `<label>.txt` marker, `THIN.txt` for buckets < 20 faces.
|
||
8. **Top-level manifest merge.** New entries are appended to
|
||
`facesets_swap_ready/manifest.json`. Operationally the THIN buckets are
|
||
then moved into `_thin/` and partitioned into a `thin_eras` array (with
|
||
`relpath: _thin/<name>`) so consumers reading `facesets` see only the
|
||
substantive entries.
|
||
|
||
## 4. Result
|
||
|
||
74 era buckets emitted; 6 substantive + 68 thin/standalone fragments.
|
||
|
||
| era | faces | dom year(s) |
|
||
|-------------------|------:|-------------|
|
||
| `faceset_001_2010-13` | 282 | 2011 |
|
||
| `faceset_001_2018-20` | 129 | 2019 |
|
||
| `faceset_001_2014-17` | 125 | 2018 (anchor sub 15 dom_year=2018) |
|
||
| `faceset_001_2018-19` | 107 | 2018 |
|
||
| `faceset_001_2005-10` | 88 | 2010 |
|
||
| `faceset_001_2011` | 43 | 2011 |
|
||
|
||
Two distinct 2011 anchors and two 2018-area anchors persist by design —
|
||
embedding-space distance separated them despite year overlap. The era-label
|
||
collisions are disambiguated with `_v2` suffixes, but only when both anchors
|
||
landed on the *same* literal label string (none of the substantive six did).
|
||
|
||
The 68 thin buckets are largely 1- or 2-face fragments with idiosyncratic
|
||
embeddings; they are quarantined into `_thin/` rather than deleted because
|
||
some are legitimate edge poses / lighting / age extremes that may be useful
|
||
for narrow targeted swaps.
|
||
|
||
## 5. Re-running and applying to other identities
|
||
|
||
- **Re-run with different parameters**: just re-execute `age_split_001.py`.
|
||
Embeddings are loaded from cache, EXIF is loaded from
|
||
`age_split_exif.json`, and only the sub-cluster + export steps re-run.
|
||
Total runtime ~2 min.
|
||
- **Apply to a different identity**: copy `age_split_001.py` to
|
||
`age_split_NNN.py` and change `FS001`. The `SCAN_ROOTS`,
|
||
`RECOVERY_THRESHOLD`, `TIGHTEN_THRESHOLD`, `SUBCLUSTER_THRESHOLD`,
|
||
`ANCHOR_MIN_SIZE`, `FRAGMENT_CENTROID_MAX`, and `FRAGMENT_YEAR_MAX`
|
||
defaults are tuned for `faceset_001`'s ~707-face curated cluster; smaller
|
||
identities likely need lower `ANCHOR_MIN_SIZE`.
|
||
- **Always quarantine THIN buckets** afterwards using the same partition
|
||
pattern (move to `_thin/`, split top-level manifest into
|
||
`facesets` + `thin_eras`). The script appends THIN entries to the top-level
|
||
manifest as if they were full facesets, so the cleanup is a separate step.
|