Files
face-sets/docs/analysis/age-split-faceset-001.md
Peter e48dd8aec7 Add age-split run analysis for faceset_001
Documents the 2026-04-26 split of faceset_001 (707 curated faces) into
6 substantive era buckets + 68 thin fragments, including the readiness
probe evidence, the anchor-based assignment rationale (replaces
transitive union-find that caused year-drift), and the re-run / apply-
to-other-identity workflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:10:37 +02:00

120 lines
5.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Age-splitting faceset_001 into era-specific facesets
_Run date: 2026-04-26. Cache: `work/cache/nl_full.npz` (5260 face records). Source: `work/age_split_001.py` and `work/check_faceset001_age.py`._
## 1. Why split
`faceset_001` aggregates a single identity across roughly 20 years of source
material. The averaged embedding consumed by roop-unleashed therefore mixes
features from very different ages. For face-swap output that should target a
specific period (e.g. "this person around 2011" or "this person around
201819"), the identity needs to be split *after* clustering — the cluster is
correctly one identity, but the averaged embedding is the problem.
## 2. Evidence the identity is age-sortable
`work/check_faceset001_age.py` probes `faceset_001` (707 curated faces).
**Pairwise cos-distance histogram** (249,571 pairs):
| range | pairs |
|-------------|------:|
| [0.0, 0.2) | 1,250 |
| [0.2, 0.3) | 11,277 |
| [0.3, 0.4) | 63,920 |
| [0.4, 0.5) | 92,555 |
| [0.5, 0.6) | 63,288 |
| [0.6, 0.7) | 16,048 |
| [0.7, 0.8) | 1,217 |
| [0.8, 1.0) | 16 |
Mean 0.453, median 0.452, max 0.842. The cluster is internally diffuse — wide
enough to admit non-trivial sub-structure without crossing the
inter-identity boundary (which sits well above 0.6 in this dataset).
**Sub-clusters at threshold 0.35** (precomputed cos-dist, average linkage):
156 sub-clusters, 10 with ≥ 10 faces, top-5 sizes [120, 105, 47, 40, 24].
The top sub-clusters align with distinct EXIF year medians (2011, 2019,
2018, 2011, 2010), so the split is meaningful.
## 3. Pipeline
`work/age_split_001.py`:
1. **Seed centroid.** Load the 707 face keys from
`facesets_swap_ready/faceset_001/manifest.json`; resolve to cache rows;
normalize the mean embedding.
2. **Wide recovery.** Pull every face record under `/mnt/x/src/{nl,
lzbkp_red}` from the cache with cos-dist ≤ 0.55 from the seed. The seed
is curated and tight, so 0.55 is a safe outer envelope. Result: 1,501
faces from 4,756 candidates.
3. **Quality gate** (mirrors export-swap defaults): `face_short ≥ 100`,
`blur ≥ 40.0`, `det_score ≥ 0.6`. Result: 892 → 856 after one
re-centroid + tighten pass at 0.50 to absorb the recovery without
drift.
4. **Sub-cluster** the survivors at cos-dist 0.35 (precomputed agglomerative,
average linkage). 223 raw sub-clusters; sizes top-10 = [127, 97, 55, 42,
40, 25, 17, 14, 13, 11].
5. **EXIF year per source path.** Read `DateTimeOriginal` once per unique
path; cache on disk at `work/cache/age_split_exif.json` so re-runs after
parameter tweaks skip the slow Windows-mount EXIF read. 728 of 855 paths
were dated.
6. **Anchor-based fragment assignment** (replaces transitive union-find merge
that caused observable year drift):
- sub-clusters with ≥ 20 faces are *anchors* (6 found: dom-years 2011,
2019, 2018, 2011, 2016, 2010);
- smaller fragments attach to the single nearest anchor *only if* both
`cent_dist ≤ 0.40` AND `|dom_year_anchor dom_year_fragment| ≤ 5`;
- anchors do not merge with each other (transitive merging produced
anchor-to-anchor year drift across 2010 → 2014 → 2018 in earlier
runs);
- fragments with no qualifying anchor remain standalone.
7. **Per-era export.** Composite-quality rank, single-face square PNG crops
(`pad_ratio=0.5`, `out_size=512`), top-N + `_all` `.fsz` bundles, per-era
`manifest.json`, `<label>.txt` marker, `THIN.txt` for buckets < 20 faces.
8. **Top-level manifest merge.** New entries are appended to
`facesets_swap_ready/manifest.json`. Operationally the THIN buckets are
then moved into `_thin/` and partitioned into a `thin_eras` array (with
`relpath: _thin/<name>`) so consumers reading `facesets` see only the
substantive entries.
## 4. Result
74 era buckets emitted; 6 substantive + 68 thin/standalone fragments.
| era | faces | dom year(s) |
|-------------------|------:|-------------|
| `faceset_001_2010-13` | 282 | 2011 |
| `faceset_001_2018-20` | 129 | 2019 |
| `faceset_001_2014-17` | 125 | 2018 (anchor sub 15 dom_year=2018) |
| `faceset_001_2018-19` | 107 | 2018 |
| `faceset_001_2005-10` | 88 | 2010 |
| `faceset_001_2011` | 43 | 2011 |
Two distinct 2011 anchors and two 2018-area anchors persist by design —
embedding-space distance separated them despite year overlap. The era-label
collisions are disambiguated with `_v2` suffixes, but only when both anchors
landed on the *same* literal label string (none of the substantive six did).
The 68 thin buckets are largely 1- or 2-face fragments with idiosyncratic
embeddings; they are quarantined into `_thin/` rather than deleted because
some are legitimate edge poses / lighting / age extremes that may be useful
for narrow targeted swaps.
## 5. Re-running and applying to other identities
- **Re-run with different parameters**: just re-execute `age_split_001.py`.
Embeddings are loaded from cache, EXIF is loaded from
`age_split_exif.json`, and only the sub-cluster + export steps re-run.
Total runtime ~2 min.
- **Apply to a different identity**: copy `age_split_001.py` to
`age_split_NNN.py` and change `FS001`. The `SCAN_ROOTS`,
`RECOVERY_THRESHOLD`, `TIGHTEN_THRESHOLD`, `SUBCLUSTER_THRESHOLD`,
`ANCHOR_MIN_SIZE`, `FRAGMENT_CENTROID_MAX`, and `FRAGMENT_YEAR_MAX`
defaults are tuned for `faceset_001`'s ~707-face curated cluster; smaller
identities likely need lower `ANCHOR_MIN_SIZE`.
- **Always quarantine THIN buckets** afterwards using the same partition
pattern (move to `_thin/`, split top-level manifest into
`facesets` + `thin_eras`). The script appends THIN entries to the top-level
manifest as if they were full facesets, so the cleanup is a separate step.