# Identity consolidation + age-bucket extension _Run date: 2026-04-27. Driver scripts: `work/consolidate_facesets.py`, `work/age_extend_001.py`._ After the Immich peter + nic imports added 280 new facesets to a corpus that had ~25 canonical identities, many "new" identities were duplicates of existing household members at lower clustering confidence. Two cooperating passes clean this up: identity consolidation merges duplicates, then age-extend slots newly-merged PNGs into the existing era buckets of `faceset_001`. ## 1. Identity consolidation ### 1.1 Approach For each active faceset, pull cached arcface embeddings from `work/cache/{nl_full,immich_peter,immich_nic}.npz` keyed by `(source, bbox)` from the per-faceset manifest's `faces[]`. Compute L2-normalized centroid. Pairwise cosine similarity matrix. **Tier-based primary selection** (lowest tier number wins, size breaks ties): | tier | sources | rationale | |-----:|---------|-----------| | 0 | `faceset_013..019` (hand-sorted) | user's curated labels | | 1 | `faceset_001..012` (auto-clustered) | well-established household | | 2 | `faceset_020..025` (osrc) | mixed-bucket discovery | | 3 | `faceset_026..264` (immich peter) | speculative | | 4 | `faceset_265+` (immich nic) | speculative | **Era splits and quarantines excluded** — `faceset_NNN_`, `_masked/`, `_thin/` are skipped during analysis. ### 1.2 Single-linkage chains catastrophically — complete-linkage required First attempt used connected-components on edge ≥ 0.45 → produced a **60-faceset cluster** around `faceset_001` with min within-group sim of **−0.16** (definitely-different people bridged via chains `A↔B↔C` where `A`, `C` are not similar). Bumping to edge ≥ 0.55 still chained (group of 17 with min 0.20). Real fix: `scipy.cluster.hierarchy.linkage(method='complete')` then `fcluster(Z, t=1-edge_threshold, criterion='distance')`. Complete-linkage **guarantees** every within-group pair sim ≥ edge threshold. Without this guarantee the report is unusable and the apply step would produce identity-poisoned merges. ### 1.3 Thresholds + run results `edge=0.55`, `confident=0.65` → 48 multi-faceset groups (29 confident, 19 uncertain). Max group size 7, all bilateral or small triplets after complete-linkage. After applying all 48 (with `--include-uncertain` after visual approval): - **74 facesets consumed** (some groups had multiple secondaries: `[10, 45, 135] → faceset_002`; `[113, 96, 178, 109, 110, 286] → faceset_095`; etc.) - Active count 255 → 181 - Notable absorptions: `faceset_001` (peter) 707 → 753 PNGs (+ 7, 132, 151); `faceset_002` 209 → 247; `faceset_026` 60 → 262 (+ 168, 146, 325); `faceset_028` → 207 - Master manifest gained `merged[]` array (parallel to `thin_eras[]`); each entry has `merged_into` field pointing at the primary ### 1.4 Apply mechanics Combine all PNGs from primary + secondaries, re-rank by existing `quality.composite` desc (no re-enrich), renumber `0001..NNNN`, copy into a fresh staging dir, atomic swap. Move secondary directories to `_merged//` (preserved in full for reversibility). Re-zip `_topN.fsz` and `_all.fsz`. The primary's existing per-PNG quality scores are reused — re-ranking does not require re-running `enrich`-equivalent landmarks/pose on the cropped PNGs. The primary's `_dropped/` (from prior occlusion filter) is preserved through the merge. ## 2. Age extension of faceset_001 era buckets ### 2.1 Why a follow-on pass Consolidation absorbed faceset_007/132/151 into faceset_001 (+46 PNGs). The original `age_split_001.py` had bucketed peter into 6 era anchors (`_2005-10`, `_2010-13`, `_2011`, `_2014-17`, `_2018-19`, `_2018-20`), but those new PNGs had never been seen by age_split. They sat in faceset_001's parent-only set, missing from every era .fsz. ### 2.2 Era-label pitfall The 6 anchor era labels are NOT strict year ranges. They are `Counter(years).most_common(1)`-derived dom-years from the original sub-cluster: | label | dom_year | actual span of members | |-------|---------:|-----------------------:| | `_2005-10` | 2010 | 2005–2010 | | `_2010-13` | 2011 | **2007–2024** | | `_2011` | 2011 | 2011 only | | `_2014-17` | 2016 | 2005–2018 | | `_2018-19` | 2018 | 2012–2020 | | `_2018-20` | 2019 | 2014–2022 | The clusters are *appearance-anchored*, not year-bounded. Year is a descriptive label. Assignment rule must use dom-year, not member span. ### 2.3 Algorithm For each unbucketed face entry in `faceset_001`'s manifest (50 of 753): 1. Look up embedding in cache by `(source, bbox)`. 2. Look up EXIF year via `work/cache/age_split_exif.json`; fetch on cache miss. 3. Find single nearest era anchor by cosine distance to its centroid. 4. Accept iff `dist ≤ 0.40` AND `|year − anchor.dom_year| ≤ 5`. These thresholds match `age_split_001.py`'s anchor-fragment rule. 5. Anchors are NOT re-centered after absorption (preserves age_split's drift-prevention guarantee). ### 2.4 Run results 50 unbucketed → 21 with EXIF year → **14 accepted**: | anchor | dom_year | added | |--------|---------:|------:| | `_2005-10` | 2010 | +2 | | `_2010-13` | 2011 | +1 | | `_2014-17` | 2016 | **+9** | | `_2018-20` | 2019 | +2 | 29 PNGs skipped for missing EXIF year (mostly immich-stripped photos). 7 dist/year-rejected (e.g. two PNGs from 2025 want `_2018-19` but year-delta 7 > 5). ### 2.5 Reconciliation side effect The apply rebuilds each affected era bucket's `faces/` from staging. This incidentally reconciled the per-bucket manifests with disk after the prior occlusion filter run had left era manifests stale at 282/126/132 entries vs ~248/125/129 actual files (occlusion filter only updates the master manifest, never per-faceset manifests — see `docs/analysis/clip-occlusion-filter.md` §7). 42 occlusion-dropped era PNGs inside the old `faces/_dropped/` were removed during rebuild. The parent `faceset_001/faces/_dropped/` still has the corpus-level audit; all source images are intact at `/mnt/x/src/`, so the era-level dropped PNGs are regeneratable via `cmd_export_swap`. ## 3. Re-running Always run both passes after any new identity import (Immich, osrc, hand-sorted folder): ```bash # 1. Find duplicate identities python work/consolidate_facesets.py analyze \ --out work/merge_review/candidates.json [--edge 0.55 --confident 0.65] python work/consolidate_facesets.py report \ --candidates work/merge_review/candidates.json --out work/merge_review # inspect work/merge_review/index.html python work/consolidate_facesets.py apply \ --candidates work/merge_review/candidates.json [--include-uncertain] # 2. Slot new faceset_001 PNGs into existing era buckets python work/age_extend_001.py analyze --out work/age_extend/candidates.json python work/age_extend_001.py report \ --candidates work/age_extend/candidates.json --out work/age_extend python work/age_extend_001.py apply --candidates work/age_extend/candidates.json ``` Both are idempotent. `consolidate_facesets` skips secondaries already in `_merged/`; `age_extend_001` recomputes anchor centroids + dom-year fresh on every run.