Add post-export corpus maintenance pipeline

Adds four new orchestration scripts that operate on an already-built facesets_swap_ready/ to clean it up over time: - filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. - consolidate_facesets.py: duplicate-identity merger using complete-linkage centroid clustering on cached arcface embeddings. Single-linkage chains catastrophically (60-faceset clusters with min sim < 0); complete-linkage guarantees within-group sim >= edge. - age_extend_001.py: slots newly-added PNGs into existing era buckets of faceset_001 using the same anchor-fragment rule as age_split_001.py (dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered. - dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three passes — cross-family SHA256 byte-dedup (preserves intra-family era duplication), within-faceset near-dup at sim >= 0.95, and a multi-face audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s on AMD Vega — ~7x embed_worker because input is 512x512 crops. Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged → 181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full reversibility. Master manifest gains masked[], merged[], plus per-run provenance blocks. Three new docs/analysis/ writeups cover model choice, threshold rationale, and per-pass run results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:41:18 +02:00
parent e66c97fd58
commit 49a43c7685
10 changed files with 3250 additions and 1 deletions
--- a/docs/analysis/identity-consolidation-and-age-extend.md
+++ b/docs/analysis/identity-consolidation-and-age-extend.md
@@ -0,0 +1,170 @@
+# Identity consolidation + age-bucket extension
+
+_Run date: 2026-04-27. Driver scripts: `work/consolidate_facesets.py`, `work/age_extend_001.py`._
+
+After the Immich peter + nic imports added 280 new facesets to a corpus that
+had ~25 canonical identities, many "new" identities were duplicates of
+existing household members at lower clustering confidence. Two cooperating
+passes clean this up: identity consolidation merges duplicates, then
+age-extend slots newly-merged PNGs into the existing era buckets of
+`faceset_001`.
+
+## 1. Identity consolidation
+
+### 1.1 Approach
+
+For each active faceset, pull cached arcface embeddings from
+`work/cache/{nl_full,immich_peter,immich_nic}.npz` keyed by
+`(source, bbox)` from the per-faceset manifest's `faces[]`. Compute
+L2-normalized centroid. Pairwise cosine similarity matrix.
+
+**Tier-based primary selection** (lowest tier number wins, size breaks ties):
+
+| tier | sources | rationale |
+|-----:|---------|-----------|
+| 0 | `faceset_013..019` (hand-sorted) | user's curated labels |
+| 1 | `faceset_001..012` (auto-clustered) | well-established household |
+| 2 | `faceset_020..025` (osrc) | mixed-bucket discovery |
+| 3 | `faceset_026..264` (immich peter) | speculative |
+| 4 | `faceset_265+` (immich nic) | speculative |
+
+**Era splits and quarantines excluded** — `faceset_NNN_<era>`, `_masked/`,
+`_thin/` are skipped during analysis.
+
+### 1.2 Single-linkage chains catastrophically — complete-linkage required
+
+First attempt used connected-components on edge ≥ 0.45 → produced a
+**60-faceset cluster** around `faceset_001` with min within-group sim of
+**−0.16** (definitely-different people bridged via chains
+`A↔B↔C` where `A`, `C` are not similar). Bumping to edge ≥ 0.55 still
+chained (group of 17 with min 0.20).
+
+Real fix: `scipy.cluster.hierarchy.linkage(method='complete')` then
+`fcluster(Z, t=1-edge_threshold, criterion='distance')`. Complete-linkage
+**guarantees** every within-group pair sim ≥ edge threshold. Without this
+guarantee the report is unusable and the apply step would produce
+identity-poisoned merges.
+
+### 1.3 Thresholds + run results
+
+`edge=0.55`, `confident=0.65` → 48 multi-faceset groups (29 confident, 19
+uncertain). Max group size 7, all bilateral or small triplets after
+complete-linkage.
+
+After applying all 48 (with `--include-uncertain` after visual approval):
+
+- **74 facesets consumed** (some groups had multiple secondaries:
+  `[10, 45, 135] → faceset_002`; `[113, 96, 178, 109, 110, 286] → faceset_095`;
+  etc.)
+- Active count 255 → 181
+- Notable absorptions: `faceset_001` (peter) 707 → 753 PNGs (+ 7, 132, 151);
+  `faceset_002` 209 → 247; `faceset_026` 60 → 262 (+ 168, 146, 325);
+  `faceset_028` → 207
+- Master manifest gained `merged[]` array (parallel to `thin_eras[]`); each
+  entry has `merged_into` field pointing at the primary
+
+### 1.4 Apply mechanics
+
+Combine all PNGs from primary + secondaries, re-rank by existing
+`quality.composite` desc (no re-enrich), renumber `0001..NNNN`, copy into a
+fresh staging dir, atomic swap. Move secondary directories to
+`_merged/<original_name>/` (preserved in full for reversibility). Re-zip
+`_topN.fsz` and `_all.fsz`.
+
+The primary's existing per-PNG quality scores are reused — re-ranking does
+not require re-running `enrich`-equivalent landmarks/pose on the cropped
+PNGs. The primary's `_dropped/` (from prior occlusion filter) is preserved
+through the merge.
+
+## 2. Age extension of faceset_001 era buckets
+
+### 2.1 Why a follow-on pass
+
+Consolidation absorbed faceset_007/132/151 into faceset_001 (+46 PNGs).
+The original `age_split_001.py` had bucketed peter into 6 era anchors
+(`_2005-10`, `_2010-13`, `_2011`, `_2014-17`, `_2018-19`, `_2018-20`), but
+those new PNGs had never been seen by age_split. They sat in faceset_001's
+parent-only set, missing from every era .fsz.
+
+### 2.2 Era-label pitfall
+
+The 6 anchor era labels are NOT strict year ranges. They are
+`Counter(years).most_common(1)`-derived dom-years from the original sub-cluster:
+
+| label | dom_year | actual span of members |
+|-------|---------:|-----------------------:|
+| `_2005-10` | 2010 | 2005–2010 |
+| `_2010-13` | 2011 | **2007–2024** |
+| `_2011` | 2011 | 2011 only |
+| `_2014-17` | 2016 | 2005–2018 |
+| `_2018-19` | 2018 | 2012–2020 |
+| `_2018-20` | 2019 | 2014–2022 |
+
+The clusters are *appearance-anchored*, not year-bounded. Year is a
+descriptive label. Assignment rule must use dom-year, not member span.
+
+### 2.3 Algorithm
+
+For each unbucketed face entry in `faceset_001`'s manifest (50 of 753):
+
+1. Look up embedding in cache by `(source, bbox)`.
+2. Look up EXIF year via `work/cache/age_split_exif.json`; fetch on cache miss.
+3. Find single nearest era anchor by cosine distance to its centroid.
+4. Accept iff `dist ≤ 0.40` AND `|year − anchor.dom_year| ≤ 5`.
+   These thresholds match `age_split_001.py`'s anchor-fragment rule.
+5. Anchors are NOT re-centered after absorption (preserves age_split's
+   drift-prevention guarantee).
+
+### 2.4 Run results
+
+50 unbucketed → 21 with EXIF year → **14 accepted**:
+
+| anchor | dom_year | added |
+|--------|---------:|------:|
+| `_2005-10` | 2010 | +2 |
+| `_2010-13` | 2011 | +1 |
+| `_2014-17` | 2016 | **+9** |
+| `_2018-20` | 2019 | +2 |
+
+29 PNGs skipped for missing EXIF year (mostly immich-stripped
+photos). 7 dist/year-rejected (e.g. two PNGs from 2025 want
+`_2018-19` but year-delta 7 > 5).
+
+### 2.5 Reconciliation side effect
+
+The apply rebuilds each affected era bucket's `faces/` from staging. This
+incidentally reconciled the per-bucket manifests with disk after the prior
+occlusion filter run had left era manifests stale at 282/126/132 entries vs
+~248/125/129 actual files (occlusion filter only updates the master
+manifest, never per-faceset manifests — see
+`docs/analysis/clip-occlusion-filter.md` §7). 42 occlusion-dropped era PNGs
+inside the old `faces/_dropped/` were removed during rebuild. The
+parent `faceset_001/faces/_dropped/` still has the corpus-level audit; all
+source images are intact at `/mnt/x/src/`, so the era-level dropped PNGs
+are regeneratable via `cmd_export_swap`.
+
+## 3. Re-running
+
+Always run both passes after any new identity import (Immich, osrc,
+hand-sorted folder):
+
+```bash
+# 1. Find duplicate identities
+python work/consolidate_facesets.py analyze \
+  --out work/merge_review/candidates.json [--edge 0.55 --confident 0.65]
+python work/consolidate_facesets.py report \
+  --candidates work/merge_review/candidates.json --out work/merge_review
+# inspect work/merge_review/index.html
+python work/consolidate_facesets.py apply \
+  --candidates work/merge_review/candidates.json [--include-uncertain]
+
+# 2. Slot new faceset_001 PNGs into existing era buckets
+python work/age_extend_001.py analyze --out work/age_extend/candidates.json
+python work/age_extend_001.py report \
+  --candidates work/age_extend/candidates.json --out work/age_extend
+python work/age_extend_001.py apply --candidates work/age_extend/candidates.json
+```
+
+Both are idempotent. `consolidate_facesets` skips secondaries already in
+`_merged/`; `age_extend_001` recomputes anchor centroids + dom-year fresh
+on every run.