Files
face-sets/docs/analysis/identity-consolidation-and-age-extend.md
Peter 49a43c7685 Add post-export corpus maintenance pipeline
Adds four new orchestration scripts that operate on an already-built
facesets_swap_ready/ to clean it up over time:

- filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses
  filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores
  via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level
  quarantine at 40% domain dominance.

- consolidate_facesets.py: duplicate-identity merger using complete-linkage
  centroid clustering on cached arcface embeddings. Single-linkage chains
  catastrophically (60-faceset clusters with min sim < 0); complete-linkage
  guarantees within-group sim >= edge.

- age_extend_001.py: slots newly-added PNGs into existing era buckets of
  faceset_001 using the same anchor-fragment rule as age_split_001.py
  (dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered.

- dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three
  passes — cross-family SHA256 byte-dedup (preserves intra-family era
  duplication), within-faceset near-dup at sim >= 0.95, and a multi-face
  audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s
  on AMD Vega — ~7x embed_worker because input is 512x512 crops.

Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged →
181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes
preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full
reversibility. Master manifest gains masked[], merged[], plus per-run
provenance blocks.

Three new docs/analysis/ writeups cover model choice, threshold rationale,
and per-pass run results.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:41:18 +02:00

171 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Identity consolidation + age-bucket extension
_Run date: 2026-04-27. Driver scripts: `work/consolidate_facesets.py`, `work/age_extend_001.py`._
After the Immich peter + nic imports added 280 new facesets to a corpus that
had ~25 canonical identities, many "new" identities were duplicates of
existing household members at lower clustering confidence. Two cooperating
passes clean this up: identity consolidation merges duplicates, then
age-extend slots newly-merged PNGs into the existing era buckets of
`faceset_001`.
## 1. Identity consolidation
### 1.1 Approach
For each active faceset, pull cached arcface embeddings from
`work/cache/{nl_full,immich_peter,immich_nic}.npz` keyed by
`(source, bbox)` from the per-faceset manifest's `faces[]`. Compute
L2-normalized centroid. Pairwise cosine similarity matrix.
**Tier-based primary selection** (lowest tier number wins, size breaks ties):
| tier | sources | rationale |
|-----:|---------|-----------|
| 0 | `faceset_013..019` (hand-sorted) | user's curated labels |
| 1 | `faceset_001..012` (auto-clustered) | well-established household |
| 2 | `faceset_020..025` (osrc) | mixed-bucket discovery |
| 3 | `faceset_026..264` (immich peter) | speculative |
| 4 | `faceset_265+` (immich nic) | speculative |
**Era splits and quarantines excluded**`faceset_NNN_<era>`, `_masked/`,
`_thin/` are skipped during analysis.
### 1.2 Single-linkage chains catastrophically — complete-linkage required
First attempt used connected-components on edge ≥ 0.45 → produced a
**60-faceset cluster** around `faceset_001` with min within-group sim of
**0.16** (definitely-different people bridged via chains
`A↔B↔C` where `A`, `C` are not similar). Bumping to edge ≥ 0.55 still
chained (group of 17 with min 0.20).
Real fix: `scipy.cluster.hierarchy.linkage(method='complete')` then
`fcluster(Z, t=1-edge_threshold, criterion='distance')`. Complete-linkage
**guarantees** every within-group pair sim ≥ edge threshold. Without this
guarantee the report is unusable and the apply step would produce
identity-poisoned merges.
### 1.3 Thresholds + run results
`edge=0.55`, `confident=0.65` → 48 multi-faceset groups (29 confident, 19
uncertain). Max group size 7, all bilateral or small triplets after
complete-linkage.
After applying all 48 (with `--include-uncertain` after visual approval):
- **74 facesets consumed** (some groups had multiple secondaries:
`[10, 45, 135] → faceset_002`; `[113, 96, 178, 109, 110, 286] → faceset_095`;
etc.)
- Active count 255 → 181
- Notable absorptions: `faceset_001` (peter) 707 → 753 PNGs (+ 7, 132, 151);
`faceset_002` 209 → 247; `faceset_026` 60 → 262 (+ 168, 146, 325);
`faceset_028` → 207
- Master manifest gained `merged[]` array (parallel to `thin_eras[]`); each
entry has `merged_into` field pointing at the primary
### 1.4 Apply mechanics
Combine all PNGs from primary + secondaries, re-rank by existing
`quality.composite` desc (no re-enrich), renumber `0001..NNNN`, copy into a
fresh staging dir, atomic swap. Move secondary directories to
`_merged/<original_name>/` (preserved in full for reversibility). Re-zip
`_topN.fsz` and `_all.fsz`.
The primary's existing per-PNG quality scores are reused — re-ranking does
not require re-running `enrich`-equivalent landmarks/pose on the cropped
PNGs. The primary's `_dropped/` (from prior occlusion filter) is preserved
through the merge.
## 2. Age extension of faceset_001 era buckets
### 2.1 Why a follow-on pass
Consolidation absorbed faceset_007/132/151 into faceset_001 (+46 PNGs).
The original `age_split_001.py` had bucketed peter into 6 era anchors
(`_2005-10`, `_2010-13`, `_2011`, `_2014-17`, `_2018-19`, `_2018-20`), but
those new PNGs had never been seen by age_split. They sat in faceset_001's
parent-only set, missing from every era .fsz.
### 2.2 Era-label pitfall
The 6 anchor era labels are NOT strict year ranges. They are
`Counter(years).most_common(1)`-derived dom-years from the original sub-cluster:
| label | dom_year | actual span of members |
|-------|---------:|-----------------------:|
| `_2005-10` | 2010 | 20052010 |
| `_2010-13` | 2011 | **20072024** |
| `_2011` | 2011 | 2011 only |
| `_2014-17` | 2016 | 20052018 |
| `_2018-19` | 2018 | 20122020 |
| `_2018-20` | 2019 | 20142022 |
The clusters are *appearance-anchored*, not year-bounded. Year is a
descriptive label. Assignment rule must use dom-year, not member span.
### 2.3 Algorithm
For each unbucketed face entry in `faceset_001`'s manifest (50 of 753):
1. Look up embedding in cache by `(source, bbox)`.
2. Look up EXIF year via `work/cache/age_split_exif.json`; fetch on cache miss.
3. Find single nearest era anchor by cosine distance to its centroid.
4. Accept iff `dist ≤ 0.40` AND `|year anchor.dom_year| ≤ 5`.
These thresholds match `age_split_001.py`'s anchor-fragment rule.
5. Anchors are NOT re-centered after absorption (preserves age_split's
drift-prevention guarantee).
### 2.4 Run results
50 unbucketed → 21 with EXIF year → **14 accepted**:
| anchor | dom_year | added |
|--------|---------:|------:|
| `_2005-10` | 2010 | +2 |
| `_2010-13` | 2011 | +1 |
| `_2014-17` | 2016 | **+9** |
| `_2018-20` | 2019 | +2 |
29 PNGs skipped for missing EXIF year (mostly immich-stripped
photos). 7 dist/year-rejected (e.g. two PNGs from 2025 want
`_2018-19` but year-delta 7 > 5).
### 2.5 Reconciliation side effect
The apply rebuilds each affected era bucket's `faces/` from staging. This
incidentally reconciled the per-bucket manifests with disk after the prior
occlusion filter run had left era manifests stale at 282/126/132 entries vs
~248/125/129 actual files (occlusion filter only updates the master
manifest, never per-faceset manifests — see
`docs/analysis/clip-occlusion-filter.md` §7). 42 occlusion-dropped era PNGs
inside the old `faces/_dropped/` were removed during rebuild. The
parent `faceset_001/faces/_dropped/` still has the corpus-level audit; all
source images are intact at `/mnt/x/src/`, so the era-level dropped PNGs
are regeneratable via `cmd_export_swap`.
## 3. Re-running
Always run both passes after any new identity import (Immich, osrc,
hand-sorted folder):
```bash
# 1. Find duplicate identities
python work/consolidate_facesets.py analyze \
--out work/merge_review/candidates.json [--edge 0.55 --confident 0.65]
python work/consolidate_facesets.py report \
--candidates work/merge_review/candidates.json --out work/merge_review
# inspect work/merge_review/index.html
python work/consolidate_facesets.py apply \
--candidates work/merge_review/candidates.json [--include-uncertain]
# 2. Slot new faceset_001 PNGs into existing era buckets
python work/age_extend_001.py analyze --out work/age_extend/candidates.json
python work/age_extend_001.py report \
--candidates work/age_extend/candidates.json --out work/age_extend
python work/age_extend_001.py apply --candidates work/age_extend/candidates.json
```
Both are idempotent. `consolidate_facesets` skips secondaries already in
`_merged/`; `age_extend_001` recomputes anchor centroids + dom-year fresh
on every run.