Adds four new orchestration scripts that operate on an already-built facesets_swap_ready/ to clean it up over time: - filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. - consolidate_facesets.py: duplicate-identity merger using complete-linkage centroid clustering on cached arcface embeddings. Single-linkage chains catastrophically (60-faceset clusters with min sim < 0); complete-linkage guarantees within-group sim >= edge. - age_extend_001.py: slots newly-added PNGs into existing era buckets of faceset_001 using the same anchor-fragment rule as age_split_001.py (dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered. - dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three passes — cross-family SHA256 byte-dedup (preserves intra-family era duplication), within-faceset near-dup at sim >= 0.95, and a multi-face audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s on AMD Vega — ~7x embed_worker because input is 512x512 crops. Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged → 181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full reversibility. Master manifest gains masked[], merged[], plus per-run provenance blocks. Three new docs/analysis/ writeups cover model choice, threshold rationale, and per-pass run results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.0 KiB
Identity consolidation + age-bucket extension
Run date: 2026-04-27. Driver scripts: work/consolidate_facesets.py, work/age_extend_001.py.
After the Immich peter + nic imports added 280 new facesets to a corpus that
had ~25 canonical identities, many "new" identities were duplicates of
existing household members at lower clustering confidence. Two cooperating
passes clean this up: identity consolidation merges duplicates, then
age-extend slots newly-merged PNGs into the existing era buckets of
faceset_001.
1. Identity consolidation
1.1 Approach
For each active faceset, pull cached arcface embeddings from
work/cache/{nl_full,immich_peter,immich_nic}.npz keyed by
(source, bbox) from the per-faceset manifest's faces[]. Compute
L2-normalized centroid. Pairwise cosine similarity matrix.
Tier-based primary selection (lowest tier number wins, size breaks ties):
| tier | sources | rationale |
|---|---|---|
| 0 | faceset_013..019 (hand-sorted) |
user's curated labels |
| 1 | faceset_001..012 (auto-clustered) |
well-established household |
| 2 | faceset_020..025 (osrc) |
mixed-bucket discovery |
| 3 | faceset_026..264 (immich peter) |
speculative |
| 4 | faceset_265+ (immich nic) |
speculative |
Era splits and quarantines excluded — faceset_NNN_<era>, _masked/,
_thin/ are skipped during analysis.
1.2 Single-linkage chains catastrophically — complete-linkage required
First attempt used connected-components on edge ≥ 0.45 → produced a
60-faceset cluster around faceset_001 with min within-group sim of
−0.16 (definitely-different people bridged via chains
A↔B↔C where A, C are not similar). Bumping to edge ≥ 0.55 still
chained (group of 17 with min 0.20).
Real fix: scipy.cluster.hierarchy.linkage(method='complete') then
fcluster(Z, t=1-edge_threshold, criterion='distance'). Complete-linkage
guarantees every within-group pair sim ≥ edge threshold. Without this
guarantee the report is unusable and the apply step would produce
identity-poisoned merges.
1.3 Thresholds + run results
edge=0.55, confident=0.65 → 48 multi-faceset groups (29 confident, 19
uncertain). Max group size 7, all bilateral or small triplets after
complete-linkage.
After applying all 48 (with --include-uncertain after visual approval):
- 74 facesets consumed (some groups had multiple secondaries:
[10, 45, 135] → faceset_002;[113, 96, 178, 109, 110, 286] → faceset_095; etc.) - Active count 255 → 181
- Notable absorptions:
faceset_001(peter) 707 → 753 PNGs (+ 7, 132, 151);faceset_002209 → 247;faceset_02660 → 262 (+ 168, 146, 325);faceset_028→ 207 - Master manifest gained
merged[]array (parallel tothin_eras[]); each entry hasmerged_intofield pointing at the primary
1.4 Apply mechanics
Combine all PNGs from primary + secondaries, re-rank by existing
quality.composite desc (no re-enrich), renumber 0001..NNNN, copy into a
fresh staging dir, atomic swap. Move secondary directories to
_merged/<original_name>/ (preserved in full for reversibility). Re-zip
_topN.fsz and _all.fsz.
The primary's existing per-PNG quality scores are reused — re-ranking does
not require re-running enrich-equivalent landmarks/pose on the cropped
PNGs. The primary's _dropped/ (from prior occlusion filter) is preserved
through the merge.
2. Age extension of faceset_001 era buckets
2.1 Why a follow-on pass
Consolidation absorbed faceset_007/132/151 into faceset_001 (+46 PNGs).
The original age_split_001.py had bucketed peter into 6 era anchors
(_2005-10, _2010-13, _2011, _2014-17, _2018-19, _2018-20), but
those new PNGs had never been seen by age_split. They sat in faceset_001's
parent-only set, missing from every era .fsz.
2.2 Era-label pitfall
The 6 anchor era labels are NOT strict year ranges. They are
Counter(years).most_common(1)-derived dom-years from the original sub-cluster:
| label | dom_year | actual span of members |
|---|---|---|
_2005-10 |
2010 | 2005–2010 |
_2010-13 |
2011 | 2007–2024 |
_2011 |
2011 | 2011 only |
_2014-17 |
2016 | 2005–2018 |
_2018-19 |
2018 | 2012–2020 |
_2018-20 |
2019 | 2014–2022 |
The clusters are appearance-anchored, not year-bounded. Year is a descriptive label. Assignment rule must use dom-year, not member span.
2.3 Algorithm
For each unbucketed face entry in faceset_001's manifest (50 of 753):
- Look up embedding in cache by
(source, bbox). - Look up EXIF year via
work/cache/age_split_exif.json; fetch on cache miss. - Find single nearest era anchor by cosine distance to its centroid.
- Accept iff
dist ≤ 0.40AND|year − anchor.dom_year| ≤ 5. These thresholds matchage_split_001.py's anchor-fragment rule. - Anchors are NOT re-centered after absorption (preserves age_split's drift-prevention guarantee).
2.4 Run results
50 unbucketed → 21 with EXIF year → 14 accepted:
| anchor | dom_year | added |
|---|---|---|
_2005-10 |
2010 | +2 |
_2010-13 |
2011 | +1 |
_2014-17 |
2016 | +9 |
_2018-20 |
2019 | +2 |
29 PNGs skipped for missing EXIF year (mostly immich-stripped
photos). 7 dist/year-rejected (e.g. two PNGs from 2025 want
_2018-19 but year-delta 7 > 5).
2.5 Reconciliation side effect
The apply rebuilds each affected era bucket's faces/ from staging. This
incidentally reconciled the per-bucket manifests with disk after the prior
occlusion filter run had left era manifests stale at 282/126/132 entries vs
~248/125/129 actual files (occlusion filter only updates the master
manifest, never per-faceset manifests — see
docs/analysis/clip-occlusion-filter.md §7). 42 occlusion-dropped era PNGs
inside the old faces/_dropped/ were removed during rebuild. The
parent faceset_001/faces/_dropped/ still has the corpus-level audit; all
source images are intact at /mnt/x/src/, so the era-level dropped PNGs
are regeneratable via cmd_export_swap.
3. Re-running
Always run both passes after any new identity import (Immich, osrc, hand-sorted folder):
# 1. Find duplicate identities
python work/consolidate_facesets.py analyze \
--out work/merge_review/candidates.json [--edge 0.55 --confident 0.65]
python work/consolidate_facesets.py report \
--candidates work/merge_review/candidates.json --out work/merge_review
# inspect work/merge_review/index.html
python work/consolidate_facesets.py apply \
--candidates work/merge_review/candidates.json [--include-uncertain]
# 2. Slot new faceset_001 PNGs into existing era buckets
python work/age_extend_001.py analyze --out work/age_extend/candidates.json
python work/age_extend_001.py report \
--candidates work/age_extend/candidates.json --out work/age_extend
python work/age_extend_001.py apply --candidates work/age_extend/candidates.json
Both are idempotent. consolidate_facesets skips secondaries already in
_merged/; age_extend_001 recomputes anchor centroids + dom-year fresh
on every run.