Adds four new orchestration scripts that operate on an already-built
facesets_swap_ready/ to clean it up over time:
- filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses
filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores
via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level
quarantine at 40% domain dominance.
- consolidate_facesets.py: duplicate-identity merger using complete-linkage
centroid clustering on cached arcface embeddings. Single-linkage chains
catastrophically (60-faceset clusters with min sim < 0); complete-linkage
guarantees within-group sim >= edge.
- age_extend_001.py: slots newly-added PNGs into existing era buckets of
faceset_001 using the same anchor-fragment rule as age_split_001.py
(dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered.
- dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three
passes — cross-family SHA256 byte-dedup (preserves intra-family era
duplication), within-faceset near-dup at sim >= 0.95, and a multi-face
audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s
on AMD Vega — ~7x embed_worker because input is 512x512 crops.
Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged →
181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes
preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full
reversibility. Master manifest gains masked[], merged[], plus per-run
provenance blocks.
Three new docs/analysis/ writeups cover model choice, threshold rationale,
and per-pass run results.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>