Add post-export corpus maintenance pipeline

Adds four new orchestration scripts that operate on an already-built facesets_swap_ready/ to clean it up over time: - filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. - consolidate_facesets.py: duplicate-identity merger using complete-linkage centroid clustering on cached arcface embeddings. Single-linkage chains catastrophically (60-faceset clusters with min sim < 0); complete-linkage guarantees within-group sim >= edge. - age_extend_001.py: slots newly-added PNGs into existing era buckets of faceset_001 using the same anchor-fragment rule as age_split_001.py (dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered. - dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three passes — cross-family SHA256 byte-dedup (preserves intra-family era duplication), within-faceset near-dup at sim >= 0.95, and a multi-face audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s on AMD Vega — ~7x embed_worker because input is 512x512 crops. Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged → 181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full reversibility. Master manifest gains masked[], merged[], plus per-run provenance blocks. Three new docs/analysis/ writeups cover model choice, threshold rationale, and per-pass run results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:41:18 +02:00
parent e66c97fd58
commit 49a43c7685
10 changed files with 3250 additions and 1 deletions
--- a/docs/analysis/dedup-and-roop-optimization.md
+++ b/docs/analysis/dedup-and-roop-optimization.md
@@ -0,0 +1,155 @@
+# Corpus dedup + roop-unleashed optimization
+
+_Run date: 2026-04-27. Driver scripts: `work/dedup_optimize.py`, `work/multiface_worker.py`._
+
+After consolidation collapsed duplicate identities and age-extend slotted
+new PNGs into era buckets, the corpus still carried artifacts that hurt
+roop's averaged-embedding quality:
+
+- **Burst-photo near-duplicates** within facesets, especially in
+  immich-discovered identities where source libraries had many similar
+  shots within seconds.
+- **Cross-faceset byte-identical PNGs** that escaped consolidation's
+  centroid-similarity matching when individual PNGs matched exactly but
+  cluster centroids diverged.
+- **Multi-face PNGs** that polluted identity averaging because the roop
+  loader appends every detected face per PNG to the FaceSet (load-bearing
+  invariant — see § 2).
+
+This pipeline runs three independent passes and an optional fourth, all
+moving dropped PNGs to `<faceset>/faces/_dropped/` for reversibility.
+
+## 1. Cross-family byte-dedup
+
+SHA256-hash every PNG in the active corpus (parallel I/O via
+`ThreadPoolExecutor(max_workers=16)`, ~17 s for 5,386 PNGs over the
+`/mnt/e/` Windows mount). Group by hash; for groups with members in
+multiple identity families, keep the higher-tier copy.
+
+**Family detection**: regex `^(faceset_\d+)(?:_.+)?$` — captures the parent
+identity. Same family includes parent + era splits (e.g. `faceset_001` +
+`faceset_001_2010-13`); these are intentional duplications for the era
+.fsz files and are preserved.
+
+Run results: 20 cross-family hash groups → 24 PNGs dropped. Most cases were
+small immich identity-cluster errors that consolidation missed because
+individual PNG embeddings matched but the cluster mean did not.
+
+## 2. Within-faceset near-dup at sim ≥ 0.95
+
+Per-faceset pairwise cosine similarity on cached arcface embeddings.
+Connected components in the `sim ≥ 0.95` graph. Keep highest
+`quality.composite` per component, drop the rest.
+
+**Threshold rationale**: legitimate same-person-different-pose pairs land at
+0.5–0.85; ≥ 0.95 means essentially the same shot (burst frames or
+recompressed dupes). Roop's `FaceSet.AverageEmbeddings()` averages all faces
+into `faces[0].embedding`; near-identical embeddings averaged ≈ averaging
+once. Removing them does not lose identity information; it removes a bias
+weight on the most-photographed moments.
+
+Run results: 851 groups → **1,225 PNGs dropped** (23 % of corpus).
+Most-affected: `faceset_026` (-132 of 262), `faceset_027` (-107),
+`faceset_028` (-92), `faceset_030` (-92). All immich-discovered identities
+where the source library had burst sequences.
+
+## 3. Multi-face audit (load-bearing roop invariant)
+
+The roop loader at `roop/ui/tabs/faceswap_tab.py:661–691` runs
+`extract_face_images(filename, (False, 0))` on every PNG and **appends every
+detected face** to `face_set.faces`. A multi-face PNG therefore pollutes the
+averaged identity. The export-swap pipeline drops multi-face crops at
+creation, but post-pipeline operations (consolidation, age-extend) move
+PNGs across facesets without re-checking.
+
+**This audit re-detects every PNG** with insightface FaceAnalysis and flags
+any with `face_count ≠ 1` (filtered by `det_score ≥ 0.5` and
+`face_short ≥ 40`). Includes:
+- ≥ 2 faces → loader will inject extra identities into averaging
+- 0 faces → insightface can't find a face on the cropped PNG; useless for
+  roop, would silently fail
+
+Run results: 4,146 PNGs scored, 332 flagged (272 with 2 faces, 9 with 3,
+2 with 4, **49 with 0**). 82 facesets affected.
+
+## 4. DML throughput jump for face crops
+
+The audit reuses the same insightface + onnxruntime-directml stack as
+`embed_worker.py` but achieves **~19 img/s** on AMD Vega vs embed_worker's
+2.6 img/s — same model, same hardware. The difference is input size:
+
+| stage | typical input | DML throughput |
+|-------|--------------|---------------:|
+| `embed_worker.py` (Immich import) | 1024–4000 px source | 2.6 img/s |
+| `multiface_worker.py` (this audit) | 512×512 face crops | **19 img/s** |
+
+Detection on small inputs is fast; recognition on aligned 112×112 inputs is
+the same cost either way. Implication: **any pipeline operating on
+already-cropped face PNGs can rely on a roughly 7× higher DML throughput
+ceiling than full-resolution embedding**.
+
+## 5. Architecture
+
+```
+   ┌────────────────────────────────────────────┐
+   │ WSL  /opt/face-sets/work/dedup_optimize.py │
+   │  • analyze:      hashes + within-faceset sim │
+   │  • apply:         move + re-zip (no GPU)     │
+   │  • stage_multiface: write queue.json         │
+   │  • merge_multiface: ingest worker results    │
+   │  • apply_multiface: move + re-zip             │
+   │  • report:        HTML audit                  │
+   └────────────┬───────────────────────────────┘
+                │ queue.json via \\wsl.localhost\
+                ▼
+   ┌────────────────────────────────────────────┐
+   │ Windows  C:\face_embed_venv\               │
+   │  /opt/face-sets/work/multiface_worker.py    │
+   │  insightface FaceAnalysis on DmlExecutionProvider │
+   │  Reads PNGs from native E:\, writes face_count │
+   └────────────────────────────────────────────┘
+```
+
+Reuses the existing `C:\face_embed_venv\` (no new venv needed — same
+insightface stack as `embed_worker.py`).
+
+## 6. Final corpus state (2026-04-27 night)
+
+| metric | start of day | after occlusion filter | after consolidation | after age-extend | after this dedup + multiface |
+|--------|-------------:|----------------------:|-------------------:|-----------------:|----------------------------:|
+| active facesets | 311 | 255 | 181 | 181 | **181** |
+| active PNGs | ~6,440 | 5,386 | 5,386 | 5,400 | **3,849** |
+| `_masked/` | 0 | 51 | 51 | 51 | 51 |
+| `_thin/` | 68 | 71 | 71 | 71 | 71 |
+| `_merged/` | 0 | 0 | 74 | 74 | 74 |
+
+Net reduction at the end of the day: **2,591 PNGs and 130 facesets** removed
+or quarantined from the active pool. All preserved on disk for
+reversibility (`<faceset>/faces/_dropped/` for prunes, `_masked/_merged/_thin/`
+for quarantines).
+
+## 7. Re-running
+
+Run after any new import / consolidation / extend:
+
+```bash
+# 1. Byte-dedup + within-faceset near-dup (CPU only)
+python work/dedup_optimize.py analyze --out work/dedup_audit/dedup_plan.json
+python work/dedup_optimize.py apply  --plan work/dedup_audit/dedup_plan.json
+
+# 2. Multi-face audit on Windows DML (resumable)
+python work/dedup_optimize.py stage_multiface --out work/dedup_audit/multiface_queue.json
+"/mnt/c/face_embed_venv/Scripts/python.exe" work/multiface_worker.py \
+  work/dedup_audit/multiface_queue.json work/dedup_audit/multiface_results.json
+python work/dedup_optimize.py merge_multiface \
+  --results work/dedup_audit/multiface_results.json \
+  --out work/dedup_audit/multiface_plan.json
+python work/dedup_optimize.py apply_multiface \
+  --plan work/dedup_audit/multiface_plan.json
+
+# 3. HTML audit
+python work/dedup_optimize.py report \
+  --dedup work/dedup_audit/dedup_plan.json \
+  --multiface work/dedup_audit/multiface_plan.json \
+  --out work/dedup_audit
+```