# Corpus dedup + roop-unleashed optimization _Run date: 2026-04-27. Driver scripts: `work/dedup_optimize.py`, `work/multiface_worker.py`._ After consolidation collapsed duplicate identities and age-extend slotted new PNGs into era buckets, the corpus still carried artifacts that hurt roop's averaged-embedding quality: - **Burst-photo near-duplicates** within facesets, especially in immich-discovered identities where source libraries had many similar shots within seconds. - **Cross-faceset byte-identical PNGs** that escaped consolidation's centroid-similarity matching when individual PNGs matched exactly but cluster centroids diverged. - **Multi-face PNGs** that polluted identity averaging because the roop loader appends every detected face per PNG to the FaceSet (load-bearing invariant — see § 2). This pipeline runs three independent passes and an optional fourth, all moving dropped PNGs to `/faces/_dropped/` for reversibility. ## 1. Cross-family byte-dedup SHA256-hash every PNG in the active corpus (parallel I/O via `ThreadPoolExecutor(max_workers=16)`, ~17 s for 5,386 PNGs over the `/mnt/e/` Windows mount). Group by hash; for groups with members in multiple identity families, keep the higher-tier copy. **Family detection**: regex `^(faceset_\d+)(?:_.+)?$` — captures the parent identity. Same family includes parent + era splits (e.g. `faceset_001` + `faceset_001_2010-13`); these are intentional duplications for the era .fsz files and are preserved. Run results: 20 cross-family hash groups → 24 PNGs dropped. Most cases were small immich identity-cluster errors that consolidation missed because individual PNG embeddings matched but the cluster mean did not. ## 2. Within-faceset near-dup at sim ≥ 0.95 Per-faceset pairwise cosine similarity on cached arcface embeddings. Connected components in the `sim ≥ 0.95` graph. Keep highest `quality.composite` per component, drop the rest. **Threshold rationale**: legitimate same-person-different-pose pairs land at 0.5–0.85; ≥ 0.95 means essentially the same shot (burst frames or recompressed dupes). Roop's `FaceSet.AverageEmbeddings()` averages all faces into `faces[0].embedding`; near-identical embeddings averaged ≈ averaging once. Removing them does not lose identity information; it removes a bias weight on the most-photographed moments. Run results: 851 groups → **1,225 PNGs dropped** (23 % of corpus). Most-affected: `faceset_026` (-132 of 262), `faceset_027` (-107), `faceset_028` (-92), `faceset_030` (-92). All immich-discovered identities where the source library had burst sequences. ## 3. Multi-face audit (load-bearing roop invariant) The roop loader at `roop/ui/tabs/faceswap_tab.py:661–691` runs `extract_face_images(filename, (False, 0))` on every PNG and **appends every detected face** to `face_set.faces`. A multi-face PNG therefore pollutes the averaged identity. The export-swap pipeline drops multi-face crops at creation, but post-pipeline operations (consolidation, age-extend) move PNGs across facesets without re-checking. **This audit re-detects every PNG** with insightface FaceAnalysis and flags any with `face_count ≠ 1` (filtered by `det_score ≥ 0.5` and `face_short ≥ 40`). Includes: - ≥ 2 faces → loader will inject extra identities into averaging - 0 faces → insightface can't find a face on the cropped PNG; useless for roop, would silently fail Run results: 4,146 PNGs scored, 332 flagged (272 with 2 faces, 9 with 3, 2 with 4, **49 with 0**). 82 facesets affected. ## 4. DML throughput jump for face crops The audit reuses the same insightface + onnxruntime-directml stack as `embed_worker.py` but achieves **~19 img/s** on AMD Vega vs embed_worker's 2.6 img/s — same model, same hardware. The difference is input size: | stage | typical input | DML throughput | |-------|--------------|---------------:| | `embed_worker.py` (Immich import) | 1024–4000 px source | 2.6 img/s | | `multiface_worker.py` (this audit) | 512×512 face crops | **19 img/s** | Detection on small inputs is fast; recognition on aligned 112×112 inputs is the same cost either way. Implication: **any pipeline operating on already-cropped face PNGs can rely on a roughly 7× higher DML throughput ceiling than full-resolution embedding**. ## 5. Architecture ``` ┌────────────────────────────────────────────┐ │ WSL /opt/face-sets/work/dedup_optimize.py │ │ • analyze: hashes + within-faceset sim │ │ • apply: move + re-zip (no GPU) │ │ • stage_multiface: write queue.json │ │ • merge_multiface: ingest worker results │ │ • apply_multiface: move + re-zip │ │ • report: HTML audit │ └────────────┬───────────────────────────────┘ │ queue.json via \\wsl.localhost\ ▼ ┌────────────────────────────────────────────┐ │ Windows C:\face_embed_venv\ │ │ /opt/face-sets/work/multiface_worker.py │ │ insightface FaceAnalysis on DmlExecutionProvider │ │ Reads PNGs from native E:\, writes face_count │ └────────────────────────────────────────────┘ ``` Reuses the existing `C:\face_embed_venv\` (no new venv needed — same insightface stack as `embed_worker.py`). ## 6. Final corpus state (2026-04-27 night) | metric | start of day | after occlusion filter | after consolidation | after age-extend | after this dedup + multiface | |--------|-------------:|----------------------:|-------------------:|-----------------:|----------------------------:| | active facesets | 311 | 255 | 181 | 181 | **181** | | active PNGs | ~6,440 | 5,386 | 5,386 | 5,400 | **3,849** | | `_masked/` | 0 | 51 | 51 | 51 | 51 | | `_thin/` | 68 | 71 | 71 | 71 | 71 | | `_merged/` | 0 | 0 | 74 | 74 | 74 | Net reduction at the end of the day: **2,591 PNGs and 130 facesets** removed or quarantined from the active pool. All preserved on disk for reversibility (`/faces/_dropped/` for prunes, `_masked/_merged/_thin/` for quarantines). ## 7. Re-running Run after any new import / consolidation / extend: ```bash # 1. Byte-dedup + within-faceset near-dup (CPU only) python work/dedup_optimize.py analyze --out work/dedup_audit/dedup_plan.json python work/dedup_optimize.py apply --plan work/dedup_audit/dedup_plan.json # 2. Multi-face audit on Windows DML (resumable) python work/dedup_optimize.py stage_multiface --out work/dedup_audit/multiface_queue.json "/mnt/c/face_embed_venv/Scripts/python.exe" work/multiface_worker.py \ work/dedup_audit/multiface_queue.json work/dedup_audit/multiface_results.json python work/dedup_optimize.py merge_multiface \ --results work/dedup_audit/multiface_results.json \ --out work/dedup_audit/multiface_plan.json python work/dedup_optimize.py apply_multiface \ --plan work/dedup_audit/multiface_plan.json # 3. HTML audit python work/dedup_optimize.py report \ --dedup work/dedup_audit/dedup_plan.json \ --multiface work/dedup_audit/multiface_plan.json \ --out work/dedup_audit ```