Files
face-sets/docs/analysis/dedup-and-roop-optimization.md
Peter 49a43c7685 Add post-export corpus maintenance pipeline
Adds four new orchestration scripts that operate on an already-built
facesets_swap_ready/ to clean it up over time:

- filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses
  filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores
  via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level
  quarantine at 40% domain dominance.

- consolidate_facesets.py: duplicate-identity merger using complete-linkage
  centroid clustering on cached arcface embeddings. Single-linkage chains
  catastrophically (60-faceset clusters with min sim < 0); complete-linkage
  guarantees within-group sim >= edge.

- age_extend_001.py: slots newly-added PNGs into existing era buckets of
  faceset_001 using the same anchor-fragment rule as age_split_001.py
  (dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered.

- dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three
  passes — cross-family SHA256 byte-dedup (preserves intra-family era
  duplication), within-faceset near-dup at sim >= 0.95, and a multi-face
  audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s
  on AMD Vega — ~7x embed_worker because input is 512x512 crops.

Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged →
181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes
preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full
reversibility. Master manifest gains masked[], merged[], plus per-run
provenance blocks.

Three new docs/analysis/ writeups cover model choice, threshold rationale,
and per-pass run results.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:41:18 +02:00

156 lines
7.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Corpus dedup + roop-unleashed optimization
_Run date: 2026-04-27. Driver scripts: `work/dedup_optimize.py`, `work/multiface_worker.py`._
After consolidation collapsed duplicate identities and age-extend slotted
new PNGs into era buckets, the corpus still carried artifacts that hurt
roop's averaged-embedding quality:
- **Burst-photo near-duplicates** within facesets, especially in
immich-discovered identities where source libraries had many similar
shots within seconds.
- **Cross-faceset byte-identical PNGs** that escaped consolidation's
centroid-similarity matching when individual PNGs matched exactly but
cluster centroids diverged.
- **Multi-face PNGs** that polluted identity averaging because the roop
loader appends every detected face per PNG to the FaceSet (load-bearing
invariant — see § 2).
This pipeline runs three independent passes and an optional fourth, all
moving dropped PNGs to `<faceset>/faces/_dropped/` for reversibility.
## 1. Cross-family byte-dedup
SHA256-hash every PNG in the active corpus (parallel I/O via
`ThreadPoolExecutor(max_workers=16)`, ~17 s for 5,386 PNGs over the
`/mnt/e/` Windows mount). Group by hash; for groups with members in
multiple identity families, keep the higher-tier copy.
**Family detection**: regex `^(faceset_\d+)(?:_.+)?$` — captures the parent
identity. Same family includes parent + era splits (e.g. `faceset_001` +
`faceset_001_2010-13`); these are intentional duplications for the era
.fsz files and are preserved.
Run results: 20 cross-family hash groups → 24 PNGs dropped. Most cases were
small immich identity-cluster errors that consolidation missed because
individual PNG embeddings matched but the cluster mean did not.
## 2. Within-faceset near-dup at sim ≥ 0.95
Per-faceset pairwise cosine similarity on cached arcface embeddings.
Connected components in the `sim ≥ 0.95` graph. Keep highest
`quality.composite` per component, drop the rest.
**Threshold rationale**: legitimate same-person-different-pose pairs land at
0.50.85; ≥ 0.95 means essentially the same shot (burst frames or
recompressed dupes). Roop's `FaceSet.AverageEmbeddings()` averages all faces
into `faces[0].embedding`; near-identical embeddings averaged ≈ averaging
once. Removing them does not lose identity information; it removes a bias
weight on the most-photographed moments.
Run results: 851 groups → **1,225 PNGs dropped** (23 % of corpus).
Most-affected: `faceset_026` (-132 of 262), `faceset_027` (-107),
`faceset_028` (-92), `faceset_030` (-92). All immich-discovered identities
where the source library had burst sequences.
## 3. Multi-face audit (load-bearing roop invariant)
The roop loader at `roop/ui/tabs/faceswap_tab.py:661691` runs
`extract_face_images(filename, (False, 0))` on every PNG and **appends every
detected face** to `face_set.faces`. A multi-face PNG therefore pollutes the
averaged identity. The export-swap pipeline drops multi-face crops at
creation, but post-pipeline operations (consolidation, age-extend) move
PNGs across facesets without re-checking.
**This audit re-detects every PNG** with insightface FaceAnalysis and flags
any with `face_count ≠ 1` (filtered by `det_score ≥ 0.5` and
`face_short ≥ 40`). Includes:
- ≥ 2 faces → loader will inject extra identities into averaging
- 0 faces → insightface can't find a face on the cropped PNG; useless for
roop, would silently fail
Run results: 4,146 PNGs scored, 332 flagged (272 with 2 faces, 9 with 3,
2 with 4, **49 with 0**). 82 facesets affected.
## 4. DML throughput jump for face crops
The audit reuses the same insightface + onnxruntime-directml stack as
`embed_worker.py` but achieves **~19 img/s** on AMD Vega vs embed_worker's
2.6 img/s — same model, same hardware. The difference is input size:
| stage | typical input | DML throughput |
|-------|--------------|---------------:|
| `embed_worker.py` (Immich import) | 10244000 px source | 2.6 img/s |
| `multiface_worker.py` (this audit) | 512×512 face crops | **19 img/s** |
Detection on small inputs is fast; recognition on aligned 112×112 inputs is
the same cost either way. Implication: **any pipeline operating on
already-cropped face PNGs can rely on a roughly 7× higher DML throughput
ceiling than full-resolution embedding**.
## 5. Architecture
```
┌────────────────────────────────────────────┐
│ WSL /opt/face-sets/work/dedup_optimize.py │
│ • analyze: hashes + within-faceset sim │
│ • apply: move + re-zip (no GPU) │
│ • stage_multiface: write queue.json │
│ • merge_multiface: ingest worker results │
│ • apply_multiface: move + re-zip │
│ • report: HTML audit │
└────────────┬───────────────────────────────┘
│ queue.json via \\wsl.localhost\
┌────────────────────────────────────────────┐
│ Windows C:\face_embed_venv\ │
│ /opt/face-sets/work/multiface_worker.py │
│ insightface FaceAnalysis on DmlExecutionProvider │
│ Reads PNGs from native E:\, writes face_count │
└────────────────────────────────────────────┘
```
Reuses the existing `C:\face_embed_venv\` (no new venv needed — same
insightface stack as `embed_worker.py`).
## 6. Final corpus state (2026-04-27 night)
| metric | start of day | after occlusion filter | after consolidation | after age-extend | after this dedup + multiface |
|--------|-------------:|----------------------:|-------------------:|-----------------:|----------------------------:|
| active facesets | 311 | 255 | 181 | 181 | **181** |
| active PNGs | ~6,440 | 5,386 | 5,386 | 5,400 | **3,849** |
| `_masked/` | 0 | 51 | 51 | 51 | 51 |
| `_thin/` | 68 | 71 | 71 | 71 | 71 |
| `_merged/` | 0 | 0 | 74 | 74 | 74 |
Net reduction at the end of the day: **2,591 PNGs and 130 facesets** removed
or quarantined from the active pool. All preserved on disk for
reversibility (`<faceset>/faces/_dropped/` for prunes, `_masked/_merged/_thin/`
for quarantines).
## 7. Re-running
Run after any new import / consolidation / extend:
```bash
# 1. Byte-dedup + within-faceset near-dup (CPU only)
python work/dedup_optimize.py analyze --out work/dedup_audit/dedup_plan.json
python work/dedup_optimize.py apply --plan work/dedup_audit/dedup_plan.json
# 2. Multi-face audit on Windows DML (resumable)
python work/dedup_optimize.py stage_multiface --out work/dedup_audit/multiface_queue.json
"/mnt/c/face_embed_venv/Scripts/python.exe" work/multiface_worker.py \
work/dedup_audit/multiface_queue.json work/dedup_audit/multiface_results.json
python work/dedup_optimize.py merge_multiface \
--results work/dedup_audit/multiface_results.json \
--out work/dedup_audit/multiface_plan.json
python work/dedup_optimize.py apply_multiface \
--plan work/dedup_audit/multiface_plan.json
# 3. HTML audit
python work/dedup_optimize.py report \
--dedup work/dedup_audit/dedup_plan.json \
--multiface work/dedup_audit/multiface_plan.json \
--out work/dedup_audit
```