Files

Peter 49a43c7685 Add post-export corpus maintenance pipeline

Adds four new orchestration scripts that operate on an already-built
facesets_swap_ready/ to clean it up over time:

- filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses
  filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores
  via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level
  quarantine at 40% domain dominance.

- consolidate_facesets.py: duplicate-identity merger using complete-linkage
  centroid clustering on cached arcface embeddings. Single-linkage chains
  catastrophically (60-faceset clusters with min sim < 0); complete-linkage
  guarantees within-group sim >= edge.

- age_extend_001.py: slots newly-added PNGs into existing era buckets of
  faceset_001 using the same anchor-fragment rule as age_split_001.py
  (dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered.

- dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three
  passes — cross-family SHA256 byte-dedup (preserves intra-family era
  duplication), within-faceset near-dup at sim >= 0.95, and a multi-face
  audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s
  on AMD Vega — ~7x embed_worker because input is 512x512 crops.

Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged →
181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes
preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full
reversibility. Master manifest gains masked[], merged[], plus per-run
provenance blocks.

Three new docs/analysis/ writeups cover model choice, threshold rationale,
and per-pass run results.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 15:41:18 +02:00

7.3 KiB

Raw Permalink Blame History

Corpus dedup + roop-unleashed optimization

Run date: 2026-04-27. Driver scripts: work/dedup_optimize.py, work/multiface_worker.py.

After consolidation collapsed duplicate identities and age-extend slotted new PNGs into era buckets, the corpus still carried artifacts that hurt roop's averaged-embedding quality:

Burst-photo near-duplicates within facesets, especially in immich-discovered identities where source libraries had many similar shots within seconds.
Cross-faceset byte-identical PNGs that escaped consolidation's centroid-similarity matching when individual PNGs matched exactly but cluster centroids diverged.
Multi-face PNGs that polluted identity averaging because the roop loader appends every detected face per PNG to the FaceSet (load-bearing invariant — see § 2).

This pipeline runs three independent passes and an optional fourth, all moving dropped PNGs to <faceset>/faces/_dropped/ for reversibility.

1. Cross-family byte-dedup

SHA256-hash every PNG in the active corpus (parallel I/O via ThreadPoolExecutor(max_workers=16), ~17 s for 5,386 PNGs over the /mnt/e/ Windows mount). Group by hash; for groups with members in multiple identity families, keep the higher-tier copy.

Family detection: regex ^(faceset_\d+)(?:_.+)?$ — captures the parent identity. Same family includes parent + era splits (e.g. faceset_001 + faceset_001_2010-13); these are intentional duplications for the era .fsz files and are preserved.

Run results: 20 cross-family hash groups → 24 PNGs dropped. Most cases were small immich identity-cluster errors that consolidation missed because individual PNG embeddings matched but the cluster mean did not.

2. Within-faceset near-dup at sim ≥ 0.95

Per-faceset pairwise cosine similarity on cached arcface embeddings. Connected components in the sim ≥ 0.95 graph. Keep highest quality.composite per component, drop the rest.

Threshold rationale: legitimate same-person-different-pose pairs land at 0.5–0.85; ≥ 0.95 means essentially the same shot (burst frames or recompressed dupes). Roop's FaceSet.AverageEmbeddings() averages all faces into faces[0].embedding; near-identical embeddings averaged ≈ averaging once. Removing them does not lose identity information; it removes a bias weight on the most-photographed moments.

Run results: 851 groups → 1,225 PNGs dropped (23 % of corpus). Most-affected: faceset_026 (-132 of 262), faceset_027 (-107), faceset_028 (-92), faceset_030 (-92). All immich-discovered identities where the source library had burst sequences.

3. Multi-face audit (load-bearing roop invariant)

The roop loader at roop/ui/tabs/faceswap_tab.py:661–691 runs extract_face_images(filename, (False, 0)) on every PNG and appends every detected face to face_set.faces. A multi-face PNG therefore pollutes the averaged identity. The export-swap pipeline drops multi-face crops at creation, but post-pipeline operations (consolidation, age-extend) move PNGs across facesets without re-checking.

This audit re-detects every PNG with insightface FaceAnalysis and flags any with face_count ≠ 1 (filtered by det_score ≥ 0.5 and face_short ≥ 40). Includes:

≥ 2 faces → loader will inject extra identities into averaging
0 faces → insightface can't find a face on the cropped PNG; useless for roop, would silently fail

Run results: 4,146 PNGs scored, 332 flagged (272 with 2 faces, 9 with 3, 2 with 4, 49 with 0). 82 facesets affected.

4. DML throughput jump for face crops

The audit reuses the same insightface + onnxruntime-directml stack as embed_worker.py but achieves ~19 img/s on AMD Vega vs embed_worker's 2.6 img/s — same model, same hardware. The difference is input size:

stage	typical input	DML throughput
`embed_worker.py` (Immich import)	1024–4000 px source	2.6 img/s
`multiface_worker.py` (this audit)	512×512 face crops	19 img/s

Detection on small inputs is fast; recognition on aligned 112×112 inputs is the same cost either way. Implication: any pipeline operating on already-cropped face PNGs can rely on a roughly 7× higher DML throughput ceiling than full-resolution embedding.

5. Architecture

   ┌────────────────────────────────────────────┐
   │ WSL  /opt/face-sets/work/dedup_optimize.py │
   │  • analyze:      hashes + within-faceset sim │
   │  • apply:         move + re-zip (no GPU)     │
   │  • stage_multiface: write queue.json         │
   │  • merge_multiface: ingest worker results    │
   │  • apply_multiface: move + re-zip             │
   │  • report:        HTML audit                  │
   └────────────┬───────────────────────────────┘
                │ queue.json via \\wsl.localhost\
                ▼
   ┌────────────────────────────────────────────┐
   │ Windows  C:\face_embed_venv\               │
   │  /opt/face-sets/work/multiface_worker.py    │
   │  insightface FaceAnalysis on DmlExecutionProvider │
   │  Reads PNGs from native E:\, writes face_count │
   └────────────────────────────────────────────┘

Reuses the existing C:\face_embed_venv\ (no new venv needed — same insightface stack as embed_worker.py).

6. Final corpus state (2026-04-27 night)

metric	start of day	after occlusion filter	after consolidation	after age-extend	after this dedup + multiface
active facesets	311	255	181	181	181
active PNGs	~6,440	5,386	5,386	5,400	3,849
`_masked/`	0	51	51	51	51
`_thin/`	68	71	71	71	71
`_merged/`	0	0	74	74	74

Net reduction at the end of the day: 2,591 PNGs and 130 facesets removed or quarantined from the active pool. All preserved on disk for reversibility (<faceset>/faces/_dropped/ for prunes, _masked/_merged/_thin/ for quarantines).

7. Re-running

Run after any new import / consolidation / extend:

# 1. Byte-dedup + within-faceset near-dup (CPU only)
python work/dedup_optimize.py analyze --out work/dedup_audit/dedup_plan.json
python work/dedup_optimize.py apply  --plan work/dedup_audit/dedup_plan.json

# 2. Multi-face audit on Windows DML (resumable)
python work/dedup_optimize.py stage_multiface --out work/dedup_audit/multiface_queue.json
"/mnt/c/face_embed_venv/Scripts/python.exe" work/multiface_worker.py \
  work/dedup_audit/multiface_queue.json work/dedup_audit/multiface_results.json
python work/dedup_optimize.py merge_multiface \
  --results work/dedup_audit/multiface_results.json \
  --out work/dedup_audit/multiface_plan.json
python work/dedup_optimize.py apply_multiface \
  --plan work/dedup_audit/multiface_plan.json

# 3. HTML audit
python work/dedup_optimize.py report \
  --dedup work/dedup_audit/dedup_plan.json \
  --multiface work/dedup_audit/multiface_plan.json \
  --out work/dedup_audit

7.3 KiB Raw Permalink Blame History Unescape Escape