Add post-export corpus maintenance pipeline
Adds four new orchestration scripts that operate on an already-built facesets_swap_ready/ to clean it up over time: - filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. - consolidate_facesets.py: duplicate-identity merger using complete-linkage centroid clustering on cached arcface embeddings. Single-linkage chains catastrophically (60-faceset clusters with min sim < 0); complete-linkage guarantees within-group sim >= edge. - age_extend_001.py: slots newly-added PNGs into existing era buckets of faceset_001 using the same anchor-fragment rule as age_split_001.py (dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered. - dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three passes — cross-family SHA256 byte-dedup (preserves intra-family era duplication), within-faceset near-dup at sim >= 0.95, and a multi-face audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s on AMD Vega — ~7x embed_worker because input is 512x512 crops. Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged → 181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full reversibility. Master manifest gains masked[], merged[], plus per-run provenance blocks. Three new docs/analysis/ writeups cover model choice, threshold rationale, and per-pass run results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
29
README.md
29
README.md
@@ -331,6 +331,27 @@ from the saved `state.json` without re-fetching what was already done.
|
||||
|
||||
The composite quality score in `export-swap` is `0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness`, each normalized to `[0, 1]`.
|
||||
|
||||
## Post-export corpus maintenance
|
||||
|
||||
The `sort_faces.py` pipeline above produces `facesets_swap_ready/`. Four
|
||||
orchestration scripts under `work/` operate on that already-built corpus to
|
||||
clean it up over time:
|
||||
|
||||
| script | purpose |
|
||||
|--------|---------|
|
||||
| `work/filter_occlusions.py` (+ Windows `work/clip_worker.py`) | Drop PNGs of masked / sun-glassed faces using open_clip ViT-L-14/dfn2b_s39b zero-shot scoring. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. WSL stages a queue, Windows DML scores, WSL applies. See `docs/analysis/clip-occlusion-filter.md`. |
|
||||
| `work/consolidate_facesets.py` | Merge duplicate identities (centroid cosine sim ≥ 0.55 with confident ≥ 0.65, **complete-linkage** to defeat single-link chaining). Pulls embeddings from cache, no GPU. See `docs/analysis/identity-consolidation-and-age-extend.md`. |
|
||||
| `work/age_extend_001.py` | Slot newly-added PNGs into existing era buckets of `faceset_001` (anchor cosine distance ≤ 0.40 AND `|year_delta|` ≤ 5). Same anchor-fragment rule as `age_split_001.py`. |
|
||||
| `work/dedup_optimize.py` (+ Windows `work/multiface_worker.py`) | (a) cross-family SHA256 byte-dedup, (b) within-faceset near-dup at cosine sim ≥ 0.95, (c) multi-face audit (re-detect via insightface, drop PNGs with face_count ≠ 1). Multi-face is the load-bearing roop invariant. See `docs/analysis/dedup-and-roop-optimization.md`. |
|
||||
|
||||
All four operate idempotently and reversibly: dropped PNGs go to
|
||||
`<faceset>/faces/_dropped/`, quarantined whole facesets go to
|
||||
`facesets_swap_ready/_masked/` or `_merged/` (parallel to the existing
|
||||
`_thin/`). The master `manifest.json` partitions entries across `facesets[]`,
|
||||
`masked[]`, `thin_eras[]`, and `merged[]` arrays, plus per-run provenance
|
||||
blocks (`occlusion_filter_run`, `merge_run`, `age_extend_runs`, `dedup_runs`,
|
||||
`multiface_runs`).
|
||||
|
||||
## Downstream: roop-unleashed
|
||||
|
||||
The `.fsz` bundles emitted by `export-swap` drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding.
|
||||
@@ -350,11 +371,17 @@ Highly recommended at swap time: enable **Select post-processing = GFPGAN** with
|
||||
├─ build_folders.py (hand-sorted-folder orchestration)
|
||||
├─ check_faceset001_age.py (age-split readiness probe)
|
||||
├─ age_split_001.py (age-split orchestration; faceset_001)
|
||||
├─ age_extend_001.py (extends existing era buckets with new PNGs)
|
||||
├─ cluster_osrc.py (mixed-bucket identity discovery)
|
||||
├─ immich_stage.py (Immich library staging, parallel)
|
||||
├─ embed_worker.py (Windows DML embed worker, runs from C:\face_embed_venv\)
|
||||
├─ embed_worker.py (Windows DML embed worker; C:\face_embed_venv\)
|
||||
├─ cluster_immich.py (Immich identity discovery + export)
|
||||
├─ finalize_immich.sh (chains queue → embed → cluster)
|
||||
├─ filter_occlusions.py (CLIP zero-shot mask + sunglasses filter)
|
||||
├─ clip_worker.py (Windows DML CLIP worker; C:\clip_dml_venv\)
|
||||
├─ consolidate_facesets.py (duplicate-identity merger; complete-linkage)
|
||||
├─ dedup_optimize.py (byte + near-dup + multi-face audit driver)
|
||||
├─ multiface_worker.py (Windows DML multi-face audit worker)
|
||||
├─ synthetic_*_manifest.json (per-run synthetic refine manifests)
|
||||
├─ immich/
|
||||
│ ├─ users.json (label -> userId map; gitignored)
|
||||
|
||||
Reference in New Issue
Block a user