Add post-export corpus maintenance pipeline
Adds four new orchestration scripts that operate on an already-built facesets_swap_ready/ to clean it up over time: - filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. - consolidate_facesets.py: duplicate-identity merger using complete-linkage centroid clustering on cached arcface embeddings. Single-linkage chains catastrophically (60-faceset clusters with min sim < 0); complete-linkage guarantees within-group sim >= edge. - age_extend_001.py: slots newly-added PNGs into existing era buckets of faceset_001 using the same anchor-fragment rule as age_split_001.py (dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered. - dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three passes — cross-family SHA256 byte-dedup (preserves intra-family era duplication), within-faceset near-dup at sim >= 0.95, and a multi-face audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s on AMD Vega — ~7x embed_worker because input is 512x512 crops. Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged → 181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full reversibility. Master manifest gains masked[], merged[], plus per-run provenance blocks. Three new docs/analysis/ writeups cover model choice, threshold rationale, and per-pass run results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
154
docs/analysis/clip-occlusion-filter.md
Normal file
154
docs/analysis/clip-occlusion-filter.md
Normal file
@@ -0,0 +1,154 @@
|
||||
# CLIP zero-shot occlusion filter (masks + sunglasses)
|
||||
|
||||
_Run date: 2026-04-27. Driver scripts: `work/filter_occlusions.py`, `work/clip_worker.py`._
|
||||
|
||||
## 1. Why
|
||||
|
||||
`facesets_swap_ready/` ended the Immich import day with 311 substantive
|
||||
facesets and a long tail of identities whose clusters had latched onto
|
||||
*eyewear or mask appearance* instead of identity (covid-era shots, vacation
|
||||
photos with sunglasses dominating the frame). Two failure modes:
|
||||
|
||||
1. **Pollution of averaged identity** — roop's `FaceSet.AverageEmbeddings()`
|
||||
averages every face in the .fsz. A faceset where 40 % of images are
|
||||
sunglassed gives a biased centroid; the swap reproduces sunglass-shaped
|
||||
eye sockets.
|
||||
2. **Whole-cluster identity drift** — clustering at the embedding level
|
||||
sometimes anchors on the eyewear silhouette rather than the face,
|
||||
producing clusters of "the same sunglasses across multiple people".
|
||||
|
||||
A targeted attribute scorer was the cleanest fix.
|
||||
|
||||
## 2. Model + prompts
|
||||
|
||||
**Model**: `open_clip` `ViT-L-14` / `dfn2b_s39b` (Apple Data Filtering Networks).
|
||||
Best public zero-shot at this size. Loads weights from HF Hub (~890 MB).
|
||||
Bit-identical scores between WSL CPU and Windows DML.
|
||||
|
||||
**Prompt design**: per-attribute ensembles of 5–6 positive + 5–6 negative
|
||||
prompts. Positive ensembles are mean-pooled and L2-normalized before softmax.
|
||||
|
||||
**Critical bug if forgotten**: CLIP cosine similarities are tiny (0.2–0.3
|
||||
range). Raw `softmax([sim_pos, sim_neg])` collapses to ~0.5/0.5 on every
|
||||
image. **Multiply by `model.logit_scale.exp()` (~100) before softmax.**
|
||||
Without that scale the entire scorer outputs a uniform 0.5.
|
||||
|
||||
**Sunglasses prompt pitfall**: the first set caught faces with sunglasses
|
||||
*pushed up on the forehead* with the same probability as faces with
|
||||
sunglasses *covering the eyes* — CLIP detects "presence of sunglasses in
|
||||
frame", not "eyes occluded". Fixed by putting the false positive into the
|
||||
*negative* class explicitly:
|
||||
|
||||
```
|
||||
positive: "a face with dark sunglasses covering the eyes"
|
||||
"a portrait with the eyes hidden behind opaque sunglasses"
|
||||
...
|
||||
negative: "a face with sunglasses pushed up on the forehead, eyes visible below"
|
||||
"a face with sunglasses resting on top of the head, eyes visible"
|
||||
"a face wearing clear prescription eyeglasses with visible eyes"
|
||||
...
|
||||
```
|
||||
|
||||
Validation pair (faceset_005): sunglasses-on-eyes → 0.91, sunglasses-on-forehead
|
||||
→ 0.39. Threshold 0.7 cleanly separates.
|
||||
|
||||
## 3. Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ WSL /opt/face-sets/work/filter_occlusions.py │
|
||||
│ • stage: walk facesets/, write queue.json │
|
||||
│ • merge: ingest worker results │
|
||||
│ • report: HTML contact sheet │
|
||||
│ • apply: prune + quarantine + re-zip │
|
||||
└────────────┬────────────────────────────────┘
|
||||
│ queue.json (paths) via \\wsl.localhost\
|
||||
▼
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Windows C:\clip_dml_venv\ │
|
||||
│ /opt/face-sets/work/clip_worker.py │
|
||||
│ Python 3.12 + torch 2.4.1 CPU │
|
||||
│ + torch-directml 0.2.5 + open_clip_torch │
|
||||
│ Reads PNGs from native E:\, writes scores │
|
||||
└─────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
A separate Windows venv (not the existing `C:\face_embed_venv\`) is needed
|
||||
because `torch-directml` brings ~1.5 GB of wheels and version-pinned
|
||||
numpy/pillow that risk breaking the embed_worker venv's
|
||||
`onnxruntime-directml` + `insightface` stack.
|
||||
|
||||
## 4. DML throughput surprise
|
||||
|
||||
Measured on AMD Radeon RX Vega:
|
||||
|
||||
| input | model | throughput | speedup vs WSL CPU |
|
||||
|------|-------|-----------:|-------------------:|
|
||||
| ViT-L-14 (CLIP, this filter) | open_clip | **1.43 img/s** | **2.4×** |
|
||||
| buffalo_l (insightface, embed_worker) | onnxruntime | 2.6 img/s | 7.5× |
|
||||
|
||||
Only 2.4× because `aten::_native_multi_head_attention` is not implemented in
|
||||
the directml plugin and falls back to CPU. The vision encoder runs on GPU,
|
||||
attention runs on CPU per layer, both alternating. A silenced UserWarning
|
||||
makes this near-invisible. Workable for a one-shot 73-min corpus run, but
|
||||
the embed_worker pattern (pure ONNX) remains the gold standard for DML.
|
||||
|
||||
## 5. Thresholds (validated 2026-04-27 on 6,318 PNGs)
|
||||
|
||||
| level | threshold | semantics |
|
||||
|-------|----------:|-----------|
|
||||
| image | P(positive) ≥ 0.7 | drop the PNG |
|
||||
| faceset | ≥ 40 % of images flagged for either attr | quarantine whole faceset to `_masked/` |
|
||||
| min-survivors | < 5 surviving AND something pruned | quarantine to `_thin/` |
|
||||
|
||||
The `AND something pruned` guard is essential — without it, naturally-small
|
||||
facesets (hand-sorted with ≤4 PNGs) get incorrectly quarantined for being
|
||||
small even when they have zero occlusions.
|
||||
|
||||
## 6. Run results
|
||||
|
||||
| action | count | net effect |
|
||||
|--------|------:|------------|
|
||||
| keep | 209 | unchanged |
|
||||
| prune | 46 | 183 PNGs dropped within survivors |
|
||||
| quarantine_masked | 51 | whole faceset → `_masked/` (11 mask-driven, 40 sunglasses-driven) |
|
||||
| quarantine_thin | 3 | survivors < 5 → `_thin/` |
|
||||
|
||||
Net: 311 active → 255 active after the filter run. 763 PNGs quarantined
|
||||
whole-faceset, 183 pruned within survivors. All dropped PNGs preserved at
|
||||
`<faceset>/faces/_dropped/` for reversibility. Master manifest gained a
|
||||
`masked[]` array parallel to `thin_eras[]`, plus an `occlusion_filter_run`
|
||||
provenance block.
|
||||
|
||||
## 7. Known limitations
|
||||
|
||||
- **Per-faceset manifests are NOT updated by `apply`** — only the master
|
||||
manifest is. Each faceset's own `<faceset>/manifest.json` retains stale
|
||||
`faces[]` entries pointing at PNGs that moved into `_dropped/`. Harmless
|
||||
for `.fsz` consumers (the .fsz is re-zipped from current disk state) but
|
||||
downstream tools reading `faces[]` will see broken references. Discovered
|
||||
later by `age_extend_001.py`'s rebuild loop, which generated 42 missing-PNG
|
||||
warnings before being caught.
|
||||
|
||||
## 8. Re-running
|
||||
|
||||
```bash
|
||||
# 1. Stage queue from current corpus state
|
||||
python work/filter_occlusions.py stage --out work/clip_dml/queue.json
|
||||
|
||||
# 2. Score on Windows DML (resumable)
|
||||
"/mnt/c/clip_dml_venv/Scripts/python.exe" work/clip_worker.py \
|
||||
work/clip_dml/queue.json work/clip_dml/scores.json --batch 8
|
||||
|
||||
# 3. Reshape into per-faceset format, then HTML for visual approval
|
||||
python work/filter_occlusions.py merge \
|
||||
--scores work/clip_dml/scores.json --out work/occlusion_scores.json
|
||||
python work/filter_occlusions.py report \
|
||||
--scores work/occlusion_scores.json --out work/occlusion_review
|
||||
|
||||
# 4. Apply (always dry-run first)
|
||||
python work/filter_occlusions.py apply \
|
||||
--scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json --dry-run
|
||||
python work/filter_occlusions.py apply \
|
||||
--scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json
|
||||
```
|
||||
Reference in New Issue
Block a user