Adds four new orchestration scripts that operate on an already-built facesets_swap_ready/ to clean it up over time: - filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. - consolidate_facesets.py: duplicate-identity merger using complete-linkage centroid clustering on cached arcface embeddings. Single-linkage chains catastrophically (60-faceset clusters with min sim < 0); complete-linkage guarantees within-group sim >= edge. - age_extend_001.py: slots newly-added PNGs into existing era buckets of faceset_001 using the same anchor-fragment rule as age_split_001.py (dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered. - dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three passes — cross-family SHA256 byte-dedup (preserves intra-family era duplication), within-faceset near-dup at sim >= 0.95, and a multi-face audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s on AMD Vega — ~7x embed_worker because input is 512x512 crops. Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged → 181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full reversibility. Master manifest gains masked[], merged[], plus per-run provenance blocks. Three new docs/analysis/ writeups cover model choice, threshold rationale, and per-pass run results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.1 KiB
CLIP zero-shot occlusion filter (masks + sunglasses)
Run date: 2026-04-27. Driver scripts: work/filter_occlusions.py, work/clip_worker.py.
1. Why
facesets_swap_ready/ ended the Immich import day with 311 substantive
facesets and a long tail of identities whose clusters had latched onto
eyewear or mask appearance instead of identity (covid-era shots, vacation
photos with sunglasses dominating the frame). Two failure modes:
- Pollution of averaged identity — roop's
FaceSet.AverageEmbeddings()averages every face in the .fsz. A faceset where 40 % of images are sunglassed gives a biased centroid; the swap reproduces sunglass-shaped eye sockets. - Whole-cluster identity drift — clustering at the embedding level sometimes anchors on the eyewear silhouette rather than the face, producing clusters of "the same sunglasses across multiple people".
A targeted attribute scorer was the cleanest fix.
2. Model + prompts
Model: open_clip ViT-L-14 / dfn2b_s39b (Apple Data Filtering Networks).
Best public zero-shot at this size. Loads weights from HF Hub (~890 MB).
Bit-identical scores between WSL CPU and Windows DML.
Prompt design: per-attribute ensembles of 5–6 positive + 5–6 negative prompts. Positive ensembles are mean-pooled and L2-normalized before softmax.
Critical bug if forgotten: CLIP cosine similarities are tiny (0.2–0.3
range). Raw softmax([sim_pos, sim_neg]) collapses to ~0.5/0.5 on every
image. Multiply by model.logit_scale.exp() (~100) before softmax.
Without that scale the entire scorer outputs a uniform 0.5.
Sunglasses prompt pitfall: the first set caught faces with sunglasses pushed up on the forehead with the same probability as faces with sunglasses covering the eyes — CLIP detects "presence of sunglasses in frame", not "eyes occluded". Fixed by putting the false positive into the negative class explicitly:
positive: "a face with dark sunglasses covering the eyes"
"a portrait with the eyes hidden behind opaque sunglasses"
...
negative: "a face with sunglasses pushed up on the forehead, eyes visible below"
"a face with sunglasses resting on top of the head, eyes visible"
"a face wearing clear prescription eyeglasses with visible eyes"
...
Validation pair (faceset_005): sunglasses-on-eyes → 0.91, sunglasses-on-forehead → 0.39. Threshold 0.7 cleanly separates.
3. Architecture
┌─────────────────────────────────────────────┐
│ WSL /opt/face-sets/work/filter_occlusions.py │
│ • stage: walk facesets/, write queue.json │
│ • merge: ingest worker results │
│ • report: HTML contact sheet │
│ • apply: prune + quarantine + re-zip │
└────────────┬────────────────────────────────┘
│ queue.json (paths) via \\wsl.localhost\
▼
┌─────────────────────────────────────────────┐
│ Windows C:\clip_dml_venv\ │
│ /opt/face-sets/work/clip_worker.py │
│ Python 3.12 + torch 2.4.1 CPU │
│ + torch-directml 0.2.5 + open_clip_torch │
│ Reads PNGs from native E:\, writes scores │
└─────────────────────────────────────────────┘
A separate Windows venv (not the existing C:\face_embed_venv\) is needed
because torch-directml brings ~1.5 GB of wheels and version-pinned
numpy/pillow that risk breaking the embed_worker venv's
onnxruntime-directml + insightface stack.
4. DML throughput surprise
Measured on AMD Radeon RX Vega:
| input | model | throughput | speedup vs WSL CPU |
|---|---|---|---|
| ViT-L-14 (CLIP, this filter) | open_clip | 1.43 img/s | 2.4× |
| buffalo_l (insightface, embed_worker) | onnxruntime | 2.6 img/s | 7.5× |
Only 2.4× because aten::_native_multi_head_attention is not implemented in
the directml plugin and falls back to CPU. The vision encoder runs on GPU,
attention runs on CPU per layer, both alternating. A silenced UserWarning
makes this near-invisible. Workable for a one-shot 73-min corpus run, but
the embed_worker pattern (pure ONNX) remains the gold standard for DML.
5. Thresholds (validated 2026-04-27 on 6,318 PNGs)
| level | threshold | semantics |
|---|---|---|
| image | P(positive) ≥ 0.7 | drop the PNG |
| faceset | ≥ 40 % of images flagged for either attr | quarantine whole faceset to _masked/ |
| min-survivors | < 5 surviving AND something pruned | quarantine to _thin/ |
The AND something pruned guard is essential — without it, naturally-small
facesets (hand-sorted with ≤4 PNGs) get incorrectly quarantined for being
small even when they have zero occlusions.
6. Run results
| action | count | net effect |
|---|---|---|
| keep | 209 | unchanged |
| prune | 46 | 183 PNGs dropped within survivors |
| quarantine_masked | 51 | whole faceset → _masked/ (11 mask-driven, 40 sunglasses-driven) |
| quarantine_thin | 3 | survivors < 5 → _thin/ |
Net: 311 active → 255 active after the filter run. 763 PNGs quarantined
whole-faceset, 183 pruned within survivors. All dropped PNGs preserved at
<faceset>/faces/_dropped/ for reversibility. Master manifest gained a
masked[] array parallel to thin_eras[], plus an occlusion_filter_run
provenance block.
7. Known limitations
- Per-faceset manifests are NOT updated by
apply— only the master manifest is. Each faceset's own<faceset>/manifest.jsonretains stalefaces[]entries pointing at PNGs that moved into_dropped/. Harmless for.fszconsumers (the .fsz is re-zipped from current disk state) but downstream tools readingfaces[]will see broken references. Discovered later byage_extend_001.py's rebuild loop, which generated 42 missing-PNG warnings before being caught.
8. Re-running
# 1. Stage queue from current corpus state
python work/filter_occlusions.py stage --out work/clip_dml/queue.json
# 2. Score on Windows DML (resumable)
"/mnt/c/clip_dml_venv/Scripts/python.exe" work/clip_worker.py \
work/clip_dml/queue.json work/clip_dml/scores.json --batch 8
# 3. Reshape into per-faceset format, then HTML for visual approval
python work/filter_occlusions.py merge \
--scores work/clip_dml/scores.json --out work/occlusion_scores.json
python work/filter_occlusions.py report \
--scores work/occlusion_scores.json --out work/occlusion_review
# 4. Apply (always dry-run first)
python work/filter_occlusions.py apply \
--scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json --dry-run
python work/filter_occlusions.py apply \
--scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json