# CLIP zero-shot occlusion filter (masks + sunglasses)

_Run date: 2026-04-27. Driver scripts: `work/filter_occlusions.py`, `work/clip_worker.py`._

## 1. Why

`facesets_swap_ready/` ended the Immich import day with 311 substantive
facesets and a long tail of identities whose clusters had latched onto
*eyewear or mask appearance* instead of identity (covid-era shots, vacation
photos with sunglasses dominating the frame). Two failure modes:

1. **Pollution of averaged identity** — roop's `FaceSet.AverageEmbeddings()`
   averages every face in the .fsz. A faceset where 40 % of images are
   sunglassed gives a biased centroid; the swap reproduces sunglass-shaped
   eye sockets.
2. **Whole-cluster identity drift** — clustering at the embedding level
   sometimes anchors on the eyewear silhouette rather than the face,
   producing clusters of "the same sunglasses across multiple people".

A targeted attribute scorer was the cleanest fix.

## 2. Model + prompts

**Model**: `open_clip` `ViT-L-14` / `dfn2b_s39b` (Apple Data Filtering Networks).
Best public zero-shot at this size. Loads weights from HF Hub (~890 MB).
Bit-identical scores between WSL CPU and Windows DML.

**Prompt design**: per-attribute ensembles of 5–6 positive + 5–6 negative
prompts. Positive ensembles are mean-pooled and L2-normalized before softmax.

**Critical bug if forgotten**: CLIP cosine similarities are tiny (0.2–0.3
range). Raw `softmax([sim_pos, sim_neg])` collapses to ~0.5/0.5 on every
image. **Multiply by `model.logit_scale.exp()` (~100) before softmax.**
Without that scale the entire scorer outputs a uniform 0.5.

**Sunglasses prompt pitfall**: the first set caught faces with sunglasses
*pushed up on the forehead* with the same probability as faces with
sunglasses *covering the eyes* — CLIP detects "presence of sunglasses in
frame", not "eyes occluded". Fixed by putting the false positive into the
*negative* class explicitly:

```
positive: "a face with dark sunglasses covering the eyes"
          "a portrait with the eyes hidden behind opaque sunglasses"
          ...
negative: "a face with sunglasses pushed up on the forehead, eyes visible below"
          "a face with sunglasses resting on top of the head, eyes visible"
          "a face wearing clear prescription eyeglasses with visible eyes"
          ...
```

Validation pair (faceset_005): sunglasses-on-eyes → 0.91, sunglasses-on-forehead
→ 0.39. Threshold 0.7 cleanly separates.

## 3. Architecture

```
   ┌─────────────────────────────────────────────┐
   │ WSL  /opt/face-sets/work/filter_occlusions.py │
   │  • stage:  walk facesets/, write queue.json   │
   │  • merge:  ingest worker results              │
   │  • report: HTML contact sheet                  │
   │  • apply:  prune + quarantine + re-zip         │
   └────────────┬────────────────────────────────┘
                │ queue.json (paths) via \\wsl.localhost\
                ▼
   ┌─────────────────────────────────────────────┐
   │ Windows  C:\clip_dml_venv\                  │
   │  /opt/face-sets/work/clip_worker.py         │
   │  Python 3.12 + torch 2.4.1 CPU              │
   │  + torch-directml 0.2.5 + open_clip_torch   │
   │  Reads PNGs from native E:\, writes scores  │
   └─────────────────────────────────────────────┘
```

A separate Windows venv (not the existing `C:\face_embed_venv\`) is needed
because `torch-directml` brings ~1.5 GB of wheels and version-pinned
numpy/pillow that risk breaking the embed_worker venv's
`onnxruntime-directml` + `insightface` stack.

## 4. DML throughput surprise

Measured on AMD Radeon RX Vega:

| input | model | throughput | speedup vs WSL CPU |
|------|-------|-----------:|-------------------:|
| ViT-L-14 (CLIP, this filter) | open_clip | **1.43 img/s** | **2.4×** |
| buffalo_l (insightface, embed_worker) | onnxruntime | 2.6 img/s | 7.5× |

Only 2.4× because `aten::_native_multi_head_attention` is not implemented in
the directml plugin and falls back to CPU. The vision encoder runs on GPU,
attention runs on CPU per layer, both alternating. A silenced UserWarning
makes this near-invisible. Workable for a one-shot 73-min corpus run, but
the embed_worker pattern (pure ONNX) remains the gold standard for DML.

## 5. Thresholds (validated 2026-04-27 on 6,318 PNGs)

| level | threshold | semantics |
|-------|----------:|-----------|
| image | P(positive) ≥ 0.7 | drop the PNG |
| faceset | ≥ 40 % of images flagged for either attr | quarantine whole faceset to `_masked/` |
| min-survivors | < 5 surviving AND something pruned | quarantine to `_thin/` |

The `AND something pruned` guard is essential — without it, naturally-small
facesets (hand-sorted with ≤4 PNGs) get incorrectly quarantined for being
small even when they have zero occlusions.

## 6. Run results

| action | count | net effect |
|--------|------:|------------|
| keep | 209 | unchanged |
| prune | 46 | 183 PNGs dropped within survivors |
| quarantine_masked | 51 | whole faceset → `_masked/` (11 mask-driven, 40 sunglasses-driven) |
| quarantine_thin | 3 | survivors < 5 → `_thin/` |

Net: 311 active → 255 active after the filter run. 763 PNGs quarantined
whole-faceset, 183 pruned within survivors. All dropped PNGs preserved at
`<faceset>/faces/_dropped/` for reversibility. Master manifest gained a
`masked[]` array parallel to `thin_eras[]`, plus an `occlusion_filter_run`
provenance block.

## 7. Known limitations

- **Per-faceset manifests are NOT updated by `apply`** — only the master
  manifest is. Each faceset's own `<faceset>/manifest.json` retains stale
  `faces[]` entries pointing at PNGs that moved into `_dropped/`. Harmless
  for `.fsz` consumers (the .fsz is re-zipped from current disk state) but
  downstream tools reading `faces[]` will see broken references. Discovered
  later by `age_extend_001.py`'s rebuild loop, which generated 42 missing-PNG
  warnings before being caught.

## 8. Re-running

```bash
# 1. Stage queue from current corpus state
python work/filter_occlusions.py stage --out work/clip_dml/queue.json

# 2. Score on Windows DML (resumable)
"/mnt/c/clip_dml_venv/Scripts/python.exe" work/clip_worker.py \
  work/clip_dml/queue.json work/clip_dml/scores.json --batch 8

# 3. Reshape into per-faceset format, then HTML for visual approval
python work/filter_occlusions.py merge \
  --scores work/clip_dml/scores.json --out work/occlusion_scores.json
python work/filter_occlusions.py report \
  --scores work/occlusion_scores.json --out work/occlusion_review

# 4. Apply (always dry-run first)
python work/filter_occlusions.py apply \
  --scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json --dry-run
python work/filter_occlusions.py apply \
  --scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json
```