Add post-export corpus maintenance pipeline

Adds four new orchestration scripts that operate on an already-built
facesets_swap_ready/ to clean it up over time:

- filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses
  filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores
  via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level
  quarantine at 40% domain dominance.

- consolidate_facesets.py: duplicate-identity merger using complete-linkage
  centroid clustering on cached arcface embeddings. Single-linkage chains
  catastrophically (60-faceset clusters with min sim < 0); complete-linkage
  guarantees within-group sim >= edge.

- age_extend_001.py: slots newly-added PNGs into existing era buckets of
  faceset_001 using the same anchor-fragment rule as age_split_001.py
  (dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered.

- dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three
  passes — cross-family SHA256 byte-dedup (preserves intra-family era
  duplication), within-faceset near-dup at sim >= 0.95, and a multi-face
  audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s
  on AMD Vega — ~7x embed_worker because input is 512x512 crops.

Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged →
181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes
preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full
reversibility. Master manifest gains masked[], merged[], plus per-run
provenance blocks.

Three new docs/analysis/ writeups cover model choice, threshold rationale,
and per-pass run results.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-27 15:41:18 +02:00
parent e66c97fd58
commit 49a43c7685
10 changed files with 3250 additions and 1 deletions

View File

@@ -0,0 +1,154 @@
# CLIP zero-shot occlusion filter (masks + sunglasses)
_Run date: 2026-04-27. Driver scripts: `work/filter_occlusions.py`, `work/clip_worker.py`._
## 1. Why
`facesets_swap_ready/` ended the Immich import day with 311 substantive
facesets and a long tail of identities whose clusters had latched onto
*eyewear or mask appearance* instead of identity (covid-era shots, vacation
photos with sunglasses dominating the frame). Two failure modes:
1. **Pollution of averaged identity** — roop's `FaceSet.AverageEmbeddings()`
averages every face in the .fsz. A faceset where 40 % of images are
sunglassed gives a biased centroid; the swap reproduces sunglass-shaped
eye sockets.
2. **Whole-cluster identity drift** — clustering at the embedding level
sometimes anchors on the eyewear silhouette rather than the face,
producing clusters of "the same sunglasses across multiple people".
A targeted attribute scorer was the cleanest fix.
## 2. Model + prompts
**Model**: `open_clip` `ViT-L-14` / `dfn2b_s39b` (Apple Data Filtering Networks).
Best public zero-shot at this size. Loads weights from HF Hub (~890 MB).
Bit-identical scores between WSL CPU and Windows DML.
**Prompt design**: per-attribute ensembles of 56 positive + 56 negative
prompts. Positive ensembles are mean-pooled and L2-normalized before softmax.
**Critical bug if forgotten**: CLIP cosine similarities are tiny (0.20.3
range). Raw `softmax([sim_pos, sim_neg])` collapses to ~0.5/0.5 on every
image. **Multiply by `model.logit_scale.exp()` (~100) before softmax.**
Without that scale the entire scorer outputs a uniform 0.5.
**Sunglasses prompt pitfall**: the first set caught faces with sunglasses
*pushed up on the forehead* with the same probability as faces with
sunglasses *covering the eyes* — CLIP detects "presence of sunglasses in
frame", not "eyes occluded". Fixed by putting the false positive into the
*negative* class explicitly:
```
positive: "a face with dark sunglasses covering the eyes"
"a portrait with the eyes hidden behind opaque sunglasses"
...
negative: "a face with sunglasses pushed up on the forehead, eyes visible below"
"a face with sunglasses resting on top of the head, eyes visible"
"a face wearing clear prescription eyeglasses with visible eyes"
...
```
Validation pair (faceset_005): sunglasses-on-eyes → 0.91, sunglasses-on-forehead
→ 0.39. Threshold 0.7 cleanly separates.
## 3. Architecture
```
┌─────────────────────────────────────────────┐
│ WSL /opt/face-sets/work/filter_occlusions.py │
│ • stage: walk facesets/, write queue.json │
│ • merge: ingest worker results │
│ • report: HTML contact sheet │
│ • apply: prune + quarantine + re-zip │
└────────────┬────────────────────────────────┘
│ queue.json (paths) via \\wsl.localhost\
┌─────────────────────────────────────────────┐
│ Windows C:\clip_dml_venv\ │
│ /opt/face-sets/work/clip_worker.py │
│ Python 3.12 + torch 2.4.1 CPU │
│ + torch-directml 0.2.5 + open_clip_torch │
│ Reads PNGs from native E:\, writes scores │
└─────────────────────────────────────────────┘
```
A separate Windows venv (not the existing `C:\face_embed_venv\`) is needed
because `torch-directml` brings ~1.5 GB of wheels and version-pinned
numpy/pillow that risk breaking the embed_worker venv's
`onnxruntime-directml` + `insightface` stack.
## 4. DML throughput surprise
Measured on AMD Radeon RX Vega:
| input | model | throughput | speedup vs WSL CPU |
|------|-------|-----------:|-------------------:|
| ViT-L-14 (CLIP, this filter) | open_clip | **1.43 img/s** | **2.4×** |
| buffalo_l (insightface, embed_worker) | onnxruntime | 2.6 img/s | 7.5× |
Only 2.4× because `aten::_native_multi_head_attention` is not implemented in
the directml plugin and falls back to CPU. The vision encoder runs on GPU,
attention runs on CPU per layer, both alternating. A silenced UserWarning
makes this near-invisible. Workable for a one-shot 73-min corpus run, but
the embed_worker pattern (pure ONNX) remains the gold standard for DML.
## 5. Thresholds (validated 2026-04-27 on 6,318 PNGs)
| level | threshold | semantics |
|-------|----------:|-----------|
| image | P(positive) ≥ 0.7 | drop the PNG |
| faceset | ≥ 40 % of images flagged for either attr | quarantine whole faceset to `_masked/` |
| min-survivors | < 5 surviving AND something pruned | quarantine to `_thin/` |
The `AND something pruned` guard is essential — without it, naturally-small
facesets (hand-sorted with ≤4 PNGs) get incorrectly quarantined for being
small even when they have zero occlusions.
## 6. Run results
| action | count | net effect |
|--------|------:|------------|
| keep | 209 | unchanged |
| prune | 46 | 183 PNGs dropped within survivors |
| quarantine_masked | 51 | whole faceset → `_masked/` (11 mask-driven, 40 sunglasses-driven) |
| quarantine_thin | 3 | survivors < 5 → `_thin/` |
Net: 311 active → 255 active after the filter run. 763 PNGs quarantined
whole-faceset, 183 pruned within survivors. All dropped PNGs preserved at
`<faceset>/faces/_dropped/` for reversibility. Master manifest gained a
`masked[]` array parallel to `thin_eras[]`, plus an `occlusion_filter_run`
provenance block.
## 7. Known limitations
- **Per-faceset manifests are NOT updated by `apply`** — only the master
manifest is. Each faceset's own `<faceset>/manifest.json` retains stale
`faces[]` entries pointing at PNGs that moved into `_dropped/`. Harmless
for `.fsz` consumers (the .fsz is re-zipped from current disk state) but
downstream tools reading `faces[]` will see broken references. Discovered
later by `age_extend_001.py`'s rebuild loop, which generated 42 missing-PNG
warnings before being caught.
## 8. Re-running
```bash
# 1. Stage queue from current corpus state
python work/filter_occlusions.py stage --out work/clip_dml/queue.json
# 2. Score on Windows DML (resumable)
"/mnt/c/clip_dml_venv/Scripts/python.exe" work/clip_worker.py \
work/clip_dml/queue.json work/clip_dml/scores.json --batch 8
# 3. Reshape into per-faceset format, then HTML for visual approval
python work/filter_occlusions.py merge \
--scores work/clip_dml/scores.json --out work/occlusion_scores.json
python work/filter_occlusions.py report \
--scores work/occlusion_scores.json --out work/occlusion_review
# 4. Apply (always dry-run first)
python work/filter_occlusions.py apply \
--scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json --dry-run
python work/filter_occlusions.py apply \
--scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json
```