Files

Peter 49a43c7685 Add post-export corpus maintenance pipeline

Adds four new orchestration scripts that operate on an already-built
facesets_swap_ready/ to clean it up over time:

- filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses
  filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores
  via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level
  quarantine at 40% domain dominance.

- consolidate_facesets.py: duplicate-identity merger using complete-linkage
  centroid clustering on cached arcface embeddings. Single-linkage chains
  catastrophically (60-faceset clusters with min sim < 0); complete-linkage
  guarantees within-group sim >= edge.

- age_extend_001.py: slots newly-added PNGs into existing era buckets of
  faceset_001 using the same anchor-fragment rule as age_split_001.py
  (dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered.

- dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three
  passes — cross-family SHA256 byte-dedup (preserves intra-family era
  duplication), within-faceset near-dup at sim >= 0.95, and a multi-face
  audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s
  on AMD Vega — ~7x embed_worker because input is 512x512 crops.

Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged →
181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes
preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full
reversibility. Master manifest gains masked[], merged[], plus per-run
provenance blocks.

Three new docs/analysis/ writeups cover model choice, threshold rationale,
and per-pass run results.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 15:41:18 +02:00

7.1 KiB

Raw Blame History

CLIP zero-shot occlusion filter (masks + sunglasses)

Run date: 2026-04-27. Driver scripts: work/filter_occlusions.py, work/clip_worker.py.

1. Why

facesets_swap_ready/ ended the Immich import day with 311 substantive facesets and a long tail of identities whose clusters had latched onto eyewear or mask appearance instead of identity (covid-era shots, vacation photos with sunglasses dominating the frame). Two failure modes:

Pollution of averaged identity — roop's FaceSet.AverageEmbeddings() averages every face in the .fsz. A faceset where 40 % of images are sunglassed gives a biased centroid; the swap reproduces sunglass-shaped eye sockets.
Whole-cluster identity drift — clustering at the embedding level sometimes anchors on the eyewear silhouette rather than the face, producing clusters of "the same sunglasses across multiple people".

A targeted attribute scorer was the cleanest fix.

2. Model + prompts

Model: open_clip ViT-L-14 / dfn2b_s39b (Apple Data Filtering Networks). Best public zero-shot at this size. Loads weights from HF Hub (~890 MB). Bit-identical scores between WSL CPU and Windows DML.

Prompt design: per-attribute ensembles of 5–6 positive + 5–6 negative prompts. Positive ensembles are mean-pooled and L2-normalized before softmax.

Critical bug if forgotten: CLIP cosine similarities are tiny (0.2–0.3 range). Raw softmax([sim_pos, sim_neg]) collapses to ~0.5/0.5 on every image. Multiply by model.logit_scale.exp() (~100) before softmax. Without that scale the entire scorer outputs a uniform 0.5.

Sunglasses prompt pitfall: the first set caught faces with sunglasses pushed up on the forehead with the same probability as faces with sunglasses covering the eyes — CLIP detects "presence of sunglasses in frame", not "eyes occluded". Fixed by putting the false positive into the negative class explicitly:

positive: "a face with dark sunglasses covering the eyes"
          "a portrait with the eyes hidden behind opaque sunglasses"
          ...
negative: "a face with sunglasses pushed up on the forehead, eyes visible below"
          "a face with sunglasses resting on top of the head, eyes visible"
          "a face wearing clear prescription eyeglasses with visible eyes"
          ...

Validation pair (faceset_005): sunglasses-on-eyes → 0.91, sunglasses-on-forehead → 0.39. Threshold 0.7 cleanly separates.

3. Architecture

   ┌─────────────────────────────────────────────┐
   │ WSL  /opt/face-sets/work/filter_occlusions.py │
   │  • stage:  walk facesets/, write queue.json   │
   │  • merge:  ingest worker results              │
   │  • report: HTML contact sheet                  │
   │  • apply:  prune + quarantine + re-zip         │
   └────────────┬────────────────────────────────┘
                │ queue.json (paths) via \\wsl.localhost\
                ▼
   ┌─────────────────────────────────────────────┐
   │ Windows  C:\clip_dml_venv\                  │
   │  /opt/face-sets/work/clip_worker.py         │
   │  Python 3.12 + torch 2.4.1 CPU              │
   │  + torch-directml 0.2.5 + open_clip_torch   │
   │  Reads PNGs from native E:\, writes scores  │
   └─────────────────────────────────────────────┘

A separate Windows venv (not the existing C:\face_embed_venv\) is needed because torch-directml brings ~1.5 GB of wheels and version-pinned numpy/pillow that risk breaking the embed_worker venv's onnxruntime-directml + insightface stack.

4. DML throughput surprise

Measured on AMD Radeon RX Vega:

input	model	throughput	speedup vs WSL CPU
ViT-L-14 (CLIP, this filter)	open_clip	1.43 img/s	2.4×
buffalo_l (insightface, embed_worker)	onnxruntime	2.6 img/s	7.5×

Only 2.4× because aten::_native_multi_head_attention is not implemented in the directml plugin and falls back to CPU. The vision encoder runs on GPU, attention runs on CPU per layer, both alternating. A silenced UserWarning makes this near-invisible. Workable for a one-shot 73-min corpus run, but the embed_worker pattern (pure ONNX) remains the gold standard for DML.

5. Thresholds (validated 2026-04-27 on 6,318 PNGs)

level	threshold	semantics
image	P(positive) ≥ 0.7	drop the PNG
faceset	≥ 40 % of images flagged for either attr	quarantine whole faceset to `_masked/`
min-survivors	< 5 surviving AND something pruned	quarantine to `_thin/`

The AND something pruned guard is essential — without it, naturally-small facesets (hand-sorted with ≤4 PNGs) get incorrectly quarantined for being small even when they have zero occlusions.

6. Run results

action	count	net effect
keep	209	unchanged
prune	46	183 PNGs dropped within survivors
quarantine_masked	51	whole faceset → `_masked/` (11 mask-driven, 40 sunglasses-driven)
quarantine_thin	3	survivors < 5 → `_thin/`

Net: 311 active → 255 active after the filter run. 763 PNGs quarantined whole-faceset, 183 pruned within survivors. All dropped PNGs preserved at <faceset>/faces/_dropped/ for reversibility. Master manifest gained a masked[] array parallel to thin_eras[], plus an occlusion_filter_run provenance block.

7. Known limitations

Per-faceset manifests are NOT updated by apply — only the master manifest is. Each faceset's own <faceset>/manifest.json retains stale faces[] entries pointing at PNGs that moved into _dropped/. Harmless for .fsz consumers (the .fsz is re-zipped from current disk state) but downstream tools reading faces[] will see broken references. Discovered later by age_extend_001.py's rebuild loop, which generated 42 missing-PNG warnings before being caught.

8. Re-running

# 1. Stage queue from current corpus state
python work/filter_occlusions.py stage --out work/clip_dml/queue.json

# 2. Score on Windows DML (resumable)
"/mnt/c/clip_dml_venv/Scripts/python.exe" work/clip_worker.py \
  work/clip_dml/queue.json work/clip_dml/scores.json --batch 8

# 3. Reshape into per-faceset format, then HTML for visual approval
python work/filter_occlusions.py merge \
  --scores work/clip_dml/scores.json --out work/occlusion_scores.json
python work/filter_occlusions.py report \
  --scores work/occlusion_scores.json --out work/occlusion_review

# 4. Apply (always dry-run first)
python work/filter_occlusions.py apply \
  --scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json --dry-run
python work/filter_occlusions.py apply \
  --scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json

7.1 KiB Raw Blame History Unescape Escape