Add post-export corpus maintenance pipeline

Adds four new orchestration scripts that operate on an already-built
facesets_swap_ready/ to clean it up over time:

- filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses
  filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores
  via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level
  quarantine at 40% domain dominance.

- consolidate_facesets.py: duplicate-identity merger using complete-linkage
  centroid clustering on cached arcface embeddings. Single-linkage chains
  catastrophically (60-faceset clusters with min sim < 0); complete-linkage
  guarantees within-group sim >= edge.

- age_extend_001.py: slots newly-added PNGs into existing era buckets of
  faceset_001 using the same anchor-fragment rule as age_split_001.py
  (dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered.

- dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three
  passes — cross-family SHA256 byte-dedup (preserves intra-family era
  duplication), within-faceset near-dup at sim >= 0.95, and a multi-face
  audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s
  on AMD Vega — ~7x embed_worker because input is 512x512 crops.

Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged →
181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes
preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full
reversibility. Master manifest gains masked[], merged[], plus per-run
provenance blocks.

Three new docs/analysis/ writeups cover model choice, threshold rationale,
and per-pass run results.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-27 15:41:18 +02:00
parent e66c97fd58
commit 49a43c7685
10 changed files with 3250 additions and 1 deletions

View File

@@ -0,0 +1,154 @@
# CLIP zero-shot occlusion filter (masks + sunglasses)
_Run date: 2026-04-27. Driver scripts: `work/filter_occlusions.py`, `work/clip_worker.py`._
## 1. Why
`facesets_swap_ready/` ended the Immich import day with 311 substantive
facesets and a long tail of identities whose clusters had latched onto
*eyewear or mask appearance* instead of identity (covid-era shots, vacation
photos with sunglasses dominating the frame). Two failure modes:
1. **Pollution of averaged identity** — roop's `FaceSet.AverageEmbeddings()`
averages every face in the .fsz. A faceset where 40 % of images are
sunglassed gives a biased centroid; the swap reproduces sunglass-shaped
eye sockets.
2. **Whole-cluster identity drift** — clustering at the embedding level
sometimes anchors on the eyewear silhouette rather than the face,
producing clusters of "the same sunglasses across multiple people".
A targeted attribute scorer was the cleanest fix.
## 2. Model + prompts
**Model**: `open_clip` `ViT-L-14` / `dfn2b_s39b` (Apple Data Filtering Networks).
Best public zero-shot at this size. Loads weights from HF Hub (~890 MB).
Bit-identical scores between WSL CPU and Windows DML.
**Prompt design**: per-attribute ensembles of 56 positive + 56 negative
prompts. Positive ensembles are mean-pooled and L2-normalized before softmax.
**Critical bug if forgotten**: CLIP cosine similarities are tiny (0.20.3
range). Raw `softmax([sim_pos, sim_neg])` collapses to ~0.5/0.5 on every
image. **Multiply by `model.logit_scale.exp()` (~100) before softmax.**
Without that scale the entire scorer outputs a uniform 0.5.
**Sunglasses prompt pitfall**: the first set caught faces with sunglasses
*pushed up on the forehead* with the same probability as faces with
sunglasses *covering the eyes* — CLIP detects "presence of sunglasses in
frame", not "eyes occluded". Fixed by putting the false positive into the
*negative* class explicitly:
```
positive: "a face with dark sunglasses covering the eyes"
"a portrait with the eyes hidden behind opaque sunglasses"
...
negative: "a face with sunglasses pushed up on the forehead, eyes visible below"
"a face with sunglasses resting on top of the head, eyes visible"
"a face wearing clear prescription eyeglasses with visible eyes"
...
```
Validation pair (faceset_005): sunglasses-on-eyes → 0.91, sunglasses-on-forehead
→ 0.39. Threshold 0.7 cleanly separates.
## 3. Architecture
```
┌─────────────────────────────────────────────┐
│ WSL /opt/face-sets/work/filter_occlusions.py │
│ • stage: walk facesets/, write queue.json │
│ • merge: ingest worker results │
│ • report: HTML contact sheet │
│ • apply: prune + quarantine + re-zip │
└────────────┬────────────────────────────────┘
│ queue.json (paths) via \\wsl.localhost\
┌─────────────────────────────────────────────┐
│ Windows C:\clip_dml_venv\ │
│ /opt/face-sets/work/clip_worker.py │
│ Python 3.12 + torch 2.4.1 CPU │
│ + torch-directml 0.2.5 + open_clip_torch │
│ Reads PNGs from native E:\, writes scores │
└─────────────────────────────────────────────┘
```
A separate Windows venv (not the existing `C:\face_embed_venv\`) is needed
because `torch-directml` brings ~1.5 GB of wheels and version-pinned
numpy/pillow that risk breaking the embed_worker venv's
`onnxruntime-directml` + `insightface` stack.
## 4. DML throughput surprise
Measured on AMD Radeon RX Vega:
| input | model | throughput | speedup vs WSL CPU |
|------|-------|-----------:|-------------------:|
| ViT-L-14 (CLIP, this filter) | open_clip | **1.43 img/s** | **2.4×** |
| buffalo_l (insightface, embed_worker) | onnxruntime | 2.6 img/s | 7.5× |
Only 2.4× because `aten::_native_multi_head_attention` is not implemented in
the directml plugin and falls back to CPU. The vision encoder runs on GPU,
attention runs on CPU per layer, both alternating. A silenced UserWarning
makes this near-invisible. Workable for a one-shot 73-min corpus run, but
the embed_worker pattern (pure ONNX) remains the gold standard for DML.
## 5. Thresholds (validated 2026-04-27 on 6,318 PNGs)
| level | threshold | semantics |
|-------|----------:|-----------|
| image | P(positive) ≥ 0.7 | drop the PNG |
| faceset | ≥ 40 % of images flagged for either attr | quarantine whole faceset to `_masked/` |
| min-survivors | < 5 surviving AND something pruned | quarantine to `_thin/` |
The `AND something pruned` guard is essential — without it, naturally-small
facesets (hand-sorted with ≤4 PNGs) get incorrectly quarantined for being
small even when they have zero occlusions.
## 6. Run results
| action | count | net effect |
|--------|------:|------------|
| keep | 209 | unchanged |
| prune | 46 | 183 PNGs dropped within survivors |
| quarantine_masked | 51 | whole faceset → `_masked/` (11 mask-driven, 40 sunglasses-driven) |
| quarantine_thin | 3 | survivors < 5 → `_thin/` |
Net: 311 active → 255 active after the filter run. 763 PNGs quarantined
whole-faceset, 183 pruned within survivors. All dropped PNGs preserved at
`<faceset>/faces/_dropped/` for reversibility. Master manifest gained a
`masked[]` array parallel to `thin_eras[]`, plus an `occlusion_filter_run`
provenance block.
## 7. Known limitations
- **Per-faceset manifests are NOT updated by `apply`** — only the master
manifest is. Each faceset's own `<faceset>/manifest.json` retains stale
`faces[]` entries pointing at PNGs that moved into `_dropped/`. Harmless
for `.fsz` consumers (the .fsz is re-zipped from current disk state) but
downstream tools reading `faces[]` will see broken references. Discovered
later by `age_extend_001.py`'s rebuild loop, which generated 42 missing-PNG
warnings before being caught.
## 8. Re-running
```bash
# 1. Stage queue from current corpus state
python work/filter_occlusions.py stage --out work/clip_dml/queue.json
# 2. Score on Windows DML (resumable)
"/mnt/c/clip_dml_venv/Scripts/python.exe" work/clip_worker.py \
work/clip_dml/queue.json work/clip_dml/scores.json --batch 8
# 3. Reshape into per-faceset format, then HTML for visual approval
python work/filter_occlusions.py merge \
--scores work/clip_dml/scores.json --out work/occlusion_scores.json
python work/filter_occlusions.py report \
--scores work/occlusion_scores.json --out work/occlusion_review
# 4. Apply (always dry-run first)
python work/filter_occlusions.py apply \
--scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json --dry-run
python work/filter_occlusions.py apply \
--scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json
```

View File

@@ -0,0 +1,155 @@
# Corpus dedup + roop-unleashed optimization
_Run date: 2026-04-27. Driver scripts: `work/dedup_optimize.py`, `work/multiface_worker.py`._
After consolidation collapsed duplicate identities and age-extend slotted
new PNGs into era buckets, the corpus still carried artifacts that hurt
roop's averaged-embedding quality:
- **Burst-photo near-duplicates** within facesets, especially in
immich-discovered identities where source libraries had many similar
shots within seconds.
- **Cross-faceset byte-identical PNGs** that escaped consolidation's
centroid-similarity matching when individual PNGs matched exactly but
cluster centroids diverged.
- **Multi-face PNGs** that polluted identity averaging because the roop
loader appends every detected face per PNG to the FaceSet (load-bearing
invariant — see § 2).
This pipeline runs three independent passes and an optional fourth, all
moving dropped PNGs to `<faceset>/faces/_dropped/` for reversibility.
## 1. Cross-family byte-dedup
SHA256-hash every PNG in the active corpus (parallel I/O via
`ThreadPoolExecutor(max_workers=16)`, ~17 s for 5,386 PNGs over the
`/mnt/e/` Windows mount). Group by hash; for groups with members in
multiple identity families, keep the higher-tier copy.
**Family detection**: regex `^(faceset_\d+)(?:_.+)?$` — captures the parent
identity. Same family includes parent + era splits (e.g. `faceset_001` +
`faceset_001_2010-13`); these are intentional duplications for the era
.fsz files and are preserved.
Run results: 20 cross-family hash groups → 24 PNGs dropped. Most cases were
small immich identity-cluster errors that consolidation missed because
individual PNG embeddings matched but the cluster mean did not.
## 2. Within-faceset near-dup at sim ≥ 0.95
Per-faceset pairwise cosine similarity on cached arcface embeddings.
Connected components in the `sim ≥ 0.95` graph. Keep highest
`quality.composite` per component, drop the rest.
**Threshold rationale**: legitimate same-person-different-pose pairs land at
0.50.85; ≥ 0.95 means essentially the same shot (burst frames or
recompressed dupes). Roop's `FaceSet.AverageEmbeddings()` averages all faces
into `faces[0].embedding`; near-identical embeddings averaged ≈ averaging
once. Removing them does not lose identity information; it removes a bias
weight on the most-photographed moments.
Run results: 851 groups → **1,225 PNGs dropped** (23 % of corpus).
Most-affected: `faceset_026` (-132 of 262), `faceset_027` (-107),
`faceset_028` (-92), `faceset_030` (-92). All immich-discovered identities
where the source library had burst sequences.
## 3. Multi-face audit (load-bearing roop invariant)
The roop loader at `roop/ui/tabs/faceswap_tab.py:661691` runs
`extract_face_images(filename, (False, 0))` on every PNG and **appends every
detected face** to `face_set.faces`. A multi-face PNG therefore pollutes the
averaged identity. The export-swap pipeline drops multi-face crops at
creation, but post-pipeline operations (consolidation, age-extend) move
PNGs across facesets without re-checking.
**This audit re-detects every PNG** with insightface FaceAnalysis and flags
any with `face_count ≠ 1` (filtered by `det_score ≥ 0.5` and
`face_short ≥ 40`). Includes:
- ≥ 2 faces → loader will inject extra identities into averaging
- 0 faces → insightface can't find a face on the cropped PNG; useless for
roop, would silently fail
Run results: 4,146 PNGs scored, 332 flagged (272 with 2 faces, 9 with 3,
2 with 4, **49 with 0**). 82 facesets affected.
## 4. DML throughput jump for face crops
The audit reuses the same insightface + onnxruntime-directml stack as
`embed_worker.py` but achieves **~19 img/s** on AMD Vega vs embed_worker's
2.6 img/s — same model, same hardware. The difference is input size:
| stage | typical input | DML throughput |
|-------|--------------|---------------:|
| `embed_worker.py` (Immich import) | 10244000 px source | 2.6 img/s |
| `multiface_worker.py` (this audit) | 512×512 face crops | **19 img/s** |
Detection on small inputs is fast; recognition on aligned 112×112 inputs is
the same cost either way. Implication: **any pipeline operating on
already-cropped face PNGs can rely on a roughly 7× higher DML throughput
ceiling than full-resolution embedding**.
## 5. Architecture
```
┌────────────────────────────────────────────┐
│ WSL /opt/face-sets/work/dedup_optimize.py │
│ • analyze: hashes + within-faceset sim │
│ • apply: move + re-zip (no GPU) │
│ • stage_multiface: write queue.json │
│ • merge_multiface: ingest worker results │
│ • apply_multiface: move + re-zip │
│ • report: HTML audit │
└────────────┬───────────────────────────────┘
│ queue.json via \\wsl.localhost\
┌────────────────────────────────────────────┐
│ Windows C:\face_embed_venv\ │
│ /opt/face-sets/work/multiface_worker.py │
│ insightface FaceAnalysis on DmlExecutionProvider │
│ Reads PNGs from native E:\, writes face_count │
└────────────────────────────────────────────┘
```
Reuses the existing `C:\face_embed_venv\` (no new venv needed — same
insightface stack as `embed_worker.py`).
## 6. Final corpus state (2026-04-27 night)
| metric | start of day | after occlusion filter | after consolidation | after age-extend | after this dedup + multiface |
|--------|-------------:|----------------------:|-------------------:|-----------------:|----------------------------:|
| active facesets | 311 | 255 | 181 | 181 | **181** |
| active PNGs | ~6,440 | 5,386 | 5,386 | 5,400 | **3,849** |
| `_masked/` | 0 | 51 | 51 | 51 | 51 |
| `_thin/` | 68 | 71 | 71 | 71 | 71 |
| `_merged/` | 0 | 0 | 74 | 74 | 74 |
Net reduction at the end of the day: **2,591 PNGs and 130 facesets** removed
or quarantined from the active pool. All preserved on disk for
reversibility (`<faceset>/faces/_dropped/` for prunes, `_masked/_merged/_thin/`
for quarantines).
## 7. Re-running
Run after any new import / consolidation / extend:
```bash
# 1. Byte-dedup + within-faceset near-dup (CPU only)
python work/dedup_optimize.py analyze --out work/dedup_audit/dedup_plan.json
python work/dedup_optimize.py apply --plan work/dedup_audit/dedup_plan.json
# 2. Multi-face audit on Windows DML (resumable)
python work/dedup_optimize.py stage_multiface --out work/dedup_audit/multiface_queue.json
"/mnt/c/face_embed_venv/Scripts/python.exe" work/multiface_worker.py \
work/dedup_audit/multiface_queue.json work/dedup_audit/multiface_results.json
python work/dedup_optimize.py merge_multiface \
--results work/dedup_audit/multiface_results.json \
--out work/dedup_audit/multiface_plan.json
python work/dedup_optimize.py apply_multiface \
--plan work/dedup_audit/multiface_plan.json
# 3. HTML audit
python work/dedup_optimize.py report \
--dedup work/dedup_audit/dedup_plan.json \
--multiface work/dedup_audit/multiface_plan.json \
--out work/dedup_audit
```

View File

@@ -0,0 +1,170 @@
# Identity consolidation + age-bucket extension
_Run date: 2026-04-27. Driver scripts: `work/consolidate_facesets.py`, `work/age_extend_001.py`._
After the Immich peter + nic imports added 280 new facesets to a corpus that
had ~25 canonical identities, many "new" identities were duplicates of
existing household members at lower clustering confidence. Two cooperating
passes clean this up: identity consolidation merges duplicates, then
age-extend slots newly-merged PNGs into the existing era buckets of
`faceset_001`.
## 1. Identity consolidation
### 1.1 Approach
For each active faceset, pull cached arcface embeddings from
`work/cache/{nl_full,immich_peter,immich_nic}.npz` keyed by
`(source, bbox)` from the per-faceset manifest's `faces[]`. Compute
L2-normalized centroid. Pairwise cosine similarity matrix.
**Tier-based primary selection** (lowest tier number wins, size breaks ties):
| tier | sources | rationale |
|-----:|---------|-----------|
| 0 | `faceset_013..019` (hand-sorted) | user's curated labels |
| 1 | `faceset_001..012` (auto-clustered) | well-established household |
| 2 | `faceset_020..025` (osrc) | mixed-bucket discovery |
| 3 | `faceset_026..264` (immich peter) | speculative |
| 4 | `faceset_265+` (immich nic) | speculative |
**Era splits and quarantines excluded**`faceset_NNN_<era>`, `_masked/`,
`_thin/` are skipped during analysis.
### 1.2 Single-linkage chains catastrophically — complete-linkage required
First attempt used connected-components on edge ≥ 0.45 → produced a
**60-faceset cluster** around `faceset_001` with min within-group sim of
**0.16** (definitely-different people bridged via chains
`A↔B↔C` where `A`, `C` are not similar). Bumping to edge ≥ 0.55 still
chained (group of 17 with min 0.20).
Real fix: `scipy.cluster.hierarchy.linkage(method='complete')` then
`fcluster(Z, t=1-edge_threshold, criterion='distance')`. Complete-linkage
**guarantees** every within-group pair sim ≥ edge threshold. Without this
guarantee the report is unusable and the apply step would produce
identity-poisoned merges.
### 1.3 Thresholds + run results
`edge=0.55`, `confident=0.65` → 48 multi-faceset groups (29 confident, 19
uncertain). Max group size 7, all bilateral or small triplets after
complete-linkage.
After applying all 48 (with `--include-uncertain` after visual approval):
- **74 facesets consumed** (some groups had multiple secondaries:
`[10, 45, 135] → faceset_002`; `[113, 96, 178, 109, 110, 286] → faceset_095`;
etc.)
- Active count 255 → 181
- Notable absorptions: `faceset_001` (peter) 707 → 753 PNGs (+ 7, 132, 151);
`faceset_002` 209 → 247; `faceset_026` 60 → 262 (+ 168, 146, 325);
`faceset_028` → 207
- Master manifest gained `merged[]` array (parallel to `thin_eras[]`); each
entry has `merged_into` field pointing at the primary
### 1.4 Apply mechanics
Combine all PNGs from primary + secondaries, re-rank by existing
`quality.composite` desc (no re-enrich), renumber `0001..NNNN`, copy into a
fresh staging dir, atomic swap. Move secondary directories to
`_merged/<original_name>/` (preserved in full for reversibility). Re-zip
`_topN.fsz` and `_all.fsz`.
The primary's existing per-PNG quality scores are reused — re-ranking does
not require re-running `enrich`-equivalent landmarks/pose on the cropped
PNGs. The primary's `_dropped/` (from prior occlusion filter) is preserved
through the merge.
## 2. Age extension of faceset_001 era buckets
### 2.1 Why a follow-on pass
Consolidation absorbed faceset_007/132/151 into faceset_001 (+46 PNGs).
The original `age_split_001.py` had bucketed peter into 6 era anchors
(`_2005-10`, `_2010-13`, `_2011`, `_2014-17`, `_2018-19`, `_2018-20`), but
those new PNGs had never been seen by age_split. They sat in faceset_001's
parent-only set, missing from every era .fsz.
### 2.2 Era-label pitfall
The 6 anchor era labels are NOT strict year ranges. They are
`Counter(years).most_common(1)`-derived dom-years from the original sub-cluster:
| label | dom_year | actual span of members |
|-------|---------:|-----------------------:|
| `_2005-10` | 2010 | 20052010 |
| `_2010-13` | 2011 | **20072024** |
| `_2011` | 2011 | 2011 only |
| `_2014-17` | 2016 | 20052018 |
| `_2018-19` | 2018 | 20122020 |
| `_2018-20` | 2019 | 20142022 |
The clusters are *appearance-anchored*, not year-bounded. Year is a
descriptive label. Assignment rule must use dom-year, not member span.
### 2.3 Algorithm
For each unbucketed face entry in `faceset_001`'s manifest (50 of 753):
1. Look up embedding in cache by `(source, bbox)`.
2. Look up EXIF year via `work/cache/age_split_exif.json`; fetch on cache miss.
3. Find single nearest era anchor by cosine distance to its centroid.
4. Accept iff `dist ≤ 0.40` AND `|year anchor.dom_year| ≤ 5`.
These thresholds match `age_split_001.py`'s anchor-fragment rule.
5. Anchors are NOT re-centered after absorption (preserves age_split's
drift-prevention guarantee).
### 2.4 Run results
50 unbucketed → 21 with EXIF year → **14 accepted**:
| anchor | dom_year | added |
|--------|---------:|------:|
| `_2005-10` | 2010 | +2 |
| `_2010-13` | 2011 | +1 |
| `_2014-17` | 2016 | **+9** |
| `_2018-20` | 2019 | +2 |
29 PNGs skipped for missing EXIF year (mostly immich-stripped
photos). 7 dist/year-rejected (e.g. two PNGs from 2025 want
`_2018-19` but year-delta 7 > 5).
### 2.5 Reconciliation side effect
The apply rebuilds each affected era bucket's `faces/` from staging. This
incidentally reconciled the per-bucket manifests with disk after the prior
occlusion filter run had left era manifests stale at 282/126/132 entries vs
~248/125/129 actual files (occlusion filter only updates the master
manifest, never per-faceset manifests — see
`docs/analysis/clip-occlusion-filter.md` §7). 42 occlusion-dropped era PNGs
inside the old `faces/_dropped/` were removed during rebuild. The
parent `faceset_001/faces/_dropped/` still has the corpus-level audit; all
source images are intact at `/mnt/x/src/`, so the era-level dropped PNGs
are regeneratable via `cmd_export_swap`.
## 3. Re-running
Always run both passes after any new identity import (Immich, osrc,
hand-sorted folder):
```bash
# 1. Find duplicate identities
python work/consolidate_facesets.py analyze \
--out work/merge_review/candidates.json [--edge 0.55 --confident 0.65]
python work/consolidate_facesets.py report \
--candidates work/merge_review/candidates.json --out work/merge_review
# inspect work/merge_review/index.html
python work/consolidate_facesets.py apply \
--candidates work/merge_review/candidates.json [--include-uncertain]
# 2. Slot new faceset_001 PNGs into existing era buckets
python work/age_extend_001.py analyze --out work/age_extend/candidates.json
python work/age_extend_001.py report \
--candidates work/age_extend/candidates.json --out work/age_extend
python work/age_extend_001.py apply --candidates work/age_extend/candidates.json
```
Both are idempotent. `consolidate_facesets` skips secondaries already in
`_merged/`; `age_extend_001` recomputes anchor centroids + dom-year fresh
on every run.