Add post-export corpus maintenance pipeline
Adds four new orchestration scripts that operate on an already-built facesets_swap_ready/ to clean it up over time: - filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. - consolidate_facesets.py: duplicate-identity merger using complete-linkage centroid clustering on cached arcface embeddings. Single-linkage chains catastrophically (60-faceset clusters with min sim < 0); complete-linkage guarantees within-group sim >= edge. - age_extend_001.py: slots newly-added PNGs into existing era buckets of faceset_001 using the same anchor-fragment rule as age_split_001.py (dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered. - dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three passes — cross-family SHA256 byte-dedup (preserves intra-family era duplication), within-faceset near-dup at sim >= 0.95, and a multi-face audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s on AMD Vega — ~7x embed_worker because input is 512x512 crops. Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged → 181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full reversibility. Master manifest gains masked[], merged[], plus per-run provenance blocks. Three new docs/analysis/ writeups cover model choice, threshold rationale, and per-pass run results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
154
docs/analysis/clip-occlusion-filter.md
Normal file
154
docs/analysis/clip-occlusion-filter.md
Normal file
@@ -0,0 +1,154 @@
|
||||
# CLIP zero-shot occlusion filter (masks + sunglasses)
|
||||
|
||||
_Run date: 2026-04-27. Driver scripts: `work/filter_occlusions.py`, `work/clip_worker.py`._
|
||||
|
||||
## 1. Why
|
||||
|
||||
`facesets_swap_ready/` ended the Immich import day with 311 substantive
|
||||
facesets and a long tail of identities whose clusters had latched onto
|
||||
*eyewear or mask appearance* instead of identity (covid-era shots, vacation
|
||||
photos with sunglasses dominating the frame). Two failure modes:
|
||||
|
||||
1. **Pollution of averaged identity** — roop's `FaceSet.AverageEmbeddings()`
|
||||
averages every face in the .fsz. A faceset where 40 % of images are
|
||||
sunglassed gives a biased centroid; the swap reproduces sunglass-shaped
|
||||
eye sockets.
|
||||
2. **Whole-cluster identity drift** — clustering at the embedding level
|
||||
sometimes anchors on the eyewear silhouette rather than the face,
|
||||
producing clusters of "the same sunglasses across multiple people".
|
||||
|
||||
A targeted attribute scorer was the cleanest fix.
|
||||
|
||||
## 2. Model + prompts
|
||||
|
||||
**Model**: `open_clip` `ViT-L-14` / `dfn2b_s39b` (Apple Data Filtering Networks).
|
||||
Best public zero-shot at this size. Loads weights from HF Hub (~890 MB).
|
||||
Bit-identical scores between WSL CPU and Windows DML.
|
||||
|
||||
**Prompt design**: per-attribute ensembles of 5–6 positive + 5–6 negative
|
||||
prompts. Positive ensembles are mean-pooled and L2-normalized before softmax.
|
||||
|
||||
**Critical bug if forgotten**: CLIP cosine similarities are tiny (0.2–0.3
|
||||
range). Raw `softmax([sim_pos, sim_neg])` collapses to ~0.5/0.5 on every
|
||||
image. **Multiply by `model.logit_scale.exp()` (~100) before softmax.**
|
||||
Without that scale the entire scorer outputs a uniform 0.5.
|
||||
|
||||
**Sunglasses prompt pitfall**: the first set caught faces with sunglasses
|
||||
*pushed up on the forehead* with the same probability as faces with
|
||||
sunglasses *covering the eyes* — CLIP detects "presence of sunglasses in
|
||||
frame", not "eyes occluded". Fixed by putting the false positive into the
|
||||
*negative* class explicitly:
|
||||
|
||||
```
|
||||
positive: "a face with dark sunglasses covering the eyes"
|
||||
"a portrait with the eyes hidden behind opaque sunglasses"
|
||||
...
|
||||
negative: "a face with sunglasses pushed up on the forehead, eyes visible below"
|
||||
"a face with sunglasses resting on top of the head, eyes visible"
|
||||
"a face wearing clear prescription eyeglasses with visible eyes"
|
||||
...
|
||||
```
|
||||
|
||||
Validation pair (faceset_005): sunglasses-on-eyes → 0.91, sunglasses-on-forehead
|
||||
→ 0.39. Threshold 0.7 cleanly separates.
|
||||
|
||||
## 3. Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ WSL /opt/face-sets/work/filter_occlusions.py │
|
||||
│ • stage: walk facesets/, write queue.json │
|
||||
│ • merge: ingest worker results │
|
||||
│ • report: HTML contact sheet │
|
||||
│ • apply: prune + quarantine + re-zip │
|
||||
└────────────┬────────────────────────────────┘
|
||||
│ queue.json (paths) via \\wsl.localhost\
|
||||
▼
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Windows C:\clip_dml_venv\ │
|
||||
│ /opt/face-sets/work/clip_worker.py │
|
||||
│ Python 3.12 + torch 2.4.1 CPU │
|
||||
│ + torch-directml 0.2.5 + open_clip_torch │
|
||||
│ Reads PNGs from native E:\, writes scores │
|
||||
└─────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
A separate Windows venv (not the existing `C:\face_embed_venv\`) is needed
|
||||
because `torch-directml` brings ~1.5 GB of wheels and version-pinned
|
||||
numpy/pillow that risk breaking the embed_worker venv's
|
||||
`onnxruntime-directml` + `insightface` stack.
|
||||
|
||||
## 4. DML throughput surprise
|
||||
|
||||
Measured on AMD Radeon RX Vega:
|
||||
|
||||
| input | model | throughput | speedup vs WSL CPU |
|
||||
|------|-------|-----------:|-------------------:|
|
||||
| ViT-L-14 (CLIP, this filter) | open_clip | **1.43 img/s** | **2.4×** |
|
||||
| buffalo_l (insightface, embed_worker) | onnxruntime | 2.6 img/s | 7.5× |
|
||||
|
||||
Only 2.4× because `aten::_native_multi_head_attention` is not implemented in
|
||||
the directml plugin and falls back to CPU. The vision encoder runs on GPU,
|
||||
attention runs on CPU per layer, both alternating. A silenced UserWarning
|
||||
makes this near-invisible. Workable for a one-shot 73-min corpus run, but
|
||||
the embed_worker pattern (pure ONNX) remains the gold standard for DML.
|
||||
|
||||
## 5. Thresholds (validated 2026-04-27 on 6,318 PNGs)
|
||||
|
||||
| level | threshold | semantics |
|
||||
|-------|----------:|-----------|
|
||||
| image | P(positive) ≥ 0.7 | drop the PNG |
|
||||
| faceset | ≥ 40 % of images flagged for either attr | quarantine whole faceset to `_masked/` |
|
||||
| min-survivors | < 5 surviving AND something pruned | quarantine to `_thin/` |
|
||||
|
||||
The `AND something pruned` guard is essential — without it, naturally-small
|
||||
facesets (hand-sorted with ≤4 PNGs) get incorrectly quarantined for being
|
||||
small even when they have zero occlusions.
|
||||
|
||||
## 6. Run results
|
||||
|
||||
| action | count | net effect |
|
||||
|--------|------:|------------|
|
||||
| keep | 209 | unchanged |
|
||||
| prune | 46 | 183 PNGs dropped within survivors |
|
||||
| quarantine_masked | 51 | whole faceset → `_masked/` (11 mask-driven, 40 sunglasses-driven) |
|
||||
| quarantine_thin | 3 | survivors < 5 → `_thin/` |
|
||||
|
||||
Net: 311 active → 255 active after the filter run. 763 PNGs quarantined
|
||||
whole-faceset, 183 pruned within survivors. All dropped PNGs preserved at
|
||||
`<faceset>/faces/_dropped/` for reversibility. Master manifest gained a
|
||||
`masked[]` array parallel to `thin_eras[]`, plus an `occlusion_filter_run`
|
||||
provenance block.
|
||||
|
||||
## 7. Known limitations
|
||||
|
||||
- **Per-faceset manifests are NOT updated by `apply`** — only the master
|
||||
manifest is. Each faceset's own `<faceset>/manifest.json` retains stale
|
||||
`faces[]` entries pointing at PNGs that moved into `_dropped/`. Harmless
|
||||
for `.fsz` consumers (the .fsz is re-zipped from current disk state) but
|
||||
downstream tools reading `faces[]` will see broken references. Discovered
|
||||
later by `age_extend_001.py`'s rebuild loop, which generated 42 missing-PNG
|
||||
warnings before being caught.
|
||||
|
||||
## 8. Re-running
|
||||
|
||||
```bash
|
||||
# 1. Stage queue from current corpus state
|
||||
python work/filter_occlusions.py stage --out work/clip_dml/queue.json
|
||||
|
||||
# 2. Score on Windows DML (resumable)
|
||||
"/mnt/c/clip_dml_venv/Scripts/python.exe" work/clip_worker.py \
|
||||
work/clip_dml/queue.json work/clip_dml/scores.json --batch 8
|
||||
|
||||
# 3. Reshape into per-faceset format, then HTML for visual approval
|
||||
python work/filter_occlusions.py merge \
|
||||
--scores work/clip_dml/scores.json --out work/occlusion_scores.json
|
||||
python work/filter_occlusions.py report \
|
||||
--scores work/occlusion_scores.json --out work/occlusion_review
|
||||
|
||||
# 4. Apply (always dry-run first)
|
||||
python work/filter_occlusions.py apply \
|
||||
--scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json --dry-run
|
||||
python work/filter_occlusions.py apply \
|
||||
--scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json
|
||||
```
|
||||
155
docs/analysis/dedup-and-roop-optimization.md
Normal file
155
docs/analysis/dedup-and-roop-optimization.md
Normal file
@@ -0,0 +1,155 @@
|
||||
# Corpus dedup + roop-unleashed optimization
|
||||
|
||||
_Run date: 2026-04-27. Driver scripts: `work/dedup_optimize.py`, `work/multiface_worker.py`._
|
||||
|
||||
After consolidation collapsed duplicate identities and age-extend slotted
|
||||
new PNGs into era buckets, the corpus still carried artifacts that hurt
|
||||
roop's averaged-embedding quality:
|
||||
|
||||
- **Burst-photo near-duplicates** within facesets, especially in
|
||||
immich-discovered identities where source libraries had many similar
|
||||
shots within seconds.
|
||||
- **Cross-faceset byte-identical PNGs** that escaped consolidation's
|
||||
centroid-similarity matching when individual PNGs matched exactly but
|
||||
cluster centroids diverged.
|
||||
- **Multi-face PNGs** that polluted identity averaging because the roop
|
||||
loader appends every detected face per PNG to the FaceSet (load-bearing
|
||||
invariant — see § 2).
|
||||
|
||||
This pipeline runs three independent passes and an optional fourth, all
|
||||
moving dropped PNGs to `<faceset>/faces/_dropped/` for reversibility.
|
||||
|
||||
## 1. Cross-family byte-dedup
|
||||
|
||||
SHA256-hash every PNG in the active corpus (parallel I/O via
|
||||
`ThreadPoolExecutor(max_workers=16)`, ~17 s for 5,386 PNGs over the
|
||||
`/mnt/e/` Windows mount). Group by hash; for groups with members in
|
||||
multiple identity families, keep the higher-tier copy.
|
||||
|
||||
**Family detection**: regex `^(faceset_\d+)(?:_.+)?$` — captures the parent
|
||||
identity. Same family includes parent + era splits (e.g. `faceset_001` +
|
||||
`faceset_001_2010-13`); these are intentional duplications for the era
|
||||
.fsz files and are preserved.
|
||||
|
||||
Run results: 20 cross-family hash groups → 24 PNGs dropped. Most cases were
|
||||
small immich identity-cluster errors that consolidation missed because
|
||||
individual PNG embeddings matched but the cluster mean did not.
|
||||
|
||||
## 2. Within-faceset near-dup at sim ≥ 0.95
|
||||
|
||||
Per-faceset pairwise cosine similarity on cached arcface embeddings.
|
||||
Connected components in the `sim ≥ 0.95` graph. Keep highest
|
||||
`quality.composite` per component, drop the rest.
|
||||
|
||||
**Threshold rationale**: legitimate same-person-different-pose pairs land at
|
||||
0.5–0.85; ≥ 0.95 means essentially the same shot (burst frames or
|
||||
recompressed dupes). Roop's `FaceSet.AverageEmbeddings()` averages all faces
|
||||
into `faces[0].embedding`; near-identical embeddings averaged ≈ averaging
|
||||
once. Removing them does not lose identity information; it removes a bias
|
||||
weight on the most-photographed moments.
|
||||
|
||||
Run results: 851 groups → **1,225 PNGs dropped** (23 % of corpus).
|
||||
Most-affected: `faceset_026` (-132 of 262), `faceset_027` (-107),
|
||||
`faceset_028` (-92), `faceset_030` (-92). All immich-discovered identities
|
||||
where the source library had burst sequences.
|
||||
|
||||
## 3. Multi-face audit (load-bearing roop invariant)
|
||||
|
||||
The roop loader at `roop/ui/tabs/faceswap_tab.py:661–691` runs
|
||||
`extract_face_images(filename, (False, 0))` on every PNG and **appends every
|
||||
detected face** to `face_set.faces`. A multi-face PNG therefore pollutes the
|
||||
averaged identity. The export-swap pipeline drops multi-face crops at
|
||||
creation, but post-pipeline operations (consolidation, age-extend) move
|
||||
PNGs across facesets without re-checking.
|
||||
|
||||
**This audit re-detects every PNG** with insightface FaceAnalysis and flags
|
||||
any with `face_count ≠ 1` (filtered by `det_score ≥ 0.5` and
|
||||
`face_short ≥ 40`). Includes:
|
||||
- ≥ 2 faces → loader will inject extra identities into averaging
|
||||
- 0 faces → insightface can't find a face on the cropped PNG; useless for
|
||||
roop, would silently fail
|
||||
|
||||
Run results: 4,146 PNGs scored, 332 flagged (272 with 2 faces, 9 with 3,
|
||||
2 with 4, **49 with 0**). 82 facesets affected.
|
||||
|
||||
## 4. DML throughput jump for face crops
|
||||
|
||||
The audit reuses the same insightface + onnxruntime-directml stack as
|
||||
`embed_worker.py` but achieves **~19 img/s** on AMD Vega vs embed_worker's
|
||||
2.6 img/s — same model, same hardware. The difference is input size:
|
||||
|
||||
| stage | typical input | DML throughput |
|
||||
|-------|--------------|---------------:|
|
||||
| `embed_worker.py` (Immich import) | 1024–4000 px source | 2.6 img/s |
|
||||
| `multiface_worker.py` (this audit) | 512×512 face crops | **19 img/s** |
|
||||
|
||||
Detection on small inputs is fast; recognition on aligned 112×112 inputs is
|
||||
the same cost either way. Implication: **any pipeline operating on
|
||||
already-cropped face PNGs can rely on a roughly 7× higher DML throughput
|
||||
ceiling than full-resolution embedding**.
|
||||
|
||||
## 5. Architecture
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────┐
|
||||
│ WSL /opt/face-sets/work/dedup_optimize.py │
|
||||
│ • analyze: hashes + within-faceset sim │
|
||||
│ • apply: move + re-zip (no GPU) │
|
||||
│ • stage_multiface: write queue.json │
|
||||
│ • merge_multiface: ingest worker results │
|
||||
│ • apply_multiface: move + re-zip │
|
||||
│ • report: HTML audit │
|
||||
└────────────┬───────────────────────────────┘
|
||||
│ queue.json via \\wsl.localhost\
|
||||
▼
|
||||
┌────────────────────────────────────────────┐
|
||||
│ Windows C:\face_embed_venv\ │
|
||||
│ /opt/face-sets/work/multiface_worker.py │
|
||||
│ insightface FaceAnalysis on DmlExecutionProvider │
|
||||
│ Reads PNGs from native E:\, writes face_count │
|
||||
└────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
Reuses the existing `C:\face_embed_venv\` (no new venv needed — same
|
||||
insightface stack as `embed_worker.py`).
|
||||
|
||||
## 6. Final corpus state (2026-04-27 night)
|
||||
|
||||
| metric | start of day | after occlusion filter | after consolidation | after age-extend | after this dedup + multiface |
|
||||
|--------|-------------:|----------------------:|-------------------:|-----------------:|----------------------------:|
|
||||
| active facesets | 311 | 255 | 181 | 181 | **181** |
|
||||
| active PNGs | ~6,440 | 5,386 | 5,386 | 5,400 | **3,849** |
|
||||
| `_masked/` | 0 | 51 | 51 | 51 | 51 |
|
||||
| `_thin/` | 68 | 71 | 71 | 71 | 71 |
|
||||
| `_merged/` | 0 | 0 | 74 | 74 | 74 |
|
||||
|
||||
Net reduction at the end of the day: **2,591 PNGs and 130 facesets** removed
|
||||
or quarantined from the active pool. All preserved on disk for
|
||||
reversibility (`<faceset>/faces/_dropped/` for prunes, `_masked/_merged/_thin/`
|
||||
for quarantines).
|
||||
|
||||
## 7. Re-running
|
||||
|
||||
Run after any new import / consolidation / extend:
|
||||
|
||||
```bash
|
||||
# 1. Byte-dedup + within-faceset near-dup (CPU only)
|
||||
python work/dedup_optimize.py analyze --out work/dedup_audit/dedup_plan.json
|
||||
python work/dedup_optimize.py apply --plan work/dedup_audit/dedup_plan.json
|
||||
|
||||
# 2. Multi-face audit on Windows DML (resumable)
|
||||
python work/dedup_optimize.py stage_multiface --out work/dedup_audit/multiface_queue.json
|
||||
"/mnt/c/face_embed_venv/Scripts/python.exe" work/multiface_worker.py \
|
||||
work/dedup_audit/multiface_queue.json work/dedup_audit/multiface_results.json
|
||||
python work/dedup_optimize.py merge_multiface \
|
||||
--results work/dedup_audit/multiface_results.json \
|
||||
--out work/dedup_audit/multiface_plan.json
|
||||
python work/dedup_optimize.py apply_multiface \
|
||||
--plan work/dedup_audit/multiface_plan.json
|
||||
|
||||
# 3. HTML audit
|
||||
python work/dedup_optimize.py report \
|
||||
--dedup work/dedup_audit/dedup_plan.json \
|
||||
--multiface work/dedup_audit/multiface_plan.json \
|
||||
--out work/dedup_audit
|
||||
```
|
||||
170
docs/analysis/identity-consolidation-and-age-extend.md
Normal file
170
docs/analysis/identity-consolidation-and-age-extend.md
Normal file
@@ -0,0 +1,170 @@
|
||||
# Identity consolidation + age-bucket extension
|
||||
|
||||
_Run date: 2026-04-27. Driver scripts: `work/consolidate_facesets.py`, `work/age_extend_001.py`._
|
||||
|
||||
After the Immich peter + nic imports added 280 new facesets to a corpus that
|
||||
had ~25 canonical identities, many "new" identities were duplicates of
|
||||
existing household members at lower clustering confidence. Two cooperating
|
||||
passes clean this up: identity consolidation merges duplicates, then
|
||||
age-extend slots newly-merged PNGs into the existing era buckets of
|
||||
`faceset_001`.
|
||||
|
||||
## 1. Identity consolidation
|
||||
|
||||
### 1.1 Approach
|
||||
|
||||
For each active faceset, pull cached arcface embeddings from
|
||||
`work/cache/{nl_full,immich_peter,immich_nic}.npz` keyed by
|
||||
`(source, bbox)` from the per-faceset manifest's `faces[]`. Compute
|
||||
L2-normalized centroid. Pairwise cosine similarity matrix.
|
||||
|
||||
**Tier-based primary selection** (lowest tier number wins, size breaks ties):
|
||||
|
||||
| tier | sources | rationale |
|
||||
|-----:|---------|-----------|
|
||||
| 0 | `faceset_013..019` (hand-sorted) | user's curated labels |
|
||||
| 1 | `faceset_001..012` (auto-clustered) | well-established household |
|
||||
| 2 | `faceset_020..025` (osrc) | mixed-bucket discovery |
|
||||
| 3 | `faceset_026..264` (immich peter) | speculative |
|
||||
| 4 | `faceset_265+` (immich nic) | speculative |
|
||||
|
||||
**Era splits and quarantines excluded** — `faceset_NNN_<era>`, `_masked/`,
|
||||
`_thin/` are skipped during analysis.
|
||||
|
||||
### 1.2 Single-linkage chains catastrophically — complete-linkage required
|
||||
|
||||
First attempt used connected-components on edge ≥ 0.45 → produced a
|
||||
**60-faceset cluster** around `faceset_001` with min within-group sim of
|
||||
**−0.16** (definitely-different people bridged via chains
|
||||
`A↔B↔C` where `A`, `C` are not similar). Bumping to edge ≥ 0.55 still
|
||||
chained (group of 17 with min 0.20).
|
||||
|
||||
Real fix: `scipy.cluster.hierarchy.linkage(method='complete')` then
|
||||
`fcluster(Z, t=1-edge_threshold, criterion='distance')`. Complete-linkage
|
||||
**guarantees** every within-group pair sim ≥ edge threshold. Without this
|
||||
guarantee the report is unusable and the apply step would produce
|
||||
identity-poisoned merges.
|
||||
|
||||
### 1.3 Thresholds + run results
|
||||
|
||||
`edge=0.55`, `confident=0.65` → 48 multi-faceset groups (29 confident, 19
|
||||
uncertain). Max group size 7, all bilateral or small triplets after
|
||||
complete-linkage.
|
||||
|
||||
After applying all 48 (with `--include-uncertain` after visual approval):
|
||||
|
||||
- **74 facesets consumed** (some groups had multiple secondaries:
|
||||
`[10, 45, 135] → faceset_002`; `[113, 96, 178, 109, 110, 286] → faceset_095`;
|
||||
etc.)
|
||||
- Active count 255 → 181
|
||||
- Notable absorptions: `faceset_001` (peter) 707 → 753 PNGs (+ 7, 132, 151);
|
||||
`faceset_002` 209 → 247; `faceset_026` 60 → 262 (+ 168, 146, 325);
|
||||
`faceset_028` → 207
|
||||
- Master manifest gained `merged[]` array (parallel to `thin_eras[]`); each
|
||||
entry has `merged_into` field pointing at the primary
|
||||
|
||||
### 1.4 Apply mechanics
|
||||
|
||||
Combine all PNGs from primary + secondaries, re-rank by existing
|
||||
`quality.composite` desc (no re-enrich), renumber `0001..NNNN`, copy into a
|
||||
fresh staging dir, atomic swap. Move secondary directories to
|
||||
`_merged/<original_name>/` (preserved in full for reversibility). Re-zip
|
||||
`_topN.fsz` and `_all.fsz`.
|
||||
|
||||
The primary's existing per-PNG quality scores are reused — re-ranking does
|
||||
not require re-running `enrich`-equivalent landmarks/pose on the cropped
|
||||
PNGs. The primary's `_dropped/` (from prior occlusion filter) is preserved
|
||||
through the merge.
|
||||
|
||||
## 2. Age extension of faceset_001 era buckets
|
||||
|
||||
### 2.1 Why a follow-on pass
|
||||
|
||||
Consolidation absorbed faceset_007/132/151 into faceset_001 (+46 PNGs).
|
||||
The original `age_split_001.py` had bucketed peter into 6 era anchors
|
||||
(`_2005-10`, `_2010-13`, `_2011`, `_2014-17`, `_2018-19`, `_2018-20`), but
|
||||
those new PNGs had never been seen by age_split. They sat in faceset_001's
|
||||
parent-only set, missing from every era .fsz.
|
||||
|
||||
### 2.2 Era-label pitfall
|
||||
|
||||
The 6 anchor era labels are NOT strict year ranges. They are
|
||||
`Counter(years).most_common(1)`-derived dom-years from the original sub-cluster:
|
||||
|
||||
| label | dom_year | actual span of members |
|
||||
|-------|---------:|-----------------------:|
|
||||
| `_2005-10` | 2010 | 2005–2010 |
|
||||
| `_2010-13` | 2011 | **2007–2024** |
|
||||
| `_2011` | 2011 | 2011 only |
|
||||
| `_2014-17` | 2016 | 2005–2018 |
|
||||
| `_2018-19` | 2018 | 2012–2020 |
|
||||
| `_2018-20` | 2019 | 2014–2022 |
|
||||
|
||||
The clusters are *appearance-anchored*, not year-bounded. Year is a
|
||||
descriptive label. Assignment rule must use dom-year, not member span.
|
||||
|
||||
### 2.3 Algorithm
|
||||
|
||||
For each unbucketed face entry in `faceset_001`'s manifest (50 of 753):
|
||||
|
||||
1. Look up embedding in cache by `(source, bbox)`.
|
||||
2. Look up EXIF year via `work/cache/age_split_exif.json`; fetch on cache miss.
|
||||
3. Find single nearest era anchor by cosine distance to its centroid.
|
||||
4. Accept iff `dist ≤ 0.40` AND `|year − anchor.dom_year| ≤ 5`.
|
||||
These thresholds match `age_split_001.py`'s anchor-fragment rule.
|
||||
5. Anchors are NOT re-centered after absorption (preserves age_split's
|
||||
drift-prevention guarantee).
|
||||
|
||||
### 2.4 Run results
|
||||
|
||||
50 unbucketed → 21 with EXIF year → **14 accepted**:
|
||||
|
||||
| anchor | dom_year | added |
|
||||
|--------|---------:|------:|
|
||||
| `_2005-10` | 2010 | +2 |
|
||||
| `_2010-13` | 2011 | +1 |
|
||||
| `_2014-17` | 2016 | **+9** |
|
||||
| `_2018-20` | 2019 | +2 |
|
||||
|
||||
29 PNGs skipped for missing EXIF year (mostly immich-stripped
|
||||
photos). 7 dist/year-rejected (e.g. two PNGs from 2025 want
|
||||
`_2018-19` but year-delta 7 > 5).
|
||||
|
||||
### 2.5 Reconciliation side effect
|
||||
|
||||
The apply rebuilds each affected era bucket's `faces/` from staging. This
|
||||
incidentally reconciled the per-bucket manifests with disk after the prior
|
||||
occlusion filter run had left era manifests stale at 282/126/132 entries vs
|
||||
~248/125/129 actual files (occlusion filter only updates the master
|
||||
manifest, never per-faceset manifests — see
|
||||
`docs/analysis/clip-occlusion-filter.md` §7). 42 occlusion-dropped era PNGs
|
||||
inside the old `faces/_dropped/` were removed during rebuild. The
|
||||
parent `faceset_001/faces/_dropped/` still has the corpus-level audit; all
|
||||
source images are intact at `/mnt/x/src/`, so the era-level dropped PNGs
|
||||
are regeneratable via `cmd_export_swap`.
|
||||
|
||||
## 3. Re-running
|
||||
|
||||
Always run both passes after any new identity import (Immich, osrc,
|
||||
hand-sorted folder):
|
||||
|
||||
```bash
|
||||
# 1. Find duplicate identities
|
||||
python work/consolidate_facesets.py analyze \
|
||||
--out work/merge_review/candidates.json [--edge 0.55 --confident 0.65]
|
||||
python work/consolidate_facesets.py report \
|
||||
--candidates work/merge_review/candidates.json --out work/merge_review
|
||||
# inspect work/merge_review/index.html
|
||||
python work/consolidate_facesets.py apply \
|
||||
--candidates work/merge_review/candidates.json [--include-uncertain]
|
||||
|
||||
# 2. Slot new faceset_001 PNGs into existing era buckets
|
||||
python work/age_extend_001.py analyze --out work/age_extend/candidates.json
|
||||
python work/age_extend_001.py report \
|
||||
--candidates work/age_extend/candidates.json --out work/age_extend
|
||||
python work/age_extend_001.py apply --candidates work/age_extend/candidates.json
|
||||
```
|
||||
|
||||
Both are idempotent. `consolidate_facesets` skips secondaries already in
|
||||
`_merged/`; `age_extend_001` recomputes anchor centroids + dom-year fresh
|
||||
on every run.
|
||||
Reference in New Issue
Block a user