Add post-export corpus maintenance pipeline
Adds four new orchestration scripts that operate on an already-built facesets_swap_ready/ to clean it up over time: - filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. - consolidate_facesets.py: duplicate-identity merger using complete-linkage centroid clustering on cached arcface embeddings. Single-linkage chains catastrophically (60-faceset clusters with min sim < 0); complete-linkage guarantees within-group sim >= edge. - age_extend_001.py: slots newly-added PNGs into existing era buckets of faceset_001 using the same anchor-fragment rule as age_split_001.py (dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered. - dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three passes — cross-family SHA256 byte-dedup (preserves intra-family era duplication), within-faceset near-dup at sim >= 0.95, and a multi-face audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s on AMD Vega — ~7x embed_worker because input is 512x512 crops. Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged → 181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full reversibility. Master manifest gains masked[], merged[], plus per-run provenance blocks. Three new docs/analysis/ writeups cover model choice, threshold rationale, and per-pass run results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
29
README.md
29
README.md
@@ -331,6 +331,27 @@ from the saved `state.json` without re-fetching what was already done.
|
||||
|
||||
The composite quality score in `export-swap` is `0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness`, each normalized to `[0, 1]`.
|
||||
|
||||
## Post-export corpus maintenance
|
||||
|
||||
The `sort_faces.py` pipeline above produces `facesets_swap_ready/`. Four
|
||||
orchestration scripts under `work/` operate on that already-built corpus to
|
||||
clean it up over time:
|
||||
|
||||
| script | purpose |
|
||||
|--------|---------|
|
||||
| `work/filter_occlusions.py` (+ Windows `work/clip_worker.py`) | Drop PNGs of masked / sun-glassed faces using open_clip ViT-L-14/dfn2b_s39b zero-shot scoring. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. WSL stages a queue, Windows DML scores, WSL applies. See `docs/analysis/clip-occlusion-filter.md`. |
|
||||
| `work/consolidate_facesets.py` | Merge duplicate identities (centroid cosine sim ≥ 0.55 with confident ≥ 0.65, **complete-linkage** to defeat single-link chaining). Pulls embeddings from cache, no GPU. See `docs/analysis/identity-consolidation-and-age-extend.md`. |
|
||||
| `work/age_extend_001.py` | Slot newly-added PNGs into existing era buckets of `faceset_001` (anchor cosine distance ≤ 0.40 AND `|year_delta|` ≤ 5). Same anchor-fragment rule as `age_split_001.py`. |
|
||||
| `work/dedup_optimize.py` (+ Windows `work/multiface_worker.py`) | (a) cross-family SHA256 byte-dedup, (b) within-faceset near-dup at cosine sim ≥ 0.95, (c) multi-face audit (re-detect via insightface, drop PNGs with face_count ≠ 1). Multi-face is the load-bearing roop invariant. See `docs/analysis/dedup-and-roop-optimization.md`. |
|
||||
|
||||
All four operate idempotently and reversibly: dropped PNGs go to
|
||||
`<faceset>/faces/_dropped/`, quarantined whole facesets go to
|
||||
`facesets_swap_ready/_masked/` or `_merged/` (parallel to the existing
|
||||
`_thin/`). The master `manifest.json` partitions entries across `facesets[]`,
|
||||
`masked[]`, `thin_eras[]`, and `merged[]` arrays, plus per-run provenance
|
||||
blocks (`occlusion_filter_run`, `merge_run`, `age_extend_runs`, `dedup_runs`,
|
||||
`multiface_runs`).
|
||||
|
||||
## Downstream: roop-unleashed
|
||||
|
||||
The `.fsz` bundles emitted by `export-swap` drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding.
|
||||
@@ -350,11 +371,17 @@ Highly recommended at swap time: enable **Select post-processing = GFPGAN** with
|
||||
├─ build_folders.py (hand-sorted-folder orchestration)
|
||||
├─ check_faceset001_age.py (age-split readiness probe)
|
||||
├─ age_split_001.py (age-split orchestration; faceset_001)
|
||||
├─ age_extend_001.py (extends existing era buckets with new PNGs)
|
||||
├─ cluster_osrc.py (mixed-bucket identity discovery)
|
||||
├─ immich_stage.py (Immich library staging, parallel)
|
||||
├─ embed_worker.py (Windows DML embed worker, runs from C:\face_embed_venv\)
|
||||
├─ embed_worker.py (Windows DML embed worker; C:\face_embed_venv\)
|
||||
├─ cluster_immich.py (Immich identity discovery + export)
|
||||
├─ finalize_immich.sh (chains queue → embed → cluster)
|
||||
├─ filter_occlusions.py (CLIP zero-shot mask + sunglasses filter)
|
||||
├─ clip_worker.py (Windows DML CLIP worker; C:\clip_dml_venv\)
|
||||
├─ consolidate_facesets.py (duplicate-identity merger; complete-linkage)
|
||||
├─ dedup_optimize.py (byte + near-dup + multi-face audit driver)
|
||||
├─ multiface_worker.py (Windows DML multi-face audit worker)
|
||||
├─ synthetic_*_manifest.json (per-run synthetic refine manifests)
|
||||
├─ immich/
|
||||
│ ├─ users.json (label -> userId map; gitignored)
|
||||
|
||||
154
docs/analysis/clip-occlusion-filter.md
Normal file
154
docs/analysis/clip-occlusion-filter.md
Normal file
@@ -0,0 +1,154 @@
|
||||
# CLIP zero-shot occlusion filter (masks + sunglasses)
|
||||
|
||||
_Run date: 2026-04-27. Driver scripts: `work/filter_occlusions.py`, `work/clip_worker.py`._
|
||||
|
||||
## 1. Why
|
||||
|
||||
`facesets_swap_ready/` ended the Immich import day with 311 substantive
|
||||
facesets and a long tail of identities whose clusters had latched onto
|
||||
*eyewear or mask appearance* instead of identity (covid-era shots, vacation
|
||||
photos with sunglasses dominating the frame). Two failure modes:
|
||||
|
||||
1. **Pollution of averaged identity** — roop's `FaceSet.AverageEmbeddings()`
|
||||
averages every face in the .fsz. A faceset where 40 % of images are
|
||||
sunglassed gives a biased centroid; the swap reproduces sunglass-shaped
|
||||
eye sockets.
|
||||
2. **Whole-cluster identity drift** — clustering at the embedding level
|
||||
sometimes anchors on the eyewear silhouette rather than the face,
|
||||
producing clusters of "the same sunglasses across multiple people".
|
||||
|
||||
A targeted attribute scorer was the cleanest fix.
|
||||
|
||||
## 2. Model + prompts
|
||||
|
||||
**Model**: `open_clip` `ViT-L-14` / `dfn2b_s39b` (Apple Data Filtering Networks).
|
||||
Best public zero-shot at this size. Loads weights from HF Hub (~890 MB).
|
||||
Bit-identical scores between WSL CPU and Windows DML.
|
||||
|
||||
**Prompt design**: per-attribute ensembles of 5–6 positive + 5–6 negative
|
||||
prompts. Positive ensembles are mean-pooled and L2-normalized before softmax.
|
||||
|
||||
**Critical bug if forgotten**: CLIP cosine similarities are tiny (0.2–0.3
|
||||
range). Raw `softmax([sim_pos, sim_neg])` collapses to ~0.5/0.5 on every
|
||||
image. **Multiply by `model.logit_scale.exp()` (~100) before softmax.**
|
||||
Without that scale the entire scorer outputs a uniform 0.5.
|
||||
|
||||
**Sunglasses prompt pitfall**: the first set caught faces with sunglasses
|
||||
*pushed up on the forehead* with the same probability as faces with
|
||||
sunglasses *covering the eyes* — CLIP detects "presence of sunglasses in
|
||||
frame", not "eyes occluded". Fixed by putting the false positive into the
|
||||
*negative* class explicitly:
|
||||
|
||||
```
|
||||
positive: "a face with dark sunglasses covering the eyes"
|
||||
"a portrait with the eyes hidden behind opaque sunglasses"
|
||||
...
|
||||
negative: "a face with sunglasses pushed up on the forehead, eyes visible below"
|
||||
"a face with sunglasses resting on top of the head, eyes visible"
|
||||
"a face wearing clear prescription eyeglasses with visible eyes"
|
||||
...
|
||||
```
|
||||
|
||||
Validation pair (faceset_005): sunglasses-on-eyes → 0.91, sunglasses-on-forehead
|
||||
→ 0.39. Threshold 0.7 cleanly separates.
|
||||
|
||||
## 3. Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ WSL /opt/face-sets/work/filter_occlusions.py │
|
||||
│ • stage: walk facesets/, write queue.json │
|
||||
│ • merge: ingest worker results │
|
||||
│ • report: HTML contact sheet │
|
||||
│ • apply: prune + quarantine + re-zip │
|
||||
└────────────┬────────────────────────────────┘
|
||||
│ queue.json (paths) via \\wsl.localhost\
|
||||
▼
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Windows C:\clip_dml_venv\ │
|
||||
│ /opt/face-sets/work/clip_worker.py │
|
||||
│ Python 3.12 + torch 2.4.1 CPU │
|
||||
│ + torch-directml 0.2.5 + open_clip_torch │
|
||||
│ Reads PNGs from native E:\, writes scores │
|
||||
└─────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
A separate Windows venv (not the existing `C:\face_embed_venv\`) is needed
|
||||
because `torch-directml` brings ~1.5 GB of wheels and version-pinned
|
||||
numpy/pillow that risk breaking the embed_worker venv's
|
||||
`onnxruntime-directml` + `insightface` stack.
|
||||
|
||||
## 4. DML throughput surprise
|
||||
|
||||
Measured on AMD Radeon RX Vega:
|
||||
|
||||
| input | model | throughput | speedup vs WSL CPU |
|
||||
|------|-------|-----------:|-------------------:|
|
||||
| ViT-L-14 (CLIP, this filter) | open_clip | **1.43 img/s** | **2.4×** |
|
||||
| buffalo_l (insightface, embed_worker) | onnxruntime | 2.6 img/s | 7.5× |
|
||||
|
||||
Only 2.4× because `aten::_native_multi_head_attention` is not implemented in
|
||||
the directml plugin and falls back to CPU. The vision encoder runs on GPU,
|
||||
attention runs on CPU per layer, both alternating. A silenced UserWarning
|
||||
makes this near-invisible. Workable for a one-shot 73-min corpus run, but
|
||||
the embed_worker pattern (pure ONNX) remains the gold standard for DML.
|
||||
|
||||
## 5. Thresholds (validated 2026-04-27 on 6,318 PNGs)
|
||||
|
||||
| level | threshold | semantics |
|
||||
|-------|----------:|-----------|
|
||||
| image | P(positive) ≥ 0.7 | drop the PNG |
|
||||
| faceset | ≥ 40 % of images flagged for either attr | quarantine whole faceset to `_masked/` |
|
||||
| min-survivors | < 5 surviving AND something pruned | quarantine to `_thin/` |
|
||||
|
||||
The `AND something pruned` guard is essential — without it, naturally-small
|
||||
facesets (hand-sorted with ≤4 PNGs) get incorrectly quarantined for being
|
||||
small even when they have zero occlusions.
|
||||
|
||||
## 6. Run results
|
||||
|
||||
| action | count | net effect |
|
||||
|--------|------:|------------|
|
||||
| keep | 209 | unchanged |
|
||||
| prune | 46 | 183 PNGs dropped within survivors |
|
||||
| quarantine_masked | 51 | whole faceset → `_masked/` (11 mask-driven, 40 sunglasses-driven) |
|
||||
| quarantine_thin | 3 | survivors < 5 → `_thin/` |
|
||||
|
||||
Net: 311 active → 255 active after the filter run. 763 PNGs quarantined
|
||||
whole-faceset, 183 pruned within survivors. All dropped PNGs preserved at
|
||||
`<faceset>/faces/_dropped/` for reversibility. Master manifest gained a
|
||||
`masked[]` array parallel to `thin_eras[]`, plus an `occlusion_filter_run`
|
||||
provenance block.
|
||||
|
||||
## 7. Known limitations
|
||||
|
||||
- **Per-faceset manifests are NOT updated by `apply`** — only the master
|
||||
manifest is. Each faceset's own `<faceset>/manifest.json` retains stale
|
||||
`faces[]` entries pointing at PNGs that moved into `_dropped/`. Harmless
|
||||
for `.fsz` consumers (the .fsz is re-zipped from current disk state) but
|
||||
downstream tools reading `faces[]` will see broken references. Discovered
|
||||
later by `age_extend_001.py`'s rebuild loop, which generated 42 missing-PNG
|
||||
warnings before being caught.
|
||||
|
||||
## 8. Re-running
|
||||
|
||||
```bash
|
||||
# 1. Stage queue from current corpus state
|
||||
python work/filter_occlusions.py stage --out work/clip_dml/queue.json
|
||||
|
||||
# 2. Score on Windows DML (resumable)
|
||||
"/mnt/c/clip_dml_venv/Scripts/python.exe" work/clip_worker.py \
|
||||
work/clip_dml/queue.json work/clip_dml/scores.json --batch 8
|
||||
|
||||
# 3. Reshape into per-faceset format, then HTML for visual approval
|
||||
python work/filter_occlusions.py merge \
|
||||
--scores work/clip_dml/scores.json --out work/occlusion_scores.json
|
||||
python work/filter_occlusions.py report \
|
||||
--scores work/occlusion_scores.json --out work/occlusion_review
|
||||
|
||||
# 4. Apply (always dry-run first)
|
||||
python work/filter_occlusions.py apply \
|
||||
--scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json --dry-run
|
||||
python work/filter_occlusions.py apply \
|
||||
--scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json
|
||||
```
|
||||
155
docs/analysis/dedup-and-roop-optimization.md
Normal file
155
docs/analysis/dedup-and-roop-optimization.md
Normal file
@@ -0,0 +1,155 @@
|
||||
# Corpus dedup + roop-unleashed optimization
|
||||
|
||||
_Run date: 2026-04-27. Driver scripts: `work/dedup_optimize.py`, `work/multiface_worker.py`._
|
||||
|
||||
After consolidation collapsed duplicate identities and age-extend slotted
|
||||
new PNGs into era buckets, the corpus still carried artifacts that hurt
|
||||
roop's averaged-embedding quality:
|
||||
|
||||
- **Burst-photo near-duplicates** within facesets, especially in
|
||||
immich-discovered identities where source libraries had many similar
|
||||
shots within seconds.
|
||||
- **Cross-faceset byte-identical PNGs** that escaped consolidation's
|
||||
centroid-similarity matching when individual PNGs matched exactly but
|
||||
cluster centroids diverged.
|
||||
- **Multi-face PNGs** that polluted identity averaging because the roop
|
||||
loader appends every detected face per PNG to the FaceSet (load-bearing
|
||||
invariant — see § 2).
|
||||
|
||||
This pipeline runs three independent passes and an optional fourth, all
|
||||
moving dropped PNGs to `<faceset>/faces/_dropped/` for reversibility.
|
||||
|
||||
## 1. Cross-family byte-dedup
|
||||
|
||||
SHA256-hash every PNG in the active corpus (parallel I/O via
|
||||
`ThreadPoolExecutor(max_workers=16)`, ~17 s for 5,386 PNGs over the
|
||||
`/mnt/e/` Windows mount). Group by hash; for groups with members in
|
||||
multiple identity families, keep the higher-tier copy.
|
||||
|
||||
**Family detection**: regex `^(faceset_\d+)(?:_.+)?$` — captures the parent
|
||||
identity. Same family includes parent + era splits (e.g. `faceset_001` +
|
||||
`faceset_001_2010-13`); these are intentional duplications for the era
|
||||
.fsz files and are preserved.
|
||||
|
||||
Run results: 20 cross-family hash groups → 24 PNGs dropped. Most cases were
|
||||
small immich identity-cluster errors that consolidation missed because
|
||||
individual PNG embeddings matched but the cluster mean did not.
|
||||
|
||||
## 2. Within-faceset near-dup at sim ≥ 0.95
|
||||
|
||||
Per-faceset pairwise cosine similarity on cached arcface embeddings.
|
||||
Connected components in the `sim ≥ 0.95` graph. Keep highest
|
||||
`quality.composite` per component, drop the rest.
|
||||
|
||||
**Threshold rationale**: legitimate same-person-different-pose pairs land at
|
||||
0.5–0.85; ≥ 0.95 means essentially the same shot (burst frames or
|
||||
recompressed dupes). Roop's `FaceSet.AverageEmbeddings()` averages all faces
|
||||
into `faces[0].embedding`; near-identical embeddings averaged ≈ averaging
|
||||
once. Removing them does not lose identity information; it removes a bias
|
||||
weight on the most-photographed moments.
|
||||
|
||||
Run results: 851 groups → **1,225 PNGs dropped** (23 % of corpus).
|
||||
Most-affected: `faceset_026` (-132 of 262), `faceset_027` (-107),
|
||||
`faceset_028` (-92), `faceset_030` (-92). All immich-discovered identities
|
||||
where the source library had burst sequences.
|
||||
|
||||
## 3. Multi-face audit (load-bearing roop invariant)
|
||||
|
||||
The roop loader at `roop/ui/tabs/faceswap_tab.py:661–691` runs
|
||||
`extract_face_images(filename, (False, 0))` on every PNG and **appends every
|
||||
detected face** to `face_set.faces`. A multi-face PNG therefore pollutes the
|
||||
averaged identity. The export-swap pipeline drops multi-face crops at
|
||||
creation, but post-pipeline operations (consolidation, age-extend) move
|
||||
PNGs across facesets without re-checking.
|
||||
|
||||
**This audit re-detects every PNG** with insightface FaceAnalysis and flags
|
||||
any with `face_count ≠ 1` (filtered by `det_score ≥ 0.5` and
|
||||
`face_short ≥ 40`). Includes:
|
||||
- ≥ 2 faces → loader will inject extra identities into averaging
|
||||
- 0 faces → insightface can't find a face on the cropped PNG; useless for
|
||||
roop, would silently fail
|
||||
|
||||
Run results: 4,146 PNGs scored, 332 flagged (272 with 2 faces, 9 with 3,
|
||||
2 with 4, **49 with 0**). 82 facesets affected.
|
||||
|
||||
## 4. DML throughput jump for face crops
|
||||
|
||||
The audit reuses the same insightface + onnxruntime-directml stack as
|
||||
`embed_worker.py` but achieves **~19 img/s** on AMD Vega vs embed_worker's
|
||||
2.6 img/s — same model, same hardware. The difference is input size:
|
||||
|
||||
| stage | typical input | DML throughput |
|
||||
|-------|--------------|---------------:|
|
||||
| `embed_worker.py` (Immich import) | 1024–4000 px source | 2.6 img/s |
|
||||
| `multiface_worker.py` (this audit) | 512×512 face crops | **19 img/s** |
|
||||
|
||||
Detection on small inputs is fast; recognition on aligned 112×112 inputs is
|
||||
the same cost either way. Implication: **any pipeline operating on
|
||||
already-cropped face PNGs can rely on a roughly 7× higher DML throughput
|
||||
ceiling than full-resolution embedding**.
|
||||
|
||||
## 5. Architecture
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────┐
|
||||
│ WSL /opt/face-sets/work/dedup_optimize.py │
|
||||
│ • analyze: hashes + within-faceset sim │
|
||||
│ • apply: move + re-zip (no GPU) │
|
||||
│ • stage_multiface: write queue.json │
|
||||
│ • merge_multiface: ingest worker results │
|
||||
│ • apply_multiface: move + re-zip │
|
||||
│ • report: HTML audit │
|
||||
└────────────┬───────────────────────────────┘
|
||||
│ queue.json via \\wsl.localhost\
|
||||
▼
|
||||
┌────────────────────────────────────────────┐
|
||||
│ Windows C:\face_embed_venv\ │
|
||||
│ /opt/face-sets/work/multiface_worker.py │
|
||||
│ insightface FaceAnalysis on DmlExecutionProvider │
|
||||
│ Reads PNGs from native E:\, writes face_count │
|
||||
└────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
Reuses the existing `C:\face_embed_venv\` (no new venv needed — same
|
||||
insightface stack as `embed_worker.py`).
|
||||
|
||||
## 6. Final corpus state (2026-04-27 night)
|
||||
|
||||
| metric | start of day | after occlusion filter | after consolidation | after age-extend | after this dedup + multiface |
|
||||
|--------|-------------:|----------------------:|-------------------:|-----------------:|----------------------------:|
|
||||
| active facesets | 311 | 255 | 181 | 181 | **181** |
|
||||
| active PNGs | ~6,440 | 5,386 | 5,386 | 5,400 | **3,849** |
|
||||
| `_masked/` | 0 | 51 | 51 | 51 | 51 |
|
||||
| `_thin/` | 68 | 71 | 71 | 71 | 71 |
|
||||
| `_merged/` | 0 | 0 | 74 | 74 | 74 |
|
||||
|
||||
Net reduction at the end of the day: **2,591 PNGs and 130 facesets** removed
|
||||
or quarantined from the active pool. All preserved on disk for
|
||||
reversibility (`<faceset>/faces/_dropped/` for prunes, `_masked/_merged/_thin/`
|
||||
for quarantines).
|
||||
|
||||
## 7. Re-running
|
||||
|
||||
Run after any new import / consolidation / extend:
|
||||
|
||||
```bash
|
||||
# 1. Byte-dedup + within-faceset near-dup (CPU only)
|
||||
python work/dedup_optimize.py analyze --out work/dedup_audit/dedup_plan.json
|
||||
python work/dedup_optimize.py apply --plan work/dedup_audit/dedup_plan.json
|
||||
|
||||
# 2. Multi-face audit on Windows DML (resumable)
|
||||
python work/dedup_optimize.py stage_multiface --out work/dedup_audit/multiface_queue.json
|
||||
"/mnt/c/face_embed_venv/Scripts/python.exe" work/multiface_worker.py \
|
||||
work/dedup_audit/multiface_queue.json work/dedup_audit/multiface_results.json
|
||||
python work/dedup_optimize.py merge_multiface \
|
||||
--results work/dedup_audit/multiface_results.json \
|
||||
--out work/dedup_audit/multiface_plan.json
|
||||
python work/dedup_optimize.py apply_multiface \
|
||||
--plan work/dedup_audit/multiface_plan.json
|
||||
|
||||
# 3. HTML audit
|
||||
python work/dedup_optimize.py report \
|
||||
--dedup work/dedup_audit/dedup_plan.json \
|
||||
--multiface work/dedup_audit/multiface_plan.json \
|
||||
--out work/dedup_audit
|
||||
```
|
||||
170
docs/analysis/identity-consolidation-and-age-extend.md
Normal file
170
docs/analysis/identity-consolidation-and-age-extend.md
Normal file
@@ -0,0 +1,170 @@
|
||||
# Identity consolidation + age-bucket extension
|
||||
|
||||
_Run date: 2026-04-27. Driver scripts: `work/consolidate_facesets.py`, `work/age_extend_001.py`._
|
||||
|
||||
After the Immich peter + nic imports added 280 new facesets to a corpus that
|
||||
had ~25 canonical identities, many "new" identities were duplicates of
|
||||
existing household members at lower clustering confidence. Two cooperating
|
||||
passes clean this up: identity consolidation merges duplicates, then
|
||||
age-extend slots newly-merged PNGs into the existing era buckets of
|
||||
`faceset_001`.
|
||||
|
||||
## 1. Identity consolidation
|
||||
|
||||
### 1.1 Approach
|
||||
|
||||
For each active faceset, pull cached arcface embeddings from
|
||||
`work/cache/{nl_full,immich_peter,immich_nic}.npz` keyed by
|
||||
`(source, bbox)` from the per-faceset manifest's `faces[]`. Compute
|
||||
L2-normalized centroid. Pairwise cosine similarity matrix.
|
||||
|
||||
**Tier-based primary selection** (lowest tier number wins, size breaks ties):
|
||||
|
||||
| tier | sources | rationale |
|
||||
|-----:|---------|-----------|
|
||||
| 0 | `faceset_013..019` (hand-sorted) | user's curated labels |
|
||||
| 1 | `faceset_001..012` (auto-clustered) | well-established household |
|
||||
| 2 | `faceset_020..025` (osrc) | mixed-bucket discovery |
|
||||
| 3 | `faceset_026..264` (immich peter) | speculative |
|
||||
| 4 | `faceset_265+` (immich nic) | speculative |
|
||||
|
||||
**Era splits and quarantines excluded** — `faceset_NNN_<era>`, `_masked/`,
|
||||
`_thin/` are skipped during analysis.
|
||||
|
||||
### 1.2 Single-linkage chains catastrophically — complete-linkage required
|
||||
|
||||
First attempt used connected-components on edge ≥ 0.45 → produced a
|
||||
**60-faceset cluster** around `faceset_001` with min within-group sim of
|
||||
**−0.16** (definitely-different people bridged via chains
|
||||
`A↔B↔C` where `A`, `C` are not similar). Bumping to edge ≥ 0.55 still
|
||||
chained (group of 17 with min 0.20).
|
||||
|
||||
Real fix: `scipy.cluster.hierarchy.linkage(method='complete')` then
|
||||
`fcluster(Z, t=1-edge_threshold, criterion='distance')`. Complete-linkage
|
||||
**guarantees** every within-group pair sim ≥ edge threshold. Without this
|
||||
guarantee the report is unusable and the apply step would produce
|
||||
identity-poisoned merges.
|
||||
|
||||
### 1.3 Thresholds + run results
|
||||
|
||||
`edge=0.55`, `confident=0.65` → 48 multi-faceset groups (29 confident, 19
|
||||
uncertain). Max group size 7, all bilateral or small triplets after
|
||||
complete-linkage.
|
||||
|
||||
After applying all 48 (with `--include-uncertain` after visual approval):
|
||||
|
||||
- **74 facesets consumed** (some groups had multiple secondaries:
|
||||
`[10, 45, 135] → faceset_002`; `[113, 96, 178, 109, 110, 286] → faceset_095`;
|
||||
etc.)
|
||||
- Active count 255 → 181
|
||||
- Notable absorptions: `faceset_001` (peter) 707 → 753 PNGs (+ 7, 132, 151);
|
||||
`faceset_002` 209 → 247; `faceset_026` 60 → 262 (+ 168, 146, 325);
|
||||
`faceset_028` → 207
|
||||
- Master manifest gained `merged[]` array (parallel to `thin_eras[]`); each
|
||||
entry has `merged_into` field pointing at the primary
|
||||
|
||||
### 1.4 Apply mechanics
|
||||
|
||||
Combine all PNGs from primary + secondaries, re-rank by existing
|
||||
`quality.composite` desc (no re-enrich), renumber `0001..NNNN`, copy into a
|
||||
fresh staging dir, atomic swap. Move secondary directories to
|
||||
`_merged/<original_name>/` (preserved in full for reversibility). Re-zip
|
||||
`_topN.fsz` and `_all.fsz`.
|
||||
|
||||
The primary's existing per-PNG quality scores are reused — re-ranking does
|
||||
not require re-running `enrich`-equivalent landmarks/pose on the cropped
|
||||
PNGs. The primary's `_dropped/` (from prior occlusion filter) is preserved
|
||||
through the merge.
|
||||
|
||||
## 2. Age extension of faceset_001 era buckets
|
||||
|
||||
### 2.1 Why a follow-on pass
|
||||
|
||||
Consolidation absorbed faceset_007/132/151 into faceset_001 (+46 PNGs).
|
||||
The original `age_split_001.py` had bucketed peter into 6 era anchors
|
||||
(`_2005-10`, `_2010-13`, `_2011`, `_2014-17`, `_2018-19`, `_2018-20`), but
|
||||
those new PNGs had never been seen by age_split. They sat in faceset_001's
|
||||
parent-only set, missing from every era .fsz.
|
||||
|
||||
### 2.2 Era-label pitfall
|
||||
|
||||
The 6 anchor era labels are NOT strict year ranges. They are
|
||||
`Counter(years).most_common(1)`-derived dom-years from the original sub-cluster:
|
||||
|
||||
| label | dom_year | actual span of members |
|
||||
|-------|---------:|-----------------------:|
|
||||
| `_2005-10` | 2010 | 2005–2010 |
|
||||
| `_2010-13` | 2011 | **2007–2024** |
|
||||
| `_2011` | 2011 | 2011 only |
|
||||
| `_2014-17` | 2016 | 2005–2018 |
|
||||
| `_2018-19` | 2018 | 2012–2020 |
|
||||
| `_2018-20` | 2019 | 2014–2022 |
|
||||
|
||||
The clusters are *appearance-anchored*, not year-bounded. Year is a
|
||||
descriptive label. Assignment rule must use dom-year, not member span.
|
||||
|
||||
### 2.3 Algorithm
|
||||
|
||||
For each unbucketed face entry in `faceset_001`'s manifest (50 of 753):
|
||||
|
||||
1. Look up embedding in cache by `(source, bbox)`.
|
||||
2. Look up EXIF year via `work/cache/age_split_exif.json`; fetch on cache miss.
|
||||
3. Find single nearest era anchor by cosine distance to its centroid.
|
||||
4. Accept iff `dist ≤ 0.40` AND `|year − anchor.dom_year| ≤ 5`.
|
||||
These thresholds match `age_split_001.py`'s anchor-fragment rule.
|
||||
5. Anchors are NOT re-centered after absorption (preserves age_split's
|
||||
drift-prevention guarantee).
|
||||
|
||||
### 2.4 Run results
|
||||
|
||||
50 unbucketed → 21 with EXIF year → **14 accepted**:
|
||||
|
||||
| anchor | dom_year | added |
|
||||
|--------|---------:|------:|
|
||||
| `_2005-10` | 2010 | +2 |
|
||||
| `_2010-13` | 2011 | +1 |
|
||||
| `_2014-17` | 2016 | **+9** |
|
||||
| `_2018-20` | 2019 | +2 |
|
||||
|
||||
29 PNGs skipped for missing EXIF year (mostly immich-stripped
|
||||
photos). 7 dist/year-rejected (e.g. two PNGs from 2025 want
|
||||
`_2018-19` but year-delta 7 > 5).
|
||||
|
||||
### 2.5 Reconciliation side effect
|
||||
|
||||
The apply rebuilds each affected era bucket's `faces/` from staging. This
|
||||
incidentally reconciled the per-bucket manifests with disk after the prior
|
||||
occlusion filter run had left era manifests stale at 282/126/132 entries vs
|
||||
~248/125/129 actual files (occlusion filter only updates the master
|
||||
manifest, never per-faceset manifests — see
|
||||
`docs/analysis/clip-occlusion-filter.md` §7). 42 occlusion-dropped era PNGs
|
||||
inside the old `faces/_dropped/` were removed during rebuild. The
|
||||
parent `faceset_001/faces/_dropped/` still has the corpus-level audit; all
|
||||
source images are intact at `/mnt/x/src/`, so the era-level dropped PNGs
|
||||
are regeneratable via `cmd_export_swap`.
|
||||
|
||||
## 3. Re-running
|
||||
|
||||
Always run both passes after any new identity import (Immich, osrc,
|
||||
hand-sorted folder):
|
||||
|
||||
```bash
|
||||
# 1. Find duplicate identities
|
||||
python work/consolidate_facesets.py analyze \
|
||||
--out work/merge_review/candidates.json [--edge 0.55 --confident 0.65]
|
||||
python work/consolidate_facesets.py report \
|
||||
--candidates work/merge_review/candidates.json --out work/merge_review
|
||||
# inspect work/merge_review/index.html
|
||||
python work/consolidate_facesets.py apply \
|
||||
--candidates work/merge_review/candidates.json [--include-uncertain]
|
||||
|
||||
# 2. Slot new faceset_001 PNGs into existing era buckets
|
||||
python work/age_extend_001.py analyze --out work/age_extend/candidates.json
|
||||
python work/age_extend_001.py report \
|
||||
--candidates work/age_extend/candidates.json --out work/age_extend
|
||||
python work/age_extend_001.py apply --candidates work/age_extend/candidates.json
|
||||
```
|
||||
|
||||
Both are idempotent. `consolidate_facesets` skips secondaries already in
|
||||
`_merged/`; `age_extend_001` recomputes anchor centroids + dom-year fresh
|
||||
on every run.
|
||||
576
work/age_extend_001.py
Normal file
576
work/age_extend_001.py
Normal file
@@ -0,0 +1,576 @@
|
||||
"""Extend the existing 6 era buckets of faceset_001 by absorbing PNGs that
|
||||
post-date the original age_split run (from consolidation merges, etc.).
|
||||
|
||||
Mirrors the anchor-fragment assignment logic in age_split_001.py:
|
||||
- For each unbucketed face in faceset_001's manifest, find the nearest active
|
||||
era anchor by cosine distance to the anchor's centroid.
|
||||
- Accept the assignment iff dist <= 0.40 AND |year_delta| <= 5
|
||||
(where year_delta = exif_year(face) - dom_year(anchor)).
|
||||
- Undated PNGs are skipped (no assignment).
|
||||
- Anchors are NOT re-centered after absorption (preserves the same drift
|
||||
guarantees as the original age_split).
|
||||
|
||||
CLI:
|
||||
python work/age_extend_001.py analyze --out work/age_extend/candidates.json
|
||||
python work/age_extend_001.py report --candidates ... --out work/age_extend
|
||||
python work/age_extend_001.py apply --candidates ... [--dry-run]
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import shutil
|
||||
import sys
|
||||
import time
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
from PIL import Image, ExifTags
|
||||
|
||||
ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
|
||||
PARENT = "faceset_001"
|
||||
ACTIVE_ERAS = [
|
||||
"faceset_001_2005-10",
|
||||
"faceset_001_2010-13",
|
||||
"faceset_001_2011",
|
||||
"faceset_001_2014-17",
|
||||
"faceset_001_2018-19",
|
||||
"faceset_001_2018-20",
|
||||
]
|
||||
CACHES = [
|
||||
Path("/opt/face-sets/work/cache/nl_full.npz"),
|
||||
Path("/opt/face-sets/work/cache/immich_peter.npz"),
|
||||
Path("/opt/face-sets/work/cache/immich_nic.npz"),
|
||||
]
|
||||
EXIF_CACHE = Path("/opt/face-sets/work/cache/age_split_exif.json")
|
||||
|
||||
# anchor-fragment thresholds (mirror age_split_001.py)
|
||||
DIST_MAX = 0.40
|
||||
YEAR_MAX = 5
|
||||
|
||||
|
||||
# ----------------------------- caches -----------------------------
|
||||
|
||||
def load_caches():
|
||||
rec_index: dict[tuple[str, tuple[int, int, int, int]], np.ndarray] = {}
|
||||
alias_map: dict[str, str] = {}
|
||||
for c in CACHES:
|
||||
if not c.exists():
|
||||
print(f"[warn] cache missing: {c}", file=sys.stderr)
|
||||
continue
|
||||
d = np.load(c, allow_pickle=True)
|
||||
emb = d["embeddings"]
|
||||
meta = json.loads(str(d["meta"]))
|
||||
face_records = [m for m in meta if not m.get("noface")]
|
||||
if len(face_records) != len(emb):
|
||||
raise SystemExit(f"meta/emb mismatch in {c}: {len(face_records)} vs {len(emb)}")
|
||||
if "path_aliases" in d.files:
|
||||
paliases = json.loads(str(d["path_aliases"]))
|
||||
for canon, alist in paliases.items():
|
||||
alias_map.setdefault(canon, canon)
|
||||
for a in alist:
|
||||
alias_map[a] = canon
|
||||
for i, rec in enumerate(face_records):
|
||||
p = rec["path"]
|
||||
bbox = tuple(int(x) for x in rec["bbox"])
|
||||
v = emb[i].astype(np.float32)
|
||||
n = float(np.linalg.norm(v))
|
||||
if n > 0:
|
||||
v = v / n
|
||||
rec_index[(p, bbox)] = v
|
||||
alias_map.setdefault(p, p)
|
||||
print(f"[cache] indexed {len(rec_index)} face records, {len(alias_map)} aliases", file=sys.stderr)
|
||||
return rec_index, alias_map
|
||||
|
||||
|
||||
def lookup_emb(rec_index, alias_map, src: str, bbox):
|
||||
bbox_t = tuple(int(x) for x in bbox)
|
||||
canon = alias_map.get(src, src)
|
||||
v = rec_index.get((canon, bbox_t))
|
||||
if v is None and canon != src:
|
||||
v = rec_index.get((src, bbox_t))
|
||||
return v
|
||||
|
||||
|
||||
# ----------------------------- exif -----------------------------
|
||||
|
||||
def load_exif_cache():
|
||||
if not EXIF_CACHE.exists():
|
||||
return {}
|
||||
return json.loads(EXIF_CACHE.read_text())
|
||||
|
||||
|
||||
def save_exif_cache(cache):
|
||||
tmp = EXIF_CACHE.with_suffix(".tmp.json")
|
||||
tmp.write_text(json.dumps(cache, indent=2))
|
||||
tmp.replace(EXIF_CACHE)
|
||||
|
||||
|
||||
def exif_year(path: Path) -> int | None:
|
||||
try:
|
||||
with Image.open(path) as im:
|
||||
ex = im._getexif()
|
||||
if not ex:
|
||||
return None
|
||||
for tag_id, val in ex.items():
|
||||
tag = ExifTags.TAGS.get(tag_id, tag_id)
|
||||
if tag == "DateTimeOriginal" and isinstance(val, str) and len(val) >= 4:
|
||||
return int(val[:4])
|
||||
except Exception:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
def get_year(src: str, exif_cache) -> int | None:
|
||||
"""Return EXIF year for src, using cache. Mutates cache for new lookups."""
|
||||
if src in exif_cache:
|
||||
return exif_cache[src]
|
||||
p = Path(src)
|
||||
y = exif_year(p) if p.exists() else None
|
||||
exif_cache[src] = y
|
||||
return y
|
||||
|
||||
|
||||
# ----------------------------- analyze -----------------------------
|
||||
|
||||
def cmd_analyze(args):
|
||||
rec_index, alias_map = load_caches()
|
||||
exif_cache = load_exif_cache()
|
||||
exif_cache_dirty = False
|
||||
|
||||
parent_dir = ROOT / PARENT
|
||||
parent_manifest = json.loads((parent_dir / "manifest.json").read_text())
|
||||
parent_faces = parent_manifest.get("faces", [])
|
||||
print(f"[parent] {PARENT}: {len(parent_faces)} face entries", file=sys.stderr)
|
||||
|
||||
# Build "in_bucket" set + each anchor's centroid + dom_year
|
||||
anchors = []
|
||||
in_bucket: set[tuple[str, tuple[int, int, int, int]]] = set()
|
||||
for era in ACTIVE_ERAS:
|
||||
ed = ROOT / era
|
||||
if not ed.is_dir():
|
||||
print(f"[warn] missing era bucket: {era}", file=sys.stderr)
|
||||
continue
|
||||
em = json.loads((ed / "manifest.json").read_text())
|
||||
emb_list = []
|
||||
years = []
|
||||
n_missing_emb = 0
|
||||
for f in em.get("faces", []):
|
||||
src = f.get("source")
|
||||
bbox = f.get("bbox")
|
||||
if not src or not bbox:
|
||||
continue
|
||||
key = (alias_map.get(src, src), tuple(int(x) for x in bbox))
|
||||
in_bucket.add(key)
|
||||
in_bucket.add((src, tuple(int(x) for x in bbox))) # cover both alias and raw
|
||||
v = lookup_emb(rec_index, alias_map, src, bbox)
|
||||
if v is None:
|
||||
n_missing_emb += 1
|
||||
else:
|
||||
emb_list.append(v)
|
||||
y = get_year(src, exif_cache)
|
||||
if y is None:
|
||||
exif_cache_dirty = True
|
||||
else:
|
||||
years.append(y)
|
||||
if src not in exif_cache:
|
||||
exif_cache_dirty = True
|
||||
if not emb_list:
|
||||
print(f"[warn] {era}: no embeddings found, skipping anchor", file=sys.stderr)
|
||||
continue
|
||||
arr = np.stack(emb_list).astype(np.float32)
|
||||
c = arr.mean(axis=0)
|
||||
n = float(np.linalg.norm(c))
|
||||
if n > 0:
|
||||
c = c / n
|
||||
dom_year = Counter(years).most_common(1)[0][0] if years else None
|
||||
anchors.append({
|
||||
"name": era, "centroid": c, "n_faces": len(em.get("faces", [])),
|
||||
"n_emb_used": len(emb_list), "n_emb_missing": n_missing_emb,
|
||||
"dom_year": dom_year,
|
||||
"year_min": min(years) if years else None,
|
||||
"year_max": max(years) if years else None,
|
||||
})
|
||||
print(f"[anchor] {era}: n={len(em.get('faces', []))} emb_used={len(emb_list)} "
|
||||
f"emb_miss={n_missing_emb} dom_year={dom_year} years=[{min(years) if years else '-'}..{max(years) if years else '-'}]",
|
||||
file=sys.stderr)
|
||||
|
||||
# Find unbucketed faces in parent
|
||||
unbucketed = []
|
||||
for f in parent_faces:
|
||||
src = f.get("source")
|
||||
bbox = f.get("bbox")
|
||||
if not src or not bbox:
|
||||
continue
|
||||
bbox_t = tuple(int(x) for x in bbox)
|
||||
key1 = (alias_map.get(src, src), bbox_t)
|
||||
key2 = (src, bbox_t)
|
||||
if key1 in in_bucket or key2 in in_bucket:
|
||||
continue
|
||||
unbucketed.append(f)
|
||||
print(f"[parent] {len(unbucketed)} unbucketed face entries (in {PARENT} but no era bucket)", file=sys.stderr)
|
||||
|
||||
# Score each unbucketed face against every anchor
|
||||
proposals = []
|
||||
skipped_no_emb = 0
|
||||
skipped_no_year = 0
|
||||
for f in unbucketed:
|
||||
src = f["source"]
|
||||
bbox = f["bbox"]
|
||||
v = lookup_emb(rec_index, alias_map, src, bbox)
|
||||
if v is None:
|
||||
skipped_no_emb += 1
|
||||
continue
|
||||
y = get_year(src, exif_cache)
|
||||
if y is None:
|
||||
skipped_no_year += 1
|
||||
exif_cache_dirty = True
|
||||
continue
|
||||
if src not in exif_cache:
|
||||
exif_cache_dirty = True
|
||||
# nearest anchor
|
||||
best = None # (dist, idx)
|
||||
for i, a in enumerate(anchors):
|
||||
d = 1.0 - float(np.dot(a["centroid"], v))
|
||||
if best is None or d < best[0]:
|
||||
best = (d, i)
|
||||
if best is None:
|
||||
continue
|
||||
dist, bidx = best
|
||||
anchor = anchors[bidx]
|
||||
year_delta = abs(y - anchor["dom_year"]) if anchor["dom_year"] is not None else None
|
||||
accept = (dist <= DIST_MAX and year_delta is not None and year_delta <= YEAR_MAX)
|
||||
proposals.append({
|
||||
"png": f["png"],
|
||||
"source": src,
|
||||
"bbox": [int(x) for x in bbox],
|
||||
"year": y,
|
||||
"rank_in_parent": f.get("rank"),
|
||||
"quality_composite": f.get("quality", {}).get("composite"),
|
||||
"quality": f.get("quality", {}),
|
||||
"best_anchor": anchor["name"],
|
||||
"best_anchor_dom_year": anchor["dom_year"],
|
||||
"centroid_dist": round(dist, 4),
|
||||
"year_delta": year_delta,
|
||||
"accept": bool(accept),
|
||||
"all_anchor_dists": {
|
||||
a["name"]: round(1.0 - float(np.dot(a["centroid"], v)), 4) for a in anchors
|
||||
},
|
||||
})
|
||||
|
||||
if exif_cache_dirty:
|
||||
save_exif_cache(exif_cache)
|
||||
print(f"[exif] cache flushed ({len(exif_cache)} entries total)", file=sys.stderr)
|
||||
|
||||
# Summarize
|
||||
accepted = [p for p in proposals if p["accept"]]
|
||||
rejected = [p for p in proposals if not p["accept"]]
|
||||
by_anchor = Counter(p["best_anchor"] for p in accepted)
|
||||
print(f"[summary] unbucketed={len(unbucketed)} scored={len(proposals)} "
|
||||
f"accepted={len(accepted)} rejected={len(rejected)} "
|
||||
f"skipped(no_emb={skipped_no_emb}, no_year={skipped_no_year})", file=sys.stderr)
|
||||
for k, v in by_anchor.most_common():
|
||||
print(f" {k}: +{v}", file=sys.stderr)
|
||||
|
||||
out = {
|
||||
"thresholds": {"dist_max": DIST_MAX, "year_max": YEAR_MAX},
|
||||
"anchors": [
|
||||
{k: v for k, v in a.items() if k != "centroid"}
|
||||
for a in anchors
|
||||
],
|
||||
"n_unbucketed": len(unbucketed),
|
||||
"skipped": {"no_emb": skipped_no_emb, "no_year": skipped_no_year},
|
||||
"proposals": sorted(proposals, key=lambda p: (not p["accept"], p["best_anchor"], -1 * (p["quality_composite"] or 0))),
|
||||
"by_anchor": dict(by_anchor),
|
||||
}
|
||||
op = Path(args.out)
|
||||
op.parent.mkdir(parents=True, exist_ok=True)
|
||||
op.write_text(json.dumps(out, indent=2))
|
||||
print(f"[done] {len(proposals)} proposals -> {op}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- report -----------------------------
|
||||
|
||||
def cmd_report(args):
|
||||
cand = json.loads(Path(args.candidates).read_text())
|
||||
out_dir = Path(args.out)
|
||||
thumbs_dir = out_dir / "thumbs"
|
||||
thumbs_dir.mkdir(parents=True, exist_ok=True)
|
||||
THUMB = 140
|
||||
|
||||
def make_thumb(png_relpath: str) -> str:
|
||||
# png_relpath looks like "faces/0042.png"
|
||||
src = ROOT / PARENT / png_relpath
|
||||
name = Path(png_relpath).stem
|
||||
dst = thumbs_dir / f"{name}.jpg"
|
||||
if not dst.exists():
|
||||
try:
|
||||
img = Image.open(src).convert("RGB")
|
||||
img.thumbnail((THUMB, THUMB), Image.LANCZOS)
|
||||
img.save(dst, "JPEG", quality=82)
|
||||
except Exception as e:
|
||||
print(f"[thumb-skip] {src}: {e}", file=sys.stderr)
|
||||
return ""
|
||||
return f"thumbs/{name}.jpg"
|
||||
|
||||
# group accepted proposals by target anchor
|
||||
by_anchor: dict[str, list] = {}
|
||||
rejected = []
|
||||
for p in cand["proposals"]:
|
||||
if p["accept"]:
|
||||
by_anchor.setdefault(p["best_anchor"], []).append(p)
|
||||
else:
|
||||
rejected.append(p)
|
||||
|
||||
rows = []
|
||||
rows.append("<h1>faceset_001 age extension — review</h1>")
|
||||
rows.append(f"<p>{cand['n_unbucketed']} unbucketed faces in {PARENT}; "
|
||||
f"{sum(len(v) for v in by_anchor.values())} accepted / {len(rejected)} rejected; "
|
||||
f"thresholds dist≤{cand['thresholds']['dist_max']} AND |year_delta|≤{cand['thresholds']['year_max']}.</p>")
|
||||
nav = " · ".join(f"<a href='#{a}'>{a} (+{len(by_anchor[a])})</a>" for a in by_anchor) + " · <a href='#rejected'>rejected</a>"
|
||||
rows.append(f"<div class='nav'>{nav}</div>")
|
||||
|
||||
for anchor_name in ACTIVE_ERAS:
|
||||
if anchor_name not in by_anchor:
|
||||
continue
|
||||
items = by_anchor[anchor_name]
|
||||
anchor_meta = next((a for a in cand["anchors"] if a["name"] == anchor_name), {})
|
||||
rows.append(f"<section id='{anchor_name}' class='grp'>")
|
||||
rows.append(f"<h2>{anchor_name} <small>(dom_year={anchor_meta.get('dom_year')}; "
|
||||
f"existing n={anchor_meta.get('n_faces')}; +{len(items)} new)</small></h2>")
|
||||
rows.append("<div class='cells'>")
|
||||
for p in sorted(items, key=lambda x: (x["centroid_dist"], -1 * (x["quality_composite"] or 0))):
|
||||
thumb = make_thumb(p["png"])
|
||||
cls = "hi" if p["centroid_dist"] <= 0.30 else "mid"
|
||||
rows.append(
|
||||
f"<div class='cell'>"
|
||||
f"<img src='{thumb}' loading='lazy' title='{p['png']}'>"
|
||||
f"<div class='meta'>{p['png']}<br>year {p['year']} (Δ{p['year_delta']})<br>"
|
||||
f"<span class='{cls}'>dist {p['centroid_dist']:.3f}</span></div>"
|
||||
f"</div>"
|
||||
)
|
||||
rows.append("</div></section>")
|
||||
|
||||
if rejected:
|
||||
rows.append("<section id='rejected' class='grp rej'>")
|
||||
rows.append(f"<h2>rejected <small>({len(rejected)} faces don't fit any anchor)</small></h2>")
|
||||
rows.append("<div class='cells'>")
|
||||
for p in sorted(rejected, key=lambda x: x["centroid_dist"])[:200]:
|
||||
thumb = make_thumb(p["png"])
|
||||
why = []
|
||||
if p["centroid_dist"] > cand['thresholds']['dist_max']:
|
||||
why.append(f"dist {p['centroid_dist']:.2f}>{cand['thresholds']['dist_max']}")
|
||||
if p["year_delta"] is None or p["year_delta"] > cand['thresholds']['year_max']:
|
||||
why.append(f"yΔ{p['year_delta']}>{cand['thresholds']['year_max']}")
|
||||
rows.append(
|
||||
f"<div class='cell'>"
|
||||
f"<img src='{thumb}' loading='lazy'>"
|
||||
f"<div class='meta'>{p['png']}<br>year {p['year']} → best {p['best_anchor']}<br>"
|
||||
f"<span class='lo'>{'; '.join(why)}</span></div>"
|
||||
f"</div>"
|
||||
)
|
||||
if len(rejected) > 200:
|
||||
rows.append(f"<p>...{len(rejected)-200} more truncated.</p>")
|
||||
rows.append("</div></section>")
|
||||
|
||||
html = f"""<!doctype html>
|
||||
<html><head><meta charset='utf-8'><title>faceset_001 age extension</title>
|
||||
<style>
|
||||
body {{ font-family: system-ui, sans-serif; background:#111; color:#eee; padding:1em; }}
|
||||
h1 {{ margin-top:0; }} h2 {{ margin:0; }}
|
||||
small {{ color:#999; font-weight:normal; }}
|
||||
section.grp {{ background:#1a1a1a; border-radius:6px; padding:12px; margin:12px 0; }}
|
||||
section.grp.rej {{ border-left:4px solid #ff5050; }}
|
||||
.cells {{ display:flex; flex-wrap:wrap; gap:6px; }}
|
||||
.cell {{ background:#222; border-radius:4px; padding:4px; width:160px; font-size:11px; font-family:monospace; text-align:center; }}
|
||||
.cell img {{ height:140px; width:auto; border-radius:3px; }}
|
||||
.meta {{ padding-top:4px; line-height:1.3; }}
|
||||
.hi {{ color:#5fa05f; font-weight:bold; }}
|
||||
.mid {{ color:#ffb050; }}
|
||||
.lo {{ color:#ff5050; }}
|
||||
.nav {{ position:sticky; top:0; background:#111; padding:.5em 0; border-bottom:1px solid #333; font-size:13px; }}
|
||||
a {{ color:#6cf; }}
|
||||
</style></head>
|
||||
<body>
|
||||
{''.join(rows)}
|
||||
</body></html>"""
|
||||
out_html = out_dir / "index.html"
|
||||
out_html.write_text(html)
|
||||
print(f"[done] {out_html}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- apply -----------------------------
|
||||
|
||||
def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
|
||||
import zipfile
|
||||
with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
|
||||
for i, p in enumerate(pngs):
|
||||
zf.write(p, arcname=f"{i:04d}.png")
|
||||
|
||||
|
||||
def cmd_apply(args):
|
||||
cand = json.loads(Path(args.candidates).read_text())
|
||||
accepted = [p for p in cand["proposals"] if p["accept"]]
|
||||
if args.dry_run:
|
||||
from collections import Counter as C
|
||||
by = C(p["best_anchor"] for p in accepted)
|
||||
print(f"=== dry-run: {len(accepted)} assignments across {len(by)} anchors ===")
|
||||
for k, v in by.most_common():
|
||||
print(f" {k}: +{v}")
|
||||
return
|
||||
|
||||
parent_dir = ROOT / PARENT
|
||||
master_path = ROOT / "manifest.json"
|
||||
master = json.loads(master_path.read_text())
|
||||
facesets_by_name = {f["name"]: f for f in master.get("facesets", [])}
|
||||
|
||||
by_anchor: dict[str, list] = {}
|
||||
for p in accepted:
|
||||
by_anchor.setdefault(p["best_anchor"], []).append(p)
|
||||
|
||||
total_added = 0
|
||||
for anchor_name, props in by_anchor.items():
|
||||
ed = ROOT / anchor_name
|
||||
em_path = ed / "manifest.json"
|
||||
em = json.loads(em_path.read_text())
|
||||
existing = list(em.get("faces", []))
|
||||
|
||||
# gather new entries with their source PNG paths in faceset_001/faces/
|
||||
new_with_src = []
|
||||
for p in props:
|
||||
src_png = parent_dir / p["png"]
|
||||
if not src_png.exists():
|
||||
print(f"[warn] missing parent PNG {src_png}; skip", file=sys.stderr)
|
||||
continue
|
||||
face_entry = {
|
||||
"source": p["source"],
|
||||
"bbox": p["bbox"],
|
||||
"quality": p["quality"],
|
||||
"exif_year": p["year"],
|
||||
"centroid_dist_at_assign": p["centroid_dist"],
|
||||
"year_delta_at_assign": p["year_delta"],
|
||||
"extended_from_parent": True,
|
||||
}
|
||||
new_with_src.append((face_entry, src_png))
|
||||
|
||||
# combine; rank by quality.composite desc (existing entries already have rank,
|
||||
# but we re-rank globally so new entries slot in by quality)
|
||||
combined: list[tuple[dict, Path | None]] = []
|
||||
for f in existing:
|
||||
combined.append((f, None))
|
||||
combined.extend(new_with_src)
|
||||
combined.sort(key=lambda x: -x[0].get("quality", {}).get("composite", 0))
|
||||
|
||||
# stage fresh
|
||||
staging = ed / "_faces_new"
|
||||
if staging.exists():
|
||||
shutil.rmtree(staging)
|
||||
staging.mkdir()
|
||||
new_face_entries = []
|
||||
for new_rank, (face, src_png_or_none) in enumerate(combined, start=1):
|
||||
new_name = f"{new_rank:04d}.png"
|
||||
if src_png_or_none is None:
|
||||
# existing entry: copy from current era bucket faces/
|
||||
old_name = Path(face["png"]).name
|
||||
src = ed / "faces" / old_name
|
||||
if not src.exists():
|
||||
print(f"[warn] {anchor_name}: missing existing PNG {src}; skip", file=sys.stderr)
|
||||
continue
|
||||
shutil.copy2(src, staging / new_name)
|
||||
else:
|
||||
shutil.copy2(src_png_or_none, staging / new_name)
|
||||
face = dict(face)
|
||||
face["rank"] = new_rank
|
||||
face["png"] = f"faces/{new_name}"
|
||||
new_face_entries.append(face)
|
||||
|
||||
# swap dirs
|
||||
old_holding = ed / "_faces_old"
|
||||
if old_holding.exists():
|
||||
shutil.rmtree(old_holding)
|
||||
(ed / "faces").rename(old_holding)
|
||||
staging.rename(ed / "faces")
|
||||
shutil.rmtree(old_holding)
|
||||
|
||||
# re-zip .fsz
|
||||
survivor_pngs = sorted((ed / "faces").glob("*.png"))
|
||||
top_n = em.get("top_n", 30)
|
||||
top_n_eff = min(top_n, len(survivor_pngs))
|
||||
for old in ed.glob("*.fsz"):
|
||||
old.unlink()
|
||||
top_fsz_name = f"{anchor_name}_top{top_n_eff}.fsz"
|
||||
all_fsz_name = f"{anchor_name}_all.fsz"
|
||||
_zip_png_list(survivor_pngs[:top_n_eff], ed / top_fsz_name)
|
||||
if len(survivor_pngs) > top_n_eff:
|
||||
_zip_png_list(survivor_pngs, ed / all_fsz_name)
|
||||
all_fsz_used = all_fsz_name
|
||||
else:
|
||||
all_fsz_used = None
|
||||
|
||||
# update local + master manifests
|
||||
em["faces"] = new_face_entries
|
||||
em["exported"] = len(new_face_entries)
|
||||
em["fsz_top"] = top_fsz_name
|
||||
em["fsz_all"] = all_fsz_used
|
||||
em["top_n"] = top_n_eff
|
||||
em.setdefault("age_extend_history", []).append({
|
||||
"added": len(new_with_src),
|
||||
"thresholds": cand["thresholds"],
|
||||
})
|
||||
em_path.write_text(json.dumps(em, indent=2))
|
||||
|
||||
if anchor_name in facesets_by_name:
|
||||
facesets_by_name[anchor_name]["exported"] = len(new_face_entries)
|
||||
facesets_by_name[anchor_name]["fsz_top"] = top_fsz_name
|
||||
facesets_by_name[anchor_name]["fsz_all"] = all_fsz_used
|
||||
facesets_by_name[anchor_name]["top_n"] = top_n_eff
|
||||
|
||||
added_here = len(new_with_src)
|
||||
total_added += added_here
|
||||
print(f"[applied] {anchor_name}: +{added_here} (now {len(new_face_entries)} faces)", file=sys.stderr)
|
||||
|
||||
# rewrite master with ordering preserved
|
||||
new_facesets = []
|
||||
for entry in master.get("facesets", []):
|
||||
new_facesets.append(facesets_by_name.get(entry["name"], entry))
|
||||
master["facesets"] = new_facesets
|
||||
master.setdefault("age_extend_runs", []).append({
|
||||
"parent": PARENT,
|
||||
"thresholds": cand["thresholds"],
|
||||
"anchors": list(by_anchor.keys()),
|
||||
"added_total": total_added,
|
||||
})
|
||||
tmp = master_path.with_suffix(".tmp.json")
|
||||
tmp.write_text(json.dumps(master, indent=2))
|
||||
tmp.replace(master_path)
|
||||
print(f"[done] +{total_added} faces across {len(by_anchor)} anchors", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- main -----------------------------
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
sub = ap.add_subparsers(dest="cmd", required=True)
|
||||
|
||||
a = sub.add_parser("analyze")
|
||||
a.add_argument("--out", required=True)
|
||||
a.set_defaults(func=cmd_analyze)
|
||||
|
||||
r = sub.add_parser("report")
|
||||
r.add_argument("--candidates", required=True)
|
||||
r.add_argument("--out", required=True)
|
||||
r.set_defaults(func=cmd_report)
|
||||
|
||||
p = sub.add_parser("apply")
|
||||
p.add_argument("--candidates", required=True)
|
||||
p.add_argument("--dry-run", action="store_true")
|
||||
p.set_defaults(func=cmd_apply)
|
||||
|
||||
args = ap.parse_args()
|
||||
args.func(args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
221
work/clip_worker.py
Normal file
221
work/clip_worker.py
Normal file
@@ -0,0 +1,221 @@
|
||||
"""Windows / DirectML CLIP worker for occlusion scoring.
|
||||
|
||||
Reads a queue.json staged by /opt/face-sets/work/filter_occlusions.py (WSL side),
|
||||
runs open_clip ViT-L-14 (dfn2b_s39b) on each PNG via torch-directml on the AMD
|
||||
Vega, and writes a scores.json with mask + sunglasses softmax probabilities.
|
||||
|
||||
CLI:
|
||||
py -3.12 clip_worker.py <queue.json> <out_scores.json> [--limit N] [--batch 8]
|
||||
|
||||
queue.json shape: list of objects
|
||||
{"wsl_path": "...", "win_path": "E:\\...\\faceset_NNN\\faces\\NNNN.png",
|
||||
"faceset": "faceset_NNN", "file": "NNNN.png"}
|
||||
|
||||
scores.json shape:
|
||||
{"model": "ViT-L-14/dfn2b_s39b",
|
||||
"logit_scale": 100.0,
|
||||
"prompts": {...},
|
||||
"results": [{"wsl_path": "...", "faceset": "...", "file": "...",
|
||||
"mask": float, "sunglasses": float}],
|
||||
"processed": [wsl_path, ...]}
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
import warnings
|
||||
from pathlib import Path
|
||||
|
||||
# DML emits a verbose UserWarning per attention call -- silence at import time
|
||||
warnings.filterwarnings("ignore", category=UserWarning)
|
||||
|
||||
import torch
|
||||
import torch_directml
|
||||
import open_clip
|
||||
from PIL import Image
|
||||
|
||||
MODEL_NAME = "ViT-L-14"
|
||||
PRETRAINED = "dfn2b_s39b"
|
||||
|
||||
# kept in sync with /opt/face-sets/work/filter_occlusions.py PROMPTS
|
||||
PROMPTS = {
|
||||
"mask": {
|
||||
"pos": [
|
||||
"a photo of a person wearing a surgical face mask",
|
||||
"a photo of a person wearing an FFP2 respirator covering mouth and nose",
|
||||
"a photo of a person wearing a cloth face mask",
|
||||
"a face partially covered by a medical mask",
|
||||
"a person whose mouth and nose are hidden by a face mask",
|
||||
],
|
||||
"neg": [
|
||||
"a photo of a person's face with mouth and nose clearly visible",
|
||||
"a clear, unobstructed photo of a face",
|
||||
"a photo of a face without any mask or covering",
|
||||
"a portrait of a person showing their full face",
|
||||
"a photo of a person with a beard and visible mouth",
|
||||
],
|
||||
},
|
||||
"sunglasses": {
|
||||
"pos": [
|
||||
"a face with dark sunglasses covering the eyes",
|
||||
"a portrait with the eyes hidden behind opaque sunglasses",
|
||||
"a person wearing dark sunglasses over their eyes, eyes not visible",
|
||||
"a face where the eyes are completely concealed by tinted lenses",
|
||||
"a close-up portrait wearing aviator sunglasses on the eyes",
|
||||
],
|
||||
"neg": [
|
||||
"a portrait with both eyes clearly visible and uncovered",
|
||||
"a face with sunglasses pushed up on the forehead, eyes visible below",
|
||||
"a face with sunglasses resting on top of the head, eyes visible",
|
||||
"a person with sunglasses hanging from their shirt, eyes visible",
|
||||
"a face wearing clear prescription eyeglasses with visible eyes",
|
||||
"a portrait with no eyewear and visible eyes",
|
||||
],
|
||||
},
|
||||
}
|
||||
|
||||
FLUSH_EVERY = 100
|
||||
|
||||
|
||||
def load_existing(out_path: Path):
|
||||
if not out_path.exists():
|
||||
return None, set()
|
||||
try:
|
||||
d = json.loads(out_path.read_text())
|
||||
processed = set(d.get("processed", []))
|
||||
return d, processed
|
||||
except Exception as e:
|
||||
print(f"[warn] could not parse existing {out_path}: {e}; starting fresh", file=sys.stderr)
|
||||
return None, set()
|
||||
|
||||
|
||||
def save_atomic(out_path: Path, data: dict):
|
||||
tmp = out_path.with_suffix(".tmp.json")
|
||||
tmp.write_text(json.dumps(data, indent=2))
|
||||
os.replace(tmp, out_path)
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def build_text_features(model, tokenizer, device):
|
||||
out = {}
|
||||
for attr, sides in PROMPTS.items():
|
||||
feats = {}
|
||||
for side in ("pos", "neg"):
|
||||
tokens = tokenizer(sides[side]).to(device)
|
||||
f = model.encode_text(tokens)
|
||||
f = f / f.norm(dim=-1, keepdim=True)
|
||||
mean = f.mean(dim=0)
|
||||
feats[side] = mean / mean.norm()
|
||||
out[attr] = (feats["pos"], feats["neg"])
|
||||
return out
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("queue", type=Path)
|
||||
ap.add_argument("out", type=Path)
|
||||
ap.add_argument("--limit", type=int, default=None)
|
||||
ap.add_argument("--batch", type=int, default=8)
|
||||
args = ap.parse_args()
|
||||
|
||||
queue = json.loads(args.queue.read_text())
|
||||
print(f"[queue] {len(queue)} entries from {args.queue}")
|
||||
|
||||
args.out.parent.mkdir(parents=True, exist_ok=True)
|
||||
existing, processed = load_existing(args.out)
|
||||
if existing:
|
||||
print(f"[resume] {len(processed)} entries already scored")
|
||||
results = existing.get("results", [])
|
||||
else:
|
||||
results = []
|
||||
|
||||
pending = [e for e in queue if e["wsl_path"] not in processed]
|
||||
if args.limit is not None:
|
||||
pending = pending[: args.limit]
|
||||
print(f"[pending] {len(pending)} entries to score")
|
||||
|
||||
if not pending:
|
||||
print("[done] nothing to do")
|
||||
return
|
||||
|
||||
device = torch_directml.device()
|
||||
print(f"[load] {MODEL_NAME}/{PRETRAINED} on {torch_directml.device_name(0)}")
|
||||
t0 = time.time()
|
||||
model, _, preprocess = open_clip.create_model_and_transforms(MODEL_NAME, pretrained=PRETRAINED)
|
||||
tokenizer = open_clip.get_tokenizer(MODEL_NAME)
|
||||
model = model.to(device).eval()
|
||||
logit_scale = float(model.logit_scale.exp().detach().cpu())
|
||||
print(f"[load] ready in {time.time()-t0:.1f}s logit_scale={logit_scale:.2f}")
|
||||
text_feats = build_text_features(model, tokenizer, device)
|
||||
|
||||
def flush():
|
||||
save_atomic(args.out, {
|
||||
"model": f"{MODEL_NAME}/{PRETRAINED}",
|
||||
"logit_scale": logit_scale,
|
||||
"prompts": PROMPTS,
|
||||
"results": results,
|
||||
"processed": sorted(processed),
|
||||
})
|
||||
|
||||
n_done_this_run = 0
|
||||
n_load_err = 0
|
||||
last_flush = time.time()
|
||||
t_start = time.time()
|
||||
|
||||
for i in range(0, len(pending), args.batch):
|
||||
chunk = pending[i:i + args.batch]
|
||||
imgs = []
|
||||
keep = []
|
||||
for entry in chunk:
|
||||
try:
|
||||
img = Image.open(entry["win_path"]).convert("RGB")
|
||||
imgs.append(preprocess(img))
|
||||
keep.append(entry)
|
||||
except Exception as e:
|
||||
print(f"[skip] {entry['win_path']}: {e}", file=sys.stderr)
|
||||
n_load_err += 1
|
||||
processed.add(entry["wsl_path"])
|
||||
if not imgs:
|
||||
continue
|
||||
x = torch.stack(imgs).to(device)
|
||||
with torch.no_grad():
|
||||
feats = model.encode_image(x)
|
||||
feats = feats / feats.norm(dim=-1, keepdim=True)
|
||||
scores_per_attr = {}
|
||||
for attr, (pos, neg) in text_feats.items():
|
||||
sims = torch.stack([feats @ pos, feats @ neg], dim=1) * logit_scale
|
||||
probs = sims.softmax(dim=1)[:, 0].detach().cpu().tolist()
|
||||
scores_per_attr[attr] = probs
|
||||
for j, entry in enumerate(keep):
|
||||
results.append({
|
||||
"wsl_path": entry["wsl_path"],
|
||||
"faceset": entry["faceset"],
|
||||
"file": entry["file"],
|
||||
"mask": round(scores_per_attr["mask"][j], 4),
|
||||
"sunglasses": round(scores_per_attr["sunglasses"][j], 4),
|
||||
})
|
||||
processed.add(entry["wsl_path"])
|
||||
n_done_this_run += 1
|
||||
|
||||
if (n_done_this_run % FLUSH_EVERY < args.batch) or (time.time() - last_flush) > 30.0:
|
||||
flush()
|
||||
last_flush = time.time()
|
||||
elapsed = time.time() - t_start
|
||||
rate = n_done_this_run / max(0.1, elapsed)
|
||||
eta_min = (len(pending) - n_done_this_run) / max(0.1, rate) / 60.0
|
||||
print(f"[score] {n_done_this_run}/{len(pending)} "
|
||||
f"rate={rate:.2f} img/s eta={eta_min:.1f}min "
|
||||
f"load_err={n_load_err}", flush=True)
|
||||
|
||||
flush()
|
||||
elapsed = time.time() - t_start
|
||||
print(f"[done] {n_done_this_run} scored, {n_load_err} load errors, "
|
||||
f"{elapsed:.1f}s ({n_done_this_run/max(0.1,elapsed):.2f} img/s) -> {args.out}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
634
work/consolidate_facesets.py
Normal file
634
work/consolidate_facesets.py
Normal file
@@ -0,0 +1,634 @@
|
||||
"""Consolidate facesets_swap_ready/ — find duplicate identities and merge.
|
||||
|
||||
Pipeline:
|
||||
1. analyze: pull arcface embeddings from work/cache/*.npz for every PNG in every
|
||||
active faceset (skipping _masked, _thin, era splits). Compute L2-normalized
|
||||
centroid per faceset. Build similarity graph at sim>=0.45, extract components.
|
||||
Pick primary per component by tier (hand-sorted > auto > osrc > immich) + size.
|
||||
2. report: HTML contact sheet at work/merge_review/index.html grouped by
|
||||
candidate cluster, with top-3 thumbs per faceset, all pairwise sims, and
|
||||
"merge X,Y -> Z" plan. Confident edges (sim>=0.65) are highlighted.
|
||||
3. apply: combine PNGs of secondaries into primary, re-rank by quality.composite
|
||||
descending, renumber 0001..NNNN, re-zip _topN.fsz + _all.fsz, move secondaries
|
||||
to facesets_swap_ready/_merged/<name>/, update master manifest with
|
||||
`merged[]` array + `merge_run` provenance block.
|
||||
|
||||
Embeddings come from caches (no GPU re-embed needed); the original clusterer used
|
||||
exactly these vectors so they are the right yardstick. Era splits are excluded
|
||||
entirely (intentional time-period segmentation, not a duplication).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import shutil
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
from PIL import Image
|
||||
from scipy.cluster.hierarchy import linkage, fcluster
|
||||
from scipy.spatial.distance import squareform
|
||||
|
||||
ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
|
||||
CACHES = [
|
||||
Path("/opt/face-sets/work/cache/nl_full.npz"),
|
||||
Path("/opt/face-sets/work/cache/immich_peter.npz"),
|
||||
Path("/opt/face-sets/work/cache/immich_nic.npz"),
|
||||
]
|
||||
|
||||
ERA_SPLIT_RE = re.compile(r"^faceset_\d+_(?:\d{4}-\d{2,4}|\d{4}|undated)$")
|
||||
|
||||
|
||||
# ----------------------------- helpers -----------------------------
|
||||
|
||||
def load_caches():
|
||||
"""Return (rec_index, alias_map). rec_index keyed by (path, bbox_tuple)
|
||||
-> embedding (np.float32, shape (512,) L2-normalized).
|
||||
alias_map maps every alias path -> canonical path."""
|
||||
rec_index: dict[tuple[str, tuple[int, int, int, int]], np.ndarray] = {}
|
||||
alias_map: dict[str, str] = {}
|
||||
n_total = 0
|
||||
for c in CACHES:
|
||||
if not c.exists():
|
||||
print(f"[warn] cache missing: {c}", file=sys.stderr)
|
||||
continue
|
||||
d = np.load(c, allow_pickle=True)
|
||||
emb = d["embeddings"]
|
||||
meta = json.loads(str(d["meta"]))
|
||||
face_records = [m for m in meta if not m.get("noface")]
|
||||
if len(face_records) != len(emb):
|
||||
raise SystemExit(f"meta/emb mismatch in {c}: {len(face_records)} vs {len(emb)}")
|
||||
# path_aliases may be present
|
||||
if "path_aliases" in d.files:
|
||||
paliases = json.loads(str(d["path_aliases"]))
|
||||
for canon, alist in paliases.items():
|
||||
alias_map.setdefault(canon, canon)
|
||||
for a in alist:
|
||||
alias_map[a] = canon
|
||||
for i, rec in enumerate(face_records):
|
||||
p = rec["path"]
|
||||
bbox = tuple(int(x) for x in rec["bbox"])
|
||||
v = emb[i].astype(np.float32)
|
||||
n = float(np.linalg.norm(v))
|
||||
if n > 0:
|
||||
v = v / n
|
||||
rec_index[(p, bbox)] = v
|
||||
alias_map.setdefault(p, p)
|
||||
print(f"[cache] {c.name}: +{len(face_records)} face records (running total {len(rec_index)})", file=sys.stderr)
|
||||
n_total += len(face_records)
|
||||
print(f"[cache] indexed {n_total} face records, {len(alias_map)} path aliases", file=sys.stderr)
|
||||
return rec_index, alias_map
|
||||
|
||||
|
||||
def faceset_tier(name: str) -> int:
|
||||
"""Lower number = higher priority for primary selection."""
|
||||
m = re.match(r"^faceset_0*(\d+)$", name)
|
||||
if not m:
|
||||
return 99 # unknown structure
|
||||
n = int(m.group(1))
|
||||
if 13 <= n <= 19:
|
||||
return 0 # hand-sorted
|
||||
if 1 <= n <= 12:
|
||||
return 1 # auto-clustered
|
||||
if 20 <= n <= 25:
|
||||
return 2 # osrc
|
||||
if 26 <= n <= 264:
|
||||
return 3 # immich peter
|
||||
if 265 <= n:
|
||||
return 4 # immich nic and beyond
|
||||
return 99
|
||||
|
||||
|
||||
def is_era_split(name: str) -> bool:
|
||||
return bool(ERA_SPLIT_RE.match(name))
|
||||
|
||||
|
||||
def faceset_centroid(faceset_dir: Path, rec_index, alias_map):
|
||||
"""Return (centroid, n_used, n_missing) where centroid is L2-normalized mean
|
||||
of embeddings of the faces listed in the per-faceset manifest. Falls back to
|
||||
None if too few embeddings found."""
|
||||
manifest = faceset_dir / "manifest.json"
|
||||
if not manifest.exists():
|
||||
return None, 0, 0
|
||||
m = json.loads(manifest.read_text())
|
||||
vecs = []
|
||||
n_missing = 0
|
||||
for f in m.get("faces", []):
|
||||
src = f.get("source")
|
||||
bbox = f.get("bbox")
|
||||
if src is None or bbox is None:
|
||||
n_missing += 1
|
||||
continue
|
||||
bbox_t = tuple(int(x) for x in bbox)
|
||||
canon = alias_map.get(src, src)
|
||||
v = rec_index.get((canon, bbox_t))
|
||||
if v is None and canon != src:
|
||||
v = rec_index.get((src, bbox_t))
|
||||
if v is None:
|
||||
n_missing += 1
|
||||
continue
|
||||
vecs.append(v)
|
||||
if len(vecs) < 3:
|
||||
return None, len(vecs), n_missing
|
||||
arr = np.stack(vecs).astype(np.float32)
|
||||
c = arr.mean(axis=0)
|
||||
n = float(np.linalg.norm(c))
|
||||
if n > 0:
|
||||
c = c / n
|
||||
return c, len(vecs), n_missing
|
||||
|
||||
|
||||
def connected_components(adj: dict[int, set[int]]) -> list[list[int]]:
|
||||
seen: set[int] = set()
|
||||
comps = []
|
||||
for node in adj:
|
||||
if node in seen:
|
||||
continue
|
||||
stack = [node]
|
||||
comp = []
|
||||
while stack:
|
||||
x = stack.pop()
|
||||
if x in seen:
|
||||
continue
|
||||
seen.add(x)
|
||||
comp.append(x)
|
||||
for y in adj.get(x, set()):
|
||||
if y not in seen:
|
||||
stack.append(y)
|
||||
comps.append(sorted(comp))
|
||||
return comps
|
||||
|
||||
|
||||
# ----------------------------- analyze -----------------------------
|
||||
|
||||
def cmd_analyze(args):
|
||||
rec_index, alias_map = load_caches()
|
||||
|
||||
# collect active facesets
|
||||
active = []
|
||||
for d in sorted(ROOT.iterdir()):
|
||||
if not d.is_dir() or d.name.startswith("_"):
|
||||
continue
|
||||
if is_era_split(d.name):
|
||||
continue
|
||||
active.append(d)
|
||||
print(f"[scan] {len(active)} active facesets (era splits + _masked + _thin excluded)", file=sys.stderr)
|
||||
|
||||
centroids: dict[str, np.ndarray] = {}
|
||||
sizes: dict[str, int] = {}
|
||||
skipped = []
|
||||
t0 = time.time()
|
||||
for fs in active:
|
||||
c, n_used, n_miss = faceset_centroid(fs, rec_index, alias_map)
|
||||
if c is None:
|
||||
skipped.append((fs.name, n_used, n_miss))
|
||||
continue
|
||||
centroids[fs.name] = c
|
||||
sizes[fs.name] = n_used
|
||||
print(f"[centroid] {len(centroids)} facesets centroided in {time.time()-t0:.1f}s; "
|
||||
f"{len(skipped)} skipped (too few embeddings)", file=sys.stderr)
|
||||
if skipped:
|
||||
for n, u, m in skipped[:10]:
|
||||
print(f" skip {n}: used={u} missing={m}", file=sys.stderr)
|
||||
if len(skipped) > 10:
|
||||
print(f" ... +{len(skipped)-10} more", file=sys.stderr)
|
||||
|
||||
names = sorted(centroids.keys())
|
||||
if not names:
|
||||
raise SystemExit("no centroids built")
|
||||
|
||||
# similarity matrix
|
||||
M = np.stack([centroids[n] for n in names]).astype(np.float32) # (N, 512), normalized
|
||||
sim = M @ M.T # (N, N) cosine since unit-normalized
|
||||
np.clip(sim, -1.0, 1.0, out=sim)
|
||||
|
||||
edge_thr = args.edge
|
||||
confident_thr = args.confident
|
||||
|
||||
# complete-linkage agglomerative clustering on cosine distance.
|
||||
# Cut at edge threshold: groups are guaranteed to have ALL pairs sim >= edge_thr.
|
||||
# This avoids the chaining problem of single-link / connected-components.
|
||||
n = len(names)
|
||||
dist = 1.0 - sim
|
||||
np.fill_diagonal(dist, 0.0)
|
||||
# symmetrize numerical noise
|
||||
dist = (dist + dist.T) / 2.0
|
||||
np.clip(dist, 0.0, 2.0, out=dist)
|
||||
cond = squareform(dist, checks=False)
|
||||
Z = linkage(cond, method="complete")
|
||||
cut_dist = 1.0 - edge_thr # complete-link distance corresponds to (1 - min sim)
|
||||
labels = fcluster(Z, t=cut_dist, criterion="distance") # 1-indexed cluster ids
|
||||
|
||||
cluster_members: dict[int, list[int]] = {}
|
||||
for idx, lbl in enumerate(labels):
|
||||
cluster_members.setdefault(int(lbl), []).append(idx)
|
||||
comps = [sorted(idxs) for idxs in cluster_members.values() if len(idxs) > 1]
|
||||
|
||||
n_pairs_in_groups = 0
|
||||
for c in comps:
|
||||
n_pairs_in_groups += len(c) * (len(c) - 1) // 2
|
||||
print(f"[graph] complete-linkage cut at sim>={edge_thr}: {len(comps)} multi-faceset groups "
|
||||
f"({n_pairs_in_groups} within-group pairs)", file=sys.stderr)
|
||||
|
||||
# pick primary per group: lowest tier number, then largest size
|
||||
groups_out = []
|
||||
for comp in comps:
|
||||
members = [names[i] for i in comp]
|
||||
members_sorted = sorted(members, key=lambda x: (faceset_tier(x), -sizes.get(x, 0), x))
|
||||
primary = members_sorted[0]
|
||||
secondaries = members_sorted[1:]
|
||||
# gather pairwise sims within group
|
||||
pair_sims = []
|
||||
idx_of = {names[i]: i for i in comp}
|
||||
for a in members:
|
||||
for b in members:
|
||||
if a >= b:
|
||||
continue
|
||||
pair_sims.append({"a": a, "b": b, "sim": round(float(sim[idx_of[a], idx_of[b]]), 4)})
|
||||
# confidence: minimum within-group sim (the weakest link)
|
||||
min_link = min(p["sim"] for p in pair_sims)
|
||||
max_link = max(p["sim"] for p in pair_sims)
|
||||
confidence = "confident" if min_link >= confident_thr else "uncertain"
|
||||
groups_out.append({
|
||||
"primary": primary,
|
||||
"secondaries": secondaries,
|
||||
"members": members_sorted,
|
||||
"tiers": {n: faceset_tier(n) for n in members},
|
||||
"sizes": {n: sizes.get(n, 0) for n in members},
|
||||
"pair_sims": pair_sims,
|
||||
"min_link": round(min_link, 4),
|
||||
"max_link": round(max_link, 4),
|
||||
"confidence": confidence,
|
||||
})
|
||||
# sort: confident first, then by max_link desc
|
||||
groups_out.sort(key=lambda g: (0 if g["confidence"] == "confident" else 1, -g["max_link"]))
|
||||
|
||||
out = {
|
||||
"thresholds": {"edge": edge_thr, "confident": confident_thr},
|
||||
"n_active": len(active),
|
||||
"n_centroided": len(centroids),
|
||||
"n_skipped": len(skipped),
|
||||
"skipped_reasons": [{"name": n, "used": u, "missing": m} for n, u, m in skipped],
|
||||
"n_groups": len(groups_out),
|
||||
"n_facesets_in_groups": sum(len(g["members"]) for g in groups_out),
|
||||
"groups": groups_out,
|
||||
}
|
||||
op = Path(args.out)
|
||||
op.parent.mkdir(parents=True, exist_ok=True)
|
||||
op.write_text(json.dumps(out, indent=2))
|
||||
confident = sum(1 for g in groups_out if g["confidence"] == "confident")
|
||||
uncertain = sum(1 for g in groups_out if g["confidence"] == "uncertain")
|
||||
print(f"[done] {len(groups_out)} groups ({confident} confident, {uncertain} uncertain) -> {op}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- report -----------------------------
|
||||
|
||||
def cmd_report(args):
|
||||
candidates = json.loads(Path(args.candidates).read_text())
|
||||
out_dir = Path(args.out)
|
||||
thumbs_dir = out_dir / "thumbs"
|
||||
thumbs_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
THUMB = 140
|
||||
THUMBS_PER_FACESET = 4
|
||||
|
||||
def make_thumb(faceset: str, fname: str) -> str:
|
||||
d = thumbs_dir / faceset
|
||||
d.mkdir(parents=True, exist_ok=True)
|
||||
dst = d / (Path(fname).stem + ".jpg")
|
||||
if not dst.exists():
|
||||
try:
|
||||
src = ROOT / faceset / "faces" / fname
|
||||
img = Image.open(src).convert("RGB")
|
||||
img.thumbnail((THUMB, THUMB), Image.LANCZOS)
|
||||
img.save(dst, "JPEG", quality=82)
|
||||
except Exception as e:
|
||||
print(f"[thumb-skip] {faceset}/{fname}: {e}", file=sys.stderr)
|
||||
return ""
|
||||
return f"thumbs/{faceset}/{Path(fname).stem}.jpg"
|
||||
|
||||
rows = []
|
||||
for gi, g in enumerate(candidates["groups"]):
|
||||
primary = g["primary"]
|
||||
sec = g["secondaries"]
|
||||
conf_cls = "confident" if g["confidence"] == "confident" else "uncertain"
|
||||
rows.append(f"<section class='grp {conf_cls}' id='g{gi}'>")
|
||||
rows.append(f"<h2>group #{gi+1} <small>({g['confidence']}; min_sim={g['min_link']:.3f}, max_sim={g['max_link']:.3f})</small></h2>")
|
||||
rows.append(f"<div class='plan'>merge <b>{', '.join(sec)}</b> → <b>{primary}</b></div>")
|
||||
# member rows
|
||||
for name in g["members"]:
|
||||
tier = g["tiers"][name]
|
||||
sz = g["sizes"][name]
|
||||
tier_label = ["hand-sorted", "auto", "osrc", "immich-peter", "immich-nic", "?"][min(tier, 5)]
|
||||
badge = "PRIMARY" if name == primary else "secondary"
|
||||
rows.append(f"<div class='member'>")
|
||||
rows.append(f"<div class='label'><span class='badge {badge.lower()}'>{badge}</span> "
|
||||
f"<b>{name}</b> <small>tier={tier_label} · n={sz}</small></div>")
|
||||
rows.append("<div class='thumbs'>")
|
||||
faces_dir = ROOT / name / "faces"
|
||||
files = sorted(faces_dir.glob("*.png"))[:THUMBS_PER_FACESET]
|
||||
for f in files:
|
||||
rel = make_thumb(name, f.name)
|
||||
if rel:
|
||||
rows.append(f"<img src='{rel}' loading='lazy' title='{f.name}'>")
|
||||
rows.append("</div></div>")
|
||||
# pairwise sims
|
||||
rows.append("<table class='sims'><tr><th>a</th><th>b</th><th>sim</th></tr>")
|
||||
for ps in sorted(g["pair_sims"], key=lambda x: -x["sim"]):
|
||||
cls = "hi" if ps["sim"] >= candidates["thresholds"]["confident"] else "mid"
|
||||
rows.append(f"<tr><td>{ps['a']}</td><td>{ps['b']}</td><td class='{cls}'>{ps['sim']:.3f}</td></tr>")
|
||||
rows.append("</table>")
|
||||
rows.append("</section>")
|
||||
|
||||
nav = " · ".join(f"<a href='#g{i}'>#{i+1}</a>" for i in range(len(candidates["groups"])))
|
||||
|
||||
html = f"""<!doctype html>
|
||||
<html><head><meta charset='utf-8'><title>Faceset merge review</title>
|
||||
<style>
|
||||
body {{ font-family: system-ui, sans-serif; background: #111; color: #eee; padding: 1em; }}
|
||||
h1 {{ margin-top: 0; }}
|
||||
h2 {{ margin: 0; }}
|
||||
small {{ color: #999; font-weight: normal; }}
|
||||
section.grp {{ background: #1a1a1a; border-radius: 6px; padding: 12px; margin: 12px 0; }}
|
||||
section.grp.confident {{ border-left: 4px solid #5fa05f; }}
|
||||
section.grp.uncertain {{ border-left: 4px solid #ffb050; }}
|
||||
.plan {{ margin: .5em 0; color: #6cf; }}
|
||||
.member {{ margin: 8px 0; padding: 6px; background: #222; border-radius: 4px; }}
|
||||
.label {{ font-family: monospace; font-size: 13px; }}
|
||||
.badge {{ display: inline-block; padding: 0 6px; font-size: 10px; border-radius: 2px; }}
|
||||
.badge.primary {{ background: #5fa05f; color: #000; font-weight: bold; }}
|
||||
.badge.secondary {{ background: #444; color: #ccc; }}
|
||||
.thumbs {{ display: flex; gap: 4px; margin-top: 4px; flex-wrap: wrap; }}
|
||||
.thumbs img {{ height: 140px; width: auto; border-radius: 3px; }}
|
||||
table.sims {{ font-family: monospace; font-size: 11px; margin-top: 6px; border-collapse: collapse; }}
|
||||
table.sims td, table.sims th {{ padding: 1px 8px; border: 1px solid #333; text-align: left; }}
|
||||
table.sims td.hi {{ color: #5fa05f; font-weight: bold; }}
|
||||
table.sims td.mid {{ color: #ffb050; }}
|
||||
.nav {{ position: sticky; top: 0; background: #111; padding: .5em 0; border-bottom: 1px solid #333; font-size: 12px; }}
|
||||
a {{ color: #6cf; }}
|
||||
</style></head>
|
||||
<body>
|
||||
<h1>Merge review — {len(candidates['groups'])} candidate groups
|
||||
<small>(edge>={candidates['thresholds']['edge']}, confident>={candidates['thresholds']['confident']})</small></h1>
|
||||
<p>{candidates['n_centroided']} of {candidates['n_active']} active facesets centroided
|
||||
(skipped {candidates['n_skipped']} for too few cached embeddings).
|
||||
Green = confident (min within-group sim >= {candidates['thresholds']['confident']}); orange = uncertain.</p>
|
||||
<div class='nav'>{nav}</div>
|
||||
{''.join(rows)}
|
||||
</body></html>"""
|
||||
|
||||
out_html = out_dir / "index.html"
|
||||
out_html.write_text(html)
|
||||
print(f"[done] {out_html}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- apply -----------------------------
|
||||
|
||||
def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
|
||||
import zipfile
|
||||
with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
|
||||
for i, p in enumerate(pngs):
|
||||
zf.write(p, arcname=f"{i:04d}.png")
|
||||
|
||||
|
||||
def cmd_apply(args):
|
||||
candidates = json.loads(Path(args.candidates).read_text())
|
||||
master_path = ROOT / "manifest.json"
|
||||
master = json.loads(master_path.read_text())
|
||||
by_name = {f["name"]: f for f in master.get("facesets", [])}
|
||||
|
||||
# filter: skip "uncertain" groups unless --include-uncertain
|
||||
accepted = [g for g in candidates["groups"]
|
||||
if g["confidence"] == "confident" or args.include_uncertain]
|
||||
skipped_unc = [g for g in candidates["groups"]
|
||||
if g["confidence"] == "uncertain" and not args.include_uncertain]
|
||||
# explicit --exclude / --only filters (group indices in the candidates file)
|
||||
if args.only:
|
||||
only = {int(s) for s in args.only.split(",")}
|
||||
accepted = [g for i, g in enumerate(candidates["groups"]) if i in only]
|
||||
if args.exclude:
|
||||
excl = {int(s) for s in args.exclude.split(",")}
|
||||
accepted = [g for i, g in enumerate(accepted) if i not in excl]
|
||||
|
||||
print(f"[plan] {len(accepted)} groups will be merged "
|
||||
f"({len(skipped_unc)} uncertain skipped)", file=sys.stderr)
|
||||
|
||||
if args.dry_run:
|
||||
for g in accepted:
|
||||
print(f" merge {g['secondaries']} -> {g['primary']} "
|
||||
f"({g['confidence']}, min_sim={g['min_link']:.3f})")
|
||||
return
|
||||
|
||||
merged_dir = ROOT / "_merged"
|
||||
merged_dir.mkdir(exist_ok=True)
|
||||
new_facesets: list[dict] = []
|
||||
new_merged: list[dict] = list(master.get("merged", []))
|
||||
consumed_names: set[str] = set()
|
||||
primary_updates: dict[str, dict] = {} # name -> new entry
|
||||
primary_absorbed: dict[str, list[dict]] = {} # primary_name -> [secondary entries]
|
||||
|
||||
for g in accepted:
|
||||
primary = g["primary"]
|
||||
if primary not in by_name:
|
||||
print(f"[warn] primary {primary} not in master; skipping group", file=sys.stderr)
|
||||
continue
|
||||
primary_dir = ROOT / primary
|
||||
if not primary_dir.is_dir():
|
||||
print(f"[warn] primary dir {primary_dir} missing; skipping group", file=sys.stderr)
|
||||
continue
|
||||
primary_faces = primary_dir / "faces"
|
||||
primary_manifest_path = primary_dir / "manifest.json"
|
||||
primary_manifest = json.loads(primary_manifest_path.read_text())
|
||||
|
||||
# gather all face entries: primary + each secondary
|
||||
combined_faces: list[dict] = list(primary_manifest.get("faces", []))
|
||||
# adjust composite quality fall-back: ensure key exists
|
||||
for f in combined_faces:
|
||||
f.setdefault("origin_faceset", primary)
|
||||
|
||||
for sec in g["secondaries"]:
|
||||
sec_dir = ROOT / sec
|
||||
if not sec_dir.is_dir():
|
||||
print(f"[warn] secondary {sec} missing; skipping", file=sys.stderr)
|
||||
continue
|
||||
sec_manifest_path = sec_dir / "manifest.json"
|
||||
sec_manifest = json.loads(sec_manifest_path.read_text()) if sec_manifest_path.exists() else {"faces": []}
|
||||
for f in sec_manifest.get("faces", []):
|
||||
f = dict(f)
|
||||
f["origin_faceset"] = sec
|
||||
combined_faces.append(f)
|
||||
|
||||
# rank by quality.composite descending; ties broken by lower cosd_centroid
|
||||
def sort_key(f):
|
||||
q = f.get("quality", {}).get("composite", 0)
|
||||
d = f.get("cosd_centroid", 1.0)
|
||||
return (-q, d)
|
||||
combined_faces.sort(key=sort_key)
|
||||
|
||||
# renumber and stage PNGs into a fresh staging dir, then atomically swap
|
||||
staging = primary_dir / "_faces_new"
|
||||
if staging.exists():
|
||||
shutil.rmtree(staging)
|
||||
staging.mkdir()
|
||||
new_face_entries = []
|
||||
for new_rank, f in enumerate(combined_faces, start=1):
|
||||
origin = f.pop("origin_faceset")
|
||||
old_png_rel = f["png"] # e.g. "faces/0042.png"
|
||||
old_png_name = Path(old_png_rel).name
|
||||
origin_png = ROOT / origin / "faces" / old_png_name
|
||||
if not origin_png.exists():
|
||||
# could be in _dropped if occlusion-pruned; skip
|
||||
continue
|
||||
new_name = f"{new_rank:04d}.png"
|
||||
shutil.copy2(origin_png, staging / new_name)
|
||||
f = dict(f)
|
||||
f["rank"] = new_rank
|
||||
f["png"] = f"faces/{new_name}"
|
||||
f["origin_faceset"] = origin # preserve provenance in manifest
|
||||
new_face_entries.append(f)
|
||||
|
||||
# swap directories: primary/faces -> primary/_faces_old, staging -> primary/faces
|
||||
old_faces_holding = primary_dir / "_faces_old"
|
||||
if old_faces_holding.exists():
|
||||
shutil.rmtree(old_faces_holding)
|
||||
if primary_faces.exists():
|
||||
primary_faces.rename(old_faces_holding)
|
||||
staging.rename(primary_faces)
|
||||
# migrate _dropped/ from old holding (so occlusion-pruned PNGs remain accessible)
|
||||
old_dropped = old_faces_holding / "_dropped"
|
||||
if old_dropped.exists():
|
||||
(primary_faces / "_dropped").mkdir(exist_ok=True)
|
||||
for x in old_dropped.iterdir():
|
||||
shutil.move(str(x), str(primary_faces / "_dropped" / x.name))
|
||||
shutil.rmtree(old_faces_holding)
|
||||
|
||||
# re-zip .fsz
|
||||
survivor_pngs = sorted(primary_faces.glob("*.png"))
|
||||
top_n = primary_manifest.get("top_n", 30)
|
||||
top_n_eff = min(top_n, len(survivor_pngs))
|
||||
# remove old .fsz files
|
||||
for old in primary_dir.glob("*.fsz"):
|
||||
old.unlink()
|
||||
top_fsz_name = f"{primary}_top{top_n_eff}.fsz"
|
||||
all_fsz_name = f"{primary}_all.fsz"
|
||||
_zip_png_list(survivor_pngs[:top_n_eff], primary_dir / top_fsz_name)
|
||||
if len(survivor_pngs) > top_n_eff:
|
||||
_zip_png_list(survivor_pngs, primary_dir / all_fsz_name)
|
||||
all_fsz_used = all_fsz_name
|
||||
else:
|
||||
all_fsz_used = None
|
||||
|
||||
# update primary's local manifest
|
||||
primary_manifest["faces"] = new_face_entries
|
||||
primary_manifest["exported"] = len(new_face_entries)
|
||||
primary_manifest["fsz_top"] = top_fsz_name
|
||||
primary_manifest["fsz_all"] = all_fsz_used
|
||||
primary_manifest["top_n"] = top_n_eff
|
||||
primary_manifest.setdefault("merge_history", []).append({
|
||||
"absorbed": g["secondaries"],
|
||||
"min_link": g["min_link"],
|
||||
"max_link": g["max_link"],
|
||||
"confidence": g["confidence"],
|
||||
})
|
||||
primary_manifest_path.write_text(json.dumps(primary_manifest, indent=2))
|
||||
|
||||
# move secondary directories into _merged/
|
||||
absorbed_master_entries: list[dict] = []
|
||||
for sec in g["secondaries"]:
|
||||
sec_dir = ROOT / sec
|
||||
target = merged_dir / sec
|
||||
if not sec_dir.is_dir():
|
||||
continue
|
||||
if target.exists():
|
||||
shutil.rmtree(sec_dir) # already moved by previous run; clean stub
|
||||
else:
|
||||
shutil.move(str(sec_dir), str(target))
|
||||
sec_master = dict(by_name.get(sec, {"name": sec}))
|
||||
sec_master["merged_into"] = primary
|
||||
sec_master["relpath"] = f"_merged/{sec}"
|
||||
sec_master["fsz_top"] = None
|
||||
sec_master["fsz_all"] = None
|
||||
absorbed_master_entries.append(sec_master)
|
||||
consumed_names.add(sec)
|
||||
|
||||
new_merged.extend(absorbed_master_entries)
|
||||
|
||||
# bump primary master entry
|
||||
prim_master = dict(by_name[primary])
|
||||
prim_master["exported"] = len(new_face_entries)
|
||||
prim_master["top_n"] = top_n_eff
|
||||
prim_master["fsz_top"] = top_fsz_name
|
||||
prim_master["fsz_all"] = all_fsz_used
|
||||
prim_master.setdefault("merge_history", []).append({
|
||||
"absorbed": g["secondaries"],
|
||||
"min_link": g["min_link"],
|
||||
"max_link": g["max_link"],
|
||||
})
|
||||
primary_updates[primary] = prim_master
|
||||
|
||||
print(f"[merged] {g['secondaries']} -> {primary} "
|
||||
f"now {len(new_face_entries)} png", file=sys.stderr)
|
||||
|
||||
# rebuild master facesets list
|
||||
for entry in master.get("facesets", []):
|
||||
nm = entry["name"]
|
||||
if nm in consumed_names:
|
||||
continue
|
||||
if nm in primary_updates:
|
||||
new_facesets.append(primary_updates[nm])
|
||||
else:
|
||||
new_facesets.append(entry)
|
||||
|
||||
new_master = dict(master)
|
||||
new_master["facesets"] = new_facesets
|
||||
new_master["merged"] = new_merged
|
||||
new_master["merge_run"] = {
|
||||
"thresholds": candidates["thresholds"],
|
||||
"groups_applied": len(accepted),
|
||||
"facesets_consumed": len(consumed_names),
|
||||
"include_uncertain": bool(args.include_uncertain),
|
||||
}
|
||||
tmp = master_path.with_suffix(".tmp.json")
|
||||
tmp.write_text(json.dumps(new_master, indent=2))
|
||||
tmp.replace(master_path)
|
||||
print(f"[done] master manifest updated: {len(new_facesets)} active, "
|
||||
f"{len(new_merged)} merged, {len(consumed_names)} consumed in this run",
|
||||
file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- main -----------------------------
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
sub = ap.add_subparsers(dest="cmd", required=True)
|
||||
|
||||
a = sub.add_parser("analyze")
|
||||
a.add_argument("--out", required=True)
|
||||
a.add_argument("--edge", type=float, default=0.45, help="min cosine sim to draw an edge (default 0.45)")
|
||||
a.add_argument("--confident", type=float, default=0.65, help="min within-group sim to be confident (default 0.65)")
|
||||
a.set_defaults(func=cmd_analyze)
|
||||
|
||||
r = sub.add_parser("report")
|
||||
r.add_argument("--candidates", required=True)
|
||||
r.add_argument("--out", required=True)
|
||||
r.set_defaults(func=cmd_report)
|
||||
|
||||
p = sub.add_parser("apply")
|
||||
p.add_argument("--candidates", required=True)
|
||||
p.add_argument("--include-uncertain", action="store_true",
|
||||
help="apply uncertain groups too (default: confident only)")
|
||||
p.add_argument("--only", default=None, help="comma-separated group indices to apply")
|
||||
p.add_argument("--exclude", default=None, help="comma-separated group indices to skip")
|
||||
p.add_argument("--dry-run", action="store_true")
|
||||
p.set_defaults(func=cmd_apply)
|
||||
|
||||
args = ap.parse_args()
|
||||
args.func(args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
594
work/dedup_optimize.py
Normal file
594
work/dedup_optimize.py
Normal file
@@ -0,0 +1,594 @@
|
||||
"""Corpus-wide dedup + roop-unleashed optimization.
|
||||
|
||||
Two passes:
|
||||
1. Cross-family byte-identical PNG dedup (same SHA256 in two different identity
|
||||
families) — keep the higher-tier family copy. Era splits of the same parent
|
||||
identity (faceset_NNN_*) are intentional duplications and are NOT deduped
|
||||
within their family.
|
||||
2. Within-faceset near-duplicate dedup using cached arcface embeddings
|
||||
(cosine sim >= 0.95). Keep highest quality.composite, drop the rest.
|
||||
|
||||
Plus a Windows-DML multi-face audit (separate phase via clip_worker-style split):
|
||||
3. Re-detect each PNG with insightface; flag any with 0 or >1 detected faces.
|
||||
The roop loader appends every detected face per PNG, so multi-face crops
|
||||
pollute identity averaging.
|
||||
|
||||
All flagged PNGs are MOVED to <faceset>/faces/_dropped/ (reversible). Affected
|
||||
.fsz files are re-zipped, manifests updated.
|
||||
|
||||
CLI:
|
||||
analyze --out work/dedup_audit/dedup_plan.json
|
||||
apply --plan ... [--dry-run]
|
||||
stage_multiface --out work/dedup_audit/multiface_queue.json
|
||||
merge_multiface --results <worker_out> --out work/dedup_audit/multiface_plan.json
|
||||
apply_multiface --plan ... [--dry-run]
|
||||
report --dedup ... --multiface ... --out work/dedup_audit
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import re
|
||||
import shutil
|
||||
import sys
|
||||
import time
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
|
||||
ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
|
||||
WIN_ROOT = r"E:\temp_things\fcswp\nl_sorted\facesets_swap_ready"
|
||||
CACHES = [
|
||||
Path("/opt/face-sets/work/cache/nl_full.npz"),
|
||||
Path("/opt/face-sets/work/cache/immich_peter.npz"),
|
||||
Path("/opt/face-sets/work/cache/immich_nic.npz"),
|
||||
]
|
||||
|
||||
NEAR_DUP_THRESHOLD = 0.95
|
||||
HASH_PARALLEL = 16
|
||||
|
||||
|
||||
# ----------------------------- helpers -----------------------------
|
||||
|
||||
def faceset_tier(name: str) -> int:
|
||||
m = re.match(r"^faceset_0*(\d+)(?:_.+)?$", name)
|
||||
if not m:
|
||||
return 99
|
||||
n = int(m.group(1))
|
||||
if 13 <= n <= 19:
|
||||
return 0
|
||||
if 1 <= n <= 12:
|
||||
return 1
|
||||
if 20 <= n <= 25:
|
||||
return 2
|
||||
if 26 <= n <= 264:
|
||||
return 3
|
||||
if 265 <= n:
|
||||
return 4
|
||||
return 99
|
||||
|
||||
|
||||
def faceset_family(name: str) -> str:
|
||||
"""faceset_001_2010-13 → faceset_001; faceset_001 → faceset_001."""
|
||||
m = re.match(r"^(faceset_\d+)(?:_.+)?$", name)
|
||||
return m.group(1) if m else name
|
||||
|
||||
|
||||
def wsl_to_win(p: str) -> str:
|
||||
s = str(p)
|
||||
if s.startswith("/mnt/"):
|
||||
return f"{s[5].upper()}:\\{s[7:].replace('/', chr(92))}"
|
||||
return s
|
||||
|
||||
|
||||
def iter_active_facesets() -> list[Path]:
|
||||
out = []
|
||||
for d in sorted(ROOT.iterdir()):
|
||||
if d.is_dir() and not d.name.startswith("_"):
|
||||
out.append(d)
|
||||
return out
|
||||
|
||||
|
||||
def sha256_file(p: Path) -> str:
|
||||
h = hashlib.sha256()
|
||||
with open(p, "rb") as f:
|
||||
while True:
|
||||
b = f.read(1 << 20)
|
||||
if not b:
|
||||
break
|
||||
h.update(b)
|
||||
return h.hexdigest()
|
||||
|
||||
|
||||
def load_caches():
|
||||
rec_index: dict[tuple[str, tuple[int, int, int, int]], np.ndarray] = {}
|
||||
alias_map: dict[str, str] = {}
|
||||
for c in CACHES:
|
||||
if not c.exists():
|
||||
continue
|
||||
d = np.load(c, allow_pickle=True)
|
||||
emb = d["embeddings"]
|
||||
meta = json.loads(str(d["meta"]))
|
||||
face_records = [m for m in meta if not m.get("noface")]
|
||||
if "path_aliases" in d.files:
|
||||
paliases = json.loads(str(d["path_aliases"]))
|
||||
for canon, alist in paliases.items():
|
||||
alias_map.setdefault(canon, canon)
|
||||
for a in alist:
|
||||
alias_map[a] = canon
|
||||
for i, rec in enumerate(face_records):
|
||||
p = rec["path"]
|
||||
bbox = tuple(int(x) for x in rec["bbox"])
|
||||
v = emb[i].astype(np.float32)
|
||||
n = float(np.linalg.norm(v))
|
||||
if n > 0:
|
||||
v = v / n
|
||||
rec_index[(p, bbox)] = v
|
||||
alias_map.setdefault(p, p)
|
||||
return rec_index, alias_map
|
||||
|
||||
|
||||
def lookup_emb(rec_index, alias_map, src: str, bbox):
|
||||
bbox_t = tuple(int(x) for x in bbox)
|
||||
canon = alias_map.get(src, src)
|
||||
v = rec_index.get((canon, bbox_t))
|
||||
if v is None and canon != src:
|
||||
v = rec_index.get((src, bbox_t))
|
||||
return v
|
||||
|
||||
|
||||
# ----------------------------- analyze -----------------------------
|
||||
|
||||
def cmd_analyze(args):
|
||||
rec_index, alias_map = load_caches()
|
||||
facesets = iter_active_facesets()
|
||||
print(f"[scan] {len(facesets)} active facesets", file=sys.stderr)
|
||||
|
||||
# Phase 1: walk every PNG, collect (faceset, file, src, bbox, quality, emb, sha256)
|
||||
all_pngs = [] # list of dicts
|
||||
t0 = time.time()
|
||||
for fs in facesets:
|
||||
manifest_path = fs / "manifest.json"
|
||||
if not manifest_path.exists():
|
||||
continue
|
||||
m = json.loads(manifest_path.read_text())
|
||||
for f in m.get("faces", []):
|
||||
png_rel = f.get("png")
|
||||
if not png_rel:
|
||||
continue
|
||||
disk_path = fs / png_rel
|
||||
if not disk_path.exists():
|
||||
continue
|
||||
all_pngs.append({
|
||||
"faceset": fs.name,
|
||||
"family": faceset_family(fs.name),
|
||||
"tier": faceset_tier(fs.name),
|
||||
"file": Path(png_rel).name,
|
||||
"rank": f.get("rank"),
|
||||
"source": f.get("source"),
|
||||
"bbox": f.get("bbox"),
|
||||
"quality": f.get("quality", {}).get("composite", 0),
|
||||
"disk_path": str(disk_path),
|
||||
})
|
||||
print(f"[scan] {len(all_pngs)} PNGs walked in {time.time()-t0:.1f}s", file=sys.stderr)
|
||||
|
||||
# Phase 2: SHA256 hash each PNG (parallel I/O)
|
||||
t0 = time.time()
|
||||
def _hash_one(idx):
|
||||
all_pngs[idx]["sha256"] = sha256_file(Path(all_pngs[idx]["disk_path"]))
|
||||
with ThreadPoolExecutor(max_workers=HASH_PARALLEL) as ex:
|
||||
# exhaust the iterator to actually run
|
||||
for _ in ex.map(_hash_one, range(len(all_pngs)), chunksize=16):
|
||||
pass
|
||||
print(f"[hash] {len(all_pngs)} PNGs hashed in {time.time()-t0:.1f}s", file=sys.stderr)
|
||||
|
||||
# Phase 3: cross-family byte-dedup
|
||||
by_sha: dict[str, list[int]] = {}
|
||||
for i, p in enumerate(all_pngs):
|
||||
by_sha.setdefault(p["sha256"], []).append(i)
|
||||
|
||||
cross_family_groups = []
|
||||
byte_drops: set[int] = set() # indices of PNGs to drop
|
||||
for sha, idxs in by_sha.items():
|
||||
if len(idxs) < 2:
|
||||
continue
|
||||
families = {all_pngs[i]["family"] for i in idxs}
|
||||
if len(families) < 2:
|
||||
continue # all in same family — intentional era duplication
|
||||
# multiple families share this content → dedup keeping the best one
|
||||
cross_family_groups.append({"sha256": sha, "members": [
|
||||
{"faceset": all_pngs[i]["faceset"], "file": all_pngs[i]["file"],
|
||||
"tier": all_pngs[i]["tier"], "quality": all_pngs[i]["quality"],
|
||||
"rank": all_pngs[i]["rank"]} for i in idxs
|
||||
]})
|
||||
# keeper rule: lowest tier number, then highest quality
|
||||
best = sorted(idxs, key=lambda i: (all_pngs[i]["tier"], -all_pngs[i]["quality"]))[0]
|
||||
for i in idxs:
|
||||
# NEVER drop within-family copies (preserve era duplication intentionally)
|
||||
# We only drop indices whose family != best's family
|
||||
if i != best and all_pngs[i]["family"] != all_pngs[best]["family"]:
|
||||
byte_drops.add(i)
|
||||
print(f"[byte] {len(cross_family_groups)} cross-family hash groups; "
|
||||
f"{len(byte_drops)} PNGs marked for byte-dedup drop", file=sys.stderr)
|
||||
|
||||
# Phase 4: within-faceset near-dup (embedding sim >= threshold)
|
||||
by_faceset: dict[str, list[int]] = {}
|
||||
for i, p in enumerate(all_pngs):
|
||||
by_faceset.setdefault(p["faceset"], []).append(i)
|
||||
|
||||
near_dup_groups = []
|
||||
near_drops: set[int] = set()
|
||||
miss_emb_total = 0
|
||||
t0 = time.time()
|
||||
for fs_name, idxs in by_faceset.items():
|
||||
if len(idxs) < 2:
|
||||
continue
|
||||
# gather embeddings
|
||||
embs = []
|
||||
kept_idxs = []
|
||||
for i in idxs:
|
||||
v = lookup_emb(rec_index, alias_map, all_pngs[i]["source"], all_pngs[i]["bbox"])
|
||||
if v is None:
|
||||
miss_emb_total += 1
|
||||
continue
|
||||
embs.append(v)
|
||||
kept_idxs.append(i)
|
||||
if len(kept_idxs) < 2:
|
||||
continue
|
||||
M = np.stack(embs).astype(np.float32)
|
||||
sim = M @ M.T
|
||||
np.fill_diagonal(sim, -1) # ignore self
|
||||
# find connected components in the (sim >= threshold) graph
|
||||
adj = {k: set() for k in range(len(kept_idxs))}
|
||||
for a in range(len(kept_idxs)):
|
||||
# only check a < b to avoid double work
|
||||
hi = np.where(sim[a, a+1:] >= NEAR_DUP_THRESHOLD)[0]
|
||||
for off in hi:
|
||||
b = a + 1 + int(off)
|
||||
adj[a].add(b)
|
||||
adj[b].add(a)
|
||||
seen = set()
|
||||
for k in adj:
|
||||
if k in seen or not adj[k]:
|
||||
continue
|
||||
stack = [k]
|
||||
comp = []
|
||||
while stack:
|
||||
x = stack.pop()
|
||||
if x in seen:
|
||||
continue
|
||||
seen.add(x)
|
||||
comp.append(x)
|
||||
for y in adj[x]:
|
||||
if y not in seen:
|
||||
stack.append(y)
|
||||
if len(comp) < 2:
|
||||
continue
|
||||
comp_idxs = [kept_idxs[c] for c in comp]
|
||||
# keeper: highest quality.composite, tie-break: lowest rank
|
||||
best = sorted(comp_idxs, key=lambda i: (-all_pngs[i]["quality"], all_pngs[i]["rank"] or 9999))[0]
|
||||
sims_in_group = []
|
||||
for ci in range(len(comp)):
|
||||
for cj in range(ci+1, len(comp)):
|
||||
sims_in_group.append(float(sim[comp[ci], comp[cj]]))
|
||||
near_dup_groups.append({
|
||||
"faceset": fs_name,
|
||||
"members": [{"file": all_pngs[i]["file"], "rank": all_pngs[i]["rank"],
|
||||
"quality": all_pngs[i]["quality"]} for i in comp_idxs],
|
||||
"keeper": all_pngs[best]["file"],
|
||||
"min_sim": min(sims_in_group) if sims_in_group else None,
|
||||
"max_sim": max(sims_in_group) if sims_in_group else None,
|
||||
})
|
||||
for i in comp_idxs:
|
||||
if i != best:
|
||||
near_drops.add(i)
|
||||
print(f"[near] {len(near_dup_groups)} near-dup groups; "
|
||||
f"{len(near_drops)} PNGs marked for near-dup drop "
|
||||
f"(miss_emb={miss_emb_total}); {time.time()-t0:.1f}s", file=sys.stderr)
|
||||
|
||||
# Combined drop set; for output, group by faceset
|
||||
all_drops = byte_drops | near_drops
|
||||
drops_by_faceset: dict[str, list] = {}
|
||||
for i in all_drops:
|
||||
p = all_pngs[i]
|
||||
reason = []
|
||||
if i in byte_drops: reason.append("byte_dup")
|
||||
if i in near_drops: reason.append("near_dup")
|
||||
drops_by_faceset.setdefault(p["faceset"], []).append({
|
||||
"file": p["file"], "rank": p["rank"], "reason": "+".join(reason),
|
||||
"sha256": p["sha256"], "quality": p["quality"],
|
||||
})
|
||||
|
||||
plan = {
|
||||
"thresholds": {"near_dup_sim": NEAR_DUP_THRESHOLD},
|
||||
"totals": {
|
||||
"active_facesets": len(facesets),
|
||||
"active_pngs": len(all_pngs),
|
||||
"byte_dup_groups": len(cross_family_groups),
|
||||
"byte_dup_drops": len(byte_drops),
|
||||
"near_dup_groups": len(near_dup_groups),
|
||||
"near_dup_drops": len(near_drops),
|
||||
"all_drops": len(all_drops),
|
||||
"facesets_affected": len(drops_by_faceset),
|
||||
},
|
||||
"byte_dup_groups": cross_family_groups,
|
||||
"near_dup_groups": near_dup_groups,
|
||||
"drops_by_faceset": drops_by_faceset,
|
||||
}
|
||||
op = Path(args.out)
|
||||
op.parent.mkdir(parents=True, exist_ok=True)
|
||||
op.write_text(json.dumps(plan, indent=2))
|
||||
print(f"[done] plan -> {op}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- apply -----------------------------
|
||||
|
||||
def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
|
||||
import zipfile
|
||||
with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
|
||||
for i, p in enumerate(pngs):
|
||||
zf.write(p, arcname=f"{i:04d}.png")
|
||||
|
||||
|
||||
def _apply_drops_to_facesets(drops_by_faceset: dict[str, list], reason_label: str, master_path: Path):
|
||||
"""Move flagged PNGs to <faceset>/faces/_dropped/, rebuild manifests + .fsz.
|
||||
drops_by_faceset values are lists of {"file": str, ...}.
|
||||
Returns total moved + counts per faceset."""
|
||||
master = json.loads(master_path.read_text())
|
||||
by_name = {f["name"]: f for f in master.get("facesets", [])}
|
||||
total_moved = 0
|
||||
per_faceset_counts = {}
|
||||
|
||||
for fs_name, drops in drops_by_faceset.items():
|
||||
fs_dir = ROOT / fs_name
|
||||
if not fs_dir.is_dir():
|
||||
print(f"[warn] {fs_name}: dir missing, skip", file=sys.stderr)
|
||||
continue
|
||||
faces_dir = fs_dir / "faces"
|
||||
dropped_dir = faces_dir / "_dropped"
|
||||
dropped_dir.mkdir(exist_ok=True)
|
||||
drop_files = {d["file"] for d in drops}
|
||||
|
||||
moved_here = 0
|
||||
for fname in sorted(drop_files):
|
||||
src = faces_dir / fname
|
||||
if not src.exists():
|
||||
continue
|
||||
shutil.move(str(src), str(dropped_dir / fname))
|
||||
moved_here += 1
|
||||
|
||||
# rebuild manifest by filtering out dropped files
|
||||
manifest_path = fs_dir / "manifest.json"
|
||||
if manifest_path.exists():
|
||||
mm = json.loads(manifest_path.read_text())
|
||||
new_faces = [f for f in mm.get("faces", []) if Path(f.get("png", "")).name not in drop_files]
|
||||
mm["faces"] = new_faces
|
||||
mm["exported"] = len(new_faces)
|
||||
mm.setdefault(f"{reason_label}_history", []).append({"dropped": moved_here})
|
||||
|
||||
# re-zip
|
||||
survivor_pngs = sorted(faces_dir.glob("*.png"))
|
||||
top_n = mm.get("top_n", 30)
|
||||
top_n_eff = min(top_n, len(survivor_pngs))
|
||||
for old in fs_dir.glob("*.fsz"):
|
||||
old.unlink()
|
||||
top_fsz_name = f"{fs_name}_top{top_n_eff}.fsz"
|
||||
all_fsz_name = f"{fs_name}_all.fsz"
|
||||
if top_n_eff > 0:
|
||||
_zip_png_list(survivor_pngs[:top_n_eff], fs_dir / top_fsz_name)
|
||||
mm["fsz_top"] = top_fsz_name
|
||||
mm["top_n"] = top_n_eff
|
||||
else:
|
||||
mm["fsz_top"] = None
|
||||
mm["top_n"] = 0
|
||||
if len(survivor_pngs) > top_n_eff:
|
||||
_zip_png_list(survivor_pngs, fs_dir / all_fsz_name)
|
||||
mm["fsz_all"] = all_fsz_name
|
||||
else:
|
||||
mm["fsz_all"] = None
|
||||
manifest_path.write_text(json.dumps(mm, indent=2))
|
||||
|
||||
if fs_name in by_name:
|
||||
by_name[fs_name]["exported"] = len(new_faces)
|
||||
by_name[fs_name]["fsz_top"] = mm["fsz_top"]
|
||||
by_name[fs_name]["fsz_all"] = mm["fsz_all"]
|
||||
by_name[fs_name]["top_n"] = mm["top_n"]
|
||||
by_name[fs_name].setdefault(f"{reason_label}_dropped", 0)
|
||||
by_name[fs_name][f"{reason_label}_dropped"] += moved_here
|
||||
|
||||
total_moved += moved_here
|
||||
per_faceset_counts[fs_name] = moved_here
|
||||
|
||||
# rewrite master with same ordering
|
||||
new_facesets = [by_name.get(e["name"], e) for e in master.get("facesets", [])]
|
||||
master["facesets"] = new_facesets
|
||||
master.setdefault(f"{reason_label}_runs", []).append({
|
||||
"facesets_affected": len(per_faceset_counts),
|
||||
"pngs_moved": total_moved,
|
||||
})
|
||||
tmp = master_path.with_suffix(".tmp.json")
|
||||
tmp.write_text(json.dumps(master, indent=2))
|
||||
tmp.replace(master_path)
|
||||
return total_moved, per_faceset_counts
|
||||
|
||||
|
||||
def cmd_apply(args):
|
||||
plan = json.loads(Path(args.plan).read_text())
|
||||
drops = plan["drops_by_faceset"]
|
||||
if args.dry_run:
|
||||
for fs, items in sorted(drops.items()):
|
||||
reasons = {}
|
||||
for it in items:
|
||||
reasons[it["reason"]] = reasons.get(it["reason"], 0) + 1
|
||||
print(f" {fs}: {len(items)} dropped ({reasons})")
|
||||
print(f"=== total: {sum(len(v) for v in drops.values())} PNGs across {len(drops)} facesets ===")
|
||||
return
|
||||
master_path = ROOT / "manifest.json"
|
||||
total, _ = _apply_drops_to_facesets(drops, "dedup", master_path)
|
||||
print(f"[done] {total} PNGs moved to faces/_dropped/ across {len(drops)} facesets", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- multiface staging + apply -----------------------------
|
||||
|
||||
def cmd_stage_multiface(args):
|
||||
"""Build queue.json of all currently-active PNGs in the corpus
|
||||
for the Windows DML multi-face audit worker."""
|
||||
queue = []
|
||||
for fs in iter_active_facesets():
|
||||
faces_dir = fs / "faces"
|
||||
if not faces_dir.is_dir():
|
||||
continue
|
||||
for p in sorted(faces_dir.glob("*.png")):
|
||||
queue.append({
|
||||
"wsl_path": str(p),
|
||||
"win_path": wsl_to_win(str(p)),
|
||||
"faceset": fs.name,
|
||||
"file": p.name,
|
||||
})
|
||||
op = Path(args.out)
|
||||
op.parent.mkdir(parents=True, exist_ok=True)
|
||||
op.write_text(json.dumps(queue, indent=2))
|
||||
print(f"[stage] {len(queue)} PNGs -> {op}", file=sys.stderr)
|
||||
|
||||
|
||||
def cmd_merge_multiface(args):
|
||||
"""Convert worker results.json into a drops_by_faceset plan."""
|
||||
src = json.loads(Path(args.results).read_text())
|
||||
drops_by_faceset: dict[str, list] = {}
|
||||
bad_count = 0
|
||||
for r in src.get("results", []):
|
||||
n_faces = r.get("face_count", -1)
|
||||
if n_faces == 1:
|
||||
continue
|
||||
bad_count += 1
|
||||
drops_by_faceset.setdefault(r["faceset"], []).append({
|
||||
"file": r["file"],
|
||||
"reason": f"multiface_{n_faces}",
|
||||
"face_count": n_faces,
|
||||
})
|
||||
plan = {
|
||||
"totals": {"bad_pngs": bad_count, "facesets_affected": len(drops_by_faceset),
|
||||
"scored": len(src.get("results", []))},
|
||||
"drops_by_faceset": drops_by_faceset,
|
||||
}
|
||||
op = Path(args.out)
|
||||
op.parent.mkdir(parents=True, exist_ok=True)
|
||||
op.write_text(json.dumps(plan, indent=2))
|
||||
print(f"[merge] {bad_count} bad PNGs across {len(drops_by_faceset)} facesets -> {op}", file=sys.stderr)
|
||||
|
||||
|
||||
def cmd_apply_multiface(args):
|
||||
plan = json.loads(Path(args.plan).read_text())
|
||||
drops = plan["drops_by_faceset"]
|
||||
if args.dry_run:
|
||||
for fs, items in sorted(drops.items()):
|
||||
print(f" {fs}: {len(items)} bad PNG(s)")
|
||||
print(f"=== total: {sum(len(v) for v in drops.values())} ===")
|
||||
return
|
||||
master_path = ROOT / "manifest.json"
|
||||
total, _ = _apply_drops_to_facesets(drops, "multiface", master_path)
|
||||
print(f"[done] {total} PNGs moved to faces/_dropped/ across {len(drops)} facesets", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- report -----------------------------
|
||||
|
||||
def cmd_report(args):
|
||||
out_dir = Path(args.out)
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
sections = []
|
||||
if args.dedup:
|
||||
d = json.loads(Path(args.dedup).read_text())
|
||||
t = d["totals"]
|
||||
sections.append(f"<h2>Dedup</h2>")
|
||||
sections.append(
|
||||
f"<ul>"
|
||||
f"<li>Active facesets: {t['active_facesets']}, active PNGs: {t['active_pngs']}</li>"
|
||||
f"<li>Cross-family byte-dup groups: {t['byte_dup_groups']} → {t['byte_dup_drops']} PNGs dropped</li>"
|
||||
f"<li>Within-faceset near-dup groups (sim≥{d['thresholds']['near_dup_sim']}): {t['near_dup_groups']} → {t['near_dup_drops']} PNGs dropped</li>"
|
||||
f"<li><b>Total dedup drops: {t['all_drops']}</b> across {t['facesets_affected']} facesets</li>"
|
||||
f"</ul>"
|
||||
)
|
||||
# top-N affected facesets
|
||||
rows = sorted(d["drops_by_faceset"].items(), key=lambda x: -len(x[1]))[:25]
|
||||
sections.append("<h3>Top 25 most-affected facesets</h3><table><tr><th>faceset</th><th>dropped</th><th>reasons</th></tr>")
|
||||
for fs, items in rows:
|
||||
r = {}
|
||||
for it in items:
|
||||
r[it["reason"]] = r.get(it["reason"], 0) + 1
|
||||
sections.append(f"<tr><td>{fs}</td><td>{len(items)}</td><td>{r}</td></tr>")
|
||||
sections.append("</table>")
|
||||
|
||||
if args.multiface:
|
||||
m = json.loads(Path(args.multiface).read_text())
|
||||
t = m["totals"]
|
||||
sections.append("<h2>Multi-face audit</h2>")
|
||||
sections.append(
|
||||
f"<ul>"
|
||||
f"<li>PNGs scored: {t['scored']}</li>"
|
||||
f"<li>Bad PNGs (0 or >1 face): {t['bad_pngs']} across {t['facesets_affected']} facesets</li>"
|
||||
f"</ul>"
|
||||
)
|
||||
|
||||
html = f"""<!doctype html>
|
||||
<html><head><meta charset='utf-8'><title>Dedup + multi-face audit</title>
|
||||
<style>
|
||||
body {{ font-family: system-ui, sans-serif; background:#111; color:#eee; padding:1em; }}
|
||||
h1, h2, h3 {{ margin-top:1em; }}
|
||||
table {{ border-collapse: collapse; font-family: monospace; font-size: 12px; }}
|
||||
table td, table th {{ padding: 2px 8px; border: 1px solid #333; }}
|
||||
ul li {{ margin: 4px 0; }}
|
||||
</style></head>
|
||||
<body>
|
||||
<h1>facesets_swap_ready dedup + roop optimization audit</h1>
|
||||
{''.join(sections)}
|
||||
</body></html>"""
|
||||
out_html = out_dir / "index.html"
|
||||
out_html.write_text(html)
|
||||
print(f"[done] {out_html}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- main -----------------------------
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
sub = ap.add_subparsers(dest="cmd", required=True)
|
||||
|
||||
a = sub.add_parser("analyze")
|
||||
a.add_argument("--out", required=True)
|
||||
a.set_defaults(func=cmd_analyze)
|
||||
|
||||
p = sub.add_parser("apply")
|
||||
p.add_argument("--plan", required=True)
|
||||
p.add_argument("--dry-run", action="store_true")
|
||||
p.set_defaults(func=cmd_apply)
|
||||
|
||||
sm = sub.add_parser("stage_multiface")
|
||||
sm.add_argument("--out", required=True)
|
||||
sm.set_defaults(func=cmd_stage_multiface)
|
||||
|
||||
mm = sub.add_parser("merge_multiface")
|
||||
mm.add_argument("--results", required=True)
|
||||
mm.add_argument("--out", required=True)
|
||||
mm.set_defaults(func=cmd_merge_multiface)
|
||||
|
||||
am = sub.add_parser("apply_multiface")
|
||||
am.add_argument("--plan", required=True)
|
||||
am.add_argument("--dry-run", action="store_true")
|
||||
am.set_defaults(func=cmd_apply_multiface)
|
||||
|
||||
r = sub.add_parser("report")
|
||||
r.add_argument("--dedup", default=None)
|
||||
r.add_argument("--multiface", default=None)
|
||||
r.add_argument("--out", required=True)
|
||||
r.set_defaults(func=cmd_report)
|
||||
|
||||
args = ap.parse_args()
|
||||
args.func(args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
574
work/filter_occlusions.py
Normal file
574
work/filter_occlusions.py
Normal file
@@ -0,0 +1,574 @@
|
||||
"""CLIP zero-shot scoring for masks + sunglasses across facesets_swap_ready/.
|
||||
|
||||
Usage:
|
||||
# score one or more specific facesets (test mode)
|
||||
python work/filter_occlusions.py score --facesets faceset_001,faceset_050 \
|
||||
--out work/test_batch_occlusion/scores.json
|
||||
|
||||
# score everything (full corpus)
|
||||
python work/filter_occlusions.py score --out work/occlusion_scores.json
|
||||
|
||||
# render HTML contact sheet from a scores.json
|
||||
python work/filter_occlusions.py report --scores work/test_batch_occlusion/scores.json \
|
||||
--out work/test_batch_occlusion
|
||||
|
||||
Notes:
|
||||
- This script never modifies facesets_swap_ready/. An --apply step lives elsewhere
|
||||
(or will be added once thresholds are validated).
|
||||
- Model: open_clip ViT-L-14 / dfn2b_s39b (best public zero-shot at this size).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Iterable
|
||||
|
||||
import torch
|
||||
from PIL import Image
|
||||
import open_clip
|
||||
|
||||
ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
|
||||
WIN_ROOT = r"E:\temp_things\fcswp\nl_sorted\facesets_swap_ready"
|
||||
|
||||
MODEL_NAME = "ViT-L-14"
|
||||
PRETRAINED = "dfn2b_s39b"
|
||||
|
||||
|
||||
def wsl_to_win(wsl_path: str) -> str:
|
||||
"""Translate a /mnt/e/... wsl path to E:\\... for the Windows worker."""
|
||||
s = str(wsl_path)
|
||||
if s.startswith("/mnt/"):
|
||||
drive = s[5]
|
||||
rest = s[7:].replace("/", "\\")
|
||||
return f"{drive.upper()}:\\{rest}"
|
||||
return s
|
||||
|
||||
# Prompt ensembles. Each pair (positive, negative) becomes one binary classifier.
|
||||
# We average text embeddings within each list, then softmax across the two means.
|
||||
PROMPTS = {
|
||||
"mask": {
|
||||
"pos": [
|
||||
"a photo of a person wearing a surgical face mask",
|
||||
"a photo of a person wearing an FFP2 respirator covering mouth and nose",
|
||||
"a photo of a person wearing a cloth face mask",
|
||||
"a face partially covered by a medical mask",
|
||||
"a person whose mouth and nose are hidden by a face mask",
|
||||
],
|
||||
"neg": [
|
||||
"a photo of a person's face with mouth and nose clearly visible",
|
||||
"a clear, unobstructed photo of a face",
|
||||
"a photo of a face without any mask or covering",
|
||||
"a portrait of a person showing their full face",
|
||||
"a photo of a person with a beard and visible mouth", # avoid beard false positives
|
||||
],
|
||||
},
|
||||
"sunglasses": {
|
||||
# We want to flag ONLY images where sunglasses occlude the eyes.
|
||||
# False positives to defeat: sunglasses pushed up on the head/forehead, hanging on a shirt collar.
|
||||
"pos": [
|
||||
"a face with dark sunglasses covering the eyes",
|
||||
"a portrait with the eyes hidden behind opaque sunglasses",
|
||||
"a person wearing dark sunglasses over their eyes, eyes not visible",
|
||||
"a face where the eyes are completely concealed by tinted lenses",
|
||||
"a close-up portrait wearing aviator sunglasses on the eyes",
|
||||
],
|
||||
"neg": [
|
||||
"a portrait with both eyes clearly visible and uncovered",
|
||||
"a face with sunglasses pushed up on the forehead, eyes visible below",
|
||||
"a face with sunglasses resting on top of the head, eyes visible",
|
||||
"a person with sunglasses hanging from their shirt, eyes visible",
|
||||
"a face wearing clear prescription eyeglasses with visible eyes",
|
||||
"a portrait with no eyewear and visible eyes",
|
||||
],
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def load_model(device: str = "cpu"):
|
||||
print(f"[clip] loading {MODEL_NAME} / {PRETRAINED} on {device} ...", file=sys.stderr)
|
||||
t0 = time.time()
|
||||
model, _, preprocess = open_clip.create_model_and_transforms(MODEL_NAME, pretrained=PRETRAINED)
|
||||
tokenizer = open_clip.get_tokenizer(MODEL_NAME)
|
||||
model = model.to(device).eval()
|
||||
logit_scale = float(model.logit_scale.exp().detach().cpu())
|
||||
print(f"[clip] ready in {time.time()-t0:.1f}s, logit_scale={logit_scale:.2f}", file=sys.stderr)
|
||||
return model, preprocess, tokenizer, logit_scale
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def build_text_features(model, tokenizer, device: str):
|
||||
"""Return dict {attr: (pos_mean_emb, neg_mean_emb)} on device, both L2-normalized."""
|
||||
out = {}
|
||||
for attr, sides in PROMPTS.items():
|
||||
feats = {}
|
||||
for side in ("pos", "neg"):
|
||||
tokens = tokenizer(sides[side]).to(device)
|
||||
f = model.encode_text(tokens)
|
||||
f = f / f.norm(dim=-1, keepdim=True)
|
||||
mean = f.mean(dim=0)
|
||||
feats[side] = mean / mean.norm()
|
||||
out[attr] = (feats["pos"], feats["neg"])
|
||||
return out
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def score_images(model, preprocess, text_feats, logit_scale: float, paths: list[Path], device: str, batch: int = 16):
|
||||
"""Yield (path, {attr: pos_prob}) per image. logit_scale is CLIP's learned temperature (~100)."""
|
||||
for i in range(0, len(paths), batch):
|
||||
chunk = paths[i:i + batch]
|
||||
imgs = []
|
||||
keep = []
|
||||
for p in chunk:
|
||||
try:
|
||||
img = Image.open(p).convert("RGB")
|
||||
imgs.append(preprocess(img))
|
||||
keep.append(p)
|
||||
except Exception as e:
|
||||
print(f"[skip] {p}: {e}", file=sys.stderr)
|
||||
if not imgs:
|
||||
continue
|
||||
x = torch.stack(imgs).to(device)
|
||||
feats = model.encode_image(x)
|
||||
feats = feats / feats.norm(dim=-1, keepdim=True) # (B, D)
|
||||
results = {}
|
||||
for attr, (pos, neg) in text_feats.items():
|
||||
sims = torch.stack([feats @ pos, feats @ neg], dim=1) * logit_scale # (B, 2)
|
||||
probs = sims.softmax(dim=1)[:, 0].tolist() # P(pos)
|
||||
results[attr] = probs
|
||||
for j, p in enumerate(keep):
|
||||
yield p, {attr: results[attr][j] for attr in text_feats}
|
||||
|
||||
|
||||
def iter_facesets(root: Path, only: list[str] | None) -> Iterable[Path]:
|
||||
if only:
|
||||
for name in only:
|
||||
d = root / name
|
||||
if d.is_dir():
|
||||
yield d
|
||||
else:
|
||||
print(f"[warn] not a directory: {d}", file=sys.stderr)
|
||||
return
|
||||
for d in sorted(root.iterdir()):
|
||||
if d.is_dir() and not d.name.startswith("_"):
|
||||
yield d
|
||||
|
||||
|
||||
def cmd_score(args):
|
||||
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
model, preprocess, tokenizer, logit_scale = load_model(device)
|
||||
text_feats = build_text_features(model, tokenizer, device)
|
||||
|
||||
only = [s.strip() for s in args.facesets.split(",")] if args.facesets else None
|
||||
facesets = list(iter_facesets(ROOT, only))
|
||||
if args.sample_per_faceset:
|
||||
# take first N PNGs per faceset (cheap deterministic sample for test batches)
|
||||
pass
|
||||
|
||||
report = {
|
||||
"model": f"{MODEL_NAME}/{PRETRAINED}",
|
||||
"root": str(ROOT),
|
||||
"prompts": PROMPTS,
|
||||
"facesets": {},
|
||||
}
|
||||
total_imgs = 0
|
||||
t0 = time.time()
|
||||
for fs in facesets:
|
||||
faces = sorted((fs / "faces").glob("*.png")) if (fs / "faces").is_dir() else sorted(fs.glob("*.png"))
|
||||
if args.sample_per_faceset:
|
||||
faces = faces[: args.sample_per_faceset]
|
||||
if not faces:
|
||||
continue
|
||||
print(f"[scan] {fs.name}: {len(faces)} png", file=sys.stderr)
|
||||
per_image = []
|
||||
for p, scores in score_images(model, preprocess, text_feats, logit_scale, faces, device):
|
||||
per_image.append({"file": p.name, "mask": round(scores["mask"], 4), "sunglasses": round(scores["sunglasses"], 4)})
|
||||
total_imgs += 1
|
||||
report["facesets"][fs.name] = per_image
|
||||
|
||||
out = Path(args.out)
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
out.write_text(json.dumps(report, indent=2))
|
||||
dt = time.time() - t0
|
||||
print(f"[done] {total_imgs} images, {dt:.1f}s ({total_imgs/max(dt,1e-3):.2f} img/s) -> {out}", file=sys.stderr)
|
||||
|
||||
|
||||
def cmd_report(args):
|
||||
"""Render an HTML contact sheet from scores.json. Generates JPG thumbs."""
|
||||
import io
|
||||
scores = json.loads(Path(args.scores).read_text())
|
||||
out_dir = Path(args.out)
|
||||
thumbs_dir = out_dir / "thumbs"
|
||||
thumbs_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
THUMB = 160
|
||||
rows_html = []
|
||||
|
||||
def thumb_path(faceset: str, fname: str) -> Path:
|
||||
d = thumbs_dir / faceset
|
||||
d.mkdir(parents=True, exist_ok=True)
|
||||
return d / (Path(fname).stem + ".jpg")
|
||||
|
||||
def make_thumb(src: Path, dst: Path):
|
||||
if dst.exists():
|
||||
return
|
||||
try:
|
||||
img = Image.open(src).convert("RGB")
|
||||
img.thumbnail((THUMB, THUMB), Image.LANCZOS)
|
||||
img.save(dst, "JPEG", quality=82)
|
||||
except Exception as e:
|
||||
print(f"[thumb-skip] {src}: {e}", file=sys.stderr)
|
||||
|
||||
facesets = scores["facesets"]
|
||||
for faceset, items in facesets.items():
|
||||
# sort: high score first so borderline cases group at the boundary
|
||||
items_sorted = sorted(items, key=lambda x: max(x["mask"], x["sunglasses"]), reverse=True)
|
||||
# faceset summary
|
||||
n = len(items)
|
||||
n_mask = sum(1 for x in items if x["mask"] >= 0.7)
|
||||
n_sg = sum(1 for x in items if x["sunglasses"] >= 0.7)
|
||||
pct_mask = (100 * n_mask / n) if n else 0
|
||||
pct_sg = (100 * n_sg / n) if n else 0
|
||||
rows_html.append(f"<h2 id='{faceset}'>{faceset} <small>({n} imgs · mask≥0.7: {n_mask} ({pct_mask:.0f}%) · sunglasses≥0.7: {n_sg} ({pct_sg:.0f}%))</small></h2>")
|
||||
rows_html.append("<div class='grid'>")
|
||||
src_root = ROOT / faceset
|
||||
faces_root = (src_root / "faces") if (src_root / "faces").is_dir() else src_root
|
||||
for it in items_sorted:
|
||||
src = faces_root / it["file"]
|
||||
dst = thumb_path(faceset, it["file"])
|
||||
make_thumb(src, dst)
|
||||
rel = f"thumbs/{faceset}/{Path(it['file']).stem}.jpg"
|
||||
m, s = it["mask"], it["sunglasses"]
|
||||
cls_m = "hi" if m >= 0.7 else ("mid" if m >= 0.4 else "lo")
|
||||
cls_s = "hi" if s >= 0.7 else ("mid" if s >= 0.4 else "lo")
|
||||
rows_html.append(
|
||||
f"<div class='cell'>"
|
||||
f"<img src='{rel}' loading='lazy' title='{it['file']}'>"
|
||||
f"<div class='scores'><span class='{cls_m}'>M {m:.2f}</span> <span class='{cls_s}'>S {s:.2f}</span></div>"
|
||||
f"</div>"
|
||||
)
|
||||
rows_html.append("</div>")
|
||||
|
||||
nav = " · ".join(f"<a href='#{f}'>{f}</a>" for f in facesets)
|
||||
|
||||
html = f"""<!doctype html>
|
||||
<html><head><meta charset='utf-8'><title>Occlusion test batch</title>
|
||||
<style>
|
||||
body {{ font-family: system-ui, sans-serif; background: #111; color: #eee; padding: 1em; }}
|
||||
h1 {{ margin-top: 0; }}
|
||||
h2 {{ margin-top: 1.5em; border-bottom: 1px solid #333; padding-bottom: .25em; }}
|
||||
small {{ color: #999; font-weight: normal; }}
|
||||
.grid {{ display: grid; grid-template-columns: repeat(auto-fill, minmax(170px, 1fr)); gap: .5em; }}
|
||||
.cell {{ background: #1c1c1c; padding: 4px; border-radius: 4px; text-align: center; }}
|
||||
.cell img {{ max-width: 100%; height: auto; display: block; margin: 0 auto; }}
|
||||
.scores {{ font-family: monospace; font-size: 11px; padding-top: 4px; }}
|
||||
.hi {{ color: #ff5050; font-weight: bold; }}
|
||||
.mid {{ color: #ffb050; }}
|
||||
.lo {{ color: #5fa05f; }}
|
||||
.nav {{ position: sticky; top: 0; background: #111; padding: .5em 0; border-bottom: 1px solid #333; }}
|
||||
a {{ color: #6cf; }}
|
||||
</style></head>
|
||||
<body>
|
||||
<h1>Occlusion scores — {scores['model']}</h1>
|
||||
<p>Sorted within each faceset by max(mask, sunglasses) descending.
|
||||
Color: <span class='hi'>≥0.70</span> · <span class='mid'>0.40–0.70</span> · <span class='lo'><0.40</span></p>
|
||||
<div class='nav'>{nav}</div>
|
||||
{''.join(rows_html)}
|
||||
</body></html>"""
|
||||
|
||||
out_html = out_dir / "index.html"
|
||||
out_html.write_text(html)
|
||||
print(f"[done] {out_html}", file=sys.stderr)
|
||||
|
||||
|
||||
def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
|
||||
"""Mirror of sort_faces.py:_zip_png_list. Renames PNGs to 0000.png, 0001.png, ..."""
|
||||
import zipfile
|
||||
with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
|
||||
for i, p in enumerate(pngs):
|
||||
zf.write(p, arcname=f"{i:04d}.png")
|
||||
|
||||
|
||||
def cmd_apply(args):
|
||||
"""Prune mask/sunglasses PNGs, quarantine occlusion-dominated facesets,
|
||||
re-zip .fsz, update top-level manifest. --dry-run prints the plan only."""
|
||||
import shutil
|
||||
|
||||
threshold = args.threshold
|
||||
domain_pct = args.domain_pct
|
||||
min_survivors = args.min_survivors
|
||||
top_n_target = args.top_n
|
||||
|
||||
scores = json.loads(Path(args.scores).read_text())
|
||||
master_path = ROOT / "manifest.json"
|
||||
master = json.loads(master_path.read_text())
|
||||
by_name = {f["name"]: f for f in master.get("facesets", [])}
|
||||
|
||||
masked_dir = ROOT / "_masked"
|
||||
thin_dir = ROOT / "_thin"
|
||||
|
||||
plan = []
|
||||
for faceset, items in scores["facesets"].items():
|
||||
if faceset not in by_name:
|
||||
print(f"[warn] {faceset} not in master manifest — skipping", file=sys.stderr)
|
||||
continue
|
||||
n = len(items)
|
||||
flagged_files = sorted(
|
||||
it["file"] for it in items
|
||||
if it["mask"] >= threshold or it["sunglasses"] >= threshold
|
||||
)
|
||||
survivors_items = [it for it in items if it["file"] not in set(flagged_files)]
|
||||
# preserve quality order from filename (0001.png is highest-rank)
|
||||
survivors_files = sorted(it["file"] for it in survivors_items)
|
||||
|
||||
n_mask = sum(1 for it in items if it["mask"] >= threshold)
|
||||
n_sg = sum(1 for it in items if it["sunglasses"] >= threshold)
|
||||
pct_mask = n_mask / n if n else 0
|
||||
pct_sg = n_sg / n if n else 0
|
||||
|
||||
if pct_mask >= domain_pct:
|
||||
action, reason = "quarantine_masked", f"mask_pct={pct_mask:.0%}"
|
||||
elif pct_sg >= domain_pct:
|
||||
action, reason = "quarantine_masked", f"sunglasses_pct={pct_sg:.0%}"
|
||||
elif flagged_files and len(survivors_files) < min_survivors:
|
||||
# only quarantine-as-thin if pruning is the cause of the drop below threshold;
|
||||
# pre-existing small facesets without occlusions are left alone
|
||||
action, reason = "quarantine_thin", f"survivors={len(survivors_files)}<{min_survivors}"
|
||||
elif flagged_files:
|
||||
action, reason = "prune", f"drop {len(flagged_files)}"
|
||||
else:
|
||||
action, reason = "keep", "clean"
|
||||
|
||||
plan.append({
|
||||
"faceset": faceset, "action": action, "reason": reason,
|
||||
"n": n, "n_mask": n_mask, "n_sg": n_sg,
|
||||
"n_dropped": len(flagged_files), "n_survivors": len(survivors_files),
|
||||
"dropped_files": flagged_files,
|
||||
})
|
||||
|
||||
# Summary
|
||||
counts = {a: 0 for a in ("keep", "prune", "quarantine_masked", "quarantine_thin")}
|
||||
for p in plan:
|
||||
counts[p["action"]] += 1
|
||||
total_dropped_pngs = sum(p["n_dropped"] for p in plan if p["action"] == "prune")
|
||||
total_quarantined_pngs = sum(p["n"] for p in plan if p["action"].startswith("quarantine"))
|
||||
print(f"=== plan summary (threshold={threshold} domain_pct={domain_pct} min_survivors={min_survivors}) ===")
|
||||
for a, c in counts.items():
|
||||
print(f" {a}: {c}")
|
||||
print(f" PNGs to drop (prune): {total_dropped_pngs}")
|
||||
print(f" PNGs to quarantine (whole): {total_quarantined_pngs}")
|
||||
print(f" facesets in master: {len(master['facesets'])}")
|
||||
print(f" facesets scored: {len(plan)}")
|
||||
|
||||
# Write plan for audit
|
||||
plan_path = Path(args.out_plan)
|
||||
plan_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
plan_path.write_text(json.dumps({
|
||||
"thresholds": {"image": threshold, "domain_pct": domain_pct, "min_survivors": min_survivors},
|
||||
"counts": counts,
|
||||
"totals": {"dropped_pngs": total_dropped_pngs, "quarantined_pngs": total_quarantined_pngs},
|
||||
"plan": plan,
|
||||
}, indent=2))
|
||||
print(f" plan written to {plan_path}")
|
||||
|
||||
if args.dry_run:
|
||||
# pretty list of quarantines
|
||||
for p in plan:
|
||||
if p["action"].startswith("quarantine"):
|
||||
print(f" [{p['action']:>18s}] {p['faceset']} ({p['reason']}, n={p['n']})")
|
||||
return
|
||||
|
||||
# ----- destructive section -----
|
||||
masked_dir.mkdir(parents=True, exist_ok=True)
|
||||
thin_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
new_facesets = []
|
||||
new_masked = list(master.get("masked", [])) # preserve any prior runs
|
||||
new_thin = list(master.get("thin_eras", []))
|
||||
|
||||
# build a name -> existing-thin/masked entry index, to update relpath if we re-quarantine
|
||||
by_name_thin = {e["name"]: e for e in new_thin}
|
||||
by_name_masked = {e["name"]: e for e in new_masked}
|
||||
|
||||
for p in plan:
|
||||
entry = dict(by_name[p["faceset"]]) # copy
|
||||
fs_dir = ROOT / p["faceset"]
|
||||
faces_dir = fs_dir / "faces"
|
||||
|
||||
if p["action"] == "keep":
|
||||
new_facesets.append(entry)
|
||||
continue
|
||||
|
||||
# prune dropped PNGs first (applies to both prune and quarantine_thin paths)
|
||||
if p["dropped_files"]:
|
||||
dropped_holding = faces_dir / "_dropped"
|
||||
dropped_holding.mkdir(exist_ok=True)
|
||||
for fname in p["dropped_files"]:
|
||||
src = faces_dir / fname
|
||||
if src.exists():
|
||||
shutil.move(str(src), str(dropped_holding / fname))
|
||||
|
||||
if p["action"].startswith("quarantine"):
|
||||
target_root = masked_dir if p["action"] == "quarantine_masked" else thin_dir
|
||||
target = target_root / p["faceset"]
|
||||
if target.exists():
|
||||
# idempotency: if a previous run already moved it, skip move
|
||||
pass
|
||||
else:
|
||||
shutil.move(str(fs_dir), str(target))
|
||||
entry["occlusion_filter"] = {
|
||||
"action": p["action"], "reason": p["reason"],
|
||||
"n_input": p["n"], "n_mask": p["n_mask"], "n_sg": p["n_sg"],
|
||||
"n_dropped": p["n_dropped"], "n_survivors": p["n_survivors"],
|
||||
"threshold": threshold, "domain_pct": domain_pct,
|
||||
}
|
||||
entry["relpath"] = f"{'_masked' if p['action']=='quarantine_masked' else '_thin'}/{p['faceset']}"
|
||||
entry["fsz_top"] = None
|
||||
entry["fsz_all"] = None
|
||||
if p["action"] == "quarantine_masked":
|
||||
entry["masked"] = True
|
||||
new_masked.append(entry)
|
||||
else:
|
||||
entry["thin"] = True
|
||||
new_thin.append(entry)
|
||||
continue
|
||||
|
||||
# action == prune
|
||||
survivor_pngs = sorted([pp for pp in faces_dir.glob("*.png")])
|
||||
if not survivor_pngs:
|
||||
print(f"[warn] {p['faceset']}: no survivor PNGs after prune", file=sys.stderr)
|
||||
new_facesets.append(entry)
|
||||
continue
|
||||
|
||||
# re-zip .fsz from survivors in quality order
|
||||
top_n_eff = min(top_n_target, len(survivor_pngs))
|
||||
top_fsz = fs_dir / f"{p['faceset']}_top{top_n_eff}.fsz"
|
||||
all_fsz = fs_dir / f"{p['faceset']}_all.fsz"
|
||||
# remove old .fsz files (they may have different top_n in name)
|
||||
for old in fs_dir.glob("*.fsz"):
|
||||
old.unlink()
|
||||
_zip_png_list(survivor_pngs[:top_n_eff], top_fsz)
|
||||
if len(survivor_pngs) > top_n_eff:
|
||||
_zip_png_list(survivor_pngs, all_fsz)
|
||||
entry["fsz_all"] = all_fsz.name
|
||||
else:
|
||||
entry["fsz_all"] = None
|
||||
entry["fsz_top"] = top_fsz.name
|
||||
entry["top_n"] = top_n_eff
|
||||
entry["exported"] = len(survivor_pngs)
|
||||
entry["dropped_occlusion"] = p["n_dropped"]
|
||||
entry["occlusion_filter"] = {
|
||||
"action": "prune", "n_input": p["n"], "n_mask": p["n_mask"],
|
||||
"n_sg": p["n_sg"], "n_dropped": p["n_dropped"], "n_survivors": p["n_survivors"],
|
||||
"threshold": threshold,
|
||||
}
|
||||
new_facesets.append(entry)
|
||||
|
||||
# write updated master manifest
|
||||
new_master = dict(master)
|
||||
new_master["facesets"] = new_facesets
|
||||
new_master["masked"] = new_masked
|
||||
new_master["thin_eras"] = new_thin
|
||||
new_master["occlusion_filter_run"] = {
|
||||
"model": scores.get("model"),
|
||||
"threshold": threshold,
|
||||
"domain_pct": domain_pct,
|
||||
"min_survivors": min_survivors,
|
||||
"counts": counts,
|
||||
"totals": {"dropped_pngs": total_dropped_pngs, "quarantined_pngs": total_quarantined_pngs},
|
||||
}
|
||||
tmp = master_path.with_suffix(".tmp.json")
|
||||
tmp.write_text(json.dumps(new_master, indent=2))
|
||||
tmp.replace(master_path)
|
||||
print(f"[done] master manifest updated: {len(new_facesets)} active, "
|
||||
f"{len(new_masked)} masked, {len(new_thin)} thin")
|
||||
|
||||
|
||||
def cmd_stage(args):
|
||||
"""Walk facesets_swap_ready/ and write a queue.json for the Windows clip_worker."""
|
||||
only = [s.strip() for s in args.facesets.split(",")] if args.facesets else None
|
||||
queue = []
|
||||
for fs in iter_facesets(ROOT, only):
|
||||
faces = sorted((fs / "faces").glob("*.png")) if (fs / "faces").is_dir() else sorted(fs.glob("*.png"))
|
||||
for p in faces:
|
||||
queue.append({
|
||||
"wsl_path": str(p),
|
||||
"win_path": wsl_to_win(str(p)),
|
||||
"faceset": fs.name,
|
||||
"file": p.name,
|
||||
})
|
||||
out = Path(args.out)
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
out.write_text(json.dumps(queue, indent=2))
|
||||
print(f"[stage] {len(queue)} png paths -> {out}", file=sys.stderr)
|
||||
print(f"[stage] win queue file: {wsl_to_win(str(out))}", file=sys.stderr)
|
||||
|
||||
|
||||
def cmd_merge(args):
|
||||
"""Ingest worker scores.json into the per-faceset shape that `report` reads."""
|
||||
src = json.loads(Path(args.scores).read_text())
|
||||
by_faceset: dict[str, list] = {}
|
||||
for r in src.get("results", []):
|
||||
by_faceset.setdefault(r["faceset"], []).append({
|
||||
"file": r["file"],
|
||||
"mask": r["mask"],
|
||||
"sunglasses": r["sunglasses"],
|
||||
})
|
||||
# stable ordering: faceset by name, files by name
|
||||
out_data = {
|
||||
"model": src.get("model", f"{MODEL_NAME}/{PRETRAINED}"),
|
||||
"root": str(ROOT),
|
||||
"prompts": src.get("prompts", PROMPTS),
|
||||
"facesets": {fs: sorted(items, key=lambda x: x["file"]) for fs, items in sorted(by_faceset.items())},
|
||||
}
|
||||
out = Path(args.out)
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
out.write_text(json.dumps(out_data, indent=2))
|
||||
total = sum(len(v) for v in by_faceset.values())
|
||||
print(f"[merge] {total} scores across {len(by_faceset)} facesets -> {out}", file=sys.stderr)
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
sub = ap.add_subparsers(dest="cmd", required=True)
|
||||
|
||||
s = sub.add_parser("score", help="WSL CPU scoring (slow but no GPU dependency)")
|
||||
s.add_argument("--facesets", default=None, help="comma-separated faceset names; default = all")
|
||||
s.add_argument("--sample-per-faceset", type=int, default=0, help="cap PNGs per faceset (0 = all)")
|
||||
s.add_argument("--out", required=True)
|
||||
s.set_defaults(func=cmd_score)
|
||||
|
||||
st = sub.add_parser("stage", help="Build queue.json for Windows clip_worker.py")
|
||||
st.add_argument("--facesets", default=None, help="comma-separated faceset names; default = all")
|
||||
st.add_argument("--out", required=True)
|
||||
st.set_defaults(func=cmd_stage)
|
||||
|
||||
m = sub.add_parser("merge", help="Convert worker scores.json into per-faceset report format")
|
||||
m.add_argument("--scores", required=True, help="worker output (flat list of results)")
|
||||
m.add_argument("--out", required=True, help="output path for per-faceset format")
|
||||
m.set_defaults(func=cmd_merge)
|
||||
|
||||
r = sub.add_parser("report", help="Render HTML contact sheet from a per-faceset scores.json")
|
||||
r.add_argument("--scores", required=True)
|
||||
r.add_argument("--out", required=True)
|
||||
r.set_defaults(func=cmd_report)
|
||||
|
||||
a = sub.add_parser("apply", help="Prune flagged PNGs, quarantine dominated facesets, re-zip .fsz, update manifest")
|
||||
a.add_argument("--scores", required=True, help="per-faceset scores.json (output of `merge` or `score`)")
|
||||
a.add_argument("--out-plan", required=True, help="path to write the apply plan json (audit)")
|
||||
a.add_argument("--threshold", type=float, default=0.7, help="image-level drop threshold for mask/sunglasses (default 0.7)")
|
||||
a.add_argument("--domain-pct", type=float, default=0.40, help="faceset-level quarantine threshold (default 0.40)")
|
||||
a.add_argument("--min-survivors", type=int, default=5, help="quarantine to _thin if survivors below this (default 5)")
|
||||
a.add_argument("--top-n", type=int, default=30, help="top-N for re-zipped _topN.fsz (default 30)")
|
||||
a.add_argument("--dry-run", action="store_true", help="print plan only, no filesystem changes")
|
||||
a.set_defaults(func=cmd_apply)
|
||||
|
||||
args = ap.parse_args()
|
||||
args.func(args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
144
work/multiface_worker.py
Normal file
144
work/multiface_worker.py
Normal file
@@ -0,0 +1,144 @@
|
||||
"""Windows / DirectML multi-face audit worker.
|
||||
|
||||
For every PNG in queue.json, run insightface FaceAnalysis and record how many
|
||||
faces were detected (filtering by det_score>=MIN_DET and face_short>=MIN_PIX).
|
||||
Surfaces the load-bearing roop invariant: each .fsz PNG must hold exactly one
|
||||
face, otherwise the loader's `extract_face_images` appends every detected face
|
||||
into the FaceSet and pollutes the averaged identity embedding.
|
||||
|
||||
CLI:
|
||||
py -3.12 multiface_worker.py <queue.json> <out_results.json> [--limit N]
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
from PIL import Image, ImageOps
|
||||
from insightface.app import FaceAnalysis
|
||||
|
||||
MODEL_ROOT = r"C:\face_embed_venv\models"
|
||||
MIN_DET = 0.5
|
||||
MIN_FACE_PIX = 40
|
||||
FLUSH_EVERY = 200
|
||||
|
||||
|
||||
def load_existing(out_path: Path):
|
||||
if not out_path.exists():
|
||||
return None, set()
|
||||
try:
|
||||
d = json.loads(out_path.read_text())
|
||||
processed = set(d.get("processed", []))
|
||||
return d, processed
|
||||
except Exception as e:
|
||||
print(f"[warn] could not parse {out_path}: {e}; starting fresh", file=sys.stderr)
|
||||
return None, set()
|
||||
|
||||
|
||||
def save_atomic(out_path: Path, data: dict):
|
||||
tmp = out_path.with_suffix(".tmp.json")
|
||||
tmp.write_text(json.dumps(data, indent=2))
|
||||
os.replace(tmp, out_path)
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("queue", type=Path)
|
||||
ap.add_argument("out", type=Path)
|
||||
ap.add_argument("--limit", type=int, default=None)
|
||||
args = ap.parse_args()
|
||||
|
||||
queue = json.loads(args.queue.read_text())
|
||||
print(f"[queue] {len(queue)} entries from {args.queue}", flush=True)
|
||||
args.out.parent.mkdir(parents=True, exist_ok=True)
|
||||
existing, processed = load_existing(args.out)
|
||||
if existing:
|
||||
print(f"[resume] {len(processed)} already scored", flush=True)
|
||||
results = existing.get("results", [])
|
||||
else:
|
||||
results = []
|
||||
pending = [e for e in queue if e["wsl_path"] not in processed]
|
||||
if args.limit is not None:
|
||||
pending = pending[: args.limit]
|
||||
print(f"[pending] {len(pending)} entries", flush=True)
|
||||
if not pending:
|
||||
print("[done] nothing to do")
|
||||
return
|
||||
|
||||
print("[load] FaceAnalysis with DmlExecutionProvider", flush=True)
|
||||
app = FaceAnalysis(
|
||||
name="buffalo_l",
|
||||
root=MODEL_ROOT,
|
||||
providers=["DmlExecutionProvider", "CPUExecutionProvider"],
|
||||
)
|
||||
app.prepare(ctx_id=0, det_size=(640, 640))
|
||||
|
||||
n_done = 0
|
||||
n_load_err = 0
|
||||
last_flush = time.time()
|
||||
t_start = time.time()
|
||||
|
||||
def flush():
|
||||
save_atomic(args.out, {
|
||||
"results": results,
|
||||
"processed": sorted(processed),
|
||||
})
|
||||
|
||||
for entry in pending:
|
||||
try:
|
||||
with Image.open(entry["win_path"]) as im:
|
||||
im = ImageOps.exif_transpose(im)
|
||||
im = im.convert("RGB")
|
||||
rgb = np.array(im)
|
||||
bgr = rgb[:, :, ::-1].copy()
|
||||
except Exception as e:
|
||||
n_load_err += 1
|
||||
results.append({
|
||||
"wsl_path": entry["wsl_path"], "faceset": entry["faceset"], "file": entry["file"],
|
||||
"face_count": -1, "error": "load",
|
||||
})
|
||||
processed.add(entry["wsl_path"])
|
||||
n_done += 1
|
||||
continue
|
||||
|
||||
faces = app.get(bgr)
|
||||
kept = 0
|
||||
for f in faces:
|
||||
if float(f.det_score) < MIN_DET:
|
||||
continue
|
||||
x1, y1, x2, y2 = [int(round(v)) for v in f.bbox]
|
||||
short = min(max(x2 - x1, 0), max(y2 - y1, 0))
|
||||
if short < MIN_FACE_PIX:
|
||||
continue
|
||||
kept += 1
|
||||
|
||||
results.append({
|
||||
"wsl_path": entry["wsl_path"], "faceset": entry["faceset"], "file": entry["file"],
|
||||
"face_count": kept,
|
||||
})
|
||||
processed.add(entry["wsl_path"])
|
||||
n_done += 1
|
||||
|
||||
if (n_done % FLUSH_EVERY == 0) or (time.time() - last_flush) > 30.0:
|
||||
flush()
|
||||
last_flush = time.time()
|
||||
elapsed = time.time() - t_start
|
||||
rate = n_done / max(0.1, elapsed)
|
||||
eta = (len(pending) - n_done) / max(0.1, rate) / 60.0
|
||||
print(f"[scan] {n_done}/{len(pending)} rate={rate:.2f} img/s eta={eta:.1f}min "
|
||||
f"load_err={n_load_err}", flush=True)
|
||||
|
||||
flush()
|
||||
elapsed = time.time() - t_start
|
||||
print(f"[done] {n_done} scored, {n_load_err} load errors, {elapsed:.1f}s "
|
||||
f"({n_done/max(0.1,elapsed):.2f} img/s) -> {args.out}", flush=True)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user