Update video preprocessing doc with full-corpus results

After completing the rest-of-corpus run, update docs/analysis to reflect the final numbers across all three batches (test + 13-file + 45-file) and surface the numerical lessons: - 1,984 segments / 10.78h accepted content from 19.76h / 61 input videos - 0 worker errors across 143,137 sampled frames - rest batch sustained 15.78 fps from a fresh JSONL start (vs 7.5 fps for the migrated batch), confirming the append-only fix is the right steady-state design - skip-pattern note: 5-digit basename numbers need full padding (0005[0-9] not 005[0-9]) — bit me on the first relaunch - documented SIDECAR=yes opt-in for the chain script Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make per-clip sidecar JSONs opt-in (default off)
2026-04-28 16:47:59 +02:00 · 2026-04-28 12:44:27 +02:00 · 2026-04-27 21:38:50 +02:00 · 2026-04-27 15:41:18 +02:00 · 2026-04-27 00:32:11 +02:00 · 2026-04-26 23:36:11 +02:00
28 changed files with 9481 additions and 154 deletions
@@ -1,48 +1,400 @@
 # face-sets

-Sort photos by similar face using InsightFace embeddings + agglomerative clustering, then refine into faceset-ready folders for downstream face-swap tooling (roop-unleashed, etc.).
+Sort photos by similar face using InsightFace embeddings + agglomerative clustering, refine into per-identity sets, and export ready-to-drop bundles for face-swap tooling (roop-unleashed, etc.).

 ## Pipeline

-`sort_faces.py` is a single-file CLI with three subcommands:
+`sort_faces.py` is a single-file CLI with six subcommands:

 | step        | what it does                                                                                                |
-|---------|------------------------------------------------------------------------------|
-| embed   | Recursively scan a source tree, detect + embed every face, write `.npz` cache |
-| cluster | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` |
-| refine  | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/` |
+|-------------|-------------------------------------------------------------------------------------------------------------|
+| embed       | Recursively scan a source tree, detect + embed every face, write `.npz` cache. Resumable; sha256-dedup.    |
+| cluster     | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` with manifest.  |
+| refine      | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/`.        |
+| dedup       | Post-hoc near-duplicate report: byte-identical + visual near-dupe groups → `<cache>.duplicates.json`.      |
+| extend      | Fold new embeddings into an existing raw/refine output via nearest person-centroid without renumbering.    |
+| enrich      | Re-detect each cached face to persist landmark_2d_106, landmark_3d_68, pose (pitch/yaw/roll) into cache.   |
+| export-swap | Per-identity export: tight outlier gate + visual-dupe collapse + composite quality rank + single-face PNG crops + `.fsz` bundles (top-N and full) ready for roop-unleashed. Optional singleton rescue into `_candidates/`. |

-Cache and outputs are kept out of the repo via `.gitignore`; defaults live under `work/`.
+### Design principles

-## Typical run
+- **embed is resumable and incremental.** It loads any existing cache at the target path and only hashes / embeds files it has not seen. Atomic flush every 50 new files so a mid-run crash loses at most ~50 embeddings.
+- **Byte-identical duplicates are sha256-grouped at listing time.** The canonical file is embedded once; other paths with the same hash become `path_aliases` in the cache. Every alias is materialized by `cluster` / `refine` / `export-swap`, so each on-disk location is represented.
+- **`safe_dst_name` always flattens the absolute path.** This keeps output filenames stable across runs even as `src_root` changes between embed / extend / export invocations.
+- **Caches and outputs stay out of git** via `.gitignore`; defaults live under `work/`.
+
+## Typical end-to-end run

 ```bash
-# 1. Embed (CPU; InsightFace buffalo_l). Caches faces + metadata.
-python sort_faces.py embed "/mnt/x/src/nl/Neuer Ordner (2)/New Folder" work/cache/nl_all.npz
+SRC=/mnt/x/src/nl
+CACHE=work/cache/nl_full.npz
+OUT=/mnt/e/temp_things/fcswp/nl_sorted

-# 2. Raw clusters (every multi-face cluster -> a person_NNN/ folder).
-python sort_faces.py cluster work/cache/nl_all.npz /mnt/e/temp_things/fcswp/nl_sorted/raw
+# 1. Embed (CPU; InsightFace buffalo_l). Resumable on re-run.
+python sort_faces.py embed "$SRC" "$CACHE"

-# 3. Refined facesets (filters for faceset-ready quality).
-python sort_faces.py refine  work/cache/nl_all.npz /mnt/e/temp_things/fcswp/nl_sorted/facesets
+# 2. Raw clusters (one person_NNN/ per multi-face cluster).
+python sort_faces.py cluster "$CACHE" "$OUT/raw_full"
+
+# 3. Refined facesets (quality-gated per-identity sets).
+python sort_faces.py refine  "$CACHE" "$OUT/facesets_full"
+
+# 4. Near-duplicate report (byte + visual).
+python sort_faces.py dedup   "$CACHE"
+
+# 5. Enrich the cache with landmarks + pose (needed by export-swap).
+python sort_faces.py enrich  "$CACHE"
+
+# 6. Export roop-unleashed-ready bundles.
+python sort_faces.py export-swap "$CACHE" \
+  "$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
+  --raw-manifest "$OUT/raw_full/manifest.json" --candidates
 ```

-## Refine defaults
+### Merging a new source into an existing result
+
+```bash
+# Embed new source into the same cache (resume from existing embeddings + aliases).
+python sort_faces.py embed /mnt/x/src/lzbkp_red "$CACHE"
+
+# Fold new faces into raw_full + facesets_full without renumbering.
+python sort_faces.py extend "$CACHE" "$OUT/raw_full" --refine-out "$OUT/facesets_full"
+
+# Refresh the swap-ready export to reflect the merge.
+python sort_faces.py enrich "$CACHE"
+python sort_faces.py export-swap "$CACHE" \
+  "$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
+  --raw-manifest "$OUT/raw_full/manifest.json" --candidates
+```
+
+### Importing hand-sorted folders as identities
+
+When source folders are already hand-sorted by person (one folder per identity), the
+clustering path is the wrong tool — the identity is asserted, not inferred. The
+orchestration script `work/build_folders.py` covers this case:
+
+- For each trusted folder, it filters cache records that fall under it, builds an
+  identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so
+  bystanders in group photos drop out, and writes a synthetic `refine_manifest.json`.
+- It then routes each face record from a *mixed* folder (e.g. `osrc/`) into every
+  identity centroid within a tight cosine cutoff (default 0.45). A multi-identity
+  photo lands in multiple facesets; `export-swap`'s per-bbox outlier filter ensures
+  each faceset crops only its matching face.
+- Finally it invokes `cmd_export_swap` against the synthetic manifest, renames the
+  emitted `.fsz` bundles after the source folder, drops a `<label>.txt` marker, and
+  merges the new entries into the canonical `facesets_swap_ready/manifest.json`
+  (existing facesets are left untouched).
+
+```bash
+# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
+for d in k m mi mir s sab t osrc; do
+  python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
+done
+
+# Bring landmarks/pose + visual-dupe report in sync with the new records.
+python sort_faces.py enrich "$CACHE"
+python sort_faces.py dedup  "$CACHE"
+
+# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
+python work/build_folders.py
+```
+
+The script's config block (`TRUSTED`, `START_NNN`, `OSRC_THRESHOLD`, `TOP_N`, etc.)
+is the only thing to edit when adding more hand-sorted folders later.
+
+### Splitting an identity by era (age sub-clustering)
+
+Long-running source corpora produce identities that span 10+ years. The 2009 face
+and the 2024 face of the same person sit in the same cluster (correctly — same
+identity), but a single averaged embedding pulled from that cluster blurs across
+ages. For face-swap output that should target a specific period, the identity
+needs to be split by era *after* the identity is established.
+
+`work/age_split_001.py` is a worked example for `faceset_001` and a template for
+any other identity. The pipeline is:
+
+- **Probe first** with `work/check_faceset001_age.py` — report intra-cluster
+  pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and
+  EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with
+  distinct year ranges, the identity is age-sortable.
+- **Seed centroid** from the curated `facesets_swap_ready/faceset_001/`
+  (manifest provides face keys → cache rows).
+- **Wide recovery** at cos-dist ≤ 0.55 against the seed under the original
+  source roots, then quality-gate (`face_short`, `blur`, `det_score`) and one
+  re-centroid + tighten pass at 0.50 to absorb new faces without drift.
+- **Sub-cluster** the survivors at cos-dist 0.35 (precomputed-distance
+  agglomerative, average linkage).
+- **Anchor-based fragment assignment** (not transitive merge — that caused
+  year-drift): sub-clusters with size ≥ 20 are *anchors*; smaller fragments
+  attach to the single nearest anchor only if both the centroid distance ≤ 0.40
+  AND the dominant EXIF year is within ±5 years. Fragments with no qualifying
+  anchor remain standalone (and end up THIN-tagged downstream).
+- **EXIF year per source path** with on-disk caching at
+  `work/cache/age_split_exif.json` — the Windows-mount EXIF read is the
+  slowest step, so re-runs after a parameter tweak are nearly instant.
+- **Per-era export** mirrors `export-swap`: composite-quality rank, single-face
+  square PNG crops, top-N + `_all` `.fsz` bundles, per-era `manifest.json`,
+  human-readable `<era>.txt` marker. Eras with < 20 face records also drop a
+  `THIN.txt` marker so they can be quarantined.
+- **Top-level manifest merge**: era buckets are appended to
+  `facesets_swap_ready/manifest.json`. Operationally the THIN buckets should be
+  moved into `_thin/` (and the manifest split into `facesets` + `thin_eras`),
+  leaving only the substantive era buckets at the top level.
+
+```bash
+# 1. Confirm the identity is age-sortable.
+python work/check_faceset001_age.py
+
+# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
+python work/age_split_001.py
+```
+
+For the `faceset_001` run on 5260-face `nl_full.npz`, this produced 6 substantive
+era buckets (2005–10, 2010–13, 2011, 2014–17, 2018–19, 2018–20; sizes 43–282)
+plus 68 thin/fragment buckets quarantined under `_thin/`.
+
+### Discovering new identities in a mixed bucket
+
+A flat folder of mixed-identity photos (e.g. `osrc/`) is the opposite of the
+hand-sorted case: identities have to be discovered, not asserted, but should
+not collide with already-known identities or scramble their numbering.
+
+`work/cluster_osrc.py` is the worked example. The pipeline:
+
+- **Filter cache to the source root**, including any byte-aliased path that
+  resolves under it.
+- **Drop already-covered faces** by comparing each candidate to the centroids
+  of the existing canonical facesets at the `EXISTING_MATCH_THRESHOLD`
+  (default 0.45 — same cutoff as `build_folders.py`'s osrc routing). These
+  faces are already routed by `extend` / `build_folders.py` and shouldn't
+  seed new facesets.
+- **Cluster the unmatched** at cos-dist 0.55 (matches the `extend` default
+  for the new-cluster phase).
+- **Apply `refine`-equivalent gates** per cluster: `face_short`, `blur`,
+  `det_score`, plus outlier rejection (cluster-centroid cos-dist > 0.55) for
+  clusters of size ≥ 4. Keep clusters whose surviving unique-source-path
+  count is ≥ `MIN_FACES`.
+- **Number new facesets past the existing maximum** (`START_NNN`), so
+  `faceset_001..NNN` are never disturbed.
+- **Synthesize a refine manifest** and run `cmd_export_swap` against it,
+  then move the resulting dirs into `facesets_swap_ready/` and append to the
+  top-level `manifest.json`. Each new dir gets an `osrc.txt` provenance
+  marker.
+
+Always run `extend` first so `raw_full/` and `facesets_full/` reflect the new
+source — the `cluster_osrc.py` step then operates against the canonical
+cache and doesn't need `raw_full/` for input:
+
+```bash
+# 1. Bring raw_full / facesets_full up to date (folds matches into existing
+#    person folders + facesets, creates new person_NNN+ for unmatched).
+python sort_faces.py extend "$CACHE" "$OUT/raw_full" \
+  --refine-out "$OUT/facesets_full"
+
+# 2. Optional dry-run: report cluster sizes and per-faceset survivor counts
+#    without touching facesets_swap_ready/.
+python work/cluster_osrc.py --dry-run
+
+# 3. Real run: emits facesets_swap_ready/faceset_NNN+ and merges the manifest.
+python work/cluster_osrc.py
+```
+
+For the 2026-04-26 run on 336 osrc face records (after dropping 18 covered by
+existing identities), this produced 6 new facesets (`faceset_020..025`,
+sizes 4–26 exported PNGs; the 7th candidate cluster lost all 6 faces to
+export-swap's tighter `min_face_short=100` gate).
+
+### Importing identities from a self-hosted Immich library
+
+`work/immich_stage.py` + `work/embed_worker.py` + `work/cluster_immich.py`
+together import an Immich library at scale, with the embed step running on
+a Windows AMD GPU via DirectML and everything else on WSL. Three pieces:
+
+1. **`work/immich_stage.py` (WSL)** — pages every IMAGE asset via
+   `/search/metadata`, fetches each asset's `/faces?id=` to read Immich's
+   own ML-driven bboxes, scales each bbox to original-image coordinates,
+   and prefilters by `face_short ≥ 90`. For survivors it downloads the
+   original, sha256-deduplicates against the canonical `nl_full.npz` and
+   against same-run staged files, and saves to
+   `/mnt/x/src/immich/<user>/<rel>`. Writes a `queue.json` that the embed
+   worker consumes. 8 concurrent worker threads run the full per-asset
+   I/O chain (`/faces` → filter → `/original`) so 8 workers ≈ 8× the
+   serial throughput.
+2. **`work/embed_worker.py` (Windows venv at `C:\face_embed_venv\`)** —
+   loads `insightface.FaceAnalysis(buffalo_l)` with the
+   `DmlExecutionProvider` and runs detection + landmarks + recognition
+   over the queue. Produces a `.npz` cache that's bit-identical in
+   schema to what `sort_faces.py:cmd_embed` writes, so the result is
+   directly loadable by `load_cache()`. The cache already includes the
+   post-`enrich` fields (`landmark_2d_106`, `landmark_3d_68`, `pose`)
+   because FaceAnalysis returns them for free. AMD Vega gives ~7.5×
+   real-pipeline speedup over CPU.
+3. **`work/cluster_immich.py` (WSL)** — mirrors `cluster_osrc.py`'s
+   shape but reads from `immich_<user>.npz`. Builds existing-identity
+   centroids from every canonical `faceset_NNN/` in
+   `facesets_swap_ready/` (skipping era splits and `_thin/`), drops
+   immich faces matching at cos-dist ≤ 0.45, clusters the rest at 0.55,
+   applies refine gates, numbers new facesets past the existing maximum,
+   and feeds `cmd_export_swap` via a synthetic manifest.
+
+`work/finalize_immich.sh <user>` chains queue → Windows embed → cache
+copy back → cluster_immich, with logging.
+
+The Immich admin API key + base URL come from environment variables:
+
+```bash
+export IMMICH_URL=https://your-immich.example.com
+export IMMICH_API_KEY=...                # admin or per-user key
+python work/immich_stage.py --user peter --workers 8
+bash   work/finalize_immich.sh peter
+```
+
+For the 2026-04-26 run against `https://fotos.computerliebe.org` (Immich
+v2.7.2), with the admin API key:
+
+| step | result |
+|------|------|
+| stage | 53,842 assets seen, **10,261 staged** (~10 GB), 978 byte-deduped against `nl_full.npz`, 2,976 internal byte-duplicates, 39K skipped no-face / no-big-face |
+| Windows DML embed | 19,462 face records + 1 noface in **64.6 min** (2.6 img/s end-to-end) |
+| matched existing identities | **8,103 of 19,480 (42%)** at cos-dist ≤ 0.45; biggest hits faceset_002 (+2,666), faceset_001 (+1,856), faceset_003 (+670) |
+| new clusters | 2,534 at threshold 0.55 → 239 surviving refine gates → **185 emitted** as `faceset_026..264` (gaps where export-swap's tighter outlier filter dropped clusters below the export quality bar) |
+
+A second 2026-04-26 run with **nic's per-user API key** confirmed the
+expected behavior: 25,777 of nic's IMAGE assets were enumerated (matching
+her `/server/statistics` count of 25,786, off by 9 ≈ the transient errors
+that didn't get marked seen), **7,834 staged** (30% face-bearing-with-big-face,
+denser than peter's 19%), 519 byte-deduped vs `nl_full.npz`, **0 internal
+byte-duplicates** (cleaner library than peter's 2,976), 54 transient errors.
+
+Embed + cluster on the nic queue:
+
+| step | result |
+|------|------|
+| Windows DML embed | 15,627 face records + 1 noface in **59 min** (2.2 img/s end-to-end), 7 load errors |
+| matched existing identities | **6,770 of 15,627 (43%)** at cos-dist ≤ 0.45; biggest hits faceset_002 (+3,261), faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408) |
+| new clusters | 3,787 at threshold 0.55 → 129 surviving refine gates → **95 emitted** as `faceset_265..NNN` (gaps where export-swap's 0.45 outlier dropped clusters below the export bar) |
+
+Top-level `facesets_swap_ready/manifest.json` after both Immich runs:
+**311 substantive facesets** (12 auto-cluster nl/lzbkp + 7 hand-sorted +
+6 era splits + 6 osrc-discovered + 185 peter-Immich + 95 nic-Immich) +
+68 thin_eras under `_thin/`.
+
+`work/immich_stage.py` carries a built-in **outage circuit breaker**:
+after 12 consecutive HTTP errors it probes Immich; if that probe also
+fails, the script exits cleanly with code 2, state preserved. This made
+the nic run survive a mid-stage Immich outage — the script paused, the
+operator confirmed connectivity was back, and the same command resumed
+from the saved `state.json` without re-fetching what was already done.
+
+**Important caveats for Immich v2.7.2**:
+- The `userIds` filter on `/search/metadata` is **silently ignored** when
+  the API key is bound to a different user. The "import everything the
+  API key can see" semantics are what you actually get; cross-user
+  isolation is enforced server-side.
+- `/server/statistics` reports counts that under-count what
+  `/search/metadata` actually returns (e.g. external library
+  thumbnail-dirs that got indexed because the import path included them).
+  Don't trust the statistics number as a denominator.
+- A meaningful fraction of `originalPath`-based assets are *Immich's own
+  thumbnails* (`<library_root>/thumbs/.../-preview.jpeg`) — included if
+  the external library's import path covers the thumbs directory and the
+  exclusion patterns don't list `**/thumbs/**`. For our run, 5,563 of
+  10,261 staged were thumbnails. They embed and cluster fine but the
+  resulting faces are lower-resolution.
+
+## Key defaults
+
+`refine`:

 | flag                    | default | meaning |
-|---|---|---|
+|-------------------------|--------:|---------|
 | `--initial-threshold`   | 0.55    | cosine distance for stage-1 clustering |
 | `--merge-threshold`     | 0.40    | centroid-level merge of over-split clusters |
-| `--outlier-threshold` | 0.55 | drop face if cosine dist from cluster centroid exceeds this (only if cluster ≥ 4) |
+| `--outlier-threshold`   | 0.55    | drop face if cosine dist from centroid exceeds (only if cluster ≥ 4) |
 | `--min-faces`           | 15      | minimum unique images per faceset |
 | `--min-short`           | 90      | minimum short-edge pixels of face bbox |
 | `--min-blur`            | 40.0    | Laplacian-variance blur gate |
 | `--min-det-score`       | 0.6     | InsightFace detector score gate |
-| `--mode`              | copy | copy / move / symlink |

-## Prior runs (as of 2026-04-22)
+`export-swap`:

- `work/cache/kos11.npz` — 181 images, 333 faces from `Kos '11/` → `kos11_sorted/`
- `work/cache/nl_all.npz` — 916 images, 1396 faces from `Neuer Ordner (2)/New Folder/` → `nl_sorted/raw/`, refined to 6 facesets (197, 120, 91, 47, 23, 18 images)
+| flag                          | default | meaning |
+|-------------------------------|--------:|---------|
+| `--top-n`                     | 30      | size of the `<faceset>_topN.fsz` bundle |
+| `--outlier-threshold`         | 0.45    | tighter than refine; trims cluster boundary for averaging |
+| `--pad-ratio`                 | 0.5     | padding around face bbox for PNG crop |
+| `--out-size`                  | 512     | PNG output is square `out_size × out_size` |
+| `--min-face-short`            | 100     | export gate; stricter than refine's 90 |
+| `--candidates`                | off     | rescue `_singletons/` into `_candidates/` for manual review |
+| `--candidate-match-threshold` | 0.55    | cos-dist cutoff for singleton → existing faceset |
+| `--candidate-min-score`       | 0.40    | composite-quality floor for candidates |

-Output lives outside the repo at `/mnt/e/temp_things/fcswp/`.
+The composite quality score in `export-swap` is `0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness`, each normalized to `[0, 1]`.
+
+## Post-export corpus maintenance
+
+The `sort_faces.py` pipeline above produces `facesets_swap_ready/`. Four
+orchestration scripts under `work/` operate on that already-built corpus to
+clean it up over time:
+
+| script | purpose |
+|--------|---------|
+| `work/filter_occlusions.py` (+ Windows `work/clip_worker.py`) | Drop PNGs of masked / sun-glassed faces using open_clip ViT-L-14/dfn2b_s39b zero-shot scoring. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. WSL stages a queue, Windows DML scores, WSL applies. See `docs/analysis/clip-occlusion-filter.md`. |
+| `work/consolidate_facesets.py` | Merge duplicate identities (centroid cosine sim ≥ 0.55 with confident ≥ 0.65, **complete-linkage** to defeat single-link chaining). Pulls embeddings from cache, no GPU. See `docs/analysis/identity-consolidation-and-age-extend.md`. |
+| `work/age_extend_001.py` | Slot newly-added PNGs into existing era buckets of `faceset_001` (anchor cosine distance ≤ 0.40 AND `|year_delta|` ≤ 5). Same anchor-fragment rule as `age_split_001.py`. |
+| `work/dedup_optimize.py` (+ Windows `work/multiface_worker.py`) | (a) cross-family SHA256 byte-dedup, (b) within-faceset near-dup at cosine sim ≥ 0.95, (c) multi-face audit (re-detect via insightface, drop PNGs with face_count ≠ 1). Multi-face is the load-bearing roop invariant. See `docs/analysis/dedup-and-roop-optimization.md`. |
+| `work/video_target_pipeline.py` (+ Windows `work/video_face_worker.py` + `work/run_video_pipeline.sh` chain) | Target-side preprocessing: scan a folder of videos → PySceneDetect shot-cuts → 2 fps frame sampling → DML face detection + embedding → IoU+embedding tracking → quality-gated segments (yaw≤75°, face≥80px, det≥0.5, ≥70% pass-rate, 1–120s duration, 2s cross-track merge gap) → ffmpeg stream-copy into UUID-named clips. Output organized into per-source subfolders. Provenance sidecars are opt-in (`cut --write-sidecar` or `SIDECAR=yes` env var); the full plan is always retained in the per-batch `plan.json`. See `docs/analysis/video-target-preprocessing.md`. |
+
+All four operate idempotently and reversibly: dropped PNGs go to
+`<faceset>/faces/_dropped/`, quarantined whole facesets go to
+`facesets_swap_ready/_masked/` or `_merged/` (parallel to the existing
+`_thin/`). The master `manifest.json` partitions entries across `facesets[]`,
+`masked[]`, `thin_eras[]`, and `merged[]` arrays, plus per-run provenance
+blocks (`occlusion_filter_run`, `merge_run`, `age_extend_runs`, `dedup_runs`,
+`multiface_runs`).
+
+## Downstream: roop-unleashed
+
+The `.fsz` bundles emitted by `export-swap` drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding.
+
+Highly recommended at swap time: enable **Select post-processing = GFPGAN** with the **Original/Enhanced image blend ratio = 0.85** (default is 0.65 which is conservative). See `docs/analysis/facesets-downstream-refinement-evaluation.md` for the full evaluation.
+
+## Layout
+
+```
+/opt/face-sets/
+├─ README.md                                     (this file)
+├─ sort_faces.py                                 (the tool)
+├─ docs/
+│  └─ analysis/
+│     └─ facesets-downstream-refinement-evaluation.md
+└─ work/                                         (gitignored except force-tracked .py / .sh)
+   ├─ build_folders.py                           (hand-sorted-folder orchestration)
+   ├─ check_faceset001_age.py                    (age-split readiness probe)
+   ├─ age_split_001.py                           (age-split orchestration; faceset_001)
+   ├─ age_extend_001.py                          (extends existing era buckets with new PNGs)
+   ├─ cluster_osrc.py                            (mixed-bucket identity discovery)
+   ├─ immich_stage.py                            (Immich library staging, parallel)
+   ├─ embed_worker.py                            (Windows DML embed worker; C:\face_embed_venv\)
+   ├─ cluster_immich.py                          (Immich identity discovery + export)
+   ├─ finalize_immich.sh                         (chains queue → embed → cluster)
+   ├─ filter_occlusions.py                       (CLIP zero-shot mask + sunglasses filter)
+   ├─ clip_worker.py                             (Windows DML CLIP worker; C:\clip_dml_venv\)
+   ├─ consolidate_facesets.py                    (duplicate-identity merger; complete-linkage)
+   ├─ dedup_optimize.py                          (byte + near-dup + multi-face audit driver)
+   ├─ multiface_worker.py                        (Windows DML multi-face audit worker)
+   ├─ video_target_pipeline.py                   (video → swappable segment cuts orchestration)
+   ├─ video_face_worker.py                       (Windows DML per-frame face worker; JSONL append-only)
+   ├─ run_video_pipeline.sh                      (generic chain driver: scenes → stage → worker → cut)
+   ├─ status_video_pipeline.sh                   (status helper for any video_pipeline log)
+   ├─ synthetic_*_manifest.json                  (per-run synthetic refine manifests)
+   ├─ immich/
+   │  ├─ users.json                              (label -> userId map; gitignored)
+   │  └─ <user>/{queue,state,aliases}.json       (per-user staging artifacts)
+   ├─ cache/
+   │  ├─ nl_full.npz                             (canonical cache + duplicates.json)
+   │  ├─ immich_<user>.npz                       (per-user immich embeddings)
+   │  └─ age_split_exif.json                     (path → EXIF-year cache)
+   └─ logs/
+      └─ *.log                                   (every long step writes here)
+```
@@ -0,0 +1,119 @@
+# Age-splitting faceset_001 into era-specific facesets
+
+_Run date: 2026-04-26. Cache: `work/cache/nl_full.npz` (5260 face records). Source: `work/age_split_001.py` and `work/check_faceset001_age.py`._
+
+## 1. Why split
+
+`faceset_001` aggregates a single identity across roughly 20 years of source
+material. The averaged embedding consumed by roop-unleashed therefore mixes
+features from very different ages. For face-swap output that should target a
+specific period (e.g. "this person around 2011" or "this person around
+2018–19"), the identity needs to be split *after* clustering — the cluster is
+correctly one identity, but the averaged embedding is the problem.
+
+## 2. Evidence the identity is age-sortable
+
+`work/check_faceset001_age.py` probes `faceset_001` (707 curated faces).
+
+**Pairwise cos-distance histogram** (249,571 pairs):
+
+| range       | pairs |
+|-------------|------:|
+| [0.0, 0.2)  | 1,250 |
+| [0.2, 0.3)  | 11,277 |
+| [0.3, 0.4)  | 63,920 |
+| [0.4, 0.5)  | 92,555 |
+| [0.5, 0.6)  | 63,288 |
+| [0.6, 0.7)  | 16,048 |
+| [0.7, 0.8)  | 1,217 |
+| [0.8, 1.0)  | 16 |
+
+Mean 0.453, median 0.452, max 0.842. The cluster is internally diffuse — wide
+enough to admit non-trivial sub-structure without crossing the
+inter-identity boundary (which sits well above 0.6 in this dataset).
+
+**Sub-clusters at threshold 0.35** (precomputed cos-dist, average linkage):
+156 sub-clusters, 10 with ≥ 10 faces, top-5 sizes [120, 105, 47, 40, 24].
+The top sub-clusters align with distinct EXIF year medians (2011, 2019,
+2018, 2011, 2010), so the split is meaningful.
+
+## 3. Pipeline
+
+`work/age_split_001.py`:
+
+1. **Seed centroid.** Load the 707 face keys from
+   `facesets_swap_ready/faceset_001/manifest.json`; resolve to cache rows;
+   normalize the mean embedding.
+2. **Wide recovery.** Pull every face record under `/mnt/x/src/{nl,
+   lzbkp_red}` from the cache with cos-dist ≤ 0.55 from the seed. The seed
+   is curated and tight, so 0.55 is a safe outer envelope. Result: 1,501
+   faces from 4,756 candidates.
+3. **Quality gate** (mirrors export-swap defaults): `face_short ≥ 100`,
+   `blur ≥ 40.0`, `det_score ≥ 0.6`. Result: 892 → 856 after one
+   re-centroid + tighten pass at 0.50 to absorb the recovery without
+   drift.
+4. **Sub-cluster** the survivors at cos-dist 0.35 (precomputed agglomerative,
+   average linkage). 223 raw sub-clusters; sizes top-10 = [127, 97, 55, 42,
+   40, 25, 17, 14, 13, 11].
+5. **EXIF year per source path.** Read `DateTimeOriginal` once per unique
+   path; cache on disk at `work/cache/age_split_exif.json` so re-runs after
+   parameter tweaks skip the slow Windows-mount EXIF read. 728 of 855 paths
+   were dated.
+6. **Anchor-based fragment assignment** (replaces transitive union-find merge
+   that caused observable year drift):
+   - sub-clusters with ≥ 20 faces are *anchors* (6 found: dom-years 2011,
+     2019, 2018, 2011, 2016, 2010);
+   - smaller fragments attach to the single nearest anchor *only if* both
+     `cent_dist ≤ 0.40` AND `|dom_year_anchor − dom_year_fragment| ≤ 5`;
+   - anchors do not merge with each other (transitive merging produced
+     anchor-to-anchor year drift across 2010 → 2014 → 2018 in earlier
+     runs);
+   - fragments with no qualifying anchor remain standalone.
+7. **Per-era export.** Composite-quality rank, single-face square PNG crops
+   (`pad_ratio=0.5`, `out_size=512`), top-N + `_all` `.fsz` bundles, per-era
+   `manifest.json`, `<label>.txt` marker, `THIN.txt` for buckets < 20 faces.
+8. **Top-level manifest merge.** New entries are appended to
+   `facesets_swap_ready/manifest.json`. Operationally the THIN buckets are
+   then moved into `_thin/` and partitioned into a `thin_eras` array (with
+   `relpath: _thin/<name>`) so consumers reading `facesets` see only the
+   substantive entries.
+
+## 4. Result
+
+74 era buckets emitted; 6 substantive + 68 thin/standalone fragments.
+
+| era               | faces | dom year(s) |
+|-------------------|------:|-------------|
+| `faceset_001_2010-13` | 282 | 2011 |
+| `faceset_001_2018-20` | 129 | 2019 |
+| `faceset_001_2014-17` | 125 | 2018 (anchor sub 15 dom_year=2018) |
+| `faceset_001_2018-19` | 107 | 2018 |
+| `faceset_001_2005-10` | 88  | 2010 |
+| `faceset_001_2011`    | 43  | 2011 |
+
+Two distinct 2011 anchors and two 2018-area anchors persist by design —
+embedding-space distance separated them despite year overlap. The era-label
+collisions are disambiguated with `_v2` suffixes, but only when both anchors
+landed on the *same* literal label string (none of the substantive six did).
+
+The 68 thin buckets are largely 1- or 2-face fragments with idiosyncratic
+embeddings; they are quarantined into `_thin/` rather than deleted because
+some are legitimate edge poses / lighting / age extremes that may be useful
+for narrow targeted swaps.
+
+## 5. Re-running and applying to other identities
+
+- **Re-run with different parameters**: just re-execute `age_split_001.py`.
+  Embeddings are loaded from cache, EXIF is loaded from
+  `age_split_exif.json`, and only the sub-cluster + export steps re-run.
+  Total runtime ~2 min.
+- **Apply to a different identity**: copy `age_split_001.py` to
+  `age_split_NNN.py` and change `FS001`. The `SCAN_ROOTS`,
+  `RECOVERY_THRESHOLD`, `TIGHTEN_THRESHOLD`, `SUBCLUSTER_THRESHOLD`,
+  `ANCHOR_MIN_SIZE`, `FRAGMENT_CENTROID_MAX`, and `FRAGMENT_YEAR_MAX`
+  defaults are tuned for `faceset_001`'s ~707-face curated cluster; smaller
+  identities likely need lower `ANCHOR_MIN_SIZE`.
+- **Always quarantine THIN buckets** afterwards using the same partition
+  pattern (move to `_thin/`, split top-level manifest into
+  `facesets` + `thin_eras`). The script appends THIN entries to the top-level
+  manifest as if they were full facesets, so the cleanup is a separate step.
@@ -0,0 +1,154 @@
+# CLIP zero-shot occlusion filter (masks + sunglasses)
+
+_Run date: 2026-04-27. Driver scripts: `work/filter_occlusions.py`, `work/clip_worker.py`._
+
+## 1. Why
+
+`facesets_swap_ready/` ended the Immich import day with 311 substantive
+facesets and a long tail of identities whose clusters had latched onto
+*eyewear or mask appearance* instead of identity (covid-era shots, vacation
+photos with sunglasses dominating the frame). Two failure modes:
+
+1. **Pollution of averaged identity** — roop's `FaceSet.AverageEmbeddings()`
+   averages every face in the .fsz. A faceset where 40 % of images are
+   sunglassed gives a biased centroid; the swap reproduces sunglass-shaped
+   eye sockets.
+2. **Whole-cluster identity drift** — clustering at the embedding level
+   sometimes anchors on the eyewear silhouette rather than the face,
+   producing clusters of "the same sunglasses across multiple people".
+
+A targeted attribute scorer was the cleanest fix.
+
+## 2. Model + prompts
+
+**Model**: `open_clip` `ViT-L-14` / `dfn2b_s39b` (Apple Data Filtering Networks).
+Best public zero-shot at this size. Loads weights from HF Hub (~890 MB).
+Bit-identical scores between WSL CPU and Windows DML.
+
+**Prompt design**: per-attribute ensembles of 5–6 positive + 5–6 negative
+prompts. Positive ensembles are mean-pooled and L2-normalized before softmax.
+
+**Critical bug if forgotten**: CLIP cosine similarities are tiny (0.2–0.3
+range). Raw `softmax([sim_pos, sim_neg])` collapses to ~0.5/0.5 on every
+image. **Multiply by `model.logit_scale.exp()` (~100) before softmax.**
+Without that scale the entire scorer outputs a uniform 0.5.
+
+**Sunglasses prompt pitfall**: the first set caught faces with sunglasses
+*pushed up on the forehead* with the same probability as faces with
+sunglasses *covering the eyes* — CLIP detects "presence of sunglasses in
+frame", not "eyes occluded". Fixed by putting the false positive into the
+*negative* class explicitly:
+
+```
+positive: "a face with dark sunglasses covering the eyes"
+          "a portrait with the eyes hidden behind opaque sunglasses"
+          ...
+negative: "a face with sunglasses pushed up on the forehead, eyes visible below"
+          "a face with sunglasses resting on top of the head, eyes visible"
+          "a face wearing clear prescription eyeglasses with visible eyes"
+          ...
+```
+
+Validation pair (faceset_005): sunglasses-on-eyes → 0.91, sunglasses-on-forehead
+→ 0.39. Threshold 0.7 cleanly separates.
+
+## 3. Architecture
+
+```
+   ┌─────────────────────────────────────────────┐
+   │ WSL  /opt/face-sets/work/filter_occlusions.py │
+   │  • stage:  walk facesets/, write queue.json   │
+   │  • merge:  ingest worker results              │
+   │  • report: HTML contact sheet                  │
+   │  • apply:  prune + quarantine + re-zip         │
+   └────────────┬────────────────────────────────┘
+                │ queue.json (paths) via \\wsl.localhost\
+                ▼
+   ┌─────────────────────────────────────────────┐
+   │ Windows  C:\clip_dml_venv\                  │
+   │  /opt/face-sets/work/clip_worker.py         │
+   │  Python 3.12 + torch 2.4.1 CPU              │
+   │  + torch-directml 0.2.5 + open_clip_torch   │
+   │  Reads PNGs from native E:\, writes scores  │
+   └─────────────────────────────────────────────┘
+```
+
+A separate Windows venv (not the existing `C:\face_embed_venv\`) is needed
+because `torch-directml` brings ~1.5 GB of wheels and version-pinned
+numpy/pillow that risk breaking the embed_worker venv's
+`onnxruntime-directml` + `insightface` stack.
+
+## 4. DML throughput surprise
+
+Measured on AMD Radeon RX Vega:
+
+| input | model | throughput | speedup vs WSL CPU |
+|------|-------|-----------:|-------------------:|
+| ViT-L-14 (CLIP, this filter) | open_clip | **1.43 img/s** | **2.4×** |
+| buffalo_l (insightface, embed_worker) | onnxruntime | 2.6 img/s | 7.5× |
+
+Only 2.4× because `aten::_native_multi_head_attention` is not implemented in
+the directml plugin and falls back to CPU. The vision encoder runs on GPU,
+attention runs on CPU per layer, both alternating. A silenced UserWarning
+makes this near-invisible. Workable for a one-shot 73-min corpus run, but
+the embed_worker pattern (pure ONNX) remains the gold standard for DML.
+
+## 5. Thresholds (validated 2026-04-27 on 6,318 PNGs)
+
+| level | threshold | semantics |
+|-------|----------:|-----------|
+| image | P(positive) ≥ 0.7 | drop the PNG |
+| faceset | ≥ 40 % of images flagged for either attr | quarantine whole faceset to `_masked/` |
+| min-survivors | < 5 surviving AND something pruned | quarantine to `_thin/` |
+
+The `AND something pruned` guard is essential — without it, naturally-small
+facesets (hand-sorted with ≤4 PNGs) get incorrectly quarantined for being
+small even when they have zero occlusions.
+
+## 6. Run results
+
+| action | count | net effect |
+|--------|------:|------------|
+| keep | 209 | unchanged |
+| prune | 46 | 183 PNGs dropped within survivors |
+| quarantine_masked | 51 | whole faceset → `_masked/` (11 mask-driven, 40 sunglasses-driven) |
+| quarantine_thin | 3 | survivors < 5 → `_thin/` |
+
+Net: 311 active → 255 active after the filter run. 763 PNGs quarantined
+whole-faceset, 183 pruned within survivors. All dropped PNGs preserved at
+`<faceset>/faces/_dropped/` for reversibility. Master manifest gained a
+`masked[]` array parallel to `thin_eras[]`, plus an `occlusion_filter_run`
+provenance block.
+
+## 7. Known limitations
+
+- **Per-faceset manifests are NOT updated by `apply`** — only the master
+  manifest is. Each faceset's own `<faceset>/manifest.json` retains stale
+  `faces[]` entries pointing at PNGs that moved into `_dropped/`. Harmless
+  for `.fsz` consumers (the .fsz is re-zipped from current disk state) but
+  downstream tools reading `faces[]` will see broken references. Discovered
+  later by `age_extend_001.py`'s rebuild loop, which generated 42 missing-PNG
+  warnings before being caught.
+
+## 8. Re-running
+
+```bash
+# 1. Stage queue from current corpus state
+python work/filter_occlusions.py stage --out work/clip_dml/queue.json
+
+# 2. Score on Windows DML (resumable)
+"/mnt/c/clip_dml_venv/Scripts/python.exe" work/clip_worker.py \
+  work/clip_dml/queue.json work/clip_dml/scores.json --batch 8
+
+# 3. Reshape into per-faceset format, then HTML for visual approval
+python work/filter_occlusions.py merge \
+  --scores work/clip_dml/scores.json --out work/occlusion_scores.json
+python work/filter_occlusions.py report \
+  --scores work/occlusion_scores.json --out work/occlusion_review
+
+# 4. Apply (always dry-run first)
+python work/filter_occlusions.py apply \
+  --scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json --dry-run
+python work/filter_occlusions.py apply \
+  --scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json
+```
@@ -0,0 +1,155 @@
+# Corpus dedup + roop-unleashed optimization
+
+_Run date: 2026-04-27. Driver scripts: `work/dedup_optimize.py`, `work/multiface_worker.py`._
+
+After consolidation collapsed duplicate identities and age-extend slotted
+new PNGs into era buckets, the corpus still carried artifacts that hurt
+roop's averaged-embedding quality:
+
+- **Burst-photo near-duplicates** within facesets, especially in
+  immich-discovered identities where source libraries had many similar
+  shots within seconds.
+- **Cross-faceset byte-identical PNGs** that escaped consolidation's
+  centroid-similarity matching when individual PNGs matched exactly but
+  cluster centroids diverged.
+- **Multi-face PNGs** that polluted identity averaging because the roop
+  loader appends every detected face per PNG to the FaceSet (load-bearing
+  invariant — see § 2).
+
+This pipeline runs three independent passes and an optional fourth, all
+moving dropped PNGs to `<faceset>/faces/_dropped/` for reversibility.
+
+## 1. Cross-family byte-dedup
+
+SHA256-hash every PNG in the active corpus (parallel I/O via
+`ThreadPoolExecutor(max_workers=16)`, ~17 s for 5,386 PNGs over the
+`/mnt/e/` Windows mount). Group by hash; for groups with members in
+multiple identity families, keep the higher-tier copy.
+
+**Family detection**: regex `^(faceset_\d+)(?:_.+)?$` — captures the parent
+identity. Same family includes parent + era splits (e.g. `faceset_001` +
+`faceset_001_2010-13`); these are intentional duplications for the era
+.fsz files and are preserved.
+
+Run results: 20 cross-family hash groups → 24 PNGs dropped. Most cases were
+small immich identity-cluster errors that consolidation missed because
+individual PNG embeddings matched but the cluster mean did not.
+
+## 2. Within-faceset near-dup at sim ≥ 0.95
+
+Per-faceset pairwise cosine similarity on cached arcface embeddings.
+Connected components in the `sim ≥ 0.95` graph. Keep highest
+`quality.composite` per component, drop the rest.
+
+**Threshold rationale**: legitimate same-person-different-pose pairs land at
+0.5–0.85; ≥ 0.95 means essentially the same shot (burst frames or
+recompressed dupes). Roop's `FaceSet.AverageEmbeddings()` averages all faces
+into `faces[0].embedding`; near-identical embeddings averaged ≈ averaging
+once. Removing them does not lose identity information; it removes a bias
+weight on the most-photographed moments.
+
+Run results: 851 groups → **1,225 PNGs dropped** (23 % of corpus).
+Most-affected: `faceset_026` (-132 of 262), `faceset_027` (-107),
+`faceset_028` (-92), `faceset_030` (-92). All immich-discovered identities
+where the source library had burst sequences.
+
+## 3. Multi-face audit (load-bearing roop invariant)
+
+The roop loader at `roop/ui/tabs/faceswap_tab.py:661–691` runs
+`extract_face_images(filename, (False, 0))` on every PNG and **appends every
+detected face** to `face_set.faces`. A multi-face PNG therefore pollutes the
+averaged identity. The export-swap pipeline drops multi-face crops at
+creation, but post-pipeline operations (consolidation, age-extend) move
+PNGs across facesets without re-checking.
+
+**This audit re-detects every PNG** with insightface FaceAnalysis and flags
+any with `face_count ≠ 1` (filtered by `det_score ≥ 0.5` and
+`face_short ≥ 40`). Includes:
+- ≥ 2 faces → loader will inject extra identities into averaging
+- 0 faces → insightface can't find a face on the cropped PNG; useless for
+  roop, would silently fail
+
+Run results: 4,146 PNGs scored, 332 flagged (272 with 2 faces, 9 with 3,
+2 with 4, **49 with 0**). 82 facesets affected.
+
+## 4. DML throughput jump for face crops
+
+The audit reuses the same insightface + onnxruntime-directml stack as
+`embed_worker.py` but achieves **~19 img/s** on AMD Vega vs embed_worker's
+2.6 img/s — same model, same hardware. The difference is input size:
+
+| stage | typical input | DML throughput |
+|-------|--------------|---------------:|
+| `embed_worker.py` (Immich import) | 1024–4000 px source | 2.6 img/s |
+| `multiface_worker.py` (this audit) | 512×512 face crops | **19 img/s** |
+
+Detection on small inputs is fast; recognition on aligned 112×112 inputs is
+the same cost either way. Implication: **any pipeline operating on
+already-cropped face PNGs can rely on a roughly 7× higher DML throughput
+ceiling than full-resolution embedding**.
+
+## 5. Architecture
+
+```
+   ┌────────────────────────────────────────────┐
+   │ WSL  /opt/face-sets/work/dedup_optimize.py │
+   │  • analyze:      hashes + within-faceset sim │
+   │  • apply:         move + re-zip (no GPU)     │
+   │  • stage_multiface: write queue.json         │
+   │  • merge_multiface: ingest worker results    │
+   │  • apply_multiface: move + re-zip             │
+   │  • report:        HTML audit                  │
+   └────────────┬───────────────────────────────┘
+                │ queue.json via \\wsl.localhost\
+                ▼
+   ┌────────────────────────────────────────────┐
+   │ Windows  C:\face_embed_venv\               │
+   │  /opt/face-sets/work/multiface_worker.py    │
+   │  insightface FaceAnalysis on DmlExecutionProvider │
+   │  Reads PNGs from native E:\, writes face_count │
+   └────────────────────────────────────────────┘
+```
+
+Reuses the existing `C:\face_embed_venv\` (no new venv needed — same
+insightface stack as `embed_worker.py`).
+
+## 6. Final corpus state (2026-04-27 night)
+
+| metric | start of day | after occlusion filter | after consolidation | after age-extend | after this dedup + multiface |
+|--------|-------------:|----------------------:|-------------------:|-----------------:|----------------------------:|
+| active facesets | 311 | 255 | 181 | 181 | **181** |
+| active PNGs | ~6,440 | 5,386 | 5,386 | 5,400 | **3,849** |
+| `_masked/` | 0 | 51 | 51 | 51 | 51 |
+| `_thin/` | 68 | 71 | 71 | 71 | 71 |
+| `_merged/` | 0 | 0 | 74 | 74 | 74 |
+
+Net reduction at the end of the day: **2,591 PNGs and 130 facesets** removed
+or quarantined from the active pool. All preserved on disk for
+reversibility (`<faceset>/faces/_dropped/` for prunes, `_masked/_merged/_thin/`
+for quarantines).
+
+## 7. Re-running
+
+Run after any new import / consolidation / extend:
+
+```bash
+# 1. Byte-dedup + within-faceset near-dup (CPU only)
+python work/dedup_optimize.py analyze --out work/dedup_audit/dedup_plan.json
+python work/dedup_optimize.py apply  --plan work/dedup_audit/dedup_plan.json
+
+# 2. Multi-face audit on Windows DML (resumable)
+python work/dedup_optimize.py stage_multiface --out work/dedup_audit/multiface_queue.json
+"/mnt/c/face_embed_venv/Scripts/python.exe" work/multiface_worker.py \
+  work/dedup_audit/multiface_queue.json work/dedup_audit/multiface_results.json
+python work/dedup_optimize.py merge_multiface \
+  --results work/dedup_audit/multiface_results.json \
+  --out work/dedup_audit/multiface_plan.json
+python work/dedup_optimize.py apply_multiface \
+  --plan work/dedup_audit/multiface_plan.json
+
+# 3. HTML audit
+python work/dedup_optimize.py report \
+  --dedup work/dedup_audit/dedup_plan.json \
+  --multiface work/dedup_audit/multiface_plan.json \
+  --out work/dedup_audit
+```
@@ -0,0 +1,233 @@
+# Facesets → roop-unleashed: downstream refinement evaluation
+
+_Analysis date: 2026-04-23. Author: Peter (with Claude Code)._
+
+## 1. Scope
+
+**Objective.** Evaluate how the existing face-clustering / person-sorted results in `facesets_full/` can be refined so that the downstream project `roop-unleashed` produces the best practical face-swap results.
+
+**Repositories / folders inspected**
+- `/opt/face-sets/` — the upstream project (this repo); code, `README.md`, `sort_faces.py`, `refine_manifest.json`, `duplicates.json`.
+- `/mnt/e/temp_things/fcswp/nl_sorted/facesets_full/` and `.../raw_full/` — current output.
+- `/opt/face-sets/work/cache/nl_full.npz` — the underlying embedding cache used to produce the output.
+- `/opt/roop-unleashed/` — the downstream consumer.
+- InsightFace 0.7.3 Face class (`/home/peter/face_sort_env/lib/...insightface/app/common.py`) to resolve an ambiguity about embedding averaging.
+
+**Agent usage.** Subagents (Explore, Trend Researcher) were attempted but rejected by the operator. All investigation was done directly via Read, Grep, Bash, WebFetch, WebSearch. `~/.claude/agents/` was enumerated; no face-swap-specific agent exists.
+
+**Web research used.** Targeted WebSearch + WebFetch against FaceSwapLab FAQ, FaceFusion docs, and the GitHub roop-unleashed discussion page for faceset creation. The original `C0untFloyd/roop-unleashed` GitHub repo has been disabled by GitHub Staff for ToS violation, so the code in `/opt/roop-unleashed/` is the authoritative source for this analysis.
+
+## 2. Evidence base
+
+### 2.1 Files read in `facesets` / output
+- `sort_faces.py` (full) — current pipeline, esp. `cmd_embed` (embed + sha256 dedup + resume), `cmd_cluster`, `cmd_refine` (centroid-merge + quality gate + outlier rejection), `cmd_extend` (centroid-preserving merge), `cmd_dedup` (byte + visual).
+- `refine_manifest.json` at `facesets_full/` — post-extend state; `extended: true`; 12 facesets, params `{initial_threshold: 0.55, merge_threshold: 0.40, outlier_threshold: 0.55, min_faces: 15, min_short: 90, min_blur: 40.0, min_det_score: 0.6}`.
+- `nl_full.npz` — 4756 face embeddings + 133 noface records across 2667 unique files; 113 byte-dupe alias paths; 103 byte-groups + 115 visual-dupe groups in `nl_full.duplicates.json`.
+
+### 2.2 Files read in `roop-unleashed`
+- `roop/FaceSet.py` — the downstream identity container; `AverageEmbeddings()` at lines 15–20.
+- `roop/face_util.py` — `get_face_analyser()` builds InsightFace `buffalo_l` (lines 35–50); `extract_face_images()` at lines 72–144 implements the .fsz unpack + detect path.
+- `roop/processors/FaceSwapInsightFace.py` — the actual inswapper swap; `Run()` at lines 42–52 uses `source_face.normed_embedding`.
+- `roop/core.py:178–179` — identifies the swap model as `inswapper_128.onnx` (HuggingFace `countfloyd/deepfake` + Codeberg mirror).
+- `roop/ProcessMgr.py:626–634` — `process_face` confirms only `face_datas[face_index].faces[0]` is used per identity.
+- `ui/tabs/facemgr_tab.py` (full) — how .fsz is created by users (cv2.imwrite PNGs → zip).
+- `ui/tabs/faceswap_tab.py:651–710` — how .fsz / image source is loaded into `INPUT_FACESETS`; `AverageEmbeddings()` is called iff `len(faces) > 1` at line 690.
+- Insightface `common.py:Face` — `normed_embedding` is a `@property`, so it does re-derive from `self.embedding`; averaging therefore does propagate to the swap (resolves an ambiguity).
+
+### 2.3 External sources
+- [FaceSwapLab FAQ](https://glucauze.github.io/sd-webui-faceswaplab/faq/) — practitioner-level guidance on multi-image reference and the checkpoint builder.
+- [FaceFusion face-swapper docs](https://docs.facefusion.io/usage/cli-arguments/processors/face-swapper) — model list including `inswapper_128_fp16`, `hyperswap_1a_256`, etc.
+- [InsightFace blog: evolution of face swapping](https://www.insightface.ai/blog/the-evolution-of-neural-network-face-swapping-from-deepfakes-to-one-shot-innovation-with-insightface) — inswapper internal face resolution is 128×128 RGB regardless of input.
+- [DeepWiki: inswapper_128](https://deepwiki.com/deepinsight/inswapper-512-live/5.1-first-generation:-inswapper_128) — confirms encoder-decoder structure, identity taken from embedding, target appearance preserved.
+- [SDD-FIQA CVPR 2021](https://openaccess.thecvf.com/content/CVPR2021/papers/Ou_SDD-FIQA_Unsupervised_Face_Image_Quality_Assessment_With_Similarity_Distribution_Distance_CVPR_2021_paper.pdf) — unsupervised face quality metric; a modern alternative to `det_score + blur`.
+
+## 3. Current upstream output assessment
+
+### 3.1 Structure of `facesets_full/`
+- 12 faceset folders (`faceset_001` … `faceset_012`) selected by the refine step (`min_faces=15`).
+- Each folder contains the full original images (jpg / jpeg / png) that contributed a face to that cluster, filename-flattened from the absolute path so each file is traceable to its on-disk source.
+- One `refine_manifest.json` at the root with per-faceset `{face_count, image_count, alias_count, images[]}`.
+- `facesets_full/extended=true` (merged after the lzbkp_red run via `cmd_extend`).
+
+Counts (manifest):
+
+| faceset      | images | face records | aliases |
+|--------------|-------:|-------------:|--------:|
+| faceset_001  |    771 |         1505 |      55 |
+| faceset_002  |    238 |          543 |       6 |
+| faceset_003  |    206 |          402 |       2 |
+| faceset_004  |    103 |          273 |       2 |
+| faceset_005  |     68 |          218 |       2 |
+| faceset_006  |     51 |          153 |       1 |
+| faceset_007  |     89 |          158 |       0 |
+| faceset_008  |     44 |          131 |       1 |
+| faceset_009  |     43 |          129 |       0 |
+| faceset_010  |     25 |           73 |       0 |
+| faceset_011  |     25 |           71 |       8 |
+| faceset_012  |     17 |           55 |       0 |
+
+### 3.2 Observed strengths
+- **Identity grouping is directionally correct.** The top facesets are credibly large and coherent — the raw `raw_full/person_001` is 2.3 GB; refine extracted a 557→771-image faceset on top of that, which is a significant and useful identity pool by any standard.
+- **Quality gate is applied.** `min_short=90`, `min_blur=40`, `min_det_score=0.6` are enforced; low-resolution and out-of-focus faces are rejected.
+- **Outlier rejection is applied.** Faces with cosine distance > 0.55 from their cluster centroid are dropped (when cluster ≥ 4).
+- **Aliasing preserves provenance.** Every on-disk copy (byte-duplicates between iCloud / manual backups / etc.) is preserved in the folder, so the user can trace every file in a faceset back to its original location.
+- **Quality metrics already captured per face.** `face_short`, `blur` (Laplacian variance), `det_score`, `bbox` are persisted in the cache — available for any future ranking logic without re-embedding.
+
+### 3.3 Observed weaknesses
+
+Evidence is from direct computation on the cache (`nl_full.npz`) + the manifests.
+
+**W1. face_records / image_count ratio ~2:1 in top facesets.**
+- faceset_001: 1505 faces / 771 images = 1.95 faces per image.
+- faceset_002: 543 / 238 = 2.28.
+- faceset_003: 402 / 206 = 1.95.
+- A healthy one-identity set should be ~1:1 (one face per image).
+- **Interpretation**: many of these are multi-face photos (group / family shots) where multiple people's faces were placed into the same cluster, or the same image had multiple faces all passing the centroid gate for the same identity. Either way, the current facesets are contaminated with **faces of other people from the same photo**. This is the single biggest downstream risk — see §4.
+
+**W2. Intra-faceset pairwise cosine distance is high.**
+- Mean pairwise distance in faceset_001 = 0.835, p90 = 1.047, max = 1.242.
+- For reference: same-identity ArcFace cosine distance typically clusters in [0.2, 0.6]. Pairs > 1.0 (negative cosine similarity) cannot be the same person.
+- All 12 facesets have means in [0.82, 0.90] and p90 in [1.03, 1.07].
+- **Interpretation**: the clusters were built with `linkage=average, threshold=0.55`, which admits chain-effects — two points with direct distance > 1.0 can end up in the same cluster via intermediate points. Some of this spread is legitimate (the photo library spans 15+ years — same person at different ages and lighting), some is contamination from W1.
+
+**W3. Near-duplicates inflate the effective size.**
+- `nl_full.duplicates.json`: 103 byte-identical groups (same file copied around) + 115 visual near-duplicate groups (cross-file cosine-distance ≤ 0.03 with matching bbox size — likely re-encodes / resizes).
+- faceset_001 alone carries 55 aliased paths.
+- **Interpretation**: multiple copies of the same photograph contribute the same embedding (or a near-identical one) to the cluster's average. This does not add identity information — at best neutral, at worst biases the average toward whatever pose/expression appears in the duplicate set.
+
+**W4. Blur / quality gate is lax.**
+- Cache-wide `blur` (Laplacian variance) p10/p25/p50 = 19/32/60. Refine gate is 40, so roughly the bottom ~35% of faces drop on blur.
+- Per-faceset p10 blur is 36–90 — many included faces are visibly soft. For downstream swap this is acceptable (identity embedding tolerates modest softness) but tightening would improve the average.
+
+**W5. No pose / frontality filtering.**
+- Neither detect-time nor refine uses landmarks / yaw / pitch. A strong profile shot with clear det_score + size still passes. ArcFace embeddings degrade for |yaw| > ~45°. The current set has no way to prefer frontal faces.
+
+**W6. 583 singletons + 133 noface drop to floor.**
+- `_singletons/` in raw_full has 583 face-records (some of which are from legitimate subjects that just didn't cluster). `_noface/` has 133 files (hash-deduped images where detection failed). Some of these could belong to existing facesets with a looser centroid-match threshold.
+
+**W7. Embedding averaging quirk is latent but OK.**
+- Investigated because `FaceSet.AverageEmbeddings()` at `FaceSet.py:15` overwrites `self.faces[0]["embedding"]` while the swapper reads `source_face.normed_embedding`. Confirmed via InsightFace source that `normed_embedding` is a `@property` that re-normalizes from `embedding`. **So averaging does take effect in the swap.** No action needed; noted to avoid a future misdiagnosis.
+
+### 3.4 Observed risks for downstream use
+1. **Multi-face photos in a single-identity folder** (W1) → when zipped into `.fsz` and loaded, roop-unleashed will detect and add ALL faces in each PNG to the FaceSet (`faceswap_tab.py:678–687` loops every face returned by `extract_face_images` into the set). This is identity contamination by design of the loader. **Highest-priority risk.**
+2. **High intra-faceset variance** (W2, W5) → the averaged embedding becomes a diffuse "average face" rather than a crisp identity vector. Downstream swap will produce generic likenesses, with identity drift on hard frames.
+3. **Near-dupes biasing the mean** (W3) → identity average tilts toward over-represented poses (e.g., ten copies of one iPhone screenshot skew the mean).
+4. **No per-face ranking** — users have no signal on which images to include / exclude when hand-curating a subset, and no way to pick "best representative" images for thumbnails.
+
+## 4. Downstream consumer requirements
+
+### 4.1 What `roop-unleashed` expects
+
+- **Input format**: a `.fsz` file, which is a zip of `.png` files (one crop per reference face). Created by `ui/tabs/facemgr_tab.py:on_update_clicked()`:
+  ```python
+  filename = os.path.join(roop.globals.output_path, f"{index}.png")
+  cv2.imwrite(filename, img)
+  …
+  util.zip(imgnames, finalzip)   # imgnames → "faceset.fsz"
+  ```
+  Files inside are named `0.png`, `1.png`, … — only indices.
+- **Load path** (`ui/tabs/faceswap_tab.py:672–691`): unzip, iterate `*.png`, run `extract_face_images(filename, (False, 0))` (note: `extra_padding` default `-1.0` → plain bbox crop, no resize-to-512 dance). For **every** detected face in each PNG, append the InsightFace `Face` object (with its 512-dim embedding) to `face_set.faces`. If the resulting set has more than one face, call `face_set.AverageEmbeddings()`.
+- **Use at swap time** (`ProcessMgr.py:626–634` + `processors/FaceSwapInsightFace.py:42–52`): only `face_set.faces[0]` is used; its `normed_embedding` is fed to `inswapper_128.onnx`. The other faces in the set only exist to contribute to the averaged embedding.
+- **Swap backend**: `inswapper_128.onnx` (see `roop/core.py:178`). Internal face working resolution is 128×128 per the InsightFace blog and FaceSwapLab FAQ; identity is carried entirely in the 512-dim embedding.
+
+### 4.2 Practical requirements derived from the code
+1. **One identity per `.fsz`.** Anything else corrupts the averaged embedding.
+2. **One face per PNG inside the `.fsz`.** Any multi-face PNG → every face gets appended to the set, polluting the average. This is enforced only by the PNG's content, not by the loader.
+3. **Faces must be detectable by InsightFace `buffalo_l` at `det_size=(640,640)` or `(320,320)`.** Extremely small or cut-off faces will fail detection and be silently skipped on load.
+4. **Input resolution**: there is no explicit requirement, but since inswapper works at 128×128 and InsightFace aligns on 5 landmarks, a face bbox with a short edge of at least ~100–150 px gives a reliable embedding. Below ~60 px, embedding quality drops measurably (literature). Our `min_short=90` gate is close to the lower end of useful.
+5. **Frontality helps**. ArcFace embeddings are trained with some pose augmentation, so near-frontal (|yaw| ≤ 30°) is ideal; beyond ~45° the embedding starts to drift. Roop applies no compensation for this.
+6. **Expression / lighting diversity is desirable but not required.** FaceSwapLab explicitly supports "face blending" and notes it "improves the face's representative accuracy" — so a diverse set of the same identity is better than 100 near-duplicate frames.
+7. **No metadata is consumed.** roop-unleashed ignores everything outside the PNG bytes — filename, EXIF, sidecar JSON are not read.
+
+### 4.3 Constraints and uncertainties
+- The `roop-unleashed` GitHub is unreachable (disabled), so the closest thing to community guidance is the in-repo `CLAUDE.md` and the code itself. Treat this code as authoritative.
+- **Assumption**: the user will either provide the whole `facesets_full/faceset_NNN/` folder to roop-unleashed's Face Management tab (which accepts image files + a folder button — `faceswap_tab.py:644–647`), OR pre-build `.fsz` files. Both paths run through the same loader; the multi-face-per-PNG issue applies equally.
+
+## 5. Refinement opportunity matrix
+
+Each opportunity is scored qualitatively. "Automation feasibility" distinguishes fully automated (A), semi-automated with heuristics that need operator review (S), and manual-only (M). "Best place" is where implementation should live.
+
+| # | Opportunity | Problem addressed | Evidence | Expected downstream benefit | Automation | Risk / downside | Best place | Priority | Confidence |
+|---|---|---|---|---|---|---|---|---|---|
+| R1 | **Pre-crop each faceset image to a single face (the identity's own face)** before export | W1 — multi-face photos pollute FaceSet on load | refine_manifest face/image ratio ~2:1 in top clusters; roop loader adds every detected face in a PNG (`faceswap_tab.py:678–687`) | Large. Cleans the single biggest identity-averaging contaminant | A (use the existing bbox per face record in the cache and cv2.crop with padding, save to a new `facesets_swap_ready/` mirror) | Must pick the correct face of multiple detected per image → use the bbox that the upstream cache already matched to this faceset | `facesets` | **P0** | High |
+| R2 | **Split known multi-face photos so only the identity's own bbox is included**, alternative to full image export | Same as R1, more conservative | Same as R1 | Same as R1 | A | — | `facesets` | P0 | High |
+| R3 | **Identity tightening — re-run refine with stricter outlier threshold** (e.g. outlier_threshold=0.45) | W2 — intra-cluster spread too wide, chain effects from average-linkage | pairwise distance max > 1.2 in every faceset | Sharpens averaged embedding; removes obviously-wrong faces | A | Some legitimate same-person faces (age / lighting extremes) may be dropped | `facesets` | P0 | High |
+| R4 | **Drop visual near-duplicates from the set** (keep the highest-quality representative per dupe group) | W3 — duplicate images bias the average | `duplicates.json` has 115 visual groups (2–5 images each) across 4756 faces | Removes silent bias toward over-represented frames; shrinks set size for faster load | A | Deciding which copy to keep is a tiny judgement call (pick highest det_score × face_short × blur) | `facesets` | P1 | High |
+| R5 | **Per-face composite quality score** (weighted `det_score · blur · face_short · frontality`) and **ranked export / top-N subset** | Need to give roop-unleashed a small, strong averaging pool rather than all 771 images | Cache already has det_score, blur, face_short; frontality = landmark symmetry, computable from `landmark_2d_106` which InsightFace already provides but we don't store | Smaller `.fsz` files, better average embedding, faster UI | A for the score; S for the top-N choice (operator picks N per identity) | Frontality adds a small extra compute step; needs a re-pass over the cache or a re-embed storing landmarks | `facesets` | P1 | Medium |
+| R6 | **Produce `.fsz` directly** (zip the cropped PNGs with integer filenames) as an export mode | Saves the operator the manual zipping step; guarantees filename correctness | `facemgr_tab.py:242–255` is the reference implementation; trivially reproducible | Zero-friction import into roop-unleashed | A | — | `facesets` | P1 | High |
+| R7 | **Pose / frontality filter at refine time** using `pose_2d_106` landmark symmetry or yaw estimation from `face.pose` (if available) | W5 — strong profile faces weaken the average | ArcFace literature; no measurement yet in our cache | Tighter identity average, especially for smaller facesets where one profile shot can dominate | A (compute from cached landmarks if we re-embed or store them; otherwise a one-off enrichment pass) | Landmarks not currently persisted in the cache; requires a small re-embed or enrichment command | `facesets` | P2 | Medium |
+| R8 | **Singleton rescue pass** — re-classify `_singletons/` against final faceset centroids with a looser threshold + quality gate | W6 — some singletons are legit faceset members | 583 singletons with p50 face_short=149, p50 det_score=0.76 — many look usable | Recovers lost identity examples; modest expansion of useful facesets | A | Some true singletons will be mis-assigned; threshold choice matters | `facesets` | P2 | Medium |
+| R9 | **Modern face-quality scorer** (SDD-FIQA / CR-FIQA) to replace the `det_score × blur` heuristic | More robust quality ranking than hand-rolled heuristics | Literature; current heuristic is crude | Marginal improvement over R5 for the same goal | A but adds a new model dependency | Model weights to download, more CPU cost at ranking time | `facesets` | P3 | Medium |
+| R10 | **Person-label sidecars** (e.g. `faceset_001/_label.txt` with an operator-provided name) | UX — the 12 facesets are anonymous; operator has to peek to find "mom" | No evidence; improvement to workflow | Operator-quality-of-life; no effect on swap quality | M | — | `facesets` | P3 | Low |
+| R11 | **Feed multiple source images selection UI in roop-unleashed improvements** (e.g. a "pick best 20 by quality" button on load) | Better use of large `.fsz` files | Not implemented downstream | Improvement happens at consumption time | A | Requires roop-unleashed patch, which is a disabled upstream | `roop-unleashed` | P4 | Low |
+| R12 | **Face alignment / crop standardization** (e.g. arcface-aligned 512×512 crops in the `.fsz`) | Some marginal consistency gain on detection | roop re-detects anyway on load (`extract_face_images`) so input alignment is discarded | Very small — roop's loader re-detects and re-aligns regardless | A | Extra compute for no practical gain | — (do not do) | **Not recommended** | High |
+| R13 | **Increase resolution via upscaling of low-res crops** | Make small faces "bigger" | Identity comes from the embedding, not the pixels | None — upscaling with GAN does not add identity info; inswapper reads 128×128 anyway | A | Can introduce synthetic artifacts | — (do not do) | **Not recommended** | High |
+| R14 | **Destructive reorganization of `facesets_full/` in place** | Simpler final layout | Operator explicitly told us yesterday to preserve existing output | Marginal tidiness | M | Loses the current "full cluster" reference view, which has diagnostic value | — (do not do without explicit go-ahead) | **Not recommended by default** | High |
+
+## 6. Recommended target state
+
+Define a new output view, `facesets_swap_ready/`, produced by a new subcommand (e.g. `sort_faces.py export-swap`). Original `facesets_full/` stays intact. Per faceset:
+
+```
+facesets_swap_ready/
+  faceset_001/
+    manifest.json           # provenance + per-image score + rank
+    previews/               # 4-image contact sheet thumbnail
+      top_20_grid.jpg
+    faces/                  # cropped-to-single-face PNGs named "000.png", "001.png", ...
+      000.png               # highest-ranked face, single face per PNG, 512x512 padded/aligned
+      001.png
+      ...
+    faceset.fsz             # zip of faces/*.png — drop-in for roop-unleashed
+  faceset_002/
+    ...
+```
+
+Key properties:
+1. **One face per PNG** — each PNG is a crop of a single face (R1/R2), padded to a consistent 512×512 with the identity's bbox centred. Roop-unleashed's loader will re-detect exactly one face per file.
+2. **Ranked by composite quality** — `faces/000.png` is the best representative; later indices are weaker. Operator can trivially truncate by dropping later files.
+3. **Configurable top-N** — default `--top-n 30` per faceset with a `--include-all` flag for the current behaviour. 30 is conservative; FaceSwapLab's "face blending" tool (the most analogous public practitioner reference) shows that blending with diverse but consistent images materially helps; 20–40 is a common practitioner range.
+4. **Near-duplicates dropped** (R4) — one representative per visual-dupe group.
+5. **Tighter outlier gate** (R3) — outlier_threshold reduced from 0.55 to ~0.45 for this export, keeping the refine defaults on `facesets_full/`.
+6. **Ready-to-ship `.fsz`** (R6) in each folder.
+7. **manifest.json per faceset** — cites every source path and score. Lets the operator see *why* a face was kept (or dropped if we add a `_rejected/` sibling).
+
+This lets the operator test swap quality end-to-end without any roop-unleashed modification, and preserves full fallback to the raw / full results if anything needs re-examination.
+
+## 7. Recommended next steps
+
+### 7.1 Quick wins (high value, low effort)
+1. **R1 — single-face crop export** as part of `export-swap`. Uses bbox already in the cache; zero new models. Delivers the biggest likely swap-quality improvement.
+2. **R4 — drop visual near-duplicates** inside the export. Uses `duplicates.json` already produced by `cmd_dedup`. Smaller sets, cleaner averages.
+3. **R5 — composite quality score + rank + top-N**. Uses existing fields (`det_score`, `blur`, `face_short`). Deliver `.fsz` + `faces/` sorted by descending score.
+4. **R6 — `.fsz` bundle emission** by simply zipping `faces/*.png` with integer names. Trivial given (1)-(3).
+
+These four together give a clean, drop-in-usable export in one session of work.
+
+### 7.2 Medium-effort improvements
+5. **R3 — re-run refine with stricter `outlier_threshold`** (e.g. 0.45) for the export path; keep `facesets_full/` at 0.55 for reference. Requires a re-cluster over existing embeddings — fast (seconds), no re-embed.
+6. **R7 — pose/frontality filter** using landmarks. Requires either (a) a re-embed pass that persists `landmark_2d_106`, or (b) an enrichment pass that re-loads each image and computes yaw without redoing the full embed. Modest CPU cost; meaningful for small facesets.
+7. **R8 — singleton rescue** against final centroids. Low code cost; likely yields a handful of additional good images per identity.
+
+### 7.3 Items requiring operator decision
+- **Target top-N per faceset** for the export (proposal: 30, override per run). Affects the average-embedding quality trade-off vs. UI load time.
+- **Whether to name facesets** (R10) by operator — purely workflow.
+- **Whether `_singletons/` should be retired** or promoted to "uncertain identity" export with a lower-confidence tag.
+
+### 7.4 Not recommended
+- **R11** — patching `roop-unleashed` itself. The upstream repo is disabled; touching it introduces fork-maintenance overhead for no proportional gain we can't already achieve upstream in `facesets`.
+- **R12 / R13** — pre-aligning or up-scaling source crops. Roop re-detects/aligns on load and inswapper caps at 128×128 internally; effort is wasted.
+- **R14** — destructive reorganization of `facesets_full/`. The operator already told us (yesterday) to preserve existing results; no new evidence supports re-opening that.
+
+## 8. Open questions
+
+- **OQ1**. Is the operator willing to have the export step **drop** faces rather than just rank them? R5-top-N drops everything past rank N; if the operator prefers to keep the full set but marked, we should export ranked without truncation and let the user pick in the UI.
+- **OQ2**. How many `.fsz` files does the operator actually plan to use? If only 3–4 identities will be used in practice, R5 can stay conservative (N=50) without cost. If all 12 are routinely used, leaner is better (N=20).
+- **OQ3**. Should singletons (R8) be rescued into existing facesets or exported as their own "candidate_NNN/" bucket for manual triage? The safer default is a separate bucket; the operator may prefer direct merge.
+- **OQ4**. Is frontality-filtering (R7) worth a re-embed, or should we settle for a cheap "bbox aspect ratio" proxy? A proper yaw estimate needs landmarks; a crude proxy (bbox width/height ratio) is free but weaker.
+- **OQ5**. Is there appetite for adding a modern FIQA model (R9) as a drop-in dependency? It adds ~50 MB download and a small CPU cost per face; benefit over the current heuristic is real but modest.
+- **OQ6**. For the export, should the operator name (R10) be **required** before an `.fsz` is emitted (forces thought about which identity is which), or optional (pure convenience)?
+
+---
+
+_End of evaluation. No code has been changed as part of this analysis._
@@ -0,0 +1,170 @@
+# Identity consolidation + age-bucket extension
+
+_Run date: 2026-04-27. Driver scripts: `work/consolidate_facesets.py`, `work/age_extend_001.py`._
+
+After the Immich peter + nic imports added 280 new facesets to a corpus that
+had ~25 canonical identities, many "new" identities were duplicates of
+existing household members at lower clustering confidence. Two cooperating
+passes clean this up: identity consolidation merges duplicates, then
+age-extend slots newly-merged PNGs into the existing era buckets of
+`faceset_001`.
+
+## 1. Identity consolidation
+
+### 1.1 Approach
+
+For each active faceset, pull cached arcface embeddings from
+`work/cache/{nl_full,immich_peter,immich_nic}.npz` keyed by
+`(source, bbox)` from the per-faceset manifest's `faces[]`. Compute
+L2-normalized centroid. Pairwise cosine similarity matrix.
+
+**Tier-based primary selection** (lowest tier number wins, size breaks ties):
+
+| tier | sources | rationale |
+|-----:|---------|-----------|
+| 0 | `faceset_013..019` (hand-sorted) | user's curated labels |
+| 1 | `faceset_001..012` (auto-clustered) | well-established household |
+| 2 | `faceset_020..025` (osrc) | mixed-bucket discovery |
+| 3 | `faceset_026..264` (immich peter) | speculative |
+| 4 | `faceset_265+` (immich nic) | speculative |
+
+**Era splits and quarantines excluded** — `faceset_NNN_<era>`, `_masked/`,
+`_thin/` are skipped during analysis.
+
+### 1.2 Single-linkage chains catastrophically — complete-linkage required
+
+First attempt used connected-components on edge ≥ 0.45 → produced a
+**60-faceset cluster** around `faceset_001` with min within-group sim of
+**−0.16** (definitely-different people bridged via chains
+`A↔B↔C` where `A`, `C` are not similar). Bumping to edge ≥ 0.55 still
+chained (group of 17 with min 0.20).
+
+Real fix: `scipy.cluster.hierarchy.linkage(method='complete')` then
+`fcluster(Z, t=1-edge_threshold, criterion='distance')`. Complete-linkage
+**guarantees** every within-group pair sim ≥ edge threshold. Without this
+guarantee the report is unusable and the apply step would produce
+identity-poisoned merges.
+
+### 1.3 Thresholds + run results
+
+`edge=0.55`, `confident=0.65` → 48 multi-faceset groups (29 confident, 19
+uncertain). Max group size 7, all bilateral or small triplets after
+complete-linkage.
+
+After applying all 48 (with `--include-uncertain` after visual approval):
+
+- **74 facesets consumed** (some groups had multiple secondaries:
+  `[10, 45, 135] → faceset_002`; `[113, 96, 178, 109, 110, 286] → faceset_095`;
+  etc.)
+- Active count 255 → 181
+- Notable absorptions: `faceset_001` (peter) 707 → 753 PNGs (+ 7, 132, 151);
+  `faceset_002` 209 → 247; `faceset_026` 60 → 262 (+ 168, 146, 325);
+  `faceset_028` → 207
+- Master manifest gained `merged[]` array (parallel to `thin_eras[]`); each
+  entry has `merged_into` field pointing at the primary
+
+### 1.4 Apply mechanics
+
+Combine all PNGs from primary + secondaries, re-rank by existing
+`quality.composite` desc (no re-enrich), renumber `0001..NNNN`, copy into a
+fresh staging dir, atomic swap. Move secondary directories to
+`_merged/<original_name>/` (preserved in full for reversibility). Re-zip
+`_topN.fsz` and `_all.fsz`.
+
+The primary's existing per-PNG quality scores are reused — re-ranking does
+not require re-running `enrich`-equivalent landmarks/pose on the cropped
+PNGs. The primary's `_dropped/` (from prior occlusion filter) is preserved
+through the merge.
+
+## 2. Age extension of faceset_001 era buckets
+
+### 2.1 Why a follow-on pass
+
+Consolidation absorbed faceset_007/132/151 into faceset_001 (+46 PNGs).
+The original `age_split_001.py` had bucketed peter into 6 era anchors
+(`_2005-10`, `_2010-13`, `_2011`, `_2014-17`, `_2018-19`, `_2018-20`), but
+those new PNGs had never been seen by age_split. They sat in faceset_001's
+parent-only set, missing from every era .fsz.
+
+### 2.2 Era-label pitfall
+
+The 6 anchor era labels are NOT strict year ranges. They are
+`Counter(years).most_common(1)`-derived dom-years from the original sub-cluster:
+
+| label | dom_year | actual span of members |
+|-------|---------:|-----------------------:|
+| `_2005-10` | 2010 | 2005–2010 |
+| `_2010-13` | 2011 | **2007–2024** |
+| `_2011` | 2011 | 2011 only |
+| `_2014-17` | 2016 | 2005–2018 |
+| `_2018-19` | 2018 | 2012–2020 |
+| `_2018-20` | 2019 | 2014–2022 |
+
+The clusters are *appearance-anchored*, not year-bounded. Year is a
+descriptive label. Assignment rule must use dom-year, not member span.
+
+### 2.3 Algorithm
+
+For each unbucketed face entry in `faceset_001`'s manifest (50 of 753):
+
+1. Look up embedding in cache by `(source, bbox)`.
+2. Look up EXIF year via `work/cache/age_split_exif.json`; fetch on cache miss.
+3. Find single nearest era anchor by cosine distance to its centroid.
+4. Accept iff `dist ≤ 0.40` AND `|year − anchor.dom_year| ≤ 5`.
+   These thresholds match `age_split_001.py`'s anchor-fragment rule.
+5. Anchors are NOT re-centered after absorption (preserves age_split's
+   drift-prevention guarantee).
+
+### 2.4 Run results
+
+50 unbucketed → 21 with EXIF year → **14 accepted**:
+
+| anchor | dom_year | added |
+|--------|---------:|------:|
+| `_2005-10` | 2010 | +2 |
+| `_2010-13` | 2011 | +1 |
+| `_2014-17` | 2016 | **+9** |
+| `_2018-20` | 2019 | +2 |
+
+29 PNGs skipped for missing EXIF year (mostly immich-stripped
+photos). 7 dist/year-rejected (e.g. two PNGs from 2025 want
+`_2018-19` but year-delta 7 > 5).
+
+### 2.5 Reconciliation side effect
+
+The apply rebuilds each affected era bucket's `faces/` from staging. This
+incidentally reconciled the per-bucket manifests with disk after the prior
+occlusion filter run had left era manifests stale at 282/126/132 entries vs
+~248/125/129 actual files (occlusion filter only updates the master
+manifest, never per-faceset manifests — see
+`docs/analysis/clip-occlusion-filter.md` §7). 42 occlusion-dropped era PNGs
+inside the old `faces/_dropped/` were removed during rebuild. The
+parent `faceset_001/faces/_dropped/` still has the corpus-level audit; all
+source images are intact at `/mnt/x/src/`, so the era-level dropped PNGs
+are regeneratable via `cmd_export_swap`.
+
+## 3. Re-running
+
+Always run both passes after any new identity import (Immich, osrc,
+hand-sorted folder):
+
+```bash
+# 1. Find duplicate identities
+python work/consolidate_facesets.py analyze \
+  --out work/merge_review/candidates.json [--edge 0.55 --confident 0.65]
+python work/consolidate_facesets.py report \
+  --candidates work/merge_review/candidates.json --out work/merge_review
+# inspect work/merge_review/index.html
+python work/consolidate_facesets.py apply \
+  --candidates work/merge_review/candidates.json [--include-uncertain]
+
+# 2. Slot new faceset_001 PNGs into existing era buckets
+python work/age_extend_001.py analyze --out work/age_extend/candidates.json
+python work/age_extend_001.py report \
+  --candidates work/age_extend/candidates.json --out work/age_extend
+python work/age_extend_001.py apply --candidates work/age_extend/candidates.json
+```
+
+Both are idempotent. `consolidate_facesets` skips secondaries already in
+`_merged/`; `age_extend_001` recomputes anchor centroids + dom-year fresh
+on every run.
@@ -0,0 +1,279 @@
+# Importing identities from a self-hosted Immich library
+
+_Run date: 2026-04-26. Target: Immich v2.7.2 at `https://fotos.computerliebe.org`.
+Driver scripts: `work/immich_stage.py`, `work/embed_worker.py`,
+`work/cluster_immich.py`, `work/finalize_immich.sh`._
+
+## 1. Why a split workflow
+
+InsightFace `buffalo_l` on the WSL CPU runs the full detection + landmarks +
+recognition stack at ~3–4 faces/second. Re-detecting all 79K Immich photos
+would have taken ~10–28 days. The available AMD Radeon RX Vega is unusable
+under WSL (no `/dev/dri/`, no ROCm), but **DirectML on Windows native**
+runs the same models bit-identically and ~7.5× faster end-to-end. The
+pipeline therefore splits:
+
+- **WSL side** (`/opt/face-sets/`) — orchestration: API listing, download,
+  sha256 dedup, file management, clustering, faceset emission.
+- **Windows side** (`C:\face_embed_venv\`) — the embed step only. A fresh
+  Python 3.12 (installed via `winget install Python.Python.3.12`) with
+  `numpy`, `pillow`, `opencv-python-headless`, `onnxruntime-directml`,
+  `insightface`. Models copied from `/home/peter/.insightface/models/buffalo_l/`
+  to `C:\face_embed_venv\models\buffalo_l\`.
+
+A 30-iteration synthetic benchmark on Vega:
+
+| model       | DML | CPU | speedup |
+|-------------|----:|----:|--------:|
+| `det_10g.onnx` (640×640) | 10.0 ms | 183.5 ms | 18.4× |
+| `w600k_r50.onnx` (112×112) | 8.2 ms | 90.5 ms | 11.0× |
+
+End-to-end FaceAnalysis on 5 real Immich-sourced images (excluding the
+first-call DML JIT warmup): ~7.5× speedup post-warmup. Per-face cosine
+similarity DML vs CPU was 1.0000 across all 8 detected faces — DML is
+bit-identical to CPU for arcface inference.
+
+## 2. Architecture
+
+```
+   ┌─────────────────────────────────────────────┐
+   │ WSL  /opt/face-sets/work/immich_stage.py    │
+   │ ┌──────────────────────────────────────────┐│
+   │ │ ThreadPoolExecutor.map(_fetch_for_asset, ││
+   │ │   list_assets(user))                     ││
+   │ │  ─ /faces?id=    (Immich, parallel x8)   ││
+   │ │  ─ filter face_short >= 90               ││
+   │ │  ─ /assets/.../original (parallel x8)    ││
+   │ └──────────────────────────────────────────┘│
+   │  consumer (main thread):                    │
+   │   sha256 → dedup vs nl_full.npz             │
+   │   save to /mnt/x/src/immich/<user>/<rel>/   │
+   │   append to queue.json                      │
+   └────────────────┬────────────────────────────┘
+                    │
+                    ▼ queue.json (with WSL + Windows paths)
+   ┌─────────────────────────────────────────────┐
+   │ Windows embed_worker.py (C:\face_embed_venv) │
+   │  insightface.FaceAnalysis(                  │
+   │    providers=[DmlExecutionProvider, ...])   │
+   │  per image: detection + landmarks + arcface │
+   │  emit cache in sort_faces.py:cmd_embed      │
+   │  schema with embeddings + meta + processed  │
+   │  + path_aliases + schema=v2                 │
+   └────────────────┬────────────────────────────┘
+                    │
+                    ▼ immich_<user>.npz
+   ┌─────────────────────────────────────────────┐
+   │ WSL cluster_immich.py                       │
+   │   build centroids of canonical              │
+   │     faceset_NNN/ in facesets_swap_ready/    │
+   │   drop matches at cos-dist <= 0.45          │
+   │   cluster the rest at 0.55                  │
+   │   refine gates -> synthetic refine_manifest │
+   │   cmd_export_swap -> facesets_swap_ready/   │
+   │   merge top-level manifest                  │
+   └─────────────────────────────────────────────┘
+```
+
+Cache artifacts stay separate (per the architecture choice on this run):
+each user's results live in their own `immich_<user>.npz`. A future
+one-shot merge can fold them into `nl_full.npz` if needed; the existing
+`extend` command would do the right thing once schemas align.
+
+## 3. Path mapping
+
+`/mnt/x/` ↔ `X:\`. Cache stores WSL form (matching `nl_full.npz`'s
+existing convention). `wsl_to_win()` translates for the embed worker
+which runs natively on Windows.
+
+`work/cluster_immich.py` always uses the canonical `facesets_swap_ready/`
+view to build identity centroids — meaning the comparison is against the
+*current* set of canonical facesets in the swap-ready directory (skipping
+era splits and `_thin/`), not against the older `facesets_full/` snapshot.
+
+## 4. Result of the 2026-04-26 run (peter / admin)
+
+### 4a. Stage
+
+```
+total_assets_seen:     53842
+staged_count:          10261       (~10 GB on /mnt/x/)
+deduped_against_existing:  978     (sha256 in nl_full.npz already)
+deduped_against_staged:   2976     (internal byte-dupes inside Immich)
+skipped_no_big_face:     9539      (Immich detected only sub-90px faces)
+skipped_no_faces:       29390      (Immich detected zero faces)
+skipped_download_error:   698      (transient DNS / TLS, not seen-marked)
+elapsed:                ~70 min    (6.4 assets/s end-to-end at 8 workers)
+```
+
+The 698 transient errors are recoverable on a re-run because
+`immich_stage.py` does not add them to the `seen` set. Each transient
+asset would be retried.
+
+### 4b. Embed (Windows DML)
+
+```
+queue:                  10261 entries
+new face records:       19462
+new noface records:         1
+load errors:              125    (likely HEIC / unreadable)
+elapsed:                3878.0s  (64.6 min, 2.6 img/s end-to-end)
+```
+
+The 2.6 img/s end-to-end includes CIFS-share image load, image decode,
+DML inference (~50 ms/face), and JSON / NPZ flushing. Pure DML inference
+is faster; the rest of the pipeline dominates at scale.
+
+### 4c. Cluster
+
+```
+existing canonical centroids: 25
+faces already covered (cos-dist <= 0.45): 8103/19480  (42%)
+  faceset_001:  1856
+  faceset_002:  2666
+  faceset_003:   670
+  faceset_004:    48
+  faceset_005:    40
+  ... (smaller hits to the remaining 20)
+unmatched faces to cluster:  11377
+clusters at threshold 0.55:   2534  (top sizes [469, 444, 342, 338, 262, ...])
+survived refine gates:         239
+emitted as new facesets:       185  (54 dropped by export-swap's 0.45 outlier)
+```
+
+Top-level `facesets_swap_ready/manifest.json` after this run: **216
+facesets** (up from 31; ~7× growth) + 68 thin_eras under `_thin/`.
+
+## 4d. Result of the 2026-04-26..27 run (nic, with per-user API key)
+
+After issuing nic a per-user API key, the same pipeline ran end-to-end
+with no code changes (only the `IMMICH_API_KEY` env var changed). The
+run survived one Immich outage mid-stage thanks to the circuit breaker
+added in `work/immich_stage.py` (12 consecutive HTTP errors → probe →
+exit 2 with state preserved → resume on same command).
+
+### Stage
+
+```
+total_assets_seen:     25777            (matches /server/statistics 25,786)
+staged_count:           7834            (30% face-bearing-with-big-face;
+                                          peter was 19%)
+deduped_against_existing: 519           (sha256 in nl_full.npz already)
+deduped_against_staged:    0            (nic's library has zero internal
+                                          byte-dupes; peter had 2,976)
+skipped_no_big_face:     725
+skipped_no_faces:      16695
+skipped_download_error:   54            (transient; not marked seen ->
+                                          would be retried on resume)
+elapsed:               ~75 min wall (across two pause/resume sessions
+                                     bracketing one Immich outage)
+```
+
+### Embed (Windows DML)
+
+```
+queue:                 7834 entries
+new face records:     15627
+new noface records:       1
+load errors:              7
+elapsed:               3538.9s (59 min, 2.2 img/s end-to-end)
+```
+
+### Cluster
+
+```
+existing canonical centroids: 25
+faces already covered (cos-dist <= 0.45): 6770/15627  (43%)
+  faceset_002:  3261   (the dominant family identity)
+  faceset_008:  1461   (cross-match to hand-sorted 'sab')
+  faceset_001:   955
+  faceset_007:   408   (cross-match to hand-sorted 's')
+  faceset_006:   114
+  ...
+unmatched:                     8857
+clusters at threshold 0.55:   3787   (top sizes [165, 134, 106, 99, 92,
+                                       67, 62, 61, 58, 53])
+survived refine gates:         129
+emitted as new facesets:        95   (faceset_265..NNN with gaps)
+```
+
+Top-level `facesets_swap_ready/manifest.json` after the nic run: **311
+substantive facesets** + 68 thin_eras. Two-day cumulative growth:
+
+| date | event | facesets total |
+|------|------|------:|
+| 2026-04-25 | hand-sorted folder import | 19 |
+| 2026-04-26 morning | osrc + age split + cleanup | 31 |
+| 2026-04-26 afternoon | Immich peter run | 216 |
+| 2026-04-27 (overnight) | Immich nic run | 311 |
+
+## 5. Surprises and caveats
+
+### 5a. `/search/metadata`'s `userIds` filter is silently ignored (Immich v2.7.2)
+
+When the admin API key is used, passing `userIds=[<other-user-uuid>]`
+returns admin's own assets, not the other user's. The filter is
+silently dropped. Verified by sampling 200 returned items and
+confirming `ownerId` was admin for all of them.
+
+To process another user's library, **a separate API key issued by that
+user is required** — the admin key cannot enumerate cross-user
+libraries through any documented endpoint we tried. `/timeline/buckets`
+with a `userId` query parameter returns
+`Not found or no timeline.read access`.
+
+### 5b. `/server/statistics` undercounts what the search returns
+
+`/server/statistics` reported admin = 53,842 photos. Our
+`/search/metadata` paginated through... **53,842** top-level. So the
+header agrees with the body in this case. But `/server/statistics` does
+NOT count items that live under external libraries' import paths —
+yet `/search/metadata` does include them. For this Immich, two external
+libraries (`/mnt/media/photos` and `/mnt/media/omv_photos`) are
+configured but `/libraries` reports `assetCount=0` for both. Yet 80% of
+our staged paths come from those library import paths. Don't trust
+statistics-vs-search consistency.
+
+### 5c. Indexed Immich thumbnails masquerading as assets
+
+5,563 of our 10,261 staged paths are `<library>/thumbs/.../-preview.jpeg`
+— Immich's own internally-generated thumbnails got indexed because the
+external library import path included the thumbs subdirectory and the
+exclusion patterns didn't list `**/thumbs/**`. They embed and cluster
+fine but produce lower-resolution face records. The fix on the Immich
+side is adding `**/thumbs/**` to the exclusion patterns.
+
+### 5d. Internal byte-duplicates (2,976)
+
+Many Immich assets are byte-identical to other Immich assets — typically
+because the same photo was uploaded both from a phone and from a
+synced cloud folder. sha256 dedup catches all of these on the second
+download (we still pay the bandwidth, but skip the disk write and
+embed work). With Immich v2.7.2's own `assets/duplicates` endpoint we
+could catch this earlier, but it's not currently used.
+
+## 6. Re-running and applying to other Immich instances
+
+```bash
+export IMMICH_URL=https://your-immich.example.com
+export IMMICH_API_KEY=...           # admin or per-user key
+
+# Optional: populate work/immich/users.json with label -> UUID map.
+
+# 1. Stage (parallel /faces + downloads, resumable).
+python work/immich_stage.py --user peter --workers 8
+
+# 2. End-to-end finalize: copy queue to /mnt/c/, run Windows embed worker,
+#    copy the cache back, run cluster_immich.py.
+bash work/finalize_immich.sh peter
+```
+
+For a different Immich instance, the only configuration is the env vars
+and the `users.json` sidecar. `cluster_immich.py`'s tunables (matching
+threshold, clustering threshold, refine gates, MIN_FACES) are at the
+top of the script.
+
+To process a *second* user's library, issue a per-user API key in the
+Immich admin UI for that user, set `IMMICH_API_KEY` to that key, and
+re-run with their `--user <label>`. The admin key cannot impersonate
+other users via the search API.
@@ -0,0 +1,119 @@
+# Identity discovery in `/mnt/x/src/osrc`
+
+_Run date: 2026-04-26. Cache: `work/cache/nl_full.npz` (5260 face records).
+Driver script: `work/cluster_osrc.py`._
+
+## 1. Source
+
+`/mnt/x/src/osrc/` is a flat mixed-identity bucket: 213 files in root + a
+`psd/` subfolder with 41 PSD files + a single file in `[Originaldateien]/`.
+File extensions are 171 jpg + 1 jpeg + 41 psd. PSDs are not embedded
+(InsightFace's loader doesn't read PSD); the 41 PSDs were skipped, on the
+working assumption that the same identities are also present in the
+adjacent JPGs.
+
+`nl_full.npz` already covered 160 of the 213 files (the remaining 53: 41
+psd + 12 jpg). Of the 12 missing JPGs, 11 are byte-duplicates of `00843resc.jpg`
+.. `00855resc.jpg` (same file sizes, paired by sha256) — already aliased
+in the cache. Only 1 jpg (`19554226_..._n.jpg`) is genuinely uncovered.
+
+The 160 covered files yielded **336 face records / 10 noface**, with 64
+single-face / 35 two-face / 19 three-face / 24 four-face / 8 with 5–8
+faces. Quality is good: median `face_short=116px`, `det_score=0.85`,
+`blur=244`. Min `face_short=40px` will fail the 90px refine gate.
+
+## 2. Coverage by existing identities
+
+Computed cos-dist from each osrc face to the centroids of the canonical
+`faceset_001..019` (built from each manifest's `(source, bbox)` keys).
+Median nearest-cos-dist was 0.875 — i.e. the bulk of osrc is **not** the
+existing 19 identities.
+
+At cos-dist ≤ 0.45 (matching `build_folders.py`'s `OSRC_THRESHOLD`):
+
+| existing identity | osrc faces matched |
+|------------------|------------------:|
+| faceset_002      | 7 |
+| faceset_008      | 4 |
+| faceset_015      | 3 |
+| faceset_019      | 4 |
+
+These 18 osrc faces are routed to existing identities by
+`build_folders.py` and `extend`; they are excluded from the
+identity-discovery step.
+
+## 3. Pipeline
+
+`work/cluster_osrc.py` mirrors `build_folders.py`'s structure (synthesize
+a refine manifest, hand off to `cmd_export_swap`, relocate, merge
+top-level manifest) but discovers identities by clustering rather than
+asserting them by folder.
+
+1. Filter cache to face records under `/mnt/x/src/osrc` (canonical or
+   byte-aliased path).
+2. Drop the 18 already-covered faces (cos-dist ≤ 0.45 to any existing
+   identity centroid).
+3. Cluster the remaining 318 faces among themselves at cos-dist 0.55
+   (matches the `extend` default for new-cluster formation).
+4. For each cluster, apply `refine`-equivalent per-face gates
+   (`face_short ≥ 90`, `blur ≥ 40`, `det_score ≥ 0.6`); for clusters ≥ 4
+   faces apply outlier rejection at cluster-centroid cos-dist 0.55. Keep
+   clusters whose surviving unique-path count is ≥ 6 (the operator-
+   chosen `MIN_FACES`, lower than the canonical 15 because osrc is small
+   per-identity).
+5. Number kept clusters `faceset_020+` (past the existing
+   `facesets_swap_ready/` max of 019) ordered by size descending.
+6. Synthesize a refine manifest and call `cmd_export_swap` on it. Move
+   the emitted dirs into `facesets_swap_ready/`, drop an `osrc.txt`
+   provenance marker, and append the new entries to the top-level
+   `manifest.json` (without disturbing existing `facesets` / `thin_eras`).
+
+## 4. Result (2026-04-26)
+
+Phase 1 (clustering, before export-swap):
+
+- 137 raw clusters at cos-dist 0.55; top sizes [37, 20, 12, 9, 7, 7, 6, 6, 6, 5].
+- After quality gate: 124 faces dropped (mostly `face_short < 90` from
+  group-photo tertiary subjects).
+- Outlier rejection: 0 dropped (clusters were tight).
+- After `min_faces=6`: **7 candidate clusters kept** (sizes 6–28 unique
+  source paths).
+
+Phase 2 (`cmd_export_swap` with `min_face_short=100`,
+`outlier_threshold=0.45`):
+
+| name         | input | outlier drop | exported PNGs |
+|--------------|------:|-------------:|--------------:|
+| faceset_020  | 71 | 42 | 26 |
+| faceset_021  | 36 | 21 | 10 |
+| faceset_022  | 15 |  7 |  8 |
+| faceset_023  | 19 | 14 |  4 |
+| faceset_024  |  6 |  0 |  6 |
+| faceset_025  | 10 |  4 |  6 |
+| faceset_026  |  — |  — |  0 (skipped: empty after filter) |
+
+`faceset_026`'s 6 cluster faces all failed export-swap's tighter
+`min_face_short=100` gate (vs. cluster's 90); it is not emitted.
+`faceset_023` is small (4 PNGs) but useful as an averaged identity at
+that size.
+
+Top-level `facesets_swap_ready/manifest.json` now: **31 substantive
+facesets** (12 auto-cluster nl/lzbkp + 7 hand-sorted + 6 era splits + 6
+osrc-discovered) + **68 thin_eras** under `_thin/`.
+
+## 5. Re-running and applying to other mixed buckets
+
+- The cache holds osrc embeddings; to re-run with different parameters,
+  edit `cluster_osrc.py`'s config block and re-execute. Cluster discovery
+  + export-swap is a few minutes total.
+- For a different mixed-bucket source, copy `cluster_osrc.py` to
+  `cluster_<name>.py` and change `OSRC_DIR`, `OUT_TMP`, `SYNTH_MANIFEST`,
+  `START_NNN`. The exclusion step compares against the *current* contents
+  of `facesets_swap_ready/faceset_NNN/` so it picks up everything emitted
+  by previous discovery / split / hand-sorted runs.
+- Lowering `MIN_FACES` from 6 to 4 would have admitted ~3 additional
+  marginal clusters at this corpus size; the trade-off is a noisier
+  identity average for small-N facesets.
+- `extend` should be run before `cluster_osrc.py` so `raw_full/` and
+  `facesets_full/` stay in sync — `cluster_osrc.py` itself only writes
+  to `facesets_swap_ready/`.
@@ -0,0 +1,142 @@
+# Video target preprocessing for roop-unleashed
+
+_Initial design + first batch run: 2026-04-27. Driver scripts: `work/video_target_pipeline.py`, `work/video_face_worker.py`, `work/run_video_pipeline.sh`._
+
+Companion to the face-set side of the project: instead of building per-identity .fsz bundles for the *source* of a swap, this pipeline preprocesses the *target* (videos to swap into). Given a folder of video files, it identifies "swappable" segments — continuous shots where a face is detectable, sufficiently visible, and roughly within inswapper_128's working envelope — and cuts them into UUID-named clips ready to feed into roop-unleashed.
+
+## 1. Why build it
+
+I checked the obvious open-source projects for an existing implementation:
+
+- **FaceFusion** ([github.com/facefusion/facefusion](https://github.com/facefusion/facefusion)) — CLI has `run`, `headless-run`, `batch-run`, `job-*`, `force-download`, `benchmark`. No scene-detection or clip-extraction subcommand. Its own guides recommend "split your video manually first."
+- **roop-unleashed** at `/opt/roop-unleashed/roop/util_ffmpeg.py` — has `cut_video(start_frame, end_frame)` for a manual GUI trim, no detection-driven segmentation.
+- **Deep-Live-Cam** ([github.com/hacksider/Deep-Live-Cam](https://github.com/hacksider/Deep-Live-Cam)) — real-time / single-shot, no batch preprocessing.
+- **DeepFaceLab** — `extract_video.bat` dumps every frame between user-supplied trim points; no quality gating.
+
+Closest prior art for the cut-detection pattern is the two-stage hybrid in [SportSBD MMSys'26](https://dl.acm.org/doi/10.1145/3793853.3799803) (cheap detector for cuts, accurate net for verification), but the actual implementation has to be ours.
+
+## 2. Pipeline architecture
+
+```
+WSL  /opt/face-sets/work/                   Windows  C:\face_embed_venv\
+─────────────────────────────────────       ─────────────────────────────
+run_video_pipeline.sh (chain driver)
+   │
+   ├─ scan         (ffprobe metadata)
+   ├─ scenes       (PySceneDetect AdaptiveDetector, CPU)
+   ├─ stage        (sampled frame queue.json @ 2 fps)
+   │                                  │
+   │                                  ▼
+   │                            video_face_worker.py
+   │                            insightface FaceAnalysis
+   │                            on DmlExecutionProvider
+   │                            output: results.jsonl
+   ├─ merge        (ingest results.jsonl)
+   ├─ track        (IoU + embedding stitching, ~30 LOC)
+   ├─ score        (track-level quality gate + cross-track merge)
+   ├─ cut          (ffmpeg -c copy → per-source subfolders)
+   └─ report       (HTML preview)
+
+   Output: <output_dir>/<source_video_stem>/<uuid>.mp4
+                                           /<uuid>.json (sidecar; opt-in via
+                                                          --write-sidecar)
+```
+
+`run_video_pipeline.sh` is parameterized via env vars (`WORK`, `INPUT_DIR`, `OUTPUT_DIR`, `FILTER_FROM`, `SKIP_PATTERN`, `MAX_DUR`, `IDENTITY`, `SIDECAR`) so you can pin a particular batch without editing the script. Sidecars are off by default — the per-batch `plan.json` always carries the full provenance for every clip; the `<uuid>.json` files alongside the clips are redundant and only useful if you need each clip to be self-describing in isolation.
+
+## 3. Quality signals (matched to inswapper_128's working envelope)
+
+inswapper_128 is trained near-frontal at 128×128. The score gate uses defaults that admit side profiles (since rich face-sets can absorb non-frontal swap targets):
+
+| signal | threshold | rationale |
+|--------|----------:|-----------|
+| `|yaw|` | ≤ 75° | covers full 3/4 + side profile |
+| `|pitch|` | ≤ 45° | covers extreme up/down looks |
+| `face_short` | ≥ 80 px | inswapper resamples to 128; ≥80 still produces clean output |
+| `det_score` | ≥ 0.5 | matches buffalo_l's MIN_DET; lower = unreliable detection |
+| track-gate | ≥ 70 % frames pass | binary track filter rather than per-frame |
+| duration | 1 s ≤ dur ≤ 120 s | below 1s = unusable slivers; above 120s probably contains a missed micro-cut |
+
+Plus two segment-merging knobs:
+- `--bridge-gap` (default 3 s) — within a single track, brief pose-failure gaps shorter than this get bridged so single bad frames don't fragment a good run
+- `--merge-gap` (default 2 s) — across tracks within the same scene, segments closer than this get fused (cross-track merge fires when face detection briefly fails between adjacent good runs)
+
+The defaults can be tightened (e.g. `--max-yaw 25` for portrait-only) or loosened (e.g. `--max-yaw 90 --merge-gap 5`) without re-running detection — `score` reads the existing `tracks.json`.
+
+## 4. Performance + the JSONL append-only fix
+
+This is where the engineering interest is. The first production run on 13 videos / 6.18 h of input went through three failure modes before settling at production speed:
+
+| attempt | issue | rate observed |
+|---|---|---:|
+| 1. Original `cap.set(POS_FRAMES, N)` per sample | OpenCV seeks to nearest keyframe + decodes forward at every sample. Cost grows with depth into the video; on a 60-min H.264 it falls off a cliff. | 1.4 fps → degrading |
+| 2. Sequential `cap.grab()` from frame 0 | On resume, grab-walking from frame 0 to a deep target is unbounded. | 0.08 fps |
+| 3. Hybrid: seek-once-per-video + sequential within | Better in principle. But hit a different bug: `flush()` was re-serializing the entire `results.json` (245 MB at this point) every 100 frames or 30 sec. Save dominated wall-clock. | 0.5 fps |
+| 4. **JSONL append-only** | One result per line. Each flush is O(new records), not O(total records). | **13.77 fps** smoke / 7.57 fps cumulative across the full batch |
+
+Lesson: when the output is large + grows monotonically + needs frequent checkpointing, *do not* re-serialize the whole structure on each flush. Append-only line-delimited JSON is the right tool. The legacy `results.json` is auto-converted to `.jsonl` on first load (one-time migration), so resumes survive the format switch.
+
+## 5. Hardware decode/encode on AMD Vega + WSL
+
+Skipped. Per [Microsoft's WSL D3D12 video acceleration post](https://devblogs.microsoft.com/commandline/d3d12-gpu-video-acceleration-in-the-windows-subsystem-for-linux-now-available/), VAAPI-via-Mesa-D3D12 exists but is fragile on older AMD. AMF on Windows would mean a Windows-side ffmpeg leg, doubling boundary crossings. CPU software decode of 1280×720 H.264 in WSL ffmpeg is faster than realtime, and the bottleneck is buffalo_l detection on DML, not decode.
+
+For cutting we use `-c copy` stream-copy — no re-encode, hardware codecs are moot.
+
+## 6. Full corpus run results
+
+Three runs across the 61-video corpus at `/mnt/x/src/vd/`:
+
+| | test (3 videos) | first batch (13 videos, 50–62) | rest (45 videos, 02–49 minus test) | **total** |
+|---|---:|---:|---:|---:|
+| input duration | 0.6 h | 6.18 h | 12.98 h | **19.76 h** |
+| sampled frames @ 2 fps | 4,472 | 44,635 | 94,030 | 143,137 |
+| tracks | 187 | 2,564 | 3,823 | 6,574 |
+| accepted tracks | 94 (50 %) | 1,193 (47 %) | 1,905 (50 %) | 3,192 (49 %) |
+| **emitted segments** | **83** | **600** | **1,301** | **1,984** |
+| cross-track-merged segments | 14 | 254 | 382 | 650 |
+| accepted content | 13 min | 239 min | 395 min | **647 min (= 10.78 h)** |
+| acceptance rate by time | 36 % | 64.6 % | 50.7 % | **54.6 %** |
+| output size | 0.135 GB | 3.63 GB | 4.84 GB | **8.6 GB** |
+
+Phase timings (rest batch — best representative since it ran fully under JSONL append-only from a fresh start):
+- scenes: 117 min (PySceneDetect, 45 × ~3 min/video)
+- stage: instant
+- worker: 100 min @ **15.78 fps** sustained (vs 7.5 fps for first batch which migrated mid-run)
+- merge: 90 s
+- track: 92 s
+- score: 23 s
+- cut (1,301 ffmpeg stream-copies): 30 min
+- report (1,301 thumbs + HTML): 5.5 min
+- **total wall-clock: 4h16m**
+
+Across all three runs, **0 worker errors on 143,137 sampled frames**.
+
+## 7. Re-running
+
+```bash
+# choose a per-batch workdir + log
+WORK=/opt/face-sets/work/video_preprocess_<batch_name> \
+  FILTER_FROM=ct_src_00050.mp4 \
+  bash work/run_video_pipeline.sh > work/logs/video_run_<batch_name>.log 2>&1 &
+
+# check status anytime
+bash work/status_video_pipeline.sh work/logs/video_run_<batch_name>.log
+```
+
+Skip patterns can exclude already-processed inputs (note that 5-digit numbers need full padding in the regex, e.g. `0005[0-9]` not `005[0-9]`):
+
+```bash
+SKIP_PATTERN='^ct_src_(0001[015]|0005[0-9]|0006[0-2])\.mp4$' \
+  WORK=/opt/face-sets/work/video_preprocess_rest \
+  bash work/run_video_pipeline.sh > work/logs/video_run_rest.log 2>&1 &
+```
+
+To also emit per-clip provenance sidecars (off by default):
+
+```bash
+SIDECAR=yes \
+  WORK=/opt/face-sets/work/video_preprocess_<batch> \
+  bash work/run_video_pipeline.sh > work/logs/video_run_<batch>.log 2>&1 &
+```
+
+`scenes` outputs are cached in the batch's `WORK/scenes/` dir, so re-running the chain after an edit-to-score step doesn't redo detection. The worker is also resumable per `queue_id` — if killed mid-flight, just relaunch.
@@ -0,0 +1,576 @@
+"""Extend the existing 6 era buckets of faceset_001 by absorbing PNGs that
+post-date the original age_split run (from consolidation merges, etc.).
+
+Mirrors the anchor-fragment assignment logic in age_split_001.py:
+  - For each unbucketed face in faceset_001's manifest, find the nearest active
+    era anchor by cosine distance to the anchor's centroid.
+  - Accept the assignment iff dist <= 0.40 AND |year_delta| <= 5
+    (where year_delta = exif_year(face) - dom_year(anchor)).
+  - Undated PNGs are skipped (no assignment).
+  - Anchors are NOT re-centered after absorption (preserves the same drift
+    guarantees as the original age_split).
+
+CLI:
+  python work/age_extend_001.py analyze --out work/age_extend/candidates.json
+  python work/age_extend_001.py report --candidates ... --out work/age_extend
+  python work/age_extend_001.py apply --candidates ... [--dry-run]
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import shutil
+import sys
+import time
+from collections import Counter
+from pathlib import Path
+
+import numpy as np
+from PIL import Image, ExifTags
+
+ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
+PARENT = "faceset_001"
+ACTIVE_ERAS = [
+    "faceset_001_2005-10",
+    "faceset_001_2010-13",
+    "faceset_001_2011",
+    "faceset_001_2014-17",
+    "faceset_001_2018-19",
+    "faceset_001_2018-20",
+]
+CACHES = [
+    Path("/opt/face-sets/work/cache/nl_full.npz"),
+    Path("/opt/face-sets/work/cache/immich_peter.npz"),
+    Path("/opt/face-sets/work/cache/immich_nic.npz"),
+]
+EXIF_CACHE = Path("/opt/face-sets/work/cache/age_split_exif.json")
+
+# anchor-fragment thresholds (mirror age_split_001.py)
+DIST_MAX = 0.40
+YEAR_MAX = 5
+
+
+# ----------------------------- caches -----------------------------
+
+def load_caches():
+    rec_index: dict[tuple[str, tuple[int, int, int, int]], np.ndarray] = {}
+    alias_map: dict[str, str] = {}
+    for c in CACHES:
+        if not c.exists():
+            print(f"[warn] cache missing: {c}", file=sys.stderr)
+            continue
+        d = np.load(c, allow_pickle=True)
+        emb = d["embeddings"]
+        meta = json.loads(str(d["meta"]))
+        face_records = [m for m in meta if not m.get("noface")]
+        if len(face_records) != len(emb):
+            raise SystemExit(f"meta/emb mismatch in {c}: {len(face_records)} vs {len(emb)}")
+        if "path_aliases" in d.files:
+            paliases = json.loads(str(d["path_aliases"]))
+            for canon, alist in paliases.items():
+                alias_map.setdefault(canon, canon)
+                for a in alist:
+                    alias_map[a] = canon
+        for i, rec in enumerate(face_records):
+            p = rec["path"]
+            bbox = tuple(int(x) for x in rec["bbox"])
+            v = emb[i].astype(np.float32)
+            n = float(np.linalg.norm(v))
+            if n > 0:
+                v = v / n
+            rec_index[(p, bbox)] = v
+            alias_map.setdefault(p, p)
+    print(f"[cache] indexed {len(rec_index)} face records, {len(alias_map)} aliases", file=sys.stderr)
+    return rec_index, alias_map
+
+
+def lookup_emb(rec_index, alias_map, src: str, bbox):
+    bbox_t = tuple(int(x) for x in bbox)
+    canon = alias_map.get(src, src)
+    v = rec_index.get((canon, bbox_t))
+    if v is None and canon != src:
+        v = rec_index.get((src, bbox_t))
+    return v
+
+
+# ----------------------------- exif -----------------------------
+
+def load_exif_cache():
+    if not EXIF_CACHE.exists():
+        return {}
+    return json.loads(EXIF_CACHE.read_text())
+
+
+def save_exif_cache(cache):
+    tmp = EXIF_CACHE.with_suffix(".tmp.json")
+    tmp.write_text(json.dumps(cache, indent=2))
+    tmp.replace(EXIF_CACHE)
+
+
+def exif_year(path: Path) -> int | None:
+    try:
+        with Image.open(path) as im:
+            ex = im._getexif()
+            if not ex:
+                return None
+            for tag_id, val in ex.items():
+                tag = ExifTags.TAGS.get(tag_id, tag_id)
+                if tag == "DateTimeOriginal" and isinstance(val, str) and len(val) >= 4:
+                    return int(val[:4])
+    except Exception:
+        return None
+    return None
+
+
+def get_year(src: str, exif_cache) -> int | None:
+    """Return EXIF year for src, using cache. Mutates cache for new lookups."""
+    if src in exif_cache:
+        return exif_cache[src]
+    p = Path(src)
+    y = exif_year(p) if p.exists() else None
+    exif_cache[src] = y
+    return y
+
+
+# ----------------------------- analyze -----------------------------
+
+def cmd_analyze(args):
+    rec_index, alias_map = load_caches()
+    exif_cache = load_exif_cache()
+    exif_cache_dirty = False
+
+    parent_dir = ROOT / PARENT
+    parent_manifest = json.loads((parent_dir / "manifest.json").read_text())
+    parent_faces = parent_manifest.get("faces", [])
+    print(f"[parent] {PARENT}: {len(parent_faces)} face entries", file=sys.stderr)
+
+    # Build "in_bucket" set + each anchor's centroid + dom_year
+    anchors = []
+    in_bucket: set[tuple[str, tuple[int, int, int, int]]] = set()
+    for era in ACTIVE_ERAS:
+        ed = ROOT / era
+        if not ed.is_dir():
+            print(f"[warn] missing era bucket: {era}", file=sys.stderr)
+            continue
+        em = json.loads((ed / "manifest.json").read_text())
+        emb_list = []
+        years = []
+        n_missing_emb = 0
+        for f in em.get("faces", []):
+            src = f.get("source")
+            bbox = f.get("bbox")
+            if not src or not bbox:
+                continue
+            key = (alias_map.get(src, src), tuple(int(x) for x in bbox))
+            in_bucket.add(key)
+            in_bucket.add((src, tuple(int(x) for x in bbox)))   # cover both alias and raw
+            v = lookup_emb(rec_index, alias_map, src, bbox)
+            if v is None:
+                n_missing_emb += 1
+            else:
+                emb_list.append(v)
+            y = get_year(src, exif_cache)
+            if y is None:
+                exif_cache_dirty = True
+            else:
+                years.append(y)
+                if src not in exif_cache:
+                    exif_cache_dirty = True
+        if not emb_list:
+            print(f"[warn] {era}: no embeddings found, skipping anchor", file=sys.stderr)
+            continue
+        arr = np.stack(emb_list).astype(np.float32)
+        c = arr.mean(axis=0)
+        n = float(np.linalg.norm(c))
+        if n > 0:
+            c = c / n
+        dom_year = Counter(years).most_common(1)[0][0] if years else None
+        anchors.append({
+            "name": era, "centroid": c, "n_faces": len(em.get("faces", [])),
+            "n_emb_used": len(emb_list), "n_emb_missing": n_missing_emb,
+            "dom_year": dom_year,
+            "year_min": min(years) if years else None,
+            "year_max": max(years) if years else None,
+        })
+        print(f"[anchor] {era}: n={len(em.get('faces', []))} emb_used={len(emb_list)} "
+              f"emb_miss={n_missing_emb} dom_year={dom_year} years=[{min(years) if years else '-'}..{max(years) if years else '-'}]",
+              file=sys.stderr)
+
+    # Find unbucketed faces in parent
+    unbucketed = []
+    for f in parent_faces:
+        src = f.get("source")
+        bbox = f.get("bbox")
+        if not src or not bbox:
+            continue
+        bbox_t = tuple(int(x) for x in bbox)
+        key1 = (alias_map.get(src, src), bbox_t)
+        key2 = (src, bbox_t)
+        if key1 in in_bucket or key2 in in_bucket:
+            continue
+        unbucketed.append(f)
+    print(f"[parent] {len(unbucketed)} unbucketed face entries (in {PARENT} but no era bucket)", file=sys.stderr)
+
+    # Score each unbucketed face against every anchor
+    proposals = []
+    skipped_no_emb = 0
+    skipped_no_year = 0
+    for f in unbucketed:
+        src = f["source"]
+        bbox = f["bbox"]
+        v = lookup_emb(rec_index, alias_map, src, bbox)
+        if v is None:
+            skipped_no_emb += 1
+            continue
+        y = get_year(src, exif_cache)
+        if y is None:
+            skipped_no_year += 1
+            exif_cache_dirty = True
+            continue
+        if src not in exif_cache:
+            exif_cache_dirty = True
+        # nearest anchor
+        best = None  # (dist, idx)
+        for i, a in enumerate(anchors):
+            d = 1.0 - float(np.dot(a["centroid"], v))
+            if best is None or d < best[0]:
+                best = (d, i)
+        if best is None:
+            continue
+        dist, bidx = best
+        anchor = anchors[bidx]
+        year_delta = abs(y - anchor["dom_year"]) if anchor["dom_year"] is not None else None
+        accept = (dist <= DIST_MAX and year_delta is not None and year_delta <= YEAR_MAX)
+        proposals.append({
+            "png": f["png"],
+            "source": src,
+            "bbox": [int(x) for x in bbox],
+            "year": y,
+            "rank_in_parent": f.get("rank"),
+            "quality_composite": f.get("quality", {}).get("composite"),
+            "quality": f.get("quality", {}),
+            "best_anchor": anchor["name"],
+            "best_anchor_dom_year": anchor["dom_year"],
+            "centroid_dist": round(dist, 4),
+            "year_delta": year_delta,
+            "accept": bool(accept),
+            "all_anchor_dists": {
+                a["name"]: round(1.0 - float(np.dot(a["centroid"], v)), 4) for a in anchors
+            },
+        })
+
+    if exif_cache_dirty:
+        save_exif_cache(exif_cache)
+        print(f"[exif] cache flushed ({len(exif_cache)} entries total)", file=sys.stderr)
+
+    # Summarize
+    accepted = [p for p in proposals if p["accept"]]
+    rejected = [p for p in proposals if not p["accept"]]
+    by_anchor = Counter(p["best_anchor"] for p in accepted)
+    print(f"[summary] unbucketed={len(unbucketed)} scored={len(proposals)} "
+          f"accepted={len(accepted)} rejected={len(rejected)} "
+          f"skipped(no_emb={skipped_no_emb}, no_year={skipped_no_year})", file=sys.stderr)
+    for k, v in by_anchor.most_common():
+        print(f"  {k}: +{v}", file=sys.stderr)
+
+    out = {
+        "thresholds": {"dist_max": DIST_MAX, "year_max": YEAR_MAX},
+        "anchors": [
+            {k: v for k, v in a.items() if k != "centroid"}
+            for a in anchors
+        ],
+        "n_unbucketed": len(unbucketed),
+        "skipped": {"no_emb": skipped_no_emb, "no_year": skipped_no_year},
+        "proposals": sorted(proposals, key=lambda p: (not p["accept"], p["best_anchor"], -1 * (p["quality_composite"] or 0))),
+        "by_anchor": dict(by_anchor),
+    }
+    op = Path(args.out)
+    op.parent.mkdir(parents=True, exist_ok=True)
+    op.write_text(json.dumps(out, indent=2))
+    print(f"[done] {len(proposals)} proposals -> {op}", file=sys.stderr)
+
+
+# ----------------------------- report -----------------------------
+
+def cmd_report(args):
+    cand = json.loads(Path(args.candidates).read_text())
+    out_dir = Path(args.out)
+    thumbs_dir = out_dir / "thumbs"
+    thumbs_dir.mkdir(parents=True, exist_ok=True)
+    THUMB = 140
+
+    def make_thumb(png_relpath: str) -> str:
+        # png_relpath looks like "faces/0042.png"
+        src = ROOT / PARENT / png_relpath
+        name = Path(png_relpath).stem
+        dst = thumbs_dir / f"{name}.jpg"
+        if not dst.exists():
+            try:
+                img = Image.open(src).convert("RGB")
+                img.thumbnail((THUMB, THUMB), Image.LANCZOS)
+                img.save(dst, "JPEG", quality=82)
+            except Exception as e:
+                print(f"[thumb-skip] {src}: {e}", file=sys.stderr)
+                return ""
+        return f"thumbs/{name}.jpg"
+
+    # group accepted proposals by target anchor
+    by_anchor: dict[str, list] = {}
+    rejected = []
+    for p in cand["proposals"]:
+        if p["accept"]:
+            by_anchor.setdefault(p["best_anchor"], []).append(p)
+        else:
+            rejected.append(p)
+
+    rows = []
+    rows.append("<h1>faceset_001 age extension &mdash; review</h1>")
+    rows.append(f"<p>{cand['n_unbucketed']} unbucketed faces in {PARENT}; "
+                f"{sum(len(v) for v in by_anchor.values())} accepted / {len(rejected)} rejected; "
+                f"thresholds dist&le;{cand['thresholds']['dist_max']} AND |year_delta|&le;{cand['thresholds']['year_max']}.</p>")
+    nav = " · ".join(f"<a href='#{a}'>{a} (+{len(by_anchor[a])})</a>" for a in by_anchor) + " · <a href='#rejected'>rejected</a>"
+    rows.append(f"<div class='nav'>{nav}</div>")
+
+    for anchor_name in ACTIVE_ERAS:
+        if anchor_name not in by_anchor:
+            continue
+        items = by_anchor[anchor_name]
+        anchor_meta = next((a for a in cand["anchors"] if a["name"] == anchor_name), {})
+        rows.append(f"<section id='{anchor_name}' class='grp'>")
+        rows.append(f"<h2>{anchor_name} <small>(dom_year={anchor_meta.get('dom_year')}; "
+                    f"existing n={anchor_meta.get('n_faces')}; +{len(items)} new)</small></h2>")
+        rows.append("<div class='cells'>")
+        for p in sorted(items, key=lambda x: (x["centroid_dist"], -1 * (x["quality_composite"] or 0))):
+            thumb = make_thumb(p["png"])
+            cls = "hi" if p["centroid_dist"] <= 0.30 else "mid"
+            rows.append(
+                f"<div class='cell'>"
+                f"<img src='{thumb}' loading='lazy' title='{p['png']}'>"
+                f"<div class='meta'>{p['png']}<br>year {p['year']} (Δ{p['year_delta']})<br>"
+                f"<span class='{cls}'>dist {p['centroid_dist']:.3f}</span></div>"
+                f"</div>"
+            )
+        rows.append("</div></section>")
+
+    if rejected:
+        rows.append("<section id='rejected' class='grp rej'>")
+        rows.append(f"<h2>rejected <small>({len(rejected)} faces don't fit any anchor)</small></h2>")
+        rows.append("<div class='cells'>")
+        for p in sorted(rejected, key=lambda x: x["centroid_dist"])[:200]:
+            thumb = make_thumb(p["png"])
+            why = []
+            if p["centroid_dist"] > cand['thresholds']['dist_max']:
+                why.append(f"dist {p['centroid_dist']:.2f}>{cand['thresholds']['dist_max']}")
+            if p["year_delta"] is None or p["year_delta"] > cand['thresholds']['year_max']:
+                why.append(f"yΔ{p['year_delta']}>{cand['thresholds']['year_max']}")
+            rows.append(
+                f"<div class='cell'>"
+                f"<img src='{thumb}' loading='lazy'>"
+                f"<div class='meta'>{p['png']}<br>year {p['year']} → best {p['best_anchor']}<br>"
+                f"<span class='lo'>{'; '.join(why)}</span></div>"
+                f"</div>"
+            )
+        if len(rejected) > 200:
+            rows.append(f"<p>...{len(rejected)-200} more truncated.</p>")
+        rows.append("</div></section>")
+
+    html = f"""<!doctype html>
+<html><head><meta charset='utf-8'><title>faceset_001 age extension</title>
+<style>
+body {{ font-family: system-ui, sans-serif; background:#111; color:#eee; padding:1em; }}
+h1 {{ margin-top:0; }} h2 {{ margin:0; }}
+small {{ color:#999; font-weight:normal; }}
+section.grp {{ background:#1a1a1a; border-radius:6px; padding:12px; margin:12px 0; }}
+section.grp.rej {{ border-left:4px solid #ff5050; }}
+.cells {{ display:flex; flex-wrap:wrap; gap:6px; }}
+.cell {{ background:#222; border-radius:4px; padding:4px; width:160px; font-size:11px; font-family:monospace; text-align:center; }}
+.cell img {{ height:140px; width:auto; border-radius:3px; }}
+.meta {{ padding-top:4px; line-height:1.3; }}
+.hi  {{ color:#5fa05f; font-weight:bold; }}
+.mid {{ color:#ffb050; }}
+.lo  {{ color:#ff5050; }}
+.nav {{ position:sticky; top:0; background:#111; padding:.5em 0; border-bottom:1px solid #333; font-size:13px; }}
+a {{ color:#6cf; }}
+</style></head>
+<body>
+{''.join(rows)}
+</body></html>"""
+    out_html = out_dir / "index.html"
+    out_html.write_text(html)
+    print(f"[done] {out_html}", file=sys.stderr)
+
+
+# ----------------------------- apply -----------------------------
+
+def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
+    import zipfile
+    with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
+        for i, p in enumerate(pngs):
+            zf.write(p, arcname=f"{i:04d}.png")
+
+
+def cmd_apply(args):
+    cand = json.loads(Path(args.candidates).read_text())
+    accepted = [p for p in cand["proposals"] if p["accept"]]
+    if args.dry_run:
+        from collections import Counter as C
+        by = C(p["best_anchor"] for p in accepted)
+        print(f"=== dry-run: {len(accepted)} assignments across {len(by)} anchors ===")
+        for k, v in by.most_common():
+            print(f"  {k}: +{v}")
+        return
+
+    parent_dir = ROOT / PARENT
+    master_path = ROOT / "manifest.json"
+    master = json.loads(master_path.read_text())
+    facesets_by_name = {f["name"]: f for f in master.get("facesets", [])}
+
+    by_anchor: dict[str, list] = {}
+    for p in accepted:
+        by_anchor.setdefault(p["best_anchor"], []).append(p)
+
+    total_added = 0
+    for anchor_name, props in by_anchor.items():
+        ed = ROOT / anchor_name
+        em_path = ed / "manifest.json"
+        em = json.loads(em_path.read_text())
+        existing = list(em.get("faces", []))
+
+        # gather new entries with their source PNG paths in faceset_001/faces/
+        new_with_src = []
+        for p in props:
+            src_png = parent_dir / p["png"]
+            if not src_png.exists():
+                print(f"[warn] missing parent PNG {src_png}; skip", file=sys.stderr)
+                continue
+            face_entry = {
+                "source": p["source"],
+                "bbox": p["bbox"],
+                "quality": p["quality"],
+                "exif_year": p["year"],
+                "centroid_dist_at_assign": p["centroid_dist"],
+                "year_delta_at_assign": p["year_delta"],
+                "extended_from_parent": True,
+            }
+            new_with_src.append((face_entry, src_png))
+
+        # combine; rank by quality.composite desc (existing entries already have rank,
+        # but we re-rank globally so new entries slot in by quality)
+        combined: list[tuple[dict, Path | None]] = []
+        for f in existing:
+            combined.append((f, None))
+        combined.extend(new_with_src)
+        combined.sort(key=lambda x: -x[0].get("quality", {}).get("composite", 0))
+
+        # stage fresh
+        staging = ed / "_faces_new"
+        if staging.exists():
+            shutil.rmtree(staging)
+        staging.mkdir()
+        new_face_entries = []
+        for new_rank, (face, src_png_or_none) in enumerate(combined, start=1):
+            new_name = f"{new_rank:04d}.png"
+            if src_png_or_none is None:
+                # existing entry: copy from current era bucket faces/
+                old_name = Path(face["png"]).name
+                src = ed / "faces" / old_name
+                if not src.exists():
+                    print(f"[warn] {anchor_name}: missing existing PNG {src}; skip", file=sys.stderr)
+                    continue
+                shutil.copy2(src, staging / new_name)
+            else:
+                shutil.copy2(src_png_or_none, staging / new_name)
+            face = dict(face)
+            face["rank"] = new_rank
+            face["png"] = f"faces/{new_name}"
+            new_face_entries.append(face)
+
+        # swap dirs
+        old_holding = ed / "_faces_old"
+        if old_holding.exists():
+            shutil.rmtree(old_holding)
+        (ed / "faces").rename(old_holding)
+        staging.rename(ed / "faces")
+        shutil.rmtree(old_holding)
+
+        # re-zip .fsz
+        survivor_pngs = sorted((ed / "faces").glob("*.png"))
+        top_n = em.get("top_n", 30)
+        top_n_eff = min(top_n, len(survivor_pngs))
+        for old in ed.glob("*.fsz"):
+            old.unlink()
+        top_fsz_name = f"{anchor_name}_top{top_n_eff}.fsz"
+        all_fsz_name = f"{anchor_name}_all.fsz"
+        _zip_png_list(survivor_pngs[:top_n_eff], ed / top_fsz_name)
+        if len(survivor_pngs) > top_n_eff:
+            _zip_png_list(survivor_pngs, ed / all_fsz_name)
+            all_fsz_used = all_fsz_name
+        else:
+            all_fsz_used = None
+
+        # update local + master manifests
+        em["faces"] = new_face_entries
+        em["exported"] = len(new_face_entries)
+        em["fsz_top"] = top_fsz_name
+        em["fsz_all"] = all_fsz_used
+        em["top_n"] = top_n_eff
+        em.setdefault("age_extend_history", []).append({
+            "added": len(new_with_src),
+            "thresholds": cand["thresholds"],
+        })
+        em_path.write_text(json.dumps(em, indent=2))
+
+        if anchor_name in facesets_by_name:
+            facesets_by_name[anchor_name]["exported"] = len(new_face_entries)
+            facesets_by_name[anchor_name]["fsz_top"] = top_fsz_name
+            facesets_by_name[anchor_name]["fsz_all"] = all_fsz_used
+            facesets_by_name[anchor_name]["top_n"] = top_n_eff
+
+        added_here = len(new_with_src)
+        total_added += added_here
+        print(f"[applied] {anchor_name}: +{added_here} (now {len(new_face_entries)} faces)", file=sys.stderr)
+
+    # rewrite master with ordering preserved
+    new_facesets = []
+    for entry in master.get("facesets", []):
+        new_facesets.append(facesets_by_name.get(entry["name"], entry))
+    master["facesets"] = new_facesets
+    master.setdefault("age_extend_runs", []).append({
+        "parent": PARENT,
+        "thresholds": cand["thresholds"],
+        "anchors": list(by_anchor.keys()),
+        "added_total": total_added,
+    })
+    tmp = master_path.with_suffix(".tmp.json")
+    tmp.write_text(json.dumps(master, indent=2))
+    tmp.replace(master_path)
+    print(f"[done] +{total_added} faces across {len(by_anchor)} anchors", file=sys.stderr)
+
+
+# ----------------------------- main -----------------------------
+
+def main():
+    ap = argparse.ArgumentParser()
+    sub = ap.add_subparsers(dest="cmd", required=True)
+
+    a = sub.add_parser("analyze")
+    a.add_argument("--out", required=True)
+    a.set_defaults(func=cmd_analyze)
+
+    r = sub.add_parser("report")
+    r.add_argument("--candidates", required=True)
+    r.add_argument("--out", required=True)
+    r.set_defaults(func=cmd_report)
+
+    p = sub.add_parser("apply")
+    p.add_argument("--candidates", required=True)
+    p.add_argument("--dry-run", action="store_true")
+    p.set_defaults(func=cmd_apply)
+
+    args = ap.parse_args()
+    args.func(args)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,485 @@
+#!/usr/bin/env python3
+"""Age-split person_001 into era-specific facesets.
+
+Workflow:
+1. Seed a clean person_001 centroid from the existing curated 707-face
+   `facesets_swap_ready/faceset_001/`.
+2. Wide-recovery scan: pull every face record under /mnt/x/src/{nl, lzbkp_red}
+   from `nl_full.npz` with cos-dist <= 0.55 from the seed centroid.
+3. Apply export-swap-style per-face quality gates.
+4. One re-centroid + 0.50 tighten pass to absorb the recovery without drift.
+5. Agglomerative sub-clustering at cos-dist 0.35.
+6. Post-merge sub-clusters whose centroids <0.30 AND whose dominant EXIF
+   years are within 2 years.
+7. Read EXIF DateTimeOriginal for each face's source path; era label =
+   (p10 year, p90 year) over dated faces.
+8. Undated faces are assigned to the nearest era by embedding distance.
+9. For each era: composite-quality rank, single-face PNG crops, .fsz bundles
+   (top-N and _all if era > top_n). `<era>_<range>.txt` marker file. Eras
+   with <20 face records get a `THIN.txt` marker.
+10. Append era entries into the canonical
+    `facesets_swap_ready/manifest.json` next to the existing 19.
+"""
+
+from __future__ import annotations
+
+import json
+import shutil
+import sys
+from collections import Counter
+from pathlib import Path
+
+import numpy as np
+from PIL import Image, ExifTags, ImageOps
+
+REPO = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(REPO))
+
+from sort_faces import (  # noqa: E402
+    QUALITY_WEIGHTS,
+    _crop_face_square,
+    _zip_png_list,
+    compute_quality,
+    load_cache,
+    load_rgb_bgr,
+)
+
+# ---- config -------------------------------------------------------------- #
+
+CACHE = REPO / "work" / "cache" / "nl_full.npz"
+SWAP_READY = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
+FS001 = SWAP_READY / "faceset_001"
+
+SCAN_ROOTS = [
+    Path("/mnt/x/src/nl"),
+    Path("/mnt/x/src/lzbkp_red"),
+]
+
+# Recovery + identity refinement
+RECOVERY_THRESHOLD = 0.55  # initial centroid match
+TIGHTEN_THRESHOLD = 0.50   # post-recentroid drift trim
+# Quality gates (mirror export-swap defaults)
+MIN_FACE_SHORT = 100
+# Sub-cluster
+SUBCLUSTER_THRESHOLD = 0.35
+# Anchor-based fragment assignment (replaces transitive union-find merge):
+ANCHOR_MIN_SIZE = 20          # sub-cluster size to qualify as an era anchor
+FRAGMENT_CENTROID_MAX = 0.40  # small fragment may join an anchor only if cent_dist <=
+FRAGMENT_YEAR_MAX = 5         # AND |dom_year_anchor - dom_year_fragment| <=
+# Output
+TOP_N = 30
+PAD_RATIO = 0.5
+OUT_SIZE = 512
+THIN_THRESHOLD = 20
+
+# EXIF cache (so re-runs skip the 30-min Windows-mount EXIF read)
+EXIF_CACHE = REPO / "work" / "cache" / "age_split_exif.json"
+
+
+# ---- helpers ------------------------------------------------------------- #
+
+def _normalize(v: np.ndarray) -> np.ndarray:
+    n = np.linalg.norm(v)
+    return v / n if n > 0 else v
+
+
+def _under(roots: list[Path], p: str) -> bool:
+    for r in roots:
+        rs = str(r).rstrip("/") + "/"
+        if p == str(r) or p.startswith(rs):
+            return True
+    return False
+
+
+def _record_in_roots(rec: dict, roots: list[Path], path_aliases: dict) -> bool:
+    if _under(roots, rec["path"]):
+        return True
+    for alias in path_aliases.get(rec["path"], []):
+        if _under(roots, alias):
+            return True
+    return False
+
+
+def exif_year(path: Path) -> int | None:
+    try:
+        with Image.open(path) as im:
+            exif = im._getexif()
+            if not exif:
+                return None
+            for tag_id, val in exif.items():
+                tag = ExifTags.TAGS.get(tag_id, tag_id)
+                if tag == "DateTimeOriginal" and isinstance(val, str) and len(val) >= 4:
+                    return int(val[:4])
+    except Exception:
+        return None
+    return None
+
+
+def label_for_era(years: list[int]) -> str:
+    """Era label as a year-range string. Falls back to 'undated' if no years."""
+    if not years:
+        return "undated"
+    ys = sorted(years)
+    lo = ys[len(ys) // 10] if len(ys) >= 10 else ys[0]
+    hi = ys[-(len(ys) // 10) - 1] if len(ys) >= 10 else ys[-1]
+    if lo == hi:
+        return str(lo)
+    # Compact year range like 2011-13 if same century, else 2009-2024.
+    if (lo // 100) == (hi // 100):
+        return f"{lo}-{hi % 100:02d}"
+    return f"{lo}-{hi}"
+
+
+# ---- phase 1 + 2: seed centroid + recovery scan ------------------------- #
+
+def main() -> None:
+    if not FS001.exists():
+        raise SystemExit(f"missing seed faceset: {FS001}")
+
+    print("=== loading cache ===")
+    emb, meta, _src, _proc, path_aliases = load_cache(CACHE)
+    face_records = [m for m in meta if not m.get("noface")]
+    if len(face_records) != len(emb):
+        raise SystemExit(f"emb/meta mismatch: {len(face_records)} vs {len(emb)}")
+
+    bbox_idx = {(m["path"], tuple(m.get("bbox") or ())): i for i, m in enumerate(face_records)}
+
+    seed_manifest = json.loads((FS001 / "manifest.json").read_text())
+    seed_face_keys = [(f["source"], tuple(f.get("bbox") or ())) for f in seed_manifest["faces"]]
+    seed_indices = [bbox_idx[k] for k in seed_face_keys if k in bbox_idx]
+    print(f"seed faces from faceset_001: {len(seed_indices)} (manifest had {len(seed_face_keys)})")
+
+    seed_centroid = _normalize(emb[seed_indices].mean(axis=0))
+
+    # Recovery: every face record under nl/ + lzbkp_red/ within RECOVERY_THRESHOLD.
+    candidate_idxs = [
+        i for i, rec in enumerate(face_records)
+        if _record_in_roots(rec, SCAN_ROOTS, path_aliases)
+    ]
+    print(f"\ncandidates under {[str(r) for r in SCAN_ROOTS]}: {len(candidate_idxs)}")
+
+    cand_emb = emb[candidate_idxs]
+    cand_dists = 1.0 - cand_emb @ seed_centroid
+    recovered_local = [k for k, d in enumerate(cand_dists) if d <= RECOVERY_THRESHOLD]
+    recovered = [candidate_idxs[k] for k in recovered_local]
+    print(f"recovered at cos-dist <= {RECOVERY_THRESHOLD}: {len(recovered)}")
+
+    # Quality gate.
+    qualified = []
+    drop_size = drop_blur = drop_det = 0
+    for i in recovered:
+        r = face_records[i]
+        if r.get("face_short", 0) < MIN_FACE_SHORT:
+            drop_size += 1
+            continue
+        if r.get("blur", 0.0) < 40.0:
+            drop_blur += 1
+            continue
+        if r.get("det_score", 0.0) < 0.6:
+            drop_det += 1
+            continue
+        qualified.append(i)
+    print(f"after quality gate: {len(qualified)} (drop size={drop_size} blur={drop_blur} det={drop_det})")
+
+    # One tightening pass: re-centroid on qualified, drop anyone > TIGHTEN_THRESHOLD.
+    qcent = _normalize(emb[qualified].mean(axis=0))
+    qd = 1.0 - emb[qualified] @ qcent
+    tight = [qualified[k] for k, d in enumerate(qd) if d <= TIGHTEN_THRESHOLD]
+    print(f"after re-centroid tighten ({TIGHTEN_THRESHOLD}): {len(tight)}")
+
+    # ---- phase 5: sub-cluster -------------------------------------------- #
+    print("\n=== sub-clustering ===")
+    from sklearn.cluster import AgglomerativeClustering
+
+    E = emb[tight]
+    sims = E @ E.T
+    dists = 1.0 - sims
+    # Floor numerical noise.
+    np.fill_diagonal(dists, 0.0)
+    dists = np.maximum(dists, 0.0)
+
+    ac = AgglomerativeClustering(
+        n_clusters=None,
+        metric="precomputed",
+        linkage="average",
+        distance_threshold=SUBCLUSTER_THRESHOLD,
+    )
+    labels = ac.fit_predict(dists)
+    sub_sizes = Counter(labels)
+    print(f"raw sub-clusters: {len(sub_sizes)} (sizes: top10={sorted(sub_sizes.values(), reverse=True)[:10]})")
+
+    # Per-cluster: indices, centroid, EXIF years.
+    cluster_indices: dict[int, list[int]] = {}
+    for k, lab in enumerate(labels):
+        cluster_indices.setdefault(int(lab), []).append(tight[k])
+
+    cluster_centroids: dict[int, np.ndarray] = {}
+    for lab, idxs in cluster_indices.items():
+        cluster_centroids[lab] = _normalize(emb[idxs].mean(axis=0))
+
+    print("\n=== EXIF years (one read per source path; cached) ===")
+    unique_paths = sorted({face_records[i]["path"] for i in tight})
+    if EXIF_CACHE.exists():
+        cached = json.loads(EXIF_CACHE.read_text())
+    else:
+        cached = {}
+    path_year: dict[str, int | None] = {}
+    new_reads = 0
+    for p in unique_paths:
+        if p in cached:
+            path_year[p] = cached[p]
+        else:
+            y = exif_year(Path(p))
+            path_year[p] = y
+            cached[p] = y
+            new_reads += 1
+    EXIF_CACHE.parent.mkdir(parents=True, exist_ok=True)
+    EXIF_CACHE.write_text(json.dumps(cached, indent=0))
+    dated = sum(1 for v in path_year.values() if v is not None)
+    print(f"  EXIF cache: {len(cached)} entries, {new_reads} new reads, "
+          f"{dated}/{len(unique_paths)} dated")
+
+    cluster_years: dict[int, list[int]] = {}
+    cluster_dom_year: dict[int, int | None] = {}
+    for lab, idxs in cluster_indices.items():
+        ys = []
+        for i in idxs:
+            y = path_year.get(face_records[i]["path"])
+            if y is not None:
+                ys.append(y)
+        cluster_years[lab] = ys
+        cluster_dom_year[lab] = (Counter(ys).most_common(1)[0][0]) if ys else None
+
+    # ---- phase 6: anchor-based fragment assignment ----------------------- #
+    # Each sub-cluster of size >= ANCHOR_MIN_SIZE is an "era anchor". Smaller
+    # fragments are assigned to the single nearest anchor IFF (centroid distance
+    # <= FRAGMENT_CENTROID_MAX AND |dom_year delta| <= FRAGMENT_YEAR_MAX).
+    # Anchors do NOT merge with each other — that prevented transitive year drift
+    # observed when union-find was used. Standalone fragments stay as their own
+    # (likely THIN) eras.
+    print("\n=== anchor-based assignment ===")
+    anchors = [lab for lab, idxs in cluster_indices.items() if len(idxs) >= ANCHOR_MIN_SIZE]
+    fragments = [lab for lab in cluster_indices if lab not in anchors]
+    anchors.sort(key=lambda l: -len(cluster_indices[l]))
+    print(f"anchors (size>={ANCHOR_MIN_SIZE}): {len(anchors)}; fragments: {len(fragments)}")
+    for a in anchors:
+        print(f"  anchor sub {a}: size={len(cluster_indices[a])} dom_year={cluster_dom_year[a]}")
+
+    if anchors:
+        a_cent = np.stack([cluster_centroids[a] for a in anchors])
+        assignments: dict[int, int] = {a: a for a in anchors}  # anchor -> self
+        unassigned: list[int] = []
+        for f in fragments:
+            f_cent = cluster_centroids[f]
+            f_year = cluster_dom_year[f]
+            # cosine distances to each anchor
+            cd = 1.0 - a_cent @ f_cent
+            # year distance (inf if either dom-year unknown)
+            yd = []
+            for a in anchors:
+                ay = cluster_dom_year[a]
+                if f_year is None or ay is None:
+                    yd.append(float("inf"))
+                else:
+                    yd.append(abs(f_year - ay))
+            yd = np.array(yd)
+            ok = (cd <= FRAGMENT_CENTROID_MAX) & (yd <= FRAGMENT_YEAR_MAX)
+            if not ok.any():
+                unassigned.append(f)
+                continue
+            # nearest qualifying anchor by centroid distance.
+            cd_masked = np.where(ok, cd, np.inf)
+            best = int(np.argmin(cd_masked))
+            assignments[f] = anchors[best]
+        print(f"  assigned fragments: {sum(1 for k,v in assignments.items() if k!=v)}/{len(fragments)}; "
+              f"unassigned (standalone): {len(unassigned)}")
+    else:
+        print("  no anchors; every sub-cluster stands alone")
+        assignments = {lab: lab for lab in cluster_indices}
+        unassigned = []
+
+    merged: dict[int, list[int]] = {}
+    for lab, idxs in cluster_indices.items():
+        root = assignments.get(lab, lab)
+        merged.setdefault(root, []).extend(idxs)
+
+    merged_sizes = sorted(((r, len(v)) for r, v in merged.items()), key=lambda kv: -kv[1])
+    print(f"era buckets: {len(merged)} (top10 sizes: {[s for _, s in merged_sizes[:10]]})")
+
+    # Recompute centroid + dom-year for merged eras.
+    era_indices: dict[int, list[int]] = merged
+    era_centroids: dict[int, np.ndarray] = {}
+    era_year_label: dict[int, str] = {}
+    era_years_full: dict[int, list[int]] = {}
+    for root, idxs in era_indices.items():
+        era_centroids[root] = _normalize(emb[idxs].mean(axis=0))
+        ys = []
+        for i in idxs:
+            y = path_year.get(face_records[i]["path"])
+            if y is not None:
+                ys.append(y)
+        era_years_full[root] = ys
+        era_year_label[root] = label_for_era(ys)
+
+    # ---- phase 8: assign undated faces (no-EXIF) to nearest era ---------- #
+    # NB: undated = path's EXIF was None. For era assignment we use embedding,
+    # but the year *label* is unaffected because labels come from dated faces only.
+    # Actually undated face is already in some sub-cluster; here we just note count.
+    n_undated = sum(1 for i in tight if path_year.get(face_records[i]["path"]) is None)
+    print(f"undated face records (no EXIF): {n_undated}/{len(tight)} (placed by embedding only)")
+
+    # ---- phase 9: per-era export ----------------------------------------- #
+    import cv2
+
+    print("\n=== exporting era bundles ===")
+    new_manifest_entries: list[dict] = []
+    eras_sorted = sorted(era_indices.items(), key=lambda kv: -len(kv[1]))
+    for root, idxs in eras_sorted:
+        size = len(idxs)
+        label = era_year_label[root]
+        era_name = f"faceset_001_{label}"
+        out_dir = SWAP_READY / era_name
+
+        # Disambiguate same-label collisions (e.g. two distinct embedding eras both 2019).
+        collision = 2
+        while out_dir.exists():
+            era_name = f"faceset_001_{label}_v{collision}"
+            out_dir = SWAP_READY / era_name
+            collision += 1
+
+        faces_dir = out_dir / "faces"
+        faces_dir.mkdir(parents=True, exist_ok=True)
+
+        # Composite quality + rank.
+        ranked = []
+        for ci in idxs:
+            rec = face_records[ci]
+            q = compute_quality(rec)
+            ranked.append({"cache_idx": ci, "rec": rec, "quality": q})
+
+        # Dedup by source path within this era — keep highest-quality face per path.
+        seen_path: dict[str, dict] = {}
+        for r in ranked:
+            p = r["rec"]["path"]
+            prev = seen_path.get(p)
+            if prev is None or r["quality"]["composite"] > prev["quality"]["composite"]:
+                seen_path[p] = r
+        unique = sorted(seen_path.values(), key=lambda r: -r["quality"]["composite"])
+
+        # Materialize crops.
+        written: list[Path] = []
+        face_entries: list[dict] = []
+        for rank, r in enumerate(unique, start=1):
+            rec = r["rec"]
+            src = Path(rec["path"])
+            if not src.exists():
+                continue
+            rgb, _ = load_rgb_bgr(src)
+            if rgb is None:
+                continue
+            crop = _crop_face_square(rgb, rec["bbox"], PAD_RATIO, OUT_SIZE)
+            png = faces_dir / f"{rank:04d}.png"
+            cv2.imwrite(str(png), cv2.cvtColor(crop, cv2.COLOR_RGB2BGR))
+            written.append(png)
+            face_entries.append({
+                "rank": rank,
+                "png": f"faces/{rank:04d}.png",
+                "source": rec["path"],
+                "aliases": path_aliases.get(rec["path"], []),
+                "bbox": rec["bbox"],
+                "face_short": rec.get("face_short"),
+                "det_score": rec.get("det_score"),
+                "blur": rec.get("blur"),
+                "pose": rec.get("pose"),
+                "exif_year": path_year.get(rec["path"]),
+                "quality": r["quality"],
+            })
+
+        if not written:
+            print(f"[{era_name}] empty after materialization; skipping")
+            shutil.rmtree(out_dir)
+            continue
+
+        # Bundle.
+        top_n_eff = min(TOP_N, len(written))
+        top_fsz = out_dir / f"{era_name}_top{top_n_eff}.fsz"
+        _zip_png_list(written[:top_n_eff], top_fsz)
+        all_fsz: Path | None = None
+        if len(written) > top_n_eff:
+            all_fsz = out_dir / f"{era_name}_all.fsz"
+            _zip_png_list(written, all_fsz)
+
+        # Per-era manifest.
+        ys = era_years_full[root]
+        year_summary = {
+            "label": label,
+            "year_count": len(ys),
+            "year_min": min(ys) if ys else None,
+            "year_max": max(ys) if ys else None,
+            "year_dist": dict(Counter(ys).most_common()),
+        }
+        is_thin = size < THIN_THRESHOLD
+        manifest = {
+            "name": era_name,
+            "parent_identity": "faceset_001",
+            "era": year_summary,
+            "input_face_records": size,
+            "exported": len(written),
+            "top_n": top_n_eff,
+            "fsz_top": top_fsz.name,
+            "fsz_all": all_fsz.name if all_fsz else None,
+            "thin": is_thin,
+            "quality_weights": QUALITY_WEIGHTS,
+            "params": {
+                "recovery_threshold": RECOVERY_THRESHOLD,
+                "tighten_threshold": TIGHTEN_THRESHOLD,
+                "subcluster_threshold": SUBCLUSTER_THRESHOLD,
+                "anchor_min_size": ANCHOR_MIN_SIZE,
+                "fragment_centroid_max": FRAGMENT_CENTROID_MAX,
+                "fragment_year_max": FRAGMENT_YEAR_MAX,
+                "min_face_short": MIN_FACE_SHORT,
+            },
+            "faces": face_entries,
+        }
+        (out_dir / "manifest.json").write_text(json.dumps(manifest, indent=2))
+
+        # Per-era marker file (always: <label>.txt for human reference).
+        (out_dir / f"{label}.txt").write_text(
+            f"{era_name}\n\nEra: {label}\n"
+            f"Year span: {year_summary['year_min']}..{year_summary['year_max']} "
+            f"({year_summary['year_count']} dated of {size} faces)\n"
+            f"Sub-cluster size: {size} face records, {len(unique)} unique source paths, "
+            f"{len(written)} exported PNGs.\n"
+        )
+        if is_thin:
+            (out_dir / "THIN.txt").write_text(
+                f"This era has only {size} face records (<{THIN_THRESHOLD}). "
+                f"Averaged embedding may be dominated by single-photo idiosyncrasies.\n"
+            )
+
+        # Append to top-level manifest summary.
+        new_manifest_entries.append({k: v for k, v in manifest.items() if k != "faces"})
+
+        thin_tag = " THIN" if is_thin else ""
+        print(
+            f"[{era_name}] size={size} unique_paths={len(unique)} exported={len(written)} "
+            f"top{top_n_eff}{thin_tag}"
+        )
+
+    # ---- merge into top-level manifest ----------------------------------- #
+    top_path = SWAP_READY / "manifest.json"
+    existing = json.loads(top_path.read_text()) if top_path.exists() else {"facesets": []}
+    existing_names = {fs.get("name") for fs in existing.get("facesets", [])}
+    appended = 0
+    for entry in new_manifest_entries:
+        if entry["name"] in existing_names:
+            continue
+        existing["facesets"].append(entry)
+        appended += 1
+    top_path.write_text(json.dumps(existing, indent=2))
+    print(f"\nAppended {appended} era entries to {top_path}")
+    print(f"Done. {len(new_manifest_entries)} era buckets emitted (faceset_001/ left untouched).")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,323 @@
+#!/usr/bin/env python3
+"""Build per-folder facesets from hand-sorted source directories.
+
+Phase B + C of the folder-import workflow:
+  - Filter cache records into per-folder identity sets, run 2-pass centroid+outlier
+    rejection so non-target faces in group photos drop out.
+  - Route every osrc face record to every trusted-folder identity within a tight
+    cosine cutoff (multi-identity osrc photos land in multiple facesets;
+    cmd_export_swap then per-bbox-filters so each faceset crops only the matching face).
+  - Synthesize a refine_manifest.json compatible with cmd_export_swap.
+  - Invoke cmd_export_swap to emit faceset_NNN/ dirs into a temp output dir.
+  - Rename .fsz bundles after the source folder, replace NAME.txt with foldername.txt,
+    move dirs into the canonical facesets_swap_ready/, merge top-level manifest
+    preserving existing faceset_001..012 entries.
+"""
+
+from __future__ import annotations
+
+import json
+import shutil
+import sys
+from pathlib import Path
+
+import numpy as np
+
+REPO = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(REPO))
+
+from sort_faces import (  # noqa: E402
+    cmd_export_swap,
+    load_cache,
+)
+
+# ---- config -------------------------------------------------------------- #
+
+CACHE = REPO / "work" / "cache" / "nl_full.npz"
+OUT_FINAL = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
+OUT_TMP = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready_new")
+SYNTH_MANIFEST = REPO / "work" / "synthetic_refine_manifest.json"
+
+# Trusted folders, in numbering order. faceset_NNN starts at 013.
+TRUSTED: list[tuple[str, Path]] = [
+    ("k",   Path("/mnt/x/src/k")),
+    ("m",   Path("/mnt/x/src/m")),
+    ("mi",  Path("/mnt/x/src/mi")),
+    ("mir", Path("/mnt/x/src/mir")),
+    ("s",   Path("/mnt/x/src/s")),
+    ("sab", Path("/mnt/x/src/sab")),
+    ("t",   Path("/mnt/x/src/t")),
+]
+START_NNN = 13
+OSRC_DIR = Path("/mnt/x/src/osrc")
+
+# Centroid-build outlier passes (loose then tight).
+PASS1_THRESHOLD = 0.55
+PASS2_THRESHOLD = 0.45
+# osrc routing cutoff (tight).
+OSRC_THRESHOLD = 0.45
+
+# export-swap params (defaults from sort_faces.py).
+TOP_N = 30
+EXPORT_OUTLIER_THRESHOLD = 0.45
+PAD_RATIO = 0.5
+OUT_SIZE = 512
+MIN_FACE_SHORT = 100
+
+
+# ---- helpers ------------------------------------------------------------- #
+
+def _normalize_rows(mat: np.ndarray) -> np.ndarray:
+    n = np.linalg.norm(mat, axis=1, keepdims=True)
+    n[n == 0] = 1.0
+    return mat / n
+
+
+def _centroid(vecs: np.ndarray) -> np.ndarray:
+    c = vecs.mean(axis=0)
+    n = np.linalg.norm(c)
+    return c / n if n > 0 else c
+
+
+def _under(folder: Path, p: str) -> bool:
+    """True iff path string p lies under folder."""
+    fs = str(folder).rstrip("/") + "/"
+    return p == str(folder) or p.startswith(fs)
+
+
+def _record_in_folder(rec: dict, folder: Path, path_aliases: dict[str, list[str]]) -> bool:
+    if _under(folder, rec["path"]):
+        return True
+    for alias in path_aliases.get(rec["path"], []):
+        if _under(folder, alias):
+            return True
+    return False
+
+
+# ---- phase B: identity centroids + osrc routing ------------------------- #
+
+def build_synthetic_manifest() -> tuple[dict, dict[str, np.ndarray], dict[str, dict]]:
+    emb, meta, _src_root, _processed, path_aliases = load_cache(CACHE)
+    # emb is aligned with the no-noface-filtered records (matching cmd_export_swap's
+    # invariant). Use indices into face_records to access emb.
+    face_records = [m for m in meta if not m.get("noface")]
+    if len(face_records) != len(emb):
+        raise SystemExit(f"meta/embedding mismatch: {len(face_records)} vs {len(emb)}")
+
+    print(f"Loaded cache: {len(face_records)} face records.")
+
+    # Per-folder identity centroids.
+    centroids: dict[str, np.ndarray] = {}
+    folder_paths: dict[str, set[str]] = {}
+    folder_stats: dict[str, dict] = {}
+
+    for label, folder in TRUSTED:
+        idxs = [i for i, m in enumerate(face_records) if _record_in_folder(m, folder, path_aliases)]
+        if not idxs:
+            print(f"[{label}] no face records found under {folder}; skipping")
+            continue
+
+        vecs = emb[idxs]
+        cent = _centroid(vecs)
+
+        # Pass 1: drop loose outliers.
+        d1 = 1.0 - vecs @ cent
+        keep1 = [idxs[k] for k, dist in enumerate(d1) if dist <= PASS1_THRESHOLD]
+        if not keep1:
+            print(f"[{label}] every face was a pass-1 outlier; using all faces as-is")
+            keep1 = idxs
+        cent = _centroid(emb[keep1])
+
+        # Pass 2: tight outlier rejection.
+        d2 = 1.0 - emb[keep1] @ cent
+        keep2 = [keep1[k] for k, dist in enumerate(d2) if dist <= PASS2_THRESHOLD]
+        if not keep2:
+            print(f"[{label}] every face was a pass-2 outlier; falling back to pass-1")
+            keep2 = keep1
+        cent = _centroid(emb[keep2])
+
+        centroids[label] = cent
+        # Use canonical path strings; export-swap will look up indices by path.
+        folder_paths[label] = {face_records[i]["path"] for i in keep2}
+        folder_stats[label] = {
+            "folder": str(folder),
+            "input_records": len(idxs),
+            "after_pass1": len(keep1),
+            "after_pass2": len(keep2),
+            "unique_paths": len(folder_paths[label]),
+        }
+        print(
+            f"[{label}] in={len(idxs)} pass1={len(keep1)} pass2={len(keep2)} "
+            f"unique_paths={len(folder_paths[label])}"
+        )
+
+    # osrc routing: every osrc face -> every centroid within OSRC_THRESHOLD.
+    osrc_idxs = [
+        i for i, m in enumerate(face_records)
+        if _record_in_folder(m, OSRC_DIR, path_aliases)
+    ]
+    print(f"\nosrc: {len(osrc_idxs)} face records to route")
+    if osrc_idxs and centroids:
+        labels = list(centroids.keys())
+        cent_mat = np.stack([centroids[lab] for lab in labels])
+        # Build sims: (n_osrc, n_labels)
+        osrc_emb = emb[osrc_idxs]
+        sims = osrc_emb @ cent_mat.T  # cosine similarity (vectors already normalized)
+        dists = 1.0 - sims
+        per_label_added: dict[str, int] = {lab: 0 for lab in labels}
+        for row, ci in enumerate(osrc_idxs):
+            p = face_records[ci]["path"]
+            for col, lab in enumerate(labels):
+                if dists[row, col] <= OSRC_THRESHOLD:
+                    if p not in folder_paths[lab]:
+                        folder_paths[lab].add(p)
+                        per_label_added[lab] += 1
+        for lab in labels:
+            folder_stats[lab]["osrc_paths_added"] = per_label_added[lab]
+            print(f"[{lab}] osrc faces routed: +{per_label_added[lab]} unique paths")
+
+    # Build synthetic refine_manifest.
+    facesets: list[dict] = []
+    for n, (label, _folder) in enumerate(TRUSTED, start=START_NNN):
+        if label not in folder_paths:
+            continue
+        facesets.append({
+            "name": f"faceset_{n:03d}",
+            "label": label,
+            "image_count": len(folder_paths[label]),
+            "images": sorted(folder_paths[label]),
+        })
+
+    manifest = {
+        "params": {
+            "pass1_threshold": PASS1_THRESHOLD,
+            "pass2_threshold": PASS2_THRESHOLD,
+            "osrc_threshold": OSRC_THRESHOLD,
+            "min_face_short": MIN_FACE_SHORT,
+        },
+        "facesets": facesets,
+        "_per_folder_stats": folder_stats,
+    }
+    SYNTH_MANIFEST.write_text(json.dumps(manifest, indent=2))
+    print(f"\nSynthetic manifest -> {SYNTH_MANIFEST}")
+    return manifest, centroids, folder_stats
+
+
+# ---- phase C: export + rename + merge ----------------------------------- #
+
+def export_and_relocate(manifest: dict) -> None:
+    if OUT_TMP.exists():
+        shutil.rmtree(OUT_TMP)
+    OUT_TMP.mkdir(parents=True)
+
+    print(f"\nRunning cmd_export_swap -> {OUT_TMP}")
+    cmd_export_swap(
+        cache_path=CACHE,
+        refine_manifest_path=SYNTH_MANIFEST,
+        raw_manifest_path=None,
+        out_dir=OUT_TMP,
+        top_n=TOP_N,
+        outlier_threshold=EXPORT_OUTLIER_THRESHOLD,
+        pad_ratio=PAD_RATIO,
+        out_size=OUT_SIZE,
+        include_candidates=False,
+        candidate_match_threshold=0.55,
+        candidate_min_score=0.40,
+        min_face_short=MIN_FACE_SHORT,
+    )
+
+    # Map name -> label from the synthetic manifest.
+    name_to_label = {fs["name"]: fs["label"] for fs in manifest["facesets"]}
+
+    # Load the temp top-level manifest (export-swap just wrote it).
+    new_top = json.loads((OUT_TMP / "manifest.json").read_text())
+    new_entries = new_top.get("facesets", [])
+
+    # Per-faceset rename + relocate.
+    for fs_meta in new_entries:
+        name = fs_meta["name"]
+        label = name_to_label.get(name)
+        src_dir = OUT_TMP / name
+        if not src_dir.exists():
+            print(f"[{name}] export dir missing; skipping")
+            continue
+
+        # Rename .fsz bundles to <label>_*.fsz; record updated names.
+        renames = {}
+        for fsz in sorted(src_dir.glob(f"{name}_top*.fsz")):
+            new = src_dir / fsz.name.replace(name + "_", label + "_", 1)
+            fsz.rename(new)
+            renames[fsz.name] = new.name
+        for fsz in sorted(src_dir.glob(f"{name}_all.fsz")):
+            new = src_dir / fsz.name.replace(name + "_", label + "_", 1)
+            fsz.rename(new)
+            renames[fsz.name] = new.name
+
+        # Replace NAME.txt placeholder with <label>.txt.
+        nametxt = src_dir / "NAME.txt"
+        if nametxt.exists():
+            nametxt.unlink()
+        (src_dir / f"{label}.txt").write_text(
+            f"{label}\n\nSource: /mnt/x/src/{label} (hand-sorted) + matched osrc faces.\n"
+        )
+
+        # Update fs_meta entry's fsz fields to point at the renamed files.
+        for k in ("fsz_top", "fsz_all"):
+            if fs_meta.get(k) and fs_meta[k] in renames:
+                fs_meta[k] = renames[fs_meta[k]]
+        fs_meta["label"] = label
+
+        # Move the directory into the final output.
+        dst_dir = OUT_FINAL / name
+        if dst_dir.exists():
+            print(f"[{name}] {dst_dir} already exists; refusing to overwrite")
+            continue
+        shutil.move(str(src_dir), str(dst_dir))
+        print(f"[{name}] -> {dst_dir} (label={label})")
+
+    # Merge top-level manifest, preserving existing faceset_001..012 entries.
+    final_manifest_path = OUT_FINAL / "manifest.json"
+    if final_manifest_path.exists():
+        existing = json.loads(final_manifest_path.read_text())
+    else:
+        existing = {"facesets": []}
+
+    existing_names = {fs["name"] for fs in existing.get("facesets", [])}
+    appended = 0
+    for entry in new_entries:
+        if entry["name"] in existing_names:
+            print(f"[manifest] {entry['name']} already in top-level manifest; not duplicating")
+            continue
+        existing["facesets"].append(entry)
+        appended += 1
+
+    # Carry over export-swap params if not already present.
+    for k in ("quality_weights", "outlier_threshold", "top_n", "pad_ratio", "out_size"):
+        if k not in existing and k in new_top:
+            existing[k] = new_top[k]
+
+    final_manifest_path.write_text(json.dumps(existing, indent=2))
+    print(f"\nMerged manifest: appended {appended} entries -> {final_manifest_path}")
+
+    # Clean up temp dir if empty.
+    leftover = list(OUT_TMP.iterdir()) if OUT_TMP.exists() else []
+    if not leftover:
+        OUT_TMP.rmdir()
+    else:
+        # leave temp manifest.json for inspection
+        pass
+
+
+# ---- main ---------------------------------------------------------------- #
+
+def main() -> None:
+    manifest, _centroids, _stats = build_synthetic_manifest()
+    if not manifest.get("facesets"):
+        print("No facesets to build; nothing to do.")
+        return
+    export_and_relocate(manifest)
+    print("\nDone.")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,151 @@
+#!/usr/bin/env python3
+"""Probe faceset_001 for age-sortable sub-structure.
+
+Three questions:
+1. How spread is the embedding cloud? (intra-cluster pairwise distance histogram)
+2. Does it split naturally into sub-clusters at a tight threshold?
+3. Do the sub-clusters correspond to distinct time periods (EXIF DateTimeOriginal)?
+"""
+
+from __future__ import annotations
+
+import json
+import sys
+from collections import Counter
+from pathlib import Path
+
+import numpy as np
+from PIL import Image, ExifTags
+
+REPO = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(REPO))
+from sort_faces import load_cache  # noqa: E402
+
+CACHE = REPO / "work" / "cache" / "nl_full.npz"
+FS001 = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready/faceset_001")
+
+
+def exif_year(path: Path) -> int | None:
+    try:
+        with Image.open(path) as im:
+            exif = im._getexif()
+            if not exif:
+                return None
+            for tag_id, val in exif.items():
+                tag = ExifTags.TAGS.get(tag_id, tag_id)
+                if tag == "DateTimeOriginal" and isinstance(val, str) and len(val) >= 4:
+                    return int(val[:4])
+    except Exception:
+        return None
+    return None
+
+
+def main() -> None:
+    manifest = json.loads((FS001 / "manifest.json").read_text())
+    faces = manifest["faces"]
+    paths = [Path(f["source"]) for f in faces]
+    print(f"faceset_001 has {len(paths)} ranked faces in the swap-ready set")
+
+    # Pull embeddings for these face records by (path, bbox).
+    emb, meta, _src, _proc, _aliases = load_cache(CACHE)
+    face_records = [m for m in meta if not m.get("noface")]
+    if len(face_records) != len(emb):
+        raise SystemExit("emb/meta mismatch")
+    bbox_key = {}
+    for i, m in enumerate(face_records):
+        bbox_key[(m["path"], tuple(m.get("bbox") or ()))] = i
+
+    selected = []
+    missing = 0
+    for f in faces:
+        key = (f["source"], tuple(f.get("bbox") or ()))
+        i = bbox_key.get(key)
+        if i is None:
+            missing += 1
+            continue
+        selected.append(i)
+    print(f"matched {len(selected)} embeddings (missing {missing})")
+
+    E = emb[selected]
+    # All embeddings are L2-normalized -> cosine dist = 1 - dot.
+    sims = E @ E.T
+    dists = 1.0 - sims
+    iu = np.triu_indices_from(dists, k=1)
+    pw = dists[iu]
+    print("\n-- intra-cluster pairwise cosine distance --")
+    print(f"  n_pairs = {len(pw):,}")
+    print(f"  mean    = {pw.mean():.3f}")
+    print(f"  median  = {np.median(pw):.3f}")
+    print(f"  p10/p25/p75/p90 = {np.percentile(pw, [10,25,75,90])}")
+    print(f"  max     = {pw.max():.3f}")
+
+    # Histogram bins around interesting thresholds.
+    edges = [0.0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0, 1.4]
+    hist, _ = np.histogram(pw, bins=edges)
+    print("\n  histogram (cos-dist bin -> pair count):")
+    for lo, hi, c in zip(edges[:-1], edges[1:], hist):
+        bar = "#" * int(60 * c / max(hist.max(), 1))
+        print(f"    [{lo:.1f},{hi:.1f})  {c:7d}  {bar}")
+
+    # Sub-cluster at three thresholds via agglomerative on the distance matrix.
+    from sklearn.cluster import AgglomerativeClustering
+    print("\n-- sub-clustering --")
+    for thr in (0.30, 0.35, 0.40, 0.45, 0.50):
+        ac = AgglomerativeClustering(
+            n_clusters=None,
+            metric="precomputed",
+            linkage="average",
+            distance_threshold=thr,
+        )
+        labels = ac.fit_predict(dists)
+        sizes = Counter(labels)
+        n = len(sizes)
+        big = sum(1 for s in sizes.values() if s >= 10)
+        top5 = sorted(sizes.values(), reverse=True)[:5]
+        print(f"  threshold {thr:.2f}: {n} sub-clusters, {big} with >=10 images, top-5 sizes={top5}")
+
+    # Pick the threshold that gives 2-5 substantial sub-clusters.
+    target_thr = 0.35
+    ac = AgglomerativeClustering(
+        n_clusters=None, metric="precomputed", linkage="average",
+        distance_threshold=target_thr,
+    )
+    labels = ac.fit_predict(dists)
+    sizes = Counter(labels)
+    big_labels = [lab for lab, s in sizes.most_common() if s >= 20]
+    print(f"\n-- EXIF year analysis at threshold {target_thr} (sub-clusters with >=20 images) --")
+    print(f"   {len(big_labels)} substantial sub-clusters")
+
+    # Build label -> list of source paths
+    by_label: dict[int, list[Path]] = {}
+    for ci, lab in zip(selected, labels):
+        rec = face_records[ci]
+        by_label.setdefault(int(lab), []).append(Path(rec["path"]))
+
+    for lab in big_labels[:6]:
+        paths_in = by_label[lab]
+        years = []
+        for p in paths_in:
+            y = exif_year(p)
+            if y is not None:
+                years.append(y)
+        n_paths = len(paths_in)
+        n_years = len(years)
+        if years:
+            ys = np.array(years)
+            ymin, ymax = int(ys.min()), int(ys.max())
+            ymed = int(np.median(ys))
+            yhist = Counter(years)
+            top_years = ", ".join(f"{y}:{c}" for y, c in sorted(yhist.most_common(5)))
+        else:
+            ymin = ymax = ymed = None
+            top_years = ""
+        print(
+            f"  cluster {lab}: {n_paths} faces, EXIF on {n_years}/{n_paths}, "
+            f"year range {ymin}..{ymax} (median {ymed})"
+        )
+        print(f"    top years: {top_years}")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,221 @@
+"""Windows / DirectML CLIP worker for occlusion scoring.
+
+Reads a queue.json staged by /opt/face-sets/work/filter_occlusions.py (WSL side),
+runs open_clip ViT-L-14 (dfn2b_s39b) on each PNG via torch-directml on the AMD
+Vega, and writes a scores.json with mask + sunglasses softmax probabilities.
+
+CLI:
+    py -3.12 clip_worker.py <queue.json> <out_scores.json> [--limit N] [--batch 8]
+
+queue.json shape: list of objects
+    {"wsl_path": "...", "win_path": "E:\\...\\faceset_NNN\\faces\\NNNN.png",
+     "faceset": "faceset_NNN", "file": "NNNN.png"}
+
+scores.json shape:
+    {"model": "ViT-L-14/dfn2b_s39b",
+     "logit_scale": 100.0,
+     "prompts": {...},
+     "results": [{"wsl_path": "...", "faceset": "...", "file": "...",
+                  "mask": float, "sunglasses": float}],
+     "processed": [wsl_path, ...]}
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import sys
+import time
+import warnings
+from pathlib import Path
+
+# DML emits a verbose UserWarning per attention call -- silence at import time
+warnings.filterwarnings("ignore", category=UserWarning)
+
+import torch
+import torch_directml
+import open_clip
+from PIL import Image
+
+MODEL_NAME = "ViT-L-14"
+PRETRAINED = "dfn2b_s39b"
+
+# kept in sync with /opt/face-sets/work/filter_occlusions.py PROMPTS
+PROMPTS = {
+    "mask": {
+        "pos": [
+            "a photo of a person wearing a surgical face mask",
+            "a photo of a person wearing an FFP2 respirator covering mouth and nose",
+            "a photo of a person wearing a cloth face mask",
+            "a face partially covered by a medical mask",
+            "a person whose mouth and nose are hidden by a face mask",
+        ],
+        "neg": [
+            "a photo of a person's face with mouth and nose clearly visible",
+            "a clear, unobstructed photo of a face",
+            "a photo of a face without any mask or covering",
+            "a portrait of a person showing their full face",
+            "a photo of a person with a beard and visible mouth",
+        ],
+    },
+    "sunglasses": {
+        "pos": [
+            "a face with dark sunglasses covering the eyes",
+            "a portrait with the eyes hidden behind opaque sunglasses",
+            "a person wearing dark sunglasses over their eyes, eyes not visible",
+            "a face where the eyes are completely concealed by tinted lenses",
+            "a close-up portrait wearing aviator sunglasses on the eyes",
+        ],
+        "neg": [
+            "a portrait with both eyes clearly visible and uncovered",
+            "a face with sunglasses pushed up on the forehead, eyes visible below",
+            "a face with sunglasses resting on top of the head, eyes visible",
+            "a person with sunglasses hanging from their shirt, eyes visible",
+            "a face wearing clear prescription eyeglasses with visible eyes",
+            "a portrait with no eyewear and visible eyes",
+        ],
+    },
+}
+
+FLUSH_EVERY = 100
+
+
+def load_existing(out_path: Path):
+    if not out_path.exists():
+        return None, set()
+    try:
+        d = json.loads(out_path.read_text())
+        processed = set(d.get("processed", []))
+        return d, processed
+    except Exception as e:
+        print(f"[warn] could not parse existing {out_path}: {e}; starting fresh", file=sys.stderr)
+        return None, set()
+
+
+def save_atomic(out_path: Path, data: dict):
+    tmp = out_path.with_suffix(".tmp.json")
+    tmp.write_text(json.dumps(data, indent=2))
+    os.replace(tmp, out_path)
+
+
+@torch.no_grad()
+def build_text_features(model, tokenizer, device):
+    out = {}
+    for attr, sides in PROMPTS.items():
+        feats = {}
+        for side in ("pos", "neg"):
+            tokens = tokenizer(sides[side]).to(device)
+            f = model.encode_text(tokens)
+            f = f / f.norm(dim=-1, keepdim=True)
+            mean = f.mean(dim=0)
+            feats[side] = mean / mean.norm()
+        out[attr] = (feats["pos"], feats["neg"])
+    return out
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("queue", type=Path)
+    ap.add_argument("out", type=Path)
+    ap.add_argument("--limit", type=int, default=None)
+    ap.add_argument("--batch", type=int, default=8)
+    args = ap.parse_args()
+
+    queue = json.loads(args.queue.read_text())
+    print(f"[queue] {len(queue)} entries from {args.queue}")
+
+    args.out.parent.mkdir(parents=True, exist_ok=True)
+    existing, processed = load_existing(args.out)
+    if existing:
+        print(f"[resume] {len(processed)} entries already scored")
+        results = existing.get("results", [])
+    else:
+        results = []
+
+    pending = [e for e in queue if e["wsl_path"] not in processed]
+    if args.limit is not None:
+        pending = pending[: args.limit]
+    print(f"[pending] {len(pending)} entries to score")
+
+    if not pending:
+        print("[done] nothing to do")
+        return
+
+    device = torch_directml.device()
+    print(f"[load] {MODEL_NAME}/{PRETRAINED} on {torch_directml.device_name(0)}")
+    t0 = time.time()
+    model, _, preprocess = open_clip.create_model_and_transforms(MODEL_NAME, pretrained=PRETRAINED)
+    tokenizer = open_clip.get_tokenizer(MODEL_NAME)
+    model = model.to(device).eval()
+    logit_scale = float(model.logit_scale.exp().detach().cpu())
+    print(f"[load] ready in {time.time()-t0:.1f}s logit_scale={logit_scale:.2f}")
+    text_feats = build_text_features(model, tokenizer, device)
+
+    def flush():
+        save_atomic(args.out, {
+            "model": f"{MODEL_NAME}/{PRETRAINED}",
+            "logit_scale": logit_scale,
+            "prompts": PROMPTS,
+            "results": results,
+            "processed": sorted(processed),
+        })
+
+    n_done_this_run = 0
+    n_load_err = 0
+    last_flush = time.time()
+    t_start = time.time()
+
+    for i in range(0, len(pending), args.batch):
+        chunk = pending[i:i + args.batch]
+        imgs = []
+        keep = []
+        for entry in chunk:
+            try:
+                img = Image.open(entry["win_path"]).convert("RGB")
+                imgs.append(preprocess(img))
+                keep.append(entry)
+            except Exception as e:
+                print(f"[skip] {entry['win_path']}: {e}", file=sys.stderr)
+                n_load_err += 1
+                processed.add(entry["wsl_path"])
+        if not imgs:
+            continue
+        x = torch.stack(imgs).to(device)
+        with torch.no_grad():
+            feats = model.encode_image(x)
+            feats = feats / feats.norm(dim=-1, keepdim=True)
+            scores_per_attr = {}
+            for attr, (pos, neg) in text_feats.items():
+                sims = torch.stack([feats @ pos, feats @ neg], dim=1) * logit_scale
+                probs = sims.softmax(dim=1)[:, 0].detach().cpu().tolist()
+                scores_per_attr[attr] = probs
+        for j, entry in enumerate(keep):
+            results.append({
+                "wsl_path": entry["wsl_path"],
+                "faceset": entry["faceset"],
+                "file": entry["file"],
+                "mask": round(scores_per_attr["mask"][j], 4),
+                "sunglasses": round(scores_per_attr["sunglasses"][j], 4),
+            })
+            processed.add(entry["wsl_path"])
+            n_done_this_run += 1
+
+        if (n_done_this_run % FLUSH_EVERY < args.batch) or (time.time() - last_flush) > 30.0:
+            flush()
+            last_flush = time.time()
+            elapsed = time.time() - t_start
+            rate = n_done_this_run / max(0.1, elapsed)
+            eta_min = (len(pending) - n_done_this_run) / max(0.1, rate) / 60.0
+            print(f"[score] {n_done_this_run}/{len(pending)} "
+                  f"rate={rate:.2f} img/s eta={eta_min:.1f}min "
+                  f"load_err={n_load_err}", flush=True)
+
+    flush()
+    elapsed = time.time() - t_start
+    print(f"[done] {n_done_this_run} scored, {n_load_err} load errors, "
+          f"{elapsed:.1f}s ({n_done_this_run/max(0.1,elapsed):.2f} img/s) -> {args.out}")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,340 @@
+#!/usr/bin/env python3
+"""Discover new identities in an Immich-sourced cache and emit them as facesets.
+
+Mirrors `work/cluster_osrc.py`, but the source corpus is an arbitrary
+Immich user's `immich_<user>.npz` cache produced by the Windows DML embed
+worker. Existing identity centroids come from the union of every faceset
+already in `facesets_swap_ready/` (faceset_001..NNN, both auto-clustered
+and hand-sorted).
+
+Pipeline:
+ 1. Load immich_<user>.npz; restrict to face records (drop noface).
+ 2. Build centroids of every existing canonical faceset in
+    facesets_swap_ready/ (skip era splits and _thin/).
+ 3. Drop immich faces whose nearest existing centroid is within
+    EXISTING_MATCH_THRESHOLD; those are already covered by the canonical set.
+ 4. Cluster the remaining among themselves at INITIAL_THRESHOLD.
+ 5. Per cluster: refine-equivalent gates (face_short, blur, det_score),
+    plus outlier rejection at OUTLIER_THRESHOLD for clusters of size >= 4.
+ 6. Keep clusters whose surviving unique source-path count is >= MIN_FACES.
+ 7. Number kept clusters past the existing facesets_swap_ready/ max.
+ 8. Synthesize a refine_manifest, hand off to cmd_export_swap, move dirs into
+    facesets_swap_ready/, drop a provenance marker, append to top-level
+    manifest.json (preserving facesets / thin_eras).
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import shutil
+import sys
+from pathlib import Path
+
+import numpy as np
+
+REPO = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(REPO))
+
+from sort_faces import (  # noqa: E402
+    _cluster_embeddings,
+    cmd_export_swap,
+    load_cache,
+)
+
+# ---- config -------------------------------------------------------------- #
+
+REPO_WORK = REPO / "work"
+SWAP_READY = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
+
+EXISTING_MATCH_THRESHOLD = 0.45
+INITIAL_THRESHOLD = 0.55
+
+MIN_FACES = 6
+MIN_SHORT = 90
+MIN_BLUR = 40.0
+MIN_DET_SCORE = 0.6
+OUTLIER_THRESHOLD = 0.55
+
+TOP_N = 30
+EXPORT_OUTLIER_THRESHOLD = 0.45
+PAD_RATIO = 0.5
+OUT_SIZE = 512
+EXPORT_MIN_FACE_SHORT = 100
+
+
+# ---- helpers ------------------------------------------------------------- #
+
+def _normalize(v: np.ndarray) -> np.ndarray:
+    n = np.linalg.norm(v)
+    return v / n if n > 0 else v
+
+
+def _existing_identity_centroids(
+    nl_cache: Path,
+) -> tuple[np.ndarray, list[str]]:
+    """Build identity centroids from every canonical faceset_NNN/ in
+    facesets_swap_ready/. Era-split sub-dirs (faceset_001_<era>) and the
+    _thin/ quarantine are skipped. Each faceset's manifest.json provides
+    (source, bbox) keys we use to look up rows in nl_full.npz."""
+    emb, meta, _src, _proc, _aliases = load_cache(nl_cache)
+    face_records = [m for m in meta if not m.get("noface")]
+    if len(face_records) != len(emb):
+        raise SystemExit(f"meta/embedding mismatch in {nl_cache}: {len(face_records)} vs {len(emb)}")
+    bbox_idx = {(m["path"], tuple(m.get("bbox") or ())): i for i, m in enumerate(face_records)}
+
+    centroids: list[np.ndarray] = []
+    names: list[str] = []
+    for d in sorted(SWAP_READY.iterdir()):
+        if not d.is_dir():
+            continue
+        if d.name.startswith("_"):
+            continue
+        # Skip era-split sub-facesets (faceset_NNN_*).
+        if d.name.startswith("faceset_") and "_" in d.name[len("faceset_"):]:
+            continue
+        man = d / "manifest.json"
+        if not man.exists():
+            continue
+        try:
+            entries = json.loads(man.read_text()).get("faces", [])
+        except Exception:
+            continue
+        keys = [(f["source"], tuple(f.get("bbox") or ())) for f in entries]
+        idxs = [bbox_idx[k] for k in keys if k in bbox_idx]
+        if not idxs:
+            continue
+        centroids.append(_normalize(emb[idxs].mean(axis=0)))
+        names.append(d.name)
+    if not centroids:
+        raise SystemExit("no canonical identity centroids could be built; check facesets_swap_ready/")
+    return np.stack(centroids), names
+
+
+def _next_faceset_number() -> int:
+    nums = []
+    for d in SWAP_READY.iterdir():
+        if not d.is_dir() or not d.name.startswith("faceset_"):
+            continue
+        tail = d.name[len("faceset_"):]
+        # Take only top-level numbered facesets (no era suffix).
+        if "_" in tail:
+            continue
+        try:
+            nums.append(int(tail))
+        except ValueError:
+            continue
+    return (max(nums) + 1) if nums else 1
+
+
+# ---- phase 1: discover --------------------------------------------------- #
+
+def discover_new_clusters(
+    immich_cache: Path, nl_cache: Path, start_nnn: int, source_label: str
+) -> tuple[dict, list[dict]]:
+    print(f"loading immich cache: {immich_cache}")
+    emb, meta, _src, _proc, _aliases = load_cache(immich_cache)
+    face_records = [m for m in meta if not m.get("noface")]
+    if len(face_records) != len(emb):
+        raise SystemExit(f"meta/embedding mismatch: {len(face_records)} vs {len(emb)}")
+    print(f"  {len(face_records)} face records, {sum(1 for m in meta if m.get('noface'))} noface")
+
+    print(f"building existing-identity centroids from {SWAP_READY}")
+    cents, cent_names = _existing_identity_centroids(nl_cache)
+    print(f"  {len(cent_names)} canonical centroids")
+
+    sims = emb @ cents.T
+    nearest_d = 1.0 - sims.max(axis=1)
+    nearest_id = sims.argmax(axis=1)
+    covered = nearest_d <= EXISTING_MATCH_THRESHOLD
+    print(f"\nfaces already covered (cos-dist <= {EXISTING_MATCH_THRESHOLD}): "
+          f"{int(covered.sum())}/{len(emb)}")
+    for j, name in enumerate(cent_names):
+        c = int(((nearest_id == j) & covered).sum())
+        if c:
+            print(f"  -> {name}: {c}")
+
+    new_idx = [i for i in range(len(emb)) if not covered[i]]
+    print(f"\nunmatched immich faces to cluster: {len(new_idx)}")
+    if len(new_idx) <= 1:
+        labels = np.zeros(len(new_idx), dtype=int)
+    else:
+        labels = _cluster_embeddings(emb[new_idx], INITIAL_THRESHOLD)
+    n_clusters = len(set(int(l) for l in labels))
+    sizes = sorted([int((labels == l).sum()) for l in set(labels)], reverse=True)
+    print(f"clusters at threshold {INITIAL_THRESHOLD}: {n_clusters}  "
+          f"top sizes: {sizes[:10]}")
+
+    clusters: dict[int, list[int]] = {}
+    for k, lab in enumerate(labels):
+        clusters.setdefault(int(lab), []).append(new_idx[k])
+
+    kept: list[dict] = []
+    drop_quality_total = 0
+    drop_outlier_total = 0
+    for cid, idxs in clusters.items():
+        good: list[int] = []
+        for i in idxs:
+            r = face_records[i]
+            if r.get("face_short", 0) < MIN_SHORT:
+                drop_quality_total += 1; continue
+            if r.get("blur", 0.0) < MIN_BLUR:
+                drop_quality_total += 1; continue
+            if r.get("det_score", 0.0) < MIN_DET_SCORE:
+                drop_quality_total += 1; continue
+            good.append(i)
+        if not good:
+            continue
+        if len(good) >= 4:
+            cent = _normalize(emb[good].mean(axis=0))
+            d = 1.0 - emb[good] @ cent
+            tight = [good[k] for k, dist in enumerate(d) if dist <= OUTLIER_THRESHOLD]
+            drop_outlier_total += len(good) - len(tight)
+            good = tight
+        if not good:
+            continue
+        unique_paths = sorted({face_records[i]["path"] for i in good})
+        if len(unique_paths) < MIN_FACES:
+            continue
+        kept.append({
+            "indices": good,
+            "unique_paths": unique_paths,
+            "size_face": len(good),
+            "size_paths": len(unique_paths),
+        })
+
+    kept.sort(key=lambda c: -c["size_paths"])
+    print(f"\nafter quality+outlier+min_faces: {len(kept)} clusters kept "
+          f"(dropped: quality={drop_quality_total} outlier={drop_outlier_total})")
+    for rank, c in enumerate(kept, start=start_nnn):
+        print(f"  faceset_{rank:03d}: faces={c['size_face']:3d} "
+              f"unique_paths={c['size_paths']:3d}")
+
+    facesets = [
+        {
+            "name": f"faceset_{rank:03d}",
+            "image_count": c["size_paths"],
+            "face_count": c["size_face"],
+            "images": c["unique_paths"],
+        }
+        for rank, c in enumerate(kept, start=start_nnn)
+    ]
+    manifest = {
+        "params": {
+            "existing_match_threshold": EXISTING_MATCH_THRESHOLD,
+            "initial_threshold": INITIAL_THRESHOLD,
+            "outlier_threshold": OUTLIER_THRESHOLD,
+            "min_faces": MIN_FACES,
+            "min_short": MIN_SHORT,
+            "min_blur": MIN_BLUR,
+            "min_det_score": MIN_DET_SCORE,
+            "source_label": source_label,
+            "source_cache": str(immich_cache),
+        },
+        "facesets": facesets,
+    }
+    return manifest, kept
+
+
+# ---- phase 2: export + relocate ----------------------------------------- #
+
+def export_and_relocate(manifest: dict, immich_cache: Path, source_label: str) -> None:
+    synth_path = REPO_WORK / f"synthetic_{source_label}_manifest.json"
+    synth_path.write_text(json.dumps(manifest, indent=2))
+    print(f"\nsynthetic manifest -> {synth_path}")
+
+    out_tmp = SWAP_READY.parent / f"facesets_swap_ready_{source_label}_new"
+    if out_tmp.exists():
+        shutil.rmtree(out_tmp)
+    out_tmp.mkdir(parents=True)
+
+    print(f"running cmd_export_swap -> {out_tmp}")
+    cmd_export_swap(
+        cache_path=immich_cache,
+        refine_manifest_path=synth_path,
+        raw_manifest_path=None,
+        out_dir=out_tmp,
+        top_n=TOP_N,
+        outlier_threshold=EXPORT_OUTLIER_THRESHOLD,
+        pad_ratio=PAD_RATIO,
+        out_size=OUT_SIZE,
+        include_candidates=False,
+        candidate_match_threshold=0.55,
+        candidate_min_score=0.40,
+        min_face_short=EXPORT_MIN_FACE_SHORT,
+    )
+
+    new_top = json.loads((out_tmp / "manifest.json").read_text())
+    new_entries = new_top.get("facesets", [])
+
+    moved = 0
+    for fs_meta in new_entries:
+        name = fs_meta["name"]
+        src_dir = out_tmp / name
+        if not src_dir.exists():
+            print(f"[{name}] export dir missing; skipping")
+            continue
+        dst_dir = SWAP_READY / name
+        if dst_dir.exists():
+            print(f"[{name}] {dst_dir} already exists; refusing to overwrite")
+            continue
+        (src_dir / f"immich_{source_label}.txt").write_text(
+            f"{name}\n\nSource: Immich user {source_label} cluster (auto-discovered).\n"
+        )
+        shutil.move(str(src_dir), str(dst_dir))
+        moved += 1
+        print(f"[{name}] -> {dst_dir}")
+
+    final_manifest_path = SWAP_READY / "manifest.json"
+    if final_manifest_path.exists():
+        existing = json.loads(final_manifest_path.read_text())
+    else:
+        existing = {"facesets": []}
+    existing.setdefault("facesets", [])
+    existing_names = {fs["name"] for fs in existing["facesets"]}
+    appended = 0
+    for entry in new_entries:
+        if entry["name"] in existing_names:
+            print(f"[manifest] {entry['name']} already present; not duplicating")
+            continue
+        existing["facesets"].append(entry)
+        appended += 1
+    final_manifest_path.write_text(json.dumps(existing, indent=2))
+    print(f"\nmerged manifest: appended {appended} entries -> {final_manifest_path}")
+    print(f"moved {moved} faceset directories into {SWAP_READY}")
+    if out_tmp.exists() and not list(out_tmp.iterdir()):
+        out_tmp.rmdir()
+
+
+# ---- main ---------------------------------------------------------------- #
+
+def main() -> None:
+    p = argparse.ArgumentParser()
+    p.add_argument("immich_cache", type=Path,
+                   help="path to immich_<user>.npz produced by the embed worker")
+    p.add_argument("--nl-cache", type=Path, default=REPO_WORK / "cache" / "nl_full.npz",
+                   help="canonical cache for existing identity centroids")
+    p.add_argument("--source-label", default=None,
+                   help="short label used in marker filenames; default = stem of immich_cache")
+    p.add_argument("--start-nnn", type=int, default=None,
+                   help="first faceset number to assign; default = current max+1 in facesets_swap_ready/")
+    p.add_argument("--dry-run", action="store_true")
+    args = p.parse_args()
+
+    label = args.source_label or args.immich_cache.stem.removeprefix("immich_") or args.immich_cache.stem
+    start_nnn = args.start_nnn if args.start_nnn is not None else _next_faceset_number()
+    print(f"source label: {label!r}; first faceset number: {start_nnn:03d}")
+
+    manifest, kept = discover_new_clusters(args.immich_cache, args.nl_cache, start_nnn, label)
+    if args.dry_run:
+        print("\n--dry-run: stopping after cluster discovery (no exports written).")
+        return
+    if not manifest.get("facesets"):
+        print("no new facesets to build.")
+        return
+    export_and_relocate(manifest, args.immich_cache, label)
+    print("\nDone.")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,352 @@
+#!/usr/bin/env python3
+"""Discover new identities in /mnt/x/src/osrc and emit them as facesets.
+
+Workflow (mirrors the shape of build_folders.py, but identities are
+discovered by clustering rather than asserted by folder):
+
+  1. Load cache; restrict to face records whose canonical or alias path
+     lies under /mnt/x/src/osrc/.
+  2. Build centroids of the existing 19 canonical identities in
+     facesets_swap_ready/faceset_001..019. Drop any osrc face whose
+     nearest-existing-identity cos-dist <= EXISTING_MATCH_THRESHOLD;
+     those are already covered by `extend` and shouldn't seed new
+     facesets.
+  3. Cluster the remaining osrc faces among themselves at
+     INITIAL_THRESHOLD (matches `extend`'s new_cluster_threshold default).
+  4. Per cluster, apply refine-equivalent gates: face_short >= MIN_SHORT,
+     blur >= MIN_BLUR, det_score >= MIN_DET_SCORE; for clusters >= 4,
+     drop faces with cos-dist > OUTLIER_THRESHOLD from the cluster
+     centroid.
+  5. Keep clusters whose surviving unique source-path count is >= MIN_FACES.
+  6. Number kept clusters faceset_020, 021, ... (past the highest existing
+     in facesets_swap_ready, which is 019). Order by descending size.
+  7. Synthesize a refine_manifest.json and call cmd_export_swap on it,
+     emitting into a temp dir. Move new dirs into facesets_swap_ready/.
+  8. Append new entries to the top-level facesets_swap_ready/manifest.json
+     (preserving existing facesets / thin_eras).
+"""
+
+from __future__ import annotations
+
+import json
+import shutil
+import sys
+from pathlib import Path
+
+import numpy as np
+
+REPO = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(REPO))
+
+from sort_faces import (  # noqa: E402
+    _cluster_embeddings,
+    cmd_export_swap,
+    load_cache,
+)
+
+# ---- config -------------------------------------------------------------- #
+
+CACHE = REPO / "work" / "cache" / "nl_full.npz"
+SWAP_READY = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
+OUT_TMP = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready_osrc_new")
+SYNTH_MANIFEST = REPO / "work" / "synthetic_osrc_manifest.json"
+
+OSRC_DIR = Path("/mnt/x/src/osrc")
+START_NNN = 20  # facesets_swap_ready max is 019; pick up here.
+
+# Existing-identity exclusion: drop osrc faces whose nearest existing
+# identity centroid is within this cosine distance. 0.45 matches the
+# build_folders.py OSRC_THRESHOLD: at this cutoff the face is already
+# routed to an existing identity by extend / build_folders.py.
+EXISTING_MATCH_THRESHOLD = 0.45
+
+# Cluster the unmatched.
+INITIAL_THRESHOLD = 0.55
+
+# Refine-equivalent gates (per the user's request: drop min_faces to 6).
+MIN_FACES = 6
+MIN_SHORT = 90
+MIN_BLUR = 40.0
+MIN_DET_SCORE = 0.6
+OUTLIER_THRESHOLD = 0.55  # only applied if cluster >= 4
+
+# export-swap params (defaults from sort_faces.py).
+TOP_N = 30
+EXPORT_OUTLIER_THRESHOLD = 0.45
+PAD_RATIO = 0.5
+OUT_SIZE = 512
+EXPORT_MIN_FACE_SHORT = 100
+
+
+# ---- helpers ------------------------------------------------------------- #
+
+def _normalize(v: np.ndarray) -> np.ndarray:
+    n = np.linalg.norm(v)
+    return v / n if n > 0 else v
+
+
+def _under(folder: Path, p: str) -> bool:
+    fs = str(folder).rstrip("/") + "/"
+    return p == str(folder) or p.startswith(fs)
+
+
+def _record_in_folder(rec: dict, folder: Path, path_aliases: dict[str, list[str]]) -> bool:
+    if _under(folder, rec["path"]):
+        return True
+    for alias in path_aliases.get(rec["path"], []):
+        if _under(folder, alias):
+            return True
+    return False
+
+
+def _existing_identity_centroids(
+    emb: np.ndarray, face_records: list[dict]
+) -> tuple[np.ndarray, list[str]]:
+    """Build a (n_identities, 512) matrix of L2-normalized centroids and a parallel name list,
+    drawn from the canonical faceset_001..019 manifests in facesets_swap_ready/."""
+    bbox_idx: dict[tuple[str, tuple], int] = {
+        (m["path"], tuple(m.get("bbox") or ())): i for i, m in enumerate(face_records)
+    }
+    centroids: list[np.ndarray] = []
+    names: list[str] = []
+    for n in range(1, 20):
+        d = SWAP_READY / f"faceset_{n:03d}"
+        man_path = d / "manifest.json"
+        if not man_path.exists():
+            continue
+        man = json.loads(man_path.read_text())
+        keys = [(f["source"], tuple(f.get("bbox") or ())) for f in man.get("faces", [])]
+        idxs = [bbox_idx[k] for k in keys if k in bbox_idx]
+        if not idxs:
+            continue
+        centroids.append(_normalize(emb[idxs].mean(axis=0)))
+        names.append(d.name)
+    return np.stack(centroids), names
+
+
+# ---- phase 1: identify new osrc clusters --------------------------------- #
+
+def discover_new_clusters() -> tuple[dict, list[dict]]:
+    emb, meta, _src_root, _proc, path_aliases = load_cache(CACHE)
+    face_records = [m for m in meta if not m.get("noface")]
+    if len(face_records) != len(emb):
+        raise SystemExit(f"meta/embedding mismatch: {len(face_records)} vs {len(emb)}")
+    print(f"Cache: {len(face_records)} face records.")
+
+    # Step 1: filter to osrc.
+    osrc_idx = [
+        i for i, m in enumerate(face_records)
+        if _record_in_folder(m, OSRC_DIR, path_aliases)
+    ]
+    print(f"osrc face records: {len(osrc_idx)}")
+
+    # Step 2: drop those already matching an existing identity.
+    cents, cent_names = _existing_identity_centroids(emb, face_records)
+    osrc_emb = emb[osrc_idx]
+    sims = osrc_emb @ cents.T
+    nearest_d = 1.0 - sims.max(axis=1)
+    nearest_id = sims.argmax(axis=1)
+    covered_mask = nearest_d <= EXISTING_MATCH_THRESHOLD
+    n_covered = int(covered_mask.sum())
+    print(
+        f"Already covered by existing 19 identities at cos-dist <= "
+        f"{EXISTING_MATCH_THRESHOLD}: {n_covered}/{len(osrc_idx)}"
+    )
+    # Per-identity coverage breakdown (for logging only).
+    for j, name in enumerate(cent_names):
+        c = int(((nearest_id == j) & covered_mask).sum())
+        if c:
+            print(f"  -> {name}: {c}")
+
+    new_idx = [osrc_idx[k] for k in range(len(osrc_idx)) if not covered_mask[k]]
+    print(f"\nUnmatched osrc faces to cluster: {len(new_idx)}")
+
+    # Step 3: cluster the unmatched among themselves.
+    new_emb = emb[new_idx]
+    if len(new_idx) <= 1:
+        labels = np.zeros(len(new_idx), dtype=int)
+    else:
+        labels = _cluster_embeddings(new_emb, INITIAL_THRESHOLD)
+    n_clusters = len(set(int(l) for l in labels))
+    print(
+        f"Initial clusters at threshold {INITIAL_THRESHOLD}: {n_clusters} "
+        f"(top sizes: {sorted([int((labels==l).sum()) for l in set(labels)], reverse=True)[:10]})"
+    )
+
+    # Step 4 + 5: per-cluster refine gates + min_faces.
+    clusters: dict[int, list[int]] = {}
+    for k, lab in enumerate(labels):
+        clusters.setdefault(int(lab), []).append(new_idx[k])
+
+    kept_clusters: list[dict] = []
+    drop_quality_total = 0
+    drop_outlier_total = 0
+    for cid, idxs in clusters.items():
+        # Per-face quality gate.
+        good: list[int] = []
+        for i in idxs:
+            r = face_records[i]
+            if r.get("face_short", 0) < MIN_SHORT:
+                drop_quality_total += 1
+                continue
+            if r.get("blur", 0.0) < MIN_BLUR:
+                drop_quality_total += 1
+                continue
+            if r.get("det_score", 0.0) < MIN_DET_SCORE:
+                drop_quality_total += 1
+                continue
+            good.append(i)
+        if not good:
+            continue
+
+        # Outlier rejection (only if cluster >= 4).
+        if len(good) >= 4:
+            cent = _normalize(emb[good].mean(axis=0))
+            d = 1.0 - emb[good] @ cent
+            tight = [good[k] for k, dist in enumerate(d) if dist <= OUTLIER_THRESHOLD]
+            drop_outlier_total += len(good) - len(tight)
+            good = tight
+        if not good:
+            continue
+
+        unique_paths = sorted({face_records[i]["path"] for i in good})
+        if len(unique_paths) < MIN_FACES:
+            continue
+
+        kept_clusters.append({
+            "indices": good,
+            "unique_paths": unique_paths,
+            "size_face": len(good),
+            "size_paths": len(unique_paths),
+        })
+
+    kept_clusters.sort(key=lambda c: -c["size_paths"])
+    print(
+        f"\nAfter quality gate ({drop_quality_total} dropped) + outlier "
+        f"rejection ({drop_outlier_total} dropped) + min_faces={MIN_FACES}: "
+        f"{len(kept_clusters)} clusters kept"
+    )
+    for rank, c in enumerate(kept_clusters, start=START_NNN):
+        print(
+            f"  faceset_{rank:03d}: faces={c['size_face']:3d}  "
+            f"unique_paths={c['size_paths']:3d}"
+        )
+
+    # Build synthetic refine_manifest.json compatible with cmd_export_swap.
+    facesets = [
+        {
+            "name": f"faceset_{rank:03d}",
+            "image_count": c["size_paths"],
+            "face_count": c["size_face"],
+            "images": c["unique_paths"],
+        }
+        for rank, c in enumerate(kept_clusters, start=START_NNN)
+    ]
+    manifest = {
+        "params": {
+            "existing_match_threshold": EXISTING_MATCH_THRESHOLD,
+            "initial_threshold": INITIAL_THRESHOLD,
+            "outlier_threshold": OUTLIER_THRESHOLD,
+            "min_faces": MIN_FACES,
+            "min_short": MIN_SHORT,
+            "min_blur": MIN_BLUR,
+            "min_det_score": MIN_DET_SCORE,
+            "source_root": str(OSRC_DIR),
+        },
+        "facesets": facesets,
+    }
+    SYNTH_MANIFEST.write_text(json.dumps(manifest, indent=2))
+    print(f"\nSynthetic manifest -> {SYNTH_MANIFEST}")
+    return manifest, kept_clusters
+
+
+# ---- phase 2: export + relocate + merge top-level manifest -------------- #
+
+def export_and_relocate(manifest: dict) -> None:
+    if OUT_TMP.exists():
+        shutil.rmtree(OUT_TMP)
+    OUT_TMP.mkdir(parents=True)
+
+    print(f"\nRunning cmd_export_swap -> {OUT_TMP}")
+    cmd_export_swap(
+        cache_path=CACHE,
+        refine_manifest_path=SYNTH_MANIFEST,
+        raw_manifest_path=None,
+        out_dir=OUT_TMP,
+        top_n=TOP_N,
+        outlier_threshold=EXPORT_OUTLIER_THRESHOLD,
+        pad_ratio=PAD_RATIO,
+        out_size=OUT_SIZE,
+        include_candidates=False,
+        candidate_match_threshold=0.55,
+        candidate_min_score=0.40,
+        min_face_short=EXPORT_MIN_FACE_SHORT,
+    )
+
+    new_top = json.loads((OUT_TMP / "manifest.json").read_text())
+    new_entries = new_top.get("facesets", [])
+
+    moved = 0
+    for fs_meta in new_entries:
+        name = fs_meta["name"]
+        src_dir = OUT_TMP / name
+        if not src_dir.exists():
+            print(f"[{name}] export dir missing; skipping")
+            continue
+        dst_dir = SWAP_READY / name
+        if dst_dir.exists():
+            print(f"[{name}] {dst_dir} already exists; refusing to overwrite")
+            continue
+        # Add a marker file so the source provenance is obvious.
+        (src_dir / "osrc.txt").write_text(
+            f"{name}\n\nSource: osrc cluster (auto-discovered, {OSRC_DIR}).\n"
+        )
+        shutil.move(str(src_dir), str(dst_dir))
+        moved += 1
+        print(f"[{name}] -> {dst_dir}")
+
+    # Merge top-level manifest, preserving facesets / thin_eras / etc.
+    final_manifest_path = SWAP_READY / "manifest.json"
+    if final_manifest_path.exists():
+        existing = json.loads(final_manifest_path.read_text())
+    else:
+        existing = {"facesets": []}
+    existing.setdefault("facesets", [])
+
+    existing_names = {fs["name"] for fs in existing["facesets"]}
+    appended = 0
+    for entry in new_entries:
+        if entry["name"] in existing_names:
+            print(f"[manifest] {entry['name']} already present; not duplicating")
+            continue
+        existing["facesets"].append(entry)
+        appended += 1
+
+    final_manifest_path.write_text(json.dumps(existing, indent=2))
+    print(f"\nMerged manifest: appended {appended} entries -> {final_manifest_path}")
+    print(f"Moved {moved} faceset directories into {SWAP_READY}")
+
+    # Clean up temp dir if empty.
+    if OUT_TMP.exists():
+        leftover = list(OUT_TMP.iterdir())
+        if not leftover:
+            OUT_TMP.rmdir()
+
+
+# ---- main ---------------------------------------------------------------- #
+
+def main() -> None:
+    dry = "--dry-run" in sys.argv
+    manifest, kept = discover_new_clusters()
+    if dry:
+        print("\n--dry-run: stopping after cluster discovery (no exports written).")
+        return
+    if not manifest.get("facesets"):
+        print("No new facesets to build; nothing to do.")
+        return
+    export_and_relocate(manifest)
+    print("\nDone.")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,634 @@
+"""Consolidate facesets_swap_ready/ — find duplicate identities and merge.
+
+Pipeline:
+  1. analyze: pull arcface embeddings from work/cache/*.npz for every PNG in every
+     active faceset (skipping _masked, _thin, era splits). Compute L2-normalized
+     centroid per faceset. Build similarity graph at sim>=0.45, extract components.
+     Pick primary per component by tier (hand-sorted > auto > osrc > immich) + size.
+  2. report: HTML contact sheet at work/merge_review/index.html grouped by
+     candidate cluster, with top-3 thumbs per faceset, all pairwise sims, and
+     "merge X,Y -> Z" plan. Confident edges (sim>=0.65) are highlighted.
+  3. apply: combine PNGs of secondaries into primary, re-rank by quality.composite
+     descending, renumber 0001..NNNN, re-zip _topN.fsz + _all.fsz, move secondaries
+     to facesets_swap_ready/_merged/<name>/, update master manifest with
+     `merged[]` array + `merge_run` provenance block.
+
+Embeddings come from caches (no GPU re-embed needed); the original clusterer used
+exactly these vectors so they are the right yardstick. Era splits are excluded
+entirely (intentional time-period segmentation, not a duplication).
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import re
+import shutil
+import sys
+import time
+from pathlib import Path
+
+import numpy as np
+from PIL import Image
+from scipy.cluster.hierarchy import linkage, fcluster
+from scipy.spatial.distance import squareform
+
+ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
+CACHES = [
+    Path("/opt/face-sets/work/cache/nl_full.npz"),
+    Path("/opt/face-sets/work/cache/immich_peter.npz"),
+    Path("/opt/face-sets/work/cache/immich_nic.npz"),
+]
+
+ERA_SPLIT_RE = re.compile(r"^faceset_\d+_(?:\d{4}-\d{2,4}|\d{4}|undated)$")
+
+
+# ----------------------------- helpers -----------------------------
+
+def load_caches():
+    """Return (rec_index, alias_map). rec_index keyed by (path, bbox_tuple)
+    -> embedding (np.float32, shape (512,) L2-normalized).
+    alias_map maps every alias path -> canonical path."""
+    rec_index: dict[tuple[str, tuple[int, int, int, int]], np.ndarray] = {}
+    alias_map: dict[str, str] = {}
+    n_total = 0
+    for c in CACHES:
+        if not c.exists():
+            print(f"[warn] cache missing: {c}", file=sys.stderr)
+            continue
+        d = np.load(c, allow_pickle=True)
+        emb = d["embeddings"]
+        meta = json.loads(str(d["meta"]))
+        face_records = [m for m in meta if not m.get("noface")]
+        if len(face_records) != len(emb):
+            raise SystemExit(f"meta/emb mismatch in {c}: {len(face_records)} vs {len(emb)}")
+        # path_aliases may be present
+        if "path_aliases" in d.files:
+            paliases = json.loads(str(d["path_aliases"]))
+            for canon, alist in paliases.items():
+                alias_map.setdefault(canon, canon)
+                for a in alist:
+                    alias_map[a] = canon
+        for i, rec in enumerate(face_records):
+            p = rec["path"]
+            bbox = tuple(int(x) for x in rec["bbox"])
+            v = emb[i].astype(np.float32)
+            n = float(np.linalg.norm(v))
+            if n > 0:
+                v = v / n
+            rec_index[(p, bbox)] = v
+            alias_map.setdefault(p, p)
+        print(f"[cache] {c.name}: +{len(face_records)} face records (running total {len(rec_index)})", file=sys.stderr)
+        n_total += len(face_records)
+    print(f"[cache] indexed {n_total} face records, {len(alias_map)} path aliases", file=sys.stderr)
+    return rec_index, alias_map
+
+
+def faceset_tier(name: str) -> int:
+    """Lower number = higher priority for primary selection."""
+    m = re.match(r"^faceset_0*(\d+)$", name)
+    if not m:
+        return 99  # unknown structure
+    n = int(m.group(1))
+    if 13 <= n <= 19:
+        return 0  # hand-sorted
+    if 1 <= n <= 12:
+        return 1  # auto-clustered
+    if 20 <= n <= 25:
+        return 2  # osrc
+    if 26 <= n <= 264:
+        return 3  # immich peter
+    if 265 <= n:
+        return 4  # immich nic and beyond
+    return 99
+
+
+def is_era_split(name: str) -> bool:
+    return bool(ERA_SPLIT_RE.match(name))
+
+
+def faceset_centroid(faceset_dir: Path, rec_index, alias_map):
+    """Return (centroid, n_used, n_missing) where centroid is L2-normalized mean
+    of embeddings of the faces listed in the per-faceset manifest. Falls back to
+    None if too few embeddings found."""
+    manifest = faceset_dir / "manifest.json"
+    if not manifest.exists():
+        return None, 0, 0
+    m = json.loads(manifest.read_text())
+    vecs = []
+    n_missing = 0
+    for f in m.get("faces", []):
+        src = f.get("source")
+        bbox = f.get("bbox")
+        if src is None or bbox is None:
+            n_missing += 1
+            continue
+        bbox_t = tuple(int(x) for x in bbox)
+        canon = alias_map.get(src, src)
+        v = rec_index.get((canon, bbox_t))
+        if v is None and canon != src:
+            v = rec_index.get((src, bbox_t))
+        if v is None:
+            n_missing += 1
+            continue
+        vecs.append(v)
+    if len(vecs) < 3:
+        return None, len(vecs), n_missing
+    arr = np.stack(vecs).astype(np.float32)
+    c = arr.mean(axis=0)
+    n = float(np.linalg.norm(c))
+    if n > 0:
+        c = c / n
+    return c, len(vecs), n_missing
+
+
+def connected_components(adj: dict[int, set[int]]) -> list[list[int]]:
+    seen: set[int] = set()
+    comps = []
+    for node in adj:
+        if node in seen:
+            continue
+        stack = [node]
+        comp = []
+        while stack:
+            x = stack.pop()
+            if x in seen:
+                continue
+            seen.add(x)
+            comp.append(x)
+            for y in adj.get(x, set()):
+                if y not in seen:
+                    stack.append(y)
+        comps.append(sorted(comp))
+    return comps
+
+
+# ----------------------------- analyze -----------------------------
+
+def cmd_analyze(args):
+    rec_index, alias_map = load_caches()
+
+    # collect active facesets
+    active = []
+    for d in sorted(ROOT.iterdir()):
+        if not d.is_dir() or d.name.startswith("_"):
+            continue
+        if is_era_split(d.name):
+            continue
+        active.append(d)
+    print(f"[scan] {len(active)} active facesets (era splits + _masked + _thin excluded)", file=sys.stderr)
+
+    centroids: dict[str, np.ndarray] = {}
+    sizes: dict[str, int] = {}
+    skipped = []
+    t0 = time.time()
+    for fs in active:
+        c, n_used, n_miss = faceset_centroid(fs, rec_index, alias_map)
+        if c is None:
+            skipped.append((fs.name, n_used, n_miss))
+            continue
+        centroids[fs.name] = c
+        sizes[fs.name] = n_used
+    print(f"[centroid] {len(centroids)} facesets centroided in {time.time()-t0:.1f}s; "
+          f"{len(skipped)} skipped (too few embeddings)", file=sys.stderr)
+    if skipped:
+        for n, u, m in skipped[:10]:
+            print(f"  skip {n}: used={u} missing={m}", file=sys.stderr)
+        if len(skipped) > 10:
+            print(f"  ... +{len(skipped)-10} more", file=sys.stderr)
+
+    names = sorted(centroids.keys())
+    if not names:
+        raise SystemExit("no centroids built")
+
+    # similarity matrix
+    M = np.stack([centroids[n] for n in names]).astype(np.float32)  # (N, 512), normalized
+    sim = M @ M.T  # (N, N) cosine since unit-normalized
+    np.clip(sim, -1.0, 1.0, out=sim)
+
+    edge_thr = args.edge
+    confident_thr = args.confident
+
+    # complete-linkage agglomerative clustering on cosine distance.
+    # Cut at edge threshold: groups are guaranteed to have ALL pairs sim >= edge_thr.
+    # This avoids the chaining problem of single-link / connected-components.
+    n = len(names)
+    dist = 1.0 - sim
+    np.fill_diagonal(dist, 0.0)
+    # symmetrize numerical noise
+    dist = (dist + dist.T) / 2.0
+    np.clip(dist, 0.0, 2.0, out=dist)
+    cond = squareform(dist, checks=False)
+    Z = linkage(cond, method="complete")
+    cut_dist = 1.0 - edge_thr  # complete-link distance corresponds to (1 - min sim)
+    labels = fcluster(Z, t=cut_dist, criterion="distance")  # 1-indexed cluster ids
+
+    cluster_members: dict[int, list[int]] = {}
+    for idx, lbl in enumerate(labels):
+        cluster_members.setdefault(int(lbl), []).append(idx)
+    comps = [sorted(idxs) for idxs in cluster_members.values() if len(idxs) > 1]
+
+    n_pairs_in_groups = 0
+    for c in comps:
+        n_pairs_in_groups += len(c) * (len(c) - 1) // 2
+    print(f"[graph] complete-linkage cut at sim>={edge_thr}: {len(comps)} multi-faceset groups "
+          f"({n_pairs_in_groups} within-group pairs)", file=sys.stderr)
+
+    # pick primary per group: lowest tier number, then largest size
+    groups_out = []
+    for comp in comps:
+        members = [names[i] for i in comp]
+        members_sorted = sorted(members, key=lambda x: (faceset_tier(x), -sizes.get(x, 0), x))
+        primary = members_sorted[0]
+        secondaries = members_sorted[1:]
+        # gather pairwise sims within group
+        pair_sims = []
+        idx_of = {names[i]: i for i in comp}
+        for a in members:
+            for b in members:
+                if a >= b:
+                    continue
+                pair_sims.append({"a": a, "b": b, "sim": round(float(sim[idx_of[a], idx_of[b]]), 4)})
+        # confidence: minimum within-group sim (the weakest link)
+        min_link = min(p["sim"] for p in pair_sims)
+        max_link = max(p["sim"] for p in pair_sims)
+        confidence = "confident" if min_link >= confident_thr else "uncertain"
+        groups_out.append({
+            "primary": primary,
+            "secondaries": secondaries,
+            "members": members_sorted,
+            "tiers": {n: faceset_tier(n) for n in members},
+            "sizes": {n: sizes.get(n, 0) for n in members},
+            "pair_sims": pair_sims,
+            "min_link": round(min_link, 4),
+            "max_link": round(max_link, 4),
+            "confidence": confidence,
+        })
+    # sort: confident first, then by max_link desc
+    groups_out.sort(key=lambda g: (0 if g["confidence"] == "confident" else 1, -g["max_link"]))
+
+    out = {
+        "thresholds": {"edge": edge_thr, "confident": confident_thr},
+        "n_active": len(active),
+        "n_centroided": len(centroids),
+        "n_skipped": len(skipped),
+        "skipped_reasons": [{"name": n, "used": u, "missing": m} for n, u, m in skipped],
+        "n_groups": len(groups_out),
+        "n_facesets_in_groups": sum(len(g["members"]) for g in groups_out),
+        "groups": groups_out,
+    }
+    op = Path(args.out)
+    op.parent.mkdir(parents=True, exist_ok=True)
+    op.write_text(json.dumps(out, indent=2))
+    confident = sum(1 for g in groups_out if g["confidence"] == "confident")
+    uncertain = sum(1 for g in groups_out if g["confidence"] == "uncertain")
+    print(f"[done] {len(groups_out)} groups ({confident} confident, {uncertain} uncertain) -> {op}", file=sys.stderr)
+
+
+# ----------------------------- report -----------------------------
+
+def cmd_report(args):
+    candidates = json.loads(Path(args.candidates).read_text())
+    out_dir = Path(args.out)
+    thumbs_dir = out_dir / "thumbs"
+    thumbs_dir.mkdir(parents=True, exist_ok=True)
+
+    THUMB = 140
+    THUMBS_PER_FACESET = 4
+
+    def make_thumb(faceset: str, fname: str) -> str:
+        d = thumbs_dir / faceset
+        d.mkdir(parents=True, exist_ok=True)
+        dst = d / (Path(fname).stem + ".jpg")
+        if not dst.exists():
+            try:
+                src = ROOT / faceset / "faces" / fname
+                img = Image.open(src).convert("RGB")
+                img.thumbnail((THUMB, THUMB), Image.LANCZOS)
+                img.save(dst, "JPEG", quality=82)
+            except Exception as e:
+                print(f"[thumb-skip] {faceset}/{fname}: {e}", file=sys.stderr)
+                return ""
+        return f"thumbs/{faceset}/{Path(fname).stem}.jpg"
+
+    rows = []
+    for gi, g in enumerate(candidates["groups"]):
+        primary = g["primary"]
+        sec = g["secondaries"]
+        conf_cls = "confident" if g["confidence"] == "confident" else "uncertain"
+        rows.append(f"<section class='grp {conf_cls}' id='g{gi}'>")
+        rows.append(f"<h2>group #{gi+1} <small>({g['confidence']}; min_sim={g['min_link']:.3f}, max_sim={g['max_link']:.3f})</small></h2>")
+        rows.append(f"<div class='plan'>merge <b>{', '.join(sec)}</b> &rarr; <b>{primary}</b></div>")
+        # member rows
+        for name in g["members"]:
+            tier = g["tiers"][name]
+            sz = g["sizes"][name]
+            tier_label = ["hand-sorted", "auto", "osrc", "immich-peter", "immich-nic", "?"][min(tier, 5)]
+            badge = "PRIMARY" if name == primary else "secondary"
+            rows.append(f"<div class='member'>")
+            rows.append(f"<div class='label'><span class='badge {badge.lower()}'>{badge}</span> "
+                        f"<b>{name}</b> <small>tier={tier_label} · n={sz}</small></div>")
+            rows.append("<div class='thumbs'>")
+            faces_dir = ROOT / name / "faces"
+            files = sorted(faces_dir.glob("*.png"))[:THUMBS_PER_FACESET]
+            for f in files:
+                rel = make_thumb(name, f.name)
+                if rel:
+                    rows.append(f"<img src='{rel}' loading='lazy' title='{f.name}'>")
+            rows.append("</div></div>")
+        # pairwise sims
+        rows.append("<table class='sims'><tr><th>a</th><th>b</th><th>sim</th></tr>")
+        for ps in sorted(g["pair_sims"], key=lambda x: -x["sim"]):
+            cls = "hi" if ps["sim"] >= candidates["thresholds"]["confident"] else "mid"
+            rows.append(f"<tr><td>{ps['a']}</td><td>{ps['b']}</td><td class='{cls}'>{ps['sim']:.3f}</td></tr>")
+        rows.append("</table>")
+        rows.append("</section>")
+
+    nav = " · ".join(f"<a href='#g{i}'>#{i+1}</a>" for i in range(len(candidates["groups"])))
+
+    html = f"""<!doctype html>
+<html><head><meta charset='utf-8'><title>Faceset merge review</title>
+<style>
+body {{ font-family: system-ui, sans-serif; background: #111; color: #eee; padding: 1em; }}
+h1 {{ margin-top: 0; }}
+h2 {{ margin: 0; }}
+small {{ color: #999; font-weight: normal; }}
+section.grp {{ background: #1a1a1a; border-radius: 6px; padding: 12px; margin: 12px 0; }}
+section.grp.confident {{ border-left: 4px solid #5fa05f; }}
+section.grp.uncertain {{ border-left: 4px solid #ffb050; }}
+.plan {{ margin: .5em 0; color: #6cf; }}
+.member {{ margin: 8px 0; padding: 6px; background: #222; border-radius: 4px; }}
+.label {{ font-family: monospace; font-size: 13px; }}
+.badge {{ display: inline-block; padding: 0 6px; font-size: 10px; border-radius: 2px; }}
+.badge.primary {{ background: #5fa05f; color: #000; font-weight: bold; }}
+.badge.secondary {{ background: #444; color: #ccc; }}
+.thumbs {{ display: flex; gap: 4px; margin-top: 4px; flex-wrap: wrap; }}
+.thumbs img {{ height: 140px; width: auto; border-radius: 3px; }}
+table.sims {{ font-family: monospace; font-size: 11px; margin-top: 6px; border-collapse: collapse; }}
+table.sims td, table.sims th {{ padding: 1px 8px; border: 1px solid #333; text-align: left; }}
+table.sims td.hi {{ color: #5fa05f; font-weight: bold; }}
+table.sims td.mid {{ color: #ffb050; }}
+.nav {{ position: sticky; top: 0; background: #111; padding: .5em 0; border-bottom: 1px solid #333; font-size: 12px; }}
+a {{ color: #6cf; }}
+</style></head>
+<body>
+<h1>Merge review &mdash; {len(candidates['groups'])} candidate groups
+  <small>(edge>={candidates['thresholds']['edge']}, confident>={candidates['thresholds']['confident']})</small></h1>
+<p>{candidates['n_centroided']} of {candidates['n_active']} active facesets centroided
+  (skipped {candidates['n_skipped']} for too few cached embeddings).
+  Green = confident (min within-group sim >= {candidates['thresholds']['confident']}); orange = uncertain.</p>
+<div class='nav'>{nav}</div>
+{''.join(rows)}
+</body></html>"""
+
+    out_html = out_dir / "index.html"
+    out_html.write_text(html)
+    print(f"[done] {out_html}", file=sys.stderr)
+
+
+# ----------------------------- apply -----------------------------
+
+def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
+    import zipfile
+    with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
+        for i, p in enumerate(pngs):
+            zf.write(p, arcname=f"{i:04d}.png")
+
+
+def cmd_apply(args):
+    candidates = json.loads(Path(args.candidates).read_text())
+    master_path = ROOT / "manifest.json"
+    master = json.loads(master_path.read_text())
+    by_name = {f["name"]: f for f in master.get("facesets", [])}
+
+    # filter: skip "uncertain" groups unless --include-uncertain
+    accepted = [g for g in candidates["groups"]
+                if g["confidence"] == "confident" or args.include_uncertain]
+    skipped_unc = [g for g in candidates["groups"]
+                   if g["confidence"] == "uncertain" and not args.include_uncertain]
+    # explicit --exclude / --only filters (group indices in the candidates file)
+    if args.only:
+        only = {int(s) for s in args.only.split(",")}
+        accepted = [g for i, g in enumerate(candidates["groups"]) if i in only]
+    if args.exclude:
+        excl = {int(s) for s in args.exclude.split(",")}
+        accepted = [g for i, g in enumerate(accepted) if i not in excl]
+
+    print(f"[plan] {len(accepted)} groups will be merged "
+          f"({len(skipped_unc)} uncertain skipped)", file=sys.stderr)
+
+    if args.dry_run:
+        for g in accepted:
+            print(f"  merge {g['secondaries']} -> {g['primary']}  "
+                  f"({g['confidence']}, min_sim={g['min_link']:.3f})")
+        return
+
+    merged_dir = ROOT / "_merged"
+    merged_dir.mkdir(exist_ok=True)
+    new_facesets: list[dict] = []
+    new_merged: list[dict] = list(master.get("merged", []))
+    consumed_names: set[str] = set()
+    primary_updates: dict[str, dict] = {}  # name -> new entry
+    primary_absorbed: dict[str, list[dict]] = {}  # primary_name -> [secondary entries]
+
+    for g in accepted:
+        primary = g["primary"]
+        if primary not in by_name:
+            print(f"[warn] primary {primary} not in master; skipping group", file=sys.stderr)
+            continue
+        primary_dir = ROOT / primary
+        if not primary_dir.is_dir():
+            print(f"[warn] primary dir {primary_dir} missing; skipping group", file=sys.stderr)
+            continue
+        primary_faces = primary_dir / "faces"
+        primary_manifest_path = primary_dir / "manifest.json"
+        primary_manifest = json.loads(primary_manifest_path.read_text())
+
+        # gather all face entries: primary + each secondary
+        combined_faces: list[dict] = list(primary_manifest.get("faces", []))
+        # adjust composite quality fall-back: ensure key exists
+        for f in combined_faces:
+            f.setdefault("origin_faceset", primary)
+
+        for sec in g["secondaries"]:
+            sec_dir = ROOT / sec
+            if not sec_dir.is_dir():
+                print(f"[warn] secondary {sec} missing; skipping", file=sys.stderr)
+                continue
+            sec_manifest_path = sec_dir / "manifest.json"
+            sec_manifest = json.loads(sec_manifest_path.read_text()) if sec_manifest_path.exists() else {"faces": []}
+            for f in sec_manifest.get("faces", []):
+                f = dict(f)
+                f["origin_faceset"] = sec
+                combined_faces.append(f)
+
+        # rank by quality.composite descending; ties broken by lower cosd_centroid
+        def sort_key(f):
+            q = f.get("quality", {}).get("composite", 0)
+            d = f.get("cosd_centroid", 1.0)
+            return (-q, d)
+        combined_faces.sort(key=sort_key)
+
+        # renumber and stage PNGs into a fresh staging dir, then atomically swap
+        staging = primary_dir / "_faces_new"
+        if staging.exists():
+            shutil.rmtree(staging)
+        staging.mkdir()
+        new_face_entries = []
+        for new_rank, f in enumerate(combined_faces, start=1):
+            origin = f.pop("origin_faceset")
+            old_png_rel = f["png"]                   # e.g. "faces/0042.png"
+            old_png_name = Path(old_png_rel).name
+            origin_png = ROOT / origin / "faces" / old_png_name
+            if not origin_png.exists():
+                # could be in _dropped if occlusion-pruned; skip
+                continue
+            new_name = f"{new_rank:04d}.png"
+            shutil.copy2(origin_png, staging / new_name)
+            f = dict(f)
+            f["rank"] = new_rank
+            f["png"] = f"faces/{new_name}"
+            f["origin_faceset"] = origin   # preserve provenance in manifest
+            new_face_entries.append(f)
+
+        # swap directories: primary/faces -> primary/_faces_old, staging -> primary/faces
+        old_faces_holding = primary_dir / "_faces_old"
+        if old_faces_holding.exists():
+            shutil.rmtree(old_faces_holding)
+        if primary_faces.exists():
+            primary_faces.rename(old_faces_holding)
+        staging.rename(primary_faces)
+        # migrate _dropped/ from old holding (so occlusion-pruned PNGs remain accessible)
+        old_dropped = old_faces_holding / "_dropped"
+        if old_dropped.exists():
+            (primary_faces / "_dropped").mkdir(exist_ok=True)
+            for x in old_dropped.iterdir():
+                shutil.move(str(x), str(primary_faces / "_dropped" / x.name))
+        shutil.rmtree(old_faces_holding)
+
+        # re-zip .fsz
+        survivor_pngs = sorted(primary_faces.glob("*.png"))
+        top_n = primary_manifest.get("top_n", 30)
+        top_n_eff = min(top_n, len(survivor_pngs))
+        # remove old .fsz files
+        for old in primary_dir.glob("*.fsz"):
+            old.unlink()
+        top_fsz_name = f"{primary}_top{top_n_eff}.fsz"
+        all_fsz_name = f"{primary}_all.fsz"
+        _zip_png_list(survivor_pngs[:top_n_eff], primary_dir / top_fsz_name)
+        if len(survivor_pngs) > top_n_eff:
+            _zip_png_list(survivor_pngs, primary_dir / all_fsz_name)
+            all_fsz_used = all_fsz_name
+        else:
+            all_fsz_used = None
+
+        # update primary's local manifest
+        primary_manifest["faces"] = new_face_entries
+        primary_manifest["exported"] = len(new_face_entries)
+        primary_manifest["fsz_top"] = top_fsz_name
+        primary_manifest["fsz_all"] = all_fsz_used
+        primary_manifest["top_n"] = top_n_eff
+        primary_manifest.setdefault("merge_history", []).append({
+            "absorbed": g["secondaries"],
+            "min_link": g["min_link"],
+            "max_link": g["max_link"],
+            "confidence": g["confidence"],
+        })
+        primary_manifest_path.write_text(json.dumps(primary_manifest, indent=2))
+
+        # move secondary directories into _merged/
+        absorbed_master_entries: list[dict] = []
+        for sec in g["secondaries"]:
+            sec_dir = ROOT / sec
+            target = merged_dir / sec
+            if not sec_dir.is_dir():
+                continue
+            if target.exists():
+                shutil.rmtree(sec_dir)  # already moved by previous run; clean stub
+            else:
+                shutil.move(str(sec_dir), str(target))
+            sec_master = dict(by_name.get(sec, {"name": sec}))
+            sec_master["merged_into"] = primary
+            sec_master["relpath"] = f"_merged/{sec}"
+            sec_master["fsz_top"] = None
+            sec_master["fsz_all"] = None
+            absorbed_master_entries.append(sec_master)
+            consumed_names.add(sec)
+
+        new_merged.extend(absorbed_master_entries)
+
+        # bump primary master entry
+        prim_master = dict(by_name[primary])
+        prim_master["exported"] = len(new_face_entries)
+        prim_master["top_n"] = top_n_eff
+        prim_master["fsz_top"] = top_fsz_name
+        prim_master["fsz_all"] = all_fsz_used
+        prim_master.setdefault("merge_history", []).append({
+            "absorbed": g["secondaries"],
+            "min_link": g["min_link"],
+            "max_link": g["max_link"],
+        })
+        primary_updates[primary] = prim_master
+
+        print(f"[merged] {g['secondaries']} -> {primary}  "
+              f"now {len(new_face_entries)} png", file=sys.stderr)
+
+    # rebuild master facesets list
+    for entry in master.get("facesets", []):
+        nm = entry["name"]
+        if nm in consumed_names:
+            continue
+        if nm in primary_updates:
+            new_facesets.append(primary_updates[nm])
+        else:
+            new_facesets.append(entry)
+
+    new_master = dict(master)
+    new_master["facesets"] = new_facesets
+    new_master["merged"] = new_merged
+    new_master["merge_run"] = {
+        "thresholds": candidates["thresholds"],
+        "groups_applied": len(accepted),
+        "facesets_consumed": len(consumed_names),
+        "include_uncertain": bool(args.include_uncertain),
+    }
+    tmp = master_path.with_suffix(".tmp.json")
+    tmp.write_text(json.dumps(new_master, indent=2))
+    tmp.replace(master_path)
+    print(f"[done] master manifest updated: {len(new_facesets)} active, "
+          f"{len(new_merged)} merged, {len(consumed_names)} consumed in this run",
+          file=sys.stderr)
+
+
+# ----------------------------- main -----------------------------
+
+def main():
+    ap = argparse.ArgumentParser()
+    sub = ap.add_subparsers(dest="cmd", required=True)
+
+    a = sub.add_parser("analyze")
+    a.add_argument("--out", required=True)
+    a.add_argument("--edge", type=float, default=0.45, help="min cosine sim to draw an edge (default 0.45)")
+    a.add_argument("--confident", type=float, default=0.65, help="min within-group sim to be confident (default 0.65)")
+    a.set_defaults(func=cmd_analyze)
+
+    r = sub.add_parser("report")
+    r.add_argument("--candidates", required=True)
+    r.add_argument("--out", required=True)
+    r.set_defaults(func=cmd_report)
+
+    p = sub.add_parser("apply")
+    p.add_argument("--candidates", required=True)
+    p.add_argument("--include-uncertain", action="store_true",
+                   help="apply uncertain groups too (default: confident only)")
+    p.add_argument("--only", default=None, help="comma-separated group indices to apply")
+    p.add_argument("--exclude", default=None, help="comma-separated group indices to skip")
+    p.add_argument("--dry-run", action="store_true")
+    p.set_defaults(func=cmd_apply)
+
+    args = ap.parse_args()
+    args.func(args)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,594 @@
+"""Corpus-wide dedup + roop-unleashed optimization.
+
+Two passes:
+  1. Cross-family byte-identical PNG dedup (same SHA256 in two different identity
+     families) — keep the higher-tier family copy. Era splits of the same parent
+     identity (faceset_NNN_*) are intentional duplications and are NOT deduped
+     within their family.
+  2. Within-faceset near-duplicate dedup using cached arcface embeddings
+     (cosine sim >= 0.95). Keep highest quality.composite, drop the rest.
+
+Plus a Windows-DML multi-face audit (separate phase via clip_worker-style split):
+  3. Re-detect each PNG with insightface; flag any with 0 or >1 detected faces.
+     The roop loader appends every detected face per PNG, so multi-face crops
+     pollute identity averaging.
+
+All flagged PNGs are MOVED to <faceset>/faces/_dropped/ (reversible). Affected
+.fsz files are re-zipped, manifests updated.
+
+CLI:
+  analyze        --out work/dedup_audit/dedup_plan.json
+  apply          --plan ... [--dry-run]
+  stage_multiface --out work/dedup_audit/multiface_queue.json
+  merge_multiface --results <worker_out> --out work/dedup_audit/multiface_plan.json
+  apply_multiface --plan ... [--dry-run]
+  report         --dedup ... --multiface ... --out work/dedup_audit
+"""
+
+from __future__ import annotations
+
+import argparse
+import hashlib
+import json
+import re
+import shutil
+import sys
+import time
+from concurrent.futures import ThreadPoolExecutor
+from pathlib import Path
+
+import numpy as np
+
+ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
+WIN_ROOT = r"E:\temp_things\fcswp\nl_sorted\facesets_swap_ready"
+CACHES = [
+    Path("/opt/face-sets/work/cache/nl_full.npz"),
+    Path("/opt/face-sets/work/cache/immich_peter.npz"),
+    Path("/opt/face-sets/work/cache/immich_nic.npz"),
+]
+
+NEAR_DUP_THRESHOLD = 0.95
+HASH_PARALLEL = 16
+
+
+# ----------------------------- helpers -----------------------------
+
+def faceset_tier(name: str) -> int:
+    m = re.match(r"^faceset_0*(\d+)(?:_.+)?$", name)
+    if not m:
+        return 99
+    n = int(m.group(1))
+    if 13 <= n <= 19:
+        return 0
+    if 1 <= n <= 12:
+        return 1
+    if 20 <= n <= 25:
+        return 2
+    if 26 <= n <= 264:
+        return 3
+    if 265 <= n:
+        return 4
+    return 99
+
+
+def faceset_family(name: str) -> str:
+    """faceset_001_2010-13 → faceset_001; faceset_001 → faceset_001."""
+    m = re.match(r"^(faceset_\d+)(?:_.+)?$", name)
+    return m.group(1) if m else name
+
+
+def wsl_to_win(p: str) -> str:
+    s = str(p)
+    if s.startswith("/mnt/"):
+        return f"{s[5].upper()}:\\{s[7:].replace('/', chr(92))}"
+    return s
+
+
+def iter_active_facesets() -> list[Path]:
+    out = []
+    for d in sorted(ROOT.iterdir()):
+        if d.is_dir() and not d.name.startswith("_"):
+            out.append(d)
+    return out
+
+
+def sha256_file(p: Path) -> str:
+    h = hashlib.sha256()
+    with open(p, "rb") as f:
+        while True:
+            b = f.read(1 << 20)
+            if not b:
+                break
+            h.update(b)
+    return h.hexdigest()
+
+
+def load_caches():
+    rec_index: dict[tuple[str, tuple[int, int, int, int]], np.ndarray] = {}
+    alias_map: dict[str, str] = {}
+    for c in CACHES:
+        if not c.exists():
+            continue
+        d = np.load(c, allow_pickle=True)
+        emb = d["embeddings"]
+        meta = json.loads(str(d["meta"]))
+        face_records = [m for m in meta if not m.get("noface")]
+        if "path_aliases" in d.files:
+            paliases = json.loads(str(d["path_aliases"]))
+            for canon, alist in paliases.items():
+                alias_map.setdefault(canon, canon)
+                for a in alist:
+                    alias_map[a] = canon
+        for i, rec in enumerate(face_records):
+            p = rec["path"]
+            bbox = tuple(int(x) for x in rec["bbox"])
+            v = emb[i].astype(np.float32)
+            n = float(np.linalg.norm(v))
+            if n > 0:
+                v = v / n
+            rec_index[(p, bbox)] = v
+            alias_map.setdefault(p, p)
+    return rec_index, alias_map
+
+
+def lookup_emb(rec_index, alias_map, src: str, bbox):
+    bbox_t = tuple(int(x) for x in bbox)
+    canon = alias_map.get(src, src)
+    v = rec_index.get((canon, bbox_t))
+    if v is None and canon != src:
+        v = rec_index.get((src, bbox_t))
+    return v
+
+
+# ----------------------------- analyze -----------------------------
+
+def cmd_analyze(args):
+    rec_index, alias_map = load_caches()
+    facesets = iter_active_facesets()
+    print(f"[scan] {len(facesets)} active facesets", file=sys.stderr)
+
+    # Phase 1: walk every PNG, collect (faceset, file, src, bbox, quality, emb, sha256)
+    all_pngs = []  # list of dicts
+    t0 = time.time()
+    for fs in facesets:
+        manifest_path = fs / "manifest.json"
+        if not manifest_path.exists():
+            continue
+        m = json.loads(manifest_path.read_text())
+        for f in m.get("faces", []):
+            png_rel = f.get("png")
+            if not png_rel:
+                continue
+            disk_path = fs / png_rel
+            if not disk_path.exists():
+                continue
+            all_pngs.append({
+                "faceset": fs.name,
+                "family": faceset_family(fs.name),
+                "tier": faceset_tier(fs.name),
+                "file": Path(png_rel).name,
+                "rank": f.get("rank"),
+                "source": f.get("source"),
+                "bbox": f.get("bbox"),
+                "quality": f.get("quality", {}).get("composite", 0),
+                "disk_path": str(disk_path),
+            })
+    print(f"[scan] {len(all_pngs)} PNGs walked in {time.time()-t0:.1f}s", file=sys.stderr)
+
+    # Phase 2: SHA256 hash each PNG (parallel I/O)
+    t0 = time.time()
+    def _hash_one(idx):
+        all_pngs[idx]["sha256"] = sha256_file(Path(all_pngs[idx]["disk_path"]))
+    with ThreadPoolExecutor(max_workers=HASH_PARALLEL) as ex:
+        # exhaust the iterator to actually run
+        for _ in ex.map(_hash_one, range(len(all_pngs)), chunksize=16):
+            pass
+    print(f"[hash] {len(all_pngs)} PNGs hashed in {time.time()-t0:.1f}s", file=sys.stderr)
+
+    # Phase 3: cross-family byte-dedup
+    by_sha: dict[str, list[int]] = {}
+    for i, p in enumerate(all_pngs):
+        by_sha.setdefault(p["sha256"], []).append(i)
+
+    cross_family_groups = []
+    byte_drops: set[int] = set()  # indices of PNGs to drop
+    for sha, idxs in by_sha.items():
+        if len(idxs) < 2:
+            continue
+        families = {all_pngs[i]["family"] for i in idxs}
+        if len(families) < 2:
+            continue  # all in same family — intentional era duplication
+        # multiple families share this content → dedup keeping the best one
+        cross_family_groups.append({"sha256": sha, "members": [
+            {"faceset": all_pngs[i]["faceset"], "file": all_pngs[i]["file"],
+             "tier": all_pngs[i]["tier"], "quality": all_pngs[i]["quality"],
+             "rank": all_pngs[i]["rank"]} for i in idxs
+        ]})
+        # keeper rule: lowest tier number, then highest quality
+        best = sorted(idxs, key=lambda i: (all_pngs[i]["tier"], -all_pngs[i]["quality"]))[0]
+        for i in idxs:
+            # NEVER drop within-family copies (preserve era duplication intentionally)
+            # We only drop indices whose family != best's family
+            if i != best and all_pngs[i]["family"] != all_pngs[best]["family"]:
+                byte_drops.add(i)
+    print(f"[byte] {len(cross_family_groups)} cross-family hash groups; "
+          f"{len(byte_drops)} PNGs marked for byte-dedup drop", file=sys.stderr)
+
+    # Phase 4: within-faceset near-dup (embedding sim >= threshold)
+    by_faceset: dict[str, list[int]] = {}
+    for i, p in enumerate(all_pngs):
+        by_faceset.setdefault(p["faceset"], []).append(i)
+
+    near_dup_groups = []
+    near_drops: set[int] = set()
+    miss_emb_total = 0
+    t0 = time.time()
+    for fs_name, idxs in by_faceset.items():
+        if len(idxs) < 2:
+            continue
+        # gather embeddings
+        embs = []
+        kept_idxs = []
+        for i in idxs:
+            v = lookup_emb(rec_index, alias_map, all_pngs[i]["source"], all_pngs[i]["bbox"])
+            if v is None:
+                miss_emb_total += 1
+                continue
+            embs.append(v)
+            kept_idxs.append(i)
+        if len(kept_idxs) < 2:
+            continue
+        M = np.stack(embs).astype(np.float32)
+        sim = M @ M.T
+        np.fill_diagonal(sim, -1)  # ignore self
+        # find connected components in the (sim >= threshold) graph
+        adj = {k: set() for k in range(len(kept_idxs))}
+        for a in range(len(kept_idxs)):
+            # only check a < b to avoid double work
+            hi = np.where(sim[a, a+1:] >= NEAR_DUP_THRESHOLD)[0]
+            for off in hi:
+                b = a + 1 + int(off)
+                adj[a].add(b)
+                adj[b].add(a)
+        seen = set()
+        for k in adj:
+            if k in seen or not adj[k]:
+                continue
+            stack = [k]
+            comp = []
+            while stack:
+                x = stack.pop()
+                if x in seen:
+                    continue
+                seen.add(x)
+                comp.append(x)
+                for y in adj[x]:
+                    if y not in seen:
+                        stack.append(y)
+            if len(comp) < 2:
+                continue
+            comp_idxs = [kept_idxs[c] for c in comp]
+            # keeper: highest quality.composite, tie-break: lowest rank
+            best = sorted(comp_idxs, key=lambda i: (-all_pngs[i]["quality"], all_pngs[i]["rank"] or 9999))[0]
+            sims_in_group = []
+            for ci in range(len(comp)):
+                for cj in range(ci+1, len(comp)):
+                    sims_in_group.append(float(sim[comp[ci], comp[cj]]))
+            near_dup_groups.append({
+                "faceset": fs_name,
+                "members": [{"file": all_pngs[i]["file"], "rank": all_pngs[i]["rank"],
+                             "quality": all_pngs[i]["quality"]} for i in comp_idxs],
+                "keeper": all_pngs[best]["file"],
+                "min_sim": min(sims_in_group) if sims_in_group else None,
+                "max_sim": max(sims_in_group) if sims_in_group else None,
+            })
+            for i in comp_idxs:
+                if i != best:
+                    near_drops.add(i)
+    print(f"[near] {len(near_dup_groups)} near-dup groups; "
+          f"{len(near_drops)} PNGs marked for near-dup drop "
+          f"(miss_emb={miss_emb_total}); {time.time()-t0:.1f}s", file=sys.stderr)
+
+    # Combined drop set; for output, group by faceset
+    all_drops = byte_drops | near_drops
+    drops_by_faceset: dict[str, list] = {}
+    for i in all_drops:
+        p = all_pngs[i]
+        reason = []
+        if i in byte_drops: reason.append("byte_dup")
+        if i in near_drops: reason.append("near_dup")
+        drops_by_faceset.setdefault(p["faceset"], []).append({
+            "file": p["file"], "rank": p["rank"], "reason": "+".join(reason),
+            "sha256": p["sha256"], "quality": p["quality"],
+        })
+
+    plan = {
+        "thresholds": {"near_dup_sim": NEAR_DUP_THRESHOLD},
+        "totals": {
+            "active_facesets": len(facesets),
+            "active_pngs": len(all_pngs),
+            "byte_dup_groups": len(cross_family_groups),
+            "byte_dup_drops": len(byte_drops),
+            "near_dup_groups": len(near_dup_groups),
+            "near_dup_drops": len(near_drops),
+            "all_drops": len(all_drops),
+            "facesets_affected": len(drops_by_faceset),
+        },
+        "byte_dup_groups": cross_family_groups,
+        "near_dup_groups": near_dup_groups,
+        "drops_by_faceset": drops_by_faceset,
+    }
+    op = Path(args.out)
+    op.parent.mkdir(parents=True, exist_ok=True)
+    op.write_text(json.dumps(plan, indent=2))
+    print(f"[done] plan -> {op}", file=sys.stderr)
+
+
+# ----------------------------- apply -----------------------------
+
+def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
+    import zipfile
+    with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
+        for i, p in enumerate(pngs):
+            zf.write(p, arcname=f"{i:04d}.png")
+
+
+def _apply_drops_to_facesets(drops_by_faceset: dict[str, list], reason_label: str, master_path: Path):
+    """Move flagged PNGs to <faceset>/faces/_dropped/, rebuild manifests + .fsz.
+    drops_by_faceset values are lists of {"file": str, ...}.
+    Returns total moved + counts per faceset."""
+    master = json.loads(master_path.read_text())
+    by_name = {f["name"]: f for f in master.get("facesets", [])}
+    total_moved = 0
+    per_faceset_counts = {}
+
+    for fs_name, drops in drops_by_faceset.items():
+        fs_dir = ROOT / fs_name
+        if not fs_dir.is_dir():
+            print(f"[warn] {fs_name}: dir missing, skip", file=sys.stderr)
+            continue
+        faces_dir = fs_dir / "faces"
+        dropped_dir = faces_dir / "_dropped"
+        dropped_dir.mkdir(exist_ok=True)
+        drop_files = {d["file"] for d in drops}
+
+        moved_here = 0
+        for fname in sorted(drop_files):
+            src = faces_dir / fname
+            if not src.exists():
+                continue
+            shutil.move(str(src), str(dropped_dir / fname))
+            moved_here += 1
+
+        # rebuild manifest by filtering out dropped files
+        manifest_path = fs_dir / "manifest.json"
+        if manifest_path.exists():
+            mm = json.loads(manifest_path.read_text())
+            new_faces = [f for f in mm.get("faces", []) if Path(f.get("png", "")).name not in drop_files]
+            mm["faces"] = new_faces
+            mm["exported"] = len(new_faces)
+            mm.setdefault(f"{reason_label}_history", []).append({"dropped": moved_here})
+
+            # re-zip
+            survivor_pngs = sorted(faces_dir.glob("*.png"))
+            top_n = mm.get("top_n", 30)
+            top_n_eff = min(top_n, len(survivor_pngs))
+            for old in fs_dir.glob("*.fsz"):
+                old.unlink()
+            top_fsz_name = f"{fs_name}_top{top_n_eff}.fsz"
+            all_fsz_name = f"{fs_name}_all.fsz"
+            if top_n_eff > 0:
+                _zip_png_list(survivor_pngs[:top_n_eff], fs_dir / top_fsz_name)
+                mm["fsz_top"] = top_fsz_name
+                mm["top_n"] = top_n_eff
+            else:
+                mm["fsz_top"] = None
+                mm["top_n"] = 0
+            if len(survivor_pngs) > top_n_eff:
+                _zip_png_list(survivor_pngs, fs_dir / all_fsz_name)
+                mm["fsz_all"] = all_fsz_name
+            else:
+                mm["fsz_all"] = None
+            manifest_path.write_text(json.dumps(mm, indent=2))
+
+            if fs_name in by_name:
+                by_name[fs_name]["exported"] = len(new_faces)
+                by_name[fs_name]["fsz_top"] = mm["fsz_top"]
+                by_name[fs_name]["fsz_all"] = mm["fsz_all"]
+                by_name[fs_name]["top_n"] = mm["top_n"]
+                by_name[fs_name].setdefault(f"{reason_label}_dropped", 0)
+                by_name[fs_name][f"{reason_label}_dropped"] += moved_here
+
+        total_moved += moved_here
+        per_faceset_counts[fs_name] = moved_here
+
+    # rewrite master with same ordering
+    new_facesets = [by_name.get(e["name"], e) for e in master.get("facesets", [])]
+    master["facesets"] = new_facesets
+    master.setdefault(f"{reason_label}_runs", []).append({
+        "facesets_affected": len(per_faceset_counts),
+        "pngs_moved": total_moved,
+    })
+    tmp = master_path.with_suffix(".tmp.json")
+    tmp.write_text(json.dumps(master, indent=2))
+    tmp.replace(master_path)
+    return total_moved, per_faceset_counts
+
+
+def cmd_apply(args):
+    plan = json.loads(Path(args.plan).read_text())
+    drops = plan["drops_by_faceset"]
+    if args.dry_run:
+        for fs, items in sorted(drops.items()):
+            reasons = {}
+            for it in items:
+                reasons[it["reason"]] = reasons.get(it["reason"], 0) + 1
+            print(f"  {fs}: {len(items)} dropped ({reasons})")
+        print(f"=== total: {sum(len(v) for v in drops.values())} PNGs across {len(drops)} facesets ===")
+        return
+    master_path = ROOT / "manifest.json"
+    total, _ = _apply_drops_to_facesets(drops, "dedup", master_path)
+    print(f"[done] {total} PNGs moved to faces/_dropped/ across {len(drops)} facesets", file=sys.stderr)
+
+
+# ----------------------------- multiface staging + apply -----------------------------
+
+def cmd_stage_multiface(args):
+    """Build queue.json of all currently-active PNGs in the corpus
+    for the Windows DML multi-face audit worker."""
+    queue = []
+    for fs in iter_active_facesets():
+        faces_dir = fs / "faces"
+        if not faces_dir.is_dir():
+            continue
+        for p in sorted(faces_dir.glob("*.png")):
+            queue.append({
+                "wsl_path": str(p),
+                "win_path": wsl_to_win(str(p)),
+                "faceset": fs.name,
+                "file": p.name,
+            })
+    op = Path(args.out)
+    op.parent.mkdir(parents=True, exist_ok=True)
+    op.write_text(json.dumps(queue, indent=2))
+    print(f"[stage] {len(queue)} PNGs -> {op}", file=sys.stderr)
+
+
+def cmd_merge_multiface(args):
+    """Convert worker results.json into a drops_by_faceset plan."""
+    src = json.loads(Path(args.results).read_text())
+    drops_by_faceset: dict[str, list] = {}
+    bad_count = 0
+    for r in src.get("results", []):
+        n_faces = r.get("face_count", -1)
+        if n_faces == 1:
+            continue
+        bad_count += 1
+        drops_by_faceset.setdefault(r["faceset"], []).append({
+            "file": r["file"],
+            "reason": f"multiface_{n_faces}",
+            "face_count": n_faces,
+        })
+    plan = {
+        "totals": {"bad_pngs": bad_count, "facesets_affected": len(drops_by_faceset),
+                   "scored": len(src.get("results", []))},
+        "drops_by_faceset": drops_by_faceset,
+    }
+    op = Path(args.out)
+    op.parent.mkdir(parents=True, exist_ok=True)
+    op.write_text(json.dumps(plan, indent=2))
+    print(f"[merge] {bad_count} bad PNGs across {len(drops_by_faceset)} facesets -> {op}", file=sys.stderr)
+
+
+def cmd_apply_multiface(args):
+    plan = json.loads(Path(args.plan).read_text())
+    drops = plan["drops_by_faceset"]
+    if args.dry_run:
+        for fs, items in sorted(drops.items()):
+            print(f"  {fs}: {len(items)} bad PNG(s)")
+        print(f"=== total: {sum(len(v) for v in drops.values())} ===")
+        return
+    master_path = ROOT / "manifest.json"
+    total, _ = _apply_drops_to_facesets(drops, "multiface", master_path)
+    print(f"[done] {total} PNGs moved to faces/_dropped/ across {len(drops)} facesets", file=sys.stderr)
+
+
+# ----------------------------- report -----------------------------
+
+def cmd_report(args):
+    out_dir = Path(args.out)
+    out_dir.mkdir(parents=True, exist_ok=True)
+
+    sections = []
+    if args.dedup:
+        d = json.loads(Path(args.dedup).read_text())
+        t = d["totals"]
+        sections.append(f"<h2>Dedup</h2>")
+        sections.append(
+            f"<ul>"
+            f"<li>Active facesets: {t['active_facesets']}, active PNGs: {t['active_pngs']}</li>"
+            f"<li>Cross-family byte-dup groups: {t['byte_dup_groups']} → {t['byte_dup_drops']} PNGs dropped</li>"
+            f"<li>Within-faceset near-dup groups (sim≥{d['thresholds']['near_dup_sim']}): {t['near_dup_groups']} → {t['near_dup_drops']} PNGs dropped</li>"
+            f"<li><b>Total dedup drops: {t['all_drops']}</b> across {t['facesets_affected']} facesets</li>"
+            f"</ul>"
+        )
+        # top-N affected facesets
+        rows = sorted(d["drops_by_faceset"].items(), key=lambda x: -len(x[1]))[:25]
+        sections.append("<h3>Top 25 most-affected facesets</h3><table><tr><th>faceset</th><th>dropped</th><th>reasons</th></tr>")
+        for fs, items in rows:
+            r = {}
+            for it in items:
+                r[it["reason"]] = r.get(it["reason"], 0) + 1
+            sections.append(f"<tr><td>{fs}</td><td>{len(items)}</td><td>{r}</td></tr>")
+        sections.append("</table>")
+
+    if args.multiface:
+        m = json.loads(Path(args.multiface).read_text())
+        t = m["totals"]
+        sections.append("<h2>Multi-face audit</h2>")
+        sections.append(
+            f"<ul>"
+            f"<li>PNGs scored: {t['scored']}</li>"
+            f"<li>Bad PNGs (0 or >1 face): {t['bad_pngs']} across {t['facesets_affected']} facesets</li>"
+            f"</ul>"
+        )
+
+    html = f"""<!doctype html>
+<html><head><meta charset='utf-8'><title>Dedup + multi-face audit</title>
+<style>
+body {{ font-family: system-ui, sans-serif; background:#111; color:#eee; padding:1em; }}
+h1, h2, h3 {{ margin-top:1em; }}
+table {{ border-collapse: collapse; font-family: monospace; font-size: 12px; }}
+table td, table th {{ padding: 2px 8px; border: 1px solid #333; }}
+ul li {{ margin: 4px 0; }}
+</style></head>
+<body>
+<h1>facesets_swap_ready dedup + roop optimization audit</h1>
+{''.join(sections)}
+</body></html>"""
+    out_html = out_dir / "index.html"
+    out_html.write_text(html)
+    print(f"[done] {out_html}", file=sys.stderr)
+
+
+# ----------------------------- main -----------------------------
+
+def main():
+    ap = argparse.ArgumentParser()
+    sub = ap.add_subparsers(dest="cmd", required=True)
+
+    a = sub.add_parser("analyze")
+    a.add_argument("--out", required=True)
+    a.set_defaults(func=cmd_analyze)
+
+    p = sub.add_parser("apply")
+    p.add_argument("--plan", required=True)
+    p.add_argument("--dry-run", action="store_true")
+    p.set_defaults(func=cmd_apply)
+
+    sm = sub.add_parser("stage_multiface")
+    sm.add_argument("--out", required=True)
+    sm.set_defaults(func=cmd_stage_multiface)
+
+    mm = sub.add_parser("merge_multiface")
+    mm.add_argument("--results", required=True)
+    mm.add_argument("--out", required=True)
+    mm.set_defaults(func=cmd_merge_multiface)
+
+    am = sub.add_parser("apply_multiface")
+    am.add_argument("--plan", required=True)
+    am.add_argument("--dry-run", action="store_true")
+    am.set_defaults(func=cmd_apply_multiface)
+
+    r = sub.add_parser("report")
+    r.add_argument("--dedup", default=None)
+    r.add_argument("--multiface", default=None)
+    r.add_argument("--out", required=True)
+    r.set_defaults(func=cmd_report)
+
+    args = ap.parse_args()
+    args.func(args)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,244 @@
+"""Windows / DirectML embed worker.
+
+Reads a queue.json staged by /opt/face-sets/work/immich_stage.py (WSL side),
+runs InsightFace's FaceAnalysis on each image with the DmlExecutionProvider
+backed by the AMD Vega, and writes a cache file in the schema produced by
+sort_faces.py:cmd_embed (so it can be merged into nl_full.npz).
+
+CLI:
+    py -3.12 embed_worker.py <queue.json> <out_cache.npz> [--limit N]
+
+The queue.json entry shape (each item) is:
+    {
+        "asset_id": "...",
+        "sha256":   "...",
+        "wsl_path": "/mnt/x/src/immich/<user>/<rel>",   # canonical path stored
+        "win_path": "X:\\src\\immich\\<user>\\<rel>",   # what we read from
+        "size_bytes": int,
+        "width": int, "height": int,
+        ...
+    }
+
+Per face record matches cmd_embed's schema:
+    path, face_idx, det_score, bbox, face_short, face_area, blur, noface=False, hash
+plus landmark_2d_106, landmark_3d_68, pose (FaceAnalysis returns these for
+free and the existing cache already carries them after `enrich`).
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import sys
+import time
+from pathlib import Path
+
+import numpy as np
+from PIL import Image, ImageOps
+from insightface.app import FaceAnalysis
+
+MODEL_ROOT = r"C:\face_embed_venv\models"
+MIN_DET_SCORE = 0.5
+MIN_FACE_PIX = 40
+FLUSH_EVERY = 50
+
+
+def load_rgb_bgr(path: Path):
+    try:
+        with Image.open(path) as im:
+            im = ImageOps.exif_transpose(im)
+            im = im.convert("RGB")
+            rgb = np.array(im)
+        bgr = rgb[:, :, ::-1].copy()
+        return rgb, bgr
+    except Exception as e:
+        print(f"[warn] failed to load {path}: {e}", file=sys.stderr)
+        return None, None
+
+
+def laplacian_variance(gray: np.ndarray) -> float:
+    g = gray.astype(np.float32)
+    lap = (
+        -4.0 * g[1:-1, 1:-1]
+        + g[:-2, 1:-1] + g[2:, 1:-1]
+        + g[1:-1, :-2] + g[1:-1, 2:]
+    )
+    return float(lap.var())
+
+
+def save_cache(out_path: Path, emb_chunks: list, meta: list, processed: set, src_root: str):
+    emb = np.concatenate(emb_chunks) if emb_chunks else np.zeros((0, 512), dtype=np.float32)
+    tmp = out_path.with_suffix(".tmp.npz")
+    np.savez(
+        str(tmp),
+        embeddings=emb,
+        meta=json.dumps(meta),
+        src_root=str(src_root),
+        processed_paths=json.dumps(sorted(processed)),
+        path_aliases=json.dumps({}),
+        schema="v2",
+    )
+    os.replace(tmp, out_path)
+
+
+def load_cache_if_exists(out_path: Path):
+    """Resume helper. Returns (emb_chunks, meta, processed_set)."""
+    if not out_path.exists():
+        return [], [], set()
+    data = np.load(out_path, allow_pickle=True)
+    emb = data["embeddings"]
+    meta = json.loads(str(data["meta"]))
+    processed = set(json.loads(str(data["processed_paths"])))
+    return [emb] if len(emb) else [], list(meta), processed
+
+
+def main():
+    p = argparse.ArgumentParser()
+    p.add_argument("queue", type=Path)
+    p.add_argument("out", type=Path)
+    p.add_argument("--limit", type=int, default=None)
+    args = p.parse_args()
+
+    queue = json.loads(args.queue.read_text())
+    print(f"queue: {len(queue)} entries from {args.queue}")
+
+    args.out.parent.mkdir(parents=True, exist_ok=True)
+    emb_chunks, meta, processed = load_cache_if_exists(args.out)
+    n_existing_records = len(meta)
+    n_existing_emb = sum(e.shape[0] for e in emb_chunks)
+    if n_existing_records:
+        print(f"resume: {n_existing_records} existing meta records "
+              f"({n_existing_emb} embeddings, {len(processed)} processed paths)")
+
+    print("initializing FaceAnalysis with DmlExecutionProvider")
+    app = FaceAnalysis(
+        name="buffalo_l",
+        root=MODEL_ROOT,
+        providers=["DmlExecutionProvider", "CPUExecutionProvider"],
+    )
+    app.prepare(ctx_id=0, det_size=(640, 640))
+
+    src_root = "/mnt/x/src/immich"
+
+    n_done = 0
+    n_face_records_added = 0
+    n_noface_added = 0
+    n_skipped = 0
+    n_load_err = 0
+    t0 = time.perf_counter()
+    last_flush = time.perf_counter()
+    new_emb_chunks: list[np.ndarray] = []
+    new_meta: list[dict] = []
+
+    def flush():
+        nonlocal new_emb_chunks, new_meta, last_flush
+        if not new_emb_chunks and not new_meta:
+            return
+        if new_emb_chunks:
+            emb_chunks.append(np.concatenate(new_emb_chunks))
+            new_emb_chunks = []
+        for r in new_meta:
+            meta.append(r)
+        new_meta = []
+        save_cache(args.out, emb_chunks, meta, processed, src_root)
+        last_flush = time.perf_counter()
+
+    for i, entry in enumerate(queue):
+        if args.limit is not None and n_done >= args.limit:
+            break
+        wsl_path = entry["wsl_path"]
+        win_path = entry["win_path"]
+        sha = entry["sha256"]
+
+        if wsl_path in processed:
+            n_skipped += 1
+            continue
+
+        rgb, bgr = load_rgb_bgr(Path(win_path))
+        if bgr is None:
+            new_meta.append({
+                "path": wsl_path, "face_idx": -1, "noface": True,
+                "hash": sha, "error": "load",
+            })
+            processed.add(wsl_path)
+            n_load_err += 1
+            n_done += 1
+            continue
+
+        faces = app.get(bgr)
+        kept_any = False
+        for j, f in enumerate(faces):
+            if float(f.det_score) < MIN_DET_SCORE:
+                continue
+            x1, y1, x2, y2 = [int(round(v)) for v in f.bbox]
+            x1 = max(x1, 0); y1 = max(y1, 0)
+            x2 = min(x2, rgb.shape[1]); y2 = min(y2, rgb.shape[0])
+            w, h = x2 - x1, y2 - y1
+            short = min(w, h)
+            if short < MIN_FACE_PIX:
+                continue
+            crop = rgb[y1:y2, x1:x2]
+            if crop.size == 0:
+                continue
+            gray = crop.mean(axis=2)
+            blur = laplacian_variance(gray) if min(gray.shape) > 3 else 0.0
+
+            emb = f.normed_embedding.astype(np.float32)
+            new_emb_chunks.append(emb[None, :])
+            rec = {
+                "path": wsl_path,
+                "face_idx": j,
+                "det_score": float(f.det_score),
+                "bbox": [x1, y1, x2, y2],
+                "face_short": int(short),
+                "face_area": int(w * h),
+                "blur": blur,
+                "noface": False,
+                "hash": sha,
+            }
+            # Enrichment-equivalent fields (FaceAnalysis returns these for free)
+            if hasattr(f, "landmark_2d_106") and f.landmark_2d_106 is not None:
+                rec["landmark_2d_106"] = f.landmark_2d_106.astype(np.float32).tolist()
+            if hasattr(f, "landmark_3d_68") and f.landmark_3d_68 is not None:
+                rec["landmark_3d_68"] = f.landmark_3d_68.astype(np.float32).tolist()
+            if hasattr(f, "pose") and f.pose is not None:
+                rec["pose"] = [float(x) for x in f.pose]
+            new_meta.append(rec)
+            kept_any = True
+            n_face_records_added += 1
+        if not kept_any:
+            new_meta.append({
+                "path": wsl_path, "face_idx": -1, "noface": True, "hash": sha,
+            })
+            n_noface_added += 1
+
+        processed.add(wsl_path)
+        n_done += 1
+
+        if (n_done % FLUSH_EVERY == 0) or (time.perf_counter() - last_flush) > 30.0:
+            flush()
+            elapsed = time.perf_counter() - t0
+            rate = n_done / max(0.1, elapsed)
+            print(
+                f"[embed] done={n_done:5d}/{len(queue)}  faces+={n_face_records_added:5d}  "
+                f"noface+={n_noface_added:4d}  skipped={n_skipped:4d}  "
+                f"load_err={n_load_err:3d}  rate={rate:.1f} img/s  "
+                f"({elapsed:.1f}s elapsed)"
+            )
+
+    flush()
+    elapsed = time.perf_counter() - t0
+    print()
+    print("=== embed done ===")
+    print(f"  done:                    {n_done}")
+    print(f"  new face records:        {n_face_records_added}")
+    print(f"  new noface records:      {n_noface_added}")
+    print(f"  skipped (already done):  {n_skipped}")
+    print(f"  load errors:             {n_load_err}")
+    print(f"  elapsed:                 {elapsed:.1f}s ({n_done/max(0.1,elapsed):.1f} img/s)")
+    print(f"  cache:                   {args.out}")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,574 @@
+"""CLIP zero-shot scoring for masks + sunglasses across facesets_swap_ready/.
+
+Usage:
+  # score one or more specific facesets (test mode)
+  python work/filter_occlusions.py score --facesets faceset_001,faceset_050 \
+      --out work/test_batch_occlusion/scores.json
+
+  # score everything (full corpus)
+  python work/filter_occlusions.py score --out work/occlusion_scores.json
+
+  # render HTML contact sheet from a scores.json
+  python work/filter_occlusions.py report --scores work/test_batch_occlusion/scores.json \
+      --out work/test_batch_occlusion
+
+Notes:
+- This script never modifies facesets_swap_ready/. An --apply step lives elsewhere
+  (or will be added once thresholds are validated).
+- Model: open_clip ViT-L-14 / dfn2b_s39b (best public zero-shot at this size).
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+import time
+from pathlib import Path
+from typing import Iterable
+
+import torch
+from PIL import Image
+import open_clip
+
+ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
+WIN_ROOT = r"E:\temp_things\fcswp\nl_sorted\facesets_swap_ready"
+
+MODEL_NAME = "ViT-L-14"
+PRETRAINED = "dfn2b_s39b"
+
+
+def wsl_to_win(wsl_path: str) -> str:
+    """Translate a /mnt/e/... wsl path to E:\\... for the Windows worker."""
+    s = str(wsl_path)
+    if s.startswith("/mnt/"):
+        drive = s[5]
+        rest = s[7:].replace("/", "\\")
+        return f"{drive.upper()}:\\{rest}"
+    return s
+
+# Prompt ensembles. Each pair (positive, negative) becomes one binary classifier.
+# We average text embeddings within each list, then softmax across the two means.
+PROMPTS = {
+    "mask": {
+        "pos": [
+            "a photo of a person wearing a surgical face mask",
+            "a photo of a person wearing an FFP2 respirator covering mouth and nose",
+            "a photo of a person wearing a cloth face mask",
+            "a face partially covered by a medical mask",
+            "a person whose mouth and nose are hidden by a face mask",
+        ],
+        "neg": [
+            "a photo of a person's face with mouth and nose clearly visible",
+            "a clear, unobstructed photo of a face",
+            "a photo of a face without any mask or covering",
+            "a portrait of a person showing their full face",
+            "a photo of a person with a beard and visible mouth",  # avoid beard false positives
+        ],
+    },
+    "sunglasses": {
+        # We want to flag ONLY images where sunglasses occlude the eyes.
+        # False positives to defeat: sunglasses pushed up on the head/forehead, hanging on a shirt collar.
+        "pos": [
+            "a face with dark sunglasses covering the eyes",
+            "a portrait with the eyes hidden behind opaque sunglasses",
+            "a person wearing dark sunglasses over their eyes, eyes not visible",
+            "a face where the eyes are completely concealed by tinted lenses",
+            "a close-up portrait wearing aviator sunglasses on the eyes",
+        ],
+        "neg": [
+            "a portrait with both eyes clearly visible and uncovered",
+            "a face with sunglasses pushed up on the forehead, eyes visible below",
+            "a face with sunglasses resting on top of the head, eyes visible",
+            "a person with sunglasses hanging from their shirt, eyes visible",
+            "a face wearing clear prescription eyeglasses with visible eyes",
+            "a portrait with no eyewear and visible eyes",
+        ],
+    },
+}
+
+
+def load_model(device: str = "cpu"):
+    print(f"[clip] loading {MODEL_NAME} / {PRETRAINED} on {device} ...", file=sys.stderr)
+    t0 = time.time()
+    model, _, preprocess = open_clip.create_model_and_transforms(MODEL_NAME, pretrained=PRETRAINED)
+    tokenizer = open_clip.get_tokenizer(MODEL_NAME)
+    model = model.to(device).eval()
+    logit_scale = float(model.logit_scale.exp().detach().cpu())
+    print(f"[clip] ready in {time.time()-t0:.1f}s, logit_scale={logit_scale:.2f}", file=sys.stderr)
+    return model, preprocess, tokenizer, logit_scale
+
+
+@torch.no_grad()
+def build_text_features(model, tokenizer, device: str):
+    """Return dict {attr: (pos_mean_emb, neg_mean_emb)} on device, both L2-normalized."""
+    out = {}
+    for attr, sides in PROMPTS.items():
+        feats = {}
+        for side in ("pos", "neg"):
+            tokens = tokenizer(sides[side]).to(device)
+            f = model.encode_text(tokens)
+            f = f / f.norm(dim=-1, keepdim=True)
+            mean = f.mean(dim=0)
+            feats[side] = mean / mean.norm()
+        out[attr] = (feats["pos"], feats["neg"])
+    return out
+
+
+@torch.no_grad()
+def score_images(model, preprocess, text_feats, logit_scale: float, paths: list[Path], device: str, batch: int = 16):
+    """Yield (path, {attr: pos_prob}) per image. logit_scale is CLIP's learned temperature (~100)."""
+    for i in range(0, len(paths), batch):
+        chunk = paths[i:i + batch]
+        imgs = []
+        keep = []
+        for p in chunk:
+            try:
+                img = Image.open(p).convert("RGB")
+                imgs.append(preprocess(img))
+                keep.append(p)
+            except Exception as e:
+                print(f"[skip] {p}: {e}", file=sys.stderr)
+        if not imgs:
+            continue
+        x = torch.stack(imgs).to(device)
+        feats = model.encode_image(x)
+        feats = feats / feats.norm(dim=-1, keepdim=True)  # (B, D)
+        results = {}
+        for attr, (pos, neg) in text_feats.items():
+            sims = torch.stack([feats @ pos, feats @ neg], dim=1) * logit_scale  # (B, 2)
+            probs = sims.softmax(dim=1)[:, 0].tolist()                            # P(pos)
+            results[attr] = probs
+        for j, p in enumerate(keep):
+            yield p, {attr: results[attr][j] for attr in text_feats}
+
+
+def iter_facesets(root: Path, only: list[str] | None) -> Iterable[Path]:
+    if only:
+        for name in only:
+            d = root / name
+            if d.is_dir():
+                yield d
+            else:
+                print(f"[warn] not a directory: {d}", file=sys.stderr)
+        return
+    for d in sorted(root.iterdir()):
+        if d.is_dir() and not d.name.startswith("_"):
+            yield d
+
+
+def cmd_score(args):
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    model, preprocess, tokenizer, logit_scale = load_model(device)
+    text_feats = build_text_features(model, tokenizer, device)
+
+    only = [s.strip() for s in args.facesets.split(",")] if args.facesets else None
+    facesets = list(iter_facesets(ROOT, only))
+    if args.sample_per_faceset:
+        # take first N PNGs per faceset (cheap deterministic sample for test batches)
+        pass
+
+    report = {
+        "model": f"{MODEL_NAME}/{PRETRAINED}",
+        "root": str(ROOT),
+        "prompts": PROMPTS,
+        "facesets": {},
+    }
+    total_imgs = 0
+    t0 = time.time()
+    for fs in facesets:
+        faces = sorted((fs / "faces").glob("*.png")) if (fs / "faces").is_dir() else sorted(fs.glob("*.png"))
+        if args.sample_per_faceset:
+            faces = faces[: args.sample_per_faceset]
+        if not faces:
+            continue
+        print(f"[scan] {fs.name}: {len(faces)} png", file=sys.stderr)
+        per_image = []
+        for p, scores in score_images(model, preprocess, text_feats, logit_scale, faces, device):
+            per_image.append({"file": p.name, "mask": round(scores["mask"], 4), "sunglasses": round(scores["sunglasses"], 4)})
+            total_imgs += 1
+        report["facesets"][fs.name] = per_image
+
+    out = Path(args.out)
+    out.parent.mkdir(parents=True, exist_ok=True)
+    out.write_text(json.dumps(report, indent=2))
+    dt = time.time() - t0
+    print(f"[done] {total_imgs} images, {dt:.1f}s ({total_imgs/max(dt,1e-3):.2f} img/s) -> {out}", file=sys.stderr)
+
+
+def cmd_report(args):
+    """Render an HTML contact sheet from scores.json. Generates JPG thumbs."""
+    import io
+    scores = json.loads(Path(args.scores).read_text())
+    out_dir = Path(args.out)
+    thumbs_dir = out_dir / "thumbs"
+    thumbs_dir.mkdir(parents=True, exist_ok=True)
+
+    THUMB = 160
+    rows_html = []
+
+    def thumb_path(faceset: str, fname: str) -> Path:
+        d = thumbs_dir / faceset
+        d.mkdir(parents=True, exist_ok=True)
+        return d / (Path(fname).stem + ".jpg")
+
+    def make_thumb(src: Path, dst: Path):
+        if dst.exists():
+            return
+        try:
+            img = Image.open(src).convert("RGB")
+            img.thumbnail((THUMB, THUMB), Image.LANCZOS)
+            img.save(dst, "JPEG", quality=82)
+        except Exception as e:
+            print(f"[thumb-skip] {src}: {e}", file=sys.stderr)
+
+    facesets = scores["facesets"]
+    for faceset, items in facesets.items():
+        # sort: high score first so borderline cases group at the boundary
+        items_sorted = sorted(items, key=lambda x: max(x["mask"], x["sunglasses"]), reverse=True)
+        # faceset summary
+        n = len(items)
+        n_mask = sum(1 for x in items if x["mask"] >= 0.7)
+        n_sg = sum(1 for x in items if x["sunglasses"] >= 0.7)
+        pct_mask = (100 * n_mask / n) if n else 0
+        pct_sg = (100 * n_sg / n) if n else 0
+        rows_html.append(f"<h2 id='{faceset}'>{faceset} <small>({n} imgs &middot; mask&ge;0.7: {n_mask} ({pct_mask:.0f}%) &middot; sunglasses&ge;0.7: {n_sg} ({pct_sg:.0f}%))</small></h2>")
+        rows_html.append("<div class='grid'>")
+        src_root = ROOT / faceset
+        faces_root = (src_root / "faces") if (src_root / "faces").is_dir() else src_root
+        for it in items_sorted:
+            src = faces_root / it["file"]
+            dst = thumb_path(faceset, it["file"])
+            make_thumb(src, dst)
+            rel = f"thumbs/{faceset}/{Path(it['file']).stem}.jpg"
+            m, s = it["mask"], it["sunglasses"]
+            cls_m = "hi" if m >= 0.7 else ("mid" if m >= 0.4 else "lo")
+            cls_s = "hi" if s >= 0.7 else ("mid" if s >= 0.4 else "lo")
+            rows_html.append(
+                f"<div class='cell'>"
+                f"<img src='{rel}' loading='lazy' title='{it['file']}'>"
+                f"<div class='scores'><span class='{cls_m}'>M {m:.2f}</span> <span class='{cls_s}'>S {s:.2f}</span></div>"
+                f"</div>"
+            )
+        rows_html.append("</div>")
+
+    nav = " · ".join(f"<a href='#{f}'>{f}</a>" for f in facesets)
+
+    html = f"""<!doctype html>
+<html><head><meta charset='utf-8'><title>Occlusion test batch</title>
+<style>
+body {{ font-family: system-ui, sans-serif; background: #111; color: #eee; padding: 1em; }}
+h1 {{ margin-top: 0; }}
+h2 {{ margin-top: 1.5em; border-bottom: 1px solid #333; padding-bottom: .25em; }}
+small {{ color: #999; font-weight: normal; }}
+.grid {{ display: grid; grid-template-columns: repeat(auto-fill, minmax(170px, 1fr)); gap: .5em; }}
+.cell {{ background: #1c1c1c; padding: 4px; border-radius: 4px; text-align: center; }}
+.cell img {{ max-width: 100%; height: auto; display: block; margin: 0 auto; }}
+.scores {{ font-family: monospace; font-size: 11px; padding-top: 4px; }}
+.hi  {{ color: #ff5050; font-weight: bold; }}
+.mid {{ color: #ffb050; }}
+.lo  {{ color: #5fa05f; }}
+.nav {{ position: sticky; top: 0; background: #111; padding: .5em 0; border-bottom: 1px solid #333; }}
+a {{ color: #6cf; }}
+</style></head>
+<body>
+<h1>Occlusion scores &mdash; {scores['model']}</h1>
+<p>Sorted within each faceset by max(mask, sunglasses) descending.
+Color: <span class='hi'>&ge;0.70</span> &middot; <span class='mid'>0.40&ndash;0.70</span> &middot; <span class='lo'>&lt;0.40</span></p>
+<div class='nav'>{nav}</div>
+{''.join(rows_html)}
+</body></html>"""
+
+    out_html = out_dir / "index.html"
+    out_html.write_text(html)
+    print(f"[done] {out_html}", file=sys.stderr)
+
+
+def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
+    """Mirror of sort_faces.py:_zip_png_list. Renames PNGs to 0000.png, 0001.png, ..."""
+    import zipfile
+    with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
+        for i, p in enumerate(pngs):
+            zf.write(p, arcname=f"{i:04d}.png")
+
+
+def cmd_apply(args):
+    """Prune mask/sunglasses PNGs, quarantine occlusion-dominated facesets,
+    re-zip .fsz, update top-level manifest. --dry-run prints the plan only."""
+    import shutil
+
+    threshold = args.threshold
+    domain_pct = args.domain_pct
+    min_survivors = args.min_survivors
+    top_n_target = args.top_n
+
+    scores = json.loads(Path(args.scores).read_text())
+    master_path = ROOT / "manifest.json"
+    master = json.loads(master_path.read_text())
+    by_name = {f["name"]: f for f in master.get("facesets", [])}
+
+    masked_dir = ROOT / "_masked"
+    thin_dir = ROOT / "_thin"
+
+    plan = []
+    for faceset, items in scores["facesets"].items():
+        if faceset not in by_name:
+            print(f"[warn] {faceset} not in master manifest — skipping", file=sys.stderr)
+            continue
+        n = len(items)
+        flagged_files = sorted(
+            it["file"] for it in items
+            if it["mask"] >= threshold or it["sunglasses"] >= threshold
+        )
+        survivors_items = [it for it in items if it["file"] not in set(flagged_files)]
+        # preserve quality order from filename (0001.png is highest-rank)
+        survivors_files = sorted(it["file"] for it in survivors_items)
+
+        n_mask = sum(1 for it in items if it["mask"] >= threshold)
+        n_sg = sum(1 for it in items if it["sunglasses"] >= threshold)
+        pct_mask = n_mask / n if n else 0
+        pct_sg = n_sg / n if n else 0
+
+        if pct_mask >= domain_pct:
+            action, reason = "quarantine_masked", f"mask_pct={pct_mask:.0%}"
+        elif pct_sg >= domain_pct:
+            action, reason = "quarantine_masked", f"sunglasses_pct={pct_sg:.0%}"
+        elif flagged_files and len(survivors_files) < min_survivors:
+            # only quarantine-as-thin if pruning is the cause of the drop below threshold;
+            # pre-existing small facesets without occlusions are left alone
+            action, reason = "quarantine_thin", f"survivors={len(survivors_files)}<{min_survivors}"
+        elif flagged_files:
+            action, reason = "prune", f"drop {len(flagged_files)}"
+        else:
+            action, reason = "keep", "clean"
+
+        plan.append({
+            "faceset": faceset, "action": action, "reason": reason,
+            "n": n, "n_mask": n_mask, "n_sg": n_sg,
+            "n_dropped": len(flagged_files), "n_survivors": len(survivors_files),
+            "dropped_files": flagged_files,
+        })
+
+    # Summary
+    counts = {a: 0 for a in ("keep", "prune", "quarantine_masked", "quarantine_thin")}
+    for p in plan:
+        counts[p["action"]] += 1
+    total_dropped_pngs = sum(p["n_dropped"] for p in plan if p["action"] == "prune")
+    total_quarantined_pngs = sum(p["n"] for p in plan if p["action"].startswith("quarantine"))
+    print(f"=== plan summary (threshold={threshold} domain_pct={domain_pct} min_survivors={min_survivors}) ===")
+    for a, c in counts.items():
+        print(f"  {a}: {c}")
+    print(f"  PNGs to drop (prune):       {total_dropped_pngs}")
+    print(f"  PNGs to quarantine (whole): {total_quarantined_pngs}")
+    print(f"  facesets in master:         {len(master['facesets'])}")
+    print(f"  facesets scored:            {len(plan)}")
+
+    # Write plan for audit
+    plan_path = Path(args.out_plan)
+    plan_path.parent.mkdir(parents=True, exist_ok=True)
+    plan_path.write_text(json.dumps({
+        "thresholds": {"image": threshold, "domain_pct": domain_pct, "min_survivors": min_survivors},
+        "counts": counts,
+        "totals": {"dropped_pngs": total_dropped_pngs, "quarantined_pngs": total_quarantined_pngs},
+        "plan": plan,
+    }, indent=2))
+    print(f"  plan written to {plan_path}")
+
+    if args.dry_run:
+        # pretty list of quarantines
+        for p in plan:
+            if p["action"].startswith("quarantine"):
+                print(f"  [{p['action']:>18s}] {p['faceset']}  ({p['reason']}, n={p['n']})")
+        return
+
+    # ----- destructive section -----
+    masked_dir.mkdir(parents=True, exist_ok=True)
+    thin_dir.mkdir(parents=True, exist_ok=True)
+
+    new_facesets = []
+    new_masked = list(master.get("masked", []))   # preserve any prior runs
+    new_thin = list(master.get("thin_eras", []))
+
+    # build a name -> existing-thin/masked entry index, to update relpath if we re-quarantine
+    by_name_thin = {e["name"]: e for e in new_thin}
+    by_name_masked = {e["name"]: e for e in new_masked}
+
+    for p in plan:
+        entry = dict(by_name[p["faceset"]])  # copy
+        fs_dir = ROOT / p["faceset"]
+        faces_dir = fs_dir / "faces"
+
+        if p["action"] == "keep":
+            new_facesets.append(entry)
+            continue
+
+        # prune dropped PNGs first (applies to both prune and quarantine_thin paths)
+        if p["dropped_files"]:
+            dropped_holding = faces_dir / "_dropped"
+            dropped_holding.mkdir(exist_ok=True)
+            for fname in p["dropped_files"]:
+                src = faces_dir / fname
+                if src.exists():
+                    shutil.move(str(src), str(dropped_holding / fname))
+
+        if p["action"].startswith("quarantine"):
+            target_root = masked_dir if p["action"] == "quarantine_masked" else thin_dir
+            target = target_root / p["faceset"]
+            if target.exists():
+                # idempotency: if a previous run already moved it, skip move
+                pass
+            else:
+                shutil.move(str(fs_dir), str(target))
+            entry["occlusion_filter"] = {
+                "action": p["action"], "reason": p["reason"],
+                "n_input": p["n"], "n_mask": p["n_mask"], "n_sg": p["n_sg"],
+                "n_dropped": p["n_dropped"], "n_survivors": p["n_survivors"],
+                "threshold": threshold, "domain_pct": domain_pct,
+            }
+            entry["relpath"] = f"{'_masked' if p['action']=='quarantine_masked' else '_thin'}/{p['faceset']}"
+            entry["fsz_top"] = None
+            entry["fsz_all"] = None
+            if p["action"] == "quarantine_masked":
+                entry["masked"] = True
+                new_masked.append(entry)
+            else:
+                entry["thin"] = True
+                new_thin.append(entry)
+            continue
+
+        # action == prune
+        survivor_pngs = sorted([pp for pp in faces_dir.glob("*.png")])
+        if not survivor_pngs:
+            print(f"[warn] {p['faceset']}: no survivor PNGs after prune", file=sys.stderr)
+            new_facesets.append(entry)
+            continue
+
+        # re-zip .fsz from survivors in quality order
+        top_n_eff = min(top_n_target, len(survivor_pngs))
+        top_fsz = fs_dir / f"{p['faceset']}_top{top_n_eff}.fsz"
+        all_fsz = fs_dir / f"{p['faceset']}_all.fsz"
+        # remove old .fsz files (they may have different top_n in name)
+        for old in fs_dir.glob("*.fsz"):
+            old.unlink()
+        _zip_png_list(survivor_pngs[:top_n_eff], top_fsz)
+        if len(survivor_pngs) > top_n_eff:
+            _zip_png_list(survivor_pngs, all_fsz)
+            entry["fsz_all"] = all_fsz.name
+        else:
+            entry["fsz_all"] = None
+        entry["fsz_top"] = top_fsz.name
+        entry["top_n"] = top_n_eff
+        entry["exported"] = len(survivor_pngs)
+        entry["dropped_occlusion"] = p["n_dropped"]
+        entry["occlusion_filter"] = {
+            "action": "prune", "n_input": p["n"], "n_mask": p["n_mask"],
+            "n_sg": p["n_sg"], "n_dropped": p["n_dropped"], "n_survivors": p["n_survivors"],
+            "threshold": threshold,
+        }
+        new_facesets.append(entry)
+
+    # write updated master manifest
+    new_master = dict(master)
+    new_master["facesets"] = new_facesets
+    new_master["masked"] = new_masked
+    new_master["thin_eras"] = new_thin
+    new_master["occlusion_filter_run"] = {
+        "model": scores.get("model"),
+        "threshold": threshold,
+        "domain_pct": domain_pct,
+        "min_survivors": min_survivors,
+        "counts": counts,
+        "totals": {"dropped_pngs": total_dropped_pngs, "quarantined_pngs": total_quarantined_pngs},
+    }
+    tmp = master_path.with_suffix(".tmp.json")
+    tmp.write_text(json.dumps(new_master, indent=2))
+    tmp.replace(master_path)
+    print(f"[done] master manifest updated: {len(new_facesets)} active, "
+          f"{len(new_masked)} masked, {len(new_thin)} thin")
+
+
+def cmd_stage(args):
+    """Walk facesets_swap_ready/ and write a queue.json for the Windows clip_worker."""
+    only = [s.strip() for s in args.facesets.split(",")] if args.facesets else None
+    queue = []
+    for fs in iter_facesets(ROOT, only):
+        faces = sorted((fs / "faces").glob("*.png")) if (fs / "faces").is_dir() else sorted(fs.glob("*.png"))
+        for p in faces:
+            queue.append({
+                "wsl_path": str(p),
+                "win_path": wsl_to_win(str(p)),
+                "faceset": fs.name,
+                "file": p.name,
+            })
+    out = Path(args.out)
+    out.parent.mkdir(parents=True, exist_ok=True)
+    out.write_text(json.dumps(queue, indent=2))
+    print(f"[stage] {len(queue)} png paths -> {out}", file=sys.stderr)
+    print(f"[stage] win queue file: {wsl_to_win(str(out))}", file=sys.stderr)
+
+
+def cmd_merge(args):
+    """Ingest worker scores.json into the per-faceset shape that `report` reads."""
+    src = json.loads(Path(args.scores).read_text())
+    by_faceset: dict[str, list] = {}
+    for r in src.get("results", []):
+        by_faceset.setdefault(r["faceset"], []).append({
+            "file": r["file"],
+            "mask": r["mask"],
+            "sunglasses": r["sunglasses"],
+        })
+    # stable ordering: faceset by name, files by name
+    out_data = {
+        "model": src.get("model", f"{MODEL_NAME}/{PRETRAINED}"),
+        "root": str(ROOT),
+        "prompts": src.get("prompts", PROMPTS),
+        "facesets": {fs: sorted(items, key=lambda x: x["file"]) for fs, items in sorted(by_faceset.items())},
+    }
+    out = Path(args.out)
+    out.parent.mkdir(parents=True, exist_ok=True)
+    out.write_text(json.dumps(out_data, indent=2))
+    total = sum(len(v) for v in by_faceset.values())
+    print(f"[merge] {total} scores across {len(by_faceset)} facesets -> {out}", file=sys.stderr)
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    sub = ap.add_subparsers(dest="cmd", required=True)
+
+    s = sub.add_parser("score", help="WSL CPU scoring (slow but no GPU dependency)")
+    s.add_argument("--facesets", default=None, help="comma-separated faceset names; default = all")
+    s.add_argument("--sample-per-faceset", type=int, default=0, help="cap PNGs per faceset (0 = all)")
+    s.add_argument("--out", required=True)
+    s.set_defaults(func=cmd_score)
+
+    st = sub.add_parser("stage", help="Build queue.json for Windows clip_worker.py")
+    st.add_argument("--facesets", default=None, help="comma-separated faceset names; default = all")
+    st.add_argument("--out", required=True)
+    st.set_defaults(func=cmd_stage)
+
+    m = sub.add_parser("merge", help="Convert worker scores.json into per-faceset report format")
+    m.add_argument("--scores", required=True, help="worker output (flat list of results)")
+    m.add_argument("--out", required=True, help="output path for per-faceset format")
+    m.set_defaults(func=cmd_merge)
+
+    r = sub.add_parser("report", help="Render HTML contact sheet from a per-faceset scores.json")
+    r.add_argument("--scores", required=True)
+    r.add_argument("--out", required=True)
+    r.set_defaults(func=cmd_report)
+
+    a = sub.add_parser("apply", help="Prune flagged PNGs, quarantine dominated facesets, re-zip .fsz, update manifest")
+    a.add_argument("--scores", required=True, help="per-faceset scores.json (output of `merge` or `score`)")
+    a.add_argument("--out-plan", required=True, help="path to write the apply plan json (audit)")
+    a.add_argument("--threshold", type=float, default=0.7, help="image-level drop threshold for mask/sunglasses (default 0.7)")
+    a.add_argument("--domain-pct", type=float, default=0.40, help="faceset-level quarantine threshold (default 0.40)")
+    a.add_argument("--min-survivors", type=int, default=5, help="quarantine to _thin if survivors below this (default 5)")
+    a.add_argument("--top-n", type=int, default=30, help="top-N for re-zipped _topN.fsz (default 30)")
+    a.add_argument("--dry-run", action="store_true", help="print plan only, no filesystem changes")
+    a.set_defaults(func=cmd_apply)
+
+    args = ap.parse_args()
+    args.func(args)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,50 @@
+#!/usr/bin/env bash
+# Finalize an Immich user's stage:
+#   1. Copy queue.json to /mnt/c so the Windows embed worker can read it
+#   2. Run the embed worker on Windows (DML)
+#   3. Copy the resulting cache back to /opt/face-sets/work/cache/
+#   4. Run cluster_immich.py to discover + emit new facesets
+#
+# Usage:  ./work/finalize_immich.sh <user-label>
+set -euo pipefail
+
+USER_LABEL="${1:?usage: $0 <user-label>}"
+
+REPO="$(cd "$(dirname "$0")/.." && pwd)"
+WSL_QUEUE="$REPO/work/immich/$USER_LABEL/queue.json"
+WIN_QUEUE_DIR="/mnt/c/face_embed_venv/work/immich/$USER_LABEL"
+WIN_QUEUE="$WIN_QUEUE_DIR/queue.json"
+WIN_QUEUE_FOR_PS="C:\\face_embed_venv\\work\\immich\\$USER_LABEL\\queue.json"
+
+WIN_CACHE_DIR="/mnt/c/face_embed_venv/work/cache"
+WIN_CACHE="$WIN_CACHE_DIR/immich_${USER_LABEL}.npz"
+WIN_CACHE_FOR_PS="C:\\face_embed_venv\\work\\cache\\immich_${USER_LABEL}.npz"
+WSL_CACHE="$REPO/work/cache/immich_${USER_LABEL}.npz"
+
+LOG="$REPO/work/logs/immich_finalize_${USER_LABEL}.log"
+
+[ -f "$WSL_QUEUE" ] || { echo "missing queue: $WSL_QUEUE" >&2; exit 1; }
+
+echo "=== finalize: $USER_LABEL ===" | tee -a "$LOG"
+date | tee -a "$LOG"
+
+mkdir -p "$WIN_QUEUE_DIR" "$WIN_CACHE_DIR" "$REPO/work/cache"
+
+echo "[1/4] copying queue: $WSL_QUEUE -> $WIN_QUEUE" | tee -a "$LOG"
+cp "$WSL_QUEUE" "$WIN_QUEUE"
+echo "      $(wc -c < "$WIN_QUEUE") bytes; $(/home/peter/face_sort_env/bin/python3 -c "import json,sys; print(len(json.load(open('$WIN_QUEUE'))))") entries"
+
+echo "[2/4] running Windows DML embed worker" | tee -a "$LOG"
+powershell.exe -NoProfile -Command "C:\\face_embed_venv\\Scripts\\python.exe C:\\face_embed_venv\\bench\\embed_worker.py '$WIN_QUEUE_FOR_PS' '$WIN_CACHE_FOR_PS'" 2>&1 | tee -a "$LOG"
+
+[ -f "$WIN_CACHE" ] || { echo "embed produced no cache file at $WIN_CACHE" | tee -a "$LOG"; exit 1; }
+
+echo "[3/4] copying cache back: $WIN_CACHE -> $WSL_CACHE" | tee -a "$LOG"
+cp "$WIN_CACHE" "$WSL_CACHE"
+echo "      $(/home/peter/face_sort_env/bin/python3 -c "import sys,json; sys.path.insert(0,'$REPO'); from sort_faces import load_cache; e,m,_,_,_=load_cache('$WSL_CACHE'); print(f'{len(e)} embeddings, {sum(1 for x in m if x.get(\"noface\"))} noface, {sum(1 for x in m if not x.get(\"noface\"))} faces')")"
+
+echo "[4/4] running cluster_immich.py" | tee -a "$LOG"
+/home/peter/face_sort_env/bin/python3 "$REPO/work/cluster_immich.py" "$WSL_CACHE" 2>&1 | tee -a "$LOG"
+
+echo "=== finalize done: $USER_LABEL ===" | tee -a "$LOG"
+date | tee -a "$LOG"
@@ -0,0 +1,447 @@
+#!/usr/bin/env python3
+"""Stage Immich assets for embedding (WSL side of the split workflow).
+
+For one Immich user:
+  1. Page through `/search/metadata` listing every IMAGE asset the user owns.
+  2. For each asset, fetch `/faces?id=` and decide if any detected face has a
+     scaled short side >= MIN_FACE_SHORT on the original. Skip assets that
+     don't.
+  3. Download the original. Compute sha256.
+  4. Dedup against (a) the existing canonical cache `nl_full.npz` and
+     (b) sha256s already staged in this run / earlier runs. If duplicate,
+     do NOT save to disk; record the alias.
+  5. Save survivors to /mnt/x/src/immich/<user>/<rel> mirroring the structure
+     after Immich's `/upload/library/<owner>/` prefix.
+  6. Write a queue file with WSL + Windows paths so the Windows DML embed
+     worker can find them.
+  7. Persist staging state continuously so the run is resumable.
+
+Output artifacts:
+  work/immich/<user>/queue.json         - what the Windows worker should embed
+  work/immich/<user>/state.json         - resume state
+  work/immich/<user>/aliases.json       - asset_id -> existing canonical path
+                                          when sha256 matched something already
+                                          in nl_full.npz
+"""
+
+from __future__ import annotations
+
+import argparse
+import hashlib
+import json
+import os
+import sys
+import time
+import urllib.error
+import urllib.request
+from concurrent.futures import ThreadPoolExecutor
+from pathlib import Path
+
+import numpy as np
+
+REPO = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(REPO))
+
+from sort_faces import load_cache  # noqa: E402
+
+# ---- config -------------------------------------------------------------- #
+
+API = os.environ.get("IMMICH_URL", "").rstrip("/") + "/api" if os.environ.get("IMMICH_URL") else None
+KEY = os.environ.get("IMMICH_API_KEY")
+if not API or not KEY:
+    raise SystemExit(
+        "set IMMICH_URL and IMMICH_API_KEY env vars before running, e.g.\n"
+        "  export IMMICH_URL=https://fotos.example.org\n"
+        "  export IMMICH_API_KEY=...   # admin API key"
+    )
+HEADERS = {"x-api-key": KEY, "Accept": "application/json"}
+
+# Short-label -> Immich userId. The user is responsible for filling this in for
+# their own Immich instance. NOTE: as of Immich v2.7.2, /search/metadata's
+# `userIds` filter is silently ignored when the API key is bound to a different
+# user, so changing this label/UUID does not actually change which assets the
+# API returns; we keep it here for naming output dirs and as future-proofing.
+USERS_FILE = REPO / "work" / "immich" / "users.json"
+USERS: dict[str, str] = {}
+if USERS_FILE.exists():
+    USERS = json.loads(USERS_FILE.read_text())
+
+CACHE_PATH = REPO / "work" / "cache" / "nl_full.npz"  # for sha256 dedup
+STAGE_DIR  = REPO / "work" / "immich"
+DEST_ROOT  = Path("/mnt/x/src/immich")
+WIN_DEST_ROOT = "X:\\src\\immich"  # equivalent on the Windows side
+
+PAGE_SIZE = 1000
+MIN_FACE_SHORT = 90      # match refine's gate
+MIN_DET_SCORE  = 0.5     # weaker than refine's 0.6, since Immich's score scale differs
+HTTP_TIMEOUT = 60        # seconds, conservative for big originals
+HTTP_RETRIES = 3
+HTTP_BACKOFF = 2.0
+
+# Circuit breaker: if this many consecutive workers fail with network errors,
+# probe Immich; if probe also fails, exit cleanly with code 2 so the orchestrator
+# can pause until the user says resume. State is preserved (resume-safe).
+OUTAGE_FAIL_STREAK = 12
+OUTAGE_PROBE_TIMEOUT = 8
+
+# ---- helpers ------------------------------------------------------------- #
+
+def http_get(url: str, accept_bytes: bool = False) -> bytes | dict:
+    """GET with retries. Returns parsed JSON unless accept_bytes is True."""
+    last_err = None
+    for attempt in range(HTTP_RETRIES):
+        try:
+            req = urllib.request.Request(url, headers=HEADERS)
+            with urllib.request.urlopen(req, timeout=HTTP_TIMEOUT) as resp:
+                data = resp.read()
+            return data if accept_bytes else json.loads(data)
+        except (urllib.error.URLError, urllib.error.HTTPError, TimeoutError) as e:
+            last_err = e
+            if attempt + 1 < HTTP_RETRIES:
+                time.sleep(HTTP_BACKOFF * (attempt + 1))
+    raise RuntimeError(f"GET {url} failed after {HTTP_RETRIES} attempts: {last_err}")
+
+
+def probe_immich() -> bool:
+    """Quick connectivity probe (no retry). Used by the circuit breaker."""
+    try:
+        req = urllib.request.Request(f"{API}/server/version", headers=HEADERS)
+        urllib.request.urlopen(req, timeout=OUTAGE_PROBE_TIMEOUT).read()
+        return True
+    except Exception:
+        return False
+
+
+def http_post(url: str, payload: dict) -> dict:
+    last_err = None
+    body = json.dumps(payload).encode("utf-8")
+    for attempt in range(HTTP_RETRIES):
+        try:
+            req = urllib.request.Request(
+                url, data=body, headers={**HEADERS, "Content-Type": "application/json"}
+            )
+            with urllib.request.urlopen(req, timeout=HTTP_TIMEOUT) as resp:
+                return json.loads(resp.read())
+        except (urllib.error.URLError, urllib.error.HTTPError, TimeoutError) as e:
+            last_err = e
+            if attempt + 1 < HTTP_RETRIES:
+                time.sleep(HTTP_BACKOFF * (attempt + 1))
+    raise RuntimeError(f"POST {url} failed after {HTTP_RETRIES} attempts: {last_err}")
+
+
+def sha256_bytes(b: bytes) -> str:
+    return hashlib.sha256(b).hexdigest()
+
+
+def derive_relpath(original_path: str) -> str:
+    """Return a relative subpath rooted at the user dir, mirroring Immich.
+
+    /usr/src/app/upload/library/admin/2026/2026-02-18/foo.jpg
+        -> 2026/2026-02-18/foo.jpg
+    Anything that doesn't match the expected prefix falls back to the basename
+    only.
+    """
+    marker = "/upload/library/"
+    i = original_path.find(marker)
+    if i < 0:
+        return Path(original_path).name
+    rest = original_path[i + len(marker):]
+    parts = rest.split("/", 1)
+    return parts[1] if len(parts) == 2 else parts[0]
+
+
+def wsl_to_win(p: Path) -> str:
+    """Convert /mnt/x/.. -> X:\\.. for the embed worker that runs on Windows."""
+    s = str(p)
+    if s.startswith("/mnt/"):
+        drive = s[5]
+        rest = s[6:].lstrip("/")
+        return f"{drive.upper()}:\\{rest.replace('/', chr(92))}"
+    if s.startswith("/opt/face-sets/"):
+        # /opt/face-sets/work/... is on the WSL ext4 filesystem; reachable from
+        # Windows as \\wsl$\Ubuntu\opt\face-sets\... (slower than C:).  For our
+        # use we keep all stage outputs under /mnt/x or /mnt/c so this branch
+        # should not be hit, but fall back rather than fail.
+        return f"\\\\wsl$\\Ubuntu\\opt\\face-sets\\{s[len('/opt/face-sets/'):].replace('/', chr(92))}"
+    return s
+
+
+def keep_asset(asset: dict, faces: list) -> tuple[bool, list[dict]]:
+    """Return (keep, eligible_face_records). A face is 'eligible' iff its
+    scaled-to-original short side >= MIN_FACE_SHORT and source-type is
+    machine-learning."""
+    W, H = asset.get("width"), asset.get("height")
+    if not W or not H:
+        return False, []
+    eligible = []
+    for f in faces:
+        if f.get("sourceType") and f["sourceType"] != "machine-learning":
+            continue
+        iw = f.get("imageWidth") or W
+        ih = f.get("imageHeight") or H
+        sx = (W / iw) if iw else 1.0
+        sy = (H / ih) if ih else 1.0
+        bw = (f["boundingBoxX2"] - f["boundingBoxX1"]) * sx
+        bh = (f["boundingBoxY2"] - f["boundingBoxY1"]) * sy
+        if min(bw, bh) >= MIN_FACE_SHORT:
+            eligible.append({
+                "id":      f["id"],
+                "x1": int(round(f["boundingBoxX1"] * sx)),
+                "y1": int(round(f["boundingBoxY1"] * sy)),
+                "x2": int(round(f["boundingBoxX2"] * sx)),
+                "y2": int(round(f["boundingBoxY2"] * sy)),
+                "person": (f.get("person") or {}).get("name") or None,
+            })
+    return (len(eligible) > 0), eligible
+
+
+# ---- main staging loop --------------------------------------------------- #
+
+def list_assets(user_id: str):
+    """Yield every IMAGE asset owned by user_id, paginated."""
+    page = 1
+    while True:
+        resp = http_post(f"{API}/search/metadata", {
+            "size": PAGE_SIZE,
+            "type": "IMAGE",
+            "page": page,
+            "userIds": [user_id],
+        })
+        items = resp["assets"]["items"]
+        if not items:
+            return
+        for a in items:
+            yield a
+        nxt = resp["assets"].get("nextPage")
+        if not nxt:
+            return
+        page = int(nxt)
+
+
+def stage(user_label: str, limit: int | None, workers: int) -> None:
+    user_id = USERS[user_label]
+    user_dir = STAGE_DIR / user_label
+    user_dir.mkdir(parents=True, exist_ok=True)
+
+    state_path  = user_dir / "state.json"
+    queue_path  = user_dir / "queue.json"
+    aliases_path = user_dir / "aliases.json"
+
+    # ---- load existing state for resume ---- #
+    state = {
+        "started_at": time.strftime("%Y-%m-%dT%H:%M:%S"),
+        "user_label": user_label,
+        "user_id": user_id,
+        "seen_asset_ids": [],
+        "staged_count": 0,
+        "deduped_against_existing": 0,
+        "deduped_against_staged": 0,
+        "skipped_no_big_face": 0,
+        "skipped_no_faces": 0,
+        "skipped_download_error": 0,
+        "total_assets_seen": 0,
+    }
+    queue: list[dict] = []
+    aliases: dict[str, dict] = {}  # asset_id -> {sha, canonical_path}
+    staged_hashes: set[str] = set()
+    if state_path.exists():
+        prior = json.loads(state_path.read_text())
+        state.update(prior)
+        state["resumed_at"] = time.strftime("%Y-%m-%dT%H:%M:%S")
+        if queue_path.exists():
+            queue = json.loads(queue_path.read_text())
+            staged_hashes = {q["sha256"] for q in queue}
+        if aliases_path.exists():
+            aliases = json.loads(aliases_path.read_text())
+        print(f"[resume] {len(state['seen_asset_ids'])} asset_ids already seen, "
+              f"{len(queue)} in queue, {len(aliases)} aliased to existing cache")
+    seen = set(state["seen_asset_ids"])
+
+    # ---- startup connectivity probe ---- #
+    if not probe_immich():
+        print(f"[init] Immich probe failed at {API}/server/version -- exiting code 2")
+        sys.exit(2)
+    print("[init] Immich reachable")
+
+    # ---- load existing canonical cache hashes (sha256) ---- #
+    print(f"[init] loading existing cache hashes from {CACHE_PATH}")
+    _emb, meta, _src, _proc, _aliases = load_cache(CACHE_PATH)
+    canonical_by_hash: dict[str, str] = {}
+    for m in meta:
+        h = m.get("hash")
+        if h:
+            canonical_by_hash.setdefault(h, m["path"])
+    print(f"[init] {len(canonical_by_hash)} unique sha256s in nl_full.npz")
+
+    # ---- iterate assets ---- #
+    # Each worker does the entire I/O chain for an asset: /faces -> filter ->
+    # /original. That way 8 workers translate to ~8x parallelism end-to-end.
+    # Main thread does sha256, dedup decisions, and writes (which are CPU/SMB
+    # bound but cheap relative to two HTTPS round-trips per asset).
+    # Worker result tuple:
+    #   (asset, faces|None, blob|None, eligible|None, error|None)
+    def _fetch_for_asset(asset: dict):
+        if asset.get("type") != "IMAGE":
+            return asset, None, None, None, "not_image"
+        aid = asset["id"]
+        if aid in seen:
+            return asset, None, None, None, "already_seen"
+        try:
+            faces = http_get(f"{API}/faces?id={aid}")
+        except Exception as e:
+            return asset, None, None, None, f"faces_error:{e}"
+        if not faces:
+            return asset, [], None, [], "no_faces"
+        keep, eligible = keep_asset(asset, faces)
+        if not keep:
+            return asset, faces, None, eligible, "no_big_face"
+        try:
+            blob = http_get(f"{API}/assets/{aid}/original", accept_bytes=True)
+        except Exception as e:
+            return asset, faces, None, eligible, f"download_error:{e}"
+        return asset, faces, blob, eligible, None
+
+    n = 0
+    err_streak = 0
+    last_flush = time.time()
+    t0 = time.time()
+    pool = ThreadPoolExecutor(max_workers=workers)
+    try:
+        for asset, faces, blob, eligible, err in pool.map(_fetch_for_asset, list_assets(user_id)):
+            if asset.get("type") != "IMAGE":
+                continue
+            n += 1
+            state["total_assets_seen"] = n
+            if limit is not None and n > limit:
+                print(f"[stop] hit --limit {limit}")
+                break
+            aid = asset["id"]
+
+            # Already-seen / non-image: silently skip.
+            if err == "already_seen":
+                continue
+
+            # Transient: count, but DON'T mark as seen so resume retries.
+            if err and (err.startswith("faces_error") or err.startswith("download_error")):
+                kind = err.split(":", 1)[0]
+                detail = err.split(":", 1)[1][:160] if ":" in err else err
+                print(f"[err] {kind} {aid}: {detail}")
+                state["skipped_download_error"] += 1
+                err_streak += 1
+                # Circuit breaker: long streak -> probe; if down, save and exit.
+                if err_streak >= OUTAGE_FAIL_STREAK:
+                    print(f"[breaker] {err_streak} consecutive errors; probing Immich...")
+                    if probe_immich():
+                        print("[breaker] probe ok, treating as transient; continuing")
+                        err_streak = 0
+                    else:
+                        print("[breaker] probe FAILED -- pausing run; resume with same command")
+                        queue_path.write_text(json.dumps(queue, indent=2))
+                        state_path.write_text(json.dumps(state, indent=2))
+                        aliases_path.write_text(json.dumps(aliases, indent=2))
+                        sys.exit(2)
+                continue
+            else:
+                err_streak = 0
+
+            # Permanent classifications -> seen.
+            if err == "no_faces":
+                state["skipped_no_faces"] += 1
+                seen.add(aid); state["seen_asset_ids"] = sorted(seen)
+                continue
+            if err == "no_big_face":
+                state["skipped_no_big_face"] += 1
+                seen.add(aid); state["seen_asset_ids"] = sorted(seen)
+                continue
+
+            # Have faces + blob -> dedup + save.
+            h = sha256_bytes(blob)
+            if h in canonical_by_hash:
+                aliases[aid] = {"sha256": h, "canonical": canonical_by_hash[h]}
+                state["deduped_against_existing"] += 1
+                seen.add(aid); state["seen_asset_ids"] = sorted(seen)
+                continue
+            if h in staged_hashes:
+                state["deduped_against_staged"] += 1
+                seen.add(aid); state["seen_asset_ids"] = sorted(seen)
+                continue
+
+            rel = derive_relpath(asset.get("originalPath", asset.get("originalFileName", aid)))
+            wsl_path = DEST_ROOT / user_label / rel
+            wsl_path.parent.mkdir(parents=True, exist_ok=True)
+            wsl_path.write_bytes(blob)
+            staged_hashes.add(h)
+
+            queue.append({
+                "asset_id": aid,
+                "sha256": h,
+                "wsl_path": str(wsl_path),
+                "win_path": wsl_to_win(wsl_path),
+                "size_bytes": len(blob),
+                "width":  asset.get("width"),
+                "height": asset.get("height"),
+                "originalPath": asset.get("originalPath"),
+                "originalFileName": asset.get("originalFileName"),
+                "localDateTime": asset.get("localDateTime"),
+                "immich_eligible_faces": eligible,
+            })
+            state["staged_count"] += 1
+            seen.add(aid)
+            state["seen_asset_ids"] = sorted(seen)
+
+            if time.time() - last_flush > 5.0 or len(queue) % 25 == 0:
+                queue_path.write_text(json.dumps(queue, indent=2))
+                state_path.write_text(json.dumps(state, indent=2))
+                aliases_path.write_text(json.dumps(aliases, indent=2))
+                last_flush = time.time()
+                elapsed = time.time() - t0
+                rate = state["total_assets_seen"] / max(0.1, elapsed)
+                print(f"[stage] seen={state['total_assets_seen']:6d} "
+                      f"staged={state['staged_count']:5d} "
+                      f"dedup-existing={state['deduped_against_existing']:5d} "
+                      f"dedup-staged={state['deduped_against_staged']:5d} "
+                      f"no-big-face={state['skipped_no_big_face']:6d} "
+                      f"no-faces={state['skipped_no_faces']:6d}  "
+                      f"errs={state['skipped_download_error']:3d}  "
+                      f"({rate:.1f} assets/s)")
+    finally:
+        pool.shutdown(wait=False, cancel_futures=True)
+
+    # final flush
+    queue_path.write_text(json.dumps(queue, indent=2))
+    state_path.write_text(json.dumps(state, indent=2))
+    aliases_path.write_text(json.dumps(aliases, indent=2))
+    print()
+    print(f"=== final state for user {user_label} ===")
+    for k in [
+        "total_assets_seen", "staged_count", "deduped_against_existing",
+        "deduped_against_staged", "skipped_no_big_face", "skipped_no_faces",
+        "skipped_download_error",
+    ]:
+        print(f"  {k}: {state[k]}")
+    total_bytes = sum(q["size_bytes"] for q in queue)
+    print(f"  staged bytes: {total_bytes/1e9:.2f} GB across {len(queue)} files")
+    print(f"  queue:    {queue_path}")
+    print(f"  state:    {state_path}")
+    print(f"  aliases:  {aliases_path}")
+
+
+# ---- cli ----------------------------------------------------------------- #
+
+def main() -> None:
+    p = argparse.ArgumentParser()
+    if not USERS:
+        p.add_argument("--user", required=True,
+                       help=f"label for output dir (USERS map empty; populate {USERS_FILE} to constrain)")
+    else:
+        p.add_argument("--user", choices=list(USERS.keys()), required=True)
+    p.add_argument("--limit", type=int, default=None,
+                   help="stop after seeing N assets total (for testing)")
+    p.add_argument("--workers", type=int, default=8,
+                   help="concurrent /faces fetches (default 8)")
+    args = p.parse_args()
+    stage(args.user, args.limit, args.workers)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,144 @@
+"""Windows / DirectML multi-face audit worker.
+
+For every PNG in queue.json, run insightface FaceAnalysis and record how many
+faces were detected (filtering by det_score>=MIN_DET and face_short>=MIN_PIX).
+Surfaces the load-bearing roop invariant: each .fsz PNG must hold exactly one
+face, otherwise the loader's `extract_face_images` appends every detected face
+into the FaceSet and pollutes the averaged identity embedding.
+
+CLI:
+    py -3.12 multiface_worker.py <queue.json> <out_results.json> [--limit N]
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import sys
+import time
+from pathlib import Path
+
+import numpy as np
+from PIL import Image, ImageOps
+from insightface.app import FaceAnalysis
+
+MODEL_ROOT = r"C:\face_embed_venv\models"
+MIN_DET = 0.5
+MIN_FACE_PIX = 40
+FLUSH_EVERY = 200
+
+
+def load_existing(out_path: Path):
+    if not out_path.exists():
+        return None, set()
+    try:
+        d = json.loads(out_path.read_text())
+        processed = set(d.get("processed", []))
+        return d, processed
+    except Exception as e:
+        print(f"[warn] could not parse {out_path}: {e}; starting fresh", file=sys.stderr)
+        return None, set()
+
+
+def save_atomic(out_path: Path, data: dict):
+    tmp = out_path.with_suffix(".tmp.json")
+    tmp.write_text(json.dumps(data, indent=2))
+    os.replace(tmp, out_path)
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("queue", type=Path)
+    ap.add_argument("out", type=Path)
+    ap.add_argument("--limit", type=int, default=None)
+    args = ap.parse_args()
+
+    queue = json.loads(args.queue.read_text())
+    print(f"[queue] {len(queue)} entries from {args.queue}", flush=True)
+    args.out.parent.mkdir(parents=True, exist_ok=True)
+    existing, processed = load_existing(args.out)
+    if existing:
+        print(f"[resume] {len(processed)} already scored", flush=True)
+        results = existing.get("results", [])
+    else:
+        results = []
+    pending = [e for e in queue if e["wsl_path"] not in processed]
+    if args.limit is not None:
+        pending = pending[: args.limit]
+    print(f"[pending] {len(pending)} entries", flush=True)
+    if not pending:
+        print("[done] nothing to do")
+        return
+
+    print("[load] FaceAnalysis with DmlExecutionProvider", flush=True)
+    app = FaceAnalysis(
+        name="buffalo_l",
+        root=MODEL_ROOT,
+        providers=["DmlExecutionProvider", "CPUExecutionProvider"],
+    )
+    app.prepare(ctx_id=0, det_size=(640, 640))
+
+    n_done = 0
+    n_load_err = 0
+    last_flush = time.time()
+    t_start = time.time()
+
+    def flush():
+        save_atomic(args.out, {
+            "results": results,
+            "processed": sorted(processed),
+        })
+
+    for entry in pending:
+        try:
+            with Image.open(entry["win_path"]) as im:
+                im = ImageOps.exif_transpose(im)
+                im = im.convert("RGB")
+                rgb = np.array(im)
+            bgr = rgb[:, :, ::-1].copy()
+        except Exception as e:
+            n_load_err += 1
+            results.append({
+                "wsl_path": entry["wsl_path"], "faceset": entry["faceset"], "file": entry["file"],
+                "face_count": -1, "error": "load",
+            })
+            processed.add(entry["wsl_path"])
+            n_done += 1
+            continue
+
+        faces = app.get(bgr)
+        kept = 0
+        for f in faces:
+            if float(f.det_score) < MIN_DET:
+                continue
+            x1, y1, x2, y2 = [int(round(v)) for v in f.bbox]
+            short = min(max(x2 - x1, 0), max(y2 - y1, 0))
+            if short < MIN_FACE_PIX:
+                continue
+            kept += 1
+
+        results.append({
+            "wsl_path": entry["wsl_path"], "faceset": entry["faceset"], "file": entry["file"],
+            "face_count": kept,
+        })
+        processed.add(entry["wsl_path"])
+        n_done += 1
+
+        if (n_done % FLUSH_EVERY == 0) or (time.time() - last_flush) > 30.0:
+            flush()
+            last_flush = time.time()
+            elapsed = time.time() - t_start
+            rate = n_done / max(0.1, elapsed)
+            eta = (len(pending) - n_done) / max(0.1, rate) / 60.0
+            print(f"[scan] {n_done}/{len(pending)} rate={rate:.2f} img/s eta={eta:.1f}min "
+                  f"load_err={n_load_err}", flush=True)
+
+    flush()
+    elapsed = time.time() - t_start
+    print(f"[done] {n_done} scored, {n_load_err} load errors, {elapsed:.1f}s "
+          f"({n_done/max(0.1,elapsed):.2f} img/s) -> {args.out}", flush=True)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,127 @@
+#!/bin/bash
+# Generic chain driver for the video target preprocessing pipeline.
+#
+# Usage:
+#   WORK=/path/to/workdir SKIP_PATTERN='ct_src_(0001[015]|005[0-9]|006[0-9])\.mp4' \
+#     bash run_video_pipeline.sh > /opt/face-sets/work/logs/<name>.log 2>&1
+#
+# Required env vars:
+#   WORK         per-batch workdir (will hold scenes/, queue.json, results.jsonl, plan.json, review/)
+#
+# Optional env vars:
+#   INPUT_DIR    default /mnt/x/src/vd
+#   OUTPUT_DIR   default /mnt/x/src/vd/ct
+#   FILTER_FROM  basename floor; only files with name >= this go in (e.g. ct_src_00050.mp4)
+#   SKIP_PATTERN regex of basenames to exclude (Python `re` syntax). Applied AFTER FILTER_FROM.
+#   MAX_DUR      score --max-dur (default 120)
+#   IDENTITY     "yes" to enable identity tagging; default "no"
+#   SIDECAR      "yes" to emit <uuid>.json provenance sidecars; default "no"
+
+set -e
+
+: ${WORK:?WORK env var must point at a workdir}
+: ${INPUT_DIR:=/mnt/x/src/vd}
+: ${OUTPUT_DIR:=/mnt/x/src/vd/ct}
+: ${MAX_DUR:=120}
+: ${IDENTITY:=no}
+: ${SIDECAR:=no}
+
+mkdir -p "$WORK" "$WORK/scenes"
+
+PY_WSL=/home/peter/face_sort_env/bin/python
+PY_WIN="/mnt/c/face_embed_venv/Scripts/python.exe"
+PIPELINE=/opt/face-sets/work/video_target_pipeline.py
+WORKER=/opt/face-sets/work/video_face_worker.py
+INVENTORY_FULL=/opt/face-sets/work/video_preprocess/inventory_full.json
+
+ts() { date +"%Y-%m-%d %H:%M:%S"; }
+log() { echo "[$(ts)] [$PHASE] $*"; }
+
+PHASE="setup"
+log "STARTED — host=$(hostname) pid=$$ work=$WORK"
+log "config: input=$INPUT_DIR output=$OUTPUT_DIR filter_from=${FILTER_FROM:-<none>} skip_pattern=${SKIP_PATTERN:-<none>} max_dur=$MAX_DUR identity=$IDENTITY sidecar=$SIDECAR"
+
+PHASE="inventory"
+log "building subset inventory"
+T0=$(date +%s)
+# rebuild full inventory if missing
+if [ ! -f "$INVENTORY_FULL" ]; then
+    log "(no full inventory cached — running fresh scan)"
+    $PY_WSL $PIPELINE scan --input "$INPUT_DIR" --output-dir "$OUTPUT_DIR" --out "$INVENTORY_FULL"
+fi
+$PY_WSL <<EOF
+import json, re
+from pathlib import Path
+inv = json.load(open('$INVENTORY_FULL'))
+subset = list(inv['videos'])
+filter_from = '${FILTER_FROM}'
+skip_pat = '${SKIP_PATTERN}'
+if filter_from:
+    subset = [v for v in subset if Path(v['path']).name >= filter_from]
+if skip_pat:
+    pat = re.compile(skip_pat)
+    subset = [v for v in subset if not pat.search(Path(v['path']).name)]
+subset.sort(key=lambda v: v['path'])
+inv['videos'] = subset
+json.dump(inv, open('$WORK/inventory.json','w'), indent=2)
+total_dur = sum(v.get('duration_s', 0) for v in inv['videos'] if 'error' not in v)
+print(f'  {len(inv["videos"])} videos, total {total_dur/3600:.2f}h input')
+EOF
+log "done in $(($(date +%s)-T0))s"
+
+PHASE="scenes"
+log "PySceneDetect AdaptiveDetector across all videos (cached entries skipped)"
+T0=$(date +%s)
+$PY_WSL $PIPELINE scenes --inventory "$WORK/inventory.json" --out-dir "$WORK/scenes"
+log "done in $(($(date +%s)-T0))s"
+
+PHASE="stage"
+log "building frame queue @ 2 fps within scenes"
+T0=$(date +%s)
+$PY_WSL $PIPELINE stage --inventory "$WORK/inventory.json" --scenes-dir "$WORK/scenes" --out "$WORK/queue.json"
+log "done in $(($(date +%s)-T0))s"
+
+PHASE="worker"
+log "Windows DML face detect+embed (resumable; the slow one)"
+T0=$(date +%s)
+$PY_WIN $WORKER "$WORK/queue.json" "$WORK/results.json"
+log "done in $(($(date +%s)-T0))s"
+
+PHASE="merge"
+log "ingesting worker output (jsonl)"
+T0=$(date +%s)
+$PY_WSL $PIPELINE merge --results "$WORK/results.json" --out "$WORK/frames.json"
+log "done in $(($(date +%s)-T0))s"
+
+PHASE="track"
+log "stitching detections into tracks"
+T0=$(date +%s)
+$PY_WSL $PIPELINE track --frames "$WORK/frames.json" --scenes-dir "$WORK/scenes" \
+  --inventory "$WORK/inventory.json" --out "$WORK/tracks.json"
+log "done in $(($(date +%s)-T0))s"
+
+PHASE="score"
+log "scoring with relaxed gates + max-dur=$MAX_DUR identity=$IDENTITY"
+T0=$(date +%s)
+ID_FLAG=""
+if [ "$IDENTITY" != "yes" ]; then ID_FLAG="--no-identity"; fi
+$PY_WSL $PIPELINE score --tracks "$WORK/tracks.json" --inventory "$WORK/inventory.json" \
+  --out "$WORK/plan.json" --max-dur "$MAX_DUR" $ID_FLAG
+log "done in $(($(date +%s)-T0))s"
+
+PHASE="cut"
+log "ffmpeg stream-copy into per-source subfolders (no --clean)"
+T0=$(date +%s)
+SIDECAR_FLAG=""
+if [ "$SIDECAR" = "yes" ]; then SIDECAR_FLAG="--write-sidecar"; fi
+$PY_WSL $PIPELINE cut --plan "$WORK/plan.json" --output-dir "$OUTPUT_DIR" $SIDECAR_FLAG
+log "done in $(($(date +%s)-T0))s"
+
+PHASE="report"
+log "rendering HTML"
+T0=$(date +%s)
+$PY_WSL $PIPELINE report --plan "$WORK/plan.json" --output-dir "$OUTPUT_DIR" --out "$WORK/review"
+log "done in $(($(date +%s)-T0))s"
+
+PHASE="done"
+log "PIPELINE COMPLETE — review at file://$WORK/review/index.html"
@@ -0,0 +1,32 @@
+#!/bin/bash
+# Generic status helper for run_video_pipeline.sh.
+# Usage: bash status_video_pipeline.sh <log_file>
+# Defaults to /opt/face-sets/work/logs/video_run.log if no arg.
+
+LOG="${1:-/opt/face-sets/work/logs/video_run.log}"
+
+if [ ! -f "$LOG" ]; then
+    echo "no log at $LOG yet"
+    exit 0
+fi
+
+echo "=== last 8 log lines ==="
+tail -8 "$LOG"
+echo
+
+# worker progress
+last=$(grep -E "^\[scan\] [0-9]+/[0-9]+" "$LOG" | tail -1)
+if [ -n "$last" ]; then
+    echo "=== DML worker progress ==="
+    echo "  $last"
+fi
+
+# total elapsed
+start_epoch=$(head -1 "$LOG" | sed 's/.*\[\(.*\)\].*\[setup\].*/\1/' | xargs -I{} date -d "{}" +%s 2>/dev/null)
+now_epoch=$(date +%s)
+if [ -n "$start_epoch" ] && [ "$start_epoch" != "" ] 2>/dev/null; then
+    elapsed=$((now_epoch - start_epoch))
+    h=$((elapsed / 3600))
+    m=$(( (elapsed % 3600) / 60 ))
+    echo "  elapsed: ${h}h${m}m"
+fi
@@ -0,0 +1,274 @@
+"""Windows / DirectML video frame face worker.
+
+Reads a queue.json from /opt/face-sets/work/video_target_pipeline.py:stage
+(WSL side), each entry: {video_path, win_video_path, frame_idx, time_s,
+queue_id}. Decodes frame N from the video, runs insightface FaceAnalysis,
+emits per-face records (bbox, det_score, pose, embedding, face_short).
+
+CLI:
+    py -3.12 video_face_worker.py <queue.json> <out_results.json> [--limit N]
+
+Resumable: existing entries in out_results.json with the same queue_id are
+skipped.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import sys
+import time
+from pathlib import Path
+
+import numpy as np
+import cv2
+from insightface.app import FaceAnalysis
+
+MODEL_ROOT = r"C:\face_embed_venv\models"
+MIN_DET = 0.5
+MIN_FACE_PIX = 40
+FLUSH_EVERY = 100
+
+
+def jsonl_path_for(out_path: Path) -> Path:
+    """Sister JSONL file: one result-record per line, append-only."""
+    return out_path.with_suffix(".jsonl")
+
+
+def load_existing(out_path: Path):
+    """Load existing results from .jsonl (preferred) or legacy .json (one-time conversion).
+    Returns (records_list, processed_set)."""
+    jsonl = jsonl_path_for(out_path)
+    if jsonl.exists():
+        records = []
+        processed = set()
+        with open(jsonl) as f:
+            for line_num, line in enumerate(f, 1):
+                line = line.strip()
+                if not line:
+                    continue
+                try:
+                    r = json.loads(line)
+                    records.append(r)
+                    if r.get("queue_id"):
+                        processed.add(r["queue_id"])
+                except json.JSONDecodeError:
+                    print(f"[warn] {jsonl}:{line_num} corrupt; skipping", file=sys.stderr)
+        return records, processed
+    # legacy JSON support: load once, convert to JSONL
+    if out_path.exists():
+        try:
+            d = json.loads(out_path.read_text())
+            records = d.get("results", [])
+            processed = set(d.get("processed", []))
+            print(f"[migrate] converting {len(records)} legacy JSON records to JSONL", file=sys.stderr)
+            with open(jsonl, "w") as f:
+                for r in records:
+                    f.write(json.dumps(r) + "\n")
+            return records, processed
+        except Exception as e:
+            print(f"[warn] could not parse {out_path}: {e}; starting fresh", file=sys.stderr)
+    return [], set()
+
+
+def append_records(out_path: Path, new_records: list):
+    """Append-only write to the sister .jsonl file. No re-serialization of prior records."""
+    if not new_records:
+        return
+    jsonl = jsonl_path_for(out_path)
+    with open(jsonl, "a") as f:
+        for r in new_records:
+            f.write(json.dumps(r) + "\n")
+
+
+def write_compat_summary(out_path: Path, total_records: int, processed: set):
+    """Write a tiny JSON pointer file at the legacy out_path so older consumers
+    still see *something*, but the canonical store is the .jsonl. Cheap."""
+    summary = {
+        "_format": "jsonl-pointer",
+        "_jsonl": str(jsonl_path_for(out_path).name),
+        "results_count": total_records,
+        "processed_count": len(processed),
+    }
+    tmp = out_path.with_suffix(".tmp.json")
+    tmp.write_text(json.dumps(summary, indent=2))
+    os.replace(tmp, out_path)
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("queue", type=Path)
+    ap.add_argument("out", type=Path)
+    ap.add_argument("--limit", type=int, default=None)
+    args = ap.parse_args()
+
+    queue = json.loads(args.queue.read_text())
+    print(f"[queue] {len(queue)} entries from {args.queue}", flush=True)
+    args.out.parent.mkdir(parents=True, exist_ok=True)
+
+    results, processed = load_existing(args.out)
+    if processed:
+        print(f"[resume] {len(processed)} already scored", flush=True)
+
+    pending = [e for e in queue if e["queue_id"] not in processed]
+    if args.limit is not None:
+        pending = pending[: args.limit]
+    print(f"[pending] {len(pending)} entries", flush=True)
+    if not pending:
+        print("[done] nothing to do")
+        return
+
+    print("[load] FaceAnalysis with DmlExecutionProvider", flush=True)
+    app = FaceAnalysis(
+        name="buffalo_l",
+        root=MODEL_ROOT,
+        providers=["DmlExecutionProvider", "CPUExecutionProvider"],
+    )
+    app.prepare(ctx_id=0, det_size=(640, 640))
+
+    # group queue by video so we can keep one VideoCapture open and seek
+    from collections import defaultdict
+    by_video = defaultdict(list)
+    for e in pending:
+        by_video[e["win_video_path"]].append(e)
+
+    n_done = 0
+    n_load_err = 0
+    last_flush = time.time()
+    t_start = time.time()
+    new_buffer: list = []
+
+    def flush():
+        # append-only: only NEW records since last flush get written. O(new_records),
+        # not O(total_records). Was 11s/flush at 9k records; now <50ms.
+        if new_buffer:
+            append_records(args.out, new_buffer)
+            new_buffer.clear()
+        write_compat_summary(args.out, len(results), processed)
+
+    for vidpath, entries in by_video.items():
+        # entries are already sorted by frame_idx. Hybrid decode strategy:
+        #   1. Seek ONCE to the first pending target (cheap keyframe-seek).
+        #   2. Sequential cap.grab() between subsequent targets (decode without
+        #      BGR conversion until we reach a target, then cap.retrieve()).
+        # This avoids per-sample seek cost (the original pathology that
+        # caused 1.4 fps deep in long videos) AND avoids grab-walking from
+        # frame 0 on resume (the over-correction that gave 0.08 fps).
+        entries.sort(key=lambda e: e["frame_idx"])
+        cap = cv2.VideoCapture(vidpath)
+        if not cap.isOpened():
+            print(f"[err] cannot open {vidpath}", flush=True)
+            for e in entries:
+                rec = {
+                    "queue_id": e["queue_id"], "video_path": e["video_path"],
+                    "frame_idx": e["frame_idx"], "time_s": e["time_s"],
+                    "faces": [], "error": "cap_open",
+                }
+                results.append(rec); new_buffer.append(rec)
+                processed.add(e["queue_id"])
+                n_done += 1
+                n_load_err += 1
+            continue
+        first_target = entries[0]["frame_idx"]
+        if first_target > 0:
+            cap.set(cv2.CAP_PROP_POS_FRAMES, first_target)
+            cur_frame_idx = first_target - 1
+        else:
+            cur_frame_idx = -1
+        for e in entries:
+            target = e["frame_idx"]
+            if target < cur_frame_idx + 1:
+                # backward jump (only triggers for unsorted entries — defensive)
+                cap.set(cv2.CAP_PROP_POS_FRAMES, target)
+                cur_frame_idx = target - 1
+            # advance up to (but not including) target via grab()-only
+            ran_out = False
+            while cur_frame_idx + 1 < target:
+                ok = cap.grab()
+                if not ok:
+                    ran_out = True
+                    break
+                cur_frame_idx += 1
+            if not ran_out:
+                ok = cap.grab()
+                if not ok:
+                    ran_out = True
+                else:
+                    cur_frame_idx = target
+            if ran_out:
+                rec = {
+                    "queue_id": e["queue_id"], "video_path": e["video_path"],
+                    "frame_idx": e["frame_idx"], "time_s": e["time_s"],
+                    "faces": [], "error": "frame_read",
+                }
+                results.append(rec); new_buffer.append(rec)
+                processed.add(e["queue_id"])
+                n_done += 1
+                n_load_err += 1
+                continue
+            ok, bgr = cap.retrieve()
+            if not ok or bgr is None:
+                rec = {
+                    "queue_id": e["queue_id"], "video_path": e["video_path"],
+                    "frame_idx": e["frame_idx"], "time_s": e["time_s"],
+                    "faces": [], "error": "frame_read",
+                }
+                results.append(rec); new_buffer.append(rec)
+                processed.add(e["queue_id"])
+                n_done += 1
+                n_load_err += 1
+                continue
+
+            faces = app.get(bgr)
+            kept_faces = []
+            H, W = bgr.shape[:2]
+            for f in faces:
+                if float(f.det_score) < MIN_DET:
+                    continue
+                x1, y1, x2, y2 = [int(round(v)) for v in f.bbox]
+                x1 = max(x1, 0); y1 = max(y1, 0)
+                x2 = min(x2, W); y2 = min(y2, H)
+                w, h = x2 - x1, y2 - y1
+                short = min(w, h)
+                if short < MIN_FACE_PIX:
+                    continue
+                rec = {
+                    "bbox": [x1, y1, x2, y2],
+                    "det_score": float(f.det_score),
+                    "face_short": int(short),
+                }
+                if hasattr(f, "pose") and f.pose is not None:
+                    rec["pose"] = [float(x) for x in f.pose]   # pitch, yaw, roll
+                if hasattr(f, "normed_embedding") and f.normed_embedding is not None:
+                    rec["embedding"] = f.normed_embedding.astype(np.float32).tolist()
+                kept_faces.append(rec)
+
+            rec = {
+                "queue_id": e["queue_id"], "video_path": e["video_path"],
+                "frame_idx": e["frame_idx"], "time_s": e["time_s"],
+                "frame_w": W, "frame_h": H,
+                "faces": kept_faces,
+            }
+            results.append(rec); new_buffer.append(rec)
+            processed.add(e["queue_id"])
+            n_done += 1
+
+            if (n_done % FLUSH_EVERY == 0) or (time.time() - last_flush) > 30.0:
+                flush()
+                last_flush = time.time()
+                el = time.time() - t_start
+                rate = n_done / max(0.1, el)
+                eta = (len(pending) - n_done) / max(0.1, rate) / 60.0
+                print(f"[scan] {n_done}/{len(pending)} rate={rate:.2f} fps eta={eta:.1f}min "
+                      f"errs={n_load_err}", flush=True)
+        cap.release()
+
+    flush()
+    el = time.time() - t_start
+    print(f"[done] {n_done} scored, {n_load_err} errors, {el:.1f}s "
+          f"({n_done/max(0.1,el):.2f} fps) -> {args.out}", flush=True)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,919 @@
+"""Video target preprocessing pipeline for roop-unleashed.
+
+Discovers video files in an input folder, runs scene-cut detection, samples
+frames within each scene, runs face detection + embedding via Windows DML
+worker, stitches per-frame detections into face tracks, applies quality
+gates, cuts approved segments out with ffmpeg stream-copy, and writes a
+report. Output clips have generic UUID names + a sidecar JSON with full
+provenance.
+
+Subcommands:
+  scan      list input videos, run ffprobe, write per-video index
+  scenes    PySceneDetect AdaptiveDetector per video; write scenes_<basename>.json
+  stage     write frame queue.json (sampled @ 2 fps within scenes)
+  merge     ingest worker results.json into per-video frame_results
+  track     IoU+embedding stitching of per-frame detections into tracks
+  score     track-level quality gating + segment plan
+  cut       ffmpeg -c copy each accepted segment to <out_dir>/<uuid>.mp4
+  report    HTML preview with thumbnails + identity tags
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import re
+import shutil
+import subprocess
+import sys
+import time
+import uuid
+from collections import defaultdict
+from pathlib import Path
+
+import numpy as np
+
+DEFAULT_INPUT = Path("/mnt/x/src/vd")
+DEFAULT_OUTPUT = Path("/mnt/x/src/vd/ct")
+WORK_DIR = Path("/opt/face-sets/work/video_preprocess")
+
+# defaults — first set was strict-portrait; second set loosened for side-profile + segment merging
+SAMPLE_FPS = 2.0
+QUALITY_YAW_MAX = 75.0      # was 25; allow full 3/4 + profile (face-sets handle it)
+QUALITY_PITCH_MAX = 45.0    # was 30
+QUALITY_FACE_MIN = 80       # was 96
+QUALITY_BLUR_MIN = 50.0
+QUALITY_DET_MIN = 0.5       # was 0.6
+TRACK_GATE_FRAC = 0.7       # >=70% of frames in track must pass per-frame gates
+SEGMENT_MIN_S = 1.0
+SEGMENT_MAX_S = 30.0        # was 10
+SEGMENT_BRIDGE_S = 3.0      # was 1.0 — within-track pose-failure bridging
+SEGMENT_MERGE_GAP_S = 2.0   # NEW — across-track merge if same scene + within this gap
+TRACK_IOU_MIN = 0.3
+TRACK_EMB_MIN = 0.5
+
+CACHES = [
+    Path("/opt/face-sets/work/cache/nl_full.npz"),
+    Path("/opt/face-sets/work/cache/immich_peter.npz"),
+    Path("/opt/face-sets/work/cache/immich_nic.npz"),
+]
+FACESETS_ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
+IDENTITY_TAG_THRESHOLD = 0.6  # cosine sim to faceset centroid
+
+
+def wsl_to_win(p: str) -> str:
+    s = str(p)
+    if s.startswith("/mnt/"):
+        return f"{s[5].upper()}:\\{s[7:].replace('/', chr(92))}"
+    return s
+
+
+# ----------------------------- ffprobe / scan -----------------------------
+
+def ffprobe(video: Path) -> dict:
+    cmd = [
+        "ffprobe", "-v", "error", "-print_format", "json",
+        "-show_format", "-show_streams", str(video),
+    ]
+    r = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
+    if r.returncode != 0:
+        return {"error": r.stderr.strip()}
+    return json.loads(r.stdout)
+
+
+def parse_video_meta(probe: dict) -> dict:
+    if "error" in probe:
+        return {"error": probe["error"]}
+    fmt = probe.get("format", {})
+    duration = float(fmt.get("duration", 0))
+    vstream = next((s for s in probe.get("streams", []) if s.get("codec_type") == "video"), None)
+    if vstream is None:
+        return {"error": "no video stream"}
+    fps_str = vstream.get("avg_frame_rate", "0/1")
+    try:
+        num, den = (int(x) for x in fps_str.split("/"))
+        fps = num / den if den else 0.0
+    except Exception:
+        fps = 0.0
+    nb_frames = int(vstream.get("nb_frames", 0)) or int(round(duration * fps))
+    return {
+        "duration_s": duration,
+        "fps": fps,
+        "frames": nb_frames,
+        "width": int(vstream.get("width", 0)),
+        "height": int(vstream.get("height", 0)),
+        "codec": vstream.get("codec_name"),
+    }
+
+
+def cmd_scan(args):
+    in_dir = Path(args.input)
+    out = Path(args.out)
+    out.parent.mkdir(parents=True, exist_ok=True)
+    extensions = {".mp4", ".mov", ".mkv", ".m4v", ".avi", ".webm"}
+    out_root = Path(args.output_dir).resolve()
+    videos = []
+    for p in sorted(in_dir.iterdir() if not args.recursive else in_dir.rglob("*")):
+        if not p.is_file():
+            continue
+        if out_root in p.parents or p.resolve() == out_root:
+            continue  # never include the output dir
+        if p.suffix.lower() not in extensions:
+            continue
+        videos.append(p)
+    print(f"[scan] {len(videos)} candidate videos", file=sys.stderr)
+    inventory = []
+    for p in videos:
+        meta = parse_video_meta(ffprobe(p))
+        meta["path"] = str(p)
+        meta["win_path"] = wsl_to_win(str(p))
+        meta["size"] = p.stat().st_size
+        inventory.append(meta)
+        if "error" not in meta:
+            print(f"  {p.name}: {meta['duration_s']:.1f}s @ {meta['fps']:.1f}fps "
+                  f"{meta['width']}x{meta['height']} {meta['codec']}", file=sys.stderr)
+        else:
+            print(f"  {p.name}: ERROR {meta['error']}", file=sys.stderr)
+    out.write_text(json.dumps({"input": str(in_dir), "videos": inventory}, indent=2))
+    print(f"[scan] inventory -> {out}", file=sys.stderr)
+
+
+# ----------------------------- scenes -----------------------------
+
+def cmd_scenes(args):
+    from scenedetect import open_video, SceneManager
+    from scenedetect.detectors import AdaptiveDetector
+    inv = json.loads(Path(args.inventory).read_text())
+    out_dir = Path(args.out_dir)
+    out_dir.mkdir(parents=True, exist_ok=True)
+    only = set(args.only.split(",")) if args.only else None
+    for v in inv["videos"]:
+        if "error" in v:
+            continue
+        path = Path(v["path"])
+        if only and path.name not in only:
+            continue
+        out_file = out_dir / (path.stem + ".scenes.json")
+        if out_file.exists() and not args.force:
+            continue
+        print(f"[scenes] {path.name} ...", file=sys.stderr, flush=True)
+        t0 = time.time()
+        try:
+            video = open_video(str(path))
+            sm = SceneManager()
+            sm.add_detector(AdaptiveDetector(min_scene_len=int(round(v.get("fps", 30) or 30) * 0.5)))
+            sm.detect_scenes(video, show_progress=False)
+            scenes = sm.get_scene_list()
+            entries = []
+            for s, e in scenes:
+                entries.append({
+                    "start_frame": s.frame_num, "end_frame": e.frame_num,
+                    "start_s": s.get_seconds(), "end_s": e.get_seconds(),
+                    "duration_s": e.get_seconds() - s.get_seconds(),
+                })
+            # if no cuts found, treat the whole video as one scene
+            if not entries:
+                entries = [{
+                    "start_frame": 0, "end_frame": v["frames"],
+                    "start_s": 0.0, "end_s": v["duration_s"],
+                    "duration_s": v["duration_s"],
+                }]
+            out_file.write_text(json.dumps({"video": str(path), "scenes": entries}, indent=2))
+            print(f"  {len(entries)} scenes in {time.time()-t0:.1f}s -> {out_file.name}",
+                  file=sys.stderr)
+        except Exception as e:
+            print(f"  ERROR: {e}", file=sys.stderr)
+
+
+# ----------------------------- stage -----------------------------
+
+def cmd_stage(args):
+    inv = json.loads(Path(args.inventory).read_text())
+    scenes_dir = Path(args.scenes_dir)
+    queue = []
+    qid = 0
+    sample_every = 1.0 / args.sample_fps
+    for v in inv["videos"]:
+        if "error" in v:
+            continue
+        p = Path(v["path"])
+        sf = scenes_dir / (p.stem + ".scenes.json")
+        if not sf.exists():
+            print(f"[warn] no scenes file for {p.name}; skipping", file=sys.stderr)
+            continue
+        scenes = json.loads(sf.read_text()).get("scenes", [])
+        fps = v.get("fps", 30) or 30
+        for sc in scenes:
+            t = sc["start_s"]
+            while t < sc["end_s"] - 0.01:
+                fidx = int(round(t * fps))
+                if fidx >= v["frames"]:
+                    break
+                queue.append({
+                    "queue_id": f"q{qid:08d}",
+                    "video_path": str(p),
+                    "win_video_path": v["win_path"],
+                    "frame_idx": fidx,
+                    "time_s": t,
+                })
+                qid += 1
+                t += sample_every
+    out = Path(args.out)
+    out.parent.mkdir(parents=True, exist_ok=True)
+    out.write_text(json.dumps(queue, indent=2))
+    print(f"[stage] {len(queue)} sampled frames @ {args.sample_fps} fps -> {out}",
+          file=sys.stderr)
+    print(f"[stage] win path for worker: {wsl_to_win(str(out))}", file=sys.stderr)
+
+
+# ----------------------------- merge + track -----------------------------
+
+def cmd_merge(args):
+    """Read worker output and group by video_path. Supports either JSONL (one record
+    per line, the new format) or legacy JSON (results.json with `results` list)."""
+    src_path = Path(args.results)
+    records = []
+    # try JSONL first (sister .jsonl file or .results passed directly)
+    jsonl_candidate = src_path.with_suffix(".jsonl")
+    if jsonl_candidate.exists():
+        with open(jsonl_candidate) as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    records.append(json.loads(line))
+    elif src_path.suffix == ".jsonl":
+        with open(src_path) as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    records.append(json.loads(line))
+    else:
+        # legacy: monolithic JSON
+        src = json.loads(src_path.read_text())
+        records = src.get("results", [])
+    by_video: dict[str, list] = {}
+    for r in records:
+        by_video.setdefault(r["video_path"], []).append(r)
+    for v in by_video:
+        by_video[v].sort(key=lambda x: x["frame_idx"])
+    out = Path(args.out)
+    out.parent.mkdir(parents=True, exist_ok=True)
+    out.write_text(json.dumps({"by_video": by_video}, indent=2))
+    print(f"[merge] {sum(len(v) for v in by_video.values())} frames across {len(by_video)} videos "
+          f"-> {out}", file=sys.stderr)
+
+
+def _iou(a, b):
+    ax1, ay1, ax2, ay2 = a
+    bx1, by1, bx2, by2 = b
+    ix1 = max(ax1, bx1); iy1 = max(ay1, by1)
+    ix2 = min(ax2, bx2); iy2 = min(ay2, by2)
+    iw = max(ix2 - ix1, 0); ih = max(iy2 - iy1, 0)
+    inter = iw * ih
+    ua = (ax2 - ax1) * (ay2 - ay1) + (bx2 - bx1) * (by2 - by1) - inter
+    return inter / ua if ua > 0 else 0.0
+
+
+def cmd_track(args):
+    """Stitch per-frame face detections into tracks within each scene of each video.
+    Track = list of (frame_idx, face_idx) where adjacent samples have IoU>=0.3 OR
+    cosine(emb)>=0.5. New face → new track. No cross-scene merging."""
+    fr = json.loads(Path(args.frames).read_text())
+    scenes_dir = Path(args.scenes_dir)
+    inv = json.loads(Path(args.inventory).read_text())
+    inv_by_path = {v["path"]: v for v in inv["videos"]}
+
+    all_video_tracks: dict[str, list] = {}
+    for video_path, frames in fr["by_video"].items():
+        v = inv_by_path.get(video_path, {})
+        sf = scenes_dir / (Path(video_path).stem + ".scenes.json")
+        scenes = json.loads(sf.read_text()).get("scenes", []) if sf.exists() else []
+        # group frames by scene
+        scene_for_frame = {}
+        for si, sc in enumerate(scenes):
+            for f in frames:
+                if f["frame_idx"] >= sc["start_frame"] and f["frame_idx"] < sc["end_frame"]:
+                    scene_for_frame.setdefault(si, []).append(f)
+        video_tracks = []
+        for si, scene_frames in scene_for_frame.items():
+            scene_frames.sort(key=lambda x: x["frame_idx"])
+            # tracks = list of dict{ "members": [(frame_idx, face_idx, face_dict)], "last_bbox", "last_emb" }
+            tracks = []
+            for f in scene_frames:
+                claimed = set()
+                for face_idx, face in enumerate(f.get("faces", [])):
+                    bbox = face["bbox"]
+                    emb = np.array(face.get("embedding", []), dtype=np.float32) if face.get("embedding") else None
+                    best_track = None
+                    best_score = 0.0
+                    for ti, tr in enumerate(tracks):
+                        if ti in claimed:
+                            continue
+                        # staleness in TIME (sample period independent of source fps)
+                        last_time = tr["members"][-1][3]
+                        if f["time_s"] - last_time > 1.5:  # stale if >1.5s gap (3 sample periods @ 2fps)
+                            continue
+                        score = _iou(tr["last_bbox"], bbox)
+                        if emb is not None and tr.get("last_emb") is not None:
+                            score = max(score, float(np.dot(tr["last_emb"], emb)))
+                        if score > best_score:
+                            best_score = score
+                            best_track = ti
+                    if best_track is not None and best_score >= min(TRACK_IOU_MIN, TRACK_EMB_MIN):
+                        tr = tracks[best_track]
+                        tr["members"].append((f["frame_idx"], face_idx, face, f["time_s"]))
+                        tr["last_bbox"] = bbox
+                        if emb is not None:
+                            tr["last_emb"] = emb
+                        claimed.add(best_track)
+                    else:
+                        tracks.append({
+                            "members": [(f["frame_idx"], face_idx, face, f["time_s"])],
+                            "last_bbox": bbox,
+                            "last_emb": emb,
+                        })
+            for tr in tracks:
+                if len(tr["members"]) < 2:
+                    continue
+                video_tracks.append({
+                    "scene_idx": si,
+                    "members": [
+                        {"frame_idx": m[0], "face_idx": m[1], "time_s": m[3], "face": m[2]}
+                        for m in tr["members"]
+                    ],
+                })
+        all_video_tracks[video_path] = video_tracks
+        print(f"[track] {Path(video_path).name}: {sum(len(s) for s in scene_for_frame.values())} frames "
+              f"-> {len(video_tracks)} tracks across {len(scene_for_frame)} scenes",
+              file=sys.stderr)
+
+    out = Path(args.out)
+    out.parent.mkdir(parents=True, exist_ok=True)
+    out.write_text(json.dumps({"by_video": all_video_tracks}, indent=2))
+    print(f"[track] -> {out}", file=sys.stderr)
+
+
+# ----------------------------- score (quality gates) -----------------------------
+
+def _track_passes(track, cfg):
+    """Per-frame quality gating; return list of bool (does each member pass) +
+    aggregate stats. cfg: dict with yaw_max, pitch_max, face_min, det_min."""
+    passes = []
+    yaws, pitches, sizes, dets = [], [], [], []
+    for m in track["members"]:
+        f = m["face"]
+        yaw = abs(f.get("pose", [0, 0, 0])[1]) if f.get("pose") else 0
+        pitch = abs(f.get("pose", [0, 0, 0])[0]) if f.get("pose") else 0
+        size = f.get("face_short", 0)
+        det = f.get("det_score", 0)
+        ok = (yaw <= cfg["yaw_max"] and pitch <= cfg["pitch_max"]
+              and size >= cfg["face_min"] and det >= cfg["det_min"])
+        passes.append(ok)
+        yaws.append(yaw); pitches.append(pitch); sizes.append(size); dets.append(det)
+    return passes, {
+        "n": len(passes), "n_pass": sum(passes), "frac_pass": sum(passes) / max(1, len(passes)),
+        "yaw_med": float(np.median(yaws)) if yaws else None,
+        "pitch_med": float(np.median(pitches)) if pitches else None,
+        "size_med": float(np.median(sizes)) if sizes else None,
+        "det_med": float(np.median(dets)) if dets else None,
+    }
+
+
+def _build_segments(track, cfg):
+    """Return list of (start_s, end_s) accepted sub-segments of this track:
+    contiguous runs of passing frames meeting min/max duration. Pose-failure
+    spans <= cfg['bridge_s'] long get bridged across (handles momentary head
+    turns / detection misses)."""
+    passes, stats = _track_passes(track, cfg)
+    members = track["members"]
+    if not members:
+        return [], stats
+    # bridge gaps of failing frames (any width) up to cfg["bridge_s"] seconds
+    bridged = list(passes)
+    n = len(bridged)
+    i = 0
+    while i < n:
+        if bridged[i]:
+            i += 1
+            continue
+        # find run of consecutive False starting at i
+        j = i
+        while j < n and not bridged[j]:
+            j += 1
+        # bridge if surrounded by True on both sides AND time gap <= bridge_s
+        if i > 0 and j < n and bridged[i - 1] and bridged[j]:
+            t_left = members[i - 1]["time_s"]
+            t_right = members[j]["time_s"]
+            if t_right - t_left <= cfg["bridge_s"]:
+                for k in range(i, j):
+                    bridged[k] = True
+        i = j
+    # find runs of True
+    runs = []
+    i = 0
+    while i < n:
+        if not bridged[i]:
+            i += 1; continue
+        j = i
+        while j + 1 < n and bridged[j + 1]:
+            j += 1
+        s = members[i]["time_s"]
+        # end is the time of the last passing sample plus one sample-period
+        e = members[j]["time_s"] + 1.0 / max(SAMPLE_FPS, 1e-3)
+        runs.append((s, e))
+        i = j + 1
+    return runs, stats
+
+
+def _merge_close_segments(segs_with_meta, merge_gap_s: float):
+    """Merge segments within the same scene that are within merge_gap_s of each other.
+    segs_with_meta: list of dicts with start_s, end_s, scene_idx, track_idx, stats.
+    Returns list of merged dicts (one per merged group). Identity-tag and stats
+    aggregation happen later."""
+    by_scene: dict[int, list] = {}
+    for s in segs_with_meta:
+        by_scene.setdefault(s["scene_idx"], []).append(s)
+    merged_all = []
+    for scene_idx, segs in by_scene.items():
+        segs.sort(key=lambda x: x["start_s"])
+        cur = None
+        for s in segs:
+            if cur is None:
+                cur = {**s, "track_idxs": [s["track_idx"]], "member_count": s["stats"]["n"],
+                       "pass_count": s["stats"]["n_pass"]}
+                continue
+            gap = s["start_s"] - cur["end_s"]
+            if gap <= merge_gap_s:
+                # merge
+                cur["end_s"] = max(cur["end_s"], s["end_s"])
+                cur["track_idxs"].append(s["track_idx"])
+                cur["member_count"] += s["stats"]["n"]
+                cur["pass_count"] += s["stats"]["n_pass"]
+                # take the better-quality stats for display
+                if s["stats"]["n_pass"] > cur["stats"]["n_pass"]:
+                    cur["stats"] = s["stats"]
+            else:
+                merged_all.append(cur)
+                cur = {**s, "track_idxs": [s["track_idx"]], "member_count": s["stats"]["n"],
+                       "pass_count": s["stats"]["n_pass"]}
+        if cur is not None:
+            merged_all.append(cur)
+    return merged_all
+
+
+def _split_long_segments(segs_with_meta, min_s: float, max_s: float):
+    """Apply min/max duration: drop too-short, split too-long evenly."""
+    out = []
+    for s in segs_with_meta:
+        dur = s["end_s"] - s["start_s"]
+        if dur < min_s:
+            continue
+        if dur <= max_s:
+            out.append(s)
+            continue
+        n = int(math.ceil(dur / max_s))
+        chunk = dur / n
+        base_start = s["start_s"]
+        for k in range(n):
+            piece = dict(s)
+            piece["start_s"] = base_start + k * chunk
+            piece["end_s"] = base_start + (k + 1) * chunk
+            out.append(piece)
+    return out
+
+
+# identity tagging via cached arcface centroids
+def load_caches_index():
+    rec_index = {}
+    alias_map = {}
+    for c in CACHES:
+        if not c.exists():
+            continue
+        d = np.load(c, allow_pickle=True)
+        emb = d["embeddings"]
+        meta = json.loads(str(d["meta"]))
+        face_records = [m for m in meta if not m.get("noface")]
+        if "path_aliases" in d.files:
+            paliases = json.loads(str(d["path_aliases"]))
+            for canon, alist in paliases.items():
+                alias_map.setdefault(canon, canon)
+                for a in alist:
+                    alias_map[a] = canon
+        for i, rec in enumerate(face_records):
+            v = emb[i].astype(np.float32)
+            n = float(np.linalg.norm(v))
+            if n > 0:
+                v = v / n
+            rec_index[(rec["path"], tuple(int(x) for x in rec["bbox"]))] = v
+            alias_map.setdefault(rec["path"], rec["path"])
+    return rec_index, alias_map
+
+
+def load_faceset_centroids():
+    """Return dict faceset_name -> normalized centroid embedding."""
+    rec_index, alias_map = load_caches_index()
+    centroids = {}
+    for fs_dir in sorted(FACESETS_ROOT.iterdir()):
+        if not fs_dir.is_dir() or fs_dir.name.startswith("_"):
+            continue
+        # exclude era splits to avoid double-tagging within a family
+        if re.match(r"^faceset_\d+_(?:\d{4}-\d{2,4}|\d{4}|undated)", fs_dir.name):
+            continue
+        mp = fs_dir / "manifest.json"
+        if not mp.exists():
+            continue
+        m = json.loads(mp.read_text())
+        vecs = []
+        for f in m.get("faces", []):
+            src = f.get("source"); bbox = f.get("bbox")
+            if not src or not bbox:
+                continue
+            canon = alias_map.get(src, src)
+            v = rec_index.get((canon, tuple(int(x) for x in bbox)))
+            if v is None and canon != src:
+                v = rec_index.get((src, tuple(int(x) for x in bbox)))
+            if v is not None:
+                vecs.append(v)
+        if len(vecs) < 3:
+            continue
+        c = np.stack(vecs).mean(axis=0)
+        n = float(np.linalg.norm(c))
+        if n > 0:
+            c = c / n
+        centroids[fs_dir.name] = c
+    return centroids
+
+
+def _track_centroid(track):
+    embs = [m["face"].get("embedding") for m in track["members"] if m["face"].get("embedding")]
+    if not embs:
+        return None
+    arr = np.array(embs, dtype=np.float32)
+    c = arr.mean(axis=0)
+    n = float(np.linalg.norm(c))
+    return c / n if n > 0 else c
+
+
+def cmd_score(args):
+    tr = json.loads(Path(args.tracks).read_text())
+    inv = json.loads(Path(args.inventory).read_text())
+    inv_by_path = {v["path"]: v for v in inv["videos"]}
+
+    cfg = {
+        "yaw_max": args.max_yaw, "pitch_max": args.max_pitch,
+        "face_min": args.min_face, "det_min": args.min_det,
+        "bridge_s": args.bridge_gap,
+    }
+
+    centroids = {}
+    if not args.no_identity:
+        print("[score] loading faceset centroids ...", file=sys.stderr)
+        t0 = time.time()
+        centroids = load_faceset_centroids()
+        print(f"[score]   {len(centroids)} active faceset centroids loaded in {time.time()-t0:.1f}s",
+              file=sys.stderr)
+
+    n_total_tracks = 0
+    n_accepted_tracks = 0
+    # collect per-track candidate segments first; merging happens per-video below
+    per_video_candidates: dict[str, list] = {}
+    track_centroids_by_video: dict[str, dict] = {}
+    for video_path, tracks in tr["by_video"].items():
+        per_video_candidates.setdefault(video_path, [])
+        track_centroids_by_video.setdefault(video_path, {})
+        for ti, track in enumerate(tracks):
+            n_total_tracks += 1
+            runs, stats = _build_segments(track, cfg)
+            if stats["frac_pass"] < args.track_gate_frac:
+                continue
+            if not runs:
+                continue
+            n_accepted_tracks += 1
+            track_centroids_by_video[video_path][ti] = _track_centroid(track)
+            for (s, e) in runs:
+                per_video_candidates[video_path].append({
+                    "video_path": video_path,
+                    "track_idx": ti,
+                    "scene_idx": track["scene_idx"],
+                    "start_s": s,
+                    "end_s": e,
+                    "stats": stats,
+                })
+
+    plan = []
+    for video_path, segs in per_video_candidates.items():
+        if not segs:
+            continue
+        # merge across tracks within the same scene if gap <= merge_gap_s
+        merged = _merge_close_segments(segs, args.merge_gap)
+        # apply min/max duration (split long, drop short)
+        merged = _split_long_segments(merged, args.min_dur, args.max_dur)
+        for s in merged:
+            tag = None
+            tag_sim = None
+            # identity from union of contributing tracks' centroids
+            if centroids:
+                track_centroid_list = [
+                    track_centroids_by_video[video_path].get(ti)
+                    for ti in s.get("track_idxs", [s.get("track_idx")])
+                ]
+                track_centroid_list = [c for c in track_centroid_list if c is not None]
+                if track_centroid_list:
+                    union = np.stack(track_centroid_list).mean(axis=0)
+                    nm = float(np.linalg.norm(union))
+                    if nm > 0:
+                        union = union / nm
+                    sims = {name: float(np.dot(c, union)) for name, c in centroids.items()}
+                    best = max(sims, key=sims.get)
+                    if sims[best] >= IDENTITY_TAG_THRESHOLD:
+                        tag = best; tag_sim = round(sims[best], 4)
+            plan.append({
+                "video_path": video_path,
+                "track_idxs": s.get("track_idxs", [s.get("track_idx")]),
+                "scene_idx": s["scene_idx"],
+                "start_s": round(s["start_s"], 3),
+                "end_s": round(s["end_s"], 3),
+                "duration_s": round(s["end_s"] - s["start_s"], 3),
+                "member_count": s.get("member_count", s["stats"]["n"]),
+                "pass_count": s.get("pass_count", s["stats"]["n_pass"]),
+                "stats": s["stats"],
+                "identity_tag": tag,
+                "identity_sim": tag_sim,
+                "uuid": uuid.uuid4().hex[:12],
+            })
+
+    plan.sort(key=lambda p: (p["video_path"], p["start_s"]))
+    out = Path(args.out)
+    out.parent.mkdir(parents=True, exist_ok=True)
+    out.write_text(json.dumps({
+        "thresholds": {
+            "yaw_max": args.max_yaw, "pitch_max": args.max_pitch,
+            "face_min": args.min_face, "blur_min": QUALITY_BLUR_MIN,
+            "det_min": args.min_det, "track_gate_frac": args.track_gate_frac,
+            "bridge_s": args.bridge_gap, "merge_gap_s": args.merge_gap,
+            "min_dur_s": args.min_dur, "max_dur_s": args.max_dur,
+            "identity_tag_threshold": IDENTITY_TAG_THRESHOLD,
+        },
+        "totals": {
+            "tracks_total": n_total_tracks, "tracks_accepted": n_accepted_tracks,
+            "segments": len(plan),
+        },
+        "plan": plan,
+    }, indent=2))
+    print(f"[score] {n_accepted_tracks}/{n_total_tracks} tracks accepted -> {len(plan)} segments "
+          f"-> {out}", file=sys.stderr)
+
+
+# ----------------------------- cut -----------------------------
+
+def cmd_cut(args):
+    plan = json.loads(Path(args.plan).read_text())
+    out_dir = Path(args.output_dir)
+    out_dir.mkdir(parents=True, exist_ok=True)
+
+    if args.clean:
+        # remove only existing UUID-named clips + sidecars (12-char hex), keeping any other files
+        import re as _re
+        uuid_pat = _re.compile(r"^[0-9a-f]{12}\.(mp4|json)$")
+        n_removed = 0
+        for child in out_dir.iterdir():
+            if child.is_file() and uuid_pat.match(child.name):
+                child.unlink()
+                n_removed += 1
+            elif child.is_dir() and _re.match(r"^[A-Za-z0-9_.-]+$", child.name):
+                # subfolder of prior runs — clear UUID files inside, then remove if empty
+                for inner in child.iterdir():
+                    if inner.is_file() and uuid_pat.match(inner.name):
+                        inner.unlink()
+                        n_removed += 1
+                try:
+                    child.rmdir()
+                except OSError:
+                    pass
+        if n_removed:
+            print(f"[clean] removed {n_removed} prior UUID clips/sidecars", file=sys.stderr)
+
+    n_done = 0
+    n_err = 0
+    sidecars = []
+    for seg in plan["plan"]:
+        sub = Path(seg["video_path"]).stem
+        seg_dir = out_dir / sub
+        seg_dir.mkdir(parents=True, exist_ok=True)
+        out_video = seg_dir / f"{seg['uuid']}.mp4"
+        if out_video.exists() and not args.force:
+            continue
+        s = seg["start_s"]; d = seg["duration_s"]
+        cmd = [
+            "ffmpeg", "-y", "-loglevel", "error",
+            "-ss", f"{s}",
+            "-i", seg["video_path"],
+            "-t", f"{d}",
+            "-c", "copy",
+            "-avoid_negative_ts", "make_zero",
+            str(out_video),
+        ]
+        r = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
+        if r.returncode != 0 or not out_video.exists() or out_video.stat().st_size < 1024:
+            print(f"[cut-err] {seg['uuid']} {seg['video_path']}@{s}+{d}: {r.stderr.strip()[:200]}",
+                  file=sys.stderr)
+            n_err += 1
+            if out_video.exists() and out_video.stat().st_size < 1024:
+                out_video.unlink()
+            continue
+        if args.write_sidecar:
+            sidecar = seg_dir / f"{seg['uuid']}.json"
+            sidecar.write_text(json.dumps({
+                "uuid": seg["uuid"],
+                "source_video": seg["video_path"],
+                "source_basename": Path(seg["video_path"]).name,
+                "start_s": s, "end_s": seg["end_s"], "duration_s": d,
+                "scene_idx": seg["scene_idx"],
+                "track_idxs": seg.get("track_idxs", [seg.get("track_idx")]),
+                "member_count": seg.get("member_count"),
+                "pass_count": seg.get("pass_count"),
+                "stats": seg["stats"],
+                "identity_tag": seg["identity_tag"],
+                "identity_sim": seg["identity_sim"],
+                "thresholds": plan["thresholds"],
+            }, indent=2))
+            sidecars.append(sidecar)
+        n_done += 1
+    print(f"[cut] {n_done} clips written, {n_err} errors -> {out_dir}", file=sys.stderr)
+
+
+# ----------------------------- report -----------------------------
+
+def cmd_report(args):
+    plan = json.loads(Path(args.plan).read_text())
+    out_dir = Path(args.out)
+    out_dir.mkdir(parents=True, exist_ok=True)
+    thumbs_dir = out_dir / "thumbs"
+    thumbs_dir.mkdir(exist_ok=True)
+    output_dir = Path(args.output_dir)
+
+    # group by video
+    by_video: dict[str, list] = {}
+    for seg in plan["plan"]:
+        by_video.setdefault(seg["video_path"], []).append(seg)
+
+    # generate thumbs from each segment's first frame via ffmpeg
+    print(f"[report] generating thumbs for {len(plan['plan'])} segments", file=sys.stderr)
+    for seg in plan["plan"]:
+        thumb = thumbs_dir / f"{seg['uuid']}.jpg"
+        if thumb.exists():
+            continue
+        s = seg["start_s"] + 0.1
+        cmd = [
+            "ffmpeg", "-y", "-loglevel", "error",
+            "-ss", f"{s}",
+            "-i", seg["video_path"],
+            "-frames:v", "1",
+            "-vf", "scale=240:-1",
+            str(thumb),
+        ]
+        subprocess.run(cmd, capture_output=True, timeout=30)
+
+    # render
+    rows = []
+    rows.append("<h1>Video target preprocessing &mdash; review</h1>")
+    t = plan["totals"]
+    th = plan["thresholds"]
+    rows.append(f"<p>Tracks accepted: {t['tracks_accepted']}/{t['tracks_total']}; "
+                f"segments emitted: {t['segments']}.<br>"
+                f"Thresholds: pose &le;{th['yaw_max']}&deg;yaw / {th['pitch_max']}&deg;pitch, "
+                f"face_short &ge;{th['face_min']}px, det &ge;{th['det_min']}, "
+                f"track-gate &ge;{int(100*th['track_gate_frac'])}%, "
+                f"duration {th['min_dur_s']}–{th['max_dur_s']}s. "
+                f"Output dir: <code>{output_dir}</code></p>")
+    nav = " · ".join(f"<a href='#v{i}'>{Path(v).name}</a>"
+                     for i, v in enumerate(by_video.keys()))
+    rows.append(f"<div class='nav'>{nav}</div>")
+    for vi, (video_path, segs) in enumerate(by_video.items()):
+        rows.append(f"<section id='v{vi}' class='vid'>")
+        rows.append(f"<h2>{Path(video_path).name} <small>({len(segs)} segments)</small></h2>")
+        rows.append("<div class='cells'>")
+        for seg in sorted(segs, key=lambda x: x["start_s"]):
+            stats = seg["stats"]
+            tag = seg["identity_tag"] or ""
+            tag_sim = seg["identity_sim"]
+            tag_html = (f"<span class='tag'>{tag} ({tag_sim:.2f})</span>" if tag else "<span class='tag none'>untagged</span>")
+            sub_name = Path(seg['video_path']).stem
+            rows.append(
+                f"<div class='cell'>"
+                f"<a href='{output_dir}/{sub_name}/{seg['uuid']}.mp4'><img src='thumbs/{seg['uuid']}.jpg' loading='lazy'></a>"
+                f"<div class='meta'>"
+                f"<code>{sub_name}/{seg['uuid']}.mp4</code><br>"
+                f"{seg['start_s']:.1f}s &rarr; {seg['end_s']:.1f}s ({seg['duration_s']:.1f}s)<br>"
+                f"yaw={stats['yaw_med']:.0f}&deg; size={stats['size_med']:.0f}px det={stats['det_med']:.2f}<br>"
+                f"pass {stats['n_pass']}/{stats['n']}<br>"
+                f"{tag_html}"
+                f"</div></div>"
+            )
+        rows.append("</div></section>")
+    html = f"""<!doctype html>
+<html><head><meta charset='utf-8'><title>Video targets review</title>
+<style>
+body {{ font-family: system-ui, sans-serif; background:#111; color:#eee; padding:1em; }}
+h1, h2 {{ margin-top: 1em; }} h2 {{ border-bottom: 1px solid #333; padding-bottom: 4px; }}
+small {{ color:#999; font-weight:normal; }}
+section.vid {{ background:#1a1a1a; border-radius:6px; padding:12px; margin:12px 0; }}
+.cells {{ display:flex; flex-wrap:wrap; gap:8px; }}
+.cell {{ background:#222; border-radius:4px; padding:6px; width:260px; font-size:11px; font-family:monospace; }}
+.cell img {{ width:100%; height:auto; border-radius:3px; }}
+.meta {{ padding-top:4px; line-height:1.4; }}
+.tag {{ display:inline-block; padding:1px 6px; background:#5fa05f; color:#000; border-radius:2px; }}
+.tag.none {{ background:#444; color:#aaa; }}
+.nav {{ position:sticky; top:0; background:#111; padding:.5em 0; border-bottom:1px solid #333; font-size:12px; }}
+a {{ color:#6cf; }}
+code {{ background:#000; padding:1px 4px; border-radius:2px; }}
+</style></head>
+<body>
+{''.join(rows)}
+</body></html>"""
+    out_html = out_dir / "index.html"
+    out_html.write_text(html)
+    print(f"[report] -> {out_html}", file=sys.stderr)
+
+
+# ----------------------------- main -----------------------------
+
+def main():
+    ap = argparse.ArgumentParser()
+    sub = ap.add_subparsers(dest="cmd", required=True)
+
+    s = sub.add_parser("scan")
+    s.add_argument("--input", default=str(DEFAULT_INPUT))
+    s.add_argument("--output-dir", default=str(DEFAULT_OUTPUT))
+    s.add_argument("--recursive", action="store_true")
+    s.add_argument("--out", required=True)
+    s.set_defaults(func=cmd_scan)
+
+    sc = sub.add_parser("scenes")
+    sc.add_argument("--inventory", required=True)
+    sc.add_argument("--out-dir", required=True)
+    sc.add_argument("--only", default=None, help="comma-separated basenames to limit run")
+    sc.add_argument("--force", action="store_true")
+    sc.set_defaults(func=cmd_scenes)
+
+    st = sub.add_parser("stage")
+    st.add_argument("--inventory", required=True)
+    st.add_argument("--scenes-dir", required=True)
+    st.add_argument("--sample-fps", type=float, default=SAMPLE_FPS)
+    st.add_argument("--out", required=True)
+    st.set_defaults(func=cmd_stage)
+
+    m = sub.add_parser("merge")
+    m.add_argument("--results", required=True)
+    m.add_argument("--out", required=True)
+    m.set_defaults(func=cmd_merge)
+
+    tr = sub.add_parser("track")
+    tr.add_argument("--frames", required=True)
+    tr.add_argument("--scenes-dir", required=True)
+    tr.add_argument("--inventory", required=True)
+    tr.add_argument("--sample-fps", type=float, default=SAMPLE_FPS)
+    tr.add_argument("--out", required=True)
+    tr.set_defaults(func=cmd_track)
+
+    sc2 = sub.add_parser("score")
+    sc2.add_argument("--tracks", required=True)
+    sc2.add_argument("--inventory", required=True)
+    sc2.add_argument("--out", required=True)
+    sc2.add_argument("--no-identity", action="store_true")
+    sc2.add_argument("--max-yaw", type=float, default=QUALITY_YAW_MAX)
+    sc2.add_argument("--max-pitch", type=float, default=QUALITY_PITCH_MAX)
+    sc2.add_argument("--min-face", type=int, default=QUALITY_FACE_MIN)
+    sc2.add_argument("--min-det", type=float, default=QUALITY_DET_MIN)
+    sc2.add_argument("--track-gate-frac", type=float, default=TRACK_GATE_FRAC)
+    sc2.add_argument("--bridge-gap", type=float, default=SEGMENT_BRIDGE_S,
+                     help="bridge within-track failure gaps up to this many seconds")
+    sc2.add_argument("--merge-gap", type=float, default=SEGMENT_MERGE_GAP_S,
+                     help="merge across-track segments in same scene if within this gap")
+    sc2.add_argument("--min-dur", type=float, default=SEGMENT_MIN_S)
+    sc2.add_argument("--max-dur", type=float, default=SEGMENT_MAX_S)
+    sc2.set_defaults(func=cmd_score)
+
+    cu = sub.add_parser("cut")
+    cu.add_argument("--plan", required=True)
+    cu.add_argument("--output-dir", default=str(DEFAULT_OUTPUT))
+    cu.add_argument("--force", action="store_true")
+    cu.add_argument("--clean", action="store_true",
+                    help="remove prior UUID-named clips before cutting (preserves non-UUID files)")
+    cu.add_argument("--write-sidecar", action="store_true",
+                    help="emit <uuid>.json provenance sidecar alongside each clip (default off)")
+    cu.set_defaults(func=cmd_cut)
+
+    rp = sub.add_parser("report")
+    rp.add_argument("--plan", required=True)
+    rp.add_argument("--output-dir", default=str(DEFAULT_OUTPUT))
+    rp.add_argument("--out", required=True)
+    rp.set_defaults(func=cmd_report)
+
+    args = ap.parse_args()
+    args.func(args)
+
+
+if __name__ == "__main__":
+    main()
Author	SHA1	Message	Date
Peter	308597ebf0	Update video preprocessing doc with full-corpus results After completing the rest-of-corpus run, update docs/analysis to reflect the final numbers across all three batches (test + 13-file + 45-file) and surface the numerical lessons: - 1,984 segments / 10.78h accepted content from 19.76h / 61 input videos - 0 worker errors across 143,137 sampled frames - rest batch sustained 15.78 fps from a fresh JSONL start (vs 7.5 fps for the migrated batch), confirming the append-only fix is the right steady-state design - skip-pattern note: 5-digit basename numbers need full padding (0005[0-9] not 005[0-9]) — bit me on the first relaunch - documented SIDECAR=yes opt-in for the chain script Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 16:47:59 +02:00
Peter	7960dec350	Make per-clip sidecar JSONs opt-in (default off) Previously every video_target_pipeline cut wrote a <uuid>.json provenance sidecar alongside each <uuid>.mp4. The same provenance is already in the per-batch plan.json, so the per-clip sidecars are redundant unless a downstream tool wants each clip self-describing in isolation. - video_target_pipeline.py cut: new --write-sidecar flag, default off. - run_video_pipeline.sh: new SIDECAR env var (default "no"), passes --write-sidecar when SIDECAR=yes. - README + docs/analysis/video-target-preprocessing.md updated. The 1,984 already-emitted sidecars in /mnt/x/src/vd/ct/ct_src_*/ have been deleted (1.5 MB). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 12:44:27 +02:00
Peter	998fa79f81	Add target-side video preprocessing pipeline Preprocesses a folder of video files into UUID-named clips suitable as target inputs for roop-unleashed-style face-swap. Counterpart to the faceset (source-side) tooling. work/video_target_pipeline.py — orchestration with subcommands scan / scenes / stage / merge / track / score / cut / report. Quality gates default to face-sets-can-handle-side-profile values (yaw<=75°, pitch<=45°, face_short>=80px, det>=0.5). Cross-track segment merge fuses adjacent-in-time tracks within the same scene up to 2s gap. Output organized into <output_dir>/<source_stem>/<uuid>.mp4 + <uuid>.json sidecar with full provenance. work/video_face_worker.py — Windows DML face detect+embed worker. Uses JSONL append-only for results.jsonl: a critical perf fix (re- serializing the monolithic 245MB results.json on every flush was the dominant cost in the first attempt, dropping throughput to 0.5 fps). Append-only got it to 13+ fps, ~7.5 fps cumulative across the first 6.18h batch. Also uses seek-once-per-video + sequential cap.grab() between samples to dodge cv2 per-sample seek pathology on long H.264. Legacy results.json is auto-migrated to .jsonl on first load. work/run_video_pipeline.sh — generic chain driver, parameterized via WORK / INPUT_DIR / OUTPUT_DIR / FILTER_FROM / SKIP_PATTERN / MAX_DUR / IDENTITY env vars. work/status_video_pipeline.sh — generic status helper. First production batch (ct_src_00050..00062, 13 files, 6.18h input): 600 emitted segments, 239.5min accepted content (64.6% of input), 254 segments built from >=2 tracks (cross-track merge), 1h43m wall clock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:38:50 +02:00
Peter	49a43c7685	Add post-export corpus maintenance pipeline Adds four new orchestration scripts that operate on an already-built facesets_swap_ready/ to clean it up over time: - filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. - consolidate_facesets.py: duplicate-identity merger using complete-linkage centroid clustering on cached arcface embeddings. Single-linkage chains catastrophically (60-faceset clusters with min sim < 0); complete-linkage guarantees within-group sim >= edge. - age_extend_001.py: slots newly-added PNGs into existing era buckets of faceset_001 using the same anchor-fragment rule as age_split_001.py (dist <= 0.40 AND \|year_delta\| <= 5). Anchors not re-centered. - dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three passes — cross-family SHA256 byte-dedup (preserves intra-family era duplication), within-faceset near-dup at sim >= 0.95, and a multi-face audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s on AMD Vega — ~7x embed_worker because input is 512x512 crops. Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged → 181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full reversibility. Master manifest gains masked[], merged[], plus per-run provenance blocks. Three new docs/analysis/ writeups cover model choice, threshold rationale, and per-pass run results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:41:18 +02:00
Peter	e66c97fd58	Document Immich nic run: 95 new facesets, manifest 216 -> 311 Overnight 2026-04-27 nic finalize completed. Per-user API key worked as expected. The pipeline survived one mid-stage Immich outage via the circuit breaker added in `62dba3d` -- script paused, operator confirmed connectivity, same command resumed from saved state.json. Embed (Windows DML): 7,834 images -> 15,627 face records + 1 noface in 59 minutes (2.2 img/s end-to-end). Cluster: 6,770 of 15,627 faces (43%) matched existing canonical identities at cos-dist <= 0.45; biggest hits faceset_002 (+3,261), faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408). The faceset_008 and faceset_007 hits are noteworthy cross-matches: those are hand-sorted "sab" and "s" identities, recurring frequently in nic's library. Of the 8,857 unmatched faces, 3,787 raw clusters at threshold 0.55, 129 surviving refine gates, 95 emitted as new facesets at faceset_265+. Top-level facesets_swap_ready/manifest.json: 216 -> 311 substantive facesets + 68 thin_eras unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 00:32:11 +02:00
Peter	62dba3ddb3	Add Immich outage circuit breaker; document nic run + Tailscale quirk work/immich_stage.py: - Startup probe of /server/version (exit 2 if unreachable). - Outage circuit breaker: after OUTAGE_FAIL_STREAK=12 consecutive faces_error/download_error results, run a quick probe; if the probe also fails, persist state and exit with code 2 so a long unattended run can pause rather than silently churning through tens of thousands of retries during an upstream outage. Resume by re-running the same command -- state.json + queue.json are intact. README: - Document the nic run (per-user API key necessary; second pipeline invocation confirmed expected behavior; cleaner library than peter's with 0 internal byte-dupes vs 2,976). - Mention the circuit breaker as the mechanism that keeps long unattended runs safe under the known Tailscale flicker pattern at this site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 23:36:11 +02:00
Peter	321fed01cc	Add Immich import pipeline (WSL stage + Windows DML embed + cluster) Three-piece workflow that imports a self-hosted Immich library and emits new facesets without disturbing existing identity numbering: - work/immich_stage.py (WSL): pages /search/metadata, parallel-fetches /faces?id= per asset, prefilters by face_short>=90 against bbox scaled to original-image coords, downloads originals, sha256-dedups against nl_full.npz and same-run staged files. 8-worker ThreadPoolExecutor doing the full /faces->filter->/original chain per asset; resumable via state.json. API URL + key come from IMMICH_URL / IMMICH_API_KEY env vars, label->UUID map from work/immich/users.json (gitignored). - work/embed_worker.py (Windows venv at C:\face_embed_venv): runs insightface.FaceAnalysis(buffalo_l) with the DmlExecutionProvider on AMD Radeon Vega via onnxruntime-directml. Produces a cache file in the same .npz schema as sort_faces.cmd_embed (loadable via load_cache). ~7.5x speedup over CPU end-to-end; embeddings bit- identical to CPU (cosine similarity 1.0000 across 8 sample faces). - work/cluster_immich.py (WSL): mirrors cluster_osrc.py against an immich_<user>.npz. Builds existing identity centroids from canonical faceset_NNN/ in facesets_swap_ready/, drops matches at <=0.45, clusters the rest at 0.55, applies refine gates, hands off to cmd_export_swap. Numbers new facesets past the existing maximum. - work/finalize_immich.sh: chains queue->Windows embed->cache copy-> cluster_immich, with logging. The 2026-04-26 run on https://fotos.computerliebe.org (Immich v2.7.2) processed 53,842 admin-accessible assets, staged 10,261, embedded 19,462 face records on Vega DML in 64.6 min, matched 8,103 (42%) to existing identities, and emitted 185 new facesets (faceset_026..264 with gaps). facesets_swap_ready/ went from 31 to 216 substantive facesets. Important caveat surfaced: /search/metadata's userIds filter is silently ignored when the API key is bound to a different user, so this run can't enumerate other users' libraries from the admin key. A per-user API key would be required for nic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 18:14:26 +02:00
Peter	7ecbfae981	Add osrc identity-discovery pipeline + run analysis work/cluster_osrc.py mirrors build_folders.py's shape (synthesize a refine_manifest, hand off to cmd_export_swap, relocate, merge top-level manifest) but discovers identities by clustering rather than asserting them by folder. Drops faces already covered by existing identity centroids, clusters the rest at 0.55, applies refine-equivalent gates with min_faces=6, numbers new facesets past the existing maximum so faceset_001..NNN are never disturbed. The 2026-04-26 run on /mnt/x/src/osrc produced faceset_020..025 (sizes 4-26 exported PNGs); analysis writeup in docs/analysis/. README also notes the refine-renumbers caveat in passing — extend + orchestration script is the safe pattern; cmd_refine is for fresh clusters only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 12:40:19 +02:00
Peter	1d82d71e68	Force-track work/build_folders.py The README documents work/build_folders.py as the orchestration script for hand-sorted-folder identity import, but it was excluded by the work/ gitignore. Force-track it for parity with the other orchestration scripts (age_split_001.py, check_faceset001_age.py) so the documented workflow points at code that exists in the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 12:13:56 +02:00
Peter	e48dd8aec7	Add age-split run analysis for faceset_001 Documents the 2026-04-26 split of faceset_001 (707 curated faces) into 6 substantive era buckets + 68 thin fragments, including the readiness probe evidence, the anchor-based assignment rationale (replaces transitive union-find that caused year-drift), and the re-run / apply- to-other-identity workflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 12:10:37 +02:00
Peter	03a0c75531	Document hand-sorted-folder import + age-split workflow - README: document work/build_folders.py (hand-sorted folder identities) and the new age-split workflow for splitting a long-running identity into era-specific facesets after clustering. - Force-track work/age_split_001.py and work/check_faceset001_age.py; these are the worked example + readiness probe for faceset_001 and the template for splitting any other identity by EXIF era. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 12:08:25 +02:00
Peter	4d7a8780de	Document enrich + export-swap + extend; add swap-ready usage guide README.md now covers all six subcommands (embed, cluster, refine, dedup, extend, enrich, export-swap), an end-to-end pipeline recipe, the delta recipe for merging a new source into an existing result, the quality- weight formula used by export-swap, and the GFPGAN blend recommendation at swap time (0.85, overriding roop-unleashed's 0.65 default). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:09:01 +02:00
Peter	d53ab9fbfc	Add enrich + export-swap pipeline for downstream face-swap ready output - enrich: re-detects each cached face with buffalo_l (detection + landmark_2d_106 + landmark_3d_68, recognition module skipped for speed) and persists landmarks + pose into the cache so per-face frontality and landmark-symmetry quality signals become available. - compute_quality: composite score combining det_score, face short-edge, blur, frontality (from pose pitch/yaw), and 2D-landmark symmetry with tunable weights. Default weighting 0.30/0.20/0.20/0.15/0.15. - export-swap: builds facesets_swap_ready/ from an existing refine manifest. Per identity: tighter outlier gate (default 0.45), visual- near-dupe collapse (keep best representative per group), multi-face- per-source-image collapse (keep best bbox), rank by composite score, single-face-per-PNG crops at 512x512 with 0.5 bbox padding, ready-to- drop .fsz bundles (top-N + full), per-faceset manifest.json, NAME.txt placeholder for the operator. The multi-face-per-PNG collapse is the critical fix: roop-unleashed's .fsz loader appends every detected face in each PNG to the FaceSet, so any multi-face crop would contaminate the averaged embedding. - Optional --candidates rescues raw_full singletons: matches against the final per-faceset centroids and routes to _candidates/to_<faceset>/ for manual review; orphaned singletons that still cluster among themselves land in _candidates/new_<NNN>/. - docs/analysis/: evaluation document captures the evidence, downstream requirements (FaceSet averaging, inswapper_128), opportunity matrix (R1-R14), and the recommended target state this export implements. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 22:37:32 +02:00
Peter	484278e70e	Rewrite pipeline: resumable embed, byte-dedup, extend, dedup report - embed: sha256-based dedup at listing (embed each unique hash once, carry other paths as aliases via a top-level path_aliases dict); resumable from any existing cache; atomic incremental flush every 50 files; explicit skip-ext filtering; schema bumped with processed_paths + path_aliases. - extend: new subcommand that merges new embeddings into an existing raw + facesets output without renumbering. Nearest person-centroid match above threshold, unmatched faces re-clustered into new person_NNN / _singletons. Optional --refine-out also extends facesets by centroid + quality gate. - dedup: new subcommand producing byte-identical + visual near-duplicate groups as a JSON report. - cluster/refine: fan every placement across canonical + aliases so each on-disk location gets represented. - safe_dst_name now always flattens the absolute path so filenames stay stable across runs when src_root shifts (fixes duplicate-copy bug that surfaced during the lzbkp_red extend). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:21:50 +02:00