Update video preprocessing doc with full-corpus results

After completing the rest-of-corpus run, update docs/analysis to reflect the final numbers across all three batches (test + 13-file + 45-file) and surface the numerical lessons: - 1,984 segments / 10.78h accepted content from 19.76h / 61 input videos - 0 worker errors across 143,137 sampled frames - rest batch sustained 15.78 fps from a fresh JSONL start (vs 7.5 fps for the migrated batch), confirming the append-only fix is the right steady-state design - skip-pattern note: 5-digit basename numbers need full padding (0005[0-9] not 005[0-9]) — bit me on the first relaunch - documented SIDECAR=yes opt-in for the chain script Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make per-clip sidecar JSONs opt-in (default off)
2026-04-28 16:47:59 +02:00 · 2026-04-28 12:44:27 +02:00 · 2026-04-27 21:38:50 +02:00 · 2026-04-27 15:41:18 +02:00 · 2026-04-27 00:32:11 +02:00 · 2026-04-26 23:36:11 +02:00
26 changed files with 8005 additions and 36 deletions
@@ -1,56 +1,400 @@
 # face-sets
-Sort photos by similar face using InsightFace embeddings + agglomerative clustering, then refine into faceset-ready folders for downstream face-swap tooling (roop-unleashed, etc.).
+Sort photos by similar face using InsightFace embeddings + agglomerative clustering, refine into per-identity sets, and export ready-to-drop bundles for face-swap tooling (roop-unleashed, etc.).
 ## Pipeline
-`sort_faces.py` is a single-file CLI with four subcommands:
+`sort_faces.py` is a single-file CLI with six subcommands:
 | step        | what it does                                                                                                |
-|---------|------------------------------------------------------------------------------|
+|-------------|-------------------------------------------------------------------------------------------------------------|
-| embed   | Recursively scan a source tree, detect + embed every face, write `.npz` cache |
+| embed       | Recursively scan a source tree, detect + embed every face, write `.npz` cache. Resumable; sha256-dedup.    |
-| cluster | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` |
+| cluster     | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` with manifest.  |
-| refine  | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/` |
+| refine      | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/`.        |
-| dedup   | Post-hoc near-duplicate report: byte-identical groups + visual near-dupes (same face + same size within a tight cosine threshold) |
+| dedup       | Post-hoc near-duplicate report: byte-identical + visual near-dupe groups → `<cache>.duplicates.json`.      |
 | extend      | Fold new embeddings into an existing raw/refine output via nearest person-centroid without renumbering.    |
 | enrich      | Re-detect each cached face to persist landmark_2d_106, landmark_3d_68, pose (pitch/yaw/roll) into cache.   |
 | export-swap | Per-identity export: tight outlier gate + visual-dupe collapse + composite quality rank + single-face PNG crops + `.fsz` bundles (top-N and full) ready for roop-unleashed. Optional singleton rescue into `_candidates/`. |
-`embed` is resumable and incremental: it loads any existing cache at the target path and only hashes/embeds files it hasn't processed before. A periodic flush (default every 50 new files) writes the cache atomically, so a mid-run crash loses at most a few dozen embeddings.
+### Design principles
-Byte-identical duplicates are detected via sha256 during the listing phase. The canonical file is embedded once; other paths with the same hash are carried as `aliases` on the cache's top-level `path_aliases` dict. Every alias is materialized by `cluster`/`refine`, so each on-disk location ends up represented in the output.
+- **embed is resumable and incremental.** It loads any existing cache at the target path and only hashes / embeds files it has not seen. Atomic flush every 50 new files so a mid-run crash loses at most ~50 embeddings.
 - **Byte-identical duplicates are sha256-grouped at listing time.** The canonical file is embedded once; other paths with the same hash become `path_aliases` in the cache. Every alias is materialized by `cluster` / `refine` / `export-swap`, so each on-disk location is represented.
 - **`safe_dst_name` always flattens the absolute path.** This keeps output filenames stable across runs even as `src_root` changes between embed / extend / export invocations.
 - **Caches and outputs stay out of git** via `.gitignore`; defaults live under `work/`.
-Cache and outputs are kept out of the repo via `.gitignore`; defaults live under `work/`.
+## Typical end-to-end run
 ## Typical run
 ```bash
-# 1. Embed (CPU; InsightFace buffalo_l). Caches faces + metadata. Resumable.
+SRC=/mnt/x/src/nl
-python sort_faces.py embed /mnt/x/src/nl work/cache/nl_full.npz
+CACHE=work/cache/nl_full.npz
 OUT=/mnt/e/temp_things/fcswp/nl_sorted
-# 2. Raw clusters (every multi-face cluster -> a person_NNN/ folder).
+# 1. Embed (CPU; InsightFace buffalo_l). Resumable on re-run.
-python sort_faces.py cluster work/cache/nl_full.npz /mnt/e/temp_things/fcswp/nl_sorted/raw_full
+python sort_faces.py embed "$SRC" "$CACHE"
-# 3. Refined facesets (filters for faceset-ready quality).
+# 2. Raw clusters (one person_NNN/ per multi-face cluster).
-python sort_faces.py refine  work/cache/nl_full.npz /mnt/e/temp_things/fcswp/nl_sorted/facesets_full
+python sort_faces.py cluster "$CACHE" "$OUT/raw_full"
-# 4. (Optional) report on byte-identical + visual near-duplicates.
+# 3. Refined facesets (quality-gated per-identity sets).
-python sort_faces.py dedup   work/cache/nl_full.npz
+python sort_faces.py refine  "$CACHE" "$OUT/facesets_full"
 # 4. Near-duplicate report (byte + visual).
 python sort_faces.py dedup   "$CACHE"
 # 5. Enrich the cache with landmarks + pose (needed by export-swap).
 python sort_faces.py enrich  "$CACHE"
 # 6. Export roop-unleashed-ready bundles.
 python sort_faces.py export-swap "$CACHE" \
  "$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
  --raw-manifest "$OUT/raw_full/manifest.json" --candidates
 ```
-## Refine defaults
+### Merging a new source into an existing result
 ```bash
 # Embed new source into the same cache (resume from existing embeddings + aliases).
 python sort_faces.py embed /mnt/x/src/lzbkp_red "$CACHE"
 # Fold new faces into raw_full + facesets_full without renumbering.
 python sort_faces.py extend "$CACHE" "$OUT/raw_full" --refine-out "$OUT/facesets_full"
 # Refresh the swap-ready export to reflect the merge.
 python sort_faces.py enrich "$CACHE"
 python sort_faces.py export-swap "$CACHE" \
  "$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
  --raw-manifest "$OUT/raw_full/manifest.json" --candidates
 ```
 ### Importing hand-sorted folders as identities
 When source folders are already hand-sorted by person (one folder per identity), the
 clustering path is the wrong tool — the identity is asserted, not inferred. The
 orchestration script `work/build_folders.py` covers this case:
 - For each trusted folder, it filters cache records that fall under it, builds an
  identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so
  bystanders in group photos drop out, and writes a synthetic `refine_manifest.json`.
 - It then routes each face record from a *mixed* folder (e.g. `osrc/`) into every
  identity centroid within a tight cosine cutoff (default 0.45). A multi-identity
  photo lands in multiple facesets; `export-swap`'s per-bbox outlier filter ensures
  each faceset crops only its matching face.
 - Finally it invokes `cmd_export_swap` against the synthetic manifest, renames the
  emitted `.fsz` bundles after the source folder, drops a `<label>.txt` marker, and
  merges the new entries into the canonical `facesets_swap_ready/manifest.json`
  (existing facesets are left untouched).
 ```bash
 # Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
 for d in k m mi mir s sab t osrc; do
  python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
 done
 # Bring landmarks/pose + visual-dupe report in sync with the new records.
 python sort_faces.py enrich "$CACHE"
 python sort_faces.py dedup  "$CACHE"
 # Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
 python work/build_folders.py
 ```
 The script's config block (`TRUSTED`, `START_NNN`, `OSRC_THRESHOLD`, `TOP_N`, etc.)
 is the only thing to edit when adding more hand-sorted folders later.
 ### Splitting an identity by era (age sub-clustering)
 Long-running source corpora produce identities that span 10+ years. The 2009 face
 and the 2024 face of the same person sit in the same cluster (correctly — same
 identity), but a single averaged embedding pulled from that cluster blurs across
 ages. For face-swap output that should target a specific period, the identity
 needs to be split by era *after* the identity is established.
 `work/age_split_001.py` is a worked example for `faceset_001` and a template for
 any other identity. The pipeline is:
 - **Probe first** with `work/check_faceset001_age.py` — report intra-cluster
  pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and
  EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with
  distinct year ranges, the identity is age-sortable.
 - **Seed centroid** from the curated `facesets_swap_ready/faceset_001/`
  (manifest provides face keys → cache rows).
 - **Wide recovery** at cos-dist ≤ 0.55 against the seed under the original
  source roots, then quality-gate (`face_short`, `blur`, `det_score`) and one
  re-centroid + tighten pass at 0.50 to absorb new faces without drift.
 - **Sub-cluster** the survivors at cos-dist 0.35 (precomputed-distance
  agglomerative, average linkage).
 - **Anchor-based fragment assignment** (not transitive merge — that caused
  year-drift): sub-clusters with size ≥ 20 are *anchors*; smaller fragments
  attach to the single nearest anchor only if both the centroid distance ≤ 0.40
  AND the dominant EXIF year is within ±5 years. Fragments with no qualifying
  anchor remain standalone (and end up THIN-tagged downstream).
 - **EXIF year per source path** with on-disk caching at
  `work/cache/age_split_exif.json` — the Windows-mount EXIF read is the
  slowest step, so re-runs after a parameter tweak are nearly instant.
 - **Per-era export** mirrors `export-swap`: composite-quality rank, single-face
  square PNG crops, top-N + `_all` `.fsz` bundles, per-era `manifest.json`,
  human-readable `<era>.txt` marker. Eras with < 20 face records also drop a
  `THIN.txt` marker so they can be quarantined.
 - **Top-level manifest merge**: era buckets are appended to
  `facesets_swap_ready/manifest.json`. Operationally the THIN buckets should be
  moved into `_thin/` (and the manifest split into `facesets` + `thin_eras`),
  leaving only the substantive era buckets at the top level.
 ```bash
 # 1. Confirm the identity is age-sortable.
 python work/check_faceset001_age.py
 # 2. Split it. Re-runs are cheap thanks to the EXIF cache.
 python work/age_split_001.py
 ```
 For the `faceset_001` run on 5260-face `nl_full.npz`, this produced 6 substantive
 era buckets (2005–10, 2010–13, 2011, 2014–17, 2018–19, 2018–20; sizes 43–282)
 plus 68 thin/fragment buckets quarantined under `_thin/`.
 ### Discovering new identities in a mixed bucket
 A flat folder of mixed-identity photos (e.g. `osrc/`) is the opposite of the
 hand-sorted case: identities have to be discovered, not asserted, but should
 not collide with already-known identities or scramble their numbering.
 `work/cluster_osrc.py` is the worked example. The pipeline:
 - **Filter cache to the source root**, including any byte-aliased path that
  resolves under it.
 - **Drop already-covered faces** by comparing each candidate to the centroids
  of the existing canonical facesets at the `EXISTING_MATCH_THRESHOLD`
  (default 0.45 — same cutoff as `build_folders.py`'s osrc routing). These
  faces are already routed by `extend` / `build_folders.py` and shouldn't
  seed new facesets.
 - **Cluster the unmatched** at cos-dist 0.55 (matches the `extend` default
  for the new-cluster phase).
 - **Apply `refine`-equivalent gates** per cluster: `face_short`, `blur`,
  `det_score`, plus outlier rejection (cluster-centroid cos-dist > 0.55) for
  clusters of size ≥ 4. Keep clusters whose surviving unique-source-path
  count is ≥ `MIN_FACES`.
 - **Number new facesets past the existing maximum** (`START_NNN`), so
  `faceset_001..NNN` are never disturbed.
 - **Synthesize a refine manifest** and run `cmd_export_swap` against it,
  then move the resulting dirs into `facesets_swap_ready/` and append to the
  top-level `manifest.json`. Each new dir gets an `osrc.txt` provenance
  marker.
 Always run `extend` first so `raw_full/` and `facesets_full/` reflect the new
 source — the `cluster_osrc.py` step then operates against the canonical
 cache and doesn't need `raw_full/` for input:
 ```bash
 # 1. Bring raw_full / facesets_full up to date (folds matches into existing
 #    person folders + facesets, creates new person_NNN+ for unmatched).
 python sort_faces.py extend "$CACHE" "$OUT/raw_full" \
  --refine-out "$OUT/facesets_full"
 # 2. Optional dry-run: report cluster sizes and per-faceset survivor counts
 #    without touching facesets_swap_ready/.
 python work/cluster_osrc.py --dry-run
 # 3. Real run: emits facesets_swap_ready/faceset_NNN+ and merges the manifest.
 python work/cluster_osrc.py
 ```
 For the 2026-04-26 run on 336 osrc face records (after dropping 18 covered by
 existing identities), this produced 6 new facesets (`faceset_020..025`,
 sizes 4–26 exported PNGs; the 7th candidate cluster lost all 6 faces to
 export-swap's tighter `min_face_short=100` gate).
 ### Importing identities from a self-hosted Immich library
 `work/immich_stage.py` + `work/embed_worker.py` + `work/cluster_immich.py`
 together import an Immich library at scale, with the embed step running on
 a Windows AMD GPU via DirectML and everything else on WSL. Three pieces:
 1. **`work/immich_stage.py` (WSL)** — pages every IMAGE asset via
   `/search/metadata`, fetches each asset's `/faces?id=` to read Immich's
   own ML-driven bboxes, scales each bbox to original-image coordinates,
   and prefilters by `face_short ≥ 90`. For survivors it downloads the
   original, sha256-deduplicates against the canonical `nl_full.npz` and
   against same-run staged files, and saves to
   `/mnt/x/src/immich/<user>/<rel>`. Writes a `queue.json` that the embed
   worker consumes. 8 concurrent worker threads run the full per-asset
   I/O chain (`/faces` → filter → `/original`) so 8 workers ≈ 8× the
   serial throughput.
 2. **`work/embed_worker.py` (Windows venv at `C:\face_embed_venv\`)** —
   loads `insightface.FaceAnalysis(buffalo_l)` with the
   `DmlExecutionProvider` and runs detection + landmarks + recognition
   over the queue. Produces a `.npz` cache that's bit-identical in
   schema to what `sort_faces.py:cmd_embed` writes, so the result is
   directly loadable by `load_cache()`. The cache already includes the
   post-`enrich` fields (`landmark_2d_106`, `landmark_3d_68`, `pose`)
   because FaceAnalysis returns them for free. AMD Vega gives ~7.5×
   real-pipeline speedup over CPU.
 3. **`work/cluster_immich.py` (WSL)** — mirrors `cluster_osrc.py`'s
   shape but reads from `immich_<user>.npz`. Builds existing-identity
   centroids from every canonical `faceset_NNN/` in
   `facesets_swap_ready/` (skipping era splits and `_thin/`), drops
   immich faces matching at cos-dist ≤ 0.45, clusters the rest at 0.55,
   applies refine gates, numbers new facesets past the existing maximum,
   and feeds `cmd_export_swap` via a synthetic manifest.
 `work/finalize_immich.sh <user>` chains queue → Windows embed → cache
 copy back → cluster_immich, with logging.
 The Immich admin API key + base URL come from environment variables:
 ```bash
 export IMMICH_URL=https://your-immich.example.com
 export IMMICH_API_KEY=...                # admin or per-user key
 python work/immich_stage.py --user peter --workers 8
 bash   work/finalize_immich.sh peter
 ```
 For the 2026-04-26 run against `https://fotos.computerliebe.org` (Immich
 v2.7.2), with the admin API key:
 | step | result |
 |------|------|
 | stage | 53,842 assets seen, **10,261 staged** (~10 GB), 978 byte-deduped against `nl_full.npz`, 2,976 internal byte-duplicates, 39K skipped no-face / no-big-face |
 | Windows DML embed | 19,462 face records + 1 noface in **64.6 min** (2.6 img/s end-to-end) |
 | matched existing identities | **8,103 of 19,480 (42%)** at cos-dist ≤ 0.45; biggest hits faceset_002 (+2,666), faceset_001 (+1,856), faceset_003 (+670) |
 | new clusters | 2,534 at threshold 0.55 → 239 surviving refine gates → **185 emitted** as `faceset_026..264` (gaps where export-swap's tighter outlier filter dropped clusters below the export quality bar) |
 A second 2026-04-26 run with **nic's per-user API key** confirmed the
 expected behavior: 25,777 of nic's IMAGE assets were enumerated (matching
 her `/server/statistics` count of 25,786, off by 9 ≈ the transient errors
 that didn't get marked seen), **7,834 staged** (30% face-bearing-with-big-face,
 denser than peter's 19%), 519 byte-deduped vs `nl_full.npz`, **0 internal
 byte-duplicates** (cleaner library than peter's 2,976), 54 transient errors.
 Embed + cluster on the nic queue:
 | step | result |
 |------|------|
 | Windows DML embed | 15,627 face records + 1 noface in **59 min** (2.2 img/s end-to-end), 7 load errors |
 | matched existing identities | **6,770 of 15,627 (43%)** at cos-dist ≤ 0.45; biggest hits faceset_002 (+3,261), faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408) |
 | new clusters | 3,787 at threshold 0.55 → 129 surviving refine gates → **95 emitted** as `faceset_265..NNN` (gaps where export-swap's 0.45 outlier dropped clusters below the export bar) |
 Top-level `facesets_swap_ready/manifest.json` after both Immich runs:
 **311 substantive facesets** (12 auto-cluster nl/lzbkp + 7 hand-sorted +
 6 era splits + 6 osrc-discovered + 185 peter-Immich + 95 nic-Immich) +
 68 thin_eras under `_thin/`.
 `work/immich_stage.py` carries a built-in **outage circuit breaker**:
 after 12 consecutive HTTP errors it probes Immich; if that probe also
 fails, the script exits cleanly with code 2, state preserved. This made
 the nic run survive a mid-stage Immich outage — the script paused, the
 operator confirmed connectivity was back, and the same command resumed
 from the saved `state.json` without re-fetching what was already done.
 **Important caveats for Immich v2.7.2**:
 - The `userIds` filter on `/search/metadata` is **silently ignored** when
  the API key is bound to a different user. The "import everything the
  API key can see" semantics are what you actually get; cross-user
  isolation is enforced server-side.
 - `/server/statistics` reports counts that under-count what
  `/search/metadata` actually returns (e.g. external library
  thumbnail-dirs that got indexed because the import path included them).
  Don't trust the statistics number as a denominator.
 - A meaningful fraction of `originalPath`-based assets are *Immich's own
  thumbnails* (`<library_root>/thumbs/.../-preview.jpeg`) — included if
  the external library's import path covers the thumbs directory and the
  exclusion patterns don't list `**/thumbs/**`. For our run, 5,563 of
  10,261 staged were thumbnails. They embed and cluster fine but the
  resulting faces are lower-resolution.
 ## Key defaults
 `refine`:
 | flag                    | default | meaning |
-|---|---|---|
+|-------------------------|--------:|---------|
 | `--initial-threshold`   | 0.55    | cosine distance for stage-1 clustering |
 | `--merge-threshold`     | 0.40    | centroid-level merge of over-split clusters |
-| `--outlier-threshold` | 0.55 | drop face if cosine dist from cluster centroid exceeds this (only if cluster ≥ 4) |
+| `--outlier-threshold`   | 0.55    | drop face if cosine dist from centroid exceeds (only if cluster ≥ 4) |
 | `--min-faces`           | 15      | minimum unique images per faceset |
 | `--min-short`           | 90      | minimum short-edge pixels of face bbox |
 | `--min-blur`            | 40.0    | Laplacian-variance blur gate |
 | `--min-det-score`       | 0.6     | InsightFace detector score gate |
 | `--mode`              | copy | copy / move / symlink |
-## Prior runs (as of 2026-04-22)
+`export-swap`:
- `work/cache/kos11.npz` — 181 images, 333 faces from `Kos '11/` → `kos11_sorted/`
+| flag                          | default | meaning |
- `work/cache/nl_all.npz` — 916 images, 1396 faces from `Neuer Ordner (2)/New Folder/` → `nl_sorted/raw/`, refined to 6 facesets (197, 120, 91, 47, 23, 18 images)
+|-------------------------------|--------:|---------|
 | `--top-n`                     | 30      | size of the `<faceset>_topN.fsz` bundle |
 | `--outlier-threshold`         | 0.45    | tighter than refine; trims cluster boundary for averaging |
 | `--pad-ratio`                 | 0.5     | padding around face bbox for PNG crop |
 | `--out-size`                  | 512     | PNG output is square `out_size × out_size` |
 | `--min-face-short`            | 100     | export gate; stricter than refine's 90 |
 | `--candidates`                | off     | rescue `_singletons/` into `_candidates/` for manual review |
 | `--candidate-match-threshold` | 0.55    | cos-dist cutoff for singleton → existing faceset |
 | `--candidate-min-score`       | 0.40    | composite-quality floor for candidates |
-Output lives outside the repo at `/mnt/e/temp_things/fcswp/`.
+The composite quality score in `export-swap` is `0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness`, each normalized to `[0, 1]`.
 ## Post-export corpus maintenance
 The `sort_faces.py` pipeline above produces `facesets_swap_ready/`. Four
 orchestration scripts under `work/` operate on that already-built corpus to
 clean it up over time:
 | script | purpose |
 |--------|---------|
 | `work/filter_occlusions.py` (+ Windows `work/clip_worker.py`) | Drop PNGs of masked / sun-glassed faces using open_clip ViT-L-14/dfn2b_s39b zero-shot scoring. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. WSL stages a queue, Windows DML scores, WSL applies. See `docs/analysis/clip-occlusion-filter.md`. |
 | `work/consolidate_facesets.py` | Merge duplicate identities (centroid cosine sim ≥ 0.55 with confident ≥ 0.65, **complete-linkage** to defeat single-link chaining). Pulls embeddings from cache, no GPU. See `docs/analysis/identity-consolidation-and-age-extend.md`. |
 | `work/age_extend_001.py` | Slot newly-added PNGs into existing era buckets of `faceset_001` (anchor cosine distance ≤ 0.40 AND `|year_delta|` ≤ 5). Same anchor-fragment rule as `age_split_001.py`. |
 | `work/dedup_optimize.py` (+ Windows `work/multiface_worker.py`) | (a) cross-family SHA256 byte-dedup, (b) within-faceset near-dup at cosine sim ≥ 0.95, (c) multi-face audit (re-detect via insightface, drop PNGs with face_count ≠ 1). Multi-face is the load-bearing roop invariant. See `docs/analysis/dedup-and-roop-optimization.md`. |
 | `work/video_target_pipeline.py` (+ Windows `work/video_face_worker.py` + `work/run_video_pipeline.sh` chain) | Target-side preprocessing: scan a folder of videos → PySceneDetect shot-cuts → 2 fps frame sampling → DML face detection + embedding → IoU+embedding tracking → quality-gated segments (yaw≤75°, face≥80px, det≥0.5, ≥70% pass-rate, 1–120s duration, 2s cross-track merge gap) → ffmpeg stream-copy into UUID-named clips. Output organized into per-source subfolders. Provenance sidecars are opt-in (`cut --write-sidecar` or `SIDECAR=yes` env var); the full plan is always retained in the per-batch `plan.json`. See `docs/analysis/video-target-preprocessing.md`. |
 All four operate idempotently and reversibly: dropped PNGs go to
 `<faceset>/faces/_dropped/`, quarantined whole facesets go to
 `facesets_swap_ready/_masked/` or `_merged/` (parallel to the existing
 `_thin/`). The master `manifest.json` partitions entries across `facesets[]`,
 `masked[]`, `thin_eras[]`, and `merged[]` arrays, plus per-run provenance
 blocks (`occlusion_filter_run`, `merge_run`, `age_extend_runs`, `dedup_runs`,
 `multiface_runs`).
 ## Downstream: roop-unleashed
 The `.fsz` bundles emitted by `export-swap` drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding.
 Highly recommended at swap time: enable **Select post-processing = GFPGAN** with the **Original/Enhanced image blend ratio = 0.85** (default is 0.65 which is conservative). See `docs/analysis/facesets-downstream-refinement-evaluation.md` for the full evaluation.
 ## Layout
 ```
 /opt/face-sets/
 ├─ README.md                                     (this file)
 ├─ sort_faces.py                                 (the tool)
 ├─ docs/
 │  └─ analysis/
 │     └─ facesets-downstream-refinement-evaluation.md
 └─ work/                                         (gitignored except force-tracked .py / .sh)
   ├─ build_folders.py                           (hand-sorted-folder orchestration)
   ├─ check_faceset001_age.py                    (age-split readiness probe)
   ├─ age_split_001.py                           (age-split orchestration; faceset_001)
   ├─ age_extend_001.py                          (extends existing era buckets with new PNGs)
   ├─ cluster_osrc.py                            (mixed-bucket identity discovery)
   ├─ immich_stage.py                            (Immich library staging, parallel)
   ├─ embed_worker.py                            (Windows DML embed worker; C:\face_embed_venv\)
   ├─ cluster_immich.py                          (Immich identity discovery + export)
   ├─ finalize_immich.sh                         (chains queue → embed → cluster)
   ├─ filter_occlusions.py                       (CLIP zero-shot mask + sunglasses filter)
   ├─ clip_worker.py                             (Windows DML CLIP worker; C:\clip_dml_venv\)
   ├─ consolidate_facesets.py                    (duplicate-identity merger; complete-linkage)
   ├─ dedup_optimize.py                          (byte + near-dup + multi-face audit driver)
   ├─ multiface_worker.py                        (Windows DML multi-face audit worker)
   ├─ video_target_pipeline.py                   (video → swappable segment cuts orchestration)
   ├─ video_face_worker.py                       (Windows DML per-frame face worker; JSONL append-only)
   ├─ run_video_pipeline.sh                      (generic chain driver: scenes → stage → worker → cut)
   ├─ status_video_pipeline.sh                   (status helper for any video_pipeline log)
   ├─ synthetic_*_manifest.json                  (per-run synthetic refine manifests)
   ├─ immich/
   │  ├─ users.json                              (label -> userId map; gitignored)
   │  └─ <user>/{queue,state,aliases}.json       (per-user staging artifacts)
   ├─ cache/
   │  ├─ nl_full.npz                             (canonical cache + duplicates.json)
   │  ├─ immich_<user>.npz                       (per-user immich embeddings)
   │  └─ age_split_exif.json                     (path → EXIF-year cache)
   └─ logs/
      └─ *.log                                   (every long step writes here)
 ```
@@ -0,0 +1,119 @@
 # Age-splitting faceset_001 into era-specific facesets
 _Run date: 2026-04-26. Cache: `work/cache/nl_full.npz` (5260 face records). Source: `work/age_split_001.py` and `work/check_faceset001_age.py`._
 ## 1. Why split
 `faceset_001` aggregates a single identity across roughly 20 years of source
 material. The averaged embedding consumed by roop-unleashed therefore mixes
 features from very different ages. For face-swap output that should target a
 specific period (e.g. "this person around 2011" or "this person around
 2018–19"), the identity needs to be split *after* clustering — the cluster is
 correctly one identity, but the averaged embedding is the problem.
 ## 2. Evidence the identity is age-sortable
 `work/check_faceset001_age.py` probes `faceset_001` (707 curated faces).
 **Pairwise cos-distance histogram** (249,571 pairs):
 | range       | pairs |
 |-------------|------:|
 | [0.0, 0.2)  | 1,250 |
 | [0.2, 0.3)  | 11,277 |
 | [0.3, 0.4)  | 63,920 |
 | [0.4, 0.5)  | 92,555 |
 | [0.5, 0.6)  | 63,288 |
 | [0.6, 0.7)  | 16,048 |
 | [0.7, 0.8)  | 1,217 |
 | [0.8, 1.0)  | 16 |
 Mean 0.453, median 0.452, max 0.842. The cluster is internally diffuse — wide
 enough to admit non-trivial sub-structure without crossing the
 inter-identity boundary (which sits well above 0.6 in this dataset).
 **Sub-clusters at threshold 0.35** (precomputed cos-dist, average linkage):
 156 sub-clusters, 10 with ≥ 10 faces, top-5 sizes [120, 105, 47, 40, 24].
 The top sub-clusters align with distinct EXIF year medians (2011, 2019,
 2018, 2011, 2010), so the split is meaningful.
 ## 3. Pipeline
 `work/age_split_001.py`:
 1. **Seed centroid.** Load the 707 face keys from
   `facesets_swap_ready/faceset_001/manifest.json`; resolve to cache rows;
   normalize the mean embedding.
 2. **Wide recovery.** Pull every face record under `/mnt/x/src/{nl,
   lzbkp_red}` from the cache with cos-dist ≤ 0.55 from the seed. The seed
   is curated and tight, so 0.55 is a safe outer envelope. Result: 1,501
   faces from 4,756 candidates.
 3. **Quality gate** (mirrors export-swap defaults): `face_short ≥ 100`,
   `blur ≥ 40.0`, `det_score ≥ 0.6`. Result: 892 → 856 after one
   re-centroid + tighten pass at 0.50 to absorb the recovery without
   drift.
 4. **Sub-cluster** the survivors at cos-dist 0.35 (precomputed agglomerative,
   average linkage). 223 raw sub-clusters; sizes top-10 = [127, 97, 55, 42,
   40, 25, 17, 14, 13, 11].
 5. **EXIF year per source path.** Read `DateTimeOriginal` once per unique
   path; cache on disk at `work/cache/age_split_exif.json` so re-runs after
   parameter tweaks skip the slow Windows-mount EXIF read. 728 of 855 paths
   were dated.
 6. **Anchor-based fragment assignment** (replaces transitive union-find merge
   that caused observable year drift):
   - sub-clusters with ≥ 20 faces are *anchors* (6 found: dom-years 2011,
     2019, 2018, 2011, 2016, 2010);
   - smaller fragments attach to the single nearest anchor *only if* both
     `cent_dist ≤ 0.40` AND `|dom_year_anchor − dom_year_fragment| ≤ 5`;
   - anchors do not merge with each other (transitive merging produced
     anchor-to-anchor year drift across 2010 → 2014 → 2018 in earlier
     runs);
   - fragments with no qualifying anchor remain standalone.
 7. **Per-era export.** Composite-quality rank, single-face square PNG crops
   (`pad_ratio=0.5`, `out_size=512`), top-N + `_all` `.fsz` bundles, per-era
   `manifest.json`, `<label>.txt` marker, `THIN.txt` for buckets < 20 faces.
 8. **Top-level manifest merge.** New entries are appended to
   `facesets_swap_ready/manifest.json`. Operationally the THIN buckets are
   then moved into `_thin/` and partitioned into a `thin_eras` array (with
   `relpath: _thin/<name>`) so consumers reading `facesets` see only the
   substantive entries.
 ## 4. Result
 74 era buckets emitted; 6 substantive + 68 thin/standalone fragments.
 | era               | faces | dom year(s) |
 |-------------------|------:|-------------|
 | `faceset_001_2010-13` | 282 | 2011 |
 | `faceset_001_2018-20` | 129 | 2019 |
 | `faceset_001_2014-17` | 125 | 2018 (anchor sub 15 dom_year=2018) |
 | `faceset_001_2018-19` | 107 | 2018 |
 | `faceset_001_2005-10` | 88  | 2010 |
 | `faceset_001_2011`    | 43  | 2011 |
 Two distinct 2011 anchors and two 2018-area anchors persist by design —
 embedding-space distance separated them despite year overlap. The era-label
 collisions are disambiguated with `_v2` suffixes, but only when both anchors
 landed on the *same* literal label string (none of the substantive six did).
 The 68 thin buckets are largely 1- or 2-face fragments with idiosyncratic
 embeddings; they are quarantined into `_thin/` rather than deleted because
 some are legitimate edge poses / lighting / age extremes that may be useful
 for narrow targeted swaps.
 ## 5. Re-running and applying to other identities
 - **Re-run with different parameters**: just re-execute `age_split_001.py`.
  Embeddings are loaded from cache, EXIF is loaded from
  `age_split_exif.json`, and only the sub-cluster + export steps re-run.
  Total runtime ~2 min.
 - **Apply to a different identity**: copy `age_split_001.py` to
  `age_split_NNN.py` and change `FS001`. The `SCAN_ROOTS`,
  `RECOVERY_THRESHOLD`, `TIGHTEN_THRESHOLD`, `SUBCLUSTER_THRESHOLD`,
  `ANCHOR_MIN_SIZE`, `FRAGMENT_CENTROID_MAX`, and `FRAGMENT_YEAR_MAX`
  defaults are tuned for `faceset_001`'s ~707-face curated cluster; smaller
  identities likely need lower `ANCHOR_MIN_SIZE`.
 - **Always quarantine THIN buckets** afterwards using the same partition
  pattern (move to `_thin/`, split top-level manifest into
  `facesets` + `thin_eras`). The script appends THIN entries to the top-level
  manifest as if they were full facesets, so the cleanup is a separate step.
@@ -0,0 +1,154 @@
 # CLIP zero-shot occlusion filter (masks + sunglasses)
 _Run date: 2026-04-27. Driver scripts: `work/filter_occlusions.py`, `work/clip_worker.py`._
 ## 1. Why
 `facesets_swap_ready/` ended the Immich import day with 311 substantive
 facesets and a long tail of identities whose clusters had latched onto
 *eyewear or mask appearance* instead of identity (covid-era shots, vacation
 photos with sunglasses dominating the frame). Two failure modes:
 1. **Pollution of averaged identity** — roop's `FaceSet.AverageEmbeddings()`
   averages every face in the .fsz. A faceset where 40 % of images are
   sunglassed gives a biased centroid; the swap reproduces sunglass-shaped
   eye sockets.
 2. **Whole-cluster identity drift** — clustering at the embedding level
   sometimes anchors on the eyewear silhouette rather than the face,
   producing clusters of "the same sunglasses across multiple people".
 A targeted attribute scorer was the cleanest fix.
 ## 2. Model + prompts
 **Model**: `open_clip` `ViT-L-14` / `dfn2b_s39b` (Apple Data Filtering Networks).
 Best public zero-shot at this size. Loads weights from HF Hub (~890 MB).
 Bit-identical scores between WSL CPU and Windows DML.
 **Prompt design**: per-attribute ensembles of 5–6 positive + 5–6 negative
 prompts. Positive ensembles are mean-pooled and L2-normalized before softmax.
 **Critical bug if forgotten**: CLIP cosine similarities are tiny (0.2–0.3
 range). Raw `softmax([sim_pos, sim_neg])` collapses to ~0.5/0.5 on every
 image. **Multiply by `model.logit_scale.exp()` (~100) before softmax.**
 Without that scale the entire scorer outputs a uniform 0.5.
 **Sunglasses prompt pitfall**: the first set caught faces with sunglasses
 *pushed up on the forehead* with the same probability as faces with
 sunglasses *covering the eyes* — CLIP detects "presence of sunglasses in
 frame", not "eyes occluded". Fixed by putting the false positive into the
 *negative* class explicitly:
 ```
 positive: "a face with dark sunglasses covering the eyes"
          "a portrait with the eyes hidden behind opaque sunglasses"
          ...
 negative: "a face with sunglasses pushed up on the forehead, eyes visible below"
          "a face with sunglasses resting on top of the head, eyes visible"
          "a face wearing clear prescription eyeglasses with visible eyes"
          ...
 ```
 Validation pair (faceset_005): sunglasses-on-eyes → 0.91, sunglasses-on-forehead
 → 0.39. Threshold 0.7 cleanly separates.
 ## 3. Architecture
 ```
   ┌─────────────────────────────────────────────┐
   │ WSL  /opt/face-sets/work/filter_occlusions.py │
   │  • stage:  walk facesets/, write queue.json   │
   │  • merge:  ingest worker results              │
   │  • report: HTML contact sheet                  │
   │  • apply:  prune + quarantine + re-zip         │
   └────────────┬────────────────────────────────┘
                │ queue.json (paths) via \\wsl.localhost\
                ▼
   ┌─────────────────────────────────────────────┐
   │ Windows  C:\clip_dml_venv\                  │
   │  /opt/face-sets/work/clip_worker.py         │
   │  Python 3.12 + torch 2.4.1 CPU              │
   │  + torch-directml 0.2.5 + open_clip_torch   │
   │  Reads PNGs from native E:\, writes scores  │
   └─────────────────────────────────────────────┘
 ```
 A separate Windows venv (not the existing `C:\face_embed_venv\`) is needed
 because `torch-directml` brings ~1.5 GB of wheels and version-pinned
 numpy/pillow that risk breaking the embed_worker venv's
 `onnxruntime-directml` + `insightface` stack.
 ## 4. DML throughput surprise
 Measured on AMD Radeon RX Vega:
 | input | model | throughput | speedup vs WSL CPU |
 |------|-------|-----------:|-------------------:|
 | ViT-L-14 (CLIP, this filter) | open_clip | **1.43 img/s** | **2.4×** |
 | buffalo_l (insightface, embed_worker) | onnxruntime | 2.6 img/s | 7.5× |
 Only 2.4× because `aten::_native_multi_head_attention` is not implemented in
 the directml plugin and falls back to CPU. The vision encoder runs on GPU,
 attention runs on CPU per layer, both alternating. A silenced UserWarning
 makes this near-invisible. Workable for a one-shot 73-min corpus run, but
 the embed_worker pattern (pure ONNX) remains the gold standard for DML.
 ## 5. Thresholds (validated 2026-04-27 on 6,318 PNGs)
 | level | threshold | semantics |
 |-------|----------:|-----------|
 | image | P(positive) ≥ 0.7 | drop the PNG |
 | faceset | ≥ 40 % of images flagged for either attr | quarantine whole faceset to `_masked/` |
 | min-survivors | < 5 surviving AND something pruned | quarantine to `_thin/` |
 The `AND something pruned` guard is essential — without it, naturally-small
 facesets (hand-sorted with ≤4 PNGs) get incorrectly quarantined for being
 small even when they have zero occlusions.
 ## 6. Run results
 | action | count | net effect |
 |--------|------:|------------|
 | keep | 209 | unchanged |
 | prune | 46 | 183 PNGs dropped within survivors |
 | quarantine_masked | 51 | whole faceset → `_masked/` (11 mask-driven, 40 sunglasses-driven) |
 | quarantine_thin | 3 | survivors < 5 → `_thin/` |
 Net: 311 active → 255 active after the filter run. 763 PNGs quarantined
 whole-faceset, 183 pruned within survivors. All dropped PNGs preserved at
 `<faceset>/faces/_dropped/` for reversibility. Master manifest gained a
 `masked[]` array parallel to `thin_eras[]`, plus an `occlusion_filter_run`
 provenance block.
 ## 7. Known limitations
 - **Per-faceset manifests are NOT updated by `apply`** — only the master
  manifest is. Each faceset's own `<faceset>/manifest.json` retains stale
  `faces[]` entries pointing at PNGs that moved into `_dropped/`. Harmless
  for `.fsz` consumers (the .fsz is re-zipped from current disk state) but
  downstream tools reading `faces[]` will see broken references. Discovered
  later by `age_extend_001.py`'s rebuild loop, which generated 42 missing-PNG
  warnings before being caught.
 ## 8. Re-running
 ```bash
 # 1. Stage queue from current corpus state
 python work/filter_occlusions.py stage --out work/clip_dml/queue.json
 # 2. Score on Windows DML (resumable)
 "/mnt/c/clip_dml_venv/Scripts/python.exe" work/clip_worker.py \
  work/clip_dml/queue.json work/clip_dml/scores.json --batch 8
 # 3. Reshape into per-faceset format, then HTML for visual approval
 python work/filter_occlusions.py merge \
  --scores work/clip_dml/scores.json --out work/occlusion_scores.json
 python work/filter_occlusions.py report \
  --scores work/occlusion_scores.json --out work/occlusion_review
 # 4. Apply (always dry-run first)
 python work/filter_occlusions.py apply \
  --scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json --dry-run
 python work/filter_occlusions.py apply \
  --scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json
 ```
@@ -0,0 +1,155 @@
 # Corpus dedup + roop-unleashed optimization
 _Run date: 2026-04-27. Driver scripts: `work/dedup_optimize.py`, `work/multiface_worker.py`._
 After consolidation collapsed duplicate identities and age-extend slotted
 new PNGs into era buckets, the corpus still carried artifacts that hurt
 roop's averaged-embedding quality:
 - **Burst-photo near-duplicates** within facesets, especially in
  immich-discovered identities where source libraries had many similar
  shots within seconds.
 - **Cross-faceset byte-identical PNGs** that escaped consolidation's
  centroid-similarity matching when individual PNGs matched exactly but
  cluster centroids diverged.
 - **Multi-face PNGs** that polluted identity averaging because the roop
  loader appends every detected face per PNG to the FaceSet (load-bearing
  invariant — see § 2).
 This pipeline runs three independent passes and an optional fourth, all
 moving dropped PNGs to `<faceset>/faces/_dropped/` for reversibility.
 ## 1. Cross-family byte-dedup
 SHA256-hash every PNG in the active corpus (parallel I/O via
 `ThreadPoolExecutor(max_workers=16)`, ~17 s for 5,386 PNGs over the
 `/mnt/e/` Windows mount). Group by hash; for groups with members in
 multiple identity families, keep the higher-tier copy.
 **Family detection**: regex `^(faceset_\d+)(?:_.+)?$` — captures the parent
 identity. Same family includes parent + era splits (e.g. `faceset_001` +
 `faceset_001_2010-13`); these are intentional duplications for the era
 .fsz files and are preserved.
 Run results: 20 cross-family hash groups → 24 PNGs dropped. Most cases were
 small immich identity-cluster errors that consolidation missed because
 individual PNG embeddings matched but the cluster mean did not.
 ## 2. Within-faceset near-dup at sim ≥ 0.95
 Per-faceset pairwise cosine similarity on cached arcface embeddings.
 Connected components in the `sim ≥ 0.95` graph. Keep highest
 `quality.composite` per component, drop the rest.
 **Threshold rationale**: legitimate same-person-different-pose pairs land at
 0.5–0.85; ≥ 0.95 means essentially the same shot (burst frames or
 recompressed dupes). Roop's `FaceSet.AverageEmbeddings()` averages all faces
 into `faces[0].embedding`; near-identical embeddings averaged ≈ averaging
 once. Removing them does not lose identity information; it removes a bias
 weight on the most-photographed moments.
 Run results: 851 groups → **1,225 PNGs dropped** (23 % of corpus).
 Most-affected: `faceset_026` (-132 of 262), `faceset_027` (-107),
 `faceset_028` (-92), `faceset_030` (-92). All immich-discovered identities
 where the source library had burst sequences.
 ## 3. Multi-face audit (load-bearing roop invariant)
 The roop loader at `roop/ui/tabs/faceswap_tab.py:661–691` runs
 `extract_face_images(filename, (False, 0))` on every PNG and **appends every
 detected face** to `face_set.faces`. A multi-face PNG therefore pollutes the
 averaged identity. The export-swap pipeline drops multi-face crops at
 creation, but post-pipeline operations (consolidation, age-extend) move
 PNGs across facesets without re-checking.
 **This audit re-detects every PNG** with insightface FaceAnalysis and flags
 any with `face_count ≠ 1` (filtered by `det_score ≥ 0.5` and
 `face_short ≥ 40`). Includes:
 - ≥ 2 faces → loader will inject extra identities into averaging
 - 0 faces → insightface can't find a face on the cropped PNG; useless for
  roop, would silently fail
 Run results: 4,146 PNGs scored, 332 flagged (272 with 2 faces, 9 with 3,
 2 with 4, **49 with 0**). 82 facesets affected.
 ## 4. DML throughput jump for face crops
 The audit reuses the same insightface + onnxruntime-directml stack as
 `embed_worker.py` but achieves **~19 img/s** on AMD Vega vs embed_worker's
 2.6 img/s — same model, same hardware. The difference is input size:
 | stage | typical input | DML throughput |
 |-------|--------------|---------------:|
 | `embed_worker.py` (Immich import) | 1024–4000 px source | 2.6 img/s |
 | `multiface_worker.py` (this audit) | 512×512 face crops | **19 img/s** |
 Detection on small inputs is fast; recognition on aligned 112×112 inputs is
 the same cost either way. Implication: **any pipeline operating on
 already-cropped face PNGs can rely on a roughly 7× higher DML throughput
 ceiling than full-resolution embedding**.
 ## 5. Architecture
 ```
   ┌────────────────────────────────────────────┐
   │ WSL  /opt/face-sets/work/dedup_optimize.py │
   │  • analyze:      hashes + within-faceset sim │
   │  • apply:         move + re-zip (no GPU)     │
   │  • stage_multiface: write queue.json         │
   │  • merge_multiface: ingest worker results    │
   │  • apply_multiface: move + re-zip             │
   │  • report:        HTML audit                  │
   └────────────┬───────────────────────────────┘
                │ queue.json via \\wsl.localhost\
                ▼
   ┌────────────────────────────────────────────┐
   │ Windows  C:\face_embed_venv\               │
   │  /opt/face-sets/work/multiface_worker.py    │
   │  insightface FaceAnalysis on DmlExecutionProvider │
   │  Reads PNGs from native E:\, writes face_count │
   └────────────────────────────────────────────┘
 ```
 Reuses the existing `C:\face_embed_venv\` (no new venv needed — same
 insightface stack as `embed_worker.py`).
 ## 6. Final corpus state (2026-04-27 night)
 | metric | start of day | after occlusion filter | after consolidation | after age-extend | after this dedup + multiface |
 |--------|-------------:|----------------------:|-------------------:|-----------------:|----------------------------:|
 | active facesets | 311 | 255 | 181 | 181 | **181** |
 | active PNGs | ~6,440 | 5,386 | 5,386 | 5,400 | **3,849** |
 | `_masked/` | 0 | 51 | 51 | 51 | 51 |
 | `_thin/` | 68 | 71 | 71 | 71 | 71 |
 | `_merged/` | 0 | 0 | 74 | 74 | 74 |
 Net reduction at the end of the day: **2,591 PNGs and 130 facesets** removed
 or quarantined from the active pool. All preserved on disk for
 reversibility (`<faceset>/faces/_dropped/` for prunes, `_masked/_merged/_thin/`
 for quarantines).
 ## 7. Re-running
 Run after any new import / consolidation / extend:
 ```bash
 # 1. Byte-dedup + within-faceset near-dup (CPU only)
 python work/dedup_optimize.py analyze --out work/dedup_audit/dedup_plan.json
 python work/dedup_optimize.py apply  --plan work/dedup_audit/dedup_plan.json
 # 2. Multi-face audit on Windows DML (resumable)
 python work/dedup_optimize.py stage_multiface --out work/dedup_audit/multiface_queue.json
 "/mnt/c/face_embed_venv/Scripts/python.exe" work/multiface_worker.py \
  work/dedup_audit/multiface_queue.json work/dedup_audit/multiface_results.json
 python work/dedup_optimize.py merge_multiface \
  --results work/dedup_audit/multiface_results.json \
  --out work/dedup_audit/multiface_plan.json
 python work/dedup_optimize.py apply_multiface \
  --plan work/dedup_audit/multiface_plan.json
 # 3. HTML audit
 python work/dedup_optimize.py report \
  --dedup work/dedup_audit/dedup_plan.json \
  --multiface work/dedup_audit/multiface_plan.json \
  --out work/dedup_audit
 ```
@@ -0,0 +1,170 @@
 # Identity consolidation + age-bucket extension
 _Run date: 2026-04-27. Driver scripts: `work/consolidate_facesets.py`, `work/age_extend_001.py`._
 After the Immich peter + nic imports added 280 new facesets to a corpus that
 had ~25 canonical identities, many "new" identities were duplicates of
 existing household members at lower clustering confidence. Two cooperating
 passes clean this up: identity consolidation merges duplicates, then
 age-extend slots newly-merged PNGs into the existing era buckets of
 `faceset_001`.
 ## 1. Identity consolidation
 ### 1.1 Approach
 For each active faceset, pull cached arcface embeddings from
 `work/cache/{nl_full,immich_peter,immich_nic}.npz` keyed by
 `(source, bbox)` from the per-faceset manifest's `faces[]`. Compute
 L2-normalized centroid. Pairwise cosine similarity matrix.
 **Tier-based primary selection** (lowest tier number wins, size breaks ties):
 | tier | sources | rationale |
 |-----:|---------|-----------|
 | 0 | `faceset_013..019` (hand-sorted) | user's curated labels |
 | 1 | `faceset_001..012` (auto-clustered) | well-established household |
 | 2 | `faceset_020..025` (osrc) | mixed-bucket discovery |
 | 3 | `faceset_026..264` (immich peter) | speculative |
 | 4 | `faceset_265+` (immich nic) | speculative |
 **Era splits and quarantines excluded** — `faceset_NNN_<era>`, `_masked/`,
 `_thin/` are skipped during analysis.
 ### 1.2 Single-linkage chains catastrophically — complete-linkage required
 First attempt used connected-components on edge ≥ 0.45 → produced a
 **60-faceset cluster** around `faceset_001` with min within-group sim of
 **−0.16** (definitely-different people bridged via chains
 `A↔B↔C` where `A`, `C` are not similar). Bumping to edge ≥ 0.55 still
 chained (group of 17 with min 0.20).
 Real fix: `scipy.cluster.hierarchy.linkage(method='complete')` then
 `fcluster(Z, t=1-edge_threshold, criterion='distance')`. Complete-linkage
 **guarantees** every within-group pair sim ≥ edge threshold. Without this
 guarantee the report is unusable and the apply step would produce
 identity-poisoned merges.
 ### 1.3 Thresholds + run results
 `edge=0.55`, `confident=0.65` → 48 multi-faceset groups (29 confident, 19
 uncertain). Max group size 7, all bilateral or small triplets after
 complete-linkage.
 After applying all 48 (with `--include-uncertain` after visual approval):
 - **74 facesets consumed** (some groups had multiple secondaries:
  `[10, 45, 135] → faceset_002`; `[113, 96, 178, 109, 110, 286] → faceset_095`;
  etc.)
 - Active count 255 → 181
 - Notable absorptions: `faceset_001` (peter) 707 → 753 PNGs (+ 7, 132, 151);
  `faceset_002` 209 → 247; `faceset_026` 60 → 262 (+ 168, 146, 325);
  `faceset_028` → 207
 - Master manifest gained `merged[]` array (parallel to `thin_eras[]`); each
  entry has `merged_into` field pointing at the primary
 ### 1.4 Apply mechanics
 Combine all PNGs from primary + secondaries, re-rank by existing
 `quality.composite` desc (no re-enrich), renumber `0001..NNNN`, copy into a
 fresh staging dir, atomic swap. Move secondary directories to
 `_merged/<original_name>/` (preserved in full for reversibility). Re-zip
 `_topN.fsz` and `_all.fsz`.
 The primary's existing per-PNG quality scores are reused — re-ranking does
 not require re-running `enrich`-equivalent landmarks/pose on the cropped
 PNGs. The primary's `_dropped/` (from prior occlusion filter) is preserved
 through the merge.
 ## 2. Age extension of faceset_001 era buckets
 ### 2.1 Why a follow-on pass
 Consolidation absorbed faceset_007/132/151 into faceset_001 (+46 PNGs).
 The original `age_split_001.py` had bucketed peter into 6 era anchors
 (`_2005-10`, `_2010-13`, `_2011`, `_2014-17`, `_2018-19`, `_2018-20`), but
 those new PNGs had never been seen by age_split. They sat in faceset_001's
 parent-only set, missing from every era .fsz.
 ### 2.2 Era-label pitfall
 The 6 anchor era labels are NOT strict year ranges. They are
 `Counter(years).most_common(1)`-derived dom-years from the original sub-cluster:
 | label | dom_year | actual span of members |
 |-------|---------:|-----------------------:|
 | `_2005-10` | 2010 | 2005–2010 |
 | `_2010-13` | 2011 | **2007–2024** |
 | `_2011` | 2011 | 2011 only |
 | `_2014-17` | 2016 | 2005–2018 |
 | `_2018-19` | 2018 | 2012–2020 |
 | `_2018-20` | 2019 | 2014–2022 |
 The clusters are *appearance-anchored*, not year-bounded. Year is a
 descriptive label. Assignment rule must use dom-year, not member span.
 ### 2.3 Algorithm
 For each unbucketed face entry in `faceset_001`'s manifest (50 of 753):
 1. Look up embedding in cache by `(source, bbox)`.
 2. Look up EXIF year via `work/cache/age_split_exif.json`; fetch on cache miss.
 3. Find single nearest era anchor by cosine distance to its centroid.
 4. Accept iff `dist ≤ 0.40` AND `|year − anchor.dom_year| ≤ 5`.
   These thresholds match `age_split_001.py`'s anchor-fragment rule.
 5. Anchors are NOT re-centered after absorption (preserves age_split's
   drift-prevention guarantee).
 ### 2.4 Run results
 50 unbucketed → 21 with EXIF year → **14 accepted**:
 | anchor | dom_year | added |
 |--------|---------:|------:|
 | `_2005-10` | 2010 | +2 |
 | `_2010-13` | 2011 | +1 |
 | `_2014-17` | 2016 | **+9** |
 | `_2018-20` | 2019 | +2 |
 29 PNGs skipped for missing EXIF year (mostly immich-stripped
 photos). 7 dist/year-rejected (e.g. two PNGs from 2025 want
 `_2018-19` but year-delta 7 > 5).
 ### 2.5 Reconciliation side effect
 The apply rebuilds each affected era bucket's `faces/` from staging. This
 incidentally reconciled the per-bucket manifests with disk after the prior
 occlusion filter run had left era manifests stale at 282/126/132 entries vs
 ~248/125/129 actual files (occlusion filter only updates the master
 manifest, never per-faceset manifests — see
 `docs/analysis/clip-occlusion-filter.md` §7). 42 occlusion-dropped era PNGs
 inside the old `faces/_dropped/` were removed during rebuild. The
 parent `faceset_001/faces/_dropped/` still has the corpus-level audit; all
 source images are intact at `/mnt/x/src/`, so the era-level dropped PNGs
 are regeneratable via `cmd_export_swap`.
 ## 3. Re-running
 Always run both passes after any new identity import (Immich, osrc,
 hand-sorted folder):
 ```bash
 # 1. Find duplicate identities
 python work/consolidate_facesets.py analyze \
  --out work/merge_review/candidates.json [--edge 0.55 --confident 0.65]
 python work/consolidate_facesets.py report \
  --candidates work/merge_review/candidates.json --out work/merge_review
 # inspect work/merge_review/index.html
 python work/consolidate_facesets.py apply \
  --candidates work/merge_review/candidates.json [--include-uncertain]
 # 2. Slot new faceset_001 PNGs into existing era buckets
 python work/age_extend_001.py analyze --out work/age_extend/candidates.json
 python work/age_extend_001.py report \
  --candidates work/age_extend/candidates.json --out work/age_extend
 python work/age_extend_001.py apply --candidates work/age_extend/candidates.json
 ```
 Both are idempotent. `consolidate_facesets` skips secondaries already in
 `_merged/`; `age_extend_001` recomputes anchor centroids + dom-year fresh
 on every run.
@@ -0,0 +1,279 @@
 # Importing identities from a self-hosted Immich library
 _Run date: 2026-04-26. Target: Immich v2.7.2 at `https://fotos.computerliebe.org`.
 Driver scripts: `work/immich_stage.py`, `work/embed_worker.py`,
 `work/cluster_immich.py`, `work/finalize_immich.sh`._
 ## 1. Why a split workflow
 InsightFace `buffalo_l` on the WSL CPU runs the full detection + landmarks +
 recognition stack at ~3–4 faces/second. Re-detecting all 79K Immich photos
 would have taken ~10–28 days. The available AMD Radeon RX Vega is unusable
 under WSL (no `/dev/dri/`, no ROCm), but **DirectML on Windows native**
 runs the same models bit-identically and ~7.5× faster end-to-end. The
 pipeline therefore splits:
 - **WSL side** (`/opt/face-sets/`) — orchestration: API listing, download,
  sha256 dedup, file management, clustering, faceset emission.
 - **Windows side** (`C:\face_embed_venv\`) — the embed step only. A fresh
  Python 3.12 (installed via `winget install Python.Python.3.12`) with
  `numpy`, `pillow`, `opencv-python-headless`, `onnxruntime-directml`,
  `insightface`. Models copied from `/home/peter/.insightface/models/buffalo_l/`
  to `C:\face_embed_venv\models\buffalo_l\`.
 A 30-iteration synthetic benchmark on Vega:
 | model       | DML | CPU | speedup |
 |-------------|----:|----:|--------:|
 | `det_10g.onnx` (640×640) | 10.0 ms | 183.5 ms | 18.4× |
 | `w600k_r50.onnx` (112×112) | 8.2 ms | 90.5 ms | 11.0× |
 End-to-end FaceAnalysis on 5 real Immich-sourced images (excluding the
 first-call DML JIT warmup): ~7.5× speedup post-warmup. Per-face cosine
 similarity DML vs CPU was 1.0000 across all 8 detected faces — DML is
 bit-identical to CPU for arcface inference.
 ## 2. Architecture
 ```
   ┌─────────────────────────────────────────────┐
   │ WSL  /opt/face-sets/work/immich_stage.py    │
   │ ┌──────────────────────────────────────────┐│
   │ │ ThreadPoolExecutor.map(_fetch_for_asset, ││
   │ │   list_assets(user))                     ││
   │ │  ─ /faces?id=    (Immich, parallel x8)   ││
   │ │  ─ filter face_short >= 90               ││
   │ │  ─ /assets/.../original (parallel x8)    ││
   │ └──────────────────────────────────────────┘│
   │  consumer (main thread):                    │
   │   sha256 → dedup vs nl_full.npz             │
   │   save to /mnt/x/src/immich/<user>/<rel>/   │
   │   append to queue.json                      │
   └────────────────┬────────────────────────────┘
                    │
                    ▼ queue.json (with WSL + Windows paths)
   ┌─────────────────────────────────────────────┐
   │ Windows embed_worker.py (C:\face_embed_venv) │
   │  insightface.FaceAnalysis(                  │
   │    providers=[DmlExecutionProvider, ...])   │
   │  per image: detection + landmarks + arcface │
   │  emit cache in sort_faces.py:cmd_embed      │
   │  schema with embeddings + meta + processed  │
   │  + path_aliases + schema=v2                 │
   └────────────────┬────────────────────────────┘
                    │
                    ▼ immich_<user>.npz
   ┌─────────────────────────────────────────────┐
   │ WSL cluster_immich.py                       │
   │   build centroids of canonical              │
   │     faceset_NNN/ in facesets_swap_ready/    │
   │   drop matches at cos-dist <= 0.45          │
   │   cluster the rest at 0.55                  │
   │   refine gates -> synthetic refine_manifest │
   │   cmd_export_swap -> facesets_swap_ready/   │
   │   merge top-level manifest                  │
   └─────────────────────────────────────────────┘
 ```
 Cache artifacts stay separate (per the architecture choice on this run):
 each user's results live in their own `immich_<user>.npz`. A future
 one-shot merge can fold them into `nl_full.npz` if needed; the existing
 `extend` command would do the right thing once schemas align.
 ## 3. Path mapping
 `/mnt/x/` ↔ `X:\`. Cache stores WSL form (matching `nl_full.npz`'s
 existing convention). `wsl_to_win()` translates for the embed worker
 which runs natively on Windows.
 `work/cluster_immich.py` always uses the canonical `facesets_swap_ready/`
 view to build identity centroids — meaning the comparison is against the
 *current* set of canonical facesets in the swap-ready directory (skipping
 era splits and `_thin/`), not against the older `facesets_full/` snapshot.
 ## 4. Result of the 2026-04-26 run (peter / admin)
 ### 4a. Stage
 ```
 total_assets_seen:     53842
 staged_count:          10261       (~10 GB on /mnt/x/)
 deduped_against_existing:  978     (sha256 in nl_full.npz already)
 deduped_against_staged:   2976     (internal byte-dupes inside Immich)
 skipped_no_big_face:     9539      (Immich detected only sub-90px faces)
 skipped_no_faces:       29390      (Immich detected zero faces)
 skipped_download_error:   698      (transient DNS / TLS, not seen-marked)
 elapsed:                ~70 min    (6.4 assets/s end-to-end at 8 workers)
 ```
 The 698 transient errors are recoverable on a re-run because
 `immich_stage.py` does not add them to the `seen` set. Each transient
 asset would be retried.
 ### 4b. Embed (Windows DML)
 ```
 queue:                  10261 entries
 new face records:       19462
 new noface records:         1
 load errors:              125    (likely HEIC / unreadable)
 elapsed:                3878.0s  (64.6 min, 2.6 img/s end-to-end)
 ```
 The 2.6 img/s end-to-end includes CIFS-share image load, image decode,
 DML inference (~50 ms/face), and JSON / NPZ flushing. Pure DML inference
 is faster; the rest of the pipeline dominates at scale.
 ### 4c. Cluster
 ```
 existing canonical centroids: 25
 faces already covered (cos-dist <= 0.45): 8103/19480  (42%)
  faceset_001:  1856
  faceset_002:  2666
  faceset_003:   670
  faceset_004:    48
  faceset_005:    40
  ... (smaller hits to the remaining 20)
 unmatched faces to cluster:  11377
 clusters at threshold 0.55:   2534  (top sizes [469, 444, 342, 338, 262, ...])
 survived refine gates:         239
 emitted as new facesets:       185  (54 dropped by export-swap's 0.45 outlier)
 ```
 Top-level `facesets_swap_ready/manifest.json` after this run: **216
 facesets** (up from 31; ~7× growth) + 68 thin_eras under `_thin/`.
 ## 4d. Result of the 2026-04-26..27 run (nic, with per-user API key)
 After issuing nic a per-user API key, the same pipeline ran end-to-end
 with no code changes (only the `IMMICH_API_KEY` env var changed). The
 run survived one Immich outage mid-stage thanks to the circuit breaker
 added in `work/immich_stage.py` (12 consecutive HTTP errors → probe →
 exit 2 with state preserved → resume on same command).
 ### Stage
 ```
 total_assets_seen:     25777            (matches /server/statistics 25,786)
 staged_count:           7834            (30% face-bearing-with-big-face;
                                          peter was 19%)
 deduped_against_existing: 519           (sha256 in nl_full.npz already)
 deduped_against_staged:    0            (nic's library has zero internal
                                          byte-dupes; peter had 2,976)
 skipped_no_big_face:     725
 skipped_no_faces:      16695
 skipped_download_error:   54            (transient; not marked seen ->
                                          would be retried on resume)
 elapsed:               ~75 min wall (across two pause/resume sessions
                                     bracketing one Immich outage)
 ```
 ### Embed (Windows DML)
 ```
 queue:                 7834 entries
 new face records:     15627
 new noface records:       1
 load errors:              7
 elapsed:               3538.9s (59 min, 2.2 img/s end-to-end)
 ```
 ### Cluster
 ```
 existing canonical centroids: 25
 faces already covered (cos-dist <= 0.45): 6770/15627  (43%)
  faceset_002:  3261   (the dominant family identity)
  faceset_008:  1461   (cross-match to hand-sorted 'sab')
  faceset_001:   955
  faceset_007:   408   (cross-match to hand-sorted 's')
  faceset_006:   114
  ...
 unmatched:                     8857
 clusters at threshold 0.55:   3787   (top sizes [165, 134, 106, 99, 92,
                                       67, 62, 61, 58, 53])
 survived refine gates:         129
 emitted as new facesets:        95   (faceset_265..NNN with gaps)
 ```
 Top-level `facesets_swap_ready/manifest.json` after the nic run: **311
 substantive facesets** + 68 thin_eras. Two-day cumulative growth:
 | date | event | facesets total |
 |------|------|------:|
 | 2026-04-25 | hand-sorted folder import | 19 |
 | 2026-04-26 morning | osrc + age split + cleanup | 31 |
 | 2026-04-26 afternoon | Immich peter run | 216 |
 | 2026-04-27 (overnight) | Immich nic run | 311 |
 ## 5. Surprises and caveats
 ### 5a. `/search/metadata`'s `userIds` filter is silently ignored (Immich v2.7.2)
 When the admin API key is used, passing `userIds=[<other-user-uuid>]`
 returns admin's own assets, not the other user's. The filter is
 silently dropped. Verified by sampling 200 returned items and
 confirming `ownerId` was admin for all of them.
 To process another user's library, **a separate API key issued by that
 user is required** — the admin key cannot enumerate cross-user
 libraries through any documented endpoint we tried. `/timeline/buckets`
 with a `userId` query parameter returns
 `Not found or no timeline.read access`.
 ### 5b. `/server/statistics` undercounts what the search returns
 `/server/statistics` reported admin = 53,842 photos. Our
 `/search/metadata` paginated through... **53,842** top-level. So the
 header agrees with the body in this case. But `/server/statistics` does
 NOT count items that live under external libraries' import paths —
 yet `/search/metadata` does include them. For this Immich, two external
 libraries (`/mnt/media/photos` and `/mnt/media/omv_photos`) are
 configured but `/libraries` reports `assetCount=0` for both. Yet 80% of
 our staged paths come from those library import paths. Don't trust
 statistics-vs-search consistency.
 ### 5c. Indexed Immich thumbnails masquerading as assets
 5,563 of our 10,261 staged paths are `<library>/thumbs/.../-preview.jpeg`
 — Immich's own internally-generated thumbnails got indexed because the
 external library import path included the thumbs subdirectory and the
 exclusion patterns didn't list `**/thumbs/**`. They embed and cluster
 fine but produce lower-resolution face records. The fix on the Immich
 side is adding `**/thumbs/**` to the exclusion patterns.
 ### 5d. Internal byte-duplicates (2,976)
 Many Immich assets are byte-identical to other Immich assets — typically
 because the same photo was uploaded both from a phone and from a
 synced cloud folder. sha256 dedup catches all of these on the second
 download (we still pay the bandwidth, but skip the disk write and
 embed work). With Immich v2.7.2's own `assets/duplicates` endpoint we
 could catch this earlier, but it's not currently used.
 ## 6. Re-running and applying to other Immich instances
 ```bash
 export IMMICH_URL=https://your-immich.example.com
 export IMMICH_API_KEY=...           # admin or per-user key
 # Optional: populate work/immich/users.json with label -> UUID map.
 # 1. Stage (parallel /faces + downloads, resumable).
 python work/immich_stage.py --user peter --workers 8
 # 2. End-to-end finalize: copy queue to /mnt/c/, run Windows embed worker,
 #    copy the cache back, run cluster_immich.py.
 bash work/finalize_immich.sh peter
 ```
 For a different Immich instance, the only configuration is the env vars
 and the `users.json` sidecar. `cluster_immich.py`'s tunables (matching
 threshold, clustering threshold, refine gates, MIN_FACES) are at the
 top of the script.
 To process a *second* user's library, issue a per-user API key in the
 Immich admin UI for that user, set `IMMICH_API_KEY` to that key, and
 re-run with their `--user <label>`. The admin key cannot impersonate
 other users via the search API.
@@ -0,0 +1,119 @@
 # Identity discovery in `/mnt/x/src/osrc`
 _Run date: 2026-04-26. Cache: `work/cache/nl_full.npz` (5260 face records).
 Driver script: `work/cluster_osrc.py`._
 ## 1. Source
 `/mnt/x/src/osrc/` is a flat mixed-identity bucket: 213 files in root + a
 `psd/` subfolder with 41 PSD files + a single file in `[Originaldateien]/`.
 File extensions are 171 jpg + 1 jpeg + 41 psd. PSDs are not embedded
 (InsightFace's loader doesn't read PSD); the 41 PSDs were skipped, on the
 working assumption that the same identities are also present in the
 adjacent JPGs.
 `nl_full.npz` already covered 160 of the 213 files (the remaining 53: 41
 psd + 12 jpg). Of the 12 missing JPGs, 11 are byte-duplicates of `00843resc.jpg`
 .. `00855resc.jpg` (same file sizes, paired by sha256) — already aliased
 in the cache. Only 1 jpg (`19554226_..._n.jpg`) is genuinely uncovered.
 The 160 covered files yielded **336 face records / 10 noface**, with 64
 single-face / 35 two-face / 19 three-face / 24 four-face / 8 with 5–8
 faces. Quality is good: median `face_short=116px`, `det_score=0.85`,
 `blur=244`. Min `face_short=40px` will fail the 90px refine gate.
 ## 2. Coverage by existing identities
 Computed cos-dist from each osrc face to the centroids of the canonical
 `faceset_001..019` (built from each manifest's `(source, bbox)` keys).
 Median nearest-cos-dist was 0.875 — i.e. the bulk of osrc is **not** the
 existing 19 identities.
 At cos-dist ≤ 0.45 (matching `build_folders.py`'s `OSRC_THRESHOLD`):
 | existing identity | osrc faces matched |
 |------------------|------------------:|
 | faceset_002      | 7 |
 | faceset_008      | 4 |
 | faceset_015      | 3 |
 | faceset_019      | 4 |
 These 18 osrc faces are routed to existing identities by
 `build_folders.py` and `extend`; they are excluded from the
 identity-discovery step.
 ## 3. Pipeline
 `work/cluster_osrc.py` mirrors `build_folders.py`'s structure (synthesize
 a refine manifest, hand off to `cmd_export_swap`, relocate, merge
 top-level manifest) but discovers identities by clustering rather than
 asserting them by folder.
 1. Filter cache to face records under `/mnt/x/src/osrc` (canonical or
   byte-aliased path).
 2. Drop the 18 already-covered faces (cos-dist ≤ 0.45 to any existing
   identity centroid).
 3. Cluster the remaining 318 faces among themselves at cos-dist 0.55
   (matches the `extend` default for new-cluster formation).
 4. For each cluster, apply `refine`-equivalent per-face gates
   (`face_short ≥ 90`, `blur ≥ 40`, `det_score ≥ 0.6`); for clusters ≥ 4
   faces apply outlier rejection at cluster-centroid cos-dist 0.55. Keep
   clusters whose surviving unique-path count is ≥ 6 (the operator-
   chosen `MIN_FACES`, lower than the canonical 15 because osrc is small
   per-identity).
 5. Number kept clusters `faceset_020+` (past the existing
   `facesets_swap_ready/` max of 019) ordered by size descending.
 6. Synthesize a refine manifest and call `cmd_export_swap` on it. Move
   the emitted dirs into `facesets_swap_ready/`, drop an `osrc.txt`
   provenance marker, and append the new entries to the top-level
   `manifest.json` (without disturbing existing `facesets` / `thin_eras`).
 ## 4. Result (2026-04-26)
 Phase 1 (clustering, before export-swap):
 - 137 raw clusters at cos-dist 0.55; top sizes [37, 20, 12, 9, 7, 7, 6, 6, 6, 5].
 - After quality gate: 124 faces dropped (mostly `face_short < 90` from
  group-photo tertiary subjects).
 - Outlier rejection: 0 dropped (clusters were tight).
 - After `min_faces=6`: **7 candidate clusters kept** (sizes 6–28 unique
  source paths).
 Phase 2 (`cmd_export_swap` with `min_face_short=100`,
 `outlier_threshold=0.45`):
 | name         | input | outlier drop | exported PNGs |
 |--------------|------:|-------------:|--------------:|
 | faceset_020  | 71 | 42 | 26 |
 | faceset_021  | 36 | 21 | 10 |
 | faceset_022  | 15 |  7 |  8 |
 | faceset_023  | 19 | 14 |  4 |
 | faceset_024  |  6 |  0 |  6 |
 | faceset_025  | 10 |  4 |  6 |
 | faceset_026  |  — |  — |  0 (skipped: empty after filter) |
 `faceset_026`'s 6 cluster faces all failed export-swap's tighter
 `min_face_short=100` gate (vs. cluster's 90); it is not emitted.
 `faceset_023` is small (4 PNGs) but useful as an averaged identity at
 that size.
 Top-level `facesets_swap_ready/manifest.json` now: **31 substantive
 facesets** (12 auto-cluster nl/lzbkp + 7 hand-sorted + 6 era splits + 6
 osrc-discovered) + **68 thin_eras** under `_thin/`.
 ## 5. Re-running and applying to other mixed buckets
 - The cache holds osrc embeddings; to re-run with different parameters,
  edit `cluster_osrc.py`'s config block and re-execute. Cluster discovery
  + export-swap is a few minutes total.
 - For a different mixed-bucket source, copy `cluster_osrc.py` to
  `cluster_<name>.py` and change `OSRC_DIR`, `OUT_TMP`, `SYNTH_MANIFEST`,
  `START_NNN`. The exclusion step compares against the *current* contents
  of `facesets_swap_ready/faceset_NNN/` so it picks up everything emitted
  by previous discovery / split / hand-sorted runs.
 - Lowering `MIN_FACES` from 6 to 4 would have admitted ~3 additional
  marginal clusters at this corpus size; the trade-off is a noisier
  identity average for small-N facesets.
 - `extend` should be run before `cluster_osrc.py` so `raw_full/` and
  `facesets_full/` stay in sync — `cluster_osrc.py` itself only writes
  to `facesets_swap_ready/`.
@@ -0,0 +1,142 @@
 # Video target preprocessing for roop-unleashed
 _Initial design + first batch run: 2026-04-27. Driver scripts: `work/video_target_pipeline.py`, `work/video_face_worker.py`, `work/run_video_pipeline.sh`._
 Companion to the face-set side of the project: instead of building per-identity .fsz bundles for the *source* of a swap, this pipeline preprocesses the *target* (videos to swap into). Given a folder of video files, it identifies "swappable" segments — continuous shots where a face is detectable, sufficiently visible, and roughly within inswapper_128's working envelope — and cuts them into UUID-named clips ready to feed into roop-unleashed.
 ## 1. Why build it
 I checked the obvious open-source projects for an existing implementation:
 - **FaceFusion** ([github.com/facefusion/facefusion](https://github.com/facefusion/facefusion)) — CLI has `run`, `headless-run`, `batch-run`, `job-*`, `force-download`, `benchmark`. No scene-detection or clip-extraction subcommand. Its own guides recommend "split your video manually first."
 - **roop-unleashed** at `/opt/roop-unleashed/roop/util_ffmpeg.py` — has `cut_video(start_frame, end_frame)` for a manual GUI trim, no detection-driven segmentation.
 - **Deep-Live-Cam** ([github.com/hacksider/Deep-Live-Cam](https://github.com/hacksider/Deep-Live-Cam)) — real-time / single-shot, no batch preprocessing.
 - **DeepFaceLab** — `extract_video.bat` dumps every frame between user-supplied trim points; no quality gating.
 Closest prior art for the cut-detection pattern is the two-stage hybrid in [SportSBD MMSys'26](https://dl.acm.org/doi/10.1145/3793853.3799803) (cheap detector for cuts, accurate net for verification), but the actual implementation has to be ours.
 ## 2. Pipeline architecture
 ```
 WSL  /opt/face-sets/work/                   Windows  C:\face_embed_venv\
 ─────────────────────────────────────       ─────────────────────────────
 run_video_pipeline.sh (chain driver)
   │
   ├─ scan         (ffprobe metadata)
   ├─ scenes       (PySceneDetect AdaptiveDetector, CPU)
   ├─ stage        (sampled frame queue.json @ 2 fps)
   │                                  │
   │                                  ▼
   │                            video_face_worker.py
   │                            insightface FaceAnalysis
   │                            on DmlExecutionProvider
   │                            output: results.jsonl
   ├─ merge        (ingest results.jsonl)
   ├─ track        (IoU + embedding stitching, ~30 LOC)
   ├─ score        (track-level quality gate + cross-track merge)
   ├─ cut          (ffmpeg -c copy → per-source subfolders)
   └─ report       (HTML preview)
   Output: <output_dir>/<source_video_stem>/<uuid>.mp4
                                           /<uuid>.json (sidecar; opt-in via
                                                          --write-sidecar)
 ```
 `run_video_pipeline.sh` is parameterized via env vars (`WORK`, `INPUT_DIR`, `OUTPUT_DIR`, `FILTER_FROM`, `SKIP_PATTERN`, `MAX_DUR`, `IDENTITY`, `SIDECAR`) so you can pin a particular batch without editing the script. Sidecars are off by default — the per-batch `plan.json` always carries the full provenance for every clip; the `<uuid>.json` files alongside the clips are redundant and only useful if you need each clip to be self-describing in isolation.
 ## 3. Quality signals (matched to inswapper_128's working envelope)
 inswapper_128 is trained near-frontal at 128×128. The score gate uses defaults that admit side profiles (since rich face-sets can absorb non-frontal swap targets):
 | signal | threshold | rationale |
 |--------|----------:|-----------|
 | `|yaw|` | ≤ 75° | covers full 3/4 + side profile |
 | `|pitch|` | ≤ 45° | covers extreme up/down looks |
 | `face_short` | ≥ 80 px | inswapper resamples to 128; ≥80 still produces clean output |
 | `det_score` | ≥ 0.5 | matches buffalo_l's MIN_DET; lower = unreliable detection |
 | track-gate | ≥ 70 % frames pass | binary track filter rather than per-frame |
 | duration | 1 s ≤ dur ≤ 120 s | below 1s = unusable slivers; above 120s probably contains a missed micro-cut |
 Plus two segment-merging knobs:
 - `--bridge-gap` (default 3 s) — within a single track, brief pose-failure gaps shorter than this get bridged so single bad frames don't fragment a good run
 - `--merge-gap` (default 2 s) — across tracks within the same scene, segments closer than this get fused (cross-track merge fires when face detection briefly fails between adjacent good runs)
 The defaults can be tightened (e.g. `--max-yaw 25` for portrait-only) or loosened (e.g. `--max-yaw 90 --merge-gap 5`) without re-running detection — `score` reads the existing `tracks.json`.
 ## 4. Performance + the JSONL append-only fix
 This is where the engineering interest is. The first production run on 13 videos / 6.18 h of input went through three failure modes before settling at production speed:
 | attempt | issue | rate observed |
 |---|---|---:|
 | 1. Original `cap.set(POS_FRAMES, N)` per sample | OpenCV seeks to nearest keyframe + decodes forward at every sample. Cost grows with depth into the video; on a 60-min H.264 it falls off a cliff. | 1.4 fps → degrading |
 | 2. Sequential `cap.grab()` from frame 0 | On resume, grab-walking from frame 0 to a deep target is unbounded. | 0.08 fps |
 | 3. Hybrid: seek-once-per-video + sequential within | Better in principle. But hit a different bug: `flush()` was re-serializing the entire `results.json` (245 MB at this point) every 100 frames or 30 sec. Save dominated wall-clock. | 0.5 fps |
 | 4. **JSONL append-only** | One result per line. Each flush is O(new records), not O(total records). | **13.77 fps** smoke / 7.57 fps cumulative across the full batch |
 Lesson: when the output is large + grows monotonically + needs frequent checkpointing, *do not* re-serialize the whole structure on each flush. Append-only line-delimited JSON is the right tool. The legacy `results.json` is auto-converted to `.jsonl` on first load (one-time migration), so resumes survive the format switch.
 ## 5. Hardware decode/encode on AMD Vega + WSL
 Skipped. Per [Microsoft's WSL D3D12 video acceleration post](https://devblogs.microsoft.com/commandline/d3d12-gpu-video-acceleration-in-the-windows-subsystem-for-linux-now-available/), VAAPI-via-Mesa-D3D12 exists but is fragile on older AMD. AMF on Windows would mean a Windows-side ffmpeg leg, doubling boundary crossings. CPU software decode of 1280×720 H.264 in WSL ffmpeg is faster than realtime, and the bottleneck is buffalo_l detection on DML, not decode.
 For cutting we use `-c copy` stream-copy — no re-encode, hardware codecs are moot.
 ## 6. Full corpus run results
 Three runs across the 61-video corpus at `/mnt/x/src/vd/`:
 | | test (3 videos) | first batch (13 videos, 50–62) | rest (45 videos, 02–49 minus test) | **total** |
 |---|---:|---:|---:|---:|
 | input duration | 0.6 h | 6.18 h | 12.98 h | **19.76 h** |
 | sampled frames @ 2 fps | 4,472 | 44,635 | 94,030 | 143,137 |
 | tracks | 187 | 2,564 | 3,823 | 6,574 |
 | accepted tracks | 94 (50 %) | 1,193 (47 %) | 1,905 (50 %) | 3,192 (49 %) |
 | **emitted segments** | **83** | **600** | **1,301** | **1,984** |
 | cross-track-merged segments | 14 | 254 | 382 | 650 |
 | accepted content | 13 min | 239 min | 395 min | **647 min (= 10.78 h)** |
 | acceptance rate by time | 36 % | 64.6 % | 50.7 % | **54.6 %** |
 | output size | 0.135 GB | 3.63 GB | 4.84 GB | **8.6 GB** |
 Phase timings (rest batch — best representative since it ran fully under JSONL append-only from a fresh start):
 - scenes: 117 min (PySceneDetect, 45 × ~3 min/video)
 - stage: instant
 - worker: 100 min @ **15.78 fps** sustained (vs 7.5 fps for first batch which migrated mid-run)
 - merge: 90 s
 - track: 92 s
 - score: 23 s
 - cut (1,301 ffmpeg stream-copies): 30 min
 - report (1,301 thumbs + HTML): 5.5 min
 - **total wall-clock: 4h16m**
 Across all three runs, **0 worker errors on 143,137 sampled frames**.
 ## 7. Re-running
 ```bash
 # choose a per-batch workdir + log
 WORK=/opt/face-sets/work/video_preprocess_<batch_name> \
  FILTER_FROM=ct_src_00050.mp4 \
  bash work/run_video_pipeline.sh > work/logs/video_run_<batch_name>.log 2>&1 &
 # check status anytime
 bash work/status_video_pipeline.sh work/logs/video_run_<batch_name>.log
 ```
 Skip patterns can exclude already-processed inputs (note that 5-digit numbers need full padding in the regex, e.g. `0005[0-9]` not `005[0-9]`):
 ```bash
 SKIP_PATTERN='^ct_src_(0001[015]|0005[0-9]|0006[0-2])\.mp4$' \
  WORK=/opt/face-sets/work/video_preprocess_rest \
  bash work/run_video_pipeline.sh > work/logs/video_run_rest.log 2>&1 &
 ```
 To also emit per-clip provenance sidecars (off by default):
 ```bash
 SIDECAR=yes \
  WORK=/opt/face-sets/work/video_preprocess_<batch> \
  bash work/run_video_pipeline.sh > work/logs/video_run_<batch>.log 2>&1 &
 ```
 `scenes` outputs are cached in the batch's `WORK/scenes/` dir, so re-running the chain after an edit-to-score step doesn't redo detection. The worker is also resumable per `queue_id` — if killed mid-flight, just relaunch.
@@ -0,0 +1,576 @@
 """Extend the existing 6 era buckets of faceset_001 by absorbing PNGs that
 post-date the original age_split run (from consolidation merges, etc.).
 Mirrors the anchor-fragment assignment logic in age_split_001.py:
  - For each unbucketed face in faceset_001's manifest, find the nearest active
    era anchor by cosine distance to the anchor's centroid.
  - Accept the assignment iff dist <= 0.40 AND |year_delta| <= 5
    (where year_delta = exif_year(face) - dom_year(anchor)).
  - Undated PNGs are skipped (no assignment).
  - Anchors are NOT re-centered after absorption (preserves the same drift
    guarantees as the original age_split).
 CLI:
  python work/age_extend_001.py analyze --out work/age_extend/candidates.json
  python work/age_extend_001.py report --candidates ... --out work/age_extend
  python work/age_extend_001.py apply --candidates ... [--dry-run]
 """
 from __future__ import annotations
 import argparse
 import json
 import shutil
 import sys
 import time
 from collections import Counter
 from pathlib import Path
 import numpy as np
 from PIL import Image, ExifTags
 ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
 PARENT = "faceset_001"
 ACTIVE_ERAS = [
    "faceset_001_2005-10",
    "faceset_001_2010-13",
    "faceset_001_2011",
    "faceset_001_2014-17",
    "faceset_001_2018-19",
    "faceset_001_2018-20",
 ]
 CACHES = [
    Path("/opt/face-sets/work/cache/nl_full.npz"),
    Path("/opt/face-sets/work/cache/immich_peter.npz"),
    Path("/opt/face-sets/work/cache/immich_nic.npz"),
 ]
 EXIF_CACHE = Path("/opt/face-sets/work/cache/age_split_exif.json")
 # anchor-fragment thresholds (mirror age_split_001.py)
 DIST_MAX = 0.40
 YEAR_MAX = 5
 # ----------------------------- caches -----------------------------
 def load_caches():
    rec_index: dict[tuple[str, tuple[int, int, int, int]], np.ndarray] = {}
    alias_map: dict[str, str] = {}
    for c in CACHES:
        if not c.exists():
            print(f"[warn] cache missing: {c}", file=sys.stderr)
            continue
        d = np.load(c, allow_pickle=True)
        emb = d["embeddings"]
        meta = json.loads(str(d["meta"]))
        face_records = [m for m in meta if not m.get("noface")]
        if len(face_records) != len(emb):
            raise SystemExit(f"meta/emb mismatch in {c}: {len(face_records)} vs {len(emb)}")
        if "path_aliases" in d.files:
            paliases = json.loads(str(d["path_aliases"]))
            for canon, alist in paliases.items():
                alias_map.setdefault(canon, canon)
                for a in alist:
                    alias_map[a] = canon
        for i, rec in enumerate(face_records):
            p = rec["path"]
            bbox = tuple(int(x) for x in rec["bbox"])
            v = emb[i].astype(np.float32)
            n = float(np.linalg.norm(v))
            if n > 0:
                v = v / n
            rec_index[(p, bbox)] = v
            alias_map.setdefault(p, p)
    print(f"[cache] indexed {len(rec_index)} face records, {len(alias_map)} aliases", file=sys.stderr)
    return rec_index, alias_map
 def lookup_emb(rec_index, alias_map, src: str, bbox):
    bbox_t = tuple(int(x) for x in bbox)
    canon = alias_map.get(src, src)
    v = rec_index.get((canon, bbox_t))
    if v is None and canon != src:
        v = rec_index.get((src, bbox_t))
    return v
 # ----------------------------- exif -----------------------------
 def load_exif_cache():
    if not EXIF_CACHE.exists():
        return {}
    return json.loads(EXIF_CACHE.read_text())
 def save_exif_cache(cache):
    tmp = EXIF_CACHE.with_suffix(".tmp.json")
    tmp.write_text(json.dumps(cache, indent=2))
    tmp.replace(EXIF_CACHE)
 def exif_year(path: Path) -> int | None:
    try:
        with Image.open(path) as im:
            ex = im._getexif()
            if not ex:
                return None
            for tag_id, val in ex.items():
                tag = ExifTags.TAGS.get(tag_id, tag_id)
                if tag == "DateTimeOriginal" and isinstance(val, str) and len(val) >= 4:
                    return int(val[:4])
    except Exception:
        return None
    return None
 def get_year(src: str, exif_cache) -> int | None:
    """Return EXIF year for src, using cache. Mutates cache for new lookups."""
    if src in exif_cache:
        return exif_cache[src]
    p = Path(src)
    y = exif_year(p) if p.exists() else None
    exif_cache[src] = y
    return y
 # ----------------------------- analyze -----------------------------
 def cmd_analyze(args):
    rec_index, alias_map = load_caches()
    exif_cache = load_exif_cache()
    exif_cache_dirty = False
    parent_dir = ROOT / PARENT
    parent_manifest = json.loads((parent_dir / "manifest.json").read_text())
    parent_faces = parent_manifest.get("faces", [])
    print(f"[parent] {PARENT}: {len(parent_faces)} face entries", file=sys.stderr)
    # Build "in_bucket" set + each anchor's centroid + dom_year
    anchors = []
    in_bucket: set[tuple[str, tuple[int, int, int, int]]] = set()
    for era in ACTIVE_ERAS:
        ed = ROOT / era
        if not ed.is_dir():
            print(f"[warn] missing era bucket: {era}", file=sys.stderr)
            continue
        em = json.loads((ed / "manifest.json").read_text())
        emb_list = []
        years = []
        n_missing_emb = 0
        for f in em.get("faces", []):
            src = f.get("source")
            bbox = f.get("bbox")
            if not src or not bbox:
                continue
            key = (alias_map.get(src, src), tuple(int(x) for x in bbox))
            in_bucket.add(key)
            in_bucket.add((src, tuple(int(x) for x in bbox)))   # cover both alias and raw
            v = lookup_emb(rec_index, alias_map, src, bbox)
            if v is None:
                n_missing_emb += 1
            else:
                emb_list.append(v)
            y = get_year(src, exif_cache)
            if y is None:
                exif_cache_dirty = True
            else:
                years.append(y)
                if src not in exif_cache:
                    exif_cache_dirty = True
        if not emb_list:
            print(f"[warn] {era}: no embeddings found, skipping anchor", file=sys.stderr)
            continue
        arr = np.stack(emb_list).astype(np.float32)
        c = arr.mean(axis=0)
        n = float(np.linalg.norm(c))
        if n > 0:
            c = c / n
        dom_year = Counter(years).most_common(1)[0][0] if years else None
        anchors.append({
            "name": era, "centroid": c, "n_faces": len(em.get("faces", [])),
            "n_emb_used": len(emb_list), "n_emb_missing": n_missing_emb,
            "dom_year": dom_year,
            "year_min": min(years) if years else None,
            "year_max": max(years) if years else None,
        })
        print(f"[anchor] {era}: n={len(em.get('faces', []))} emb_used={len(emb_list)} "
              f"emb_miss={n_missing_emb} dom_year={dom_year} years=[{min(years) if years else '-'}..{max(years) if years else '-'}]",
              file=sys.stderr)
    # Find unbucketed faces in parent
    unbucketed = []
    for f in parent_faces:
        src = f.get("source")
        bbox = f.get("bbox")
        if not src or not bbox:
            continue
        bbox_t = tuple(int(x) for x in bbox)
        key1 = (alias_map.get(src, src), bbox_t)
        key2 = (src, bbox_t)
        if key1 in in_bucket or key2 in in_bucket:
            continue
        unbucketed.append(f)
    print(f"[parent] {len(unbucketed)} unbucketed face entries (in {PARENT} but no era bucket)", file=sys.stderr)
    # Score each unbucketed face against every anchor
    proposals = []
    skipped_no_emb = 0
    skipped_no_year = 0
    for f in unbucketed:
        src = f["source"]
        bbox = f["bbox"]
        v = lookup_emb(rec_index, alias_map, src, bbox)
        if v is None:
            skipped_no_emb += 1
            continue
        y = get_year(src, exif_cache)
        if y is None:
            skipped_no_year += 1
            exif_cache_dirty = True
            continue
        if src not in exif_cache:
            exif_cache_dirty = True
        # nearest anchor
        best = None  # (dist, idx)
        for i, a in enumerate(anchors):
            d = 1.0 - float(np.dot(a["centroid"], v))
            if best is None or d < best[0]:
                best = (d, i)
        if best is None:
            continue
        dist, bidx = best
        anchor = anchors[bidx]
        year_delta = abs(y - anchor["dom_year"]) if anchor["dom_year"] is not None else None
        accept = (dist <= DIST_MAX and year_delta is not None and year_delta <= YEAR_MAX)
        proposals.append({
            "png": f["png"],
            "source": src,
            "bbox": [int(x) for x in bbox],
            "year": y,
            "rank_in_parent": f.get("rank"),
            "quality_composite": f.get("quality", {}).get("composite"),
            "quality": f.get("quality", {}),
            "best_anchor": anchor["name"],
            "best_anchor_dom_year": anchor["dom_year"],
            "centroid_dist": round(dist, 4),
            "year_delta": year_delta,
            "accept": bool(accept),
            "all_anchor_dists": {
                a["name"]: round(1.0 - float(np.dot(a["centroid"], v)), 4) for a in anchors
            },
        })
    if exif_cache_dirty:
        save_exif_cache(exif_cache)
        print(f"[exif] cache flushed ({len(exif_cache)} entries total)", file=sys.stderr)
    # Summarize
    accepted = [p for p in proposals if p["accept"]]
    rejected = [p for p in proposals if not p["accept"]]
    by_anchor = Counter(p["best_anchor"] for p in accepted)
    print(f"[summary] unbucketed={len(unbucketed)} scored={len(proposals)} "
          f"accepted={len(accepted)} rejected={len(rejected)} "
          f"skipped(no_emb={skipped_no_emb}, no_year={skipped_no_year})", file=sys.stderr)
    for k, v in by_anchor.most_common():
        print(f"  {k}: +{v}", file=sys.stderr)
    out = {
        "thresholds": {"dist_max": DIST_MAX, "year_max": YEAR_MAX},
        "anchors": [
            {k: v for k, v in a.items() if k != "centroid"}
            for a in anchors
        ],
        "n_unbucketed": len(unbucketed),
        "skipped": {"no_emb": skipped_no_emb, "no_year": skipped_no_year},
        "proposals": sorted(proposals, key=lambda p: (not p["accept"], p["best_anchor"], -1 * (p["quality_composite"] or 0))),
        "by_anchor": dict(by_anchor),
    }
    op = Path(args.out)
    op.parent.mkdir(parents=True, exist_ok=True)
    op.write_text(json.dumps(out, indent=2))
    print(f"[done] {len(proposals)} proposals -> {op}", file=sys.stderr)
 # ----------------------------- report -----------------------------
 def cmd_report(args):
    cand = json.loads(Path(args.candidates).read_text())
    out_dir = Path(args.out)
    thumbs_dir = out_dir / "thumbs"
    thumbs_dir.mkdir(parents=True, exist_ok=True)
    THUMB = 140
    def make_thumb(png_relpath: str) -> str:
        # png_relpath looks like "faces/0042.png"
        src = ROOT / PARENT / png_relpath
        name = Path(png_relpath).stem
        dst = thumbs_dir / f"{name}.jpg"
        if not dst.exists():
            try:
                img = Image.open(src).convert("RGB")
                img.thumbnail((THUMB, THUMB), Image.LANCZOS)
                img.save(dst, "JPEG", quality=82)
            except Exception as e:
                print(f"[thumb-skip] {src}: {e}", file=sys.stderr)
                return ""
        return f"thumbs/{name}.jpg"
    # group accepted proposals by target anchor
    by_anchor: dict[str, list] = {}
    rejected = []
    for p in cand["proposals"]:
        if p["accept"]:
            by_anchor.setdefault(p["best_anchor"], []).append(p)
        else:
            rejected.append(p)
    rows = []
    rows.append("<h1>faceset_001 age extension &mdash; review</h1>")
    rows.append(f"<p>{cand['n_unbucketed']} unbucketed faces in {PARENT}; "
                f"{sum(len(v) for v in by_anchor.values())} accepted / {len(rejected)} rejected; "
                f"thresholds dist&le;{cand['thresholds']['dist_max']} AND |year_delta|&le;{cand['thresholds']['year_max']}.</p>")
    nav = " · ".join(f"<a href='#{a}'>{a} (+{len(by_anchor[a])})</a>" for a in by_anchor) + " · <a href='#rejected'>rejected</a>"
    rows.append(f"<div class='nav'>{nav}</div>")
    for anchor_name in ACTIVE_ERAS:
        if anchor_name not in by_anchor:
            continue
        items = by_anchor[anchor_name]
        anchor_meta = next((a for a in cand["anchors"] if a["name"] == anchor_name), {})
        rows.append(f"<section id='{anchor_name}' class='grp'>")
        rows.append(f"<h2>{anchor_name} <small>(dom_year={anchor_meta.get('dom_year')}; "
                    f"existing n={anchor_meta.get('n_faces')}; +{len(items)} new)</small></h2>")
        rows.append("<div class='cells'>")
        for p in sorted(items, key=lambda x: (x["centroid_dist"], -1 * (x["quality_composite"] or 0))):
            thumb = make_thumb(p["png"])
            cls = "hi" if p["centroid_dist"] <= 0.30 else "mid"
            rows.append(
                f"<div class='cell'>"
                f"<img src='{thumb}' loading='lazy' title='{p['png']}'>"
                f"<div class='meta'>{p['png']}<br>year {p['year']} (Δ{p['year_delta']})<br>"
                f"<span class='{cls}'>dist {p['centroid_dist']:.3f}</span></div>"
                f"</div>"
            )
        rows.append("</div></section>")
    if rejected:
        rows.append("<section id='rejected' class='grp rej'>")
        rows.append(f"<h2>rejected <small>({len(rejected)} faces don't fit any anchor)</small></h2>")
        rows.append("<div class='cells'>")
        for p in sorted(rejected, key=lambda x: x["centroid_dist"])[:200]:
            thumb = make_thumb(p["png"])
            why = []
            if p["centroid_dist"] > cand['thresholds']['dist_max']:
                why.append(f"dist {p['centroid_dist']:.2f}>{cand['thresholds']['dist_max']}")
            if p["year_delta"] is None or p["year_delta"] > cand['thresholds']['year_max']:
                why.append(f"yΔ{p['year_delta']}>{cand['thresholds']['year_max']}")
            rows.append(
                f"<div class='cell'>"
                f"<img src='{thumb}' loading='lazy'>"
                f"<div class='meta'>{p['png']}<br>year {p['year']} → best {p['best_anchor']}<br>"
                f"<span class='lo'>{'; '.join(why)}</span></div>"
                f"</div>"
            )
        if len(rejected) > 200:
            rows.append(f"<p>...{len(rejected)-200} more truncated.</p>")
        rows.append("</div></section>")
    html = f"""<!doctype html>
 <html><head><meta charset='utf-8'><title>faceset_001 age extension</title>
 <style>
 body {{ font-family: system-ui, sans-serif; background:#111; color:#eee; padding:1em; }}
 h1 {{ margin-top:0; }} h2 {{ margin:0; }}
 small {{ color:#999; font-weight:normal; }}
 section.grp {{ background:#1a1a1a; border-radius:6px; padding:12px; margin:12px 0; }}
 section.grp.rej {{ border-left:4px solid #ff5050; }}
 .cells {{ display:flex; flex-wrap:wrap; gap:6px; }}
 .cell {{ background:#222; border-radius:4px; padding:4px; width:160px; font-size:11px; font-family:monospace; text-align:center; }}
 .cell img {{ height:140px; width:auto; border-radius:3px; }}
 .meta {{ padding-top:4px; line-height:1.3; }}
 .hi  {{ color:#5fa05f; font-weight:bold; }}
 .mid {{ color:#ffb050; }}
 .lo  {{ color:#ff5050; }}
 .nav {{ position:sticky; top:0; background:#111; padding:.5em 0; border-bottom:1px solid #333; font-size:13px; }}
 a {{ color:#6cf; }}
 </style></head>
 <body>
 {''.join(rows)}
 </body></html>"""
    out_html = out_dir / "index.html"
    out_html.write_text(html)
    print(f"[done] {out_html}", file=sys.stderr)
 # ----------------------------- apply -----------------------------
 def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
    import zipfile
    with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
        for i, p in enumerate(pngs):
            zf.write(p, arcname=f"{i:04d}.png")
 def cmd_apply(args):
    cand = json.loads(Path(args.candidates).read_text())
    accepted = [p for p in cand["proposals"] if p["accept"]]
    if args.dry_run:
        from collections import Counter as C
        by = C(p["best_anchor"] for p in accepted)
        print(f"=== dry-run: {len(accepted)} assignments across {len(by)} anchors ===")
        for k, v in by.most_common():
            print(f"  {k}: +{v}")
        return
    parent_dir = ROOT / PARENT
    master_path = ROOT / "manifest.json"
    master = json.loads(master_path.read_text())
    facesets_by_name = {f["name"]: f for f in master.get("facesets", [])}
    by_anchor: dict[str, list] = {}
    for p in accepted:
        by_anchor.setdefault(p["best_anchor"], []).append(p)
    total_added = 0
    for anchor_name, props in by_anchor.items():
        ed = ROOT / anchor_name
        em_path = ed / "manifest.json"
        em = json.loads(em_path.read_text())
        existing = list(em.get("faces", []))
        # gather new entries with their source PNG paths in faceset_001/faces/
        new_with_src = []
        for p in props:
            src_png = parent_dir / p["png"]
            if not src_png.exists():
                print(f"[warn] missing parent PNG {src_png}; skip", file=sys.stderr)
                continue
            face_entry = {
                "source": p["source"],
                "bbox": p["bbox"],
                "quality": p["quality"],
                "exif_year": p["year"],
                "centroid_dist_at_assign": p["centroid_dist"],
                "year_delta_at_assign": p["year_delta"],
                "extended_from_parent": True,
            }
            new_with_src.append((face_entry, src_png))
        # combine; rank by quality.composite desc (existing entries already have rank,
        # but we re-rank globally so new entries slot in by quality)
        combined: list[tuple[dict, Path | None]] = []
        for f in existing:
            combined.append((f, None))
        combined.extend(new_with_src)
        combined.sort(key=lambda x: -x[0].get("quality", {}).get("composite", 0))
        # stage fresh
        staging = ed / "_faces_new"
        if staging.exists():
            shutil.rmtree(staging)
        staging.mkdir()
        new_face_entries = []
        for new_rank, (face, src_png_or_none) in enumerate(combined, start=1):
            new_name = f"{new_rank:04d}.png"
            if src_png_or_none is None:
                # existing entry: copy from current era bucket faces/
                old_name = Path(face["png"]).name
                src = ed / "faces" / old_name
                if not src.exists():
                    print(f"[warn] {anchor_name}: missing existing PNG {src}; skip", file=sys.stderr)
                    continue
                shutil.copy2(src, staging / new_name)
            else:
                shutil.copy2(src_png_or_none, staging / new_name)
            face = dict(face)
            face["rank"] = new_rank
            face["png"] = f"faces/{new_name}"
            new_face_entries.append(face)
        # swap dirs
        old_holding = ed / "_faces_old"
        if old_holding.exists():
            shutil.rmtree(old_holding)
        (ed / "faces").rename(old_holding)
        staging.rename(ed / "faces")
        shutil.rmtree(old_holding)
        # re-zip .fsz
        survivor_pngs = sorted((ed / "faces").glob("*.png"))
        top_n = em.get("top_n", 30)
        top_n_eff = min(top_n, len(survivor_pngs))
        for old in ed.glob("*.fsz"):
            old.unlink()
        top_fsz_name = f"{anchor_name}_top{top_n_eff}.fsz"
        all_fsz_name = f"{anchor_name}_all.fsz"
        _zip_png_list(survivor_pngs[:top_n_eff], ed / top_fsz_name)
        if len(survivor_pngs) > top_n_eff:
            _zip_png_list(survivor_pngs, ed / all_fsz_name)
            all_fsz_used = all_fsz_name
        else:
            all_fsz_used = None
        # update local + master manifests
        em["faces"] = new_face_entries
        em["exported"] = len(new_face_entries)
        em["fsz_top"] = top_fsz_name
        em["fsz_all"] = all_fsz_used
        em["top_n"] = top_n_eff
        em.setdefault("age_extend_history", []).append({
            "added": len(new_with_src),
            "thresholds": cand["thresholds"],
        })
        em_path.write_text(json.dumps(em, indent=2))
        if anchor_name in facesets_by_name:
            facesets_by_name[anchor_name]["exported"] = len(new_face_entries)
            facesets_by_name[anchor_name]["fsz_top"] = top_fsz_name
            facesets_by_name[anchor_name]["fsz_all"] = all_fsz_used
            facesets_by_name[anchor_name]["top_n"] = top_n_eff
        added_here = len(new_with_src)
        total_added += added_here
        print(f"[applied] {anchor_name}: +{added_here} (now {len(new_face_entries)} faces)", file=sys.stderr)
    # rewrite master with ordering preserved
    new_facesets = []
    for entry in master.get("facesets", []):
        new_facesets.append(facesets_by_name.get(entry["name"], entry))
    master["facesets"] = new_facesets
    master.setdefault("age_extend_runs", []).append({
        "parent": PARENT,
        "thresholds": cand["thresholds"],
        "anchors": list(by_anchor.keys()),
        "added_total": total_added,
    })
    tmp = master_path.with_suffix(".tmp.json")
    tmp.write_text(json.dumps(master, indent=2))
    tmp.replace(master_path)
    print(f"[done] +{total_added} faces across {len(by_anchor)} anchors", file=sys.stderr)
 # ----------------------------- main -----------------------------
 def main():
    ap = argparse.ArgumentParser()
    sub = ap.add_subparsers(dest="cmd", required=True)
    a = sub.add_parser("analyze")
    a.add_argument("--out", required=True)
    a.set_defaults(func=cmd_analyze)
    r = sub.add_parser("report")
    r.add_argument("--candidates", required=True)
    r.add_argument("--out", required=True)
    r.set_defaults(func=cmd_report)
    p = sub.add_parser("apply")
    p.add_argument("--candidates", required=True)
    p.add_argument("--dry-run", action="store_true")
    p.set_defaults(func=cmd_apply)
    args = ap.parse_args()
    args.func(args)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,485 @@
 #!/usr/bin/env python3
 """Age-split person_001 into era-specific facesets.
 Workflow:
 1. Seed a clean person_001 centroid from the existing curated 707-face
   `facesets_swap_ready/faceset_001/`.
 2. Wide-recovery scan: pull every face record under /mnt/x/src/{nl, lzbkp_red}
   from `nl_full.npz` with cos-dist <= 0.55 from the seed centroid.
 3. Apply export-swap-style per-face quality gates.
 4. One re-centroid + 0.50 tighten pass to absorb the recovery without drift.
 5. Agglomerative sub-clustering at cos-dist 0.35.
 6. Post-merge sub-clusters whose centroids <0.30 AND whose dominant EXIF
   years are within 2 years.
 7. Read EXIF DateTimeOriginal for each face's source path; era label =
   (p10 year, p90 year) over dated faces.
 8. Undated faces are assigned to the nearest era by embedding distance.
 9. For each era: composite-quality rank, single-face PNG crops, .fsz bundles
   (top-N and _all if era > top_n). `<era>_<range>.txt` marker file. Eras
   with <20 face records get a `THIN.txt` marker.
 10. Append era entries into the canonical
    `facesets_swap_ready/manifest.json` next to the existing 19.
 """
 from __future__ import annotations
 import json
 import shutil
 import sys
 from collections import Counter
 from pathlib import Path
 import numpy as np
 from PIL import Image, ExifTags, ImageOps
 REPO = Path(__file__).resolve().parent.parent
 sys.path.insert(0, str(REPO))
 from sort_faces import (  # noqa: E402
    QUALITY_WEIGHTS,
    _crop_face_square,
    _zip_png_list,
    compute_quality,
    load_cache,
    load_rgb_bgr,
 )
 # ---- config -------------------------------------------------------------- #
 CACHE = REPO / "work" / "cache" / "nl_full.npz"
 SWAP_READY = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
 FS001 = SWAP_READY / "faceset_001"
 SCAN_ROOTS = [
    Path("/mnt/x/src/nl"),
    Path("/mnt/x/src/lzbkp_red"),
 ]
 # Recovery + identity refinement
 RECOVERY_THRESHOLD = 0.55  # initial centroid match
 TIGHTEN_THRESHOLD = 0.50   # post-recentroid drift trim
 # Quality gates (mirror export-swap defaults)
 MIN_FACE_SHORT = 100
 # Sub-cluster
 SUBCLUSTER_THRESHOLD = 0.35
 # Anchor-based fragment assignment (replaces transitive union-find merge):
 ANCHOR_MIN_SIZE = 20          # sub-cluster size to qualify as an era anchor
 FRAGMENT_CENTROID_MAX = 0.40  # small fragment may join an anchor only if cent_dist <=
 FRAGMENT_YEAR_MAX = 5         # AND |dom_year_anchor - dom_year_fragment| <=
 # Output
 TOP_N = 30
 PAD_RATIO = 0.5
 OUT_SIZE = 512
 THIN_THRESHOLD = 20
 # EXIF cache (so re-runs skip the 30-min Windows-mount EXIF read)
 EXIF_CACHE = REPO / "work" / "cache" / "age_split_exif.json"
 # ---- helpers ------------------------------------------------------------- #
 def _normalize(v: np.ndarray) -> np.ndarray:
    n = np.linalg.norm(v)
    return v / n if n > 0 else v
 def _under(roots: list[Path], p: str) -> bool:
    for r in roots:
        rs = str(r).rstrip("/") + "/"
        if p == str(r) or p.startswith(rs):
            return True
    return False
 def _record_in_roots(rec: dict, roots: list[Path], path_aliases: dict) -> bool:
    if _under(roots, rec["path"]):
        return True
    for alias in path_aliases.get(rec["path"], []):
        if _under(roots, alias):
            return True
    return False
 def exif_year(path: Path) -> int | None:
    try:
        with Image.open(path) as im:
            exif = im._getexif()
            if not exif:
                return None
            for tag_id, val in exif.items():
                tag = ExifTags.TAGS.get(tag_id, tag_id)
                if tag == "DateTimeOriginal" and isinstance(val, str) and len(val) >= 4:
                    return int(val[:4])
    except Exception:
        return None
    return None
 def label_for_era(years: list[int]) -> str:
    """Era label as a year-range string. Falls back to 'undated' if no years."""
    if not years:
        return "undated"
    ys = sorted(years)
    lo = ys[len(ys) // 10] if len(ys) >= 10 else ys[0]
    hi = ys[-(len(ys) // 10) - 1] if len(ys) >= 10 else ys[-1]
    if lo == hi:
        return str(lo)
    # Compact year range like 2011-13 if same century, else 2009-2024.
    if (lo // 100) == (hi // 100):
        return f"{lo}-{hi % 100:02d}"
    return f"{lo}-{hi}"
 # ---- phase 1 + 2: seed centroid + recovery scan ------------------------- #
 def main() -> None:
    if not FS001.exists():
        raise SystemExit(f"missing seed faceset: {FS001}")
    print("=== loading cache ===")
    emb, meta, _src, _proc, path_aliases = load_cache(CACHE)
    face_records = [m for m in meta if not m.get("noface")]
    if len(face_records) != len(emb):
        raise SystemExit(f"emb/meta mismatch: {len(face_records)} vs {len(emb)}")
    bbox_idx = {(m["path"], tuple(m.get("bbox") or ())): i for i, m in enumerate(face_records)}
    seed_manifest = json.loads((FS001 / "manifest.json").read_text())
    seed_face_keys = [(f["source"], tuple(f.get("bbox") or ())) for f in seed_manifest["faces"]]
    seed_indices = [bbox_idx[k] for k in seed_face_keys if k in bbox_idx]
    print(f"seed faces from faceset_001: {len(seed_indices)} (manifest had {len(seed_face_keys)})")
    seed_centroid = _normalize(emb[seed_indices].mean(axis=0))
    # Recovery: every face record under nl/ + lzbkp_red/ within RECOVERY_THRESHOLD.
    candidate_idxs = [
        i for i, rec in enumerate(face_records)
        if _record_in_roots(rec, SCAN_ROOTS, path_aliases)
    ]
    print(f"\ncandidates under {[str(r) for r in SCAN_ROOTS]}: {len(candidate_idxs)}")
    cand_emb = emb[candidate_idxs]
    cand_dists = 1.0 - cand_emb @ seed_centroid
    recovered_local = [k for k, d in enumerate(cand_dists) if d <= RECOVERY_THRESHOLD]
    recovered = [candidate_idxs[k] for k in recovered_local]
    print(f"recovered at cos-dist <= {RECOVERY_THRESHOLD}: {len(recovered)}")
    # Quality gate.
    qualified = []
    drop_size = drop_blur = drop_det = 0
    for i in recovered:
        r = face_records[i]
        if r.get("face_short", 0) < MIN_FACE_SHORT:
            drop_size += 1
            continue
        if r.get("blur", 0.0) < 40.0:
            drop_blur += 1
            continue
        if r.get("det_score", 0.0) < 0.6:
            drop_det += 1
            continue
        qualified.append(i)
    print(f"after quality gate: {len(qualified)} (drop size={drop_size} blur={drop_blur} det={drop_det})")
    # One tightening pass: re-centroid on qualified, drop anyone > TIGHTEN_THRESHOLD.
    qcent = _normalize(emb[qualified].mean(axis=0))
    qd = 1.0 - emb[qualified] @ qcent
    tight = [qualified[k] for k, d in enumerate(qd) if d <= TIGHTEN_THRESHOLD]
    print(f"after re-centroid tighten ({TIGHTEN_THRESHOLD}): {len(tight)}")
    # ---- phase 5: sub-cluster -------------------------------------------- #
    print("\n=== sub-clustering ===")
    from sklearn.cluster import AgglomerativeClustering
    E = emb[tight]
    sims = E @ E.T
    dists = 1.0 - sims
    # Floor numerical noise.
    np.fill_diagonal(dists, 0.0)
    dists = np.maximum(dists, 0.0)
    ac = AgglomerativeClustering(
        n_clusters=None,
        metric="precomputed",
        linkage="average",
        distance_threshold=SUBCLUSTER_THRESHOLD,
    )
    labels = ac.fit_predict(dists)
    sub_sizes = Counter(labels)
    print(f"raw sub-clusters: {len(sub_sizes)} (sizes: top10={sorted(sub_sizes.values(), reverse=True)[:10]})")
    # Per-cluster: indices, centroid, EXIF years.
    cluster_indices: dict[int, list[int]] = {}
    for k, lab in enumerate(labels):
        cluster_indices.setdefault(int(lab), []).append(tight[k])
    cluster_centroids: dict[int, np.ndarray] = {}
    for lab, idxs in cluster_indices.items():
        cluster_centroids[lab] = _normalize(emb[idxs].mean(axis=0))
    print("\n=== EXIF years (one read per source path; cached) ===")
    unique_paths = sorted({face_records[i]["path"] for i in tight})
    if EXIF_CACHE.exists():
        cached = json.loads(EXIF_CACHE.read_text())
    else:
        cached = {}
    path_year: dict[str, int | None] = {}
    new_reads = 0
    for p in unique_paths:
        if p in cached:
            path_year[p] = cached[p]
        else:
            y = exif_year(Path(p))
            path_year[p] = y
            cached[p] = y
            new_reads += 1
    EXIF_CACHE.parent.mkdir(parents=True, exist_ok=True)
    EXIF_CACHE.write_text(json.dumps(cached, indent=0))
    dated = sum(1 for v in path_year.values() if v is not None)
    print(f"  EXIF cache: {len(cached)} entries, {new_reads} new reads, "
          f"{dated}/{len(unique_paths)} dated")
    cluster_years: dict[int, list[int]] = {}
    cluster_dom_year: dict[int, int | None] = {}
    for lab, idxs in cluster_indices.items():
        ys = []
        for i in idxs:
            y = path_year.get(face_records[i]["path"])
            if y is not None:
                ys.append(y)
        cluster_years[lab] = ys
        cluster_dom_year[lab] = (Counter(ys).most_common(1)[0][0]) if ys else None
    # ---- phase 6: anchor-based fragment assignment ----------------------- #
    # Each sub-cluster of size >= ANCHOR_MIN_SIZE is an "era anchor". Smaller
    # fragments are assigned to the single nearest anchor IFF (centroid distance
    # <= FRAGMENT_CENTROID_MAX AND |dom_year delta| <= FRAGMENT_YEAR_MAX).
    # Anchors do NOT merge with each other — that prevented transitive year drift
    # observed when union-find was used. Standalone fragments stay as their own
    # (likely THIN) eras.
    print("\n=== anchor-based assignment ===")
    anchors = [lab for lab, idxs in cluster_indices.items() if len(idxs) >= ANCHOR_MIN_SIZE]
    fragments = [lab for lab in cluster_indices if lab not in anchors]
    anchors.sort(key=lambda l: -len(cluster_indices[l]))
    print(f"anchors (size>={ANCHOR_MIN_SIZE}): {len(anchors)}; fragments: {len(fragments)}")
    for a in anchors:
        print(f"  anchor sub {a}: size={len(cluster_indices[a])} dom_year={cluster_dom_year[a]}")
    if anchors:
        a_cent = np.stack([cluster_centroids[a] for a in anchors])
        assignments: dict[int, int] = {a: a for a in anchors}  # anchor -> self
        unassigned: list[int] = []
        for f in fragments:
            f_cent = cluster_centroids[f]
            f_year = cluster_dom_year[f]
            # cosine distances to each anchor
            cd = 1.0 - a_cent @ f_cent
            # year distance (inf if either dom-year unknown)
            yd = []
            for a in anchors:
                ay = cluster_dom_year[a]
                if f_year is None or ay is None:
                    yd.append(float("inf"))
                else:
                    yd.append(abs(f_year - ay))
            yd = np.array(yd)
            ok = (cd <= FRAGMENT_CENTROID_MAX) & (yd <= FRAGMENT_YEAR_MAX)
            if not ok.any():
                unassigned.append(f)
                continue
            # nearest qualifying anchor by centroid distance.
            cd_masked = np.where(ok, cd, np.inf)
            best = int(np.argmin(cd_masked))
            assignments[f] = anchors[best]
        print(f"  assigned fragments: {sum(1 for k,v in assignments.items() if k!=v)}/{len(fragments)}; "
              f"unassigned (standalone): {len(unassigned)}")
    else:
        print("  no anchors; every sub-cluster stands alone")
        assignments = {lab: lab for lab in cluster_indices}
        unassigned = []
    merged: dict[int, list[int]] = {}
    for lab, idxs in cluster_indices.items():
        root = assignments.get(lab, lab)
        merged.setdefault(root, []).extend(idxs)
    merged_sizes = sorted(((r, len(v)) for r, v in merged.items()), key=lambda kv: -kv[1])
    print(f"era buckets: {len(merged)} (top10 sizes: {[s for _, s in merged_sizes[:10]]})")
    # Recompute centroid + dom-year for merged eras.
    era_indices: dict[int, list[int]] = merged
    era_centroids: dict[int, np.ndarray] = {}
    era_year_label: dict[int, str] = {}
    era_years_full: dict[int, list[int]] = {}
    for root, idxs in era_indices.items():
        era_centroids[root] = _normalize(emb[idxs].mean(axis=0))
        ys = []
        for i in idxs:
            y = path_year.get(face_records[i]["path"])
            if y is not None:
                ys.append(y)
        era_years_full[root] = ys
        era_year_label[root] = label_for_era(ys)
    # ---- phase 8: assign undated faces (no-EXIF) to nearest era ---------- #
    # NB: undated = path's EXIF was None. For era assignment we use embedding,
    # but the year *label* is unaffected because labels come from dated faces only.
    # Actually undated face is already in some sub-cluster; here we just note count.
    n_undated = sum(1 for i in tight if path_year.get(face_records[i]["path"]) is None)
    print(f"undated face records (no EXIF): {n_undated}/{len(tight)} (placed by embedding only)")
    # ---- phase 9: per-era export ----------------------------------------- #
    import cv2
    print("\n=== exporting era bundles ===")
    new_manifest_entries: list[dict] = []
    eras_sorted = sorted(era_indices.items(), key=lambda kv: -len(kv[1]))
    for root, idxs in eras_sorted:
        size = len(idxs)
        label = era_year_label[root]
        era_name = f"faceset_001_{label}"
        out_dir = SWAP_READY / era_name
        # Disambiguate same-label collisions (e.g. two distinct embedding eras both 2019).
        collision = 2
        while out_dir.exists():
            era_name = f"faceset_001_{label}_v{collision}"
            out_dir = SWAP_READY / era_name
            collision += 1
        faces_dir = out_dir / "faces"
        faces_dir.mkdir(parents=True, exist_ok=True)
        # Composite quality + rank.
        ranked = []
        for ci in idxs:
            rec = face_records[ci]
            q = compute_quality(rec)
            ranked.append({"cache_idx": ci, "rec": rec, "quality": q})
        # Dedup by source path within this era — keep highest-quality face per path.
        seen_path: dict[str, dict] = {}
        for r in ranked:
            p = r["rec"]["path"]
            prev = seen_path.get(p)
            if prev is None or r["quality"]["composite"] > prev["quality"]["composite"]:
                seen_path[p] = r
        unique = sorted(seen_path.values(), key=lambda r: -r["quality"]["composite"])
        # Materialize crops.
        written: list[Path] = []
        face_entries: list[dict] = []
        for rank, r in enumerate(unique, start=1):
            rec = r["rec"]
            src = Path(rec["path"])
            if not src.exists():
                continue
            rgb, _ = load_rgb_bgr(src)
            if rgb is None:
                continue
            crop = _crop_face_square(rgb, rec["bbox"], PAD_RATIO, OUT_SIZE)
            png = faces_dir / f"{rank:04d}.png"
            cv2.imwrite(str(png), cv2.cvtColor(crop, cv2.COLOR_RGB2BGR))
            written.append(png)
            face_entries.append({
                "rank": rank,
                "png": f"faces/{rank:04d}.png",
                "source": rec["path"],
                "aliases": path_aliases.get(rec["path"], []),
                "bbox": rec["bbox"],
                "face_short": rec.get("face_short"),
                "det_score": rec.get("det_score"),
                "blur": rec.get("blur"),
                "pose": rec.get("pose"),
                "exif_year": path_year.get(rec["path"]),
                "quality": r["quality"],
            })
        if not written:
            print(f"[{era_name}] empty after materialization; skipping")
            shutil.rmtree(out_dir)
            continue
        # Bundle.
        top_n_eff = min(TOP_N, len(written))
        top_fsz = out_dir / f"{era_name}_top{top_n_eff}.fsz"
        _zip_png_list(written[:top_n_eff], top_fsz)
        all_fsz: Path | None = None
        if len(written) > top_n_eff:
            all_fsz = out_dir / f"{era_name}_all.fsz"
            _zip_png_list(written, all_fsz)
        # Per-era manifest.
        ys = era_years_full[root]
        year_summary = {
            "label": label,
            "year_count": len(ys),
            "year_min": min(ys) if ys else None,
            "year_max": max(ys) if ys else None,
            "year_dist": dict(Counter(ys).most_common()),
        }
        is_thin = size < THIN_THRESHOLD
        manifest = {
            "name": era_name,
            "parent_identity": "faceset_001",
            "era": year_summary,
            "input_face_records": size,
            "exported": len(written),
            "top_n": top_n_eff,
            "fsz_top": top_fsz.name,
            "fsz_all": all_fsz.name if all_fsz else None,
            "thin": is_thin,
            "quality_weights": QUALITY_WEIGHTS,
            "params": {
                "recovery_threshold": RECOVERY_THRESHOLD,
                "tighten_threshold": TIGHTEN_THRESHOLD,
                "subcluster_threshold": SUBCLUSTER_THRESHOLD,
                "anchor_min_size": ANCHOR_MIN_SIZE,
                "fragment_centroid_max": FRAGMENT_CENTROID_MAX,
                "fragment_year_max": FRAGMENT_YEAR_MAX,
                "min_face_short": MIN_FACE_SHORT,
            },
            "faces": face_entries,
        }
        (out_dir / "manifest.json").write_text(json.dumps(manifest, indent=2))
        # Per-era marker file (always: <label>.txt for human reference).
        (out_dir / f"{label}.txt").write_text(
            f"{era_name}\n\nEra: {label}\n"
            f"Year span: {year_summary['year_min']}..{year_summary['year_max']} "
            f"({year_summary['year_count']} dated of {size} faces)\n"
            f"Sub-cluster size: {size} face records, {len(unique)} unique source paths, "
            f"{len(written)} exported PNGs.\n"
        )
        if is_thin:
            (out_dir / "THIN.txt").write_text(
                f"This era has only {size} face records (<{THIN_THRESHOLD}). "
                f"Averaged embedding may be dominated by single-photo idiosyncrasies.\n"
            )
        # Append to top-level manifest summary.
        new_manifest_entries.append({k: v for k, v in manifest.items() if k != "faces"})
        thin_tag = " THIN" if is_thin else ""
        print(
            f"[{era_name}] size={size} unique_paths={len(unique)} exported={len(written)} "
            f"top{top_n_eff}{thin_tag}"
        )
    # ---- merge into top-level manifest ----------------------------------- #
    top_path = SWAP_READY / "manifest.json"
    existing = json.loads(top_path.read_text()) if top_path.exists() else {"facesets": []}
    existing_names = {fs.get("name") for fs in existing.get("facesets", [])}
    appended = 0
    for entry in new_manifest_entries:
        if entry["name"] in existing_names:
            continue
        existing["facesets"].append(entry)
        appended += 1
    top_path.write_text(json.dumps(existing, indent=2))
    print(f"\nAppended {appended} era entries to {top_path}")
    print(f"Done. {len(new_manifest_entries)} era buckets emitted (faceset_001/ left untouched).")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,323 @@
 #!/usr/bin/env python3
 """Build per-folder facesets from hand-sorted source directories.
 Phase B + C of the folder-import workflow:
  - Filter cache records into per-folder identity sets, run 2-pass centroid+outlier
    rejection so non-target faces in group photos drop out.
  - Route every osrc face record to every trusted-folder identity within a tight
    cosine cutoff (multi-identity osrc photos land in multiple facesets;
    cmd_export_swap then per-bbox-filters so each faceset crops only the matching face).
  - Synthesize a refine_manifest.json compatible with cmd_export_swap.
  - Invoke cmd_export_swap to emit faceset_NNN/ dirs into a temp output dir.
  - Rename .fsz bundles after the source folder, replace NAME.txt with foldername.txt,
    move dirs into the canonical facesets_swap_ready/, merge top-level manifest
    preserving existing faceset_001..012 entries.
 """
 from __future__ import annotations
 import json
 import shutil
 import sys
 from pathlib import Path
 import numpy as np
 REPO = Path(__file__).resolve().parent.parent
 sys.path.insert(0, str(REPO))
 from sort_faces import (  # noqa: E402
    cmd_export_swap,
    load_cache,
 )
 # ---- config -------------------------------------------------------------- #
 CACHE = REPO / "work" / "cache" / "nl_full.npz"
 OUT_FINAL = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
 OUT_TMP = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready_new")
 SYNTH_MANIFEST = REPO / "work" / "synthetic_refine_manifest.json"
 # Trusted folders, in numbering order. faceset_NNN starts at 013.
 TRUSTED: list[tuple[str, Path]] = [
    ("k",   Path("/mnt/x/src/k")),
    ("m",   Path("/mnt/x/src/m")),
    ("mi",  Path("/mnt/x/src/mi")),
    ("mir", Path("/mnt/x/src/mir")),
    ("s",   Path("/mnt/x/src/s")),
    ("sab", Path("/mnt/x/src/sab")),
    ("t",   Path("/mnt/x/src/t")),
 ]
 START_NNN = 13
 OSRC_DIR = Path("/mnt/x/src/osrc")
 # Centroid-build outlier passes (loose then tight).
 PASS1_THRESHOLD = 0.55
 PASS2_THRESHOLD = 0.45
 # osrc routing cutoff (tight).
 OSRC_THRESHOLD = 0.45
 # export-swap params (defaults from sort_faces.py).
 TOP_N = 30
 EXPORT_OUTLIER_THRESHOLD = 0.45
 PAD_RATIO = 0.5
 OUT_SIZE = 512
 MIN_FACE_SHORT = 100
 # ---- helpers ------------------------------------------------------------- #
 def _normalize_rows(mat: np.ndarray) -> np.ndarray:
    n = np.linalg.norm(mat, axis=1, keepdims=True)
    n[n == 0] = 1.0
    return mat / n
 def _centroid(vecs: np.ndarray) -> np.ndarray:
    c = vecs.mean(axis=0)
    n = np.linalg.norm(c)
    return c / n if n > 0 else c
 def _under(folder: Path, p: str) -> bool:
    """True iff path string p lies under folder."""
    fs = str(folder).rstrip("/") + "/"
    return p == str(folder) or p.startswith(fs)
 def _record_in_folder(rec: dict, folder: Path, path_aliases: dict[str, list[str]]) -> bool:
    if _under(folder, rec["path"]):
        return True
    for alias in path_aliases.get(rec["path"], []):
        if _under(folder, alias):
            return True
    return False
 # ---- phase B: identity centroids + osrc routing ------------------------- #
 def build_synthetic_manifest() -> tuple[dict, dict[str, np.ndarray], dict[str, dict]]:
    emb, meta, _src_root, _processed, path_aliases = load_cache(CACHE)
    # emb is aligned with the no-noface-filtered records (matching cmd_export_swap's
    # invariant). Use indices into face_records to access emb.
    face_records = [m for m in meta if not m.get("noface")]
    if len(face_records) != len(emb):
        raise SystemExit(f"meta/embedding mismatch: {len(face_records)} vs {len(emb)}")
    print(f"Loaded cache: {len(face_records)} face records.")
    # Per-folder identity centroids.
    centroids: dict[str, np.ndarray] = {}
    folder_paths: dict[str, set[str]] = {}
    folder_stats: dict[str, dict] = {}
    for label, folder in TRUSTED:
        idxs = [i for i, m in enumerate(face_records) if _record_in_folder(m, folder, path_aliases)]
        if not idxs:
            print(f"[{label}] no face records found under {folder}; skipping")
            continue
        vecs = emb[idxs]
        cent = _centroid(vecs)
        # Pass 1: drop loose outliers.
        d1 = 1.0 - vecs @ cent
        keep1 = [idxs[k] for k, dist in enumerate(d1) if dist <= PASS1_THRESHOLD]
        if not keep1:
            print(f"[{label}] every face was a pass-1 outlier; using all faces as-is")
            keep1 = idxs
        cent = _centroid(emb[keep1])
        # Pass 2: tight outlier rejection.
        d2 = 1.0 - emb[keep1] @ cent
        keep2 = [keep1[k] for k, dist in enumerate(d2) if dist <= PASS2_THRESHOLD]
        if not keep2:
            print(f"[{label}] every face was a pass-2 outlier; falling back to pass-1")
            keep2 = keep1
        cent = _centroid(emb[keep2])
        centroids[label] = cent
        # Use canonical path strings; export-swap will look up indices by path.
        folder_paths[label] = {face_records[i]["path"] for i in keep2}
        folder_stats[label] = {
            "folder": str(folder),
            "input_records": len(idxs),
            "after_pass1": len(keep1),
            "after_pass2": len(keep2),
            "unique_paths": len(folder_paths[label]),
        }
        print(
            f"[{label}] in={len(idxs)} pass1={len(keep1)} pass2={len(keep2)} "
            f"unique_paths={len(folder_paths[label])}"
        )
    # osrc routing: every osrc face -> every centroid within OSRC_THRESHOLD.
    osrc_idxs = [
        i for i, m in enumerate(face_records)
        if _record_in_folder(m, OSRC_DIR, path_aliases)
    ]
    print(f"\nosrc: {len(osrc_idxs)} face records to route")
    if osrc_idxs and centroids:
        labels = list(centroids.keys())
        cent_mat = np.stack([centroids[lab] for lab in labels])
        # Build sims: (n_osrc, n_labels)
        osrc_emb = emb[osrc_idxs]
        sims = osrc_emb @ cent_mat.T  # cosine similarity (vectors already normalized)
        dists = 1.0 - sims
        per_label_added: dict[str, int] = {lab: 0 for lab in labels}
        for row, ci in enumerate(osrc_idxs):
            p = face_records[ci]["path"]
            for col, lab in enumerate(labels):
                if dists[row, col] <= OSRC_THRESHOLD:
                    if p not in folder_paths[lab]:
                        folder_paths[lab].add(p)
                        per_label_added[lab] += 1
        for lab in labels:
            folder_stats[lab]["osrc_paths_added"] = per_label_added[lab]
            print(f"[{lab}] osrc faces routed: +{per_label_added[lab]} unique paths")
    # Build synthetic refine_manifest.
    facesets: list[dict] = []
    for n, (label, _folder) in enumerate(TRUSTED, start=START_NNN):
        if label not in folder_paths:
            continue
        facesets.append({
            "name": f"faceset_{n:03d}",
            "label": label,
            "image_count": len(folder_paths[label]),
            "images": sorted(folder_paths[label]),
        })
    manifest = {
        "params": {
            "pass1_threshold": PASS1_THRESHOLD,
            "pass2_threshold": PASS2_THRESHOLD,
            "osrc_threshold": OSRC_THRESHOLD,
            "min_face_short": MIN_FACE_SHORT,
        },
        "facesets": facesets,
        "_per_folder_stats": folder_stats,
    }
    SYNTH_MANIFEST.write_text(json.dumps(manifest, indent=2))
    print(f"\nSynthetic manifest -> {SYNTH_MANIFEST}")
    return manifest, centroids, folder_stats
 # ---- phase C: export + rename + merge ----------------------------------- #
 def export_and_relocate(manifest: dict) -> None:
    if OUT_TMP.exists():
        shutil.rmtree(OUT_TMP)
    OUT_TMP.mkdir(parents=True)
    print(f"\nRunning cmd_export_swap -> {OUT_TMP}")
    cmd_export_swap(
        cache_path=CACHE,
        refine_manifest_path=SYNTH_MANIFEST,
        raw_manifest_path=None,
        out_dir=OUT_TMP,
        top_n=TOP_N,
        outlier_threshold=EXPORT_OUTLIER_THRESHOLD,
        pad_ratio=PAD_RATIO,
        out_size=OUT_SIZE,
        include_candidates=False,
        candidate_match_threshold=0.55,
        candidate_min_score=0.40,
        min_face_short=MIN_FACE_SHORT,
    )
    # Map name -> label from the synthetic manifest.
    name_to_label = {fs["name"]: fs["label"] for fs in manifest["facesets"]}
    # Load the temp top-level manifest (export-swap just wrote it).
    new_top = json.loads((OUT_TMP / "manifest.json").read_text())
    new_entries = new_top.get("facesets", [])
    # Per-faceset rename + relocate.
    for fs_meta in new_entries:
        name = fs_meta["name"]
        label = name_to_label.get(name)
        src_dir = OUT_TMP / name
        if not src_dir.exists():
            print(f"[{name}] export dir missing; skipping")
            continue
        # Rename .fsz bundles to <label>_*.fsz; record updated names.
        renames = {}
        for fsz in sorted(src_dir.glob(f"{name}_top*.fsz")):
            new = src_dir / fsz.name.replace(name + "_", label + "_", 1)
            fsz.rename(new)
            renames[fsz.name] = new.name
        for fsz in sorted(src_dir.glob(f"{name}_all.fsz")):
            new = src_dir / fsz.name.replace(name + "_", label + "_", 1)
            fsz.rename(new)
            renames[fsz.name] = new.name
        # Replace NAME.txt placeholder with <label>.txt.
        nametxt = src_dir / "NAME.txt"
        if nametxt.exists():
            nametxt.unlink()
        (src_dir / f"{label}.txt").write_text(
            f"{label}\n\nSource: /mnt/x/src/{label} (hand-sorted) + matched osrc faces.\n"
        )
        # Update fs_meta entry's fsz fields to point at the renamed files.
        for k in ("fsz_top", "fsz_all"):
            if fs_meta.get(k) and fs_meta[k] in renames:
                fs_meta[k] = renames[fs_meta[k]]
        fs_meta["label"] = label
        # Move the directory into the final output.
        dst_dir = OUT_FINAL / name
        if dst_dir.exists():
            print(f"[{name}] {dst_dir} already exists; refusing to overwrite")
            continue
        shutil.move(str(src_dir), str(dst_dir))
        print(f"[{name}] -> {dst_dir} (label={label})")
    # Merge top-level manifest, preserving existing faceset_001..012 entries.
    final_manifest_path = OUT_FINAL / "manifest.json"
    if final_manifest_path.exists():
        existing = json.loads(final_manifest_path.read_text())
    else:
        existing = {"facesets": []}
    existing_names = {fs["name"] for fs in existing.get("facesets", [])}
    appended = 0
    for entry in new_entries:
        if entry["name"] in existing_names:
            print(f"[manifest] {entry['name']} already in top-level manifest; not duplicating")
            continue
        existing["facesets"].append(entry)
        appended += 1
    # Carry over export-swap params if not already present.
    for k in ("quality_weights", "outlier_threshold", "top_n", "pad_ratio", "out_size"):
        if k not in existing and k in new_top:
            existing[k] = new_top[k]
    final_manifest_path.write_text(json.dumps(existing, indent=2))
    print(f"\nMerged manifest: appended {appended} entries -> {final_manifest_path}")
    # Clean up temp dir if empty.
    leftover = list(OUT_TMP.iterdir()) if OUT_TMP.exists() else []
    if not leftover:
        OUT_TMP.rmdir()
    else:
        # leave temp manifest.json for inspection
        pass
 # ---- main ---------------------------------------------------------------- #
 def main() -> None:
    manifest, _centroids, _stats = build_synthetic_manifest()
    if not manifest.get("facesets"):
        print("No facesets to build; nothing to do.")
        return
    export_and_relocate(manifest)
    print("\nDone.")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,151 @@
 #!/usr/bin/env python3
 """Probe faceset_001 for age-sortable sub-structure.
 Three questions:
 1. How spread is the embedding cloud? (intra-cluster pairwise distance histogram)
 2. Does it split naturally into sub-clusters at a tight threshold?
 3. Do the sub-clusters correspond to distinct time periods (EXIF DateTimeOriginal)?
 """
 from __future__ import annotations
 import json
 import sys
 from collections import Counter
 from pathlib import Path
 import numpy as np
 from PIL import Image, ExifTags
 REPO = Path(__file__).resolve().parent.parent
 sys.path.insert(0, str(REPO))
 from sort_faces import load_cache  # noqa: E402
 CACHE = REPO / "work" / "cache" / "nl_full.npz"
 FS001 = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready/faceset_001")
 def exif_year(path: Path) -> int | None:
    try:
        with Image.open(path) as im:
            exif = im._getexif()
            if not exif:
                return None
            for tag_id, val in exif.items():
                tag = ExifTags.TAGS.get(tag_id, tag_id)
                if tag == "DateTimeOriginal" and isinstance(val, str) and len(val) >= 4:
                    return int(val[:4])
    except Exception:
        return None
    return None
 def main() -> None:
    manifest = json.loads((FS001 / "manifest.json").read_text())
    faces = manifest["faces"]
    paths = [Path(f["source"]) for f in faces]
    print(f"faceset_001 has {len(paths)} ranked faces in the swap-ready set")
    # Pull embeddings for these face records by (path, bbox).
    emb, meta, _src, _proc, _aliases = load_cache(CACHE)
    face_records = [m for m in meta if not m.get("noface")]
    if len(face_records) != len(emb):
        raise SystemExit("emb/meta mismatch")
    bbox_key = {}
    for i, m in enumerate(face_records):
        bbox_key[(m["path"], tuple(m.get("bbox") or ()))] = i
    selected = []
    missing = 0
    for f in faces:
        key = (f["source"], tuple(f.get("bbox") or ()))
        i = bbox_key.get(key)
        if i is None:
            missing += 1
            continue
        selected.append(i)
    print(f"matched {len(selected)} embeddings (missing {missing})")
    E = emb[selected]
    # All embeddings are L2-normalized -> cosine dist = 1 - dot.
    sims = E @ E.T
    dists = 1.0 - sims
    iu = np.triu_indices_from(dists, k=1)
    pw = dists[iu]
    print("\n-- intra-cluster pairwise cosine distance --")
    print(f"  n_pairs = {len(pw):,}")
    print(f"  mean    = {pw.mean():.3f}")
    print(f"  median  = {np.median(pw):.3f}")
    print(f"  p10/p25/p75/p90 = {np.percentile(pw, [10,25,75,90])}")
    print(f"  max     = {pw.max():.3f}")
    # Histogram bins around interesting thresholds.
    edges = [0.0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0, 1.4]
    hist, _ = np.histogram(pw, bins=edges)
    print("\n  histogram (cos-dist bin -> pair count):")
    for lo, hi, c in zip(edges[:-1], edges[1:], hist):
        bar = "#" * int(60 * c / max(hist.max(), 1))
        print(f"    [{lo:.1f},{hi:.1f})  {c:7d}  {bar}")
    # Sub-cluster at three thresholds via agglomerative on the distance matrix.
    from sklearn.cluster import AgglomerativeClustering
    print("\n-- sub-clustering --")
    for thr in (0.30, 0.35, 0.40, 0.45, 0.50):
        ac = AgglomerativeClustering(
            n_clusters=None,
            metric="precomputed",
            linkage="average",
            distance_threshold=thr,
        )
        labels = ac.fit_predict(dists)
        sizes = Counter(labels)
        n = len(sizes)
        big = sum(1 for s in sizes.values() if s >= 10)
        top5 = sorted(sizes.values(), reverse=True)[:5]
        print(f"  threshold {thr:.2f}: {n} sub-clusters, {big} with >=10 images, top-5 sizes={top5}")
    # Pick the threshold that gives 2-5 substantial sub-clusters.
    target_thr = 0.35
    ac = AgglomerativeClustering(
        n_clusters=None, metric="precomputed", linkage="average",
        distance_threshold=target_thr,
    )
    labels = ac.fit_predict(dists)
    sizes = Counter(labels)
    big_labels = [lab for lab, s in sizes.most_common() if s >= 20]
    print(f"\n-- EXIF year analysis at threshold {target_thr} (sub-clusters with >=20 images) --")
    print(f"   {len(big_labels)} substantial sub-clusters")
    # Build label -> list of source paths
    by_label: dict[int, list[Path]] = {}
    for ci, lab in zip(selected, labels):
        rec = face_records[ci]
        by_label.setdefault(int(lab), []).append(Path(rec["path"]))
    for lab in big_labels[:6]:
        paths_in = by_label[lab]
        years = []
        for p in paths_in:
            y = exif_year(p)
            if y is not None:
                years.append(y)
        n_paths = len(paths_in)
        n_years = len(years)
        if years:
            ys = np.array(years)
            ymin, ymax = int(ys.min()), int(ys.max())
            ymed = int(np.median(ys))
            yhist = Counter(years)
            top_years = ", ".join(f"{y}:{c}" for y, c in sorted(yhist.most_common(5)))
        else:
            ymin = ymax = ymed = None
            top_years = ""
        print(
            f"  cluster {lab}: {n_paths} faces, EXIF on {n_years}/{n_paths}, "
            f"year range {ymin}..{ymax} (median {ymed})"
        )
        print(f"    top years: {top_years}")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,221 @@
 """Windows / DirectML CLIP worker for occlusion scoring.
 Reads a queue.json staged by /opt/face-sets/work/filter_occlusions.py (WSL side),
 runs open_clip ViT-L-14 (dfn2b_s39b) on each PNG via torch-directml on the AMD
 Vega, and writes a scores.json with mask + sunglasses softmax probabilities.
 CLI:
    py -3.12 clip_worker.py <queue.json> <out_scores.json> [--limit N] [--batch 8]
 queue.json shape: list of objects
    {"wsl_path": "...", "win_path": "E:\\...\\faceset_NNN\\faces\\NNNN.png",
     "faceset": "faceset_NNN", "file": "NNNN.png"}
 scores.json shape:
    {"model": "ViT-L-14/dfn2b_s39b",
     "logit_scale": 100.0,
     "prompts": {...},
     "results": [{"wsl_path": "...", "faceset": "...", "file": "...",
                  "mask": float, "sunglasses": float}],
     "processed": [wsl_path, ...]}
 """
 from __future__ import annotations
 import argparse
 import json
 import os
 import sys
 import time
 import warnings
 from pathlib import Path
 # DML emits a verbose UserWarning per attention call -- silence at import time
 warnings.filterwarnings("ignore", category=UserWarning)
 import torch
 import torch_directml
 import open_clip
 from PIL import Image
 MODEL_NAME = "ViT-L-14"
 PRETRAINED = "dfn2b_s39b"
 # kept in sync with /opt/face-sets/work/filter_occlusions.py PROMPTS
 PROMPTS = {
    "mask": {
        "pos": [
            "a photo of a person wearing a surgical face mask",
            "a photo of a person wearing an FFP2 respirator covering mouth and nose",
            "a photo of a person wearing a cloth face mask",
            "a face partially covered by a medical mask",
            "a person whose mouth and nose are hidden by a face mask",
        ],
        "neg": [
            "a photo of a person's face with mouth and nose clearly visible",
            "a clear, unobstructed photo of a face",
            "a photo of a face without any mask or covering",
            "a portrait of a person showing their full face",
            "a photo of a person with a beard and visible mouth",
        ],
    },
    "sunglasses": {
        "pos": [
            "a face with dark sunglasses covering the eyes",
            "a portrait with the eyes hidden behind opaque sunglasses",
            "a person wearing dark sunglasses over their eyes, eyes not visible",
            "a face where the eyes are completely concealed by tinted lenses",
            "a close-up portrait wearing aviator sunglasses on the eyes",
        ],
        "neg": [
            "a portrait with both eyes clearly visible and uncovered",
            "a face with sunglasses pushed up on the forehead, eyes visible below",
            "a face with sunglasses resting on top of the head, eyes visible",
            "a person with sunglasses hanging from their shirt, eyes visible",
            "a face wearing clear prescription eyeglasses with visible eyes",
            "a portrait with no eyewear and visible eyes",
        ],
    },
 }
 FLUSH_EVERY = 100
 def load_existing(out_path: Path):
    if not out_path.exists():
        return None, set()
    try:
        d = json.loads(out_path.read_text())
        processed = set(d.get("processed", []))
        return d, processed
    except Exception as e:
        print(f"[warn] could not parse existing {out_path}: {e}; starting fresh", file=sys.stderr)
        return None, set()
 def save_atomic(out_path: Path, data: dict):
    tmp = out_path.with_suffix(".tmp.json")
    tmp.write_text(json.dumps(data, indent=2))
    os.replace(tmp, out_path)
@torch.no_grad()
 def build_text_features(model, tokenizer, device):
    out = {}
    for attr, sides in PROMPTS.items():
        feats = {}
        for side in ("pos", "neg"):
            tokens = tokenizer(sides[side]).to(device)
            f = model.encode_text(tokens)
            f = f / f.norm(dim=-1, keepdim=True)
            mean = f.mean(dim=0)
            feats[side] = mean / mean.norm()
        out[attr] = (feats["pos"], feats["neg"])
    return out
 def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("queue", type=Path)
    ap.add_argument("out", type=Path)
    ap.add_argument("--limit", type=int, default=None)
    ap.add_argument("--batch", type=int, default=8)
    args = ap.parse_args()
    queue = json.loads(args.queue.read_text())
    print(f"[queue] {len(queue)} entries from {args.queue}")
    args.out.parent.mkdir(parents=True, exist_ok=True)
    existing, processed = load_existing(args.out)
    if existing:
        print(f"[resume] {len(processed)} entries already scored")
        results = existing.get("results", [])
    else:
        results = []
    pending = [e for e in queue if e["wsl_path"] not in processed]
    if args.limit is not None:
        pending = pending[: args.limit]
    print(f"[pending] {len(pending)} entries to score")
    if not pending:
        print("[done] nothing to do")
        return
    device = torch_directml.device()
    print(f"[load] {MODEL_NAME}/{PRETRAINED} on {torch_directml.device_name(0)}")
    t0 = time.time()
    model, _, preprocess = open_clip.create_model_and_transforms(MODEL_NAME, pretrained=PRETRAINED)
    tokenizer = open_clip.get_tokenizer(MODEL_NAME)
    model = model.to(device).eval()
    logit_scale = float(model.logit_scale.exp().detach().cpu())
    print(f"[load] ready in {time.time()-t0:.1f}s logit_scale={logit_scale:.2f}")
    text_feats = build_text_features(model, tokenizer, device)
    def flush():
        save_atomic(args.out, {
            "model": f"{MODEL_NAME}/{PRETRAINED}",
            "logit_scale": logit_scale,
            "prompts": PROMPTS,
            "results": results,
            "processed": sorted(processed),
        })
    n_done_this_run = 0
    n_load_err = 0
    last_flush = time.time()
    t_start = time.time()
    for i in range(0, len(pending), args.batch):
        chunk = pending[i:i + args.batch]
        imgs = []
        keep = []
        for entry in chunk:
            try:
                img = Image.open(entry["win_path"]).convert("RGB")
                imgs.append(preprocess(img))
                keep.append(entry)
            except Exception as e:
                print(f"[skip] {entry['win_path']}: {e}", file=sys.stderr)
                n_load_err += 1
                processed.add(entry["wsl_path"])
        if not imgs:
            continue
        x = torch.stack(imgs).to(device)
        with torch.no_grad():
            feats = model.encode_image(x)
            feats = feats / feats.norm(dim=-1, keepdim=True)
            scores_per_attr = {}
            for attr, (pos, neg) in text_feats.items():
                sims = torch.stack([feats @ pos, feats @ neg], dim=1) * logit_scale
                probs = sims.softmax(dim=1)[:, 0].detach().cpu().tolist()
                scores_per_attr[attr] = probs
        for j, entry in enumerate(keep):
            results.append({
                "wsl_path": entry["wsl_path"],
                "faceset": entry["faceset"],
                "file": entry["file"],
                "mask": round(scores_per_attr["mask"][j], 4),
                "sunglasses": round(scores_per_attr["sunglasses"][j], 4),
            })
            processed.add(entry["wsl_path"])
            n_done_this_run += 1
        if (n_done_this_run % FLUSH_EVERY < args.batch) or (time.time() - last_flush) > 30.0:
            flush()
            last_flush = time.time()
            elapsed = time.time() - t_start
            rate = n_done_this_run / max(0.1, elapsed)
            eta_min = (len(pending) - n_done_this_run) / max(0.1, rate) / 60.0
            print(f"[score] {n_done_this_run}/{len(pending)} "
                  f"rate={rate:.2f} img/s eta={eta_min:.1f}min "
                  f"load_err={n_load_err}", flush=True)
    flush()
    elapsed = time.time() - t_start
    print(f"[done] {n_done_this_run} scored, {n_load_err} load errors, "
          f"{elapsed:.1f}s ({n_done_this_run/max(0.1,elapsed):.2f} img/s) -> {args.out}")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,340 @@
 #!/usr/bin/env python3
 """Discover new identities in an Immich-sourced cache and emit them as facesets.
 Mirrors `work/cluster_osrc.py`, but the source corpus is an arbitrary
 Immich user's `immich_<user>.npz` cache produced by the Windows DML embed
 worker. Existing identity centroids come from the union of every faceset
 already in `facesets_swap_ready/` (faceset_001..NNN, both auto-clustered
 and hand-sorted).
 Pipeline:
 1. Load immich_<user>.npz; restrict to face records (drop noface).
 2. Build centroids of every existing canonical faceset in
    facesets_swap_ready/ (skip era splits and _thin/).
 3. Drop immich faces whose nearest existing centroid is within
    EXISTING_MATCH_THRESHOLD; those are already covered by the canonical set.
 4. Cluster the remaining among themselves at INITIAL_THRESHOLD.
 5. Per cluster: refine-equivalent gates (face_short, blur, det_score),
    plus outlier rejection at OUTLIER_THRESHOLD for clusters of size >= 4.
 6. Keep clusters whose surviving unique source-path count is >= MIN_FACES.
 7. Number kept clusters past the existing facesets_swap_ready/ max.
 8. Synthesize a refine_manifest, hand off to cmd_export_swap, move dirs into
    facesets_swap_ready/, drop a provenance marker, append to top-level
    manifest.json (preserving facesets / thin_eras).
 """
 from __future__ import annotations
 import argparse
 import json
 import shutil
 import sys
 from pathlib import Path
 import numpy as np
 REPO = Path(__file__).resolve().parent.parent
 sys.path.insert(0, str(REPO))
 from sort_faces import (  # noqa: E402
    _cluster_embeddings,
    cmd_export_swap,
    load_cache,
 )
 # ---- config -------------------------------------------------------------- #
 REPO_WORK = REPO / "work"
 SWAP_READY = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
 EXISTING_MATCH_THRESHOLD = 0.45
 INITIAL_THRESHOLD = 0.55
 MIN_FACES = 6
 MIN_SHORT = 90
 MIN_BLUR = 40.0
 MIN_DET_SCORE = 0.6
 OUTLIER_THRESHOLD = 0.55
 TOP_N = 30
 EXPORT_OUTLIER_THRESHOLD = 0.45
 PAD_RATIO = 0.5
 OUT_SIZE = 512
 EXPORT_MIN_FACE_SHORT = 100
 # ---- helpers ------------------------------------------------------------- #
 def _normalize(v: np.ndarray) -> np.ndarray:
    n = np.linalg.norm(v)
    return v / n if n > 0 else v
 def _existing_identity_centroids(
    nl_cache: Path,
 ) -> tuple[np.ndarray, list[str]]:
    """Build identity centroids from every canonical faceset_NNN/ in
    facesets_swap_ready/. Era-split sub-dirs (faceset_001_<era>) and the
    _thin/ quarantine are skipped. Each faceset's manifest.json provides
    (source, bbox) keys we use to look up rows in nl_full.npz."""
    emb, meta, _src, _proc, _aliases = load_cache(nl_cache)
    face_records = [m for m in meta if not m.get("noface")]
    if len(face_records) != len(emb):
        raise SystemExit(f"meta/embedding mismatch in {nl_cache}: {len(face_records)} vs {len(emb)}")
    bbox_idx = {(m["path"], tuple(m.get("bbox") or ())): i for i, m in enumerate(face_records)}
    centroids: list[np.ndarray] = []
    names: list[str] = []
    for d in sorted(SWAP_READY.iterdir()):
        if not d.is_dir():
            continue
        if d.name.startswith("_"):
            continue
        # Skip era-split sub-facesets (faceset_NNN_*).
        if d.name.startswith("faceset_") and "_" in d.name[len("faceset_"):]:
            continue
        man = d / "manifest.json"
        if not man.exists():
            continue
        try:
            entries = json.loads(man.read_text()).get("faces", [])
        except Exception:
            continue
        keys = [(f["source"], tuple(f.get("bbox") or ())) for f in entries]
        idxs = [bbox_idx[k] for k in keys if k in bbox_idx]
        if not idxs:
            continue
        centroids.append(_normalize(emb[idxs].mean(axis=0)))
        names.append(d.name)
    if not centroids:
        raise SystemExit("no canonical identity centroids could be built; check facesets_swap_ready/")
    return np.stack(centroids), names
 def _next_faceset_number() -> int:
    nums = []
    for d in SWAP_READY.iterdir():
        if not d.is_dir() or not d.name.startswith("faceset_"):
            continue
        tail = d.name[len("faceset_"):]
        # Take only top-level numbered facesets (no era suffix).
        if "_" in tail:
            continue
        try:
            nums.append(int(tail))
        except ValueError:
            continue
    return (max(nums) + 1) if nums else 1
 # ---- phase 1: discover --------------------------------------------------- #
 def discover_new_clusters(
    immich_cache: Path, nl_cache: Path, start_nnn: int, source_label: str
 ) -> tuple[dict, list[dict]]:
    print(f"loading immich cache: {immich_cache}")
    emb, meta, _src, _proc, _aliases = load_cache(immich_cache)
    face_records = [m for m in meta if not m.get("noface")]
    if len(face_records) != len(emb):
        raise SystemExit(f"meta/embedding mismatch: {len(face_records)} vs {len(emb)}")
    print(f"  {len(face_records)} face records, {sum(1 for m in meta if m.get('noface'))} noface")
    print(f"building existing-identity centroids from {SWAP_READY}")
    cents, cent_names = _existing_identity_centroids(nl_cache)
    print(f"  {len(cent_names)} canonical centroids")
    sims = emb @ cents.T
    nearest_d = 1.0 - sims.max(axis=1)
    nearest_id = sims.argmax(axis=1)
    covered = nearest_d <= EXISTING_MATCH_THRESHOLD
    print(f"\nfaces already covered (cos-dist <= {EXISTING_MATCH_THRESHOLD}): "
          f"{int(covered.sum())}/{len(emb)}")
    for j, name in enumerate(cent_names):
        c = int(((nearest_id == j) & covered).sum())
        if c:
            print(f"  -> {name}: {c}")
    new_idx = [i for i in range(len(emb)) if not covered[i]]
    print(f"\nunmatched immich faces to cluster: {len(new_idx)}")
    if len(new_idx) <= 1:
        labels = np.zeros(len(new_idx), dtype=int)
    else:
        labels = _cluster_embeddings(emb[new_idx], INITIAL_THRESHOLD)
    n_clusters = len(set(int(l) for l in labels))
    sizes = sorted([int((labels == l).sum()) for l in set(labels)], reverse=True)
    print(f"clusters at threshold {INITIAL_THRESHOLD}: {n_clusters}  "
          f"top sizes: {sizes[:10]}")
    clusters: dict[int, list[int]] = {}
    for k, lab in enumerate(labels):
        clusters.setdefault(int(lab), []).append(new_idx[k])
    kept: list[dict] = []
    drop_quality_total = 0
    drop_outlier_total = 0
    for cid, idxs in clusters.items():
        good: list[int] = []
        for i in idxs:
            r = face_records[i]
            if r.get("face_short", 0) < MIN_SHORT:
                drop_quality_total += 1; continue
            if r.get("blur", 0.0) < MIN_BLUR:
                drop_quality_total += 1; continue
            if r.get("det_score", 0.0) < MIN_DET_SCORE:
                drop_quality_total += 1; continue
            good.append(i)
        if not good:
            continue
        if len(good) >= 4:
            cent = _normalize(emb[good].mean(axis=0))
            d = 1.0 - emb[good] @ cent
            tight = [good[k] for k, dist in enumerate(d) if dist <= OUTLIER_THRESHOLD]
            drop_outlier_total += len(good) - len(tight)
            good = tight
        if not good:
            continue
        unique_paths = sorted({face_records[i]["path"] for i in good})
        if len(unique_paths) < MIN_FACES:
            continue
        kept.append({
            "indices": good,
            "unique_paths": unique_paths,
            "size_face": len(good),
            "size_paths": len(unique_paths),
        })
    kept.sort(key=lambda c: -c["size_paths"])
    print(f"\nafter quality+outlier+min_faces: {len(kept)} clusters kept "
          f"(dropped: quality={drop_quality_total} outlier={drop_outlier_total})")
    for rank, c in enumerate(kept, start=start_nnn):
        print(f"  faceset_{rank:03d}: faces={c['size_face']:3d} "
              f"unique_paths={c['size_paths']:3d}")
    facesets = [
        {
            "name": f"faceset_{rank:03d}",
            "image_count": c["size_paths"],
            "face_count": c["size_face"],
            "images": c["unique_paths"],
        }
        for rank, c in enumerate(kept, start=start_nnn)
    ]
    manifest = {
        "params": {
            "existing_match_threshold": EXISTING_MATCH_THRESHOLD,
            "initial_threshold": INITIAL_THRESHOLD,
            "outlier_threshold": OUTLIER_THRESHOLD,
            "min_faces": MIN_FACES,
            "min_short": MIN_SHORT,
            "min_blur": MIN_BLUR,
            "min_det_score": MIN_DET_SCORE,
            "source_label": source_label,
            "source_cache": str(immich_cache),
        },
        "facesets": facesets,
    }
    return manifest, kept
 # ---- phase 2: export + relocate ----------------------------------------- #
 def export_and_relocate(manifest: dict, immich_cache: Path, source_label: str) -> None:
    synth_path = REPO_WORK / f"synthetic_{source_label}_manifest.json"
    synth_path.write_text(json.dumps(manifest, indent=2))
    print(f"\nsynthetic manifest -> {synth_path}")
    out_tmp = SWAP_READY.parent / f"facesets_swap_ready_{source_label}_new"
    if out_tmp.exists():
        shutil.rmtree(out_tmp)
    out_tmp.mkdir(parents=True)
    print(f"running cmd_export_swap -> {out_tmp}")
    cmd_export_swap(
        cache_path=immich_cache,
        refine_manifest_path=synth_path,
        raw_manifest_path=None,
        out_dir=out_tmp,
        top_n=TOP_N,
        outlier_threshold=EXPORT_OUTLIER_THRESHOLD,
        pad_ratio=PAD_RATIO,
        out_size=OUT_SIZE,
        include_candidates=False,
        candidate_match_threshold=0.55,
        candidate_min_score=0.40,
        min_face_short=EXPORT_MIN_FACE_SHORT,
    )
    new_top = json.loads((out_tmp / "manifest.json").read_text())
    new_entries = new_top.get("facesets", [])
    moved = 0
    for fs_meta in new_entries:
        name = fs_meta["name"]
        src_dir = out_tmp / name
        if not src_dir.exists():
            print(f"[{name}] export dir missing; skipping")
            continue
        dst_dir = SWAP_READY / name
        if dst_dir.exists():
            print(f"[{name}] {dst_dir} already exists; refusing to overwrite")
            continue
        (src_dir / f"immich_{source_label}.txt").write_text(
            f"{name}\n\nSource: Immich user {source_label} cluster (auto-discovered).\n"
        )
        shutil.move(str(src_dir), str(dst_dir))
        moved += 1
        print(f"[{name}] -> {dst_dir}")
    final_manifest_path = SWAP_READY / "manifest.json"
    if final_manifest_path.exists():
        existing = json.loads(final_manifest_path.read_text())
    else:
        existing = {"facesets": []}
    existing.setdefault("facesets", [])
    existing_names = {fs["name"] for fs in existing["facesets"]}
    appended = 0
    for entry in new_entries:
        if entry["name"] in existing_names:
            print(f"[manifest] {entry['name']} already present; not duplicating")
            continue
        existing["facesets"].append(entry)
        appended += 1
    final_manifest_path.write_text(json.dumps(existing, indent=2))
    print(f"\nmerged manifest: appended {appended} entries -> {final_manifest_path}")
    print(f"moved {moved} faceset directories into {SWAP_READY}")
    if out_tmp.exists() and not list(out_tmp.iterdir()):
        out_tmp.rmdir()
 # ---- main ---------------------------------------------------------------- #
 def main() -> None:
    p = argparse.ArgumentParser()
    p.add_argument("immich_cache", type=Path,
                   help="path to immich_<user>.npz produced by the embed worker")
    p.add_argument("--nl-cache", type=Path, default=REPO_WORK / "cache" / "nl_full.npz",
                   help="canonical cache for existing identity centroids")
    p.add_argument("--source-label", default=None,
                   help="short label used in marker filenames; default = stem of immich_cache")
    p.add_argument("--start-nnn", type=int, default=None,
                   help="first faceset number to assign; default = current max+1 in facesets_swap_ready/")
    p.add_argument("--dry-run", action="store_true")
    args = p.parse_args()
    label = args.source_label or args.immich_cache.stem.removeprefix("immich_") or args.immich_cache.stem
    start_nnn = args.start_nnn if args.start_nnn is not None else _next_faceset_number()
    print(f"source label: {label!r}; first faceset number: {start_nnn:03d}")
    manifest, kept = discover_new_clusters(args.immich_cache, args.nl_cache, start_nnn, label)
    if args.dry_run:
        print("\n--dry-run: stopping after cluster discovery (no exports written).")
        return
    if not manifest.get("facesets"):
        print("no new facesets to build.")
        return
    export_and_relocate(manifest, args.immich_cache, label)
    print("\nDone.")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,352 @@
 #!/usr/bin/env python3
 """Discover new identities in /mnt/x/src/osrc and emit them as facesets.
 Workflow (mirrors the shape of build_folders.py, but identities are
 discovered by clustering rather than asserted by folder):
  1. Load cache; restrict to face records whose canonical or alias path
     lies under /mnt/x/src/osrc/.
  2. Build centroids of the existing 19 canonical identities in
     facesets_swap_ready/faceset_001..019. Drop any osrc face whose
     nearest-existing-identity cos-dist <= EXISTING_MATCH_THRESHOLD;
     those are already covered by `extend` and shouldn't seed new
     facesets.
  3. Cluster the remaining osrc faces among themselves at
     INITIAL_THRESHOLD (matches `extend`'s new_cluster_threshold default).
  4. Per cluster, apply refine-equivalent gates: face_short >= MIN_SHORT,
     blur >= MIN_BLUR, det_score >= MIN_DET_SCORE; for clusters >= 4,
     drop faces with cos-dist > OUTLIER_THRESHOLD from the cluster
     centroid.
  5. Keep clusters whose surviving unique source-path count is >= MIN_FACES.
  6. Number kept clusters faceset_020, 021, ... (past the highest existing
     in facesets_swap_ready, which is 019). Order by descending size.
  7. Synthesize a refine_manifest.json and call cmd_export_swap on it,
     emitting into a temp dir. Move new dirs into facesets_swap_ready/.
  8. Append new entries to the top-level facesets_swap_ready/manifest.json
     (preserving existing facesets / thin_eras).
 """
 from __future__ import annotations
 import json
 import shutil
 import sys
 from pathlib import Path
 import numpy as np
 REPO = Path(__file__).resolve().parent.parent
 sys.path.insert(0, str(REPO))
 from sort_faces import (  # noqa: E402
    _cluster_embeddings,
    cmd_export_swap,
    load_cache,
 )
 # ---- config -------------------------------------------------------------- #
 CACHE = REPO / "work" / "cache" / "nl_full.npz"
 SWAP_READY = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
 OUT_TMP = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready_osrc_new")
 SYNTH_MANIFEST = REPO / "work" / "synthetic_osrc_manifest.json"
 OSRC_DIR = Path("/mnt/x/src/osrc")
 START_NNN = 20  # facesets_swap_ready max is 019; pick up here.
 # Existing-identity exclusion: drop osrc faces whose nearest existing
 # identity centroid is within this cosine distance. 0.45 matches the
 # build_folders.py OSRC_THRESHOLD: at this cutoff the face is already
 # routed to an existing identity by extend / build_folders.py.
 EXISTING_MATCH_THRESHOLD = 0.45
 # Cluster the unmatched.
 INITIAL_THRESHOLD = 0.55
 # Refine-equivalent gates (per the user's request: drop min_faces to 6).
 MIN_FACES = 6
 MIN_SHORT = 90
 MIN_BLUR = 40.0
 MIN_DET_SCORE = 0.6
 OUTLIER_THRESHOLD = 0.55  # only applied if cluster >= 4
 # export-swap params (defaults from sort_faces.py).
 TOP_N = 30
 EXPORT_OUTLIER_THRESHOLD = 0.45
 PAD_RATIO = 0.5
 OUT_SIZE = 512
 EXPORT_MIN_FACE_SHORT = 100
 # ---- helpers ------------------------------------------------------------- #
 def _normalize(v: np.ndarray) -> np.ndarray:
    n = np.linalg.norm(v)
    return v / n if n > 0 else v
 def _under(folder: Path, p: str) -> bool:
    fs = str(folder).rstrip("/") + "/"
    return p == str(folder) or p.startswith(fs)
 def _record_in_folder(rec: dict, folder: Path, path_aliases: dict[str, list[str]]) -> bool:
    if _under(folder, rec["path"]):
        return True
    for alias in path_aliases.get(rec["path"], []):
        if _under(folder, alias):
            return True
    return False
 def _existing_identity_centroids(
    emb: np.ndarray, face_records: list[dict]
 ) -> tuple[np.ndarray, list[str]]:
    """Build a (n_identities, 512) matrix of L2-normalized centroids and a parallel name list,
    drawn from the canonical faceset_001..019 manifests in facesets_swap_ready/."""
    bbox_idx: dict[tuple[str, tuple], int] = {
        (m["path"], tuple(m.get("bbox") or ())): i for i, m in enumerate(face_records)
    }
    centroids: list[np.ndarray] = []
    names: list[str] = []
    for n in range(1, 20):
        d = SWAP_READY / f"faceset_{n:03d}"
        man_path = d / "manifest.json"
        if not man_path.exists():
            continue
        man = json.loads(man_path.read_text())
        keys = [(f["source"], tuple(f.get("bbox") or ())) for f in man.get("faces", [])]
        idxs = [bbox_idx[k] for k in keys if k in bbox_idx]
        if not idxs:
            continue
        centroids.append(_normalize(emb[idxs].mean(axis=0)))
        names.append(d.name)
    return np.stack(centroids), names
 # ---- phase 1: identify new osrc clusters --------------------------------- #
 def discover_new_clusters() -> tuple[dict, list[dict]]:
    emb, meta, _src_root, _proc, path_aliases = load_cache(CACHE)
    face_records = [m for m in meta if not m.get("noface")]
    if len(face_records) != len(emb):
        raise SystemExit(f"meta/embedding mismatch: {len(face_records)} vs {len(emb)}")
    print(f"Cache: {len(face_records)} face records.")
    # Step 1: filter to osrc.
    osrc_idx = [
        i for i, m in enumerate(face_records)
        if _record_in_folder(m, OSRC_DIR, path_aliases)
    ]
    print(f"osrc face records: {len(osrc_idx)}")
    # Step 2: drop those already matching an existing identity.
    cents, cent_names = _existing_identity_centroids(emb, face_records)
    osrc_emb = emb[osrc_idx]
    sims = osrc_emb @ cents.T
    nearest_d = 1.0 - sims.max(axis=1)
    nearest_id = sims.argmax(axis=1)
    covered_mask = nearest_d <= EXISTING_MATCH_THRESHOLD
    n_covered = int(covered_mask.sum())
    print(
        f"Already covered by existing 19 identities at cos-dist <= "
        f"{EXISTING_MATCH_THRESHOLD}: {n_covered}/{len(osrc_idx)}"
    )
    # Per-identity coverage breakdown (for logging only).
    for j, name in enumerate(cent_names):
        c = int(((nearest_id == j) & covered_mask).sum())
        if c:
            print(f"  -> {name}: {c}")
    new_idx = [osrc_idx[k] for k in range(len(osrc_idx)) if not covered_mask[k]]
    print(f"\nUnmatched osrc faces to cluster: {len(new_idx)}")
    # Step 3: cluster the unmatched among themselves.
    new_emb = emb[new_idx]
    if len(new_idx) <= 1:
        labels = np.zeros(len(new_idx), dtype=int)
    else:
        labels = _cluster_embeddings(new_emb, INITIAL_THRESHOLD)
    n_clusters = len(set(int(l) for l in labels))
    print(
        f"Initial clusters at threshold {INITIAL_THRESHOLD}: {n_clusters} "
        f"(top sizes: {sorted([int((labels==l).sum()) for l in set(labels)], reverse=True)[:10]})"
    )
    # Step 4 + 5: per-cluster refine gates + min_faces.
    clusters: dict[int, list[int]] = {}
    for k, lab in enumerate(labels):
        clusters.setdefault(int(lab), []).append(new_idx[k])
    kept_clusters: list[dict] = []
    drop_quality_total = 0
    drop_outlier_total = 0
    for cid, idxs in clusters.items():
        # Per-face quality gate.
        good: list[int] = []
        for i in idxs:
            r = face_records[i]
            if r.get("face_short", 0) < MIN_SHORT:
                drop_quality_total += 1
                continue
            if r.get("blur", 0.0) < MIN_BLUR:
                drop_quality_total += 1
                continue
            if r.get("det_score", 0.0) < MIN_DET_SCORE:
                drop_quality_total += 1
                continue
            good.append(i)
        if not good:
            continue
        # Outlier rejection (only if cluster >= 4).
        if len(good) >= 4:
            cent = _normalize(emb[good].mean(axis=0))
            d = 1.0 - emb[good] @ cent
            tight = [good[k] for k, dist in enumerate(d) if dist <= OUTLIER_THRESHOLD]
            drop_outlier_total += len(good) - len(tight)
            good = tight
        if not good:
            continue
        unique_paths = sorted({face_records[i]["path"] for i in good})
        if len(unique_paths) < MIN_FACES:
            continue
        kept_clusters.append({
            "indices": good,
            "unique_paths": unique_paths,
            "size_face": len(good),
            "size_paths": len(unique_paths),
        })
    kept_clusters.sort(key=lambda c: -c["size_paths"])
    print(
        f"\nAfter quality gate ({drop_quality_total} dropped) + outlier "
        f"rejection ({drop_outlier_total} dropped) + min_faces={MIN_FACES}: "
        f"{len(kept_clusters)} clusters kept"
    )
    for rank, c in enumerate(kept_clusters, start=START_NNN):
        print(
            f"  faceset_{rank:03d}: faces={c['size_face']:3d}  "
            f"unique_paths={c['size_paths']:3d}"
        )
    # Build synthetic refine_manifest.json compatible with cmd_export_swap.
    facesets = [
        {
            "name": f"faceset_{rank:03d}",
            "image_count": c["size_paths"],
            "face_count": c["size_face"],
            "images": c["unique_paths"],
        }
        for rank, c in enumerate(kept_clusters, start=START_NNN)
    ]
    manifest = {
        "params": {
            "existing_match_threshold": EXISTING_MATCH_THRESHOLD,
            "initial_threshold": INITIAL_THRESHOLD,
            "outlier_threshold": OUTLIER_THRESHOLD,
            "min_faces": MIN_FACES,
            "min_short": MIN_SHORT,
            "min_blur": MIN_BLUR,
            "min_det_score": MIN_DET_SCORE,
            "source_root": str(OSRC_DIR),
        },
        "facesets": facesets,
    }
    SYNTH_MANIFEST.write_text(json.dumps(manifest, indent=2))
    print(f"\nSynthetic manifest -> {SYNTH_MANIFEST}")
    return manifest, kept_clusters
 # ---- phase 2: export + relocate + merge top-level manifest -------------- #
 def export_and_relocate(manifest: dict) -> None:
    if OUT_TMP.exists():
        shutil.rmtree(OUT_TMP)
    OUT_TMP.mkdir(parents=True)
    print(f"\nRunning cmd_export_swap -> {OUT_TMP}")
    cmd_export_swap(
        cache_path=CACHE,
        refine_manifest_path=SYNTH_MANIFEST,
        raw_manifest_path=None,
        out_dir=OUT_TMP,
        top_n=TOP_N,
        outlier_threshold=EXPORT_OUTLIER_THRESHOLD,
        pad_ratio=PAD_RATIO,
        out_size=OUT_SIZE,
        include_candidates=False,
        candidate_match_threshold=0.55,
        candidate_min_score=0.40,
        min_face_short=EXPORT_MIN_FACE_SHORT,
    )
    new_top = json.loads((OUT_TMP / "manifest.json").read_text())
    new_entries = new_top.get("facesets", [])
    moved = 0
    for fs_meta in new_entries:
        name = fs_meta["name"]
        src_dir = OUT_TMP / name
        if not src_dir.exists():
            print(f"[{name}] export dir missing; skipping")
            continue
        dst_dir = SWAP_READY / name
        if dst_dir.exists():
            print(f"[{name}] {dst_dir} already exists; refusing to overwrite")
            continue
        # Add a marker file so the source provenance is obvious.
        (src_dir / "osrc.txt").write_text(
            f"{name}\n\nSource: osrc cluster (auto-discovered, {OSRC_DIR}).\n"
        )
        shutil.move(str(src_dir), str(dst_dir))
        moved += 1
        print(f"[{name}] -> {dst_dir}")
    # Merge top-level manifest, preserving facesets / thin_eras / etc.
    final_manifest_path = SWAP_READY / "manifest.json"
    if final_manifest_path.exists():
        existing = json.loads(final_manifest_path.read_text())
    else:
        existing = {"facesets": []}
    existing.setdefault("facesets", [])
    existing_names = {fs["name"] for fs in existing["facesets"]}
    appended = 0
    for entry in new_entries:
        if entry["name"] in existing_names:
            print(f"[manifest] {entry['name']} already present; not duplicating")
            continue
        existing["facesets"].append(entry)
        appended += 1
    final_manifest_path.write_text(json.dumps(existing, indent=2))
    print(f"\nMerged manifest: appended {appended} entries -> {final_manifest_path}")
    print(f"Moved {moved} faceset directories into {SWAP_READY}")
    # Clean up temp dir if empty.
    if OUT_TMP.exists():
        leftover = list(OUT_TMP.iterdir())
        if not leftover:
            OUT_TMP.rmdir()
 # ---- main ---------------------------------------------------------------- #
 def main() -> None:
    dry = "--dry-run" in sys.argv
    manifest, kept = discover_new_clusters()
    if dry:
        print("\n--dry-run: stopping after cluster discovery (no exports written).")
        return
    if not manifest.get("facesets"):
        print("No new facesets to build; nothing to do.")
        return
    export_and_relocate(manifest)
    print("\nDone.")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,634 @@
 """Consolidate facesets_swap_ready/ — find duplicate identities and merge.
 Pipeline:
  1. analyze: pull arcface embeddings from work/cache/*.npz for every PNG in every
     active faceset (skipping _masked, _thin, era splits). Compute L2-normalized
     centroid per faceset. Build similarity graph at sim>=0.45, extract components.
     Pick primary per component by tier (hand-sorted > auto > osrc > immich) + size.
  2. report: HTML contact sheet at work/merge_review/index.html grouped by
     candidate cluster, with top-3 thumbs per faceset, all pairwise sims, and
     "merge X,Y -> Z" plan. Confident edges (sim>=0.65) are highlighted.
  3. apply: combine PNGs of secondaries into primary, re-rank by quality.composite
     descending, renumber 0001..NNNN, re-zip _topN.fsz + _all.fsz, move secondaries
     to facesets_swap_ready/_merged/<name>/, update master manifest with
     `merged[]` array + `merge_run` provenance block.
 Embeddings come from caches (no GPU re-embed needed); the original clusterer used
 exactly these vectors so they are the right yardstick. Era splits are excluded
 entirely (intentional time-period segmentation, not a duplication).
 """
 from __future__ import annotations
 import argparse
 import json
 import re
 import shutil
 import sys
 import time
 from pathlib import Path
 import numpy as np
 from PIL import Image
 from scipy.cluster.hierarchy import linkage, fcluster
 from scipy.spatial.distance import squareform
 ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
 CACHES = [
    Path("/opt/face-sets/work/cache/nl_full.npz"),
    Path("/opt/face-sets/work/cache/immich_peter.npz"),
    Path("/opt/face-sets/work/cache/immich_nic.npz"),
 ]
 ERA_SPLIT_RE = re.compile(r"^faceset_\d+_(?:\d{4}-\d{2,4}|\d{4}|undated)$")
 # ----------------------------- helpers -----------------------------
 def load_caches():
    """Return (rec_index, alias_map). rec_index keyed by (path, bbox_tuple)
    -> embedding (np.float32, shape (512,) L2-normalized).
    alias_map maps every alias path -> canonical path."""
    rec_index: dict[tuple[str, tuple[int, int, int, int]], np.ndarray] = {}
    alias_map: dict[str, str] = {}
    n_total = 0
    for c in CACHES:
        if not c.exists():
            print(f"[warn] cache missing: {c}", file=sys.stderr)
            continue
        d = np.load(c, allow_pickle=True)
        emb = d["embeddings"]
        meta = json.loads(str(d["meta"]))
        face_records = [m for m in meta if not m.get("noface")]
        if len(face_records) != len(emb):
            raise SystemExit(f"meta/emb mismatch in {c}: {len(face_records)} vs {len(emb)}")
        # path_aliases may be present
        if "path_aliases" in d.files:
            paliases = json.loads(str(d["path_aliases"]))
            for canon, alist in paliases.items():
                alias_map.setdefault(canon, canon)
                for a in alist:
                    alias_map[a] = canon
        for i, rec in enumerate(face_records):
            p = rec["path"]
            bbox = tuple(int(x) for x in rec["bbox"])
            v = emb[i].astype(np.float32)
            n = float(np.linalg.norm(v))
            if n > 0:
                v = v / n
            rec_index[(p, bbox)] = v
            alias_map.setdefault(p, p)
        print(f"[cache] {c.name}: +{len(face_records)} face records (running total {len(rec_index)})", file=sys.stderr)
        n_total += len(face_records)
    print(f"[cache] indexed {n_total} face records, {len(alias_map)} path aliases", file=sys.stderr)
    return rec_index, alias_map
 def faceset_tier(name: str) -> int:
    """Lower number = higher priority for primary selection."""
    m = re.match(r"^faceset_0*(\d+)$", name)
    if not m:
        return 99  # unknown structure
    n = int(m.group(1))
    if 13 <= n <= 19:
        return 0  # hand-sorted
    if 1 <= n <= 12:
        return 1  # auto-clustered
    if 20 <= n <= 25:
        return 2  # osrc
    if 26 <= n <= 264:
        return 3  # immich peter
    if 265 <= n:
        return 4  # immich nic and beyond
    return 99
 def is_era_split(name: str) -> bool:
    return bool(ERA_SPLIT_RE.match(name))
 def faceset_centroid(faceset_dir: Path, rec_index, alias_map):
    """Return (centroid, n_used, n_missing) where centroid is L2-normalized mean
    of embeddings of the faces listed in the per-faceset manifest. Falls back to
    None if too few embeddings found."""
    manifest = faceset_dir / "manifest.json"
    if not manifest.exists():
        return None, 0, 0
    m = json.loads(manifest.read_text())
    vecs = []
    n_missing = 0
    for f in m.get("faces", []):
        src = f.get("source")
        bbox = f.get("bbox")
        if src is None or bbox is None:
            n_missing += 1
            continue
        bbox_t = tuple(int(x) for x in bbox)
        canon = alias_map.get(src, src)
        v = rec_index.get((canon, bbox_t))
        if v is None and canon != src:
            v = rec_index.get((src, bbox_t))
        if v is None:
            n_missing += 1
            continue
        vecs.append(v)
    if len(vecs) < 3:
        return None, len(vecs), n_missing
    arr = np.stack(vecs).astype(np.float32)
    c = arr.mean(axis=0)
    n = float(np.linalg.norm(c))
    if n > 0:
        c = c / n
    return c, len(vecs), n_missing
 def connected_components(adj: dict[int, set[int]]) -> list[list[int]]:
    seen: set[int] = set()
    comps = []
    for node in adj:
        if node in seen:
            continue
        stack = [node]
        comp = []
        while stack:
            x = stack.pop()
            if x in seen:
                continue
            seen.add(x)
            comp.append(x)
            for y in adj.get(x, set()):
                if y not in seen:
                    stack.append(y)
        comps.append(sorted(comp))
    return comps
 # ----------------------------- analyze -----------------------------
 def cmd_analyze(args):
    rec_index, alias_map = load_caches()
    # collect active facesets
    active = []
    for d in sorted(ROOT.iterdir()):
        if not d.is_dir() or d.name.startswith("_"):
            continue
        if is_era_split(d.name):
            continue
        active.append(d)
    print(f"[scan] {len(active)} active facesets (era splits + _masked + _thin excluded)", file=sys.stderr)
    centroids: dict[str, np.ndarray] = {}
    sizes: dict[str, int] = {}
    skipped = []
    t0 = time.time()
    for fs in active:
        c, n_used, n_miss = faceset_centroid(fs, rec_index, alias_map)
        if c is None:
            skipped.append((fs.name, n_used, n_miss))
            continue
        centroids[fs.name] = c
        sizes[fs.name] = n_used
    print(f"[centroid] {len(centroids)} facesets centroided in {time.time()-t0:.1f}s; "
          f"{len(skipped)} skipped (too few embeddings)", file=sys.stderr)
    if skipped:
        for n, u, m in skipped[:10]:
            print(f"  skip {n}: used={u} missing={m}", file=sys.stderr)
        if len(skipped) > 10:
            print(f"  ... +{len(skipped)-10} more", file=sys.stderr)
    names = sorted(centroids.keys())
    if not names:
        raise SystemExit("no centroids built")
    # similarity matrix
    M = np.stack([centroids[n] for n in names]).astype(np.float32)  # (N, 512), normalized
    sim = M @ M.T  # (N, N) cosine since unit-normalized
    np.clip(sim, -1.0, 1.0, out=sim)
    edge_thr = args.edge
    confident_thr = args.confident
    # complete-linkage agglomerative clustering on cosine distance.
    # Cut at edge threshold: groups are guaranteed to have ALL pairs sim >= edge_thr.
    # This avoids the chaining problem of single-link / connected-components.
    n = len(names)
    dist = 1.0 - sim
    np.fill_diagonal(dist, 0.0)
    # symmetrize numerical noise
    dist = (dist + dist.T) / 2.0
    np.clip(dist, 0.0, 2.0, out=dist)
    cond = squareform(dist, checks=False)
    Z = linkage(cond, method="complete")
    cut_dist = 1.0 - edge_thr  # complete-link distance corresponds to (1 - min sim)
    labels = fcluster(Z, t=cut_dist, criterion="distance")  # 1-indexed cluster ids
    cluster_members: dict[int, list[int]] = {}
    for idx, lbl in enumerate(labels):
        cluster_members.setdefault(int(lbl), []).append(idx)
    comps = [sorted(idxs) for idxs in cluster_members.values() if len(idxs) > 1]
    n_pairs_in_groups = 0
    for c in comps:
        n_pairs_in_groups += len(c) * (len(c) - 1) // 2
    print(f"[graph] complete-linkage cut at sim>={edge_thr}: {len(comps)} multi-faceset groups "
          f"({n_pairs_in_groups} within-group pairs)", file=sys.stderr)
    # pick primary per group: lowest tier number, then largest size
    groups_out = []
    for comp in comps:
        members = [names[i] for i in comp]
        members_sorted = sorted(members, key=lambda x: (faceset_tier(x), -sizes.get(x, 0), x))
        primary = members_sorted[0]
        secondaries = members_sorted[1:]
        # gather pairwise sims within group
        pair_sims = []
        idx_of = {names[i]: i for i in comp}
        for a in members:
            for b in members:
                if a >= b:
                    continue
                pair_sims.append({"a": a, "b": b, "sim": round(float(sim[idx_of[a], idx_of[b]]), 4)})
        # confidence: minimum within-group sim (the weakest link)
        min_link = min(p["sim"] for p in pair_sims)
        max_link = max(p["sim"] for p in pair_sims)
        confidence = "confident" if min_link >= confident_thr else "uncertain"
        groups_out.append({
            "primary": primary,
            "secondaries": secondaries,
            "members": members_sorted,
            "tiers": {n: faceset_tier(n) for n in members},
            "sizes": {n: sizes.get(n, 0) for n in members},
            "pair_sims": pair_sims,
            "min_link": round(min_link, 4),
            "max_link": round(max_link, 4),
            "confidence": confidence,
        })
    # sort: confident first, then by max_link desc
    groups_out.sort(key=lambda g: (0 if g["confidence"] == "confident" else 1, -g["max_link"]))
    out = {
        "thresholds": {"edge": edge_thr, "confident": confident_thr},
        "n_active": len(active),
        "n_centroided": len(centroids),
        "n_skipped": len(skipped),
        "skipped_reasons": [{"name": n, "used": u, "missing": m} for n, u, m in skipped],
        "n_groups": len(groups_out),
        "n_facesets_in_groups": sum(len(g["members"]) for g in groups_out),
        "groups": groups_out,
    }
    op = Path(args.out)
    op.parent.mkdir(parents=True, exist_ok=True)
    op.write_text(json.dumps(out, indent=2))
    confident = sum(1 for g in groups_out if g["confidence"] == "confident")
    uncertain = sum(1 for g in groups_out if g["confidence"] == "uncertain")
    print(f"[done] {len(groups_out)} groups ({confident} confident, {uncertain} uncertain) -> {op}", file=sys.stderr)
 # ----------------------------- report -----------------------------
 def cmd_report(args):
    candidates = json.loads(Path(args.candidates).read_text())
    out_dir = Path(args.out)
    thumbs_dir = out_dir / "thumbs"
    thumbs_dir.mkdir(parents=True, exist_ok=True)
    THUMB = 140
    THUMBS_PER_FACESET = 4
    def make_thumb(faceset: str, fname: str) -> str:
        d = thumbs_dir / faceset
        d.mkdir(parents=True, exist_ok=True)
        dst = d / (Path(fname).stem + ".jpg")
        if not dst.exists():
            try:
                src = ROOT / faceset / "faces" / fname
                img = Image.open(src).convert("RGB")
                img.thumbnail((THUMB, THUMB), Image.LANCZOS)
                img.save(dst, "JPEG", quality=82)
            except Exception as e:
                print(f"[thumb-skip] {faceset}/{fname}: {e}", file=sys.stderr)
                return ""
        return f"thumbs/{faceset}/{Path(fname).stem}.jpg"
    rows = []
    for gi, g in enumerate(candidates["groups"]):
        primary = g["primary"]
        sec = g["secondaries"]
        conf_cls = "confident" if g["confidence"] == "confident" else "uncertain"
        rows.append(f"<section class='grp {conf_cls}' id='g{gi}'>")
        rows.append(f"<h2>group #{gi+1} <small>({g['confidence']}; min_sim={g['min_link']:.3f}, max_sim={g['max_link']:.3f})</small></h2>")
        rows.append(f"<div class='plan'>merge <b>{', '.join(sec)}</b> &rarr; <b>{primary}</b></div>")
        # member rows
        for name in g["members"]:
            tier = g["tiers"][name]
            sz = g["sizes"][name]
            tier_label = ["hand-sorted", "auto", "osrc", "immich-peter", "immich-nic", "?"][min(tier, 5)]
            badge = "PRIMARY" if name == primary else "secondary"
            rows.append(f"<div class='member'>")
            rows.append(f"<div class='label'><span class='badge {badge.lower()}'>{badge}</span> "
                        f"<b>{name}</b> <small>tier={tier_label} · n={sz}</small></div>")
            rows.append("<div class='thumbs'>")
            faces_dir = ROOT / name / "faces"
            files = sorted(faces_dir.glob("*.png"))[:THUMBS_PER_FACESET]
            for f in files:
                rel = make_thumb(name, f.name)
                if rel:
                    rows.append(f"<img src='{rel}' loading='lazy' title='{f.name}'>")
            rows.append("</div></div>")
        # pairwise sims
        rows.append("<table class='sims'><tr><th>a</th><th>b</th><th>sim</th></tr>")
        for ps in sorted(g["pair_sims"], key=lambda x: -x["sim"]):
            cls = "hi" if ps["sim"] >= candidates["thresholds"]["confident"] else "mid"
            rows.append(f"<tr><td>{ps['a']}</td><td>{ps['b']}</td><td class='{cls}'>{ps['sim']:.3f}</td></tr>")
        rows.append("</table>")
        rows.append("</section>")
    nav = " · ".join(f"<a href='#g{i}'>#{i+1}</a>" for i in range(len(candidates["groups"])))
    html = f"""<!doctype html>
 <html><head><meta charset='utf-8'><title>Faceset merge review</title>
 <style>
 body {{ font-family: system-ui, sans-serif; background: #111; color: #eee; padding: 1em; }}
 h1 {{ margin-top: 0; }}
 h2 {{ margin: 0; }}
 small {{ color: #999; font-weight: normal; }}
 section.grp {{ background: #1a1a1a; border-radius: 6px; padding: 12px; margin: 12px 0; }}
 section.grp.confident {{ border-left: 4px solid #5fa05f; }}
 section.grp.uncertain {{ border-left: 4px solid #ffb050; }}
 .plan {{ margin: .5em 0; color: #6cf; }}
 .member {{ margin: 8px 0; padding: 6px; background: #222; border-radius: 4px; }}
 .label {{ font-family: monospace; font-size: 13px; }}
 .badge {{ display: inline-block; padding: 0 6px; font-size: 10px; border-radius: 2px; }}
 .badge.primary {{ background: #5fa05f; color: #000; font-weight: bold; }}
 .badge.secondary {{ background: #444; color: #ccc; }}
 .thumbs {{ display: flex; gap: 4px; margin-top: 4px; flex-wrap: wrap; }}
 .thumbs img {{ height: 140px; width: auto; border-radius: 3px; }}
 table.sims {{ font-family: monospace; font-size: 11px; margin-top: 6px; border-collapse: collapse; }}
 table.sims td, table.sims th {{ padding: 1px 8px; border: 1px solid #333; text-align: left; }}
 table.sims td.hi {{ color: #5fa05f; font-weight: bold; }}
 table.sims td.mid {{ color: #ffb050; }}
 .nav {{ position: sticky; top: 0; background: #111; padding: .5em 0; border-bottom: 1px solid #333; font-size: 12px; }}
 a {{ color: #6cf; }}
 </style></head>
 <body>
 <h1>Merge review &mdash; {len(candidates['groups'])} candidate groups
  <small>(edge>={candidates['thresholds']['edge']}, confident>={candidates['thresholds']['confident']})</small></h1>
 <p>{candidates['n_centroided']} of {candidates['n_active']} active facesets centroided
  (skipped {candidates['n_skipped']} for too few cached embeddings).
  Green = confident (min within-group sim >= {candidates['thresholds']['confident']}); orange = uncertain.</p>
 <div class='nav'>{nav}</div>
 {''.join(rows)}
 </body></html>"""
    out_html = out_dir / "index.html"
    out_html.write_text(html)
    print(f"[done] {out_html}", file=sys.stderr)
 # ----------------------------- apply -----------------------------
 def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
    import zipfile
    with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
        for i, p in enumerate(pngs):
            zf.write(p, arcname=f"{i:04d}.png")
 def cmd_apply(args):
    candidates = json.loads(Path(args.candidates).read_text())
    master_path = ROOT / "manifest.json"
    master = json.loads(master_path.read_text())
    by_name = {f["name"]: f for f in master.get("facesets", [])}
    # filter: skip "uncertain" groups unless --include-uncertain
    accepted = [g for g in candidates["groups"]
                if g["confidence"] == "confident" or args.include_uncertain]
    skipped_unc = [g for g in candidates["groups"]
                   if g["confidence"] == "uncertain" and not args.include_uncertain]
    # explicit --exclude / --only filters (group indices in the candidates file)
    if args.only:
        only = {int(s) for s in args.only.split(",")}
        accepted = [g for i, g in enumerate(candidates["groups"]) if i in only]
    if args.exclude:
        excl = {int(s) for s in args.exclude.split(",")}
        accepted = [g for i, g in enumerate(accepted) if i not in excl]
    print(f"[plan] {len(accepted)} groups will be merged "
          f"({len(skipped_unc)} uncertain skipped)", file=sys.stderr)
    if args.dry_run:
        for g in accepted:
            print(f"  merge {g['secondaries']} -> {g['primary']}  "
                  f"({g['confidence']}, min_sim={g['min_link']:.3f})")
        return
    merged_dir = ROOT / "_merged"
    merged_dir.mkdir(exist_ok=True)
    new_facesets: list[dict] = []
    new_merged: list[dict] = list(master.get("merged", []))
    consumed_names: set[str] = set()
    primary_updates: dict[str, dict] = {}  # name -> new entry
    primary_absorbed: dict[str, list[dict]] = {}  # primary_name -> [secondary entries]
    for g in accepted:
        primary = g["primary"]
        if primary not in by_name:
            print(f"[warn] primary {primary} not in master; skipping group", file=sys.stderr)
            continue
        primary_dir = ROOT / primary
        if not primary_dir.is_dir():
            print(f"[warn] primary dir {primary_dir} missing; skipping group", file=sys.stderr)
            continue
        primary_faces = primary_dir / "faces"
        primary_manifest_path = primary_dir / "manifest.json"
        primary_manifest = json.loads(primary_manifest_path.read_text())
        # gather all face entries: primary + each secondary
        combined_faces: list[dict] = list(primary_manifest.get("faces", []))
        # adjust composite quality fall-back: ensure key exists
        for f in combined_faces:
            f.setdefault("origin_faceset", primary)
        for sec in g["secondaries"]:
            sec_dir = ROOT / sec
            if not sec_dir.is_dir():
                print(f"[warn] secondary {sec} missing; skipping", file=sys.stderr)
                continue
            sec_manifest_path = sec_dir / "manifest.json"
            sec_manifest = json.loads(sec_manifest_path.read_text()) if sec_manifest_path.exists() else {"faces": []}
            for f in sec_manifest.get("faces", []):
                f = dict(f)
                f["origin_faceset"] = sec
                combined_faces.append(f)
        # rank by quality.composite descending; ties broken by lower cosd_centroid
        def sort_key(f):
            q = f.get("quality", {}).get("composite", 0)
            d = f.get("cosd_centroid", 1.0)
            return (-q, d)
        combined_faces.sort(key=sort_key)
        # renumber and stage PNGs into a fresh staging dir, then atomically swap
        staging = primary_dir / "_faces_new"
        if staging.exists():
            shutil.rmtree(staging)
        staging.mkdir()
        new_face_entries = []
        for new_rank, f in enumerate(combined_faces, start=1):
            origin = f.pop("origin_faceset")
            old_png_rel = f["png"]                   # e.g. "faces/0042.png"
            old_png_name = Path(old_png_rel).name
            origin_png = ROOT / origin / "faces" / old_png_name
            if not origin_png.exists():
                # could be in _dropped if occlusion-pruned; skip
                continue
            new_name = f"{new_rank:04d}.png"
            shutil.copy2(origin_png, staging / new_name)
            f = dict(f)
            f["rank"] = new_rank
            f["png"] = f"faces/{new_name}"
            f["origin_faceset"] = origin   # preserve provenance in manifest
            new_face_entries.append(f)
        # swap directories: primary/faces -> primary/_faces_old, staging -> primary/faces
        old_faces_holding = primary_dir / "_faces_old"
        if old_faces_holding.exists():
            shutil.rmtree(old_faces_holding)
        if primary_faces.exists():
            primary_faces.rename(old_faces_holding)
        staging.rename(primary_faces)
        # migrate _dropped/ from old holding (so occlusion-pruned PNGs remain accessible)
        old_dropped = old_faces_holding / "_dropped"
        if old_dropped.exists():
            (primary_faces / "_dropped").mkdir(exist_ok=True)
            for x in old_dropped.iterdir():
                shutil.move(str(x), str(primary_faces / "_dropped" / x.name))
        shutil.rmtree(old_faces_holding)
        # re-zip .fsz
        survivor_pngs = sorted(primary_faces.glob("*.png"))
        top_n = primary_manifest.get("top_n", 30)
        top_n_eff = min(top_n, len(survivor_pngs))
        # remove old .fsz files
        for old in primary_dir.glob("*.fsz"):
            old.unlink()
        top_fsz_name = f"{primary}_top{top_n_eff}.fsz"
        all_fsz_name = f"{primary}_all.fsz"
        _zip_png_list(survivor_pngs[:top_n_eff], primary_dir / top_fsz_name)
        if len(survivor_pngs) > top_n_eff:
            _zip_png_list(survivor_pngs, primary_dir / all_fsz_name)
            all_fsz_used = all_fsz_name
        else:
            all_fsz_used = None
        # update primary's local manifest
        primary_manifest["faces"] = new_face_entries
        primary_manifest["exported"] = len(new_face_entries)
        primary_manifest["fsz_top"] = top_fsz_name
        primary_manifest["fsz_all"] = all_fsz_used
        primary_manifest["top_n"] = top_n_eff
        primary_manifest.setdefault("merge_history", []).append({
            "absorbed": g["secondaries"],
            "min_link": g["min_link"],
            "max_link": g["max_link"],
            "confidence": g["confidence"],
        })
        primary_manifest_path.write_text(json.dumps(primary_manifest, indent=2))
        # move secondary directories into _merged/
        absorbed_master_entries: list[dict] = []
        for sec in g["secondaries"]:
            sec_dir = ROOT / sec
            target = merged_dir / sec
            if not sec_dir.is_dir():
                continue
            if target.exists():
                shutil.rmtree(sec_dir)  # already moved by previous run; clean stub
            else:
                shutil.move(str(sec_dir), str(target))
            sec_master = dict(by_name.get(sec, {"name": sec}))
            sec_master["merged_into"] = primary
            sec_master["relpath"] = f"_merged/{sec}"
            sec_master["fsz_top"] = None
            sec_master["fsz_all"] = None
            absorbed_master_entries.append(sec_master)
            consumed_names.add(sec)
        new_merged.extend(absorbed_master_entries)
        # bump primary master entry
        prim_master = dict(by_name[primary])
        prim_master["exported"] = len(new_face_entries)
        prim_master["top_n"] = top_n_eff
        prim_master["fsz_top"] = top_fsz_name
        prim_master["fsz_all"] = all_fsz_used
        prim_master.setdefault("merge_history", []).append({
            "absorbed": g["secondaries"],
            "min_link": g["min_link"],
            "max_link": g["max_link"],
        })
        primary_updates[primary] = prim_master
        print(f"[merged] {g['secondaries']} -> {primary}  "
              f"now {len(new_face_entries)} png", file=sys.stderr)
    # rebuild master facesets list
    for entry in master.get("facesets", []):
        nm = entry["name"]
        if nm in consumed_names:
            continue
        if nm in primary_updates:
            new_facesets.append(primary_updates[nm])
        else:
            new_facesets.append(entry)
    new_master = dict(master)
    new_master["facesets"] = new_facesets
    new_master["merged"] = new_merged
    new_master["merge_run"] = {
        "thresholds": candidates["thresholds"],
        "groups_applied": len(accepted),
        "facesets_consumed": len(consumed_names),
        "include_uncertain": bool(args.include_uncertain),
    }
    tmp = master_path.with_suffix(".tmp.json")
    tmp.write_text(json.dumps(new_master, indent=2))
    tmp.replace(master_path)
    print(f"[done] master manifest updated: {len(new_facesets)} active, "
          f"{len(new_merged)} merged, {len(consumed_names)} consumed in this run",
          file=sys.stderr)
 # ----------------------------- main -----------------------------
 def main():
    ap = argparse.ArgumentParser()
    sub = ap.add_subparsers(dest="cmd", required=True)
    a = sub.add_parser("analyze")
    a.add_argument("--out", required=True)
    a.add_argument("--edge", type=float, default=0.45, help="min cosine sim to draw an edge (default 0.45)")
    a.add_argument("--confident", type=float, default=0.65, help="min within-group sim to be confident (default 0.65)")
    a.set_defaults(func=cmd_analyze)
    r = sub.add_parser("report")
    r.add_argument("--candidates", required=True)
    r.add_argument("--out", required=True)
    r.set_defaults(func=cmd_report)
    p = sub.add_parser("apply")
    p.add_argument("--candidates", required=True)
    p.add_argument("--include-uncertain", action="store_true",
                   help="apply uncertain groups too (default: confident only)")
    p.add_argument("--only", default=None, help="comma-separated group indices to apply")
    p.add_argument("--exclude", default=None, help="comma-separated group indices to skip")
    p.add_argument("--dry-run", action="store_true")
    p.set_defaults(func=cmd_apply)
    args = ap.parse_args()
    args.func(args)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,594 @@
 """Corpus-wide dedup + roop-unleashed optimization.
 Two passes:
  1. Cross-family byte-identical PNG dedup (same SHA256 in two different identity
     families) — keep the higher-tier family copy. Era splits of the same parent
     identity (faceset_NNN_*) are intentional duplications and are NOT deduped
     within their family.
  2. Within-faceset near-duplicate dedup using cached arcface embeddings
     (cosine sim >= 0.95). Keep highest quality.composite, drop the rest.
 Plus a Windows-DML multi-face audit (separate phase via clip_worker-style split):
  3. Re-detect each PNG with insightface; flag any with 0 or >1 detected faces.
     The roop loader appends every detected face per PNG, so multi-face crops
     pollute identity averaging.
 All flagged PNGs are MOVED to <faceset>/faces/_dropped/ (reversible). Affected
 .fsz files are re-zipped, manifests updated.
 CLI:
  analyze        --out work/dedup_audit/dedup_plan.json
  apply          --plan ... [--dry-run]
  stage_multiface --out work/dedup_audit/multiface_queue.json
  merge_multiface --results <worker_out> --out work/dedup_audit/multiface_plan.json
  apply_multiface --plan ... [--dry-run]
  report         --dedup ... --multiface ... --out work/dedup_audit
 """
 from __future__ import annotations
 import argparse
 import hashlib
 import json
 import re
 import shutil
 import sys
 import time
 from concurrent.futures import ThreadPoolExecutor
 from pathlib import Path
 import numpy as np
 ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
 WIN_ROOT = r"E:\temp_things\fcswp\nl_sorted\facesets_swap_ready"
 CACHES = [
    Path("/opt/face-sets/work/cache/nl_full.npz"),
    Path("/opt/face-sets/work/cache/immich_peter.npz"),
    Path("/opt/face-sets/work/cache/immich_nic.npz"),
 ]
 NEAR_DUP_THRESHOLD = 0.95
 HASH_PARALLEL = 16
 # ----------------------------- helpers -----------------------------
 def faceset_tier(name: str) -> int:
    m = re.match(r"^faceset_0*(\d+)(?:_.+)?$", name)
    if not m:
        return 99
    n = int(m.group(1))
    if 13 <= n <= 19:
        return 0
    if 1 <= n <= 12:
        return 1
    if 20 <= n <= 25:
        return 2
    if 26 <= n <= 264:
        return 3
    if 265 <= n:
        return 4
    return 99
 def faceset_family(name: str) -> str:
    """faceset_001_2010-13 → faceset_001; faceset_001 → faceset_001."""
    m = re.match(r"^(faceset_\d+)(?:_.+)?$", name)
    return m.group(1) if m else name
 def wsl_to_win(p: str) -> str:
    s = str(p)
    if s.startswith("/mnt/"):
        return f"{s[5].upper()}:\\{s[7:].replace('/', chr(92))}"
    return s
 def iter_active_facesets() -> list[Path]:
    out = []
    for d in sorted(ROOT.iterdir()):
        if d.is_dir() and not d.name.startswith("_"):
            out.append(d)
    return out
 def sha256_file(p: Path) -> str:
    h = hashlib.sha256()
    with open(p, "rb") as f:
        while True:
            b = f.read(1 << 20)
            if not b:
                break
            h.update(b)
    return h.hexdigest()
 def load_caches():
    rec_index: dict[tuple[str, tuple[int, int, int, int]], np.ndarray] = {}
    alias_map: dict[str, str] = {}
    for c in CACHES:
        if not c.exists():
            continue
        d = np.load(c, allow_pickle=True)
        emb = d["embeddings"]
        meta = json.loads(str(d["meta"]))
        face_records = [m for m in meta if not m.get("noface")]
        if "path_aliases" in d.files:
            paliases = json.loads(str(d["path_aliases"]))
            for canon, alist in paliases.items():
                alias_map.setdefault(canon, canon)
                for a in alist:
                    alias_map[a] = canon
        for i, rec in enumerate(face_records):
            p = rec["path"]
            bbox = tuple(int(x) for x in rec["bbox"])
            v = emb[i].astype(np.float32)
            n = float(np.linalg.norm(v))
            if n > 0:
                v = v / n
            rec_index[(p, bbox)] = v
            alias_map.setdefault(p, p)
    return rec_index, alias_map
 def lookup_emb(rec_index, alias_map, src: str, bbox):
    bbox_t = tuple(int(x) for x in bbox)
    canon = alias_map.get(src, src)
    v = rec_index.get((canon, bbox_t))
    if v is None and canon != src:
        v = rec_index.get((src, bbox_t))
    return v
 # ----------------------------- analyze -----------------------------
 def cmd_analyze(args):
    rec_index, alias_map = load_caches()
    facesets = iter_active_facesets()
    print(f"[scan] {len(facesets)} active facesets", file=sys.stderr)
    # Phase 1: walk every PNG, collect (faceset, file, src, bbox, quality, emb, sha256)
    all_pngs = []  # list of dicts
    t0 = time.time()
    for fs in facesets:
        manifest_path = fs / "manifest.json"
        if not manifest_path.exists():
            continue
        m = json.loads(manifest_path.read_text())
        for f in m.get("faces", []):
            png_rel = f.get("png")
            if not png_rel:
                continue
            disk_path = fs / png_rel
            if not disk_path.exists():
                continue
            all_pngs.append({
                "faceset": fs.name,
                "family": faceset_family(fs.name),
                "tier": faceset_tier(fs.name),
                "file": Path(png_rel).name,
                "rank": f.get("rank"),
                "source": f.get("source"),
                "bbox": f.get("bbox"),
                "quality": f.get("quality", {}).get("composite", 0),
                "disk_path": str(disk_path),
            })
    print(f"[scan] {len(all_pngs)} PNGs walked in {time.time()-t0:.1f}s", file=sys.stderr)
    # Phase 2: SHA256 hash each PNG (parallel I/O)
    t0 = time.time()
    def _hash_one(idx):
        all_pngs[idx]["sha256"] = sha256_file(Path(all_pngs[idx]["disk_path"]))
    with ThreadPoolExecutor(max_workers=HASH_PARALLEL) as ex:
        # exhaust the iterator to actually run
        for _ in ex.map(_hash_one, range(len(all_pngs)), chunksize=16):
            pass
    print(f"[hash] {len(all_pngs)} PNGs hashed in {time.time()-t0:.1f}s", file=sys.stderr)
    # Phase 3: cross-family byte-dedup
    by_sha: dict[str, list[int]] = {}
    for i, p in enumerate(all_pngs):
        by_sha.setdefault(p["sha256"], []).append(i)
    cross_family_groups = []
    byte_drops: set[int] = set()  # indices of PNGs to drop
    for sha, idxs in by_sha.items():
        if len(idxs) < 2:
            continue
        families = {all_pngs[i]["family"] for i in idxs}
        if len(families) < 2:
            continue  # all in same family — intentional era duplication
        # multiple families share this content → dedup keeping the best one
        cross_family_groups.append({"sha256": sha, "members": [
            {"faceset": all_pngs[i]["faceset"], "file": all_pngs[i]["file"],
             "tier": all_pngs[i]["tier"], "quality": all_pngs[i]["quality"],
             "rank": all_pngs[i]["rank"]} for i in idxs
        ]})
        # keeper rule: lowest tier number, then highest quality
        best = sorted(idxs, key=lambda i: (all_pngs[i]["tier"], -all_pngs[i]["quality"]))[0]
        for i in idxs:
            # NEVER drop within-family copies (preserve era duplication intentionally)
            # We only drop indices whose family != best's family
            if i != best and all_pngs[i]["family"] != all_pngs[best]["family"]:
                byte_drops.add(i)
    print(f"[byte] {len(cross_family_groups)} cross-family hash groups; "
          f"{len(byte_drops)} PNGs marked for byte-dedup drop", file=sys.stderr)
    # Phase 4: within-faceset near-dup (embedding sim >= threshold)
    by_faceset: dict[str, list[int]] = {}
    for i, p in enumerate(all_pngs):
        by_faceset.setdefault(p["faceset"], []).append(i)
    near_dup_groups = []
    near_drops: set[int] = set()
    miss_emb_total = 0
    t0 = time.time()
    for fs_name, idxs in by_faceset.items():
        if len(idxs) < 2:
            continue
        # gather embeddings
        embs = []
        kept_idxs = []
        for i in idxs:
            v = lookup_emb(rec_index, alias_map, all_pngs[i]["source"], all_pngs[i]["bbox"])
            if v is None:
                miss_emb_total += 1
                continue
            embs.append(v)
            kept_idxs.append(i)
        if len(kept_idxs) < 2:
            continue
        M = np.stack(embs).astype(np.float32)
        sim = M @ M.T
        np.fill_diagonal(sim, -1)  # ignore self
        # find connected components in the (sim >= threshold) graph
        adj = {k: set() for k in range(len(kept_idxs))}
        for a in range(len(kept_idxs)):
            # only check a < b to avoid double work
            hi = np.where(sim[a, a+1:] >= NEAR_DUP_THRESHOLD)[0]
            for off in hi:
                b = a + 1 + int(off)
                adj[a].add(b)
                adj[b].add(a)
        seen = set()
        for k in adj:
            if k in seen or not adj[k]:
                continue
            stack = [k]
            comp = []
            while stack:
                x = stack.pop()
                if x in seen:
                    continue
                seen.add(x)
                comp.append(x)
                for y in adj[x]:
                    if y not in seen:
                        stack.append(y)
            if len(comp) < 2:
                continue
            comp_idxs = [kept_idxs[c] for c in comp]
            # keeper: highest quality.composite, tie-break: lowest rank
            best = sorted(comp_idxs, key=lambda i: (-all_pngs[i]["quality"], all_pngs[i]["rank"] or 9999))[0]
            sims_in_group = []
            for ci in range(len(comp)):
                for cj in range(ci+1, len(comp)):
                    sims_in_group.append(float(sim[comp[ci], comp[cj]]))
            near_dup_groups.append({
                "faceset": fs_name,
                "members": [{"file": all_pngs[i]["file"], "rank": all_pngs[i]["rank"],
                             "quality": all_pngs[i]["quality"]} for i in comp_idxs],
                "keeper": all_pngs[best]["file"],
                "min_sim": min(sims_in_group) if sims_in_group else None,
                "max_sim": max(sims_in_group) if sims_in_group else None,
            })
            for i in comp_idxs:
                if i != best:
                    near_drops.add(i)
    print(f"[near] {len(near_dup_groups)} near-dup groups; "
          f"{len(near_drops)} PNGs marked for near-dup drop "
          f"(miss_emb={miss_emb_total}); {time.time()-t0:.1f}s", file=sys.stderr)
    # Combined drop set; for output, group by faceset
    all_drops = byte_drops | near_drops
    drops_by_faceset: dict[str, list] = {}
    for i in all_drops:
        p = all_pngs[i]
        reason = []
        if i in byte_drops: reason.append("byte_dup")
        if i in near_drops: reason.append("near_dup")
        drops_by_faceset.setdefault(p["faceset"], []).append({
            "file": p["file"], "rank": p["rank"], "reason": "+".join(reason),
            "sha256": p["sha256"], "quality": p["quality"],
        })
    plan = {
        "thresholds": {"near_dup_sim": NEAR_DUP_THRESHOLD},
        "totals": {
            "active_facesets": len(facesets),
            "active_pngs": len(all_pngs),
            "byte_dup_groups": len(cross_family_groups),
            "byte_dup_drops": len(byte_drops),
            "near_dup_groups": len(near_dup_groups),
            "near_dup_drops": len(near_drops),
            "all_drops": len(all_drops),
            "facesets_affected": len(drops_by_faceset),
        },
        "byte_dup_groups": cross_family_groups,
        "near_dup_groups": near_dup_groups,
        "drops_by_faceset": drops_by_faceset,
    }
    op = Path(args.out)
    op.parent.mkdir(parents=True, exist_ok=True)
    op.write_text(json.dumps(plan, indent=2))
    print(f"[done] plan -> {op}", file=sys.stderr)
 # ----------------------------- apply -----------------------------
 def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
    import zipfile
    with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
        for i, p in enumerate(pngs):
            zf.write(p, arcname=f"{i:04d}.png")
 def _apply_drops_to_facesets(drops_by_faceset: dict[str, list], reason_label: str, master_path: Path):
    """Move flagged PNGs to <faceset>/faces/_dropped/, rebuild manifests + .fsz.
    drops_by_faceset values are lists of {"file": str, ...}.
    Returns total moved + counts per faceset."""
    master = json.loads(master_path.read_text())
    by_name = {f["name"]: f for f in master.get("facesets", [])}
    total_moved = 0
    per_faceset_counts = {}
    for fs_name, drops in drops_by_faceset.items():
        fs_dir = ROOT / fs_name
        if not fs_dir.is_dir():
            print(f"[warn] {fs_name}: dir missing, skip", file=sys.stderr)
            continue
        faces_dir = fs_dir / "faces"
        dropped_dir = faces_dir / "_dropped"
        dropped_dir.mkdir(exist_ok=True)
        drop_files = {d["file"] for d in drops}
        moved_here = 0
        for fname in sorted(drop_files):
            src = faces_dir / fname
            if not src.exists():
                continue
            shutil.move(str(src), str(dropped_dir / fname))
            moved_here += 1
        # rebuild manifest by filtering out dropped files
        manifest_path = fs_dir / "manifest.json"
        if manifest_path.exists():
            mm = json.loads(manifest_path.read_text())
            new_faces = [f for f in mm.get("faces", []) if Path(f.get("png", "")).name not in drop_files]
            mm["faces"] = new_faces
            mm["exported"] = len(new_faces)
            mm.setdefault(f"{reason_label}_history", []).append({"dropped": moved_here})
            # re-zip
            survivor_pngs = sorted(faces_dir.glob("*.png"))
            top_n = mm.get("top_n", 30)
            top_n_eff = min(top_n, len(survivor_pngs))
            for old in fs_dir.glob("*.fsz"):
                old.unlink()
            top_fsz_name = f"{fs_name}_top{top_n_eff}.fsz"
            all_fsz_name = f"{fs_name}_all.fsz"
            if top_n_eff > 0:
                _zip_png_list(survivor_pngs[:top_n_eff], fs_dir / top_fsz_name)
                mm["fsz_top"] = top_fsz_name
                mm["top_n"] = top_n_eff
            else:
                mm["fsz_top"] = None
                mm["top_n"] = 0
            if len(survivor_pngs) > top_n_eff:
                _zip_png_list(survivor_pngs, fs_dir / all_fsz_name)
                mm["fsz_all"] = all_fsz_name
            else:
                mm["fsz_all"] = None
            manifest_path.write_text(json.dumps(mm, indent=2))
            if fs_name in by_name:
                by_name[fs_name]["exported"] = len(new_faces)
                by_name[fs_name]["fsz_top"] = mm["fsz_top"]
                by_name[fs_name]["fsz_all"] = mm["fsz_all"]
                by_name[fs_name]["top_n"] = mm["top_n"]
                by_name[fs_name].setdefault(f"{reason_label}_dropped", 0)
                by_name[fs_name][f"{reason_label}_dropped"] += moved_here
        total_moved += moved_here
        per_faceset_counts[fs_name] = moved_here
    # rewrite master with same ordering
    new_facesets = [by_name.get(e["name"], e) for e in master.get("facesets", [])]
    master["facesets"] = new_facesets
    master.setdefault(f"{reason_label}_runs", []).append({
        "facesets_affected": len(per_faceset_counts),
        "pngs_moved": total_moved,
    })
    tmp = master_path.with_suffix(".tmp.json")
    tmp.write_text(json.dumps(master, indent=2))
    tmp.replace(master_path)
    return total_moved, per_faceset_counts
 def cmd_apply(args):
    plan = json.loads(Path(args.plan).read_text())
    drops = plan["drops_by_faceset"]
    if args.dry_run:
        for fs, items in sorted(drops.items()):
            reasons = {}
            for it in items:
                reasons[it["reason"]] = reasons.get(it["reason"], 0) + 1
            print(f"  {fs}: {len(items)} dropped ({reasons})")
        print(f"=== total: {sum(len(v) for v in drops.values())} PNGs across {len(drops)} facesets ===")
        return
    master_path = ROOT / "manifest.json"
    total, _ = _apply_drops_to_facesets(drops, "dedup", master_path)
    print(f"[done] {total} PNGs moved to faces/_dropped/ across {len(drops)} facesets", file=sys.stderr)
 # ----------------------------- multiface staging + apply -----------------------------
 def cmd_stage_multiface(args):
    """Build queue.json of all currently-active PNGs in the corpus
    for the Windows DML multi-face audit worker."""
    queue = []
    for fs in iter_active_facesets():
        faces_dir = fs / "faces"
        if not faces_dir.is_dir():
            continue
        for p in sorted(faces_dir.glob("*.png")):
            queue.append({
                "wsl_path": str(p),
                "win_path": wsl_to_win(str(p)),
                "faceset": fs.name,
                "file": p.name,
            })
    op = Path(args.out)
    op.parent.mkdir(parents=True, exist_ok=True)
    op.write_text(json.dumps(queue, indent=2))
    print(f"[stage] {len(queue)} PNGs -> {op}", file=sys.stderr)
 def cmd_merge_multiface(args):
    """Convert worker results.json into a drops_by_faceset plan."""
    src = json.loads(Path(args.results).read_text())
    drops_by_faceset: dict[str, list] = {}
    bad_count = 0
    for r in src.get("results", []):
        n_faces = r.get("face_count", -1)
        if n_faces == 1:
            continue
        bad_count += 1
        drops_by_faceset.setdefault(r["faceset"], []).append({
            "file": r["file"],
            "reason": f"multiface_{n_faces}",
            "face_count": n_faces,
        })
    plan = {
        "totals": {"bad_pngs": bad_count, "facesets_affected": len(drops_by_faceset),
                   "scored": len(src.get("results", []))},
        "drops_by_faceset": drops_by_faceset,
    }
    op = Path(args.out)
    op.parent.mkdir(parents=True, exist_ok=True)
    op.write_text(json.dumps(plan, indent=2))
    print(f"[merge] {bad_count} bad PNGs across {len(drops_by_faceset)} facesets -> {op}", file=sys.stderr)
 def cmd_apply_multiface(args):
    plan = json.loads(Path(args.plan).read_text())
    drops = plan["drops_by_faceset"]
    if args.dry_run:
        for fs, items in sorted(drops.items()):
            print(f"  {fs}: {len(items)} bad PNG(s)")
        print(f"=== total: {sum(len(v) for v in drops.values())} ===")
        return
    master_path = ROOT / "manifest.json"
    total, _ = _apply_drops_to_facesets(drops, "multiface", master_path)
    print(f"[done] {total} PNGs moved to faces/_dropped/ across {len(drops)} facesets", file=sys.stderr)
 # ----------------------------- report -----------------------------
 def cmd_report(args):
    out_dir = Path(args.out)
    out_dir.mkdir(parents=True, exist_ok=True)
    sections = []
    if args.dedup:
        d = json.loads(Path(args.dedup).read_text())
        t = d["totals"]
        sections.append(f"<h2>Dedup</h2>")
        sections.append(
            f"<ul>"
            f"<li>Active facesets: {t['active_facesets']}, active PNGs: {t['active_pngs']}</li>"
            f"<li>Cross-family byte-dup groups: {t['byte_dup_groups']} → {t['byte_dup_drops']} PNGs dropped</li>"
            f"<li>Within-faceset near-dup groups (sim≥{d['thresholds']['near_dup_sim']}): {t['near_dup_groups']} → {t['near_dup_drops']} PNGs dropped</li>"
            f"<li><b>Total dedup drops: {t['all_drops']}</b> across {t['facesets_affected']} facesets</li>"
            f"</ul>"
        )
        # top-N affected facesets
        rows = sorted(d["drops_by_faceset"].items(), key=lambda x: -len(x[1]))[:25]
        sections.append("<h3>Top 25 most-affected facesets</h3><table><tr><th>faceset</th><th>dropped</th><th>reasons</th></tr>")
        for fs, items in rows:
            r = {}
            for it in items:
                r[it["reason"]] = r.get(it["reason"], 0) + 1
            sections.append(f"<tr><td>{fs}</td><td>{len(items)}</td><td>{r}</td></tr>")
        sections.append("</table>")
    if args.multiface:
        m = json.loads(Path(args.multiface).read_text())
        t = m["totals"]
        sections.append("<h2>Multi-face audit</h2>")
        sections.append(
            f"<ul>"
            f"<li>PNGs scored: {t['scored']}</li>"
            f"<li>Bad PNGs (0 or >1 face): {t['bad_pngs']} across {t['facesets_affected']} facesets</li>"
            f"</ul>"
        )
    html = f"""<!doctype html>
 <html><head><meta charset='utf-8'><title>Dedup + multi-face audit</title>
 <style>
 body {{ font-family: system-ui, sans-serif; background:#111; color:#eee; padding:1em; }}
 h1, h2, h3 {{ margin-top:1em; }}
 table {{ border-collapse: collapse; font-family: monospace; font-size: 12px; }}
 table td, table th {{ padding: 2px 8px; border: 1px solid #333; }}
 ul li {{ margin: 4px 0; }}
 </style></head>
 <body>
 <h1>facesets_swap_ready dedup + roop optimization audit</h1>
 {''.join(sections)}
 </body></html>"""
    out_html = out_dir / "index.html"
    out_html.write_text(html)
    print(f"[done] {out_html}", file=sys.stderr)
 # ----------------------------- main -----------------------------
 def main():
    ap = argparse.ArgumentParser()
    sub = ap.add_subparsers(dest="cmd", required=True)
    a = sub.add_parser("analyze")
    a.add_argument("--out", required=True)
    a.set_defaults(func=cmd_analyze)
    p = sub.add_parser("apply")
    p.add_argument("--plan", required=True)
    p.add_argument("--dry-run", action="store_true")
    p.set_defaults(func=cmd_apply)
    sm = sub.add_parser("stage_multiface")
    sm.add_argument("--out", required=True)
    sm.set_defaults(func=cmd_stage_multiface)
    mm = sub.add_parser("merge_multiface")
    mm.add_argument("--results", required=True)
    mm.add_argument("--out", required=True)
    mm.set_defaults(func=cmd_merge_multiface)
    am = sub.add_parser("apply_multiface")
    am.add_argument("--plan", required=True)
    am.add_argument("--dry-run", action="store_true")
    am.set_defaults(func=cmd_apply_multiface)
    r = sub.add_parser("report")
    r.add_argument("--dedup", default=None)
    r.add_argument("--multiface", default=None)
    r.add_argument("--out", required=True)
    r.set_defaults(func=cmd_report)
    args = ap.parse_args()
    args.func(args)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,244 @@
 """Windows / DirectML embed worker.
 Reads a queue.json staged by /opt/face-sets/work/immich_stage.py (WSL side),
 runs InsightFace's FaceAnalysis on each image with the DmlExecutionProvider
 backed by the AMD Vega, and writes a cache file in the schema produced by
 sort_faces.py:cmd_embed (so it can be merged into nl_full.npz).
 CLI:
    py -3.12 embed_worker.py <queue.json> <out_cache.npz> [--limit N]
 The queue.json entry shape (each item) is:
    {
        "asset_id": "...",
        "sha256":   "...",
        "wsl_path": "/mnt/x/src/immich/<user>/<rel>",   # canonical path stored
        "win_path": "X:\\src\\immich\\<user>\\<rel>",   # what we read from
        "size_bytes": int,
        "width": int, "height": int,
        ...
    }
 Per face record matches cmd_embed's schema:
    path, face_idx, det_score, bbox, face_short, face_area, blur, noface=False, hash
 plus landmark_2d_106, landmark_3d_68, pose (FaceAnalysis returns these for
 free and the existing cache already carries them after `enrich`).
 """
 from __future__ import annotations
 import argparse
 import json
 import os
 import sys
 import time
 from pathlib import Path
 import numpy as np
 from PIL import Image, ImageOps
 from insightface.app import FaceAnalysis
 MODEL_ROOT = r"C:\face_embed_venv\models"
 MIN_DET_SCORE = 0.5
 MIN_FACE_PIX = 40
 FLUSH_EVERY = 50
 def load_rgb_bgr(path: Path):
    try:
        with Image.open(path) as im:
            im = ImageOps.exif_transpose(im)
            im = im.convert("RGB")
            rgb = np.array(im)
        bgr = rgb[:, :, ::-1].copy()
        return rgb, bgr
    except Exception as e:
        print(f"[warn] failed to load {path}: {e}", file=sys.stderr)
        return None, None
 def laplacian_variance(gray: np.ndarray) -> float:
    g = gray.astype(np.float32)
    lap = (
        -4.0 * g[1:-1, 1:-1]
        + g[:-2, 1:-1] + g[2:, 1:-1]
        + g[1:-1, :-2] + g[1:-1, 2:]
    )
    return float(lap.var())
 def save_cache(out_path: Path, emb_chunks: list, meta: list, processed: set, src_root: str):
    emb = np.concatenate(emb_chunks) if emb_chunks else np.zeros((0, 512), dtype=np.float32)
    tmp = out_path.with_suffix(".tmp.npz")
    np.savez(
        str(tmp),
        embeddings=emb,
        meta=json.dumps(meta),
        src_root=str(src_root),
        processed_paths=json.dumps(sorted(processed)),
        path_aliases=json.dumps({}),
        schema="v2",
    )
    os.replace(tmp, out_path)
 def load_cache_if_exists(out_path: Path):
    """Resume helper. Returns (emb_chunks, meta, processed_set)."""
    if not out_path.exists():
        return [], [], set()
    data = np.load(out_path, allow_pickle=True)
    emb = data["embeddings"]
    meta = json.loads(str(data["meta"]))
    processed = set(json.loads(str(data["processed_paths"])))
    return [emb] if len(emb) else [], list(meta), processed
 def main():
    p = argparse.ArgumentParser()
    p.add_argument("queue", type=Path)
    p.add_argument("out", type=Path)
    p.add_argument("--limit", type=int, default=None)
    args = p.parse_args()
    queue = json.loads(args.queue.read_text())
    print(f"queue: {len(queue)} entries from {args.queue}")
    args.out.parent.mkdir(parents=True, exist_ok=True)
    emb_chunks, meta, processed = load_cache_if_exists(args.out)
    n_existing_records = len(meta)
    n_existing_emb = sum(e.shape[0] for e in emb_chunks)
    if n_existing_records:
        print(f"resume: {n_existing_records} existing meta records "
              f"({n_existing_emb} embeddings, {len(processed)} processed paths)")
    print("initializing FaceAnalysis with DmlExecutionProvider")
    app = FaceAnalysis(
        name="buffalo_l",
        root=MODEL_ROOT,
        providers=["DmlExecutionProvider", "CPUExecutionProvider"],
    )
    app.prepare(ctx_id=0, det_size=(640, 640))
    src_root = "/mnt/x/src/immich"
    n_done = 0
    n_face_records_added = 0
    n_noface_added = 0
    n_skipped = 0
    n_load_err = 0
    t0 = time.perf_counter()
    last_flush = time.perf_counter()
    new_emb_chunks: list[np.ndarray] = []
    new_meta: list[dict] = []
    def flush():
        nonlocal new_emb_chunks, new_meta, last_flush
        if not new_emb_chunks and not new_meta:
            return
        if new_emb_chunks:
            emb_chunks.append(np.concatenate(new_emb_chunks))
            new_emb_chunks = []
        for r in new_meta:
            meta.append(r)
        new_meta = []
        save_cache(args.out, emb_chunks, meta, processed, src_root)
        last_flush = time.perf_counter()
    for i, entry in enumerate(queue):
        if args.limit is not None and n_done >= args.limit:
            break
        wsl_path = entry["wsl_path"]
        win_path = entry["win_path"]
        sha = entry["sha256"]
        if wsl_path in processed:
            n_skipped += 1
            continue
        rgb, bgr = load_rgb_bgr(Path(win_path))
        if bgr is None:
            new_meta.append({
                "path": wsl_path, "face_idx": -1, "noface": True,
                "hash": sha, "error": "load",
            })
            processed.add(wsl_path)
            n_load_err += 1
            n_done += 1
            continue
        faces = app.get(bgr)
        kept_any = False
        for j, f in enumerate(faces):
            if float(f.det_score) < MIN_DET_SCORE:
                continue
            x1, y1, x2, y2 = [int(round(v)) for v in f.bbox]
            x1 = max(x1, 0); y1 = max(y1, 0)
            x2 = min(x2, rgb.shape[1]); y2 = min(y2, rgb.shape[0])
            w, h = x2 - x1, y2 - y1
            short = min(w, h)
            if short < MIN_FACE_PIX:
                continue
            crop = rgb[y1:y2, x1:x2]
            if crop.size == 0:
                continue
            gray = crop.mean(axis=2)
            blur = laplacian_variance(gray) if min(gray.shape) > 3 else 0.0
            emb = f.normed_embedding.astype(np.float32)
            new_emb_chunks.append(emb[None, :])
            rec = {
                "path": wsl_path,
                "face_idx": j,
                "det_score": float(f.det_score),
                "bbox": [x1, y1, x2, y2],
                "face_short": int(short),
                "face_area": int(w * h),
                "blur": blur,
                "noface": False,
                "hash": sha,
            }
            # Enrichment-equivalent fields (FaceAnalysis returns these for free)
            if hasattr(f, "landmark_2d_106") and f.landmark_2d_106 is not None:
                rec["landmark_2d_106"] = f.landmark_2d_106.astype(np.float32).tolist()
            if hasattr(f, "landmark_3d_68") and f.landmark_3d_68 is not None:
                rec["landmark_3d_68"] = f.landmark_3d_68.astype(np.float32).tolist()
            if hasattr(f, "pose") and f.pose is not None:
                rec["pose"] = [float(x) for x in f.pose]
            new_meta.append(rec)
            kept_any = True
            n_face_records_added += 1
        if not kept_any:
            new_meta.append({
                "path": wsl_path, "face_idx": -1, "noface": True, "hash": sha,
            })
            n_noface_added += 1
        processed.add(wsl_path)
        n_done += 1
        if (n_done % FLUSH_EVERY == 0) or (time.perf_counter() - last_flush) > 30.0:
            flush()
            elapsed = time.perf_counter() - t0
            rate = n_done / max(0.1, elapsed)
            print(
                f"[embed] done={n_done:5d}/{len(queue)}  faces+={n_face_records_added:5d}  "
                f"noface+={n_noface_added:4d}  skipped={n_skipped:4d}  "
                f"load_err={n_load_err:3d}  rate={rate:.1f} img/s  "
                f"({elapsed:.1f}s elapsed)"
            )
    flush()
    elapsed = time.perf_counter() - t0
    print()
    print("=== embed done ===")
    print(f"  done:                    {n_done}")
    print(f"  new face records:        {n_face_records_added}")
    print(f"  new noface records:      {n_noface_added}")
    print(f"  skipped (already done):  {n_skipped}")
    print(f"  load errors:             {n_load_err}")
    print(f"  elapsed:                 {elapsed:.1f}s ({n_done/max(0.1,elapsed):.1f} img/s)")
    print(f"  cache:                   {args.out}")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,574 @@
 """CLIP zero-shot scoring for masks + sunglasses across facesets_swap_ready/.
 Usage:
  # score one or more specific facesets (test mode)
  python work/filter_occlusions.py score --facesets faceset_001,faceset_050 \
      --out work/test_batch_occlusion/scores.json
  # score everything (full corpus)
  python work/filter_occlusions.py score --out work/occlusion_scores.json
  # render HTML contact sheet from a scores.json
  python work/filter_occlusions.py report --scores work/test_batch_occlusion/scores.json \
      --out work/test_batch_occlusion
 Notes:
 - This script never modifies facesets_swap_ready/. An --apply step lives elsewhere
  (or will be added once thresholds are validated).
 - Model: open_clip ViT-L-14 / dfn2b_s39b (best public zero-shot at this size).
 """
 from __future__ import annotations
 import argparse
 import json
 import sys
 import time
 from pathlib import Path
 from typing import Iterable
 import torch
 from PIL import Image
 import open_clip
 ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
 WIN_ROOT = r"E:\temp_things\fcswp\nl_sorted\facesets_swap_ready"
 MODEL_NAME = "ViT-L-14"
 PRETRAINED = "dfn2b_s39b"
 def wsl_to_win(wsl_path: str) -> str:
    """Translate a /mnt/e/... wsl path to E:\\... for the Windows worker."""
    s = str(wsl_path)
    if s.startswith("/mnt/"):
        drive = s[5]
        rest = s[7:].replace("/", "\\")
        return f"{drive.upper()}:\\{rest}"
    return s
 # Prompt ensembles. Each pair (positive, negative) becomes one binary classifier.
 # We average text embeddings within each list, then softmax across the two means.
 PROMPTS = {
    "mask": {
        "pos": [
            "a photo of a person wearing a surgical face mask",
            "a photo of a person wearing an FFP2 respirator covering mouth and nose",
            "a photo of a person wearing a cloth face mask",
            "a face partially covered by a medical mask",
            "a person whose mouth and nose are hidden by a face mask",
        ],
        "neg": [
            "a photo of a person's face with mouth and nose clearly visible",
            "a clear, unobstructed photo of a face",
            "a photo of a face without any mask or covering",
            "a portrait of a person showing their full face",
            "a photo of a person with a beard and visible mouth",  # avoid beard false positives
        ],
    },
    "sunglasses": {
        # We want to flag ONLY images where sunglasses occlude the eyes.
        # False positives to defeat: sunglasses pushed up on the head/forehead, hanging on a shirt collar.
        "pos": [
            "a face with dark sunglasses covering the eyes",
            "a portrait with the eyes hidden behind opaque sunglasses",
            "a person wearing dark sunglasses over their eyes, eyes not visible",
            "a face where the eyes are completely concealed by tinted lenses",
            "a close-up portrait wearing aviator sunglasses on the eyes",
        ],
        "neg": [
            "a portrait with both eyes clearly visible and uncovered",
            "a face with sunglasses pushed up on the forehead, eyes visible below",
            "a face with sunglasses resting on top of the head, eyes visible",
            "a person with sunglasses hanging from their shirt, eyes visible",
            "a face wearing clear prescription eyeglasses with visible eyes",
            "a portrait with no eyewear and visible eyes",
        ],
    },
 }
 def load_model(device: str = "cpu"):
    print(f"[clip] loading {MODEL_NAME} / {PRETRAINED} on {device} ...", file=sys.stderr)
    t0 = time.time()
    model, _, preprocess = open_clip.create_model_and_transforms(MODEL_NAME, pretrained=PRETRAINED)
    tokenizer = open_clip.get_tokenizer(MODEL_NAME)
    model = model.to(device).eval()
    logit_scale = float(model.logit_scale.exp().detach().cpu())
    print(f"[clip] ready in {time.time()-t0:.1f}s, logit_scale={logit_scale:.2f}", file=sys.stderr)
    return model, preprocess, tokenizer, logit_scale
@torch.no_grad()
 def build_text_features(model, tokenizer, device: str):
    """Return dict {attr: (pos_mean_emb, neg_mean_emb)} on device, both L2-normalized."""
    out = {}
    for attr, sides in PROMPTS.items():
        feats = {}
        for side in ("pos", "neg"):
            tokens = tokenizer(sides[side]).to(device)
            f = model.encode_text(tokens)
            f = f / f.norm(dim=-1, keepdim=True)
            mean = f.mean(dim=0)
            feats[side] = mean / mean.norm()
        out[attr] = (feats["pos"], feats["neg"])
    return out
@torch.no_grad()
 def score_images(model, preprocess, text_feats, logit_scale: float, paths: list[Path], device: str, batch: int = 16):
    """Yield (path, {attr: pos_prob}) per image. logit_scale is CLIP's learned temperature (~100)."""
    for i in range(0, len(paths), batch):
        chunk = paths[i:i + batch]
        imgs = []
        keep = []
        for p in chunk:
            try:
                img = Image.open(p).convert("RGB")
                imgs.append(preprocess(img))
                keep.append(p)
            except Exception as e:
                print(f"[skip] {p}: {e}", file=sys.stderr)
        if not imgs:
            continue
        x = torch.stack(imgs).to(device)
        feats = model.encode_image(x)
        feats = feats / feats.norm(dim=-1, keepdim=True)  # (B, D)
        results = {}
        for attr, (pos, neg) in text_feats.items():
            sims = torch.stack([feats @ pos, feats @ neg], dim=1) * logit_scale  # (B, 2)
            probs = sims.softmax(dim=1)[:, 0].tolist()                            # P(pos)
            results[attr] = probs
        for j, p in enumerate(keep):
            yield p, {attr: results[attr][j] for attr in text_feats}
 def iter_facesets(root: Path, only: list[str] | None) -> Iterable[Path]:
    if only:
        for name in only:
            d = root / name
            if d.is_dir():
                yield d
            else:
                print(f"[warn] not a directory: {d}", file=sys.stderr)
        return
    for d in sorted(root.iterdir()):
        if d.is_dir() and not d.name.startswith("_"):
            yield d
 def cmd_score(args):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model, preprocess, tokenizer, logit_scale = load_model(device)
    text_feats = build_text_features(model, tokenizer, device)
    only = [s.strip() for s in args.facesets.split(",")] if args.facesets else None
    facesets = list(iter_facesets(ROOT, only))
    if args.sample_per_faceset:
        # take first N PNGs per faceset (cheap deterministic sample for test batches)
        pass
    report = {
        "model": f"{MODEL_NAME}/{PRETRAINED}",
        "root": str(ROOT),
        "prompts": PROMPTS,
        "facesets": {},
    }
    total_imgs = 0
    t0 = time.time()
    for fs in facesets:
        faces = sorted((fs / "faces").glob("*.png")) if (fs / "faces").is_dir() else sorted(fs.glob("*.png"))
        if args.sample_per_faceset:
            faces = faces[: args.sample_per_faceset]
        if not faces:
            continue
        print(f"[scan] {fs.name}: {len(faces)} png", file=sys.stderr)
        per_image = []
        for p, scores in score_images(model, preprocess, text_feats, logit_scale, faces, device):
            per_image.append({"file": p.name, "mask": round(scores["mask"], 4), "sunglasses": round(scores["sunglasses"], 4)})
            total_imgs += 1
        report["facesets"][fs.name] = per_image
    out = Path(args.out)
    out.parent.mkdir(parents=True, exist_ok=True)
    out.write_text(json.dumps(report, indent=2))
    dt = time.time() - t0
    print(f"[done] {total_imgs} images, {dt:.1f}s ({total_imgs/max(dt,1e-3):.2f} img/s) -> {out}", file=sys.stderr)
 def cmd_report(args):
    """Render an HTML contact sheet from scores.json. Generates JPG thumbs."""
    import io
    scores = json.loads(Path(args.scores).read_text())
    out_dir = Path(args.out)
    thumbs_dir = out_dir / "thumbs"
    thumbs_dir.mkdir(parents=True, exist_ok=True)
    THUMB = 160
    rows_html = []
    def thumb_path(faceset: str, fname: str) -> Path:
        d = thumbs_dir / faceset
        d.mkdir(parents=True, exist_ok=True)
        return d / (Path(fname).stem + ".jpg")
    def make_thumb(src: Path, dst: Path):
        if dst.exists():
            return
        try:
            img = Image.open(src).convert("RGB")
            img.thumbnail((THUMB, THUMB), Image.LANCZOS)
            img.save(dst, "JPEG", quality=82)
        except Exception as e:
            print(f"[thumb-skip] {src}: {e}", file=sys.stderr)
    facesets = scores["facesets"]
    for faceset, items in facesets.items():
        # sort: high score first so borderline cases group at the boundary
        items_sorted = sorted(items, key=lambda x: max(x["mask"], x["sunglasses"]), reverse=True)
        # faceset summary
        n = len(items)
        n_mask = sum(1 for x in items if x["mask"] >= 0.7)
        n_sg = sum(1 for x in items if x["sunglasses"] >= 0.7)
        pct_mask = (100 * n_mask / n) if n else 0
        pct_sg = (100 * n_sg / n) if n else 0
        rows_html.append(f"<h2 id='{faceset}'>{faceset} <small>({n} imgs &middot; mask&ge;0.7: {n_mask} ({pct_mask:.0f}%) &middot; sunglasses&ge;0.7: {n_sg} ({pct_sg:.0f}%))</small></h2>")
        rows_html.append("<div class='grid'>")
        src_root = ROOT / faceset
        faces_root = (src_root / "faces") if (src_root / "faces").is_dir() else src_root
        for it in items_sorted:
            src = faces_root / it["file"]
            dst = thumb_path(faceset, it["file"])
            make_thumb(src, dst)
            rel = f"thumbs/{faceset}/{Path(it['file']).stem}.jpg"
            m, s = it["mask"], it["sunglasses"]
            cls_m = "hi" if m >= 0.7 else ("mid" if m >= 0.4 else "lo")
            cls_s = "hi" if s >= 0.7 else ("mid" if s >= 0.4 else "lo")
            rows_html.append(
                f"<div class='cell'>"
                f"<img src='{rel}' loading='lazy' title='{it['file']}'>"
                f"<div class='scores'><span class='{cls_m}'>M {m:.2f}</span> <span class='{cls_s}'>S {s:.2f}</span></div>"
                f"</div>"
            )
        rows_html.append("</div>")
    nav = " · ".join(f"<a href='#{f}'>{f}</a>" for f in facesets)
    html = f"""<!doctype html>
 <html><head><meta charset='utf-8'><title>Occlusion test batch</title>
 <style>
 body {{ font-family: system-ui, sans-serif; background: #111; color: #eee; padding: 1em; }}
 h1 {{ margin-top: 0; }}
 h2 {{ margin-top: 1.5em; border-bottom: 1px solid #333; padding-bottom: .25em; }}
 small {{ color: #999; font-weight: normal; }}
 .grid {{ display: grid; grid-template-columns: repeat(auto-fill, minmax(170px, 1fr)); gap: .5em; }}
 .cell {{ background: #1c1c1c; padding: 4px; border-radius: 4px; text-align: center; }}
 .cell img {{ max-width: 100%; height: auto; display: block; margin: 0 auto; }}
 .scores {{ font-family: monospace; font-size: 11px; padding-top: 4px; }}
 .hi  {{ color: #ff5050; font-weight: bold; }}
 .mid {{ color: #ffb050; }}
 .lo  {{ color: #5fa05f; }}
 .nav {{ position: sticky; top: 0; background: #111; padding: .5em 0; border-bottom: 1px solid #333; }}
 a {{ color: #6cf; }}
 </style></head>
 <body>
 <h1>Occlusion scores &mdash; {scores['model']}</h1>
 <p>Sorted within each faceset by max(mask, sunglasses) descending.
 Color: <span class='hi'>&ge;0.70</span> &middot; <span class='mid'>0.40&ndash;0.70</span> &middot; <span class='lo'>&lt;0.40</span></p>
 <div class='nav'>{nav}</div>
 {''.join(rows_html)}
 </body></html>"""
    out_html = out_dir / "index.html"
    out_html.write_text(html)
    print(f"[done] {out_html}", file=sys.stderr)
 def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
    """Mirror of sort_faces.py:_zip_png_list. Renames PNGs to 0000.png, 0001.png, ..."""
    import zipfile
    with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
        for i, p in enumerate(pngs):
            zf.write(p, arcname=f"{i:04d}.png")
 def cmd_apply(args):
    """Prune mask/sunglasses PNGs, quarantine occlusion-dominated facesets,
    re-zip .fsz, update top-level manifest. --dry-run prints the plan only."""
    import shutil
    threshold = args.threshold
    domain_pct = args.domain_pct
    min_survivors = args.min_survivors
    top_n_target = args.top_n
    scores = json.loads(Path(args.scores).read_text())
    master_path = ROOT / "manifest.json"
    master = json.loads(master_path.read_text())
    by_name = {f["name"]: f for f in master.get("facesets", [])}
    masked_dir = ROOT / "_masked"
    thin_dir = ROOT / "_thin"
    plan = []
    for faceset, items in scores["facesets"].items():
        if faceset not in by_name:
            print(f"[warn] {faceset} not in master manifest — skipping", file=sys.stderr)
            continue
        n = len(items)
        flagged_files = sorted(
            it["file"] for it in items
            if it["mask"] >= threshold or it["sunglasses"] >= threshold
        )
        survivors_items = [it for it in items if it["file"] not in set(flagged_files)]
        # preserve quality order from filename (0001.png is highest-rank)
        survivors_files = sorted(it["file"] for it in survivors_items)
        n_mask = sum(1 for it in items if it["mask"] >= threshold)
        n_sg = sum(1 for it in items if it["sunglasses"] >= threshold)
        pct_mask = n_mask / n if n else 0
        pct_sg = n_sg / n if n else 0
        if pct_mask >= domain_pct:
            action, reason = "quarantine_masked", f"mask_pct={pct_mask:.0%}"
        elif pct_sg >= domain_pct:
            action, reason = "quarantine_masked", f"sunglasses_pct={pct_sg:.0%}"
        elif flagged_files and len(survivors_files) < min_survivors:
            # only quarantine-as-thin if pruning is the cause of the drop below threshold;
            # pre-existing small facesets without occlusions are left alone
            action, reason = "quarantine_thin", f"survivors={len(survivors_files)}<{min_survivors}"
        elif flagged_files:
            action, reason = "prune", f"drop {len(flagged_files)}"
        else:
            action, reason = "keep", "clean"
        plan.append({
            "faceset": faceset, "action": action, "reason": reason,
            "n": n, "n_mask": n_mask, "n_sg": n_sg,
            "n_dropped": len(flagged_files), "n_survivors": len(survivors_files),
            "dropped_files": flagged_files,
        })
    # Summary
    counts = {a: 0 for a in ("keep", "prune", "quarantine_masked", "quarantine_thin")}
    for p in plan:
        counts[p["action"]] += 1
    total_dropped_pngs = sum(p["n_dropped"] for p in plan if p["action"] == "prune")
    total_quarantined_pngs = sum(p["n"] for p in plan if p["action"].startswith("quarantine"))
    print(f"=== plan summary (threshold={threshold} domain_pct={domain_pct} min_survivors={min_survivors}) ===")
    for a, c in counts.items():
        print(f"  {a}: {c}")
    print(f"  PNGs to drop (prune):       {total_dropped_pngs}")
    print(f"  PNGs to quarantine (whole): {total_quarantined_pngs}")
    print(f"  facesets in master:         {len(master['facesets'])}")
    print(f"  facesets scored:            {len(plan)}")
    # Write plan for audit
    plan_path = Path(args.out_plan)
    plan_path.parent.mkdir(parents=True, exist_ok=True)
    plan_path.write_text(json.dumps({
        "thresholds": {"image": threshold, "domain_pct": domain_pct, "min_survivors": min_survivors},
        "counts": counts,
        "totals": {"dropped_pngs": total_dropped_pngs, "quarantined_pngs": total_quarantined_pngs},
        "plan": plan,
    }, indent=2))
    print(f"  plan written to {plan_path}")
    if args.dry_run:
        # pretty list of quarantines
        for p in plan:
            if p["action"].startswith("quarantine"):
                print(f"  [{p['action']:>18s}] {p['faceset']}  ({p['reason']}, n={p['n']})")
        return
    # ----- destructive section -----
    masked_dir.mkdir(parents=True, exist_ok=True)
    thin_dir.mkdir(parents=True, exist_ok=True)
    new_facesets = []
    new_masked = list(master.get("masked", []))   # preserve any prior runs
    new_thin = list(master.get("thin_eras", []))
    # build a name -> existing-thin/masked entry index, to update relpath if we re-quarantine
    by_name_thin = {e["name"]: e for e in new_thin}
    by_name_masked = {e["name"]: e for e in new_masked}
    for p in plan:
        entry = dict(by_name[p["faceset"]])  # copy
        fs_dir = ROOT / p["faceset"]
        faces_dir = fs_dir / "faces"
        if p["action"] == "keep":
            new_facesets.append(entry)
            continue
        # prune dropped PNGs first (applies to both prune and quarantine_thin paths)
        if p["dropped_files"]:
            dropped_holding = faces_dir / "_dropped"
            dropped_holding.mkdir(exist_ok=True)
            for fname in p["dropped_files"]:
                src = faces_dir / fname
                if src.exists():
                    shutil.move(str(src), str(dropped_holding / fname))
        if p["action"].startswith("quarantine"):
            target_root = masked_dir if p["action"] == "quarantine_masked" else thin_dir
            target = target_root / p["faceset"]
            if target.exists():
                # idempotency: if a previous run already moved it, skip move
                pass
            else:
                shutil.move(str(fs_dir), str(target))
            entry["occlusion_filter"] = {
                "action": p["action"], "reason": p["reason"],
                "n_input": p["n"], "n_mask": p["n_mask"], "n_sg": p["n_sg"],
                "n_dropped": p["n_dropped"], "n_survivors": p["n_survivors"],
                "threshold": threshold, "domain_pct": domain_pct,
            }
            entry["relpath"] = f"{'_masked' if p['action']=='quarantine_masked' else '_thin'}/{p['faceset']}"
            entry["fsz_top"] = None
            entry["fsz_all"] = None
            if p["action"] == "quarantine_masked":
                entry["masked"] = True
                new_masked.append(entry)
            else:
                entry["thin"] = True
                new_thin.append(entry)
            continue
        # action == prune
        survivor_pngs = sorted([pp for pp in faces_dir.glob("*.png")])
        if not survivor_pngs:
            print(f"[warn] {p['faceset']}: no survivor PNGs after prune", file=sys.stderr)
            new_facesets.append(entry)
            continue
        # re-zip .fsz from survivors in quality order
        top_n_eff = min(top_n_target, len(survivor_pngs))
        top_fsz = fs_dir / f"{p['faceset']}_top{top_n_eff}.fsz"
        all_fsz = fs_dir / f"{p['faceset']}_all.fsz"
        # remove old .fsz files (they may have different top_n in name)
        for old in fs_dir.glob("*.fsz"):
            old.unlink()
        _zip_png_list(survivor_pngs[:top_n_eff], top_fsz)
        if len(survivor_pngs) > top_n_eff:
            _zip_png_list(survivor_pngs, all_fsz)
            entry["fsz_all"] = all_fsz.name
        else:
            entry["fsz_all"] = None
        entry["fsz_top"] = top_fsz.name
        entry["top_n"] = top_n_eff
        entry["exported"] = len(survivor_pngs)
        entry["dropped_occlusion"] = p["n_dropped"]
        entry["occlusion_filter"] = {
            "action": "prune", "n_input": p["n"], "n_mask": p["n_mask"],
            "n_sg": p["n_sg"], "n_dropped": p["n_dropped"], "n_survivors": p["n_survivors"],
            "threshold": threshold,
        }
        new_facesets.append(entry)
    # write updated master manifest
    new_master = dict(master)
    new_master["facesets"] = new_facesets
    new_master["masked"] = new_masked
    new_master["thin_eras"] = new_thin
    new_master["occlusion_filter_run"] = {
        "model": scores.get("model"),
        "threshold": threshold,
        "domain_pct": domain_pct,
        "min_survivors": min_survivors,
        "counts": counts,
        "totals": {"dropped_pngs": total_dropped_pngs, "quarantined_pngs": total_quarantined_pngs},
    }
    tmp = master_path.with_suffix(".tmp.json")
    tmp.write_text(json.dumps(new_master, indent=2))
    tmp.replace(master_path)
    print(f"[done] master manifest updated: {len(new_facesets)} active, "
          f"{len(new_masked)} masked, {len(new_thin)} thin")
 def cmd_stage(args):
    """Walk facesets_swap_ready/ and write a queue.json for the Windows clip_worker."""
    only = [s.strip() for s in args.facesets.split(",")] if args.facesets else None
    queue = []
    for fs in iter_facesets(ROOT, only):
        faces = sorted((fs / "faces").glob("*.png")) if (fs / "faces").is_dir() else sorted(fs.glob("*.png"))
        for p in faces:
            queue.append({
                "wsl_path": str(p),
                "win_path": wsl_to_win(str(p)),
                "faceset": fs.name,
                "file": p.name,
            })
    out = Path(args.out)
    out.parent.mkdir(parents=True, exist_ok=True)
    out.write_text(json.dumps(queue, indent=2))
    print(f"[stage] {len(queue)} png paths -> {out}", file=sys.stderr)
    print(f"[stage] win queue file: {wsl_to_win(str(out))}", file=sys.stderr)
 def cmd_merge(args):
    """Ingest worker scores.json into the per-faceset shape that `report` reads."""
    src = json.loads(Path(args.scores).read_text())
    by_faceset: dict[str, list] = {}
    for r in src.get("results", []):
        by_faceset.setdefault(r["faceset"], []).append({
            "file": r["file"],
            "mask": r["mask"],
            "sunglasses": r["sunglasses"],
        })
    # stable ordering: faceset by name, files by name
    out_data = {
        "model": src.get("model", f"{MODEL_NAME}/{PRETRAINED}"),
        "root": str(ROOT),
        "prompts": src.get("prompts", PROMPTS),
        "facesets": {fs: sorted(items, key=lambda x: x["file"]) for fs, items in sorted(by_faceset.items())},
    }
    out = Path(args.out)
    out.parent.mkdir(parents=True, exist_ok=True)
    out.write_text(json.dumps(out_data, indent=2))
    total = sum(len(v) for v in by_faceset.values())
    print(f"[merge] {total} scores across {len(by_faceset)} facesets -> {out}", file=sys.stderr)
 def main():
    ap = argparse.ArgumentParser()
    sub = ap.add_subparsers(dest="cmd", required=True)
    s = sub.add_parser("score", help="WSL CPU scoring (slow but no GPU dependency)")
    s.add_argument("--facesets", default=None, help="comma-separated faceset names; default = all")
    s.add_argument("--sample-per-faceset", type=int, default=0, help="cap PNGs per faceset (0 = all)")
    s.add_argument("--out", required=True)
    s.set_defaults(func=cmd_score)
    st = sub.add_parser("stage", help="Build queue.json for Windows clip_worker.py")
    st.add_argument("--facesets", default=None, help="comma-separated faceset names; default = all")
    st.add_argument("--out", required=True)
    st.set_defaults(func=cmd_stage)
    m = sub.add_parser("merge", help="Convert worker scores.json into per-faceset report format")
    m.add_argument("--scores", required=True, help="worker output (flat list of results)")
    m.add_argument("--out", required=True, help="output path for per-faceset format")
    m.set_defaults(func=cmd_merge)
    r = sub.add_parser("report", help="Render HTML contact sheet from a per-faceset scores.json")
    r.add_argument("--scores", required=True)
    r.add_argument("--out", required=True)
    r.set_defaults(func=cmd_report)
    a = sub.add_parser("apply", help="Prune flagged PNGs, quarantine dominated facesets, re-zip .fsz, update manifest")
    a.add_argument("--scores", required=True, help="per-faceset scores.json (output of `merge` or `score`)")
    a.add_argument("--out-plan", required=True, help="path to write the apply plan json (audit)")
    a.add_argument("--threshold", type=float, default=0.7, help="image-level drop threshold for mask/sunglasses (default 0.7)")
    a.add_argument("--domain-pct", type=float, default=0.40, help="faceset-level quarantine threshold (default 0.40)")
    a.add_argument("--min-survivors", type=int, default=5, help="quarantine to _thin if survivors below this (default 5)")
    a.add_argument("--top-n", type=int, default=30, help="top-N for re-zipped _topN.fsz (default 30)")
    a.add_argument("--dry-run", action="store_true", help="print plan only, no filesystem changes")
    a.set_defaults(func=cmd_apply)
    args = ap.parse_args()
    args.func(args)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,50 @@
 #!/usr/bin/env bash
 # Finalize an Immich user's stage:
 #   1. Copy queue.json to /mnt/c so the Windows embed worker can read it
 #   2. Run the embed worker on Windows (DML)
 #   3. Copy the resulting cache back to /opt/face-sets/work/cache/
 #   4. Run cluster_immich.py to discover + emit new facesets
 #
 # Usage:  ./work/finalize_immich.sh <user-label>
 set -euo pipefail
 USER_LABEL="${1:?usage: $0 <user-label>}"
 REPO="$(cd "$(dirname "$0")/.." && pwd)"
 WSL_QUEUE="$REPO/work/immich/$USER_LABEL/queue.json"
 WIN_QUEUE_DIR="/mnt/c/face_embed_venv/work/immich/$USER_LABEL"
 WIN_QUEUE="$WIN_QUEUE_DIR/queue.json"
 WIN_QUEUE_FOR_PS="C:\\face_embed_venv\\work\\immich\\$USER_LABEL\\queue.json"
 WIN_CACHE_DIR="/mnt/c/face_embed_venv/work/cache"
 WIN_CACHE="$WIN_CACHE_DIR/immich_${USER_LABEL}.npz"
 WIN_CACHE_FOR_PS="C:\\face_embed_venv\\work\\cache\\immich_${USER_LABEL}.npz"
 WSL_CACHE="$REPO/work/cache/immich_${USER_LABEL}.npz"
 LOG="$REPO/work/logs/immich_finalize_${USER_LABEL}.log"
 [ -f "$WSL_QUEUE" ] || { echo "missing queue: $WSL_QUEUE" >&2; exit 1; }
 echo "=== finalize: $USER_LABEL ===" | tee -a "$LOG"
 date | tee -a "$LOG"
 mkdir -p "$WIN_QUEUE_DIR" "$WIN_CACHE_DIR" "$REPO/work/cache"
 echo "[1/4] copying queue: $WSL_QUEUE -> $WIN_QUEUE" | tee -a "$LOG"
 cp "$WSL_QUEUE" "$WIN_QUEUE"
 echo "      $(wc -c < "$WIN_QUEUE") bytes; $(/home/peter/face_sort_env/bin/python3 -c "import json,sys; print(len(json.load(open('$WIN_QUEUE'))))") entries"
 echo "[2/4] running Windows DML embed worker" | tee -a "$LOG"
 powershell.exe -NoProfile -Command "C:\\face_embed_venv\\Scripts\\python.exe C:\\face_embed_venv\\bench\\embed_worker.py '$WIN_QUEUE_FOR_PS' '$WIN_CACHE_FOR_PS'" 2>&1 | tee -a "$LOG"
 [ -f "$WIN_CACHE" ] || { echo "embed produced no cache file at $WIN_CACHE" | tee -a "$LOG"; exit 1; }
 echo "[3/4] copying cache back: $WIN_CACHE -> $WSL_CACHE" | tee -a "$LOG"
 cp "$WIN_CACHE" "$WSL_CACHE"
 echo "      $(/home/peter/face_sort_env/bin/python3 -c "import sys,json; sys.path.insert(0,'$REPO'); from sort_faces import load_cache; e,m,_,_,_=load_cache('$WSL_CACHE'); print(f'{len(e)} embeddings, {sum(1 for x in m if x.get(\"noface\"))} noface, {sum(1 for x in m if not x.get(\"noface\"))} faces')")"
 echo "[4/4] running cluster_immich.py" | tee -a "$LOG"
 /home/peter/face_sort_env/bin/python3 "$REPO/work/cluster_immich.py" "$WSL_CACHE" 2>&1 | tee -a "$LOG"
 echo "=== finalize done: $USER_LABEL ===" | tee -a "$LOG"
 date | tee -a "$LOG"
@@ -0,0 +1,447 @@
 #!/usr/bin/env python3
 """Stage Immich assets for embedding (WSL side of the split workflow).
 For one Immich user:
  1. Page through `/search/metadata` listing every IMAGE asset the user owns.
  2. For each asset, fetch `/faces?id=` and decide if any detected face has a
     scaled short side >= MIN_FACE_SHORT on the original. Skip assets that
     don't.
  3. Download the original. Compute sha256.
  4. Dedup against (a) the existing canonical cache `nl_full.npz` and
     (b) sha256s already staged in this run / earlier runs. If duplicate,
     do NOT save to disk; record the alias.
  5. Save survivors to /mnt/x/src/immich/<user>/<rel> mirroring the structure
     after Immich's `/upload/library/<owner>/` prefix.
  6. Write a queue file with WSL + Windows paths so the Windows DML embed
     worker can find them.
  7. Persist staging state continuously so the run is resumable.
 Output artifacts:
  work/immich/<user>/queue.json         - what the Windows worker should embed
  work/immich/<user>/state.json         - resume state
  work/immich/<user>/aliases.json       - asset_id -> existing canonical path
                                          when sha256 matched something already
                                          in nl_full.npz
 """
 from __future__ import annotations
 import argparse
 import hashlib
 import json
 import os
 import sys
 import time
 import urllib.error
 import urllib.request
 from concurrent.futures import ThreadPoolExecutor
 from pathlib import Path
 import numpy as np
 REPO = Path(__file__).resolve().parent.parent
 sys.path.insert(0, str(REPO))
 from sort_faces import load_cache  # noqa: E402
 # ---- config -------------------------------------------------------------- #
 API = os.environ.get("IMMICH_URL", "").rstrip("/") + "/api" if os.environ.get("IMMICH_URL") else None
 KEY = os.environ.get("IMMICH_API_KEY")
 if not API or not KEY:
    raise SystemExit(
        "set IMMICH_URL and IMMICH_API_KEY env vars before running, e.g.\n"
        "  export IMMICH_URL=https://fotos.example.org\n"
        "  export IMMICH_API_KEY=...   # admin API key"
    )
 HEADERS = {"x-api-key": KEY, "Accept": "application/json"}
 # Short-label -> Immich userId. The user is responsible for filling this in for
 # their own Immich instance. NOTE: as of Immich v2.7.2, /search/metadata's
 # `userIds` filter is silently ignored when the API key is bound to a different
 # user, so changing this label/UUID does not actually change which assets the
 # API returns; we keep it here for naming output dirs and as future-proofing.
 USERS_FILE = REPO / "work" / "immich" / "users.json"
 USERS: dict[str, str] = {}
 if USERS_FILE.exists():
    USERS = json.loads(USERS_FILE.read_text())
 CACHE_PATH = REPO / "work" / "cache" / "nl_full.npz"  # for sha256 dedup
 STAGE_DIR  = REPO / "work" / "immich"
 DEST_ROOT  = Path("/mnt/x/src/immich")
 WIN_DEST_ROOT = "X:\\src\\immich"  # equivalent on the Windows side
 PAGE_SIZE = 1000
 MIN_FACE_SHORT = 90      # match refine's gate
 MIN_DET_SCORE  = 0.5     # weaker than refine's 0.6, since Immich's score scale differs
 HTTP_TIMEOUT = 60        # seconds, conservative for big originals
 HTTP_RETRIES = 3
 HTTP_BACKOFF = 2.0
 # Circuit breaker: if this many consecutive workers fail with network errors,
 # probe Immich; if probe also fails, exit cleanly with code 2 so the orchestrator
 # can pause until the user says resume. State is preserved (resume-safe).
 OUTAGE_FAIL_STREAK = 12
 OUTAGE_PROBE_TIMEOUT = 8
 # ---- helpers ------------------------------------------------------------- #
 def http_get(url: str, accept_bytes: bool = False) -> bytes | dict:
    """GET with retries. Returns parsed JSON unless accept_bytes is True."""
    last_err = None
    for attempt in range(HTTP_RETRIES):
        try:
            req = urllib.request.Request(url, headers=HEADERS)
            with urllib.request.urlopen(req, timeout=HTTP_TIMEOUT) as resp:
                data = resp.read()
            return data if accept_bytes else json.loads(data)
        except (urllib.error.URLError, urllib.error.HTTPError, TimeoutError) as e:
            last_err = e
            if attempt + 1 < HTTP_RETRIES:
                time.sleep(HTTP_BACKOFF * (attempt + 1))
    raise RuntimeError(f"GET {url} failed after {HTTP_RETRIES} attempts: {last_err}")
 def probe_immich() -> bool:
    """Quick connectivity probe (no retry). Used by the circuit breaker."""
    try:
        req = urllib.request.Request(f"{API}/server/version", headers=HEADERS)
        urllib.request.urlopen(req, timeout=OUTAGE_PROBE_TIMEOUT).read()
        return True
    except Exception:
        return False
 def http_post(url: str, payload: dict) -> dict:
    last_err = None
    body = json.dumps(payload).encode("utf-8")
    for attempt in range(HTTP_RETRIES):
        try:
            req = urllib.request.Request(
                url, data=body, headers={**HEADERS, "Content-Type": "application/json"}
            )
            with urllib.request.urlopen(req, timeout=HTTP_TIMEOUT) as resp:
                return json.loads(resp.read())
        except (urllib.error.URLError, urllib.error.HTTPError, TimeoutError) as e:
            last_err = e
            if attempt + 1 < HTTP_RETRIES:
                time.sleep(HTTP_BACKOFF * (attempt + 1))
    raise RuntimeError(f"POST {url} failed after {HTTP_RETRIES} attempts: {last_err}")
 def sha256_bytes(b: bytes) -> str:
    return hashlib.sha256(b).hexdigest()
 def derive_relpath(original_path: str) -> str:
    """Return a relative subpath rooted at the user dir, mirroring Immich.
    /usr/src/app/upload/library/admin/2026/2026-02-18/foo.jpg
        -> 2026/2026-02-18/foo.jpg
    Anything that doesn't match the expected prefix falls back to the basename
    only.
    """
    marker = "/upload/library/"
    i = original_path.find(marker)
    if i < 0:
        return Path(original_path).name
    rest = original_path[i + len(marker):]
    parts = rest.split("/", 1)
    return parts[1] if len(parts) == 2 else parts[0]
 def wsl_to_win(p: Path) -> str:
    """Convert /mnt/x/.. -> X:\\.. for the embed worker that runs on Windows."""
    s = str(p)
    if s.startswith("/mnt/"):
        drive = s[5]
        rest = s[6:].lstrip("/")
        return f"{drive.upper()}:\\{rest.replace('/', chr(92))}"
    if s.startswith("/opt/face-sets/"):
        # /opt/face-sets/work/... is on the WSL ext4 filesystem; reachable from
        # Windows as \\wsl$\Ubuntu\opt\face-sets\... (slower than C:).  For our
        # use we keep all stage outputs under /mnt/x or /mnt/c so this branch
        # should not be hit, but fall back rather than fail.
        return f"\\\\wsl$\\Ubuntu\\opt\\face-sets\\{s[len('/opt/face-sets/'):].replace('/', chr(92))}"
    return s
 def keep_asset(asset: dict, faces: list) -> tuple[bool, list[dict]]:
    """Return (keep, eligible_face_records). A face is 'eligible' iff its
    scaled-to-original short side >= MIN_FACE_SHORT and source-type is
    machine-learning."""
    W, H = asset.get("width"), asset.get("height")
    if not W or not H:
        return False, []
    eligible = []
    for f in faces:
        if f.get("sourceType") and f["sourceType"] != "machine-learning":
            continue
        iw = f.get("imageWidth") or W
        ih = f.get("imageHeight") or H
        sx = (W / iw) if iw else 1.0
        sy = (H / ih) if ih else 1.0
        bw = (f["boundingBoxX2"] - f["boundingBoxX1"]) * sx
        bh = (f["boundingBoxY2"] - f["boundingBoxY1"]) * sy
        if min(bw, bh) >= MIN_FACE_SHORT:
            eligible.append({
                "id":      f["id"],
                "x1": int(round(f["boundingBoxX1"] * sx)),
                "y1": int(round(f["boundingBoxY1"] * sy)),
                "x2": int(round(f["boundingBoxX2"] * sx)),
                "y2": int(round(f["boundingBoxY2"] * sy)),
                "person": (f.get("person") or {}).get("name") or None,
            })
    return (len(eligible) > 0), eligible
 # ---- main staging loop --------------------------------------------------- #
 def list_assets(user_id: str):
    """Yield every IMAGE asset owned by user_id, paginated."""
    page = 1
    while True:
        resp = http_post(f"{API}/search/metadata", {
            "size": PAGE_SIZE,
            "type": "IMAGE",
            "page": page,
            "userIds": [user_id],
        })
        items = resp["assets"]["items"]
        if not items:
            return
        for a in items:
            yield a
        nxt = resp["assets"].get("nextPage")
        if not nxt:
            return
        page = int(nxt)
 def stage(user_label: str, limit: int | None, workers: int) -> None:
    user_id = USERS[user_label]
    user_dir = STAGE_DIR / user_label
    user_dir.mkdir(parents=True, exist_ok=True)
    state_path  = user_dir / "state.json"
    queue_path  = user_dir / "queue.json"
    aliases_path = user_dir / "aliases.json"
    # ---- load existing state for resume ---- #
    state = {
        "started_at": time.strftime("%Y-%m-%dT%H:%M:%S"),
        "user_label": user_label,
        "user_id": user_id,
        "seen_asset_ids": [],
        "staged_count": 0,
        "deduped_against_existing": 0,
        "deduped_against_staged": 0,
        "skipped_no_big_face": 0,
        "skipped_no_faces": 0,
        "skipped_download_error": 0,
        "total_assets_seen": 0,
    }
    queue: list[dict] = []
    aliases: dict[str, dict] = {}  # asset_id -> {sha, canonical_path}
    staged_hashes: set[str] = set()
    if state_path.exists():
        prior = json.loads(state_path.read_text())
        state.update(prior)
        state["resumed_at"] = time.strftime("%Y-%m-%dT%H:%M:%S")
        if queue_path.exists():
            queue = json.loads(queue_path.read_text())
            staged_hashes = {q["sha256"] for q in queue}
        if aliases_path.exists():
            aliases = json.loads(aliases_path.read_text())
        print(f"[resume] {len(state['seen_asset_ids'])} asset_ids already seen, "
              f"{len(queue)} in queue, {len(aliases)} aliased to existing cache")
    seen = set(state["seen_asset_ids"])
    # ---- startup connectivity probe ---- #
    if not probe_immich():
        print(f"[init] Immich probe failed at {API}/server/version -- exiting code 2")
        sys.exit(2)
    print("[init] Immich reachable")
    # ---- load existing canonical cache hashes (sha256) ---- #
    print(f"[init] loading existing cache hashes from {CACHE_PATH}")
    _emb, meta, _src, _proc, _aliases = load_cache(CACHE_PATH)
    canonical_by_hash: dict[str, str] = {}
    for m in meta:
        h = m.get("hash")
        if h:
            canonical_by_hash.setdefault(h, m["path"])
    print(f"[init] {len(canonical_by_hash)} unique sha256s in nl_full.npz")
    # ---- iterate assets ---- #
    # Each worker does the entire I/O chain for an asset: /faces -> filter ->
    # /original. That way 8 workers translate to ~8x parallelism end-to-end.
    # Main thread does sha256, dedup decisions, and writes (which are CPU/SMB
    # bound but cheap relative to two HTTPS round-trips per asset).
    # Worker result tuple:
    #   (asset, faces|None, blob|None, eligible|None, error|None)
    def _fetch_for_asset(asset: dict):
        if asset.get("type") != "IMAGE":
            return asset, None, None, None, "not_image"
        aid = asset["id"]
        if aid in seen:
            return asset, None, None, None, "already_seen"
        try:
            faces = http_get(f"{API}/faces?id={aid}")
        except Exception as e:
            return asset, None, None, None, f"faces_error:{e}"
        if not faces:
            return asset, [], None, [], "no_faces"
        keep, eligible = keep_asset(asset, faces)
        if not keep:
            return asset, faces, None, eligible, "no_big_face"
        try:
            blob = http_get(f"{API}/assets/{aid}/original", accept_bytes=True)
        except Exception as e:
            return asset, faces, None, eligible, f"download_error:{e}"
        return asset, faces, blob, eligible, None
    n = 0
    err_streak = 0
    last_flush = time.time()
    t0 = time.time()
    pool = ThreadPoolExecutor(max_workers=workers)
    try:
        for asset, faces, blob, eligible, err in pool.map(_fetch_for_asset, list_assets(user_id)):
            if asset.get("type") != "IMAGE":
                continue
            n += 1
            state["total_assets_seen"] = n
            if limit is not None and n > limit:
                print(f"[stop] hit --limit {limit}")
                break
            aid = asset["id"]
            # Already-seen / non-image: silently skip.
            if err == "already_seen":
                continue
            # Transient: count, but DON'T mark as seen so resume retries.
            if err and (err.startswith("faces_error") or err.startswith("download_error")):
                kind = err.split(":", 1)[0]
                detail = err.split(":", 1)[1][:160] if ":" in err else err
                print(f"[err] {kind} {aid}: {detail}")
                state["skipped_download_error"] += 1
                err_streak += 1
                # Circuit breaker: long streak -> probe; if down, save and exit.
                if err_streak >= OUTAGE_FAIL_STREAK:
                    print(f"[breaker] {err_streak} consecutive errors; probing Immich...")
                    if probe_immich():
                        print("[breaker] probe ok, treating as transient; continuing")
                        err_streak = 0
                    else:
                        print("[breaker] probe FAILED -- pausing run; resume with same command")
                        queue_path.write_text(json.dumps(queue, indent=2))
                        state_path.write_text(json.dumps(state, indent=2))
                        aliases_path.write_text(json.dumps(aliases, indent=2))
                        sys.exit(2)
                continue
            else:
                err_streak = 0
            # Permanent classifications -> seen.
            if err == "no_faces":
                state["skipped_no_faces"] += 1
                seen.add(aid); state["seen_asset_ids"] = sorted(seen)
                continue
            if err == "no_big_face":
                state["skipped_no_big_face"] += 1
                seen.add(aid); state["seen_asset_ids"] = sorted(seen)
                continue
            # Have faces + blob -> dedup + save.
            h = sha256_bytes(blob)
            if h in canonical_by_hash:
                aliases[aid] = {"sha256": h, "canonical": canonical_by_hash[h]}
                state["deduped_against_existing"] += 1
                seen.add(aid); state["seen_asset_ids"] = sorted(seen)
                continue
            if h in staged_hashes:
                state["deduped_against_staged"] += 1
                seen.add(aid); state["seen_asset_ids"] = sorted(seen)
                continue
            rel = derive_relpath(asset.get("originalPath", asset.get("originalFileName", aid)))
            wsl_path = DEST_ROOT / user_label / rel
            wsl_path.parent.mkdir(parents=True, exist_ok=True)
            wsl_path.write_bytes(blob)
            staged_hashes.add(h)
            queue.append({
                "asset_id": aid,
                "sha256": h,
                "wsl_path": str(wsl_path),
                "win_path": wsl_to_win(wsl_path),
                "size_bytes": len(blob),
                "width":  asset.get("width"),
                "height": asset.get("height"),
                "originalPath": asset.get("originalPath"),
                "originalFileName": asset.get("originalFileName"),
                "localDateTime": asset.get("localDateTime"),
                "immich_eligible_faces": eligible,
            })
            state["staged_count"] += 1
            seen.add(aid)
            state["seen_asset_ids"] = sorted(seen)
            if time.time() - last_flush > 5.0 or len(queue) % 25 == 0:
                queue_path.write_text(json.dumps(queue, indent=2))
                state_path.write_text(json.dumps(state, indent=2))
                aliases_path.write_text(json.dumps(aliases, indent=2))
                last_flush = time.time()
                elapsed = time.time() - t0
                rate = state["total_assets_seen"] / max(0.1, elapsed)
                print(f"[stage] seen={state['total_assets_seen']:6d} "
                      f"staged={state['staged_count']:5d} "
                      f"dedup-existing={state['deduped_against_existing']:5d} "
                      f"dedup-staged={state['deduped_against_staged']:5d} "
                      f"no-big-face={state['skipped_no_big_face']:6d} "
                      f"no-faces={state['skipped_no_faces']:6d}  "
                      f"errs={state['skipped_download_error']:3d}  "
                      f"({rate:.1f} assets/s)")
    finally:
        pool.shutdown(wait=False, cancel_futures=True)
    # final flush
    queue_path.write_text(json.dumps(queue, indent=2))
    state_path.write_text(json.dumps(state, indent=2))
    aliases_path.write_text(json.dumps(aliases, indent=2))
    print()
    print(f"=== final state for user {user_label} ===")
    for k in [
        "total_assets_seen", "staged_count", "deduped_against_existing",
        "deduped_against_staged", "skipped_no_big_face", "skipped_no_faces",
        "skipped_download_error",
    ]:
        print(f"  {k}: {state[k]}")
    total_bytes = sum(q["size_bytes"] for q in queue)
    print(f"  staged bytes: {total_bytes/1e9:.2f} GB across {len(queue)} files")
    print(f"  queue:    {queue_path}")
    print(f"  state:    {state_path}")
    print(f"  aliases:  {aliases_path}")
 # ---- cli ----------------------------------------------------------------- #
 def main() -> None:
    p = argparse.ArgumentParser()
    if not USERS:
        p.add_argument("--user", required=True,
                       help=f"label for output dir (USERS map empty; populate {USERS_FILE} to constrain)")
    else:
        p.add_argument("--user", choices=list(USERS.keys()), required=True)
    p.add_argument("--limit", type=int, default=None,
                   help="stop after seeing N assets total (for testing)")
    p.add_argument("--workers", type=int, default=8,
                   help="concurrent /faces fetches (default 8)")
    args = p.parse_args()
    stage(args.user, args.limit, args.workers)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,144 @@
 """Windows / DirectML multi-face audit worker.
 For every PNG in queue.json, run insightface FaceAnalysis and record how many
 faces were detected (filtering by det_score>=MIN_DET and face_short>=MIN_PIX).
 Surfaces the load-bearing roop invariant: each .fsz PNG must hold exactly one
 face, otherwise the loader's `extract_face_images` appends every detected face
 into the FaceSet and pollutes the averaged identity embedding.
 CLI:
    py -3.12 multiface_worker.py <queue.json> <out_results.json> [--limit N]
 """
 from __future__ import annotations
 import argparse
 import json
 import os
 import sys
 import time
 from pathlib import Path
 import numpy as np
 from PIL import Image, ImageOps
 from insightface.app import FaceAnalysis
 MODEL_ROOT = r"C:\face_embed_venv\models"
 MIN_DET = 0.5
 MIN_FACE_PIX = 40
 FLUSH_EVERY = 200
 def load_existing(out_path: Path):
    if not out_path.exists():
        return None, set()
    try:
        d = json.loads(out_path.read_text())
        processed = set(d.get("processed", []))
        return d, processed
    except Exception as e:
        print(f"[warn] could not parse {out_path}: {e}; starting fresh", file=sys.stderr)
        return None, set()
 def save_atomic(out_path: Path, data: dict):
    tmp = out_path.with_suffix(".tmp.json")
    tmp.write_text(json.dumps(data, indent=2))
    os.replace(tmp, out_path)
 def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("queue", type=Path)
    ap.add_argument("out", type=Path)
    ap.add_argument("--limit", type=int, default=None)
    args = ap.parse_args()
    queue = json.loads(args.queue.read_text())
    print(f"[queue] {len(queue)} entries from {args.queue}", flush=True)
    args.out.parent.mkdir(parents=True, exist_ok=True)
    existing, processed = load_existing(args.out)
    if existing:
        print(f"[resume] {len(processed)} already scored", flush=True)
        results = existing.get("results", [])
    else:
        results = []
    pending = [e for e in queue if e["wsl_path"] not in processed]
    if args.limit is not None:
        pending = pending[: args.limit]
    print(f"[pending] {len(pending)} entries", flush=True)
    if not pending:
        print("[done] nothing to do")
        return
    print("[load] FaceAnalysis with DmlExecutionProvider", flush=True)
    app = FaceAnalysis(
        name="buffalo_l",
        root=MODEL_ROOT,
        providers=["DmlExecutionProvider", "CPUExecutionProvider"],
    )
    app.prepare(ctx_id=0, det_size=(640, 640))
    n_done = 0
    n_load_err = 0
    last_flush = time.time()
    t_start = time.time()
    def flush():
        save_atomic(args.out, {
            "results": results,
            "processed": sorted(processed),
        })
    for entry in pending:
        try:
            with Image.open(entry["win_path"]) as im:
                im = ImageOps.exif_transpose(im)
                im = im.convert("RGB")
                rgb = np.array(im)
            bgr = rgb[:, :, ::-1].copy()
        except Exception as e:
            n_load_err += 1
            results.append({
                "wsl_path": entry["wsl_path"], "faceset": entry["faceset"], "file": entry["file"],
                "face_count": -1, "error": "load",
            })
            processed.add(entry["wsl_path"])
            n_done += 1
            continue
        faces = app.get(bgr)
        kept = 0
        for f in faces:
            if float(f.det_score) < MIN_DET:
                continue
            x1, y1, x2, y2 = [int(round(v)) for v in f.bbox]
            short = min(max(x2 - x1, 0), max(y2 - y1, 0))
            if short < MIN_FACE_PIX:
                continue
            kept += 1
        results.append({
            "wsl_path": entry["wsl_path"], "faceset": entry["faceset"], "file": entry["file"],
            "face_count": kept,
        })
        processed.add(entry["wsl_path"])
        n_done += 1
        if (n_done % FLUSH_EVERY == 0) or (time.time() - last_flush) > 30.0:
            flush()
            last_flush = time.time()
            elapsed = time.time() - t_start
            rate = n_done / max(0.1, elapsed)
            eta = (len(pending) - n_done) / max(0.1, rate) / 60.0
            print(f"[scan] {n_done}/{len(pending)} rate={rate:.2f} img/s eta={eta:.1f}min "
                  f"load_err={n_load_err}", flush=True)
    flush()
    elapsed = time.time() - t_start
    print(f"[done] {n_done} scored, {n_load_err} load errors, {elapsed:.1f}s "
          f"({n_done/max(0.1,elapsed):.2f} img/s) -> {args.out}", flush=True)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,127 @@
 #!/bin/bash
 # Generic chain driver for the video target preprocessing pipeline.
 #
 # Usage:
 #   WORK=/path/to/workdir SKIP_PATTERN='ct_src_(0001[015]|005[0-9]|006[0-9])\.mp4' \
 #     bash run_video_pipeline.sh > /opt/face-sets/work/logs/<name>.log 2>&1
 #
 # Required env vars:
 #   WORK         per-batch workdir (will hold scenes/, queue.json, results.jsonl, plan.json, review/)
 #
 # Optional env vars:
 #   INPUT_DIR    default /mnt/x/src/vd
 #   OUTPUT_DIR   default /mnt/x/src/vd/ct
 #   FILTER_FROM  basename floor; only files with name >= this go in (e.g. ct_src_00050.mp4)
 #   SKIP_PATTERN regex of basenames to exclude (Python `re` syntax). Applied AFTER FILTER_FROM.
 #   MAX_DUR      score --max-dur (default 120)
 #   IDENTITY     "yes" to enable identity tagging; default "no"
 #   SIDECAR      "yes" to emit <uuid>.json provenance sidecars; default "no"
 set -e
 : ${WORK:?WORK env var must point at a workdir}
 : ${INPUT_DIR:=/mnt/x/src/vd}
 : ${OUTPUT_DIR:=/mnt/x/src/vd/ct}
 : ${MAX_DUR:=120}
 : ${IDENTITY:=no}
 : ${SIDECAR:=no}
 mkdir -p "$WORK" "$WORK/scenes"
 PY_WSL=/home/peter/face_sort_env/bin/python
 PY_WIN="/mnt/c/face_embed_venv/Scripts/python.exe"
 PIPELINE=/opt/face-sets/work/video_target_pipeline.py
 WORKER=/opt/face-sets/work/video_face_worker.py
 INVENTORY_FULL=/opt/face-sets/work/video_preprocess/inventory_full.json
 ts() { date +"%Y-%m-%d %H:%M:%S"; }
 log() { echo "[$(ts)] [$PHASE] $*"; }
 PHASE="setup"
 log "STARTED — host=$(hostname) pid=$$ work=$WORK"
 log "config: input=$INPUT_DIR output=$OUTPUT_DIR filter_from=${FILTER_FROM:-<none>} skip_pattern=${SKIP_PATTERN:-<none>} max_dur=$MAX_DUR identity=$IDENTITY sidecar=$SIDECAR"
 PHASE="inventory"
 log "building subset inventory"
 T0=$(date +%s)
 # rebuild full inventory if missing
 if [ ! -f "$INVENTORY_FULL" ]; then
    log "(no full inventory cached — running fresh scan)"
    $PY_WSL $PIPELINE scan --input "$INPUT_DIR" --output-dir "$OUTPUT_DIR" --out "$INVENTORY_FULL"
 fi
 $PY_WSL <<EOF
 import json, re
 from pathlib import Path
 inv = json.load(open('$INVENTORY_FULL'))
 subset = list(inv['videos'])
 filter_from = '${FILTER_FROM}'
 skip_pat = '${SKIP_PATTERN}'
 if filter_from:
    subset = [v for v in subset if Path(v['path']).name >= filter_from]
 if skip_pat:
    pat = re.compile(skip_pat)
    subset = [v for v in subset if not pat.search(Path(v['path']).name)]
 subset.sort(key=lambda v: v['path'])
 inv['videos'] = subset
 json.dump(inv, open('$WORK/inventory.json','w'), indent=2)
 total_dur = sum(v.get('duration_s', 0) for v in inv['videos'] if 'error' not in v)
 print(f'  {len(inv["videos"])} videos, total {total_dur/3600:.2f}h input')
 EOF
 log "done in $(($(date +%s)-T0))s"
 PHASE="scenes"
 log "PySceneDetect AdaptiveDetector across all videos (cached entries skipped)"
 T0=$(date +%s)
 $PY_WSL $PIPELINE scenes --inventory "$WORK/inventory.json" --out-dir "$WORK/scenes"
 log "done in $(($(date +%s)-T0))s"
 PHASE="stage"
 log "building frame queue @ 2 fps within scenes"
 T0=$(date +%s)
 $PY_WSL $PIPELINE stage --inventory "$WORK/inventory.json" --scenes-dir "$WORK/scenes" --out "$WORK/queue.json"
 log "done in $(($(date +%s)-T0))s"
 PHASE="worker"
 log "Windows DML face detect+embed (resumable; the slow one)"
 T0=$(date +%s)
 $PY_WIN $WORKER "$WORK/queue.json" "$WORK/results.json"
 log "done in $(($(date +%s)-T0))s"
 PHASE="merge"
 log "ingesting worker output (jsonl)"
 T0=$(date +%s)
 $PY_WSL $PIPELINE merge --results "$WORK/results.json" --out "$WORK/frames.json"
 log "done in $(($(date +%s)-T0))s"
 PHASE="track"
 log "stitching detections into tracks"
 T0=$(date +%s)
 $PY_WSL $PIPELINE track --frames "$WORK/frames.json" --scenes-dir "$WORK/scenes" \
  --inventory "$WORK/inventory.json" --out "$WORK/tracks.json"
 log "done in $(($(date +%s)-T0))s"
 PHASE="score"
 log "scoring with relaxed gates + max-dur=$MAX_DUR identity=$IDENTITY"
 T0=$(date +%s)
 ID_FLAG=""
 if [ "$IDENTITY" != "yes" ]; then ID_FLAG="--no-identity"; fi
 $PY_WSL $PIPELINE score --tracks "$WORK/tracks.json" --inventory "$WORK/inventory.json" \
  --out "$WORK/plan.json" --max-dur "$MAX_DUR" $ID_FLAG
 log "done in $(($(date +%s)-T0))s"
 PHASE="cut"
 log "ffmpeg stream-copy into per-source subfolders (no --clean)"
 T0=$(date +%s)
 SIDECAR_FLAG=""
 if [ "$SIDECAR" = "yes" ]; then SIDECAR_FLAG="--write-sidecar"; fi
 $PY_WSL $PIPELINE cut --plan "$WORK/plan.json" --output-dir "$OUTPUT_DIR" $SIDECAR_FLAG
 log "done in $(($(date +%s)-T0))s"
 PHASE="report"
 log "rendering HTML"
 T0=$(date +%s)
 $PY_WSL $PIPELINE report --plan "$WORK/plan.json" --output-dir "$OUTPUT_DIR" --out "$WORK/review"
 log "done in $(($(date +%s)-T0))s"
 PHASE="done"
 log "PIPELINE COMPLETE — review at file://$WORK/review/index.html"
@@ -0,0 +1,32 @@
 #!/bin/bash
 # Generic status helper for run_video_pipeline.sh.
 # Usage: bash status_video_pipeline.sh <log_file>
 # Defaults to /opt/face-sets/work/logs/video_run.log if no arg.
 LOG="${1:-/opt/face-sets/work/logs/video_run.log}"
 if [ ! -f "$LOG" ]; then
    echo "no log at $LOG yet"
    exit 0
 fi
 echo "=== last 8 log lines ==="
 tail -8 "$LOG"
 echo
 # worker progress
 last=$(grep -E "^\[scan\] [0-9]+/[0-9]+" "$LOG" | tail -1)
 if [ -n "$last" ]; then
    echo "=== DML worker progress ==="
    echo "  $last"
 fi
 # total elapsed
 start_epoch=$(head -1 "$LOG" | sed 's/.*\[\(.*\)\].*\[setup\].*/\1/' | xargs -I{} date -d "{}" +%s 2>/dev/null)
 now_epoch=$(date +%s)
 if [ -n "$start_epoch" ] && [ "$start_epoch" != "" ] 2>/dev/null; then
    elapsed=$((now_epoch - start_epoch))
    h=$((elapsed / 3600))
    m=$(( (elapsed % 3600) / 60 ))
    echo "  elapsed: ${h}h${m}m"
 fi
@@ -0,0 +1,274 @@
 """Windows / DirectML video frame face worker.
 Reads a queue.json from /opt/face-sets/work/video_target_pipeline.py:stage
 (WSL side), each entry: {video_path, win_video_path, frame_idx, time_s,
 queue_id}. Decodes frame N from the video, runs insightface FaceAnalysis,
 emits per-face records (bbox, det_score, pose, embedding, face_short).
 CLI:
    py -3.12 video_face_worker.py <queue.json> <out_results.json> [--limit N]
 Resumable: existing entries in out_results.json with the same queue_id are
 skipped.
 """
 from __future__ import annotations
 import argparse
 import json
 import os
 import sys
 import time
 from pathlib import Path
 import numpy as np
 import cv2
 from insightface.app import FaceAnalysis
 MODEL_ROOT = r"C:\face_embed_venv\models"
 MIN_DET = 0.5
 MIN_FACE_PIX = 40
 FLUSH_EVERY = 100
 def jsonl_path_for(out_path: Path) -> Path:
    """Sister JSONL file: one result-record per line, append-only."""
    return out_path.with_suffix(".jsonl")
 def load_existing(out_path: Path):
    """Load existing results from .jsonl (preferred) or legacy .json (one-time conversion).
    Returns (records_list, processed_set)."""
    jsonl = jsonl_path_for(out_path)
    if jsonl.exists():
        records = []
        processed = set()
        with open(jsonl) as f:
            for line_num, line in enumerate(f, 1):
                line = line.strip()
                if not line:
                    continue
                try:
                    r = json.loads(line)
                    records.append(r)
                    if r.get("queue_id"):
                        processed.add(r["queue_id"])
                except json.JSONDecodeError:
                    print(f"[warn] {jsonl}:{line_num} corrupt; skipping", file=sys.stderr)
        return records, processed
    # legacy JSON support: load once, convert to JSONL
    if out_path.exists():
        try:
            d = json.loads(out_path.read_text())
            records = d.get("results", [])
            processed = set(d.get("processed", []))
            print(f"[migrate] converting {len(records)} legacy JSON records to JSONL", file=sys.stderr)
            with open(jsonl, "w") as f:
                for r in records:
                    f.write(json.dumps(r) + "\n")
            return records, processed
        except Exception as e:
            print(f"[warn] could not parse {out_path}: {e}; starting fresh", file=sys.stderr)
    return [], set()
 def append_records(out_path: Path, new_records: list):
    """Append-only write to the sister .jsonl file. No re-serialization of prior records."""
    if not new_records:
        return
    jsonl = jsonl_path_for(out_path)
    with open(jsonl, "a") as f:
        for r in new_records:
            f.write(json.dumps(r) + "\n")
 def write_compat_summary(out_path: Path, total_records: int, processed: set):
    """Write a tiny JSON pointer file at the legacy out_path so older consumers
    still see *something*, but the canonical store is the .jsonl. Cheap."""
    summary = {
        "_format": "jsonl-pointer",
        "_jsonl": str(jsonl_path_for(out_path).name),
        "results_count": total_records,
        "processed_count": len(processed),
    }
    tmp = out_path.with_suffix(".tmp.json")
    tmp.write_text(json.dumps(summary, indent=2))
    os.replace(tmp, out_path)
 def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("queue", type=Path)
    ap.add_argument("out", type=Path)
    ap.add_argument("--limit", type=int, default=None)
    args = ap.parse_args()
    queue = json.loads(args.queue.read_text())
    print(f"[queue] {len(queue)} entries from {args.queue}", flush=True)
    args.out.parent.mkdir(parents=True, exist_ok=True)
    results, processed = load_existing(args.out)
    if processed:
        print(f"[resume] {len(processed)} already scored", flush=True)
    pending = [e for e in queue if e["queue_id"] not in processed]
    if args.limit is not None:
        pending = pending[: args.limit]
    print(f"[pending] {len(pending)} entries", flush=True)
    if not pending:
        print("[done] nothing to do")
        return
    print("[load] FaceAnalysis with DmlExecutionProvider", flush=True)
    app = FaceAnalysis(
        name="buffalo_l",
        root=MODEL_ROOT,
        providers=["DmlExecutionProvider", "CPUExecutionProvider"],
    )
    app.prepare(ctx_id=0, det_size=(640, 640))
    # group queue by video so we can keep one VideoCapture open and seek
    from collections import defaultdict
    by_video = defaultdict(list)
    for e in pending:
        by_video[e["win_video_path"]].append(e)
    n_done = 0
    n_load_err = 0
    last_flush = time.time()
    t_start = time.time()
    new_buffer: list = []
    def flush():
        # append-only: only NEW records since last flush get written. O(new_records),
        # not O(total_records). Was 11s/flush at 9k records; now <50ms.
        if new_buffer:
            append_records(args.out, new_buffer)
            new_buffer.clear()
        write_compat_summary(args.out, len(results), processed)
    for vidpath, entries in by_video.items():
        # entries are already sorted by frame_idx. Hybrid decode strategy:
        #   1. Seek ONCE to the first pending target (cheap keyframe-seek).
        #   2. Sequential cap.grab() between subsequent targets (decode without
        #      BGR conversion until we reach a target, then cap.retrieve()).
        # This avoids per-sample seek cost (the original pathology that
        # caused 1.4 fps deep in long videos) AND avoids grab-walking from
        # frame 0 on resume (the over-correction that gave 0.08 fps).
        entries.sort(key=lambda e: e["frame_idx"])
        cap = cv2.VideoCapture(vidpath)
        if not cap.isOpened():
            print(f"[err] cannot open {vidpath}", flush=True)
            for e in entries:
                rec = {
                    "queue_id": e["queue_id"], "video_path": e["video_path"],
                    "frame_idx": e["frame_idx"], "time_s": e["time_s"],
                    "faces": [], "error": "cap_open",
                }
                results.append(rec); new_buffer.append(rec)
                processed.add(e["queue_id"])
                n_done += 1
                n_load_err += 1
            continue
        first_target = entries[0]["frame_idx"]
        if first_target > 0:
            cap.set(cv2.CAP_PROP_POS_FRAMES, first_target)
            cur_frame_idx = first_target - 1
        else:
            cur_frame_idx = -1
        for e in entries:
            target = e["frame_idx"]
            if target < cur_frame_idx + 1:
                # backward jump (only triggers for unsorted entries — defensive)
                cap.set(cv2.CAP_PROP_POS_FRAMES, target)
                cur_frame_idx = target - 1
            # advance up to (but not including) target via grab()-only
            ran_out = False
            while cur_frame_idx + 1 < target:
                ok = cap.grab()
                if not ok:
                    ran_out = True
                    break
                cur_frame_idx += 1
            if not ran_out:
                ok = cap.grab()
                if not ok:
                    ran_out = True
                else:
                    cur_frame_idx = target
            if ran_out:
                rec = {
                    "queue_id": e["queue_id"], "video_path": e["video_path"],
                    "frame_idx": e["frame_idx"], "time_s": e["time_s"],
                    "faces": [], "error": "frame_read",
                }
                results.append(rec); new_buffer.append(rec)
                processed.add(e["queue_id"])
                n_done += 1
                n_load_err += 1
                continue
            ok, bgr = cap.retrieve()
            if not ok or bgr is None:
                rec = {
                    "queue_id": e["queue_id"], "video_path": e["video_path"],
                    "frame_idx": e["frame_idx"], "time_s": e["time_s"],
                    "faces": [], "error": "frame_read",
                }
                results.append(rec); new_buffer.append(rec)
                processed.add(e["queue_id"])
                n_done += 1
                n_load_err += 1
                continue
            faces = app.get(bgr)
            kept_faces = []
            H, W = bgr.shape[:2]
            for f in faces:
                if float(f.det_score) < MIN_DET:
                    continue
                x1, y1, x2, y2 = [int(round(v)) for v in f.bbox]
                x1 = max(x1, 0); y1 = max(y1, 0)
                x2 = min(x2, W); y2 = min(y2, H)
                w, h = x2 - x1, y2 - y1
                short = min(w, h)
                if short < MIN_FACE_PIX:
                    continue
                rec = {
                    "bbox": [x1, y1, x2, y2],
                    "det_score": float(f.det_score),
                    "face_short": int(short),
                }
                if hasattr(f, "pose") and f.pose is not None:
                    rec["pose"] = [float(x) for x in f.pose]   # pitch, yaw, roll
                if hasattr(f, "normed_embedding") and f.normed_embedding is not None:
                    rec["embedding"] = f.normed_embedding.astype(np.float32).tolist()
                kept_faces.append(rec)
            rec = {
                "queue_id": e["queue_id"], "video_path": e["video_path"],
                "frame_idx": e["frame_idx"], "time_s": e["time_s"],
                "frame_w": W, "frame_h": H,
                "faces": kept_faces,
            }
            results.append(rec); new_buffer.append(rec)
            processed.add(e["queue_id"])
            n_done += 1
            if (n_done % FLUSH_EVERY == 0) or (time.time() - last_flush) > 30.0:
                flush()
                last_flush = time.time()
                el = time.time() - t_start
                rate = n_done / max(0.1, el)
                eta = (len(pending) - n_done) / max(0.1, rate) / 60.0
                print(f"[scan] {n_done}/{len(pending)} rate={rate:.2f} fps eta={eta:.1f}min "
                      f"errs={n_load_err}", flush=True)
        cap.release()
    flush()
    el = time.time() - t_start
    print(f"[done] {n_done} scored, {n_load_err} errors, {el:.1f}s "
          f"({n_done/max(0.1,el):.2f} fps) -> {args.out}", flush=True)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,919 @@
 """Video target preprocessing pipeline for roop-unleashed.
 Discovers video files in an input folder, runs scene-cut detection, samples
 frames within each scene, runs face detection + embedding via Windows DML
 worker, stitches per-frame detections into face tracks, applies quality
 gates, cuts approved segments out with ffmpeg stream-copy, and writes a
 report. Output clips have generic UUID names + a sidecar JSON with full
 provenance.
 Subcommands:
  scan      list input videos, run ffprobe, write per-video index
  scenes    PySceneDetect AdaptiveDetector per video; write scenes_<basename>.json
  stage     write frame queue.json (sampled @ 2 fps within scenes)
  merge     ingest worker results.json into per-video frame_results
  track     IoU+embedding stitching of per-frame detections into tracks
  score     track-level quality gating + segment plan
  cut       ffmpeg -c copy each accepted segment to <out_dir>/<uuid>.mp4
  report    HTML preview with thumbnails + identity tags
 """
 from __future__ import annotations
 import argparse
 import json
 import math
 import re
 import shutil
 import subprocess
 import sys
 import time
 import uuid
 from collections import defaultdict
 from pathlib import Path
 import numpy as np
 DEFAULT_INPUT = Path("/mnt/x/src/vd")
 DEFAULT_OUTPUT = Path("/mnt/x/src/vd/ct")
 WORK_DIR = Path("/opt/face-sets/work/video_preprocess")
 # defaults — first set was strict-portrait; second set loosened for side-profile + segment merging
 SAMPLE_FPS = 2.0
 QUALITY_YAW_MAX = 75.0      # was 25; allow full 3/4 + profile (face-sets handle it)
 QUALITY_PITCH_MAX = 45.0    # was 30
 QUALITY_FACE_MIN = 80       # was 96
 QUALITY_BLUR_MIN = 50.0
 QUALITY_DET_MIN = 0.5       # was 0.6
 TRACK_GATE_FRAC = 0.7       # >=70% of frames in track must pass per-frame gates
 SEGMENT_MIN_S = 1.0
 SEGMENT_MAX_S = 30.0        # was 10
 SEGMENT_BRIDGE_S = 3.0      # was 1.0 — within-track pose-failure bridging
 SEGMENT_MERGE_GAP_S = 2.0   # NEW — across-track merge if same scene + within this gap
 TRACK_IOU_MIN = 0.3
 TRACK_EMB_MIN = 0.5
 CACHES = [
    Path("/opt/face-sets/work/cache/nl_full.npz"),
    Path("/opt/face-sets/work/cache/immich_peter.npz"),
    Path("/opt/face-sets/work/cache/immich_nic.npz"),
 ]
 FACESETS_ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
 IDENTITY_TAG_THRESHOLD = 0.6  # cosine sim to faceset centroid
 def wsl_to_win(p: str) -> str:
    s = str(p)
    if s.startswith("/mnt/"):
        return f"{s[5].upper()}:\\{s[7:].replace('/', chr(92))}"
    return s
 # ----------------------------- ffprobe / scan -----------------------------
 def ffprobe(video: Path) -> dict:
    cmd = [
        "ffprobe", "-v", "error", "-print_format", "json",
        "-show_format", "-show_streams", str(video),
    ]
    r = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
    if r.returncode != 0:
        return {"error": r.stderr.strip()}
    return json.loads(r.stdout)
 def parse_video_meta(probe: dict) -> dict:
    if "error" in probe:
        return {"error": probe["error"]}
    fmt = probe.get("format", {})
    duration = float(fmt.get("duration", 0))
    vstream = next((s for s in probe.get("streams", []) if s.get("codec_type") == "video"), None)
    if vstream is None:
        return {"error": "no video stream"}
    fps_str = vstream.get("avg_frame_rate", "0/1")
    try:
        num, den = (int(x) for x in fps_str.split("/"))
        fps = num / den if den else 0.0
    except Exception:
        fps = 0.0
    nb_frames = int(vstream.get("nb_frames", 0)) or int(round(duration * fps))
    return {
        "duration_s": duration,
        "fps": fps,
        "frames": nb_frames,
        "width": int(vstream.get("width", 0)),
        "height": int(vstream.get("height", 0)),
        "codec": vstream.get("codec_name"),
    }
 def cmd_scan(args):
    in_dir = Path(args.input)
    out = Path(args.out)
    out.parent.mkdir(parents=True, exist_ok=True)
    extensions = {".mp4", ".mov", ".mkv", ".m4v", ".avi", ".webm"}
    out_root = Path(args.output_dir).resolve()
    videos = []
    for p in sorted(in_dir.iterdir() if not args.recursive else in_dir.rglob("*")):
        if not p.is_file():
            continue
        if out_root in p.parents or p.resolve() == out_root:
            continue  # never include the output dir
        if p.suffix.lower() not in extensions:
            continue
        videos.append(p)
    print(f"[scan] {len(videos)} candidate videos", file=sys.stderr)
    inventory = []
    for p in videos:
        meta = parse_video_meta(ffprobe(p))
        meta["path"] = str(p)
        meta["win_path"] = wsl_to_win(str(p))
        meta["size"] = p.stat().st_size
        inventory.append(meta)
        if "error" not in meta:
            print(f"  {p.name}: {meta['duration_s']:.1f}s @ {meta['fps']:.1f}fps "
                  f"{meta['width']}x{meta['height']} {meta['codec']}", file=sys.stderr)
        else:
            print(f"  {p.name}: ERROR {meta['error']}", file=sys.stderr)
    out.write_text(json.dumps({"input": str(in_dir), "videos": inventory}, indent=2))
    print(f"[scan] inventory -> {out}", file=sys.stderr)
 # ----------------------------- scenes -----------------------------
 def cmd_scenes(args):
    from scenedetect import open_video, SceneManager
    from scenedetect.detectors import AdaptiveDetector
    inv = json.loads(Path(args.inventory).read_text())
    out_dir = Path(args.out_dir)
    out_dir.mkdir(parents=True, exist_ok=True)
    only = set(args.only.split(",")) if args.only else None
    for v in inv["videos"]:
        if "error" in v:
            continue
        path = Path(v["path"])
        if only and path.name not in only:
            continue
        out_file = out_dir / (path.stem + ".scenes.json")
        if out_file.exists() and not args.force:
            continue
        print(f"[scenes] {path.name} ...", file=sys.stderr, flush=True)
        t0 = time.time()
        try:
            video = open_video(str(path))
            sm = SceneManager()
            sm.add_detector(AdaptiveDetector(min_scene_len=int(round(v.get("fps", 30) or 30) * 0.5)))
            sm.detect_scenes(video, show_progress=False)
            scenes = sm.get_scene_list()
            entries = []
            for s, e in scenes:
                entries.append({
                    "start_frame": s.frame_num, "end_frame": e.frame_num,
                    "start_s": s.get_seconds(), "end_s": e.get_seconds(),
                    "duration_s": e.get_seconds() - s.get_seconds(),
                })
            # if no cuts found, treat the whole video as one scene
            if not entries:
                entries = [{
                    "start_frame": 0, "end_frame": v["frames"],
                    "start_s": 0.0, "end_s": v["duration_s"],
                    "duration_s": v["duration_s"],
                }]
            out_file.write_text(json.dumps({"video": str(path), "scenes": entries}, indent=2))
            print(f"  {len(entries)} scenes in {time.time()-t0:.1f}s -> {out_file.name}",
                  file=sys.stderr)
        except Exception as e:
            print(f"  ERROR: {e}", file=sys.stderr)
 # ----------------------------- stage -----------------------------
 def cmd_stage(args):
    inv = json.loads(Path(args.inventory).read_text())
    scenes_dir = Path(args.scenes_dir)
    queue = []
    qid = 0
    sample_every = 1.0 / args.sample_fps
    for v in inv["videos"]:
        if "error" in v:
            continue
        p = Path(v["path"])
        sf = scenes_dir / (p.stem + ".scenes.json")
        if not sf.exists():
            print(f"[warn] no scenes file for {p.name}; skipping", file=sys.stderr)
            continue
        scenes = json.loads(sf.read_text()).get("scenes", [])
        fps = v.get("fps", 30) or 30
        for sc in scenes:
            t = sc["start_s"]
            while t < sc["end_s"] - 0.01:
                fidx = int(round(t * fps))
                if fidx >= v["frames"]:
                    break
                queue.append({
                    "queue_id": f"q{qid:08d}",
                    "video_path": str(p),
                    "win_video_path": v["win_path"],
                    "frame_idx": fidx,
                    "time_s": t,
                })
                qid += 1
                t += sample_every
    out = Path(args.out)
    out.parent.mkdir(parents=True, exist_ok=True)
    out.write_text(json.dumps(queue, indent=2))
    print(f"[stage] {len(queue)} sampled frames @ {args.sample_fps} fps -> {out}",
          file=sys.stderr)
    print(f"[stage] win path for worker: {wsl_to_win(str(out))}", file=sys.stderr)
 # ----------------------------- merge + track -----------------------------
 def cmd_merge(args):
    """Read worker output and group by video_path. Supports either JSONL (one record
    per line, the new format) or legacy JSON (results.json with `results` list)."""
    src_path = Path(args.results)
    records = []
    # try JSONL first (sister .jsonl file or .results passed directly)
    jsonl_candidate = src_path.with_suffix(".jsonl")
    if jsonl_candidate.exists():
        with open(jsonl_candidate) as f:
            for line in f:
                line = line.strip()
                if line:
                    records.append(json.loads(line))
    elif src_path.suffix == ".jsonl":
        with open(src_path) as f:
            for line in f:
                line = line.strip()
                if line:
                    records.append(json.loads(line))
    else:
        # legacy: monolithic JSON
        src = json.loads(src_path.read_text())
        records = src.get("results", [])
    by_video: dict[str, list] = {}
    for r in records:
        by_video.setdefault(r["video_path"], []).append(r)
    for v in by_video:
        by_video[v].sort(key=lambda x: x["frame_idx"])
    out = Path(args.out)
    out.parent.mkdir(parents=True, exist_ok=True)
    out.write_text(json.dumps({"by_video": by_video}, indent=2))
    print(f"[merge] {sum(len(v) for v in by_video.values())} frames across {len(by_video)} videos "
          f"-> {out}", file=sys.stderr)
 def _iou(a, b):
    ax1, ay1, ax2, ay2 = a
    bx1, by1, bx2, by2 = b
    ix1 = max(ax1, bx1); iy1 = max(ay1, by1)
    ix2 = min(ax2, bx2); iy2 = min(ay2, by2)
    iw = max(ix2 - ix1, 0); ih = max(iy2 - iy1, 0)
    inter = iw * ih
    ua = (ax2 - ax1) * (ay2 - ay1) + (bx2 - bx1) * (by2 - by1) - inter
    return inter / ua if ua > 0 else 0.0
 def cmd_track(args):
    """Stitch per-frame face detections into tracks within each scene of each video.
    Track = list of (frame_idx, face_idx) where adjacent samples have IoU>=0.3 OR
    cosine(emb)>=0.5. New face → new track. No cross-scene merging."""
    fr = json.loads(Path(args.frames).read_text())
    scenes_dir = Path(args.scenes_dir)
    inv = json.loads(Path(args.inventory).read_text())
    inv_by_path = {v["path"]: v for v in inv["videos"]}
    all_video_tracks: dict[str, list] = {}
    for video_path, frames in fr["by_video"].items():
        v = inv_by_path.get(video_path, {})
        sf = scenes_dir / (Path(video_path).stem + ".scenes.json")
        scenes = json.loads(sf.read_text()).get("scenes", []) if sf.exists() else []
        # group frames by scene
        scene_for_frame = {}
        for si, sc in enumerate(scenes):
            for f in frames:
                if f["frame_idx"] >= sc["start_frame"] and f["frame_idx"] < sc["end_frame"]:
                    scene_for_frame.setdefault(si, []).append(f)
        video_tracks = []
        for si, scene_frames in scene_for_frame.items():
            scene_frames.sort(key=lambda x: x["frame_idx"])
            # tracks = list of dict{ "members": [(frame_idx, face_idx, face_dict)], "last_bbox", "last_emb" }
            tracks = []
            for f in scene_frames:
                claimed = set()
                for face_idx, face in enumerate(f.get("faces", [])):
                    bbox = face["bbox"]
                    emb = np.array(face.get("embedding", []), dtype=np.float32) if face.get("embedding") else None
                    best_track = None
                    best_score = 0.0
                    for ti, tr in enumerate(tracks):
                        if ti in claimed:
                            continue
                        # staleness in TIME (sample period independent of source fps)
                        last_time = tr["members"][-1][3]
                        if f["time_s"] - last_time > 1.5:  # stale if >1.5s gap (3 sample periods @ 2fps)
                            continue
                        score = _iou(tr["last_bbox"], bbox)
                        if emb is not None and tr.get("last_emb") is not None:
                            score = max(score, float(np.dot(tr["last_emb"], emb)))
                        if score > best_score:
                            best_score = score
                            best_track = ti
                    if best_track is not None and best_score >= min(TRACK_IOU_MIN, TRACK_EMB_MIN):
                        tr = tracks[best_track]
                        tr["members"].append((f["frame_idx"], face_idx, face, f["time_s"]))
                        tr["last_bbox"] = bbox
                        if emb is not None:
                            tr["last_emb"] = emb
                        claimed.add(best_track)
                    else:
                        tracks.append({
                            "members": [(f["frame_idx"], face_idx, face, f["time_s"])],
                            "last_bbox": bbox,
                            "last_emb": emb,
                        })
            for tr in tracks:
                if len(tr["members"]) < 2:
                    continue
                video_tracks.append({
                    "scene_idx": si,
                    "members": [
                        {"frame_idx": m[0], "face_idx": m[1], "time_s": m[3], "face": m[2]}
                        for m in tr["members"]
                    ],
                })
        all_video_tracks[video_path] = video_tracks
        print(f"[track] {Path(video_path).name}: {sum(len(s) for s in scene_for_frame.values())} frames "
              f"-> {len(video_tracks)} tracks across {len(scene_for_frame)} scenes",
              file=sys.stderr)
    out = Path(args.out)
    out.parent.mkdir(parents=True, exist_ok=True)
    out.write_text(json.dumps({"by_video": all_video_tracks}, indent=2))
    print(f"[track] -> {out}", file=sys.stderr)
 # ----------------------------- score (quality gates) -----------------------------
 def _track_passes(track, cfg):
    """Per-frame quality gating; return list of bool (does each member pass) +
    aggregate stats. cfg: dict with yaw_max, pitch_max, face_min, det_min."""
    passes = []
    yaws, pitches, sizes, dets = [], [], [], []
    for m in track["members"]:
        f = m["face"]
        yaw = abs(f.get("pose", [0, 0, 0])[1]) if f.get("pose") else 0
        pitch = abs(f.get("pose", [0, 0, 0])[0]) if f.get("pose") else 0
        size = f.get("face_short", 0)
        det = f.get("det_score", 0)
        ok = (yaw <= cfg["yaw_max"] and pitch <= cfg["pitch_max"]
              and size >= cfg["face_min"] and det >= cfg["det_min"])
        passes.append(ok)
        yaws.append(yaw); pitches.append(pitch); sizes.append(size); dets.append(det)
    return passes, {
        "n": len(passes), "n_pass": sum(passes), "frac_pass": sum(passes) / max(1, len(passes)),
        "yaw_med": float(np.median(yaws)) if yaws else None,
        "pitch_med": float(np.median(pitches)) if pitches else None,
        "size_med": float(np.median(sizes)) if sizes else None,
        "det_med": float(np.median(dets)) if dets else None,
    }
 def _build_segments(track, cfg):
    """Return list of (start_s, end_s) accepted sub-segments of this track:
    contiguous runs of passing frames meeting min/max duration. Pose-failure
    spans <= cfg['bridge_s'] long get bridged across (handles momentary head
    turns / detection misses)."""
    passes, stats = _track_passes(track, cfg)
    members = track["members"]
    if not members:
        return [], stats
    # bridge gaps of failing frames (any width) up to cfg["bridge_s"] seconds
    bridged = list(passes)
    n = len(bridged)
    i = 0
    while i < n:
        if bridged[i]:
            i += 1
            continue
        # find run of consecutive False starting at i
        j = i
        while j < n and not bridged[j]:
            j += 1
        # bridge if surrounded by True on both sides AND time gap <= bridge_s
        if i > 0 and j < n and bridged[i - 1] and bridged[j]:
            t_left = members[i - 1]["time_s"]
            t_right = members[j]["time_s"]
            if t_right - t_left <= cfg["bridge_s"]:
                for k in range(i, j):
                    bridged[k] = True
        i = j
    # find runs of True
    runs = []
    i = 0
    while i < n:
        if not bridged[i]:
            i += 1; continue
        j = i
        while j + 1 < n and bridged[j + 1]:
            j += 1
        s = members[i]["time_s"]
        # end is the time of the last passing sample plus one sample-period
        e = members[j]["time_s"] + 1.0 / max(SAMPLE_FPS, 1e-3)
        runs.append((s, e))
        i = j + 1
    return runs, stats
 def _merge_close_segments(segs_with_meta, merge_gap_s: float):
    """Merge segments within the same scene that are within merge_gap_s of each other.
    segs_with_meta: list of dicts with start_s, end_s, scene_idx, track_idx, stats.
    Returns list of merged dicts (one per merged group). Identity-tag and stats
    aggregation happen later."""
    by_scene: dict[int, list] = {}
    for s in segs_with_meta:
        by_scene.setdefault(s["scene_idx"], []).append(s)
    merged_all = []
    for scene_idx, segs in by_scene.items():
        segs.sort(key=lambda x: x["start_s"])
        cur = None
        for s in segs:
            if cur is None:
                cur = {**s, "track_idxs": [s["track_idx"]], "member_count": s["stats"]["n"],
                       "pass_count": s["stats"]["n_pass"]}
                continue
            gap = s["start_s"] - cur["end_s"]
            if gap <= merge_gap_s:
                # merge
                cur["end_s"] = max(cur["end_s"], s["end_s"])
                cur["track_idxs"].append(s["track_idx"])
                cur["member_count"] += s["stats"]["n"]
                cur["pass_count"] += s["stats"]["n_pass"]
                # take the better-quality stats for display
                if s["stats"]["n_pass"] > cur["stats"]["n_pass"]:
                    cur["stats"] = s["stats"]
            else:
                merged_all.append(cur)
                cur = {**s, "track_idxs": [s["track_idx"]], "member_count": s["stats"]["n"],
                       "pass_count": s["stats"]["n_pass"]}
        if cur is not None:
            merged_all.append(cur)
    return merged_all
 def _split_long_segments(segs_with_meta, min_s: float, max_s: float):
    """Apply min/max duration: drop too-short, split too-long evenly."""
    out = []
    for s in segs_with_meta:
        dur = s["end_s"] - s["start_s"]
        if dur < min_s:
            continue
        if dur <= max_s:
            out.append(s)
            continue
        n = int(math.ceil(dur / max_s))
        chunk = dur / n
        base_start = s["start_s"]
        for k in range(n):
            piece = dict(s)
            piece["start_s"] = base_start + k * chunk
            piece["end_s"] = base_start + (k + 1) * chunk
            out.append(piece)
    return out
 # identity tagging via cached arcface centroids
 def load_caches_index():
    rec_index = {}
    alias_map = {}
    for c in CACHES:
        if not c.exists():
            continue
        d = np.load(c, allow_pickle=True)
        emb = d["embeddings"]
        meta = json.loads(str(d["meta"]))
        face_records = [m for m in meta if not m.get("noface")]
        if "path_aliases" in d.files:
            paliases = json.loads(str(d["path_aliases"]))
            for canon, alist in paliases.items():
                alias_map.setdefault(canon, canon)
                for a in alist:
                    alias_map[a] = canon
        for i, rec in enumerate(face_records):
            v = emb[i].astype(np.float32)
            n = float(np.linalg.norm(v))
            if n > 0:
                v = v / n
            rec_index[(rec["path"], tuple(int(x) for x in rec["bbox"]))] = v
            alias_map.setdefault(rec["path"], rec["path"])
    return rec_index, alias_map
 def load_faceset_centroids():
    """Return dict faceset_name -> normalized centroid embedding."""
    rec_index, alias_map = load_caches_index()
    centroids = {}
    for fs_dir in sorted(FACESETS_ROOT.iterdir()):
        if not fs_dir.is_dir() or fs_dir.name.startswith("_"):
            continue
        # exclude era splits to avoid double-tagging within a family
        if re.match(r"^faceset_\d+_(?:\d{4}-\d{2,4}|\d{4}|undated)", fs_dir.name):
            continue
        mp = fs_dir / "manifest.json"
        if not mp.exists():
            continue
        m = json.loads(mp.read_text())
        vecs = []
        for f in m.get("faces", []):
            src = f.get("source"); bbox = f.get("bbox")
            if not src or not bbox:
                continue
            canon = alias_map.get(src, src)
            v = rec_index.get((canon, tuple(int(x) for x in bbox)))
            if v is None and canon != src:
                v = rec_index.get((src, tuple(int(x) for x in bbox)))
            if v is not None:
                vecs.append(v)
        if len(vecs) < 3:
            continue
        c = np.stack(vecs).mean(axis=0)
        n = float(np.linalg.norm(c))
        if n > 0:
            c = c / n
        centroids[fs_dir.name] = c
    return centroids
 def _track_centroid(track):
    embs = [m["face"].get("embedding") for m in track["members"] if m["face"].get("embedding")]
    if not embs:
        return None
    arr = np.array(embs, dtype=np.float32)
    c = arr.mean(axis=0)
    n = float(np.linalg.norm(c))
    return c / n if n > 0 else c
 def cmd_score(args):
    tr = json.loads(Path(args.tracks).read_text())
    inv = json.loads(Path(args.inventory).read_text())
    inv_by_path = {v["path"]: v for v in inv["videos"]}
    cfg = {
        "yaw_max": args.max_yaw, "pitch_max": args.max_pitch,
        "face_min": args.min_face, "det_min": args.min_det,
        "bridge_s": args.bridge_gap,
    }
    centroids = {}
    if not args.no_identity:
        print("[score] loading faceset centroids ...", file=sys.stderr)
        t0 = time.time()
        centroids = load_faceset_centroids()
        print(f"[score]   {len(centroids)} active faceset centroids loaded in {time.time()-t0:.1f}s",
              file=sys.stderr)
    n_total_tracks = 0
    n_accepted_tracks = 0
    # collect per-track candidate segments first; merging happens per-video below
    per_video_candidates: dict[str, list] = {}
    track_centroids_by_video: dict[str, dict] = {}
    for video_path, tracks in tr["by_video"].items():
        per_video_candidates.setdefault(video_path, [])
        track_centroids_by_video.setdefault(video_path, {})
        for ti, track in enumerate(tracks):
            n_total_tracks += 1
            runs, stats = _build_segments(track, cfg)
            if stats["frac_pass"] < args.track_gate_frac:
                continue
            if not runs:
                continue
            n_accepted_tracks += 1
            track_centroids_by_video[video_path][ti] = _track_centroid(track)
            for (s, e) in runs:
                per_video_candidates[video_path].append({
                    "video_path": video_path,
                    "track_idx": ti,
                    "scene_idx": track["scene_idx"],
                    "start_s": s,
                    "end_s": e,
                    "stats": stats,
                })
    plan = []
    for video_path, segs in per_video_candidates.items():
        if not segs:
            continue
        # merge across tracks within the same scene if gap <= merge_gap_s
        merged = _merge_close_segments(segs, args.merge_gap)
        # apply min/max duration (split long, drop short)
        merged = _split_long_segments(merged, args.min_dur, args.max_dur)
        for s in merged:
            tag = None
            tag_sim = None
            # identity from union of contributing tracks' centroids
            if centroids:
                track_centroid_list = [
                    track_centroids_by_video[video_path].get(ti)
                    for ti in s.get("track_idxs", [s.get("track_idx")])
                ]
                track_centroid_list = [c for c in track_centroid_list if c is not None]
                if track_centroid_list:
                    union = np.stack(track_centroid_list).mean(axis=0)
                    nm = float(np.linalg.norm(union))
                    if nm > 0:
                        union = union / nm
                    sims = {name: float(np.dot(c, union)) for name, c in centroids.items()}
                    best = max(sims, key=sims.get)
                    if sims[best] >= IDENTITY_TAG_THRESHOLD:
                        tag = best; tag_sim = round(sims[best], 4)
            plan.append({
                "video_path": video_path,
                "track_idxs": s.get("track_idxs", [s.get("track_idx")]),
                "scene_idx": s["scene_idx"],
                "start_s": round(s["start_s"], 3),
                "end_s": round(s["end_s"], 3),
                "duration_s": round(s["end_s"] - s["start_s"], 3),
                "member_count": s.get("member_count", s["stats"]["n"]),
                "pass_count": s.get("pass_count", s["stats"]["n_pass"]),
                "stats": s["stats"],
                "identity_tag": tag,
                "identity_sim": tag_sim,
                "uuid": uuid.uuid4().hex[:12],
            })
    plan.sort(key=lambda p: (p["video_path"], p["start_s"]))
    out = Path(args.out)
    out.parent.mkdir(parents=True, exist_ok=True)
    out.write_text(json.dumps({
        "thresholds": {
            "yaw_max": args.max_yaw, "pitch_max": args.max_pitch,
            "face_min": args.min_face, "blur_min": QUALITY_BLUR_MIN,
            "det_min": args.min_det, "track_gate_frac": args.track_gate_frac,
            "bridge_s": args.bridge_gap, "merge_gap_s": args.merge_gap,
            "min_dur_s": args.min_dur, "max_dur_s": args.max_dur,
            "identity_tag_threshold": IDENTITY_TAG_THRESHOLD,
        },
        "totals": {
            "tracks_total": n_total_tracks, "tracks_accepted": n_accepted_tracks,
            "segments": len(plan),
        },
        "plan": plan,
    }, indent=2))
    print(f"[score] {n_accepted_tracks}/{n_total_tracks} tracks accepted -> {len(plan)} segments "
          f"-> {out}", file=sys.stderr)
 # ----------------------------- cut -----------------------------
 def cmd_cut(args):
    plan = json.loads(Path(args.plan).read_text())
    out_dir = Path(args.output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)
    if args.clean:
        # remove only existing UUID-named clips + sidecars (12-char hex), keeping any other files
        import re as _re
        uuid_pat = _re.compile(r"^[0-9a-f]{12}\.(mp4|json)$")
        n_removed = 0
        for child in out_dir.iterdir():
            if child.is_file() and uuid_pat.match(child.name):
                child.unlink()
                n_removed += 1
            elif child.is_dir() and _re.match(r"^[A-Za-z0-9_.-]+$", child.name):
                # subfolder of prior runs — clear UUID files inside, then remove if empty
                for inner in child.iterdir():
                    if inner.is_file() and uuid_pat.match(inner.name):
                        inner.unlink()
                        n_removed += 1
                try:
                    child.rmdir()
                except OSError:
                    pass
        if n_removed:
            print(f"[clean] removed {n_removed} prior UUID clips/sidecars", file=sys.stderr)
    n_done = 0
    n_err = 0
    sidecars = []
    for seg in plan["plan"]:
        sub = Path(seg["video_path"]).stem
        seg_dir = out_dir / sub
        seg_dir.mkdir(parents=True, exist_ok=True)
        out_video = seg_dir / f"{seg['uuid']}.mp4"
        if out_video.exists() and not args.force:
            continue
        s = seg["start_s"]; d = seg["duration_s"]
        cmd = [
            "ffmpeg", "-y", "-loglevel", "error",
            "-ss", f"{s}",
            "-i", seg["video_path"],
            "-t", f"{d}",
            "-c", "copy",
            "-avoid_negative_ts", "make_zero",
            str(out_video),
        ]
        r = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
        if r.returncode != 0 or not out_video.exists() or out_video.stat().st_size < 1024:
            print(f"[cut-err] {seg['uuid']} {seg['video_path']}@{s}+{d}: {r.stderr.strip()[:200]}",
                  file=sys.stderr)
            n_err += 1
            if out_video.exists() and out_video.stat().st_size < 1024:
                out_video.unlink()
            continue
        if args.write_sidecar:
            sidecar = seg_dir / f"{seg['uuid']}.json"
            sidecar.write_text(json.dumps({
                "uuid": seg["uuid"],
                "source_video": seg["video_path"],
                "source_basename": Path(seg["video_path"]).name,
                "start_s": s, "end_s": seg["end_s"], "duration_s": d,
                "scene_idx": seg["scene_idx"],
                "track_idxs": seg.get("track_idxs", [seg.get("track_idx")]),
                "member_count": seg.get("member_count"),
                "pass_count": seg.get("pass_count"),
                "stats": seg["stats"],
                "identity_tag": seg["identity_tag"],
                "identity_sim": seg["identity_sim"],
                "thresholds": plan["thresholds"],
            }, indent=2))
            sidecars.append(sidecar)
        n_done += 1
    print(f"[cut] {n_done} clips written, {n_err} errors -> {out_dir}", file=sys.stderr)
 # ----------------------------- report -----------------------------
 def cmd_report(args):
    plan = json.loads(Path(args.plan).read_text())
    out_dir = Path(args.out)
    out_dir.mkdir(parents=True, exist_ok=True)
    thumbs_dir = out_dir / "thumbs"
    thumbs_dir.mkdir(exist_ok=True)
    output_dir = Path(args.output_dir)
    # group by video
    by_video: dict[str, list] = {}
    for seg in plan["plan"]:
        by_video.setdefault(seg["video_path"], []).append(seg)
    # generate thumbs from each segment's first frame via ffmpeg
    print(f"[report] generating thumbs for {len(plan['plan'])} segments", file=sys.stderr)
    for seg in plan["plan"]:
        thumb = thumbs_dir / f"{seg['uuid']}.jpg"
        if thumb.exists():
            continue
        s = seg["start_s"] + 0.1
        cmd = [
            "ffmpeg", "-y", "-loglevel", "error",
            "-ss", f"{s}",
            "-i", seg["video_path"],
            "-frames:v", "1",
            "-vf", "scale=240:-1",
            str(thumb),
        ]
        subprocess.run(cmd, capture_output=True, timeout=30)
    # render
    rows = []
    rows.append("<h1>Video target preprocessing &mdash; review</h1>")
    t = plan["totals"]
    th = plan["thresholds"]
    rows.append(f"<p>Tracks accepted: {t['tracks_accepted']}/{t['tracks_total']}; "
                f"segments emitted: {t['segments']}.<br>"
                f"Thresholds: pose &le;{th['yaw_max']}&deg;yaw / {th['pitch_max']}&deg;pitch, "
                f"face_short &ge;{th['face_min']}px, det &ge;{th['det_min']}, "
                f"track-gate &ge;{int(100*th['track_gate_frac'])}%, "
                f"duration {th['min_dur_s']}–{th['max_dur_s']}s. "
                f"Output dir: <code>{output_dir}</code></p>")
    nav = " · ".join(f"<a href='#v{i}'>{Path(v).name}</a>"
                     for i, v in enumerate(by_video.keys()))
    rows.append(f"<div class='nav'>{nav}</div>")
    for vi, (video_path, segs) in enumerate(by_video.items()):
        rows.append(f"<section id='v{vi}' class='vid'>")
        rows.append(f"<h2>{Path(video_path).name} <small>({len(segs)} segments)</small></h2>")
        rows.append("<div class='cells'>")
        for seg in sorted(segs, key=lambda x: x["start_s"]):
            stats = seg["stats"]
            tag = seg["identity_tag"] or ""
            tag_sim = seg["identity_sim"]
            tag_html = (f"<span class='tag'>{tag} ({tag_sim:.2f})</span>" if tag else "<span class='tag none'>untagged</span>")
            sub_name = Path(seg['video_path']).stem
            rows.append(
                f"<div class='cell'>"
                f"<a href='{output_dir}/{sub_name}/{seg['uuid']}.mp4'><img src='thumbs/{seg['uuid']}.jpg' loading='lazy'></a>"
                f"<div class='meta'>"
                f"<code>{sub_name}/{seg['uuid']}.mp4</code><br>"
                f"{seg['start_s']:.1f}s &rarr; {seg['end_s']:.1f}s ({seg['duration_s']:.1f}s)<br>"
                f"yaw={stats['yaw_med']:.0f}&deg; size={stats['size_med']:.0f}px det={stats['det_med']:.2f}<br>"
                f"pass {stats['n_pass']}/{stats['n']}<br>"
                f"{tag_html}"
                f"</div></div>"
            )
        rows.append("</div></section>")
    html = f"""<!doctype html>
 <html><head><meta charset='utf-8'><title>Video targets review</title>
 <style>
 body {{ font-family: system-ui, sans-serif; background:#111; color:#eee; padding:1em; }}
 h1, h2 {{ margin-top: 1em; }} h2 {{ border-bottom: 1px solid #333; padding-bottom: 4px; }}
 small {{ color:#999; font-weight:normal; }}
 section.vid {{ background:#1a1a1a; border-radius:6px; padding:12px; margin:12px 0; }}
 .cells {{ display:flex; flex-wrap:wrap; gap:8px; }}
 .cell {{ background:#222; border-radius:4px; padding:6px; width:260px; font-size:11px; font-family:monospace; }}
 .cell img {{ width:100%; height:auto; border-radius:3px; }}
 .meta {{ padding-top:4px; line-height:1.4; }}
 .tag {{ display:inline-block; padding:1px 6px; background:#5fa05f; color:#000; border-radius:2px; }}
 .tag.none {{ background:#444; color:#aaa; }}
 .nav {{ position:sticky; top:0; background:#111; padding:.5em 0; border-bottom:1px solid #333; font-size:12px; }}
 a {{ color:#6cf; }}
 code {{ background:#000; padding:1px 4px; border-radius:2px; }}
 </style></head>
 <body>
 {''.join(rows)}
 </body></html>"""
    out_html = out_dir / "index.html"
    out_html.write_text(html)
    print(f"[report] -> {out_html}", file=sys.stderr)
 # ----------------------------- main -----------------------------
 def main():
    ap = argparse.ArgumentParser()
    sub = ap.add_subparsers(dest="cmd", required=True)
    s = sub.add_parser("scan")
    s.add_argument("--input", default=str(DEFAULT_INPUT))
    s.add_argument("--output-dir", default=str(DEFAULT_OUTPUT))
    s.add_argument("--recursive", action="store_true")
    s.add_argument("--out", required=True)
    s.set_defaults(func=cmd_scan)
    sc = sub.add_parser("scenes")
    sc.add_argument("--inventory", required=True)
    sc.add_argument("--out-dir", required=True)
    sc.add_argument("--only", default=None, help="comma-separated basenames to limit run")
    sc.add_argument("--force", action="store_true")
    sc.set_defaults(func=cmd_scenes)
    st = sub.add_parser("stage")
    st.add_argument("--inventory", required=True)
    st.add_argument("--scenes-dir", required=True)
    st.add_argument("--sample-fps", type=float, default=SAMPLE_FPS)
    st.add_argument("--out", required=True)
    st.set_defaults(func=cmd_stage)
    m = sub.add_parser("merge")
    m.add_argument("--results", required=True)
    m.add_argument("--out", required=True)
    m.set_defaults(func=cmd_merge)
    tr = sub.add_parser("track")
    tr.add_argument("--frames", required=True)
    tr.add_argument("--scenes-dir", required=True)
    tr.add_argument("--inventory", required=True)
    tr.add_argument("--sample-fps", type=float, default=SAMPLE_FPS)
    tr.add_argument("--out", required=True)
    tr.set_defaults(func=cmd_track)
    sc2 = sub.add_parser("score")
    sc2.add_argument("--tracks", required=True)
    sc2.add_argument("--inventory", required=True)
    sc2.add_argument("--out", required=True)
    sc2.add_argument("--no-identity", action="store_true")
    sc2.add_argument("--max-yaw", type=float, default=QUALITY_YAW_MAX)
    sc2.add_argument("--max-pitch", type=float, default=QUALITY_PITCH_MAX)
    sc2.add_argument("--min-face", type=int, default=QUALITY_FACE_MIN)
    sc2.add_argument("--min-det", type=float, default=QUALITY_DET_MIN)
    sc2.add_argument("--track-gate-frac", type=float, default=TRACK_GATE_FRAC)
    sc2.add_argument("--bridge-gap", type=float, default=SEGMENT_BRIDGE_S,
                     help="bridge within-track failure gaps up to this many seconds")
    sc2.add_argument("--merge-gap", type=float, default=SEGMENT_MERGE_GAP_S,
                     help="merge across-track segments in same scene if within this gap")
    sc2.add_argument("--min-dur", type=float, default=SEGMENT_MIN_S)
    sc2.add_argument("--max-dur", type=float, default=SEGMENT_MAX_S)
    sc2.set_defaults(func=cmd_score)
    cu = sub.add_parser("cut")
    cu.add_argument("--plan", required=True)
    cu.add_argument("--output-dir", default=str(DEFAULT_OUTPUT))
    cu.add_argument("--force", action="store_true")
    cu.add_argument("--clean", action="store_true",
                    help="remove prior UUID-named clips before cutting (preserves non-UUID files)")
    cu.add_argument("--write-sidecar", action="store_true",
                    help="emit <uuid>.json provenance sidecar alongside each clip (default off)")
    cu.set_defaults(func=cmd_cut)
    rp = sub.add_parser("report")
    rp.add_argument("--plan", required=True)
    rp.add_argument("--output-dir", default=str(DEFAULT_OUTPUT))
    rp.add_argument("--out", required=True)
    rp.set_defaults(func=cmd_report)
    args = ap.parse_args()
    args.func(args)
 if __name__ == "__main__":
    main()
Author	SHA1	Message	Date
Peter	308597ebf0	Update video preprocessing doc with full-corpus results After completing the rest-of-corpus run, update docs/analysis to reflect the final numbers across all three batches (test + 13-file + 45-file) and surface the numerical lessons: - 1,984 segments / 10.78h accepted content from 19.76h / 61 input videos - 0 worker errors across 143,137 sampled frames - rest batch sustained 15.78 fps from a fresh JSONL start (vs 7.5 fps for the migrated batch), confirming the append-only fix is the right steady-state design - skip-pattern note: 5-digit basename numbers need full padding (0005[0-9] not 005[0-9]) — bit me on the first relaunch - documented SIDECAR=yes opt-in for the chain script Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 16:47:59 +02:00
Peter	7960dec350	Make per-clip sidecar JSONs opt-in (default off) Previously every video_target_pipeline cut wrote a <uuid>.json provenance sidecar alongside each <uuid>.mp4. The same provenance is already in the per-batch plan.json, so the per-clip sidecars are redundant unless a downstream tool wants each clip self-describing in isolation. - video_target_pipeline.py cut: new --write-sidecar flag, default off. - run_video_pipeline.sh: new SIDECAR env var (default "no"), passes --write-sidecar when SIDECAR=yes. - README + docs/analysis/video-target-preprocessing.md updated. The 1,984 already-emitted sidecars in /mnt/x/src/vd/ct/ct_src_*/ have been deleted (1.5 MB). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 12:44:27 +02:00
Peter	998fa79f81	Add target-side video preprocessing pipeline Preprocesses a folder of video files into UUID-named clips suitable as target inputs for roop-unleashed-style face-swap. Counterpart to the faceset (source-side) tooling. work/video_target_pipeline.py — orchestration with subcommands scan / scenes / stage / merge / track / score / cut / report. Quality gates default to face-sets-can-handle-side-profile values (yaw<=75°, pitch<=45°, face_short>=80px, det>=0.5). Cross-track segment merge fuses adjacent-in-time tracks within the same scene up to 2s gap. Output organized into <output_dir>/<source_stem>/<uuid>.mp4 + <uuid>.json sidecar with full provenance. work/video_face_worker.py — Windows DML face detect+embed worker. Uses JSONL append-only for results.jsonl: a critical perf fix (re- serializing the monolithic 245MB results.json on every flush was the dominant cost in the first attempt, dropping throughput to 0.5 fps). Append-only got it to 13+ fps, ~7.5 fps cumulative across the first 6.18h batch. Also uses seek-once-per-video + sequential cap.grab() between samples to dodge cv2 per-sample seek pathology on long H.264. Legacy results.json is auto-migrated to .jsonl on first load. work/run_video_pipeline.sh — generic chain driver, parameterized via WORK / INPUT_DIR / OUTPUT_DIR / FILTER_FROM / SKIP_PATTERN / MAX_DUR / IDENTITY env vars. work/status_video_pipeline.sh — generic status helper. First production batch (ct_src_00050..00062, 13 files, 6.18h input): 600 emitted segments, 239.5min accepted content (64.6% of input), 254 segments built from >=2 tracks (cross-track merge), 1h43m wall clock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:38:50 +02:00
Peter	49a43c7685	Add post-export corpus maintenance pipeline Adds four new orchestration scripts that operate on an already-built facesets_swap_ready/ to clean it up over time: - filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. - consolidate_facesets.py: duplicate-identity merger using complete-linkage centroid clustering on cached arcface embeddings. Single-linkage chains catastrophically (60-faceset clusters with min sim < 0); complete-linkage guarantees within-group sim >= edge. - age_extend_001.py: slots newly-added PNGs into existing era buckets of faceset_001 using the same anchor-fragment rule as age_split_001.py (dist <= 0.40 AND \|year_delta\| <= 5). Anchors not re-centered. - dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three passes — cross-family SHA256 byte-dedup (preserves intra-family era duplication), within-faceset near-dup at sim >= 0.95, and a multi-face audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s on AMD Vega — ~7x embed_worker because input is 512x512 crops. Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged → 181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full reversibility. Master manifest gains masked[], merged[], plus per-run provenance blocks. Three new docs/analysis/ writeups cover model choice, threshold rationale, and per-pass run results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:41:18 +02:00
Peter	e66c97fd58	Document Immich nic run: 95 new facesets, manifest 216 -> 311 Overnight 2026-04-27 nic finalize completed. Per-user API key worked as expected. The pipeline survived one mid-stage Immich outage via the circuit breaker added in `62dba3d` -- script paused, operator confirmed connectivity, same command resumed from saved state.json. Embed (Windows DML): 7,834 images -> 15,627 face records + 1 noface in 59 minutes (2.2 img/s end-to-end). Cluster: 6,770 of 15,627 faces (43%) matched existing canonical identities at cos-dist <= 0.45; biggest hits faceset_002 (+3,261), faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408). The faceset_008 and faceset_007 hits are noteworthy cross-matches: those are hand-sorted "sab" and "s" identities, recurring frequently in nic's library. Of the 8,857 unmatched faces, 3,787 raw clusters at threshold 0.55, 129 surviving refine gates, 95 emitted as new facesets at faceset_265+. Top-level facesets_swap_ready/manifest.json: 216 -> 311 substantive facesets + 68 thin_eras unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 00:32:11 +02:00
Peter	62dba3ddb3	Add Immich outage circuit breaker; document nic run + Tailscale quirk work/immich_stage.py: - Startup probe of /server/version (exit 2 if unreachable). - Outage circuit breaker: after OUTAGE_FAIL_STREAK=12 consecutive faces_error/download_error results, run a quick probe; if the probe also fails, persist state and exit with code 2 so a long unattended run can pause rather than silently churning through tens of thousands of retries during an upstream outage. Resume by re-running the same command -- state.json + queue.json are intact. README: - Document the nic run (per-user API key necessary; second pipeline invocation confirmed expected behavior; cleaner library than peter's with 0 internal byte-dupes vs 2,976). - Mention the circuit breaker as the mechanism that keeps long unattended runs safe under the known Tailscale flicker pattern at this site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 23:36:11 +02:00
Peter	321fed01cc	Add Immich import pipeline (WSL stage + Windows DML embed + cluster) Three-piece workflow that imports a self-hosted Immich library and emits new facesets without disturbing existing identity numbering: - work/immich_stage.py (WSL): pages /search/metadata, parallel-fetches /faces?id= per asset, prefilters by face_short>=90 against bbox scaled to original-image coords, downloads originals, sha256-dedups against nl_full.npz and same-run staged files. 8-worker ThreadPoolExecutor doing the full /faces->filter->/original chain per asset; resumable via state.json. API URL + key come from IMMICH_URL / IMMICH_API_KEY env vars, label->UUID map from work/immich/users.json (gitignored). - work/embed_worker.py (Windows venv at C:\face_embed_venv): runs insightface.FaceAnalysis(buffalo_l) with the DmlExecutionProvider on AMD Radeon Vega via onnxruntime-directml. Produces a cache file in the same .npz schema as sort_faces.cmd_embed (loadable via load_cache). ~7.5x speedup over CPU end-to-end; embeddings bit- identical to CPU (cosine similarity 1.0000 across 8 sample faces). - work/cluster_immich.py (WSL): mirrors cluster_osrc.py against an immich_<user>.npz. Builds existing identity centroids from canonical faceset_NNN/ in facesets_swap_ready/, drops matches at <=0.45, clusters the rest at 0.55, applies refine gates, hands off to cmd_export_swap. Numbers new facesets past the existing maximum. - work/finalize_immich.sh: chains queue->Windows embed->cache copy-> cluster_immich, with logging. The 2026-04-26 run on https://fotos.computerliebe.org (Immich v2.7.2) processed 53,842 admin-accessible assets, staged 10,261, embedded 19,462 face records on Vega DML in 64.6 min, matched 8,103 (42%) to existing identities, and emitted 185 new facesets (faceset_026..264 with gaps). facesets_swap_ready/ went from 31 to 216 substantive facesets. Important caveat surfaced: /search/metadata's userIds filter is silently ignored when the API key is bound to a different user, so this run can't enumerate other users' libraries from the admin key. A per-user API key would be required for nic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 18:14:26 +02:00
Peter	7ecbfae981	Add osrc identity-discovery pipeline + run analysis work/cluster_osrc.py mirrors build_folders.py's shape (synthesize a refine_manifest, hand off to cmd_export_swap, relocate, merge top-level manifest) but discovers identities by clustering rather than asserting them by folder. Drops faces already covered by existing identity centroids, clusters the rest at 0.55, applies refine-equivalent gates with min_faces=6, numbers new facesets past the existing maximum so faceset_001..NNN are never disturbed. The 2026-04-26 run on /mnt/x/src/osrc produced faceset_020..025 (sizes 4-26 exported PNGs); analysis writeup in docs/analysis/. README also notes the refine-renumbers caveat in passing — extend + orchestration script is the safe pattern; cmd_refine is for fresh clusters only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 12:40:19 +02:00
Peter	1d82d71e68	Force-track work/build_folders.py The README documents work/build_folders.py as the orchestration script for hand-sorted-folder identity import, but it was excluded by the work/ gitignore. Force-track it for parity with the other orchestration scripts (age_split_001.py, check_faceset001_age.py) so the documented workflow points at code that exists in the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 12:13:56 +02:00
Peter	e48dd8aec7	Add age-split run analysis for faceset_001 Documents the 2026-04-26 split of faceset_001 (707 curated faces) into 6 substantive era buckets + 68 thin fragments, including the readiness probe evidence, the anchor-based assignment rationale (replaces transitive union-find that caused year-drift), and the re-run / apply- to-other-identity workflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 12:10:37 +02:00
Peter	03a0c75531	Document hand-sorted-folder import + age-split workflow - README: document work/build_folders.py (hand-sorted folder identities) and the new age-split workflow for splitting a long-running identity into era-specific facesets after clustering. - Force-track work/age_split_001.py and work/check_faceset001_age.py; these are the worked example + readiness probe for faceset_001 and the template for splitting any other identity by EXIF era. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 12:08:25 +02:00
Peter	4d7a8780de	Document enrich + export-swap + extend; add swap-ready usage guide README.md now covers all six subcommands (embed, cluster, refine, dedup, extend, enrich, export-swap), an end-to-end pipeline recipe, the delta recipe for merging a new source into an existing result, the quality- weight formula used by export-swap, and the GFPGAN blend recommendation at swap time (0.85, overriding roop-unleashed's 0.65 default). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:09:01 +02:00