Add Immich import pipeline (WSL stage + Windows DML embed + cluster)

Three-piece workflow that imports a self-hosted Immich library and emits new facesets without disturbing existing identity numbering: - work/immich_stage.py (WSL): pages /search/metadata, parallel-fetches /faces?id= per asset, prefilters by face_short>=90 against bbox scaled to original-image coords, downloads originals, sha256-dedups against nl_full.npz and same-run staged files. 8-worker ThreadPoolExecutor doing the full /faces->filter->/original chain per asset; resumable via state.json. API URL + key come from IMMICH_URL / IMMICH_API_KEY env vars, label->UUID map from work/immich/users.json (gitignored). - work/embed_worker.py (Windows venv at C:\face_embed_venv): runs insightface.FaceAnalysis(buffalo_l) with the DmlExecutionProvider on AMD Radeon Vega via onnxruntime-directml. Produces a cache file in the same .npz schema as sort_faces.cmd_embed (loadable via load_cache). ~7.5x speedup over CPU end-to-end; embeddings bit- identical to CPU (cosine similarity 1.0000 across 8 sample faces). - work/cluster_immich.py (WSL): mirrors cluster_osrc.py against an immich_<user>.npz. Builds existing identity centroids from canonical faceset_NNN/ in facesets_swap_ready/, drops matches at <=0.45, clusters the rest at 0.55, applies refine gates, hands off to cmd_export_swap. Numbers new facesets past the existing maximum. - work/finalize_immich.sh: chains queue->Windows embed->cache copy-> cluster_immich, with logging. The 2026-04-26 run on https://fotos.computerliebe.org (Immich v2.7.2) processed 53,842 admin-accessible assets, staged 10,261, embedded 19,462 face records on Vega DML in 64.6 min, matched 8,103 (42%) to existing identities, and emitted 185 new facesets (faceset_026..264 with gaps). facesets_swap_ready/ went from 31 to 216 substantive facesets. Important caveat surfaced: /search/metadata's userIds filter is silently ignored when the API key is bound to a different user, so this run can't enumerate other users' libraries from the admin key. A per-user API key would be required for nic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 18:14:26 +02:00
parent 7ecbfae981
commit 321fed01cc
6 changed files with 1340 additions and 3 deletions
--- a/README.md
+++ b/README.md
@@ -204,6 +204,77 @@ existing identities), this produced 6 new facesets (`faceset_020..025`,
 sizes 4–26 exported PNGs; the 7th candidate cluster lost all 6 faces to
 export-swap's tighter `min_face_short=100` gate).

+### Importing identities from a self-hosted Immich library
+
+`work/immich_stage.py` + `work/embed_worker.py` + `work/cluster_immich.py`
+together import an Immich library at scale, with the embed step running on
+a Windows AMD GPU via DirectML and everything else on WSL. Three pieces:
+
+1. **`work/immich_stage.py` (WSL)** — pages every IMAGE asset via
+   `/search/metadata`, fetches each asset's `/faces?id=` to read Immich's
+   own ML-driven bboxes, scales each bbox to original-image coordinates,
+   and prefilters by `face_short ≥ 90`. For survivors it downloads the
+   original, sha256-deduplicates against the canonical `nl_full.npz` and
+   against same-run staged files, and saves to
+   `/mnt/x/src/immich/<user>/<rel>`. Writes a `queue.json` that the embed
+   worker consumes. 8 concurrent worker threads run the full per-asset
+   I/O chain (`/faces` → filter → `/original`) so 8 workers ≈ 8× the
+   serial throughput.
+2. **`work/embed_worker.py` (Windows venv at `C:\face_embed_venv\`)** —
+   loads `insightface.FaceAnalysis(buffalo_l)` with the
+   `DmlExecutionProvider` and runs detection + landmarks + recognition
+   over the queue. Produces a `.npz` cache that's bit-identical in
+   schema to what `sort_faces.py:cmd_embed` writes, so the result is
+   directly loadable by `load_cache()`. The cache already includes the
+   post-`enrich` fields (`landmark_2d_106`, `landmark_3d_68`, `pose`)
+   because FaceAnalysis returns them for free. AMD Vega gives ~7.5×
+   real-pipeline speedup over CPU.
+3. **`work/cluster_immich.py` (WSL)** — mirrors `cluster_osrc.py`'s
+   shape but reads from `immich_<user>.npz`. Builds existing-identity
+   centroids from every canonical `faceset_NNN/` in
+   `facesets_swap_ready/` (skipping era splits and `_thin/`), drops
+   immich faces matching at cos-dist ≤ 0.45, clusters the rest at 0.55,
+   applies refine gates, numbers new facesets past the existing maximum,
+   and feeds `cmd_export_swap` via a synthetic manifest.
+
+`work/finalize_immich.sh <user>` chains queue → Windows embed → cache
+copy back → cluster_immich, with logging.
+
+The Immich admin API key + base URL come from environment variables:
+
+```bash
+export IMMICH_URL=https://your-immich.example.com
+export IMMICH_API_KEY=...                # admin or per-user key
+python work/immich_stage.py --user peter --workers 8
+bash   work/finalize_immich.sh peter
+```
+
+For the 2026-04-26 run against `https://fotos.computerliebe.org` (Immich
+v2.7.2), with the admin API key:
+
+| step | result |
+|------|------|
+| stage | 53,842 assets seen, **10,261 staged** (~10 GB), 978 byte-deduped against `nl_full.npz`, 2,976 internal byte-duplicates, 39K skipped no-face / no-big-face |
+| Windows DML embed | 19,462 face records + 1 noface in **64.6 min** (2.6 img/s end-to-end) |
+| matched existing identities | **8,103 of 19,480 (42%)** at cos-dist ≤ 0.45; biggest hits faceset_002 (+2,666), faceset_001 (+1,856), faceset_003 (+670) |
+| new clusters | 2,534 at threshold 0.55 → 239 surviving refine gates → **185 emitted** as `faceset_026..264` (gaps where export-swap's tighter outlier filter dropped clusters below the export quality bar) |
+
+**Important caveats for Immich v2.7.2**:
+- The `userIds` filter on `/search/metadata` is **silently ignored** when
+  the API key is bound to a different user. The "import everything the
+  API key can see" semantics are what you actually get; cross-user
+  isolation is enforced server-side.
+- `/server/statistics` reports counts that under-count what
+  `/search/metadata` actually returns (e.g. external library
+  thumbnail-dirs that got indexed because the import path included them).
+  Don't trust the statistics number as a denominator.
+- A meaningful fraction of `originalPath`-based assets are *Immich's own
+  thumbnails* (`<library_root>/thumbs/.../-preview.jpeg`) — included if
+  the external library's import path covers the thumbs directory and the
+  exclusion patterns don't list `**/thumbs/**`. For our run, 5,563 of
+  10,261 staged were thumbnails. They embed and cluster fine but the
+  resulting faces are lower-resolution.
+
 ## Key defaults

 `refine`:
@@ -248,15 +319,22 @@ Highly recommended at swap time: enable **Select post-processing = GFPGAN** with
 ├─ docs/
 │  └─ analysis/
 │     └─ facesets-downstream-refinement-evaluation.md
-└─ work/                                         (gitignored except force-tracked .py)
+└─ work/                                         (gitignored except force-tracked .py / .sh)
   ├─ build_folders.py                           (hand-sorted-folder orchestration)
   ├─ check_faceset001_age.py                    (age-split readiness probe)
   ├─ age_split_001.py                           (age-split orchestration; faceset_001)
   ├─ cluster_osrc.py                            (mixed-bucket identity discovery)
-   ├─ synthetic_refine_manifest.json             (last build_folders.py output)
-   ├─ synthetic_osrc_manifest.json               (last cluster_osrc.py output)
+   ├─ immich_stage.py                            (Immich library staging, parallel)
+   ├─ embed_worker.py                            (Windows DML embed worker, runs from C:\face_embed_venv\)
+   ├─ cluster_immich.py                          (Immich identity discovery + export)
+   ├─ finalize_immich.sh                         (chains queue → embed → cluster)
+   ├─ synthetic_*_manifest.json                  (per-run synthetic refine manifests)
+   ├─ immich/
+   │  ├─ users.json                              (label -> userId map; gitignored)
+   │  └─ <user>/{queue,state,aliases}.json       (per-user staging artifacts)
   ├─ cache/
   │  ├─ nl_full.npz                             (canonical cache + duplicates.json)
+   │  ├─ immich_<user>.npz                       (per-user immich embeddings)
   │  └─ age_split_exif.json                     (path → EXIF-year cache)
   └─ logs/
      └─ *.log                                   (every long step writes here)