# Importing identities from a self-hosted Immich library _Run date: 2026-04-26. Target: Immich v2.7.2 at `https://fotos.computerliebe.org`. Driver scripts: `work/immich_stage.py`, `work/embed_worker.py`, `work/cluster_immich.py`, `work/finalize_immich.sh`._ ## 1. Why a split workflow InsightFace `buffalo_l` on the WSL CPU runs the full detection + landmarks + recognition stack at ~3–4 faces/second. Re-detecting all 79K Immich photos would have taken ~10–28 days. The available AMD Radeon RX Vega is unusable under WSL (no `/dev/dri/`, no ROCm), but **DirectML on Windows native** runs the same models bit-identically and ~7.5× faster end-to-end. The pipeline therefore splits: - **WSL side** (`/opt/face-sets/`) — orchestration: API listing, download, sha256 dedup, file management, clustering, faceset emission. - **Windows side** (`C:\face_embed_venv\`) — the embed step only. A fresh Python 3.12 (installed via `winget install Python.Python.3.12`) with `numpy`, `pillow`, `opencv-python-headless`, `onnxruntime-directml`, `insightface`. Models copied from `/home/peter/.insightface/models/buffalo_l/` to `C:\face_embed_venv\models\buffalo_l\`. A 30-iteration synthetic benchmark on Vega: | model | DML | CPU | speedup | |-------------|----:|----:|--------:| | `det_10g.onnx` (640×640) | 10.0 ms | 183.5 ms | 18.4× | | `w600k_r50.onnx` (112×112) | 8.2 ms | 90.5 ms | 11.0× | End-to-end FaceAnalysis on 5 real Immich-sourced images (excluding the first-call DML JIT warmup): ~7.5× speedup post-warmup. Per-face cosine similarity DML vs CPU was 1.0000 across all 8 detected faces — DML is bit-identical to CPU for arcface inference. ## 2. Architecture ``` ┌─────────────────────────────────────────────┐ │ WSL /opt/face-sets/work/immich_stage.py │ │ ┌──────────────────────────────────────────┐│ │ │ ThreadPoolExecutor.map(_fetch_for_asset, ││ │ │ list_assets(user)) ││ │ │ ─ /faces?id= (Immich, parallel x8) ││ │ │ ─ filter face_short >= 90 ││ │ │ ─ /assets/.../original (parallel x8) ││ │ └──────────────────────────────────────────┘│ │ consumer (main thread): │ │ sha256 → dedup vs nl_full.npz │ │ save to /mnt/x/src/immich/// │ │ append to queue.json │ └────────────────┬────────────────────────────┘ │ ▼ queue.json (with WSL + Windows paths) ┌─────────────────────────────────────────────┐ │ Windows embed_worker.py (C:\face_embed_venv) │ │ insightface.FaceAnalysis( │ │ providers=[DmlExecutionProvider, ...]) │ │ per image: detection + landmarks + arcface │ │ emit cache in sort_faces.py:cmd_embed │ │ schema with embeddings + meta + processed │ │ + path_aliases + schema=v2 │ └────────────────┬────────────────────────────┘ │ ▼ immich_.npz ┌─────────────────────────────────────────────┐ │ WSL cluster_immich.py │ │ build centroids of canonical │ │ faceset_NNN/ in facesets_swap_ready/ │ │ drop matches at cos-dist <= 0.45 │ │ cluster the rest at 0.55 │ │ refine gates -> synthetic refine_manifest │ │ cmd_export_swap -> facesets_swap_ready/ │ │ merge top-level manifest │ └─────────────────────────────────────────────┘ ``` Cache artifacts stay separate (per the architecture choice on this run): each user's results live in their own `immich_.npz`. A future one-shot merge can fold them into `nl_full.npz` if needed; the existing `extend` command would do the right thing once schemas align. ## 3. Path mapping `/mnt/x/` ↔ `X:\`. Cache stores WSL form (matching `nl_full.npz`'s existing convention). `wsl_to_win()` translates for the embed worker which runs natively on Windows. `work/cluster_immich.py` always uses the canonical `facesets_swap_ready/` view to build identity centroids — meaning the comparison is against the *current* set of canonical facesets in the swap-ready directory (skipping era splits and `_thin/`), not against the older `facesets_full/` snapshot. ## 4. Result of the 2026-04-26 run (peter / admin) ### 4a. Stage ``` total_assets_seen: 53842 staged_count: 10261 (~10 GB on /mnt/x/) deduped_against_existing: 978 (sha256 in nl_full.npz already) deduped_against_staged: 2976 (internal byte-dupes inside Immich) skipped_no_big_face: 9539 (Immich detected only sub-90px faces) skipped_no_faces: 29390 (Immich detected zero faces) skipped_download_error: 698 (transient DNS / TLS, not seen-marked) elapsed: ~70 min (6.4 assets/s end-to-end at 8 workers) ``` The 698 transient errors are recoverable on a re-run because `immich_stage.py` does not add them to the `seen` set. Each transient asset would be retried. ### 4b. Embed (Windows DML) ``` queue: 10261 entries new face records: 19462 new noface records: 1 load errors: 125 (likely HEIC / unreadable) elapsed: 3878.0s (64.6 min, 2.6 img/s end-to-end) ``` The 2.6 img/s end-to-end includes CIFS-share image load, image decode, DML inference (~50 ms/face), and JSON / NPZ flushing. Pure DML inference is faster; the rest of the pipeline dominates at scale. ### 4c. Cluster ``` existing canonical centroids: 25 faces already covered (cos-dist <= 0.45): 8103/19480 (42%) faceset_001: 1856 faceset_002: 2666 faceset_003: 670 faceset_004: 48 faceset_005: 40 ... (smaller hits to the remaining 20) unmatched faces to cluster: 11377 clusters at threshold 0.55: 2534 (top sizes [469, 444, 342, 338, 262, ...]) survived refine gates: 239 emitted as new facesets: 185 (54 dropped by export-swap's 0.45 outlier) ``` Top-level `facesets_swap_ready/manifest.json` after this run: **216 facesets** (up from 31; ~7× growth) + 68 thin_eras under `_thin/`. ## 5. Surprises and caveats ### 5a. `/search/metadata`'s `userIds` filter is silently ignored (Immich v2.7.2) When the admin API key is used, passing `userIds=[]` returns admin's own assets, not the other user's. The filter is silently dropped. Verified by sampling 200 returned items and confirming `ownerId` was admin for all of them. To process another user's library, **a separate API key issued by that user is required** — the admin key cannot enumerate cross-user libraries through any documented endpoint we tried. `/timeline/buckets` with a `userId` query parameter returns `Not found or no timeline.read access`. ### 5b. `/server/statistics` undercounts what the search returns `/server/statistics` reported admin = 53,842 photos. Our `/search/metadata` paginated through... **53,842** top-level. So the header agrees with the body in this case. But `/server/statistics` does NOT count items that live under external libraries' import paths — yet `/search/metadata` does include them. For this Immich, two external libraries (`/mnt/media/photos` and `/mnt/media/omv_photos`) are configured but `/libraries` reports `assetCount=0` for both. Yet 80% of our staged paths come from those library import paths. Don't trust statistics-vs-search consistency. ### 5c. Indexed Immich thumbnails masquerading as assets 5,563 of our 10,261 staged paths are `/thumbs/.../-preview.jpeg` — Immich's own internally-generated thumbnails got indexed because the external library import path included the thumbs subdirectory and the exclusion patterns didn't list `**/thumbs/**`. They embed and cluster fine but produce lower-resolution face records. The fix on the Immich side is adding `**/thumbs/**` to the exclusion patterns. ### 5d. Internal byte-duplicates (2,976) Many Immich assets are byte-identical to other Immich assets — typically because the same photo was uploaded both from a phone and from a synced cloud folder. sha256 dedup catches all of these on the second download (we still pay the bandwidth, but skip the disk write and embed work). With Immich v2.7.2's own `assets/duplicates` endpoint we could catch this earlier, but it's not currently used. ## 6. Re-running and applying to other Immich instances ```bash export IMMICH_URL=https://your-immich.example.com export IMMICH_API_KEY=... # admin or per-user key # Optional: populate work/immich/users.json with label -> UUID map. # 1. Stage (parallel /faces + downloads, resumable). python work/immich_stage.py --user peter --workers 8 # 2. End-to-end finalize: copy queue to /mnt/c/, run Windows embed worker, # copy the cache back, run cluster_immich.py. bash work/finalize_immich.sh peter ``` For a different Immich instance, the only configuration is the env vars and the `users.json` sidecar. `cluster_immich.py`'s tunables (matching threshold, clustering threshold, refine gates, MIN_FACES) are at the top of the script. To process a *second* user's library, issue a per-user API key in the Immich admin UI for that user, set `IMMICH_API_KEY` to that key, and re-run with their `--user