From 321fed01ccd04e09003181e6a884a031d6593ecf Mon Sep 17 00:00:00 2001 From: Peter Date: Sun, 26 Apr 2026 18:14:26 +0200 Subject: [PATCH] Add Immich import pipeline (WSL stage + Windows DML embed + cluster) Three-piece workflow that imports a self-hosted Immich library and emits new facesets without disturbing existing identity numbering: - work/immich_stage.py (WSL): pages /search/metadata, parallel-fetches /faces?id= per asset, prefilters by face_short>=90 against bbox scaled to original-image coords, downloads originals, sha256-dedups against nl_full.npz and same-run staged files. 8-worker ThreadPoolExecutor doing the full /faces->filter->/original chain per asset; resumable via state.json. API URL + key come from IMMICH_URL / IMMICH_API_KEY env vars, label->UUID map from work/immich/users.json (gitignored). - work/embed_worker.py (Windows venv at C:\face_embed_venv): runs insightface.FaceAnalysis(buffalo_l) with the DmlExecutionProvider on AMD Radeon Vega via onnxruntime-directml. Produces a cache file in the same .npz schema as sort_faces.cmd_embed (loadable via load_cache). ~7.5x speedup over CPU end-to-end; embeddings bit- identical to CPU (cosine similarity 1.0000 across 8 sample faces). - work/cluster_immich.py (WSL): mirrors cluster_osrc.py against an immich_.npz. Builds existing identity centroids from canonical faceset_NNN/ in facesets_swap_ready/, drops matches at <=0.45, clusters the rest at 0.55, applies refine gates, hands off to cmd_export_swap. Numbers new facesets past the existing maximum. - work/finalize_immich.sh: chains queue->Windows embed->cache copy-> cluster_immich, with logging. The 2026-04-26 run on https://fotos.computerliebe.org (Immich v2.7.2) processed 53,842 admin-accessible assets, staged 10,261, embedded 19,462 face records on Vega DML in 64.6 min, matched 8,103 (42%) to existing identities, and emitted 185 new facesets (faceset_026..264 with gaps). facesets_swap_ready/ went from 31 to 216 substantive facesets. Important caveat surfaced: /search/metadata's userIds filter is silently ignored when the API key is bound to a different user, so this run can't enumerate other users' libraries from the admin key. A per-user API key would be required for nic. Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 84 ++++- docs/analysis/immich-import-pipeline.md | 216 +++++++++++++ work/cluster_immich.py | 340 ++++++++++++++++++++ work/embed_worker.py | 244 ++++++++++++++ work/finalize_immich.sh | 50 +++ work/immich_stage.py | 409 ++++++++++++++++++++++++ 6 files changed, 1340 insertions(+), 3 deletions(-) create mode 100644 docs/analysis/immich-import-pipeline.md create mode 100644 work/cluster_immich.py create mode 100755 work/embed_worker.py create mode 100755 work/finalize_immich.sh create mode 100644 work/immich_stage.py diff --git a/README.md b/README.md index b6a9cf1..336b7c4 100644 --- a/README.md +++ b/README.md @@ -204,6 +204,77 @@ existing identities), this produced 6 new facesets (`faceset_020..025`, sizes 4–26 exported PNGs; the 7th candidate cluster lost all 6 faces to export-swap's tighter `min_face_short=100` gate). +### Importing identities from a self-hosted Immich library + +`work/immich_stage.py` + `work/embed_worker.py` + `work/cluster_immich.py` +together import an Immich library at scale, with the embed step running on +a Windows AMD GPU via DirectML and everything else on WSL. Three pieces: + +1. **`work/immich_stage.py` (WSL)** — pages every IMAGE asset via + `/search/metadata`, fetches each asset's `/faces?id=` to read Immich's + own ML-driven bboxes, scales each bbox to original-image coordinates, + and prefilters by `face_short ≥ 90`. For survivors it downloads the + original, sha256-deduplicates against the canonical `nl_full.npz` and + against same-run staged files, and saves to + `/mnt/x/src/immich//`. Writes a `queue.json` that the embed + worker consumes. 8 concurrent worker threads run the full per-asset + I/O chain (`/faces` → filter → `/original`) so 8 workers ≈ 8× the + serial throughput. +2. **`work/embed_worker.py` (Windows venv at `C:\face_embed_venv\`)** — + loads `insightface.FaceAnalysis(buffalo_l)` with the + `DmlExecutionProvider` and runs detection + landmarks + recognition + over the queue. Produces a `.npz` cache that's bit-identical in + schema to what `sort_faces.py:cmd_embed` writes, so the result is + directly loadable by `load_cache()`. The cache already includes the + post-`enrich` fields (`landmark_2d_106`, `landmark_3d_68`, `pose`) + because FaceAnalysis returns them for free. AMD Vega gives ~7.5× + real-pipeline speedup over CPU. +3. **`work/cluster_immich.py` (WSL)** — mirrors `cluster_osrc.py`'s + shape but reads from `immich_.npz`. Builds existing-identity + centroids from every canonical `faceset_NNN/` in + `facesets_swap_ready/` (skipping era splits and `_thin/`), drops + immich faces matching at cos-dist ≤ 0.45, clusters the rest at 0.55, + applies refine gates, numbers new facesets past the existing maximum, + and feeds `cmd_export_swap` via a synthetic manifest. + +`work/finalize_immich.sh ` chains queue → Windows embed → cache +copy back → cluster_immich, with logging. + +The Immich admin API key + base URL come from environment variables: + +```bash +export IMMICH_URL=https://your-immich.example.com +export IMMICH_API_KEY=... # admin or per-user key +python work/immich_stage.py --user peter --workers 8 +bash work/finalize_immich.sh peter +``` + +For the 2026-04-26 run against `https://fotos.computerliebe.org` (Immich +v2.7.2), with the admin API key: + +| step | result | +|------|------| +| stage | 53,842 assets seen, **10,261 staged** (~10 GB), 978 byte-deduped against `nl_full.npz`, 2,976 internal byte-duplicates, 39K skipped no-face / no-big-face | +| Windows DML embed | 19,462 face records + 1 noface in **64.6 min** (2.6 img/s end-to-end) | +| matched existing identities | **8,103 of 19,480 (42%)** at cos-dist ≤ 0.45; biggest hits faceset_002 (+2,666), faceset_001 (+1,856), faceset_003 (+670) | +| new clusters | 2,534 at threshold 0.55 → 239 surviving refine gates → **185 emitted** as `faceset_026..264` (gaps where export-swap's tighter outlier filter dropped clusters below the export quality bar) | + +**Important caveats for Immich v2.7.2**: +- The `userIds` filter on `/search/metadata` is **silently ignored** when + the API key is bound to a different user. The "import everything the + API key can see" semantics are what you actually get; cross-user + isolation is enforced server-side. +- `/server/statistics` reports counts that under-count what + `/search/metadata` actually returns (e.g. external library + thumbnail-dirs that got indexed because the import path included them). + Don't trust the statistics number as a denominator. +- A meaningful fraction of `originalPath`-based assets are *Immich's own + thumbnails* (`/thumbs/.../-preview.jpeg`) — included if + the external library's import path covers the thumbs directory and the + exclusion patterns don't list `**/thumbs/**`. For our run, 5,563 of + 10,261 staged were thumbnails. They embed and cluster fine but the + resulting faces are lower-resolution. + ## Key defaults `refine`: @@ -248,15 +319,22 @@ Highly recommended at swap time: enable **Select post-processing = GFPGAN** with ├─ docs/ │ └─ analysis/ │ └─ facesets-downstream-refinement-evaluation.md -└─ work/ (gitignored except force-tracked .py) +└─ work/ (gitignored except force-tracked .py / .sh) ├─ build_folders.py (hand-sorted-folder orchestration) ├─ check_faceset001_age.py (age-split readiness probe) ├─ age_split_001.py (age-split orchestration; faceset_001) ├─ cluster_osrc.py (mixed-bucket identity discovery) - ├─ synthetic_refine_manifest.json (last build_folders.py output) - ├─ synthetic_osrc_manifest.json (last cluster_osrc.py output) + ├─ immich_stage.py (Immich library staging, parallel) + ├─ embed_worker.py (Windows DML embed worker, runs from C:\face_embed_venv\) + ├─ cluster_immich.py (Immich identity discovery + export) + ├─ finalize_immich.sh (chains queue → embed → cluster) + ├─ synthetic_*_manifest.json (per-run synthetic refine manifests) + ├─ immich/ + │ ├─ users.json (label -> userId map; gitignored) + │ └─ /{queue,state,aliases}.json (per-user staging artifacts) ├─ cache/ │ ├─ nl_full.npz (canonical cache + duplicates.json) + │ ├─ immich_.npz (per-user immich embeddings) │ └─ age_split_exif.json (path → EXIF-year cache) └─ logs/ └─ *.log (every long step writes here) diff --git a/docs/analysis/immich-import-pipeline.md b/docs/analysis/immich-import-pipeline.md new file mode 100644 index 0000000..fee6a89 --- /dev/null +++ b/docs/analysis/immich-import-pipeline.md @@ -0,0 +1,216 @@ +# Importing identities from a self-hosted Immich library + +_Run date: 2026-04-26. Target: Immich v2.7.2 at `https://fotos.computerliebe.org`. +Driver scripts: `work/immich_stage.py`, `work/embed_worker.py`, +`work/cluster_immich.py`, `work/finalize_immich.sh`._ + +## 1. Why a split workflow + +InsightFace `buffalo_l` on the WSL CPU runs the full detection + landmarks + +recognition stack at ~3–4 faces/second. Re-detecting all 79K Immich photos +would have taken ~10–28 days. The available AMD Radeon RX Vega is unusable +under WSL (no `/dev/dri/`, no ROCm), but **DirectML on Windows native** +runs the same models bit-identically and ~7.5× faster end-to-end. The +pipeline therefore splits: + +- **WSL side** (`/opt/face-sets/`) — orchestration: API listing, download, + sha256 dedup, file management, clustering, faceset emission. +- **Windows side** (`C:\face_embed_venv\`) — the embed step only. A fresh + Python 3.12 (installed via `winget install Python.Python.3.12`) with + `numpy`, `pillow`, `opencv-python-headless`, `onnxruntime-directml`, + `insightface`. Models copied from `/home/peter/.insightface/models/buffalo_l/` + to `C:\face_embed_venv\models\buffalo_l\`. + +A 30-iteration synthetic benchmark on Vega: + +| model | DML | CPU | speedup | +|-------------|----:|----:|--------:| +| `det_10g.onnx` (640×640) | 10.0 ms | 183.5 ms | 18.4× | +| `w600k_r50.onnx` (112×112) | 8.2 ms | 90.5 ms | 11.0× | + +End-to-end FaceAnalysis on 5 real Immich-sourced images (excluding the +first-call DML JIT warmup): ~7.5× speedup post-warmup. Per-face cosine +similarity DML vs CPU was 1.0000 across all 8 detected faces — DML is +bit-identical to CPU for arcface inference. + +## 2. Architecture + +``` + ┌─────────────────────────────────────────────┐ + │ WSL /opt/face-sets/work/immich_stage.py │ + │ ┌──────────────────────────────────────────┐│ + │ │ ThreadPoolExecutor.map(_fetch_for_asset, ││ + │ │ list_assets(user)) ││ + │ │ ─ /faces?id= (Immich, parallel x8) ││ + │ │ ─ filter face_short >= 90 ││ + │ │ ─ /assets/.../original (parallel x8) ││ + │ └──────────────────────────────────────────┘│ + │ consumer (main thread): │ + │ sha256 → dedup vs nl_full.npz │ + │ save to /mnt/x/src/immich/// │ + │ append to queue.json │ + └────────────────┬────────────────────────────┘ + │ + ▼ queue.json (with WSL + Windows paths) + ┌─────────────────────────────────────────────┐ + │ Windows embed_worker.py (C:\face_embed_venv) │ + │ insightface.FaceAnalysis( │ + │ providers=[DmlExecutionProvider, ...]) │ + │ per image: detection + landmarks + arcface │ + │ emit cache in sort_faces.py:cmd_embed │ + │ schema with embeddings + meta + processed │ + │ + path_aliases + schema=v2 │ + └────────────────┬────────────────────────────┘ + │ + ▼ immich_.npz + ┌─────────────────────────────────────────────┐ + │ WSL cluster_immich.py │ + │ build centroids of canonical │ + │ faceset_NNN/ in facesets_swap_ready/ │ + │ drop matches at cos-dist <= 0.45 │ + │ cluster the rest at 0.55 │ + │ refine gates -> synthetic refine_manifest │ + │ cmd_export_swap -> facesets_swap_ready/ │ + │ merge top-level manifest │ + └─────────────────────────────────────────────┘ +``` + +Cache artifacts stay separate (per the architecture choice on this run): +each user's results live in their own `immich_.npz`. A future +one-shot merge can fold them into `nl_full.npz` if needed; the existing +`extend` command would do the right thing once schemas align. + +## 3. Path mapping + +`/mnt/x/` ↔ `X:\`. Cache stores WSL form (matching `nl_full.npz`'s +existing convention). `wsl_to_win()` translates for the embed worker +which runs natively on Windows. + +`work/cluster_immich.py` always uses the canonical `facesets_swap_ready/` +view to build identity centroids — meaning the comparison is against the +*current* set of canonical facesets in the swap-ready directory (skipping +era splits and `_thin/`), not against the older `facesets_full/` snapshot. + +## 4. Result of the 2026-04-26 run (peter / admin) + +### 4a. Stage + +``` +total_assets_seen: 53842 +staged_count: 10261 (~10 GB on /mnt/x/) +deduped_against_existing: 978 (sha256 in nl_full.npz already) +deduped_against_staged: 2976 (internal byte-dupes inside Immich) +skipped_no_big_face: 9539 (Immich detected only sub-90px faces) +skipped_no_faces: 29390 (Immich detected zero faces) +skipped_download_error: 698 (transient DNS / TLS, not seen-marked) +elapsed: ~70 min (6.4 assets/s end-to-end at 8 workers) +``` + +The 698 transient errors are recoverable on a re-run because +`immich_stage.py` does not add them to the `seen` set. Each transient +asset would be retried. + +### 4b. Embed (Windows DML) + +``` +queue: 10261 entries +new face records: 19462 +new noface records: 1 +load errors: 125 (likely HEIC / unreadable) +elapsed: 3878.0s (64.6 min, 2.6 img/s end-to-end) +``` + +The 2.6 img/s end-to-end includes CIFS-share image load, image decode, +DML inference (~50 ms/face), and JSON / NPZ flushing. Pure DML inference +is faster; the rest of the pipeline dominates at scale. + +### 4c. Cluster + +``` +existing canonical centroids: 25 +faces already covered (cos-dist <= 0.45): 8103/19480 (42%) + faceset_001: 1856 + faceset_002: 2666 + faceset_003: 670 + faceset_004: 48 + faceset_005: 40 + ... (smaller hits to the remaining 20) +unmatched faces to cluster: 11377 +clusters at threshold 0.55: 2534 (top sizes [469, 444, 342, 338, 262, ...]) +survived refine gates: 239 +emitted as new facesets: 185 (54 dropped by export-swap's 0.45 outlier) +``` + +Top-level `facesets_swap_ready/manifest.json` after this run: **216 +facesets** (up from 31; ~7× growth) + 68 thin_eras under `_thin/`. + +## 5. Surprises and caveats + +### 5a. `/search/metadata`'s `userIds` filter is silently ignored (Immich v2.7.2) + +When the admin API key is used, passing `userIds=[]` +returns admin's own assets, not the other user's. The filter is +silently dropped. Verified by sampling 200 returned items and +confirming `ownerId` was admin for all of them. + +To process another user's library, **a separate API key issued by that +user is required** — the admin key cannot enumerate cross-user +libraries through any documented endpoint we tried. `/timeline/buckets` +with a `userId` query parameter returns +`Not found or no timeline.read access`. + +### 5b. `/server/statistics` undercounts what the search returns + +`/server/statistics` reported admin = 53,842 photos. Our +`/search/metadata` paginated through... **53,842** top-level. So the +header agrees with the body in this case. But `/server/statistics` does +NOT count items that live under external libraries' import paths — +yet `/search/metadata` does include them. For this Immich, two external +libraries (`/mnt/media/photos` and `/mnt/media/omv_photos`) are +configured but `/libraries` reports `assetCount=0` for both. Yet 80% of +our staged paths come from those library import paths. Don't trust +statistics-vs-search consistency. + +### 5c. Indexed Immich thumbnails masquerading as assets + +5,563 of our 10,261 staged paths are `/thumbs/.../-preview.jpeg` +— Immich's own internally-generated thumbnails got indexed because the +external library import path included the thumbs subdirectory and the +exclusion patterns didn't list `**/thumbs/**`. They embed and cluster +fine but produce lower-resolution face records. The fix on the Immich +side is adding `**/thumbs/**` to the exclusion patterns. + +### 5d. Internal byte-duplicates (2,976) + +Many Immich assets are byte-identical to other Immich assets — typically +because the same photo was uploaded both from a phone and from a +synced cloud folder. sha256 dedup catches all of these on the second +download (we still pay the bandwidth, but skip the disk write and +embed work). With Immich v2.7.2's own `assets/duplicates` endpoint we +could catch this earlier, but it's not currently used. + +## 6. Re-running and applying to other Immich instances + +```bash +export IMMICH_URL=https://your-immich.example.com +export IMMICH_API_KEY=... # admin or per-user key + +# Optional: populate work/immich/users.json with label -> UUID map. + +# 1. Stage (parallel /faces + downloads, resumable). +python work/immich_stage.py --user peter --workers 8 + +# 2. End-to-end finalize: copy queue to /mnt/c/, run Windows embed worker, +# copy the cache back, run cluster_immich.py. +bash work/finalize_immich.sh peter +``` + +For a different Immich instance, the only configuration is the env vars +and the `users.json` sidecar. `cluster_immich.py`'s tunables (matching +threshold, clustering threshold, refine gates, MIN_FACES) are at the +top of the script. + +To process a *second* user's library, issue a per-user API key in the +Immich admin UI for that user, set `IMMICH_API_KEY` to that key, and +re-run with their `--user