Add Immich import pipeline (WSL stage + Windows DML embed + cluster)
Three-piece workflow that imports a self-hosted Immich library and emits new facesets without disturbing existing identity numbering: - work/immich_stage.py (WSL): pages /search/metadata, parallel-fetches /faces?id= per asset, prefilters by face_short>=90 against bbox scaled to original-image coords, downloads originals, sha256-dedups against nl_full.npz and same-run staged files. 8-worker ThreadPoolExecutor doing the full /faces->filter->/original chain per asset; resumable via state.json. API URL + key come from IMMICH_URL / IMMICH_API_KEY env vars, label->UUID map from work/immich/users.json (gitignored). - work/embed_worker.py (Windows venv at C:\face_embed_venv): runs insightface.FaceAnalysis(buffalo_l) with the DmlExecutionProvider on AMD Radeon Vega via onnxruntime-directml. Produces a cache file in the same .npz schema as sort_faces.cmd_embed (loadable via load_cache). ~7.5x speedup over CPU end-to-end; embeddings bit- identical to CPU (cosine similarity 1.0000 across 8 sample faces). - work/cluster_immich.py (WSL): mirrors cluster_osrc.py against an immich_<user>.npz. Builds existing identity centroids from canonical faceset_NNN/ in facesets_swap_ready/, drops matches at <=0.45, clusters the rest at 0.55, applies refine gates, hands off to cmd_export_swap. Numbers new facesets past the existing maximum. - work/finalize_immich.sh: chains queue->Windows embed->cache copy-> cluster_immich, with logging. The 2026-04-26 run on https://fotos.computerliebe.org (Immich v2.7.2) processed 53,842 admin-accessible assets, staged 10,261, embedded 19,462 face records on Vega DML in 64.6 min, matched 8,103 (42%) to existing identities, and emitted 185 new facesets (faceset_026..264 with gaps). facesets_swap_ready/ went from 31 to 216 substantive facesets. Important caveat surfaced: /search/metadata's userIds filter is silently ignored when the API key is bound to a different user, so this run can't enumerate other users' libraries from the admin key. A per-user API key would be required for nic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
216
docs/analysis/immich-import-pipeline.md
Normal file
216
docs/analysis/immich-import-pipeline.md
Normal file
@@ -0,0 +1,216 @@
|
||||
# Importing identities from a self-hosted Immich library
|
||||
|
||||
_Run date: 2026-04-26. Target: Immich v2.7.2 at `https://fotos.computerliebe.org`.
|
||||
Driver scripts: `work/immich_stage.py`, `work/embed_worker.py`,
|
||||
`work/cluster_immich.py`, `work/finalize_immich.sh`._
|
||||
|
||||
## 1. Why a split workflow
|
||||
|
||||
InsightFace `buffalo_l` on the WSL CPU runs the full detection + landmarks +
|
||||
recognition stack at ~3–4 faces/second. Re-detecting all 79K Immich photos
|
||||
would have taken ~10–28 days. The available AMD Radeon RX Vega is unusable
|
||||
under WSL (no `/dev/dri/`, no ROCm), but **DirectML on Windows native**
|
||||
runs the same models bit-identically and ~7.5× faster end-to-end. The
|
||||
pipeline therefore splits:
|
||||
|
||||
- **WSL side** (`/opt/face-sets/`) — orchestration: API listing, download,
|
||||
sha256 dedup, file management, clustering, faceset emission.
|
||||
- **Windows side** (`C:\face_embed_venv\`) — the embed step only. A fresh
|
||||
Python 3.12 (installed via `winget install Python.Python.3.12`) with
|
||||
`numpy`, `pillow`, `opencv-python-headless`, `onnxruntime-directml`,
|
||||
`insightface`. Models copied from `/home/peter/.insightface/models/buffalo_l/`
|
||||
to `C:\face_embed_venv\models\buffalo_l\`.
|
||||
|
||||
A 30-iteration synthetic benchmark on Vega:
|
||||
|
||||
| model | DML | CPU | speedup |
|
||||
|-------------|----:|----:|--------:|
|
||||
| `det_10g.onnx` (640×640) | 10.0 ms | 183.5 ms | 18.4× |
|
||||
| `w600k_r50.onnx` (112×112) | 8.2 ms | 90.5 ms | 11.0× |
|
||||
|
||||
End-to-end FaceAnalysis on 5 real Immich-sourced images (excluding the
|
||||
first-call DML JIT warmup): ~7.5× speedup post-warmup. Per-face cosine
|
||||
similarity DML vs CPU was 1.0000 across all 8 detected faces — DML is
|
||||
bit-identical to CPU for arcface inference.
|
||||
|
||||
## 2. Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ WSL /opt/face-sets/work/immich_stage.py │
|
||||
│ ┌──────────────────────────────────────────┐│
|
||||
│ │ ThreadPoolExecutor.map(_fetch_for_asset, ││
|
||||
│ │ list_assets(user)) ││
|
||||
│ │ ─ /faces?id= (Immich, parallel x8) ││
|
||||
│ │ ─ filter face_short >= 90 ││
|
||||
│ │ ─ /assets/.../original (parallel x8) ││
|
||||
│ └──────────────────────────────────────────┘│
|
||||
│ consumer (main thread): │
|
||||
│ sha256 → dedup vs nl_full.npz │
|
||||
│ save to /mnt/x/src/immich/<user>/<rel>/ │
|
||||
│ append to queue.json │
|
||||
└────────────────┬────────────────────────────┘
|
||||
│
|
||||
▼ queue.json (with WSL + Windows paths)
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Windows embed_worker.py (C:\face_embed_venv) │
|
||||
│ insightface.FaceAnalysis( │
|
||||
│ providers=[DmlExecutionProvider, ...]) │
|
||||
│ per image: detection + landmarks + arcface │
|
||||
│ emit cache in sort_faces.py:cmd_embed │
|
||||
│ schema with embeddings + meta + processed │
|
||||
│ + path_aliases + schema=v2 │
|
||||
└────────────────┬────────────────────────────┘
|
||||
│
|
||||
▼ immich_<user>.npz
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ WSL cluster_immich.py │
|
||||
│ build centroids of canonical │
|
||||
│ faceset_NNN/ in facesets_swap_ready/ │
|
||||
│ drop matches at cos-dist <= 0.45 │
|
||||
│ cluster the rest at 0.55 │
|
||||
│ refine gates -> synthetic refine_manifest │
|
||||
│ cmd_export_swap -> facesets_swap_ready/ │
|
||||
│ merge top-level manifest │
|
||||
└─────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
Cache artifacts stay separate (per the architecture choice on this run):
|
||||
each user's results live in their own `immich_<user>.npz`. A future
|
||||
one-shot merge can fold them into `nl_full.npz` if needed; the existing
|
||||
`extend` command would do the right thing once schemas align.
|
||||
|
||||
## 3. Path mapping
|
||||
|
||||
`/mnt/x/` ↔ `X:\`. Cache stores WSL form (matching `nl_full.npz`'s
|
||||
existing convention). `wsl_to_win()` translates for the embed worker
|
||||
which runs natively on Windows.
|
||||
|
||||
`work/cluster_immich.py` always uses the canonical `facesets_swap_ready/`
|
||||
view to build identity centroids — meaning the comparison is against the
|
||||
*current* set of canonical facesets in the swap-ready directory (skipping
|
||||
era splits and `_thin/`), not against the older `facesets_full/` snapshot.
|
||||
|
||||
## 4. Result of the 2026-04-26 run (peter / admin)
|
||||
|
||||
### 4a. Stage
|
||||
|
||||
```
|
||||
total_assets_seen: 53842
|
||||
staged_count: 10261 (~10 GB on /mnt/x/)
|
||||
deduped_against_existing: 978 (sha256 in nl_full.npz already)
|
||||
deduped_against_staged: 2976 (internal byte-dupes inside Immich)
|
||||
skipped_no_big_face: 9539 (Immich detected only sub-90px faces)
|
||||
skipped_no_faces: 29390 (Immich detected zero faces)
|
||||
skipped_download_error: 698 (transient DNS / TLS, not seen-marked)
|
||||
elapsed: ~70 min (6.4 assets/s end-to-end at 8 workers)
|
||||
```
|
||||
|
||||
The 698 transient errors are recoverable on a re-run because
|
||||
`immich_stage.py` does not add them to the `seen` set. Each transient
|
||||
asset would be retried.
|
||||
|
||||
### 4b. Embed (Windows DML)
|
||||
|
||||
```
|
||||
queue: 10261 entries
|
||||
new face records: 19462
|
||||
new noface records: 1
|
||||
load errors: 125 (likely HEIC / unreadable)
|
||||
elapsed: 3878.0s (64.6 min, 2.6 img/s end-to-end)
|
||||
```
|
||||
|
||||
The 2.6 img/s end-to-end includes CIFS-share image load, image decode,
|
||||
DML inference (~50 ms/face), and JSON / NPZ flushing. Pure DML inference
|
||||
is faster; the rest of the pipeline dominates at scale.
|
||||
|
||||
### 4c. Cluster
|
||||
|
||||
```
|
||||
existing canonical centroids: 25
|
||||
faces already covered (cos-dist <= 0.45): 8103/19480 (42%)
|
||||
faceset_001: 1856
|
||||
faceset_002: 2666
|
||||
faceset_003: 670
|
||||
faceset_004: 48
|
||||
faceset_005: 40
|
||||
... (smaller hits to the remaining 20)
|
||||
unmatched faces to cluster: 11377
|
||||
clusters at threshold 0.55: 2534 (top sizes [469, 444, 342, 338, 262, ...])
|
||||
survived refine gates: 239
|
||||
emitted as new facesets: 185 (54 dropped by export-swap's 0.45 outlier)
|
||||
```
|
||||
|
||||
Top-level `facesets_swap_ready/manifest.json` after this run: **216
|
||||
facesets** (up from 31; ~7× growth) + 68 thin_eras under `_thin/`.
|
||||
|
||||
## 5. Surprises and caveats
|
||||
|
||||
### 5a. `/search/metadata`'s `userIds` filter is silently ignored (Immich v2.7.2)
|
||||
|
||||
When the admin API key is used, passing `userIds=[<other-user-uuid>]`
|
||||
returns admin's own assets, not the other user's. The filter is
|
||||
silently dropped. Verified by sampling 200 returned items and
|
||||
confirming `ownerId` was admin for all of them.
|
||||
|
||||
To process another user's library, **a separate API key issued by that
|
||||
user is required** — the admin key cannot enumerate cross-user
|
||||
libraries through any documented endpoint we tried. `/timeline/buckets`
|
||||
with a `userId` query parameter returns
|
||||
`Not found or no timeline.read access`.
|
||||
|
||||
### 5b. `/server/statistics` undercounts what the search returns
|
||||
|
||||
`/server/statistics` reported admin = 53,842 photos. Our
|
||||
`/search/metadata` paginated through... **53,842** top-level. So the
|
||||
header agrees with the body in this case. But `/server/statistics` does
|
||||
NOT count items that live under external libraries' import paths —
|
||||
yet `/search/metadata` does include them. For this Immich, two external
|
||||
libraries (`/mnt/media/photos` and `/mnt/media/omv_photos`) are
|
||||
configured but `/libraries` reports `assetCount=0` for both. Yet 80% of
|
||||
our staged paths come from those library import paths. Don't trust
|
||||
statistics-vs-search consistency.
|
||||
|
||||
### 5c. Indexed Immich thumbnails masquerading as assets
|
||||
|
||||
5,563 of our 10,261 staged paths are `<library>/thumbs/.../-preview.jpeg`
|
||||
— Immich's own internally-generated thumbnails got indexed because the
|
||||
external library import path included the thumbs subdirectory and the
|
||||
exclusion patterns didn't list `**/thumbs/**`. They embed and cluster
|
||||
fine but produce lower-resolution face records. The fix on the Immich
|
||||
side is adding `**/thumbs/**` to the exclusion patterns.
|
||||
|
||||
### 5d. Internal byte-duplicates (2,976)
|
||||
|
||||
Many Immich assets are byte-identical to other Immich assets — typically
|
||||
because the same photo was uploaded both from a phone and from a
|
||||
synced cloud folder. sha256 dedup catches all of these on the second
|
||||
download (we still pay the bandwidth, but skip the disk write and
|
||||
embed work). With Immich v2.7.2's own `assets/duplicates` endpoint we
|
||||
could catch this earlier, but it's not currently used.
|
||||
|
||||
## 6. Re-running and applying to other Immich instances
|
||||
|
||||
```bash
|
||||
export IMMICH_URL=https://your-immich.example.com
|
||||
export IMMICH_API_KEY=... # admin or per-user key
|
||||
|
||||
# Optional: populate work/immich/users.json with label -> UUID map.
|
||||
|
||||
# 1. Stage (parallel /faces + downloads, resumable).
|
||||
python work/immich_stage.py --user peter --workers 8
|
||||
|
||||
# 2. End-to-end finalize: copy queue to /mnt/c/, run Windows embed worker,
|
||||
# copy the cache back, run cluster_immich.py.
|
||||
bash work/finalize_immich.sh peter
|
||||
```
|
||||
|
||||
For a different Immich instance, the only configuration is the env vars
|
||||
and the `users.json` sidecar. `cluster_immich.py`'s tunables (matching
|
||||
threshold, clustering threshold, refine gates, MIN_FACES) are at the
|
||||
top of the script.
|
||||
|
||||
To process a *second* user's library, issue a per-user API key in the
|
||||
Immich admin UI for that user, set `IMMICH_API_KEY` to that key, and
|
||||
re-run with their `--user <label>`. The admin key cannot impersonate
|
||||
other users via the search API.
|
||||
Reference in New Issue
Block a user