Files
face-sets/docs/analysis/immich-import-pipeline.md
Peter 321fed01cc Add Immich import pipeline (WSL stage + Windows DML embed + cluster)
Three-piece workflow that imports a self-hosted Immich library and emits
new facesets without disturbing existing identity numbering:

- work/immich_stage.py (WSL): pages /search/metadata, parallel-fetches
  /faces?id= per asset, prefilters by face_short>=90 against bbox scaled
  to original-image coords, downloads originals, sha256-dedups against
  nl_full.npz and same-run staged files. 8-worker ThreadPoolExecutor
  doing the full /faces->filter->/original chain per asset; resumable
  via state.json. API URL + key come from IMMICH_URL / IMMICH_API_KEY
  env vars, label->UUID map from work/immich/users.json (gitignored).
- work/embed_worker.py (Windows venv at C:\face_embed_venv): runs
  insightface.FaceAnalysis(buffalo_l) with the DmlExecutionProvider on
  AMD Radeon Vega via onnxruntime-directml. Produces a cache file in
  the same .npz schema as sort_faces.cmd_embed (loadable via
  load_cache). ~7.5x speedup over CPU end-to-end; embeddings bit-
  identical to CPU (cosine similarity 1.0000 across 8 sample faces).
- work/cluster_immich.py (WSL): mirrors cluster_osrc.py against an
  immich_<user>.npz. Builds existing identity centroids from canonical
  faceset_NNN/ in facesets_swap_ready/, drops matches at <=0.45,
  clusters the rest at 0.55, applies refine gates, hands off to
  cmd_export_swap. Numbers new facesets past the existing maximum.
- work/finalize_immich.sh: chains queue->Windows embed->cache copy->
  cluster_immich, with logging.

The 2026-04-26 run on https://fotos.computerliebe.org (Immich v2.7.2)
processed 53,842 admin-accessible assets, staged 10,261, embedded
19,462 face records on Vega DML in 64.6 min, matched 8,103 (42%) to
existing identities, and emitted 185 new facesets (faceset_026..264
with gaps). facesets_swap_ready/ went from 31 to 216 substantive
facesets.

Important caveat surfaced: /search/metadata's userIds filter is
silently ignored when the API key is bound to a different user, so
this run can't enumerate other users' libraries from the admin key.
A per-user API key would be required for nic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 18:14:26 +02:00

217 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Importing identities from a self-hosted Immich library
_Run date: 2026-04-26. Target: Immich v2.7.2 at `https://fotos.computerliebe.org`.
Driver scripts: `work/immich_stage.py`, `work/embed_worker.py`,
`work/cluster_immich.py`, `work/finalize_immich.sh`._
## 1. Why a split workflow
InsightFace `buffalo_l` on the WSL CPU runs the full detection + landmarks +
recognition stack at ~34 faces/second. Re-detecting all 79K Immich photos
would have taken ~1028 days. The available AMD Radeon RX Vega is unusable
under WSL (no `/dev/dri/`, no ROCm), but **DirectML on Windows native**
runs the same models bit-identically and ~7.5× faster end-to-end. The
pipeline therefore splits:
- **WSL side** (`/opt/face-sets/`) — orchestration: API listing, download,
sha256 dedup, file management, clustering, faceset emission.
- **Windows side** (`C:\face_embed_venv\`) — the embed step only. A fresh
Python 3.12 (installed via `winget install Python.Python.3.12`) with
`numpy`, `pillow`, `opencv-python-headless`, `onnxruntime-directml`,
`insightface`. Models copied from `/home/peter/.insightface/models/buffalo_l/`
to `C:\face_embed_venv\models\buffalo_l\`.
A 30-iteration synthetic benchmark on Vega:
| model | DML | CPU | speedup |
|-------------|----:|----:|--------:|
| `det_10g.onnx` (640×640) | 10.0 ms | 183.5 ms | 18.4× |
| `w600k_r50.onnx` (112×112) | 8.2 ms | 90.5 ms | 11.0× |
End-to-end FaceAnalysis on 5 real Immich-sourced images (excluding the
first-call DML JIT warmup): ~7.5× speedup post-warmup. Per-face cosine
similarity DML vs CPU was 1.0000 across all 8 detected faces — DML is
bit-identical to CPU for arcface inference.
## 2. Architecture
```
┌─────────────────────────────────────────────┐
│ WSL /opt/face-sets/work/immich_stage.py │
│ ┌──────────────────────────────────────────┐│
│ │ ThreadPoolExecutor.map(_fetch_for_asset, ││
│ │ list_assets(user)) ││
│ │ ─ /faces?id= (Immich, parallel x8) ││
│ │ ─ filter face_short >= 90 ││
│ │ ─ /assets/.../original (parallel x8) ││
│ └──────────────────────────────────────────┘│
│ consumer (main thread): │
│ sha256 → dedup vs nl_full.npz │
│ save to /mnt/x/src/immich/<user>/<rel>/ │
│ append to queue.json │
└────────────────┬────────────────────────────┘
▼ queue.json (with WSL + Windows paths)
┌─────────────────────────────────────────────┐
│ Windows embed_worker.py (C:\face_embed_venv) │
│ insightface.FaceAnalysis( │
│ providers=[DmlExecutionProvider, ...]) │
│ per image: detection + landmarks + arcface │
│ emit cache in sort_faces.py:cmd_embed │
│ schema with embeddings + meta + processed │
│ + path_aliases + schema=v2 │
└────────────────┬────────────────────────────┘
▼ immich_<user>.npz
┌─────────────────────────────────────────────┐
│ WSL cluster_immich.py │
│ build centroids of canonical │
│ faceset_NNN/ in facesets_swap_ready/ │
│ drop matches at cos-dist <= 0.45 │
│ cluster the rest at 0.55 │
│ refine gates -> synthetic refine_manifest │
│ cmd_export_swap -> facesets_swap_ready/ │
│ merge top-level manifest │
└─────────────────────────────────────────────┘
```
Cache artifacts stay separate (per the architecture choice on this run):
each user's results live in their own `immich_<user>.npz`. A future
one-shot merge can fold them into `nl_full.npz` if needed; the existing
`extend` command would do the right thing once schemas align.
## 3. Path mapping
`/mnt/x/``X:\`. Cache stores WSL form (matching `nl_full.npz`'s
existing convention). `wsl_to_win()` translates for the embed worker
which runs natively on Windows.
`work/cluster_immich.py` always uses the canonical `facesets_swap_ready/`
view to build identity centroids — meaning the comparison is against the
*current* set of canonical facesets in the swap-ready directory (skipping
era splits and `_thin/`), not against the older `facesets_full/` snapshot.
## 4. Result of the 2026-04-26 run (peter / admin)
### 4a. Stage
```
total_assets_seen: 53842
staged_count: 10261 (~10 GB on /mnt/x/)
deduped_against_existing: 978 (sha256 in nl_full.npz already)
deduped_against_staged: 2976 (internal byte-dupes inside Immich)
skipped_no_big_face: 9539 (Immich detected only sub-90px faces)
skipped_no_faces: 29390 (Immich detected zero faces)
skipped_download_error: 698 (transient DNS / TLS, not seen-marked)
elapsed: ~70 min (6.4 assets/s end-to-end at 8 workers)
```
The 698 transient errors are recoverable on a re-run because
`immich_stage.py` does not add them to the `seen` set. Each transient
asset would be retried.
### 4b. Embed (Windows DML)
```
queue: 10261 entries
new face records: 19462
new noface records: 1
load errors: 125 (likely HEIC / unreadable)
elapsed: 3878.0s (64.6 min, 2.6 img/s end-to-end)
```
The 2.6 img/s end-to-end includes CIFS-share image load, image decode,
DML inference (~50 ms/face), and JSON / NPZ flushing. Pure DML inference
is faster; the rest of the pipeline dominates at scale.
### 4c. Cluster
```
existing canonical centroids: 25
faces already covered (cos-dist <= 0.45): 8103/19480 (42%)
faceset_001: 1856
faceset_002: 2666
faceset_003: 670
faceset_004: 48
faceset_005: 40
... (smaller hits to the remaining 20)
unmatched faces to cluster: 11377
clusters at threshold 0.55: 2534 (top sizes [469, 444, 342, 338, 262, ...])
survived refine gates: 239
emitted as new facesets: 185 (54 dropped by export-swap's 0.45 outlier)
```
Top-level `facesets_swap_ready/manifest.json` after this run: **216
facesets** (up from 31; ~7× growth) + 68 thin_eras under `_thin/`.
## 5. Surprises and caveats
### 5a. `/search/metadata`'s `userIds` filter is silently ignored (Immich v2.7.2)
When the admin API key is used, passing `userIds=[<other-user-uuid>]`
returns admin's own assets, not the other user's. The filter is
silently dropped. Verified by sampling 200 returned items and
confirming `ownerId` was admin for all of them.
To process another user's library, **a separate API key issued by that
user is required** — the admin key cannot enumerate cross-user
libraries through any documented endpoint we tried. `/timeline/buckets`
with a `userId` query parameter returns
`Not found or no timeline.read access`.
### 5b. `/server/statistics` undercounts what the search returns
`/server/statistics` reported admin = 53,842 photos. Our
`/search/metadata` paginated through... **53,842** top-level. So the
header agrees with the body in this case. But `/server/statistics` does
NOT count items that live under external libraries' import paths —
yet `/search/metadata` does include them. For this Immich, two external
libraries (`/mnt/media/photos` and `/mnt/media/omv_photos`) are
configured but `/libraries` reports `assetCount=0` for both. Yet 80% of
our staged paths come from those library import paths. Don't trust
statistics-vs-search consistency.
### 5c. Indexed Immich thumbnails masquerading as assets
5,563 of our 10,261 staged paths are `<library>/thumbs/.../-preview.jpeg`
— Immich's own internally-generated thumbnails got indexed because the
external library import path included the thumbs subdirectory and the
exclusion patterns didn't list `**/thumbs/**`. They embed and cluster
fine but produce lower-resolution face records. The fix on the Immich
side is adding `**/thumbs/**` to the exclusion patterns.
### 5d. Internal byte-duplicates (2,976)
Many Immich assets are byte-identical to other Immich assets — typically
because the same photo was uploaded both from a phone and from a
synced cloud folder. sha256 dedup catches all of these on the second
download (we still pay the bandwidth, but skip the disk write and
embed work). With Immich v2.7.2's own `assets/duplicates` endpoint we
could catch this earlier, but it's not currently used.
## 6. Re-running and applying to other Immich instances
```bash
export IMMICH_URL=https://your-immich.example.com
export IMMICH_API_KEY=... # admin or per-user key
# Optional: populate work/immich/users.json with label -> UUID map.
# 1. Stage (parallel /faces + downloads, resumable).
python work/immich_stage.py --user peter --workers 8
# 2. End-to-end finalize: copy queue to /mnt/c/, run Windows embed worker,
# copy the cache back, run cluster_immich.py.
bash work/finalize_immich.sh peter
```
For a different Immich instance, the only configuration is the env vars
and the `users.json` sidecar. `cluster_immich.py`'s tunables (matching
threshold, clustering threshold, refine gates, MIN_FACES) are at the
top of the script.
To process a *second* user's library, issue a per-user API key in the
Immich admin UI for that user, set `IMMICH_API_KEY` to that key, and
re-run with their `--user <label>`. The admin key cannot impersonate
other users via the search API.