Overnight 2026-04-27 nic finalize completed. Per-user API key worked as
expected. The pipeline survived one mid-stage Immich outage via the
circuit breaker added in 62dba3d -- script paused, operator confirmed
connectivity, same command resumed from saved state.json.
Embed (Windows DML): 7,834 images -> 15,627 face records + 1 noface in
59 minutes (2.2 img/s end-to-end).
Cluster: 6,770 of 15,627 faces (43%) matched existing canonical
identities at cos-dist <= 0.45; biggest hits faceset_002 (+3,261),
faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408). The
faceset_008 and faceset_007 hits are noteworthy cross-matches: those
are hand-sorted "sab" and "s" identities, recurring frequently in nic's
library.
Of the 8,857 unmatched faces, 3,787 raw clusters at threshold 0.55,
129 surviving refine gates, 95 emitted as new facesets at faceset_265+.
Top-level facesets_swap_ready/manifest.json: 216 -> 311 substantive
facesets + 68 thin_eras unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
280 lines
12 KiB
Markdown
280 lines
12 KiB
Markdown
# Importing identities from a self-hosted Immich library
|
||
|
||
_Run date: 2026-04-26. Target: Immich v2.7.2 at `https://fotos.computerliebe.org`.
|
||
Driver scripts: `work/immich_stage.py`, `work/embed_worker.py`,
|
||
`work/cluster_immich.py`, `work/finalize_immich.sh`._
|
||
|
||
## 1. Why a split workflow
|
||
|
||
InsightFace `buffalo_l` on the WSL CPU runs the full detection + landmarks +
|
||
recognition stack at ~3–4 faces/second. Re-detecting all 79K Immich photos
|
||
would have taken ~10–28 days. The available AMD Radeon RX Vega is unusable
|
||
under WSL (no `/dev/dri/`, no ROCm), but **DirectML on Windows native**
|
||
runs the same models bit-identically and ~7.5× faster end-to-end. The
|
||
pipeline therefore splits:
|
||
|
||
- **WSL side** (`/opt/face-sets/`) — orchestration: API listing, download,
|
||
sha256 dedup, file management, clustering, faceset emission.
|
||
- **Windows side** (`C:\face_embed_venv\`) — the embed step only. A fresh
|
||
Python 3.12 (installed via `winget install Python.Python.3.12`) with
|
||
`numpy`, `pillow`, `opencv-python-headless`, `onnxruntime-directml`,
|
||
`insightface`. Models copied from `/home/peter/.insightface/models/buffalo_l/`
|
||
to `C:\face_embed_venv\models\buffalo_l\`.
|
||
|
||
A 30-iteration synthetic benchmark on Vega:
|
||
|
||
| model | DML | CPU | speedup |
|
||
|-------------|----:|----:|--------:|
|
||
| `det_10g.onnx` (640×640) | 10.0 ms | 183.5 ms | 18.4× |
|
||
| `w600k_r50.onnx` (112×112) | 8.2 ms | 90.5 ms | 11.0× |
|
||
|
||
End-to-end FaceAnalysis on 5 real Immich-sourced images (excluding the
|
||
first-call DML JIT warmup): ~7.5× speedup post-warmup. Per-face cosine
|
||
similarity DML vs CPU was 1.0000 across all 8 detected faces — DML is
|
||
bit-identical to CPU for arcface inference.
|
||
|
||
## 2. Architecture
|
||
|
||
```
|
||
┌─────────────────────────────────────────────┐
|
||
│ WSL /opt/face-sets/work/immich_stage.py │
|
||
│ ┌──────────────────────────────────────────┐│
|
||
│ │ ThreadPoolExecutor.map(_fetch_for_asset, ││
|
||
│ │ list_assets(user)) ││
|
||
│ │ ─ /faces?id= (Immich, parallel x8) ││
|
||
│ │ ─ filter face_short >= 90 ││
|
||
│ │ ─ /assets/.../original (parallel x8) ││
|
||
│ └──────────────────────────────────────────┘│
|
||
│ consumer (main thread): │
|
||
│ sha256 → dedup vs nl_full.npz │
|
||
│ save to /mnt/x/src/immich/<user>/<rel>/ │
|
||
│ append to queue.json │
|
||
└────────────────┬────────────────────────────┘
|
||
│
|
||
▼ queue.json (with WSL + Windows paths)
|
||
┌─────────────────────────────────────────────┐
|
||
│ Windows embed_worker.py (C:\face_embed_venv) │
|
||
│ insightface.FaceAnalysis( │
|
||
│ providers=[DmlExecutionProvider, ...]) │
|
||
│ per image: detection + landmarks + arcface │
|
||
│ emit cache in sort_faces.py:cmd_embed │
|
||
│ schema with embeddings + meta + processed │
|
||
│ + path_aliases + schema=v2 │
|
||
└────────────────┬────────────────────────────┘
|
||
│
|
||
▼ immich_<user>.npz
|
||
┌─────────────────────────────────────────────┐
|
||
│ WSL cluster_immich.py │
|
||
│ build centroids of canonical │
|
||
│ faceset_NNN/ in facesets_swap_ready/ │
|
||
│ drop matches at cos-dist <= 0.45 │
|
||
│ cluster the rest at 0.55 │
|
||
│ refine gates -> synthetic refine_manifest │
|
||
│ cmd_export_swap -> facesets_swap_ready/ │
|
||
│ merge top-level manifest │
|
||
└─────────────────────────────────────────────┘
|
||
```
|
||
|
||
Cache artifacts stay separate (per the architecture choice on this run):
|
||
each user's results live in their own `immich_<user>.npz`. A future
|
||
one-shot merge can fold them into `nl_full.npz` if needed; the existing
|
||
`extend` command would do the right thing once schemas align.
|
||
|
||
## 3. Path mapping
|
||
|
||
`/mnt/x/` ↔ `X:\`. Cache stores WSL form (matching `nl_full.npz`'s
|
||
existing convention). `wsl_to_win()` translates for the embed worker
|
||
which runs natively on Windows.
|
||
|
||
`work/cluster_immich.py` always uses the canonical `facesets_swap_ready/`
|
||
view to build identity centroids — meaning the comparison is against the
|
||
*current* set of canonical facesets in the swap-ready directory (skipping
|
||
era splits and `_thin/`), not against the older `facesets_full/` snapshot.
|
||
|
||
## 4. Result of the 2026-04-26 run (peter / admin)
|
||
|
||
### 4a. Stage
|
||
|
||
```
|
||
total_assets_seen: 53842
|
||
staged_count: 10261 (~10 GB on /mnt/x/)
|
||
deduped_against_existing: 978 (sha256 in nl_full.npz already)
|
||
deduped_against_staged: 2976 (internal byte-dupes inside Immich)
|
||
skipped_no_big_face: 9539 (Immich detected only sub-90px faces)
|
||
skipped_no_faces: 29390 (Immich detected zero faces)
|
||
skipped_download_error: 698 (transient DNS / TLS, not seen-marked)
|
||
elapsed: ~70 min (6.4 assets/s end-to-end at 8 workers)
|
||
```
|
||
|
||
The 698 transient errors are recoverable on a re-run because
|
||
`immich_stage.py` does not add them to the `seen` set. Each transient
|
||
asset would be retried.
|
||
|
||
### 4b. Embed (Windows DML)
|
||
|
||
```
|
||
queue: 10261 entries
|
||
new face records: 19462
|
||
new noface records: 1
|
||
load errors: 125 (likely HEIC / unreadable)
|
||
elapsed: 3878.0s (64.6 min, 2.6 img/s end-to-end)
|
||
```
|
||
|
||
The 2.6 img/s end-to-end includes CIFS-share image load, image decode,
|
||
DML inference (~50 ms/face), and JSON / NPZ flushing. Pure DML inference
|
||
is faster; the rest of the pipeline dominates at scale.
|
||
|
||
### 4c. Cluster
|
||
|
||
```
|
||
existing canonical centroids: 25
|
||
faces already covered (cos-dist <= 0.45): 8103/19480 (42%)
|
||
faceset_001: 1856
|
||
faceset_002: 2666
|
||
faceset_003: 670
|
||
faceset_004: 48
|
||
faceset_005: 40
|
||
... (smaller hits to the remaining 20)
|
||
unmatched faces to cluster: 11377
|
||
clusters at threshold 0.55: 2534 (top sizes [469, 444, 342, 338, 262, ...])
|
||
survived refine gates: 239
|
||
emitted as new facesets: 185 (54 dropped by export-swap's 0.45 outlier)
|
||
```
|
||
|
||
Top-level `facesets_swap_ready/manifest.json` after this run: **216
|
||
facesets** (up from 31; ~7× growth) + 68 thin_eras under `_thin/`.
|
||
|
||
## 4d. Result of the 2026-04-26..27 run (nic, with per-user API key)
|
||
|
||
After issuing nic a per-user API key, the same pipeline ran end-to-end
|
||
with no code changes (only the `IMMICH_API_KEY` env var changed). The
|
||
run survived one Immich outage mid-stage thanks to the circuit breaker
|
||
added in `work/immich_stage.py` (12 consecutive HTTP errors → probe →
|
||
exit 2 with state preserved → resume on same command).
|
||
|
||
### Stage
|
||
|
||
```
|
||
total_assets_seen: 25777 (matches /server/statistics 25,786)
|
||
staged_count: 7834 (30% face-bearing-with-big-face;
|
||
peter was 19%)
|
||
deduped_against_existing: 519 (sha256 in nl_full.npz already)
|
||
deduped_against_staged: 0 (nic's library has zero internal
|
||
byte-dupes; peter had 2,976)
|
||
skipped_no_big_face: 725
|
||
skipped_no_faces: 16695
|
||
skipped_download_error: 54 (transient; not marked seen ->
|
||
would be retried on resume)
|
||
elapsed: ~75 min wall (across two pause/resume sessions
|
||
bracketing one Immich outage)
|
||
```
|
||
|
||
### Embed (Windows DML)
|
||
|
||
```
|
||
queue: 7834 entries
|
||
new face records: 15627
|
||
new noface records: 1
|
||
load errors: 7
|
||
elapsed: 3538.9s (59 min, 2.2 img/s end-to-end)
|
||
```
|
||
|
||
### Cluster
|
||
|
||
```
|
||
existing canonical centroids: 25
|
||
faces already covered (cos-dist <= 0.45): 6770/15627 (43%)
|
||
faceset_002: 3261 (the dominant family identity)
|
||
faceset_008: 1461 (cross-match to hand-sorted 'sab')
|
||
faceset_001: 955
|
||
faceset_007: 408 (cross-match to hand-sorted 's')
|
||
faceset_006: 114
|
||
...
|
||
unmatched: 8857
|
||
clusters at threshold 0.55: 3787 (top sizes [165, 134, 106, 99, 92,
|
||
67, 62, 61, 58, 53])
|
||
survived refine gates: 129
|
||
emitted as new facesets: 95 (faceset_265..NNN with gaps)
|
||
```
|
||
|
||
Top-level `facesets_swap_ready/manifest.json` after the nic run: **311
|
||
substantive facesets** + 68 thin_eras. Two-day cumulative growth:
|
||
|
||
| date | event | facesets total |
|
||
|------|------|------:|
|
||
| 2026-04-25 | hand-sorted folder import | 19 |
|
||
| 2026-04-26 morning | osrc + age split + cleanup | 31 |
|
||
| 2026-04-26 afternoon | Immich peter run | 216 |
|
||
| 2026-04-27 (overnight) | Immich nic run | 311 |
|
||
|
||
## 5. Surprises and caveats
|
||
|
||
### 5a. `/search/metadata`'s `userIds` filter is silently ignored (Immich v2.7.2)
|
||
|
||
When the admin API key is used, passing `userIds=[<other-user-uuid>]`
|
||
returns admin's own assets, not the other user's. The filter is
|
||
silently dropped. Verified by sampling 200 returned items and
|
||
confirming `ownerId` was admin for all of them.
|
||
|
||
To process another user's library, **a separate API key issued by that
|
||
user is required** — the admin key cannot enumerate cross-user
|
||
libraries through any documented endpoint we tried. `/timeline/buckets`
|
||
with a `userId` query parameter returns
|
||
`Not found or no timeline.read access`.
|
||
|
||
### 5b. `/server/statistics` undercounts what the search returns
|
||
|
||
`/server/statistics` reported admin = 53,842 photos. Our
|
||
`/search/metadata` paginated through... **53,842** top-level. So the
|
||
header agrees with the body in this case. But `/server/statistics` does
|
||
NOT count items that live under external libraries' import paths —
|
||
yet `/search/metadata` does include them. For this Immich, two external
|
||
libraries (`/mnt/media/photos` and `/mnt/media/omv_photos`) are
|
||
configured but `/libraries` reports `assetCount=0` for both. Yet 80% of
|
||
our staged paths come from those library import paths. Don't trust
|
||
statistics-vs-search consistency.
|
||
|
||
### 5c. Indexed Immich thumbnails masquerading as assets
|
||
|
||
5,563 of our 10,261 staged paths are `<library>/thumbs/.../-preview.jpeg`
|
||
— Immich's own internally-generated thumbnails got indexed because the
|
||
external library import path included the thumbs subdirectory and the
|
||
exclusion patterns didn't list `**/thumbs/**`. They embed and cluster
|
||
fine but produce lower-resolution face records. The fix on the Immich
|
||
side is adding `**/thumbs/**` to the exclusion patterns.
|
||
|
||
### 5d. Internal byte-duplicates (2,976)
|
||
|
||
Many Immich assets are byte-identical to other Immich assets — typically
|
||
because the same photo was uploaded both from a phone and from a
|
||
synced cloud folder. sha256 dedup catches all of these on the second
|
||
download (we still pay the bandwidth, but skip the disk write and
|
||
embed work). With Immich v2.7.2's own `assets/duplicates` endpoint we
|
||
could catch this earlier, but it's not currently used.
|
||
|
||
## 6. Re-running and applying to other Immich instances
|
||
|
||
```bash
|
||
export IMMICH_URL=https://your-immich.example.com
|
||
export IMMICH_API_KEY=... # admin or per-user key
|
||
|
||
# Optional: populate work/immich/users.json with label -> UUID map.
|
||
|
||
# 1. Stage (parallel /faces + downloads, resumable).
|
||
python work/immich_stage.py --user peter --workers 8
|
||
|
||
# 2. End-to-end finalize: copy queue to /mnt/c/, run Windows embed worker,
|
||
# copy the cache back, run cluster_immich.py.
|
||
bash work/finalize_immich.sh peter
|
||
```
|
||
|
||
For a different Immich instance, the only configuration is the env vars
|
||
and the `users.json` sidecar. `cluster_immich.py`'s tunables (matching
|
||
threshold, clustering threshold, refine gates, MIN_FACES) are at the
|
||
top of the script.
|
||
|
||
To process a *second* user's library, issue a per-user API key in the
|
||
Immich admin UI for that user, set `IMMICH_API_KEY` to that key, and
|
||
re-run with their `--user <label>`. The admin key cannot impersonate
|
||
other users via the search API.
|