face-sets/docs/analysis/immich-import-pipeline.md

# Importing identities from a self-hosted Immich library

_Run date: 2026-04-26. Target: Immich v2.7.2 at `https://fotos.computerliebe.org`.
Driver scripts: `work/immich_stage.py`, `work/embed_worker.py`,
`work/cluster_immich.py`, `work/finalize_immich.sh`._

## 1. Why a split workflow

InsightFace `buffalo_l` on the WSL CPU runs the full detection + landmarks +
recognition stack at ~3–4 faces/second. Re-detecting all 79K Immich photos
would have taken ~10–28 days. The available AMD Radeon RX Vega is unusable
under WSL (no `/dev/dri/`, no ROCm), but **DirectML on Windows native**
runs the same models bit-identically and ~7.5× faster end-to-end. The
pipeline therefore splits:

- **WSL side** (`/opt/face-sets/`) — orchestration: API listing, download,
  sha256 dedup, file management, clustering, faceset emission.
- **Windows side** (`C:\face_embed_venv\`) — the embed step only. A fresh
  Python 3.12 (installed via `winget install Python.Python.3.12`) with
  `numpy`, `pillow`, `opencv-python-headless`, `onnxruntime-directml`,
  `insightface`. Models copied from `/home/peter/.insightface/models/buffalo_l/`
  to `C:\face_embed_venv\models\buffalo_l\`.

A 30-iteration synthetic benchmark on Vega:

| model       | DML | CPU | speedup |
|-------------|----:|----:|--------:|
| `det_10g.onnx` (640×640) | 10.0 ms | 183.5 ms | 18.4× |
| `w600k_r50.onnx` (112×112) | 8.2 ms | 90.5 ms | 11.0× |

End-to-end FaceAnalysis on 5 real Immich-sourced images (excluding the
first-call DML JIT warmup): ~7.5× speedup post-warmup. Per-face cosine
similarity DML vs CPU was 1.0000 across all 8 detected faces — DML is
bit-identical to CPU for arcface inference.

## 2. Architecture

```
   ┌─────────────────────────────────────────────┐
   │ WSL  /opt/face-sets/work/immich_stage.py    │
   │ ┌──────────────────────────────────────────┐│
   │ │ ThreadPoolExecutor.map(_fetch_for_asset, ││
   │ │   list_assets(user))                     ││
   │ │  ─ /faces?id=    (Immich, parallel x8)   ││
   │ │  ─ filter face_short >= 90               ││
   │ │  ─ /assets/.../original (parallel x8)    ││
   │ └──────────────────────────────────────────┘│
   │  consumer (main thread):                    │
   │   sha256 → dedup vs nl_full.npz             │
   │   save to /mnt/x/src/immich/<user>/<rel>/   │
   │   append to queue.json                      │
   └────────────────┬────────────────────────────┘
                    │
                    ▼ queue.json (with WSL + Windows paths)
   ┌─────────────────────────────────────────────┐
   │ Windows embed_worker.py (C:\face_embed_venv) │
   │  insightface.FaceAnalysis(                  │
   │    providers=[DmlExecutionProvider, ...])   │
   │  per image: detection + landmarks + arcface │
   │  emit cache in sort_faces.py:cmd_embed      │
   │  schema with embeddings + meta + processed  │
   │  + path_aliases + schema=v2                 │
   └────────────────┬────────────────────────────┘
                    │
                    ▼ immich_<user>.npz
   ┌─────────────────────────────────────────────┐
   │ WSL cluster_immich.py                       │
   │   build centroids of canonical              │
   │     faceset_NNN/ in facesets_swap_ready/    │
   │   drop matches at cos-dist <= 0.45          │
   │   cluster the rest at 0.55                  │
   │   refine gates -> synthetic refine_manifest │
   │   cmd_export_swap -> facesets_swap_ready/   │
   │   merge top-level manifest                  │
   └─────────────────────────────────────────────┘
```

Cache artifacts stay separate (per the architecture choice on this run):
each user's results live in their own `immich_<user>.npz`. A future
one-shot merge can fold them into `nl_full.npz` if needed; the existing
`extend` command would do the right thing once schemas align.

## 3. Path mapping

`/mnt/x/` ↔ `X:\`. Cache stores WSL form (matching `nl_full.npz`'s
existing convention). `wsl_to_win()` translates for the embed worker
which runs natively on Windows.

`work/cluster_immich.py` always uses the canonical `facesets_swap_ready/`
view to build identity centroids — meaning the comparison is against the
*current* set of canonical facesets in the swap-ready directory (skipping
era splits and `_thin/`), not against the older `facesets_full/` snapshot.

## 4. Result of the 2026-04-26 run (peter / admin)

### 4a. Stage

```
total_assets_seen:     53842
staged_count:          10261       (~10 GB on /mnt/x/)
deduped_against_existing:  978     (sha256 in nl_full.npz already)
deduped_against_staged:   2976     (internal byte-dupes inside Immich)
skipped_no_big_face:     9539      (Immich detected only sub-90px faces)
skipped_no_faces:       29390      (Immich detected zero faces)
skipped_download_error:   698      (transient DNS / TLS, not seen-marked)
elapsed:                ~70 min    (6.4 assets/s end-to-end at 8 workers)
```

The 698 transient errors are recoverable on a re-run because
`immich_stage.py` does not add them to the `seen` set. Each transient
asset would be retried.

### 4b. Embed (Windows DML)

```
queue:                  10261 entries
new face records:       19462
new noface records:         1
load errors:              125    (likely HEIC / unreadable)
elapsed:                3878.0s  (64.6 min, 2.6 img/s end-to-end)
```

The 2.6 img/s end-to-end includes CIFS-share image load, image decode,
DML inference (~50 ms/face), and JSON / NPZ flushing. Pure DML inference
is faster; the rest of the pipeline dominates at scale.

### 4c. Cluster

```
existing canonical centroids: 25
faces already covered (cos-dist <= 0.45): 8103/19480  (42%)
  faceset_001:  1856
  faceset_002:  2666
  faceset_003:   670
  faceset_004:    48
  faceset_005:    40
  ... (smaller hits to the remaining 20)
unmatched faces to cluster:  11377
clusters at threshold 0.55:   2534  (top sizes [469, 444, 342, 338, 262, ...])
survived refine gates:         239
emitted as new facesets:       185  (54 dropped by export-swap's 0.45 outlier)
```

Top-level `facesets_swap_ready/manifest.json` after this run: **216
facesets** (up from 31; ~7× growth) + 68 thin_eras under `_thin/`.

## 5. Surprises and caveats

### 5a. `/search/metadata`'s `userIds` filter is silently ignored (Immich v2.7.2)

When the admin API key is used, passing `userIds=[<other-user-uuid>]`
returns admin's own assets, not the other user's. The filter is
silently dropped. Verified by sampling 200 returned items and
confirming `ownerId` was admin for all of them.

To process another user's library, **a separate API key issued by that
user is required** — the admin key cannot enumerate cross-user
libraries through any documented endpoint we tried. `/timeline/buckets`
with a `userId` query parameter returns
`Not found or no timeline.read access`.

### 5b. `/server/statistics` undercounts what the search returns

`/server/statistics` reported admin = 53,842 photos. Our
`/search/metadata` paginated through... **53,842** top-level. So the
header agrees with the body in this case. But `/server/statistics` does
NOT count items that live under external libraries' import paths —
yet `/search/metadata` does include them. For this Immich, two external
libraries (`/mnt/media/photos` and `/mnt/media/omv_photos`) are
configured but `/libraries` reports `assetCount=0` for both. Yet 80% of
our staged paths come from those library import paths. Don't trust
statistics-vs-search consistency.

### 5c. Indexed Immich thumbnails masquerading as assets

5,563 of our 10,261 staged paths are `<library>/thumbs/.../-preview.jpeg`
— Immich's own internally-generated thumbnails got indexed because the
external library import path included the thumbs subdirectory and the
exclusion patterns didn't list `**/thumbs/**`. They embed and cluster
fine but produce lower-resolution face records. The fix on the Immich
side is adding `**/thumbs/**` to the exclusion patterns.

### 5d. Internal byte-duplicates (2,976)

Many Immich assets are byte-identical to other Immich assets — typically
because the same photo was uploaded both from a phone and from a
synced cloud folder. sha256 dedup catches all of these on the second
download (we still pay the bandwidth, but skip the disk write and
embed work). With Immich v2.7.2's own `assets/duplicates` endpoint we
could catch this earlier, but it's not currently used.

## 6. Re-running and applying to other Immich instances

```bash
export IMMICH_URL=https://your-immich.example.com
export IMMICH_API_KEY=...           # admin or per-user key

# Optional: populate work/immich/users.json with label -> UUID map.

# 1. Stage (parallel /faces + downloads, resumable).
python work/immich_stage.py --user peter --workers 8

# 2. End-to-end finalize: copy queue to /mnt/c/, run Windows embed worker,
#    copy the cache back, run cluster_immich.py.
bash work/finalize_immich.sh peter
```

For a different Immich instance, the only configuration is the env vars
and the `users.json` sidecar. `cluster_immich.py`'s tunables (matching
threshold, clustering threshold, refine gates, MIN_FACES) are at the
top of the script.

To process a *second* user's library, issue a per-user API key in the
Immich admin UI for that user, set `IMMICH_API_KEY` to that key, and
re-run with their `--user <label>`. The admin key cannot impersonate
other users via the search API.