Add Immich import pipeline (WSL stage + Windows DML embed + cluster)

Three-piece workflow that imports a self-hosted Immich library and emits
new facesets without disturbing existing identity numbering:

- work/immich_stage.py (WSL): pages /search/metadata, parallel-fetches
  /faces?id= per asset, prefilters by face_short>=90 against bbox scaled
  to original-image coords, downloads originals, sha256-dedups against
  nl_full.npz and same-run staged files. 8-worker ThreadPoolExecutor
  doing the full /faces->filter->/original chain per asset; resumable
  via state.json. API URL + key come from IMMICH_URL / IMMICH_API_KEY
  env vars, label->UUID map from work/immich/users.json (gitignored).
- work/embed_worker.py (Windows venv at C:\face_embed_venv): runs
  insightface.FaceAnalysis(buffalo_l) with the DmlExecutionProvider on
  AMD Radeon Vega via onnxruntime-directml. Produces a cache file in
  the same .npz schema as sort_faces.cmd_embed (loadable via
  load_cache). ~7.5x speedup over CPU end-to-end; embeddings bit-
  identical to CPU (cosine similarity 1.0000 across 8 sample faces).
- work/cluster_immich.py (WSL): mirrors cluster_osrc.py against an
  immich_<user>.npz. Builds existing identity centroids from canonical
  faceset_NNN/ in facesets_swap_ready/, drops matches at <=0.45,
  clusters the rest at 0.55, applies refine gates, hands off to
  cmd_export_swap. Numbers new facesets past the existing maximum.
- work/finalize_immich.sh: chains queue->Windows embed->cache copy->
  cluster_immich, with logging.

The 2026-04-26 run on https://fotos.computerliebe.org (Immich v2.7.2)
processed 53,842 admin-accessible assets, staged 10,261, embedded
19,462 face records on Vega DML in 64.6 min, matched 8,103 (42%) to
existing identities, and emitted 185 new facesets (faceset_026..264
with gaps). facesets_swap_ready/ went from 31 to 216 substantive
facesets.

Important caveat surfaced: /search/metadata's userIds filter is
silently ignored when the API key is bound to a different user, so
this run can't enumerate other users' libraries from the admin key.
A per-user API key would be required for nic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-26 18:14:26 +02:00
parent 7ecbfae981
commit 321fed01cc
6 changed files with 1340 additions and 3 deletions

View File

@@ -204,6 +204,77 @@ existing identities), this produced 6 new facesets (`faceset_020..025`,
sizes 426 exported PNGs; the 7th candidate cluster lost all 6 faces to
export-swap's tighter `min_face_short=100` gate).
### Importing identities from a self-hosted Immich library
`work/immich_stage.py` + `work/embed_worker.py` + `work/cluster_immich.py`
together import an Immich library at scale, with the embed step running on
a Windows AMD GPU via DirectML and everything else on WSL. Three pieces:
1. **`work/immich_stage.py` (WSL)** — pages every IMAGE asset via
`/search/metadata`, fetches each asset's `/faces?id=` to read Immich's
own ML-driven bboxes, scales each bbox to original-image coordinates,
and prefilters by `face_short ≥ 90`. For survivors it downloads the
original, sha256-deduplicates against the canonical `nl_full.npz` and
against same-run staged files, and saves to
`/mnt/x/src/immich/<user>/<rel>`. Writes a `queue.json` that the embed
worker consumes. 8 concurrent worker threads run the full per-asset
I/O chain (`/faces` → filter → `/original`) so 8 workers ≈ 8× the
serial throughput.
2. **`work/embed_worker.py` (Windows venv at `C:\face_embed_venv\`)** —
loads `insightface.FaceAnalysis(buffalo_l)` with the
`DmlExecutionProvider` and runs detection + landmarks + recognition
over the queue. Produces a `.npz` cache that's bit-identical in
schema to what `sort_faces.py:cmd_embed` writes, so the result is
directly loadable by `load_cache()`. The cache already includes the
post-`enrich` fields (`landmark_2d_106`, `landmark_3d_68`, `pose`)
because FaceAnalysis returns them for free. AMD Vega gives ~7.5×
real-pipeline speedup over CPU.
3. **`work/cluster_immich.py` (WSL)** — mirrors `cluster_osrc.py`'s
shape but reads from `immich_<user>.npz`. Builds existing-identity
centroids from every canonical `faceset_NNN/` in
`facesets_swap_ready/` (skipping era splits and `_thin/`), drops
immich faces matching at cos-dist ≤ 0.45, clusters the rest at 0.55,
applies refine gates, numbers new facesets past the existing maximum,
and feeds `cmd_export_swap` via a synthetic manifest.
`work/finalize_immich.sh <user>` chains queue → Windows embed → cache
copy back → cluster_immich, with logging.
The Immich admin API key + base URL come from environment variables:
```bash
export IMMICH_URL=https://your-immich.example.com
export IMMICH_API_KEY=... # admin or per-user key
python work/immich_stage.py --user peter --workers 8
bash work/finalize_immich.sh peter
```
For the 2026-04-26 run against `https://fotos.computerliebe.org` (Immich
v2.7.2), with the admin API key:
| step | result |
|------|------|
| stage | 53,842 assets seen, **10,261 staged** (~10 GB), 978 byte-deduped against `nl_full.npz`, 2,976 internal byte-duplicates, 39K skipped no-face / no-big-face |
| Windows DML embed | 19,462 face records + 1 noface in **64.6 min** (2.6 img/s end-to-end) |
| matched existing identities | **8,103 of 19,480 (42%)** at cos-dist ≤ 0.45; biggest hits faceset_002 (+2,666), faceset_001 (+1,856), faceset_003 (+670) |
| new clusters | 2,534 at threshold 0.55 → 239 surviving refine gates → **185 emitted** as `faceset_026..264` (gaps where export-swap's tighter outlier filter dropped clusters below the export quality bar) |
**Important caveats for Immich v2.7.2**:
- The `userIds` filter on `/search/metadata` is **silently ignored** when
the API key is bound to a different user. The "import everything the
API key can see" semantics are what you actually get; cross-user
isolation is enforced server-side.
- `/server/statistics` reports counts that under-count what
`/search/metadata` actually returns (e.g. external library
thumbnail-dirs that got indexed because the import path included them).
Don't trust the statistics number as a denominator.
- A meaningful fraction of `originalPath`-based assets are *Immich's own
thumbnails* (`<library_root>/thumbs/.../-preview.jpeg`) — included if
the external library's import path covers the thumbs directory and the
exclusion patterns don't list `**/thumbs/**`. For our run, 5,563 of
10,261 staged were thumbnails. They embed and cluster fine but the
resulting faces are lower-resolution.
## Key defaults
`refine`:
@@ -248,15 +319,22 @@ Highly recommended at swap time: enable **Select post-processing = GFPGAN** with
├─ docs/
│ └─ analysis/
│ └─ facesets-downstream-refinement-evaluation.md
└─ work/ (gitignored except force-tracked .py)
└─ work/ (gitignored except force-tracked .py / .sh)
├─ build_folders.py (hand-sorted-folder orchestration)
├─ check_faceset001_age.py (age-split readiness probe)
├─ age_split_001.py (age-split orchestration; faceset_001)
├─ cluster_osrc.py (mixed-bucket identity discovery)
├─ synthetic_refine_manifest.json (last build_folders.py output)
├─ synthetic_osrc_manifest.json (last cluster_osrc.py output)
├─ immich_stage.py (Immich library staging, parallel)
├─ embed_worker.py (Windows DML embed worker, runs from C:\face_embed_venv\)
├─ cluster_immich.py (Immich identity discovery + export)
├─ finalize_immich.sh (chains queue → embed → cluster)
├─ synthetic_*_manifest.json (per-run synthetic refine manifests)
├─ immich/
│ ├─ users.json (label -> userId map; gitignored)
│ └─ <user>/{queue,state,aliases}.json (per-user staging artifacts)
├─ cache/
│ ├─ nl_full.npz (canonical cache + duplicates.json)
│ ├─ immich_<user>.npz (per-user immich embeddings)
│ └─ age_split_exif.json (path → EXIF-year cache)
└─ logs/
└─ *.log (every long step writes here)