Overnight 2026-04-27 nic finalize completed. Per-user API key worked as
expected. The pipeline survived one mid-stage Immich outage via the
circuit breaker added in 62dba3d -- script paused, operator confirmed
connectivity, same command resumed from saved state.json.
Embed (Windows DML): 7,834 images -> 15,627 face records + 1 noface in
59 minutes (2.2 img/s end-to-end).
Cluster: 6,770 of 15,627 faces (43%) matched existing canonical
identities at cos-dist <= 0.45; biggest hits faceset_002 (+3,261),
faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408). The
faceset_008 and faceset_007 hits are noteworthy cross-matches: those
are hand-sorted "sab" and "s" identities, recurring frequently in nic's
library.
Of the 8,857 unmatched faces, 3,787 raw clusters at threshold 0.55,
129 surviving refine gates, 95 emitted as new facesets at faceset_265+.
Top-level facesets_swap_ready/manifest.json: 216 -> 311 substantive
facesets + 68 thin_eras unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
369 lines
20 KiB
Markdown
369 lines
20 KiB
Markdown
# face-sets
|
||
|
||
Sort photos by similar face using InsightFace embeddings + agglomerative clustering, refine into per-identity sets, and export ready-to-drop bundles for face-swap tooling (roop-unleashed, etc.).
|
||
|
||
## Pipeline
|
||
|
||
`sort_faces.py` is a single-file CLI with six subcommands:
|
||
|
||
| step | what it does |
|
||
|-------------|-------------------------------------------------------------------------------------------------------------|
|
||
| embed | Recursively scan a source tree, detect + embed every face, write `.npz` cache. Resumable; sha256-dedup. |
|
||
| cluster | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` with manifest. |
|
||
| refine | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/`. |
|
||
| dedup | Post-hoc near-duplicate report: byte-identical + visual near-dupe groups → `<cache>.duplicates.json`. |
|
||
| extend | Fold new embeddings into an existing raw/refine output via nearest person-centroid without renumbering. |
|
||
| enrich | Re-detect each cached face to persist landmark_2d_106, landmark_3d_68, pose (pitch/yaw/roll) into cache. |
|
||
| export-swap | Per-identity export: tight outlier gate + visual-dupe collapse + composite quality rank + single-face PNG crops + `.fsz` bundles (top-N and full) ready for roop-unleashed. Optional singleton rescue into `_candidates/`. |
|
||
|
||
### Design principles
|
||
|
||
- **embed is resumable and incremental.** It loads any existing cache at the target path and only hashes / embeds files it has not seen. Atomic flush every 50 new files so a mid-run crash loses at most ~50 embeddings.
|
||
- **Byte-identical duplicates are sha256-grouped at listing time.** The canonical file is embedded once; other paths with the same hash become `path_aliases` in the cache. Every alias is materialized by `cluster` / `refine` / `export-swap`, so each on-disk location is represented.
|
||
- **`safe_dst_name` always flattens the absolute path.** This keeps output filenames stable across runs even as `src_root` changes between embed / extend / export invocations.
|
||
- **Caches and outputs stay out of git** via `.gitignore`; defaults live under `work/`.
|
||
|
||
## Typical end-to-end run
|
||
|
||
```bash
|
||
SRC=/mnt/x/src/nl
|
||
CACHE=work/cache/nl_full.npz
|
||
OUT=/mnt/e/temp_things/fcswp/nl_sorted
|
||
|
||
# 1. Embed (CPU; InsightFace buffalo_l). Resumable on re-run.
|
||
python sort_faces.py embed "$SRC" "$CACHE"
|
||
|
||
# 2. Raw clusters (one person_NNN/ per multi-face cluster).
|
||
python sort_faces.py cluster "$CACHE" "$OUT/raw_full"
|
||
|
||
# 3. Refined facesets (quality-gated per-identity sets).
|
||
python sort_faces.py refine "$CACHE" "$OUT/facesets_full"
|
||
|
||
# 4. Near-duplicate report (byte + visual).
|
||
python sort_faces.py dedup "$CACHE"
|
||
|
||
# 5. Enrich the cache with landmarks + pose (needed by export-swap).
|
||
python sort_faces.py enrich "$CACHE"
|
||
|
||
# 6. Export roop-unleashed-ready bundles.
|
||
python sort_faces.py export-swap "$CACHE" \
|
||
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
|
||
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
|
||
```
|
||
|
||
### Merging a new source into an existing result
|
||
|
||
```bash
|
||
# Embed new source into the same cache (resume from existing embeddings + aliases).
|
||
python sort_faces.py embed /mnt/x/src/lzbkp_red "$CACHE"
|
||
|
||
# Fold new faces into raw_full + facesets_full without renumbering.
|
||
python sort_faces.py extend "$CACHE" "$OUT/raw_full" --refine-out "$OUT/facesets_full"
|
||
|
||
# Refresh the swap-ready export to reflect the merge.
|
||
python sort_faces.py enrich "$CACHE"
|
||
python sort_faces.py export-swap "$CACHE" \
|
||
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
|
||
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
|
||
```
|
||
|
||
### Importing hand-sorted folders as identities
|
||
|
||
When source folders are already hand-sorted by person (one folder per identity), the
|
||
clustering path is the wrong tool — the identity is asserted, not inferred. The
|
||
orchestration script `work/build_folders.py` covers this case:
|
||
|
||
- For each trusted folder, it filters cache records that fall under it, builds an
|
||
identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so
|
||
bystanders in group photos drop out, and writes a synthetic `refine_manifest.json`.
|
||
- It then routes each face record from a *mixed* folder (e.g. `osrc/`) into every
|
||
identity centroid within a tight cosine cutoff (default 0.45). A multi-identity
|
||
photo lands in multiple facesets; `export-swap`'s per-bbox outlier filter ensures
|
||
each faceset crops only its matching face.
|
||
- Finally it invokes `cmd_export_swap` against the synthetic manifest, renames the
|
||
emitted `.fsz` bundles after the source folder, drops a `<label>.txt` marker, and
|
||
merges the new entries into the canonical `facesets_swap_ready/manifest.json`
|
||
(existing facesets are left untouched).
|
||
|
||
```bash
|
||
# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
|
||
for d in k m mi mir s sab t osrc; do
|
||
python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
|
||
done
|
||
|
||
# Bring landmarks/pose + visual-dupe report in sync with the new records.
|
||
python sort_faces.py enrich "$CACHE"
|
||
python sort_faces.py dedup "$CACHE"
|
||
|
||
# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
|
||
python work/build_folders.py
|
||
```
|
||
|
||
The script's config block (`TRUSTED`, `START_NNN`, `OSRC_THRESHOLD`, `TOP_N`, etc.)
|
||
is the only thing to edit when adding more hand-sorted folders later.
|
||
|
||
### Splitting an identity by era (age sub-clustering)
|
||
|
||
Long-running source corpora produce identities that span 10+ years. The 2009 face
|
||
and the 2024 face of the same person sit in the same cluster (correctly — same
|
||
identity), but a single averaged embedding pulled from that cluster blurs across
|
||
ages. For face-swap output that should target a specific period, the identity
|
||
needs to be split by era *after* the identity is established.
|
||
|
||
`work/age_split_001.py` is a worked example for `faceset_001` and a template for
|
||
any other identity. The pipeline is:
|
||
|
||
- **Probe first** with `work/check_faceset001_age.py` — report intra-cluster
|
||
pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and
|
||
EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with
|
||
distinct year ranges, the identity is age-sortable.
|
||
- **Seed centroid** from the curated `facesets_swap_ready/faceset_001/`
|
||
(manifest provides face keys → cache rows).
|
||
- **Wide recovery** at cos-dist ≤ 0.55 against the seed under the original
|
||
source roots, then quality-gate (`face_short`, `blur`, `det_score`) and one
|
||
re-centroid + tighten pass at 0.50 to absorb new faces without drift.
|
||
- **Sub-cluster** the survivors at cos-dist 0.35 (precomputed-distance
|
||
agglomerative, average linkage).
|
||
- **Anchor-based fragment assignment** (not transitive merge — that caused
|
||
year-drift): sub-clusters with size ≥ 20 are *anchors*; smaller fragments
|
||
attach to the single nearest anchor only if both the centroid distance ≤ 0.40
|
||
AND the dominant EXIF year is within ±5 years. Fragments with no qualifying
|
||
anchor remain standalone (and end up THIN-tagged downstream).
|
||
- **EXIF year per source path** with on-disk caching at
|
||
`work/cache/age_split_exif.json` — the Windows-mount EXIF read is the
|
||
slowest step, so re-runs after a parameter tweak are nearly instant.
|
||
- **Per-era export** mirrors `export-swap`: composite-quality rank, single-face
|
||
square PNG crops, top-N + `_all` `.fsz` bundles, per-era `manifest.json`,
|
||
human-readable `<era>.txt` marker. Eras with < 20 face records also drop a
|
||
`THIN.txt` marker so they can be quarantined.
|
||
- **Top-level manifest merge**: era buckets are appended to
|
||
`facesets_swap_ready/manifest.json`. Operationally the THIN buckets should be
|
||
moved into `_thin/` (and the manifest split into `facesets` + `thin_eras`),
|
||
leaving only the substantive era buckets at the top level.
|
||
|
||
```bash
|
||
# 1. Confirm the identity is age-sortable.
|
||
python work/check_faceset001_age.py
|
||
|
||
# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
|
||
python work/age_split_001.py
|
||
```
|
||
|
||
For the `faceset_001` run on 5260-face `nl_full.npz`, this produced 6 substantive
|
||
era buckets (2005–10, 2010–13, 2011, 2014–17, 2018–19, 2018–20; sizes 43–282)
|
||
plus 68 thin/fragment buckets quarantined under `_thin/`.
|
||
|
||
### Discovering new identities in a mixed bucket
|
||
|
||
A flat folder of mixed-identity photos (e.g. `osrc/`) is the opposite of the
|
||
hand-sorted case: identities have to be discovered, not asserted, but should
|
||
not collide with already-known identities or scramble their numbering.
|
||
|
||
`work/cluster_osrc.py` is the worked example. The pipeline:
|
||
|
||
- **Filter cache to the source root**, including any byte-aliased path that
|
||
resolves under it.
|
||
- **Drop already-covered faces** by comparing each candidate to the centroids
|
||
of the existing canonical facesets at the `EXISTING_MATCH_THRESHOLD`
|
||
(default 0.45 — same cutoff as `build_folders.py`'s osrc routing). These
|
||
faces are already routed by `extend` / `build_folders.py` and shouldn't
|
||
seed new facesets.
|
||
- **Cluster the unmatched** at cos-dist 0.55 (matches the `extend` default
|
||
for the new-cluster phase).
|
||
- **Apply `refine`-equivalent gates** per cluster: `face_short`, `blur`,
|
||
`det_score`, plus outlier rejection (cluster-centroid cos-dist > 0.55) for
|
||
clusters of size ≥ 4. Keep clusters whose surviving unique-source-path
|
||
count is ≥ `MIN_FACES`.
|
||
- **Number new facesets past the existing maximum** (`START_NNN`), so
|
||
`faceset_001..NNN` are never disturbed.
|
||
- **Synthesize a refine manifest** and run `cmd_export_swap` against it,
|
||
then move the resulting dirs into `facesets_swap_ready/` and append to the
|
||
top-level `manifest.json`. Each new dir gets an `osrc.txt` provenance
|
||
marker.
|
||
|
||
Always run `extend` first so `raw_full/` and `facesets_full/` reflect the new
|
||
source — the `cluster_osrc.py` step then operates against the canonical
|
||
cache and doesn't need `raw_full/` for input:
|
||
|
||
```bash
|
||
# 1. Bring raw_full / facesets_full up to date (folds matches into existing
|
||
# person folders + facesets, creates new person_NNN+ for unmatched).
|
||
python sort_faces.py extend "$CACHE" "$OUT/raw_full" \
|
||
--refine-out "$OUT/facesets_full"
|
||
|
||
# 2. Optional dry-run: report cluster sizes and per-faceset survivor counts
|
||
# without touching facesets_swap_ready/.
|
||
python work/cluster_osrc.py --dry-run
|
||
|
||
# 3. Real run: emits facesets_swap_ready/faceset_NNN+ and merges the manifest.
|
||
python work/cluster_osrc.py
|
||
```
|
||
|
||
For the 2026-04-26 run on 336 osrc face records (after dropping 18 covered by
|
||
existing identities), this produced 6 new facesets (`faceset_020..025`,
|
||
sizes 4–26 exported PNGs; the 7th candidate cluster lost all 6 faces to
|
||
export-swap's tighter `min_face_short=100` gate).
|
||
|
||
### Importing identities from a self-hosted Immich library
|
||
|
||
`work/immich_stage.py` + `work/embed_worker.py` + `work/cluster_immich.py`
|
||
together import an Immich library at scale, with the embed step running on
|
||
a Windows AMD GPU via DirectML and everything else on WSL. Three pieces:
|
||
|
||
1. **`work/immich_stage.py` (WSL)** — pages every IMAGE asset via
|
||
`/search/metadata`, fetches each asset's `/faces?id=` to read Immich's
|
||
own ML-driven bboxes, scales each bbox to original-image coordinates,
|
||
and prefilters by `face_short ≥ 90`. For survivors it downloads the
|
||
original, sha256-deduplicates against the canonical `nl_full.npz` and
|
||
against same-run staged files, and saves to
|
||
`/mnt/x/src/immich/<user>/<rel>`. Writes a `queue.json` that the embed
|
||
worker consumes. 8 concurrent worker threads run the full per-asset
|
||
I/O chain (`/faces` → filter → `/original`) so 8 workers ≈ 8× the
|
||
serial throughput.
|
||
2. **`work/embed_worker.py` (Windows venv at `C:\face_embed_venv\`)** —
|
||
loads `insightface.FaceAnalysis(buffalo_l)` with the
|
||
`DmlExecutionProvider` and runs detection + landmarks + recognition
|
||
over the queue. Produces a `.npz` cache that's bit-identical in
|
||
schema to what `sort_faces.py:cmd_embed` writes, so the result is
|
||
directly loadable by `load_cache()`. The cache already includes the
|
||
post-`enrich` fields (`landmark_2d_106`, `landmark_3d_68`, `pose`)
|
||
because FaceAnalysis returns them for free. AMD Vega gives ~7.5×
|
||
real-pipeline speedup over CPU.
|
||
3. **`work/cluster_immich.py` (WSL)** — mirrors `cluster_osrc.py`'s
|
||
shape but reads from `immich_<user>.npz`. Builds existing-identity
|
||
centroids from every canonical `faceset_NNN/` in
|
||
`facesets_swap_ready/` (skipping era splits and `_thin/`), drops
|
||
immich faces matching at cos-dist ≤ 0.45, clusters the rest at 0.55,
|
||
applies refine gates, numbers new facesets past the existing maximum,
|
||
and feeds `cmd_export_swap` via a synthetic manifest.
|
||
|
||
`work/finalize_immich.sh <user>` chains queue → Windows embed → cache
|
||
copy back → cluster_immich, with logging.
|
||
|
||
The Immich admin API key + base URL come from environment variables:
|
||
|
||
```bash
|
||
export IMMICH_URL=https://your-immich.example.com
|
||
export IMMICH_API_KEY=... # admin or per-user key
|
||
python work/immich_stage.py --user peter --workers 8
|
||
bash work/finalize_immich.sh peter
|
||
```
|
||
|
||
For the 2026-04-26 run against `https://fotos.computerliebe.org` (Immich
|
||
v2.7.2), with the admin API key:
|
||
|
||
| step | result |
|
||
|------|------|
|
||
| stage | 53,842 assets seen, **10,261 staged** (~10 GB), 978 byte-deduped against `nl_full.npz`, 2,976 internal byte-duplicates, 39K skipped no-face / no-big-face |
|
||
| Windows DML embed | 19,462 face records + 1 noface in **64.6 min** (2.6 img/s end-to-end) |
|
||
| matched existing identities | **8,103 of 19,480 (42%)** at cos-dist ≤ 0.45; biggest hits faceset_002 (+2,666), faceset_001 (+1,856), faceset_003 (+670) |
|
||
| new clusters | 2,534 at threshold 0.55 → 239 surviving refine gates → **185 emitted** as `faceset_026..264` (gaps where export-swap's tighter outlier filter dropped clusters below the export quality bar) |
|
||
|
||
A second 2026-04-26 run with **nic's per-user API key** confirmed the
|
||
expected behavior: 25,777 of nic's IMAGE assets were enumerated (matching
|
||
her `/server/statistics` count of 25,786, off by 9 ≈ the transient errors
|
||
that didn't get marked seen), **7,834 staged** (30% face-bearing-with-big-face,
|
||
denser than peter's 19%), 519 byte-deduped vs `nl_full.npz`, **0 internal
|
||
byte-duplicates** (cleaner library than peter's 2,976), 54 transient errors.
|
||
|
||
Embed + cluster on the nic queue:
|
||
|
||
| step | result |
|
||
|------|------|
|
||
| Windows DML embed | 15,627 face records + 1 noface in **59 min** (2.2 img/s end-to-end), 7 load errors |
|
||
| matched existing identities | **6,770 of 15,627 (43%)** at cos-dist ≤ 0.45; biggest hits faceset_002 (+3,261), faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408) |
|
||
| new clusters | 3,787 at threshold 0.55 → 129 surviving refine gates → **95 emitted** as `faceset_265..NNN` (gaps where export-swap's 0.45 outlier dropped clusters below the export bar) |
|
||
|
||
Top-level `facesets_swap_ready/manifest.json` after both Immich runs:
|
||
**311 substantive facesets** (12 auto-cluster nl/lzbkp + 7 hand-sorted +
|
||
6 era splits + 6 osrc-discovered + 185 peter-Immich + 95 nic-Immich) +
|
||
68 thin_eras under `_thin/`.
|
||
|
||
`work/immich_stage.py` carries a built-in **outage circuit breaker**:
|
||
after 12 consecutive HTTP errors it probes Immich; if that probe also
|
||
fails, the script exits cleanly with code 2, state preserved. This made
|
||
the nic run survive a mid-stage Immich outage — the script paused, the
|
||
operator confirmed connectivity was back, and the same command resumed
|
||
from the saved `state.json` without re-fetching what was already done.
|
||
|
||
**Important caveats for Immich v2.7.2**:
|
||
- The `userIds` filter on `/search/metadata` is **silently ignored** when
|
||
the API key is bound to a different user. The "import everything the
|
||
API key can see" semantics are what you actually get; cross-user
|
||
isolation is enforced server-side.
|
||
- `/server/statistics` reports counts that under-count what
|
||
`/search/metadata` actually returns (e.g. external library
|
||
thumbnail-dirs that got indexed because the import path included them).
|
||
Don't trust the statistics number as a denominator.
|
||
- A meaningful fraction of `originalPath`-based assets are *Immich's own
|
||
thumbnails* (`<library_root>/thumbs/.../-preview.jpeg`) — included if
|
||
the external library's import path covers the thumbs directory and the
|
||
exclusion patterns don't list `**/thumbs/**`. For our run, 5,563 of
|
||
10,261 staged were thumbnails. They embed and cluster fine but the
|
||
resulting faces are lower-resolution.
|
||
|
||
## Key defaults
|
||
|
||
`refine`:
|
||
|
||
| flag | default | meaning |
|
||
|-------------------------|--------:|---------|
|
||
| `--initial-threshold` | 0.55 | cosine distance for stage-1 clustering |
|
||
| `--merge-threshold` | 0.40 | centroid-level merge of over-split clusters |
|
||
| `--outlier-threshold` | 0.55 | drop face if cosine dist from centroid exceeds (only if cluster ≥ 4) |
|
||
| `--min-faces` | 15 | minimum unique images per faceset |
|
||
| `--min-short` | 90 | minimum short-edge pixels of face bbox |
|
||
| `--min-blur` | 40.0 | Laplacian-variance blur gate |
|
||
| `--min-det-score` | 0.6 | InsightFace detector score gate |
|
||
|
||
`export-swap`:
|
||
|
||
| flag | default | meaning |
|
||
|-------------------------------|--------:|---------|
|
||
| `--top-n` | 30 | size of the `<faceset>_topN.fsz` bundle |
|
||
| `--outlier-threshold` | 0.45 | tighter than refine; trims cluster boundary for averaging |
|
||
| `--pad-ratio` | 0.5 | padding around face bbox for PNG crop |
|
||
| `--out-size` | 512 | PNG output is square `out_size × out_size` |
|
||
| `--min-face-short` | 100 | export gate; stricter than refine's 90 |
|
||
| `--candidates` | off | rescue `_singletons/` into `_candidates/` for manual review |
|
||
| `--candidate-match-threshold` | 0.55 | cos-dist cutoff for singleton → existing faceset |
|
||
| `--candidate-min-score` | 0.40 | composite-quality floor for candidates |
|
||
|
||
The composite quality score in `export-swap` is `0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness`, each normalized to `[0, 1]`.
|
||
|
||
## Downstream: roop-unleashed
|
||
|
||
The `.fsz` bundles emitted by `export-swap` drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding.
|
||
|
||
Highly recommended at swap time: enable **Select post-processing = GFPGAN** with the **Original/Enhanced image blend ratio = 0.85** (default is 0.65 which is conservative). See `docs/analysis/facesets-downstream-refinement-evaluation.md` for the full evaluation.
|
||
|
||
## Layout
|
||
|
||
```
|
||
/opt/face-sets/
|
||
├─ README.md (this file)
|
||
├─ sort_faces.py (the tool)
|
||
├─ docs/
|
||
│ └─ analysis/
|
||
│ └─ facesets-downstream-refinement-evaluation.md
|
||
└─ work/ (gitignored except force-tracked .py / .sh)
|
||
├─ build_folders.py (hand-sorted-folder orchestration)
|
||
├─ check_faceset001_age.py (age-split readiness probe)
|
||
├─ age_split_001.py (age-split orchestration; faceset_001)
|
||
├─ cluster_osrc.py (mixed-bucket identity discovery)
|
||
├─ immich_stage.py (Immich library staging, parallel)
|
||
├─ embed_worker.py (Windows DML embed worker, runs from C:\face_embed_venv\)
|
||
├─ cluster_immich.py (Immich identity discovery + export)
|
||
├─ finalize_immich.sh (chains queue → embed → cluster)
|
||
├─ synthetic_*_manifest.json (per-run synthetic refine manifests)
|
||
├─ immich/
|
||
│ ├─ users.json (label -> userId map; gitignored)
|
||
│ └─ <user>/{queue,state,aliases}.json (per-user staging artifacts)
|
||
├─ cache/
|
||
│ ├─ nl_full.npz (canonical cache + duplicates.json)
|
||
│ ├─ immich_<user>.npz (per-user immich embeddings)
|
||
│ └─ age_split_exif.json (path → EXIF-year cache)
|
||
└─ logs/
|
||
└─ *.log (every long step writes here)
|
||
```
|