face-sets/README.md

# face-sets

Sort photos by similar face using InsightFace embeddings + agglomerative clustering, refine into per-identity sets, and export ready-to-drop bundles for face-swap tooling (roop-unleashed, etc.).

## Pipeline

`sort_faces.py` is a single-file CLI with six subcommands:

| step        | what it does                                                                                                |
|-------------|-------------------------------------------------------------------------------------------------------------|
| embed       | Recursively scan a source tree, detect + embed every face, write `.npz` cache. Resumable; sha256-dedup.    |
| cluster     | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` with manifest.  |
| refine      | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/`.        |
| dedup       | Post-hoc near-duplicate report: byte-identical + visual near-dupe groups → `<cache>.duplicates.json`.      |
| extend      | Fold new embeddings into an existing raw/refine output via nearest person-centroid without renumbering.    |
| enrich      | Re-detect each cached face to persist landmark_2d_106, landmark_3d_68, pose (pitch/yaw/roll) into cache.   |
| export-swap | Per-identity export: tight outlier gate + visual-dupe collapse + composite quality rank + single-face PNG crops + `.fsz` bundles (top-N and full) ready for roop-unleashed. Optional singleton rescue into `_candidates/`. |

### Design principles

- **embed is resumable and incremental.** It loads any existing cache at the target path and only hashes / embeds files it has not seen. Atomic flush every 50 new files so a mid-run crash loses at most ~50 embeddings.
- **Byte-identical duplicates are sha256-grouped at listing time.** The canonical file is embedded once; other paths with the same hash become `path_aliases` in the cache. Every alias is materialized by `cluster` / `refine` / `export-swap`, so each on-disk location is represented.
- **`safe_dst_name` always flattens the absolute path.** This keeps output filenames stable across runs even as `src_root` changes between embed / extend / export invocations.
- **Caches and outputs stay out of git** via `.gitignore`; defaults live under `work/`.

## Typical end-to-end run

```bash
SRC=/mnt/x/src/nl
CACHE=work/cache/nl_full.npz
OUT=/mnt/e/temp_things/fcswp/nl_sorted

# 1. Embed (CPU; InsightFace buffalo_l). Resumable on re-run.
python sort_faces.py embed "$SRC" "$CACHE"

# 2. Raw clusters (one person_NNN/ per multi-face cluster).
python sort_faces.py cluster "$CACHE" "$OUT/raw_full"

# 3. Refined facesets (quality-gated per-identity sets).
python sort_faces.py refine  "$CACHE" "$OUT/facesets_full"

# 4. Near-duplicate report (byte + visual).
python sort_faces.py dedup   "$CACHE"

# 5. Enrich the cache with landmarks + pose (needed by export-swap).
python sort_faces.py enrich  "$CACHE"

# 6. Export roop-unleashed-ready bundles.
python sort_faces.py export-swap "$CACHE" \
  "$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
  --raw-manifest "$OUT/raw_full/manifest.json" --candidates
```

### Merging a new source into an existing result

```bash
# Embed new source into the same cache (resume from existing embeddings + aliases).
python sort_faces.py embed /mnt/x/src/lzbkp_red "$CACHE"

# Fold new faces into raw_full + facesets_full without renumbering.
python sort_faces.py extend "$CACHE" "$OUT/raw_full" --refine-out "$OUT/facesets_full"

# Refresh the swap-ready export to reflect the merge.
python sort_faces.py enrich "$CACHE"
python sort_faces.py export-swap "$CACHE" \
  "$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
  --raw-manifest "$OUT/raw_full/manifest.json" --candidates
```

### Importing hand-sorted folders as identities

When source folders are already hand-sorted by person (one folder per identity), the
clustering path is the wrong tool — the identity is asserted, not inferred. The
orchestration script `work/build_folders.py` covers this case:

- For each trusted folder, it filters cache records that fall under it, builds an
  identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so
  bystanders in group photos drop out, and writes a synthetic `refine_manifest.json`.
- It then routes each face record from a *mixed* folder (e.g. `osrc/`) into every
  identity centroid within a tight cosine cutoff (default 0.45). A multi-identity
  photo lands in multiple facesets; `export-swap`'s per-bbox outlier filter ensures
  each faceset crops only its matching face.
- Finally it invokes `cmd_export_swap` against the synthetic manifest, renames the
  emitted `.fsz` bundles after the source folder, drops a `<label>.txt` marker, and
  merges the new entries into the canonical `facesets_swap_ready/manifest.json`
  (existing facesets are left untouched).

```bash
# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
for d in k m mi mir s sab t osrc; do
  python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
done

# Bring landmarks/pose + visual-dupe report in sync with the new records.
python sort_faces.py enrich "$CACHE"
python sort_faces.py dedup  "$CACHE"

# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
python work/build_folders.py
```

The script's config block (`TRUSTED`, `START_NNN`, `OSRC_THRESHOLD`, `TOP_N`, etc.)
is the only thing to edit when adding more hand-sorted folders later.

### Splitting an identity by era (age sub-clustering)

Long-running source corpora produce identities that span 10+ years. The 2009 face
and the 2024 face of the same person sit in the same cluster (correctly — same
identity), but a single averaged embedding pulled from that cluster blurs across
ages. For face-swap output that should target a specific period, the identity
needs to be split by era *after* the identity is established.

`work/age_split_001.py` is a worked example for `faceset_001` and a template for
any other identity. The pipeline is:

- **Probe first** with `work/check_faceset001_age.py` — report intra-cluster
  pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and
  EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with
  distinct year ranges, the identity is age-sortable.
- **Seed centroid** from the curated `facesets_swap_ready/faceset_001/`
  (manifest provides face keys → cache rows).
- **Wide recovery** at cos-dist ≤ 0.55 against the seed under the original
  source roots, then quality-gate (`face_short`, `blur`, `det_score`) and one
  re-centroid + tighten pass at 0.50 to absorb new faces without drift.
- **Sub-cluster** the survivors at cos-dist 0.35 (precomputed-distance
  agglomerative, average linkage).
- **Anchor-based fragment assignment** (not transitive merge — that caused
  year-drift): sub-clusters with size ≥ 20 are *anchors*; smaller fragments
  attach to the single nearest anchor only if both the centroid distance ≤ 0.40
  AND the dominant EXIF year is within ±5 years. Fragments with no qualifying
  anchor remain standalone (and end up THIN-tagged downstream).
- **EXIF year per source path** with on-disk caching at
  `work/cache/age_split_exif.json` — the Windows-mount EXIF read is the
  slowest step, so re-runs after a parameter tweak are nearly instant.
- **Per-era export** mirrors `export-swap`: composite-quality rank, single-face
  square PNG crops, top-N + `_all` `.fsz` bundles, per-era `manifest.json`,
  human-readable `<era>.txt` marker. Eras with < 20 face records also drop a
  `THIN.txt` marker so they can be quarantined.
- **Top-level manifest merge**: era buckets are appended to
  `facesets_swap_ready/manifest.json`. Operationally the THIN buckets should be
  moved into `_thin/` (and the manifest split into `facesets` + `thin_eras`),
  leaving only the substantive era buckets at the top level.

```bash
# 1. Confirm the identity is age-sortable.
python work/check_faceset001_age.py

# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
python work/age_split_001.py
```

For the `faceset_001` run on 5260-face `nl_full.npz`, this produced 6 substantive
era buckets (2005–10, 2010–13, 2011, 2014–17, 2018–19, 2018–20; sizes 43–282)
plus 68 thin/fragment buckets quarantined under `_thin/`.

### Discovering new identities in a mixed bucket

A flat folder of mixed-identity photos (e.g. `osrc/`) is the opposite of the
hand-sorted case: identities have to be discovered, not asserted, but should
not collide with already-known identities or scramble their numbering.

`work/cluster_osrc.py` is the worked example. The pipeline:

- **Filter cache to the source root**, including any byte-aliased path that
  resolves under it.
- **Drop already-covered faces** by comparing each candidate to the centroids
  of the existing canonical facesets at the `EXISTING_MATCH_THRESHOLD`
  (default 0.45 — same cutoff as `build_folders.py`'s osrc routing). These
  faces are already routed by `extend` / `build_folders.py` and shouldn't
  seed new facesets.
- **Cluster the unmatched** at cos-dist 0.55 (matches the `extend` default
  for the new-cluster phase).
- **Apply `refine`-equivalent gates** per cluster: `face_short`, `blur`,
  `det_score`, plus outlier rejection (cluster-centroid cos-dist > 0.55) for
  clusters of size ≥ 4. Keep clusters whose surviving unique-source-path
  count is ≥ `MIN_FACES`.
- **Number new facesets past the existing maximum** (`START_NNN`), so
  `faceset_001..NNN` are never disturbed.
- **Synthesize a refine manifest** and run `cmd_export_swap` against it,
  then move the resulting dirs into `facesets_swap_ready/` and append to the
  top-level `manifest.json`. Each new dir gets an `osrc.txt` provenance
  marker.

Always run `extend` first so `raw_full/` and `facesets_full/` reflect the new
source — the `cluster_osrc.py` step then operates against the canonical
cache and doesn't need `raw_full/` for input:

```bash
# 1. Bring raw_full / facesets_full up to date (folds matches into existing
#    person folders + facesets, creates new person_NNN+ for unmatched).
python sort_faces.py extend "$CACHE" "$OUT/raw_full" \
  --refine-out "$OUT/facesets_full"

# 2. Optional dry-run: report cluster sizes and per-faceset survivor counts
#    without touching facesets_swap_ready/.
python work/cluster_osrc.py --dry-run

# 3. Real run: emits facesets_swap_ready/faceset_NNN+ and merges the manifest.
python work/cluster_osrc.py
```

For the 2026-04-26 run on 336 osrc face records (after dropping 18 covered by
existing identities), this produced 6 new facesets (`faceset_020..025`,
sizes 4–26 exported PNGs; the 7th candidate cluster lost all 6 faces to
export-swap's tighter `min_face_short=100` gate).

### Importing identities from a self-hosted Immich library

`work/immich_stage.py` + `work/embed_worker.py` + `work/cluster_immich.py`
together import an Immich library at scale, with the embed step running on
a Windows AMD GPU via DirectML and everything else on WSL. Three pieces:

1. **`work/immich_stage.py` (WSL)** — pages every IMAGE asset via
   `/search/metadata`, fetches each asset's `/faces?id=` to read Immich's
   own ML-driven bboxes, scales each bbox to original-image coordinates,
   and prefilters by `face_short ≥ 90`. For survivors it downloads the
   original, sha256-deduplicates against the canonical `nl_full.npz` and
   against same-run staged files, and saves to
   `/mnt/x/src/immich/<user>/<rel>`. Writes a `queue.json` that the embed
   worker consumes. 8 concurrent worker threads run the full per-asset
   I/O chain (`/faces` → filter → `/original`) so 8 workers ≈ 8× the
   serial throughput.
2. **`work/embed_worker.py` (Windows venv at `C:\face_embed_venv\`)** —
   loads `insightface.FaceAnalysis(buffalo_l)` with the
   `DmlExecutionProvider` and runs detection + landmarks + recognition
   over the queue. Produces a `.npz` cache that's bit-identical in
   schema to what `sort_faces.py:cmd_embed` writes, so the result is
   directly loadable by `load_cache()`. The cache already includes the
   post-`enrich` fields (`landmark_2d_106`, `landmark_3d_68`, `pose`)
   because FaceAnalysis returns them for free. AMD Vega gives ~7.5×
   real-pipeline speedup over CPU.
3. **`work/cluster_immich.py` (WSL)** — mirrors `cluster_osrc.py`'s
   shape but reads from `immich_<user>.npz`. Builds existing-identity
   centroids from every canonical `faceset_NNN/` in
   `facesets_swap_ready/` (skipping era splits and `_thin/`), drops
   immich faces matching at cos-dist ≤ 0.45, clusters the rest at 0.55,
   applies refine gates, numbers new facesets past the existing maximum,
   and feeds `cmd_export_swap` via a synthetic manifest.

`work/finalize_immich.sh <user>` chains queue → Windows embed → cache
copy back → cluster_immich, with logging.

The Immich admin API key + base URL come from environment variables:

```bash
export IMMICH_URL=https://your-immich.example.com
export IMMICH_API_KEY=...                # admin or per-user key
python work/immich_stage.py --user peter --workers 8
bash   work/finalize_immich.sh peter
```

For the 2026-04-26 run against `https://fotos.computerliebe.org` (Immich
v2.7.2), with the admin API key:

| step | result |
|------|------|
| stage | 53,842 assets seen, **10,261 staged** (~10 GB), 978 byte-deduped against `nl_full.npz`, 2,976 internal byte-duplicates, 39K skipped no-face / no-big-face |
| Windows DML embed | 19,462 face records + 1 noface in **64.6 min** (2.6 img/s end-to-end) |
| matched existing identities | **8,103 of 19,480 (42%)** at cos-dist ≤ 0.45; biggest hits faceset_002 (+2,666), faceset_001 (+1,856), faceset_003 (+670) |
| new clusters | 2,534 at threshold 0.55 → 239 surviving refine gates → **185 emitted** as `faceset_026..264` (gaps where export-swap's tighter outlier filter dropped clusters below the export quality bar) |

A second 2026-04-26 run with **nic's per-user API key** confirmed the
expected behavior: 25,777 of nic's IMAGE assets were enumerated (matching
her `/server/statistics` count of 25,786, off by 9 ≈ the transient errors
that didn't get marked seen), **7,834 staged** (30% face-bearing-with-big-face,
denser than peter's 19%), 519 byte-deduped vs `nl_full.npz`, **0 internal
byte-duplicates** (cleaner library than peter's 2,976), 54 transient errors.

Embed + cluster on the nic queue:

| step | result |
|------|------|
| Windows DML embed | 15,627 face records + 1 noface in **59 min** (2.2 img/s end-to-end), 7 load errors |
| matched existing identities | **6,770 of 15,627 (43%)** at cos-dist ≤ 0.45; biggest hits faceset_002 (+3,261), faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408) |
| new clusters | 3,787 at threshold 0.55 → 129 surviving refine gates → **95 emitted** as `faceset_265..NNN` (gaps where export-swap's 0.45 outlier dropped clusters below the export bar) |

Top-level `facesets_swap_ready/manifest.json` after both Immich runs:
**311 substantive facesets** (12 auto-cluster nl/lzbkp + 7 hand-sorted +
6 era splits + 6 osrc-discovered + 185 peter-Immich + 95 nic-Immich) +
68 thin_eras under `_thin/`.

`work/immich_stage.py` carries a built-in **outage circuit breaker**:
after 12 consecutive HTTP errors it probes Immich; if that probe also
fails, the script exits cleanly with code 2, state preserved. This made
the nic run survive a mid-stage Immich outage — the script paused, the
operator confirmed connectivity was back, and the same command resumed
from the saved `state.json` without re-fetching what was already done.

**Important caveats for Immich v2.7.2**:
- The `userIds` filter on `/search/metadata` is **silently ignored** when
  the API key is bound to a different user. The "import everything the
  API key can see" semantics are what you actually get; cross-user
  isolation is enforced server-side.
- `/server/statistics` reports counts that under-count what
  `/search/metadata` actually returns (e.g. external library
  thumbnail-dirs that got indexed because the import path included them).
  Don't trust the statistics number as a denominator.
- A meaningful fraction of `originalPath`-based assets are *Immich's own
  thumbnails* (`<library_root>/thumbs/.../-preview.jpeg`) — included if
  the external library's import path covers the thumbs directory and the
  exclusion patterns don't list `**/thumbs/**`. For our run, 5,563 of
  10,261 staged were thumbnails. They embed and cluster fine but the
  resulting faces are lower-resolution.

## Key defaults

`refine`:

| flag                    | default | meaning |
|-------------------------|--------:|---------|
| `--initial-threshold`   | 0.55    | cosine distance for stage-1 clustering |
| `--merge-threshold`     | 0.40    | centroid-level merge of over-split clusters |
| `--outlier-threshold`   | 0.55    | drop face if cosine dist from centroid exceeds (only if cluster ≥ 4) |
| `--min-faces`           | 15      | minimum unique images per faceset |
| `--min-short`           | 90      | minimum short-edge pixels of face bbox |
| `--min-blur`            | 40.0    | Laplacian-variance blur gate |
| `--min-det-score`       | 0.6     | InsightFace detector score gate |

`export-swap`:

| flag                          | default | meaning |
|-------------------------------|--------:|---------|
| `--top-n`                     | 30      | size of the `<faceset>_topN.fsz` bundle |
| `--outlier-threshold`         | 0.45    | tighter than refine; trims cluster boundary for averaging |
| `--pad-ratio`                 | 0.5     | padding around face bbox for PNG crop |
| `--out-size`                  | 512     | PNG output is square `out_size × out_size` |
| `--min-face-short`            | 100     | export gate; stricter than refine's 90 |
| `--candidates`                | off     | rescue `_singletons/` into `_candidates/` for manual review |
| `--candidate-match-threshold` | 0.55    | cos-dist cutoff for singleton → existing faceset |
| `--candidate-min-score`       | 0.40    | composite-quality floor for candidates |

The composite quality score in `export-swap` is `0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness`, each normalized to `[0, 1]`.

## Downstream: roop-unleashed

The `.fsz` bundles emitted by `export-swap` drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding.

Highly recommended at swap time: enable **Select post-processing = GFPGAN** with the **Original/Enhanced image blend ratio = 0.85** (default is 0.65 which is conservative). See `docs/analysis/facesets-downstream-refinement-evaluation.md` for the full evaluation.

## Layout

```
/opt/face-sets/
├─ README.md                                     (this file)
├─ sort_faces.py                                 (the tool)
├─ docs/
│  └─ analysis/
│     └─ facesets-downstream-refinement-evaluation.md
└─ work/                                         (gitignored except force-tracked .py / .sh)
   ├─ build_folders.py                           (hand-sorted-folder orchestration)
   ├─ check_faceset001_age.py                    (age-split readiness probe)
   ├─ age_split_001.py                           (age-split orchestration; faceset_001)
   ├─ cluster_osrc.py                            (mixed-bucket identity discovery)
   ├─ immich_stage.py                            (Immich library staging, parallel)
   ├─ embed_worker.py                            (Windows DML embed worker, runs from C:\face_embed_venv\)
   ├─ cluster_immich.py                          (Immich identity discovery + export)
   ├─ finalize_immich.sh                         (chains queue → embed → cluster)
   ├─ synthetic_*_manifest.json                  (per-run synthetic refine manifests)
   ├─ immich/
   │  ├─ users.json                              (label -> userId map; gitignored)
   │  └─ <user>/{queue,state,aliases}.json       (per-user staging artifacts)
   ├─ cache/
   │  ├─ nl_full.npz                             (canonical cache + duplicates.json)
   │  ├─ immich_<user>.npz                       (per-user immich embeddings)
   │  └─ age_split_exif.json                     (path → EXIF-year cache)
   └─ logs/
      └─ *.log                                   (every long step writes here)
```