Go to file

Peter 308597ebf0 Update video preprocessing doc with full-corpus results

After completing the rest-of-corpus run, update docs/analysis to reflect
the final numbers across all three batches (test + 13-file + 45-file)
and surface the numerical lessons:
- 1,984 segments / 10.78h accepted content from 19.76h / 61 input videos
- 0 worker errors across 143,137 sampled frames
- rest batch sustained 15.78 fps from a fresh JSONL start (vs 7.5 fps for
  the migrated batch), confirming the append-only fix is the right
  steady-state design
- skip-pattern note: 5-digit basename numbers need full padding
  (0005[0-9] not 005[0-9]) — bit me on the first relaunch
- documented SIDECAR=yes opt-in for the chain script

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 16:47:59 +02:00

docs/analysis

Update video preprocessing doc with full-corpus results

2026-04-28 16:47:59 +02:00

work

Make per-clip sidecar JSONs opt-in (default off)

2026-04-28 12:44:27 +02:00

.gitignore

Add face-sort pipeline as the repo's base

2026-04-23 11:20:00 +02:00

README.md

Make per-clip sidecar JSONs opt-in (default off)

2026-04-28 12:44:27 +02:00

sort_faces.py

Add enrich + export-swap pipeline for downstream face-swap ready output

2026-04-23 22:37:32 +02:00

README.md

face-sets

Sort photos by similar face using InsightFace embeddings + agglomerative clustering, refine into per-identity sets, and export ready-to-drop bundles for face-swap tooling (roop-unleashed, etc.).

Pipeline

sort_faces.py is a single-file CLI with six subcommands:

step	what it does
embed	Recursively scan a source tree, detect + embed every face, write `.npz` cache. Resumable; sha256-dedup.
cluster	Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` with manifest.
refine	Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/`.
dedup	Post-hoc near-duplicate report: byte-identical + visual near-dupe groups → `<cache>.duplicates.json`.
extend	Fold new embeddings into an existing raw/refine output via nearest person-centroid without renumbering.
enrich	Re-detect each cached face to persist landmark_2d_106, landmark_3d_68, pose (pitch/yaw/roll) into cache.
export-swap	Per-identity export: tight outlier gate + visual-dupe collapse + composite quality rank + single-face PNG crops + `.fsz` bundles (top-N and full) ready for roop-unleashed. Optional singleton rescue into `_candidates/`.

Design principles

embed is resumable and incremental. It loads any existing cache at the target path and only hashes / embeds files it has not seen. Atomic flush every 50 new files so a mid-run crash loses at most ~50 embeddings.
Byte-identical duplicates are sha256-grouped at listing time. The canonical file is embedded once; other paths with the same hash become path_aliases in the cache. Every alias is materialized by cluster / refine / export-swap, so each on-disk location is represented.
safe_dst_name always flattens the absolute path. This keeps output filenames stable across runs even as src_root changes between embed / extend / export invocations.
Caches and outputs stay out of git via .gitignore; defaults live under work/.

Typical end-to-end run

SRC=/mnt/x/src/nl
CACHE=work/cache/nl_full.npz
OUT=/mnt/e/temp_things/fcswp/nl_sorted

# 1. Embed (CPU; InsightFace buffalo_l). Resumable on re-run.
python sort_faces.py embed "$SRC" "$CACHE"

# 2. Raw clusters (one person_NNN/ per multi-face cluster).
python sort_faces.py cluster "$CACHE" "$OUT/raw_full"

# 3. Refined facesets (quality-gated per-identity sets).
python sort_faces.py refine  "$CACHE" "$OUT/facesets_full"

# 4. Near-duplicate report (byte + visual).
python sort_faces.py dedup   "$CACHE"

# 5. Enrich the cache with landmarks + pose (needed by export-swap).
python sort_faces.py enrich  "$CACHE"

# 6. Export roop-unleashed-ready bundles.
python sort_faces.py export-swap "$CACHE" \
  "$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
  --raw-manifest "$OUT/raw_full/manifest.json" --candidates

Merging a new source into an existing result

# Embed new source into the same cache (resume from existing embeddings + aliases).
python sort_faces.py embed /mnt/x/src/lzbkp_red "$CACHE"

# Fold new faces into raw_full + facesets_full without renumbering.
python sort_faces.py extend "$CACHE" "$OUT/raw_full" --refine-out "$OUT/facesets_full"

# Refresh the swap-ready export to reflect the merge.
python sort_faces.py enrich "$CACHE"
python sort_faces.py export-swap "$CACHE" \
  "$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
  --raw-manifest "$OUT/raw_full/manifest.json" --candidates

Importing hand-sorted folders as identities

When source folders are already hand-sorted by person (one folder per identity), the clustering path is the wrong tool — the identity is asserted, not inferred. The orchestration script work/build_folders.py covers this case:

For each trusted folder, it filters cache records that fall under it, builds an identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so bystanders in group photos drop out, and writes a synthetic refine_manifest.json.
It then routes each face record from a mixed folder (e.g. osrc/) into every identity centroid within a tight cosine cutoff (default 0.45). A multi-identity photo lands in multiple facesets; export-swap's per-bbox outlier filter ensures each faceset crops only its matching face.
Finally it invokes cmd_export_swap against the synthetic manifest, renames the emitted .fsz bundles after the source folder, drops a <label>.txt marker, and merges the new entries into the canonical facesets_swap_ready/manifest.json (existing facesets are left untouched).

# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
for d in k m mi mir s sab t osrc; do
  python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
done

# Bring landmarks/pose + visual-dupe report in sync with the new records.
python sort_faces.py enrich "$CACHE"
python sort_faces.py dedup  "$CACHE"

# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
python work/build_folders.py

The script's config block (TRUSTED, START_NNN, OSRC_THRESHOLD, TOP_N, etc.) is the only thing to edit when adding more hand-sorted folders later.

Splitting an identity by era (age sub-clustering)

Long-running source corpora produce identities that span 10+ years. The 2009 face and the 2024 face of the same person sit in the same cluster (correctly — same identity), but a single averaged embedding pulled from that cluster blurs across ages. For face-swap output that should target a specific period, the identity needs to be split by era after the identity is established.

work/age_split_001.py is a worked example for faceset_001 and a template for any other identity. The pipeline is:

Probe first with work/check_faceset001_age.py — report intra-cluster pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with distinct year ranges, the identity is age-sortable.
Seed centroid from the curated facesets_swap_ready/faceset_001/ (manifest provides face keys → cache rows).
Wide recovery at cos-dist ≤ 0.55 against the seed under the original source roots, then quality-gate (face_short, blur, det_score) and one re-centroid + tighten pass at 0.50 to absorb new faces without drift.
Sub-cluster the survivors at cos-dist 0.35 (precomputed-distance agglomerative, average linkage).
Anchor-based fragment assignment (not transitive merge — that caused year-drift): sub-clusters with size ≥ 20 are anchors; smaller fragments attach to the single nearest anchor only if both the centroid distance ≤ 0.40 AND the dominant EXIF year is within ±5 years. Fragments with no qualifying anchor remain standalone (and end up THIN-tagged downstream).
EXIF year per source path with on-disk caching at work/cache/age_split_exif.json — the Windows-mount EXIF read is the slowest step, so re-runs after a parameter tweak are nearly instant.
Per-era export mirrors export-swap: composite-quality rank, single-face square PNG crops, top-N + _all .fsz bundles, per-era manifest.json, human-readable <era>.txt marker. Eras with < 20 face records also drop a THIN.txt marker so they can be quarantined.
Top-level manifest merge: era buckets are appended to facesets_swap_ready/manifest.json. Operationally the THIN buckets should be moved into _thin/ (and the manifest split into facesets + thin_eras), leaving only the substantive era buckets at the top level.

# 1. Confirm the identity is age-sortable.
python work/check_faceset001_age.py

# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
python work/age_split_001.py

For the faceset_001 run on 5260-face nl_full.npz, this produced 6 substantive era buckets (2005–10, 2010–13, 2011, 2014–17, 2018–19, 2018–20; sizes 43–282) plus 68 thin/fragment buckets quarantined under _thin/.

Discovering new identities in a mixed bucket

A flat folder of mixed-identity photos (e.g. osrc/) is the opposite of the hand-sorted case: identities have to be discovered, not asserted, but should not collide with already-known identities or scramble their numbering.

work/cluster_osrc.py is the worked example. The pipeline:

Filter cache to the source root, including any byte-aliased path that resolves under it.
Drop already-covered faces by comparing each candidate to the centroids of the existing canonical facesets at the EXISTING_MATCH_THRESHOLD (default 0.45 — same cutoff as build_folders.py's osrc routing). These faces are already routed by extend / build_folders.py and shouldn't seed new facesets.
Cluster the unmatched at cos-dist 0.55 (matches the extend default for the new-cluster phase).
Apply refine-equivalent gates per cluster: face_short, blur, det_score, plus outlier rejection (cluster-centroid cos-dist > 0.55) for clusters of size ≥ 4. Keep clusters whose surviving unique-source-path count is ≥ MIN_FACES.
Number new facesets past the existing maximum (START_NNN), so faceset_001..NNN are never disturbed.
Synthesize a refine manifest and run cmd_export_swap against it, then move the resulting dirs into facesets_swap_ready/ and append to the top-level manifest.json. Each new dir gets an osrc.txt provenance marker.

Always run extend first so raw_full/ and facesets_full/ reflect the new source — the cluster_osrc.py step then operates against the canonical cache and doesn't need raw_full/ for input:

# 1. Bring raw_full / facesets_full up to date (folds matches into existing
#    person folders + facesets, creates new person_NNN+ for unmatched).
python sort_faces.py extend "$CACHE" "$OUT/raw_full" \
  --refine-out "$OUT/facesets_full"

# 2. Optional dry-run: report cluster sizes and per-faceset survivor counts
#    without touching facesets_swap_ready/.
python work/cluster_osrc.py --dry-run

# 3. Real run: emits facesets_swap_ready/faceset_NNN+ and merges the manifest.
python work/cluster_osrc.py

For the 2026-04-26 run on 336 osrc face records (after dropping 18 covered by existing identities), this produced 6 new facesets (faceset_020..025, sizes 4–26 exported PNGs; the 7th candidate cluster lost all 6 faces to export-swap's tighter min_face_short=100 gate).

Importing identities from a self-hosted Immich library

work/immich_stage.py + work/embed_worker.py + work/cluster_immich.py together import an Immich library at scale, with the embed step running on a Windows AMD GPU via DirectML and everything else on WSL. Three pieces:

work/immich_stage.py (WSL) — pages every IMAGE asset via /search/metadata, fetches each asset's /faces?id= to read Immich's own ML-driven bboxes, scales each bbox to original-image coordinates, and prefilters by face_short ≥ 90. For survivors it downloads the original, sha256-deduplicates against the canonical nl_full.npz and against same-run staged files, and saves to /mnt/x/src/immich/<user>/<rel>. Writes a queue.json that the embed worker consumes. 8 concurrent worker threads run the full per-asset I/O chain (/faces → filter → /original) so 8 workers ≈ 8× the serial throughput.
work/embed_worker.py (Windows venv at C:\face_embed_venv\) — loads insightface.FaceAnalysis(buffalo_l) with the DmlExecutionProvider and runs detection + landmarks + recognition over the queue. Produces a .npz cache that's bit-identical in schema to what sort_faces.py:cmd_embed writes, so the result is directly loadable by load_cache(). The cache already includes the post-enrich fields (landmark_2d_106, landmark_3d_68, pose) because FaceAnalysis returns them for free. AMD Vega gives ~7.5× real-pipeline speedup over CPU.
work/cluster_immich.py (WSL) — mirrors cluster_osrc.py's shape but reads from immich_<user>.npz. Builds existing-identity centroids from every canonical faceset_NNN/ in facesets_swap_ready/ (skipping era splits and _thin/), drops immich faces matching at cos-dist ≤ 0.45, clusters the rest at 0.55, applies refine gates, numbers new facesets past the existing maximum, and feeds cmd_export_swap via a synthetic manifest.

work/finalize_immich.sh <user> chains queue → Windows embed → cache copy back → cluster_immich, with logging.

The Immich admin API key + base URL come from environment variables:

export IMMICH_URL=https://your-immich.example.com
export IMMICH_API_KEY=...                # admin or per-user key
python work/immich_stage.py --user peter --workers 8
bash   work/finalize_immich.sh peter

For the 2026-04-26 run against https://fotos.computerliebe.org (Immich v2.7.2), with the admin API key:

step	result
stage	53,842 assets seen, 10,261 staged (~10 GB), 978 byte-deduped against `nl_full.npz`, 2,976 internal byte-duplicates, 39K skipped no-face / no-big-face
Windows DML embed	19,462 face records + 1 noface in 64.6 min (2.6 img/s end-to-end)
matched existing identities	8,103 of 19,480 (42%) at cos-dist ≤ 0.45; biggest hits faceset_002 (+2,666), faceset_001 (+1,856), faceset_003 (+670)
new clusters	2,534 at threshold 0.55 → 239 surviving refine gates → 185 emitted as `faceset_026..264` (gaps where export-swap's tighter outlier filter dropped clusters below the export quality bar)

A second 2026-04-26 run with nic's per-user API key confirmed the expected behavior: 25,777 of nic's IMAGE assets were enumerated (matching her /server/statistics count of 25,786, off by 9 ≈ the transient errors that didn't get marked seen), 7,834 staged (30% face-bearing-with-big-face, denser than peter's 19%), 519 byte-deduped vs nl_full.npz, 0 internal byte-duplicates (cleaner library than peter's 2,976), 54 transient errors.

Embed + cluster on the nic queue:

step	result
Windows DML embed	15,627 face records + 1 noface in 59 min (2.2 img/s end-to-end), 7 load errors
matched existing identities	6,770 of 15,627 (43%) at cos-dist ≤ 0.45; biggest hits faceset_002 (+3,261), faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408)
new clusters	3,787 at threshold 0.55 → 129 surviving refine gates → 95 emitted as `faceset_265..NNN` (gaps where export-swap's 0.45 outlier dropped clusters below the export bar)

Top-level facesets_swap_ready/manifest.json after both Immich runs: 311 substantive facesets (12 auto-cluster nl/lzbkp + 7 hand-sorted + 6 era splits + 6 osrc-discovered + 185 peter-Immich + 95 nic-Immich) + 68 thin_eras under _thin/.

work/immich_stage.py carries a built-in outage circuit breaker: after 12 consecutive HTTP errors it probes Immich; if that probe also fails, the script exits cleanly with code 2, state preserved. This made the nic run survive a mid-stage Immich outage — the script paused, the operator confirmed connectivity was back, and the same command resumed from the saved state.json without re-fetching what was already done.

Important caveats for Immich v2.7.2:

The userIds filter on /search/metadata is silently ignored when the API key is bound to a different user. The "import everything the API key can see" semantics are what you actually get; cross-user isolation is enforced server-side.
/server/statistics reports counts that under-count what /search/metadata actually returns (e.g. external library thumbnail-dirs that got indexed because the import path included them). Don't trust the statistics number as a denominator.
A meaningful fraction of originalPath-based assets are Immich's own thumbnails (<library_root>/thumbs/.../-preview.jpeg) — included if the external library's import path covers the thumbs directory and the exclusion patterns don't list **/thumbs/**. For our run, 5,563 of 10,261 staged were thumbnails. They embed and cluster fine but the resulting faces are lower-resolution.

Key defaults

refine:

flag	default	meaning
`--initial-threshold`	0.55	cosine distance for stage-1 clustering
`--merge-threshold`	0.40	centroid-level merge of over-split clusters
`--outlier-threshold`	0.55	drop face if cosine dist from centroid exceeds (only if cluster ≥ 4)
`--min-faces`	15	minimum unique images per faceset
`--min-short`	90	minimum short-edge pixels of face bbox
`--min-blur`	40.0	Laplacian-variance blur gate
`--min-det-score`	0.6	InsightFace detector score gate

export-swap:

flag	default	meaning
`--top-n`	30	size of the `<faceset>_topN.fsz` bundle
`--outlier-threshold`	0.45	tighter than refine; trims cluster boundary for averaging
`--pad-ratio`	0.5	padding around face bbox for PNG crop
`--out-size`	512	PNG output is square `out_size × out_size`
`--min-face-short`	100	export gate; stricter than refine's 90
`--candidates`	off	rescue `_singletons/` into `_candidates/` for manual review
`--candidate-match-threshold`	0.55	cos-dist cutoff for singleton → existing faceset
`--candidate-min-score`	0.40	composite-quality floor for candidates

The composite quality score in export-swap is 0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness, each normalized to [0, 1].

Post-export corpus maintenance

The sort_faces.py pipeline above produces facesets_swap_ready/. Four orchestration scripts under work/ operate on that already-built corpus to clean it up over time:

script	purpose
`work/filter_occlusions.py` (+ Windows `work/clip_worker.py`)	Drop PNGs of masked / sun-glassed faces using open_clip ViT-L-14/dfn2b_s39b zero-shot scoring. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. WSL stages a queue, Windows DML scores, WSL applies. See `docs/analysis/clip-occlusion-filter.md`.
`work/consolidate_facesets.py`	Merge duplicate identities (centroid cosine sim ≥ 0.55 with confident ≥ 0.65, complete-linkage to defeat single-link chaining). Pulls embeddings from cache, no GPU. See `docs/analysis/identity-consolidation-and-age-extend.md`.
`work/age_extend_001.py`	Slot newly-added PNGs into existing era buckets of `faceset_001` (anchor cosine distance ≤ 0.40 AND `
`work/dedup_optimize.py` (+ Windows `work/multiface_worker.py`)	(a) cross-family SHA256 byte-dedup, (b) within-faceset near-dup at cosine sim ≥ 0.95, (c) multi-face audit (re-detect via insightface, drop PNGs with face_count ≠ 1). Multi-face is the load-bearing roop invariant. See `docs/analysis/dedup-and-roop-optimization.md`.
`work/video_target_pipeline.py` (+ Windows `work/video_face_worker.py` + `work/run_video_pipeline.sh` chain)	Target-side preprocessing: scan a folder of videos → PySceneDetect shot-cuts → 2 fps frame sampling → DML face detection + embedding → IoU+embedding tracking → quality-gated segments (yaw≤75°, face≥80px, det≥0.5, ≥70% pass-rate, 1–120s duration, 2s cross-track merge gap) → ffmpeg stream-copy into UUID-named clips. Output organized into per-source subfolders. Provenance sidecars are opt-in (`cut --write-sidecar` or `SIDECAR=yes` env var); the full plan is always retained in the per-batch `plan.json`. See `docs/analysis/video-target-preprocessing.md`.

All four operate idempotently and reversibly: dropped PNGs go to <faceset>/faces/_dropped/, quarantined whole facesets go to facesets_swap_ready/_masked/ or _merged/ (parallel to the existing _thin/). The master manifest.json partitions entries across facesets[], masked[], thin_eras[], and merged[] arrays, plus per-run provenance blocks (occlusion_filter_run, merge_run, age_extend_runs, dedup_runs, multiface_runs).

Downstream: roop-unleashed

The .fsz bundles emitted by export-swap drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding.

Highly recommended at swap time: enable Select post-processing = GFPGAN with the Original/Enhanced image blend ratio = 0.85 (default is 0.65 which is conservative). See docs/analysis/facesets-downstream-refinement-evaluation.md for the full evaluation.

Layout

/opt/face-sets/
├─ README.md                                     (this file)
├─ sort_faces.py                                 (the tool)
├─ docs/
│  └─ analysis/
│     └─ facesets-downstream-refinement-evaluation.md
└─ work/                                         (gitignored except force-tracked .py / .sh)
   ├─ build_folders.py                           (hand-sorted-folder orchestration)
   ├─ check_faceset001_age.py                    (age-split readiness probe)
   ├─ age_split_001.py                           (age-split orchestration; faceset_001)
   ├─ age_extend_001.py                          (extends existing era buckets with new PNGs)
   ├─ cluster_osrc.py                            (mixed-bucket identity discovery)
   ├─ immich_stage.py                            (Immich library staging, parallel)
   ├─ embed_worker.py                            (Windows DML embed worker; C:\face_embed_venv\)
   ├─ cluster_immich.py                          (Immich identity discovery + export)
   ├─ finalize_immich.sh                         (chains queue → embed → cluster)
   ├─ filter_occlusions.py                       (CLIP zero-shot mask + sunglasses filter)
   ├─ clip_worker.py                             (Windows DML CLIP worker; C:\clip_dml_venv\)
   ├─ consolidate_facesets.py                    (duplicate-identity merger; complete-linkage)
   ├─ dedup_optimize.py                          (byte + near-dup + multi-face audit driver)
   ├─ multiface_worker.py                        (Windows DML multi-face audit worker)
   ├─ video_target_pipeline.py                   (video → swappable segment cuts orchestration)
   ├─ video_face_worker.py                       (Windows DML per-frame face worker; JSONL append-only)
   ├─ run_video_pipeline.sh                      (generic chain driver: scenes → stage → worker → cut)
   ├─ status_video_pipeline.sh                   (status helper for any video_pipeline log)
   ├─ synthetic_*_manifest.json                  (per-run synthetic refine manifests)
   ├─ immich/
   │  ├─ users.json                              (label -> userId map; gitignored)
   │  └─ <user>/{queue,state,aliases}.json       (per-user staging artifacts)
   ├─ cache/
   │  ├─ nl_full.npz                             (canonical cache + duplicates.json)
   │  ├─ immich_<user>.npz                       (per-user immich embeddings)
   │  └─ age_split_exif.json                     (path → EXIF-year cache)
   └─ logs/
      └─ *.log                                   (every long step writes here)

README.md Unescape Escape

face-sets

Pipeline

Design principles

Typical end-to-end run

Merging a new source into an existing result

Importing hand-sorted folders as identities

Splitting an identity by era (age sub-clustering)

Discovering new identities in a mixed bucket

Importing identities from a self-hosted Immich library

Key defaults

Post-export corpus maintenance

Downstream: roop-unleashed

Layout

README.md