Previously every video_target_pipeline cut wrote a <uuid>.json provenance
sidecar alongside each <uuid>.mp4. The same provenance is already in the
per-batch plan.json, so the per-clip sidecars are redundant unless a
downstream tool wants each clip self-describing in isolation.
- video_target_pipeline.py cut: new --write-sidecar flag, default off.
- run_video_pipeline.sh: new SIDECAR env var (default "no"), passes
--write-sidecar when SIDECAR=yes.
- README + docs/analysis/video-target-preprocessing.md updated.
The 1,984 already-emitted sidecars in /mnt/x/src/vd/ct/ct_src_*/ have
been deleted (1.5 MB).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Preprocesses a folder of video files into UUID-named clips suitable as
target inputs for roop-unleashed-style face-swap. Counterpart to the
faceset (source-side) tooling.
work/video_target_pipeline.py — orchestration with subcommands
scan / scenes / stage / merge / track / score / cut / report. Quality
gates default to face-sets-can-handle-side-profile values (yaw<=75°,
pitch<=45°, face_short>=80px, det>=0.5). Cross-track segment merge
fuses adjacent-in-time tracks within the same scene up to 2s gap.
Output organized into <output_dir>/<source_stem>/<uuid>.mp4 +
<uuid>.json sidecar with full provenance.
work/video_face_worker.py — Windows DML face detect+embed worker. Uses
JSONL append-only for results.jsonl: a critical perf fix (re-
serializing the monolithic 245MB results.json on every flush was the
dominant cost in the first attempt, dropping throughput to 0.5 fps).
Append-only got it to 13+ fps, ~7.5 fps cumulative across the first
6.18h batch. Also uses seek-once-per-video + sequential cap.grab()
between samples to dodge cv2 per-sample seek pathology on long H.264.
Legacy results.json is auto-migrated to .jsonl on first load.
work/run_video_pipeline.sh — generic chain driver, parameterized via
WORK / INPUT_DIR / OUTPUT_DIR / FILTER_FROM / SKIP_PATTERN / MAX_DUR /
IDENTITY env vars. work/status_video_pipeline.sh — generic status
helper.
First production batch (ct_src_00050..00062, 13 files, 6.18h input):
600 emitted segments, 239.5min accepted content (64.6% of input), 254
segments built from >=2 tracks (cross-track merge), 1h43m wall clock.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds four new orchestration scripts that operate on an already-built
facesets_swap_ready/ to clean it up over time:
- filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses
filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores
via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level
quarantine at 40% domain dominance.
- consolidate_facesets.py: duplicate-identity merger using complete-linkage
centroid clustering on cached arcface embeddings. Single-linkage chains
catastrophically (60-faceset clusters with min sim < 0); complete-linkage
guarantees within-group sim >= edge.
- age_extend_001.py: slots newly-added PNGs into existing era buckets of
faceset_001 using the same anchor-fragment rule as age_split_001.py
(dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered.
- dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three
passes — cross-family SHA256 byte-dedup (preserves intra-family era
duplication), within-faceset near-dup at sim >= 0.95, and a multi-face
audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s
on AMD Vega — ~7x embed_worker because input is 512x512 crops.
Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged →
181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes
preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full
reversibility. Master manifest gains masked[], merged[], plus per-run
provenance blocks.
Three new docs/analysis/ writeups cover model choice, threshold rationale,
and per-pass run results.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Overnight 2026-04-27 nic finalize completed. Per-user API key worked as
expected. The pipeline survived one mid-stage Immich outage via the
circuit breaker added in 62dba3d -- script paused, operator confirmed
connectivity, same command resumed from saved state.json.
Embed (Windows DML): 7,834 images -> 15,627 face records + 1 noface in
59 minutes (2.2 img/s end-to-end).
Cluster: 6,770 of 15,627 faces (43%) matched existing canonical
identities at cos-dist <= 0.45; biggest hits faceset_002 (+3,261),
faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408). The
faceset_008 and faceset_007 hits are noteworthy cross-matches: those
are hand-sorted "sab" and "s" identities, recurring frequently in nic's
library.
Of the 8,857 unmatched faces, 3,787 raw clusters at threshold 0.55,
129 surviving refine gates, 95 emitted as new facesets at faceset_265+.
Top-level facesets_swap_ready/manifest.json: 216 -> 311 substantive
facesets + 68 thin_eras unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
work/immich_stage.py:
- Startup probe of /server/version (exit 2 if unreachable).
- Outage circuit breaker: after OUTAGE_FAIL_STREAK=12 consecutive
faces_error/download_error results, run a quick probe; if the probe
also fails, persist state and exit with code 2 so a long unattended
run can pause rather than silently churning through tens of thousands
of retries during an upstream outage. Resume by re-running the same
command -- state.json + queue.json are intact.
README:
- Document the nic run (per-user API key necessary; second pipeline
invocation confirmed expected behavior; cleaner library than peter's
with 0 internal byte-dupes vs 2,976).
- Mention the circuit breaker as the mechanism that keeps long
unattended runs safe under the known Tailscale flicker pattern at
this site.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three-piece workflow that imports a self-hosted Immich library and emits
new facesets without disturbing existing identity numbering:
- work/immich_stage.py (WSL): pages /search/metadata, parallel-fetches
/faces?id= per asset, prefilters by face_short>=90 against bbox scaled
to original-image coords, downloads originals, sha256-dedups against
nl_full.npz and same-run staged files. 8-worker ThreadPoolExecutor
doing the full /faces->filter->/original chain per asset; resumable
via state.json. API URL + key come from IMMICH_URL / IMMICH_API_KEY
env vars, label->UUID map from work/immich/users.json (gitignored).
- work/embed_worker.py (Windows venv at C:\face_embed_venv): runs
insightface.FaceAnalysis(buffalo_l) with the DmlExecutionProvider on
AMD Radeon Vega via onnxruntime-directml. Produces a cache file in
the same .npz schema as sort_faces.cmd_embed (loadable via
load_cache). ~7.5x speedup over CPU end-to-end; embeddings bit-
identical to CPU (cosine similarity 1.0000 across 8 sample faces).
- work/cluster_immich.py (WSL): mirrors cluster_osrc.py against an
immich_<user>.npz. Builds existing identity centroids from canonical
faceset_NNN/ in facesets_swap_ready/, drops matches at <=0.45,
clusters the rest at 0.55, applies refine gates, hands off to
cmd_export_swap. Numbers new facesets past the existing maximum.
- work/finalize_immich.sh: chains queue->Windows embed->cache copy->
cluster_immich, with logging.
The 2026-04-26 run on https://fotos.computerliebe.org (Immich v2.7.2)
processed 53,842 admin-accessible assets, staged 10,261, embedded
19,462 face records on Vega DML in 64.6 min, matched 8,103 (42%) to
existing identities, and emitted 185 new facesets (faceset_026..264
with gaps). facesets_swap_ready/ went from 31 to 216 substantive
facesets.
Important caveat surfaced: /search/metadata's userIds filter is
silently ignored when the API key is bound to a different user, so
this run can't enumerate other users' libraries from the admin key.
A per-user API key would be required for nic.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
work/cluster_osrc.py mirrors build_folders.py's shape (synthesize a
refine_manifest, hand off to cmd_export_swap, relocate, merge top-level
manifest) but discovers identities by clustering rather than asserting
them by folder. Drops faces already covered by existing identity
centroids, clusters the rest at 0.55, applies refine-equivalent gates
with min_faces=6, numbers new facesets past the existing maximum so
faceset_001..NNN are never disturbed.
The 2026-04-26 run on /mnt/x/src/osrc produced faceset_020..025 (sizes
4-26 exported PNGs); analysis writeup in docs/analysis/.
README also notes the refine-renumbers caveat in passing — extend +
orchestration script is the safe pattern; cmd_refine is for fresh
clusters only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- README: document work/build_folders.py (hand-sorted folder identities)
and the new age-split workflow for splitting a long-running identity
into era-specific facesets after clustering.
- Force-track work/age_split_001.py and work/check_faceset001_age.py;
these are the worked example + readiness probe for faceset_001 and
the template for splitting any other identity by EXIF era.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
README.md now covers all six subcommands (embed, cluster, refine, dedup,
extend, enrich, export-swap), an end-to-end pipeline recipe, the delta
recipe for merging a new source into an existing result, the quality-
weight formula used by export-swap, and the GFPGAN blend recommendation
at swap time (0.85, overriding roop-unleashed's 0.65 default).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- embed: sha256-based dedup at listing (embed each unique hash once, carry
other paths as aliases via a top-level path_aliases dict); resumable from
any existing cache; atomic incremental flush every 50 files; explicit
skip-ext filtering; schema bumped with processed_paths + path_aliases.
- extend: new subcommand that merges new embeddings into an existing raw +
facesets output without renumbering. Nearest person-centroid match above
threshold, unmatched faces re-clustered into new person_NNN / _singletons.
Optional --refine-out also extends facesets by centroid + quality gate.
- dedup: new subcommand producing byte-identical + visual near-duplicate
groups as a JSON report.
- cluster/refine: fan every placement across canonical + aliases so each
on-disk location gets represented.
- safe_dst_name now always flattens the absolute path so filenames stay
stable across runs when src_root shifts (fixes duplicate-copy bug that
surfaced during the lzbkp_red extend).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-file CLI (embed / cluster / refine) using InsightFace buffalo_l
embeddings and agglomerative clustering, migrated in from the ad-hoc
/home/peter/face_sort/ directory so this repo is the canonical home for
faceset preparation feeding roop-unleashed and similar tools.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>