9 Commits

Author SHA1 Message Date
308597ebf0 Update video preprocessing doc with full-corpus results
After completing the rest-of-corpus run, update docs/analysis to reflect
the final numbers across all three batches (test + 13-file + 45-file)
and surface the numerical lessons:
- 1,984 segments / 10.78h accepted content from 19.76h / 61 input videos
- 0 worker errors across 143,137 sampled frames
- rest batch sustained 15.78 fps from a fresh JSONL start (vs 7.5 fps for
  the migrated batch), confirming the append-only fix is the right
  steady-state design
- skip-pattern note: 5-digit basename numbers need full padding
  (0005[0-9] not 005[0-9]) — bit me on the first relaunch
- documented SIDECAR=yes opt-in for the chain script

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 16:47:59 +02:00
7960dec350 Make per-clip sidecar JSONs opt-in (default off)
Previously every video_target_pipeline cut wrote a <uuid>.json provenance
sidecar alongside each <uuid>.mp4. The same provenance is already in the
per-batch plan.json, so the per-clip sidecars are redundant unless a
downstream tool wants each clip self-describing in isolation.

- video_target_pipeline.py cut: new --write-sidecar flag, default off.
- run_video_pipeline.sh: new SIDECAR env var (default "no"), passes
  --write-sidecar when SIDECAR=yes.
- README + docs/analysis/video-target-preprocessing.md updated.

The 1,984 already-emitted sidecars in /mnt/x/src/vd/ct/ct_src_*/ have
been deleted (1.5 MB).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 12:44:27 +02:00
998fa79f81 Add target-side video preprocessing pipeline
Preprocesses a folder of video files into UUID-named clips suitable as
target inputs for roop-unleashed-style face-swap. Counterpart to the
faceset (source-side) tooling.

work/video_target_pipeline.py — orchestration with subcommands
  scan / scenes / stage / merge / track / score / cut / report. Quality
  gates default to face-sets-can-handle-side-profile values (yaw<=75°,
  pitch<=45°, face_short>=80px, det>=0.5). Cross-track segment merge
  fuses adjacent-in-time tracks within the same scene up to 2s gap.
  Output organized into <output_dir>/<source_stem>/<uuid>.mp4 +
  <uuid>.json sidecar with full provenance.

work/video_face_worker.py — Windows DML face detect+embed worker. Uses
  JSONL append-only for results.jsonl: a critical perf fix (re-
  serializing the monolithic 245MB results.json on every flush was the
  dominant cost in the first attempt, dropping throughput to 0.5 fps).
  Append-only got it to 13+ fps, ~7.5 fps cumulative across the first
  6.18h batch. Also uses seek-once-per-video + sequential cap.grab()
  between samples to dodge cv2 per-sample seek pathology on long H.264.
  Legacy results.json is auto-migrated to .jsonl on first load.

work/run_video_pipeline.sh — generic chain driver, parameterized via
  WORK / INPUT_DIR / OUTPUT_DIR / FILTER_FROM / SKIP_PATTERN / MAX_DUR /
  IDENTITY env vars. work/status_video_pipeline.sh — generic status
  helper.

First production batch (ct_src_00050..00062, 13 files, 6.18h input):
600 emitted segments, 239.5min accepted content (64.6% of input), 254
segments built from >=2 tracks (cross-track merge), 1h43m wall clock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:38:50 +02:00
49a43c7685 Add post-export corpus maintenance pipeline
Adds four new orchestration scripts that operate on an already-built
facesets_swap_ready/ to clean it up over time:

- filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses
  filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores
  via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level
  quarantine at 40% domain dominance.

- consolidate_facesets.py: duplicate-identity merger using complete-linkage
  centroid clustering on cached arcface embeddings. Single-linkage chains
  catastrophically (60-faceset clusters with min sim < 0); complete-linkage
  guarantees within-group sim >= edge.

- age_extend_001.py: slots newly-added PNGs into existing era buckets of
  faceset_001 using the same anchor-fragment rule as age_split_001.py
  (dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered.

- dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three
  passes — cross-family SHA256 byte-dedup (preserves intra-family era
  duplication), within-faceset near-dup at sim >= 0.95, and a multi-face
  audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s
  on AMD Vega — ~7x embed_worker because input is 512x512 crops.

Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged →
181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes
preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full
reversibility. Master manifest gains masked[], merged[], plus per-run
provenance blocks.

Three new docs/analysis/ writeups cover model choice, threshold rationale,
and per-pass run results.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:41:18 +02:00
e66c97fd58 Document Immich nic run: 95 new facesets, manifest 216 -> 311
Overnight 2026-04-27 nic finalize completed. Per-user API key worked as
expected. The pipeline survived one mid-stage Immich outage via the
circuit breaker added in 62dba3d -- script paused, operator confirmed
connectivity, same command resumed from saved state.json.

Embed (Windows DML): 7,834 images -> 15,627 face records + 1 noface in
59 minutes (2.2 img/s end-to-end).

Cluster: 6,770 of 15,627 faces (43%) matched existing canonical
identities at cos-dist <= 0.45; biggest hits faceset_002 (+3,261),
faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408). The
faceset_008 and faceset_007 hits are noteworthy cross-matches: those
are hand-sorted "sab" and "s" identities, recurring frequently in nic's
library.

Of the 8,857 unmatched faces, 3,787 raw clusters at threshold 0.55,
129 surviving refine gates, 95 emitted as new facesets at faceset_265+.

Top-level facesets_swap_ready/manifest.json: 216 -> 311 substantive
facesets + 68 thin_eras unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 00:32:11 +02:00
321fed01cc Add Immich import pipeline (WSL stage + Windows DML embed + cluster)
Three-piece workflow that imports a self-hosted Immich library and emits
new facesets without disturbing existing identity numbering:

- work/immich_stage.py (WSL): pages /search/metadata, parallel-fetches
  /faces?id= per asset, prefilters by face_short>=90 against bbox scaled
  to original-image coords, downloads originals, sha256-dedups against
  nl_full.npz and same-run staged files. 8-worker ThreadPoolExecutor
  doing the full /faces->filter->/original chain per asset; resumable
  via state.json. API URL + key come from IMMICH_URL / IMMICH_API_KEY
  env vars, label->UUID map from work/immich/users.json (gitignored).
- work/embed_worker.py (Windows venv at C:\face_embed_venv): runs
  insightface.FaceAnalysis(buffalo_l) with the DmlExecutionProvider on
  AMD Radeon Vega via onnxruntime-directml. Produces a cache file in
  the same .npz schema as sort_faces.cmd_embed (loadable via
  load_cache). ~7.5x speedup over CPU end-to-end; embeddings bit-
  identical to CPU (cosine similarity 1.0000 across 8 sample faces).
- work/cluster_immich.py (WSL): mirrors cluster_osrc.py against an
  immich_<user>.npz. Builds existing identity centroids from canonical
  faceset_NNN/ in facesets_swap_ready/, drops matches at <=0.45,
  clusters the rest at 0.55, applies refine gates, hands off to
  cmd_export_swap. Numbers new facesets past the existing maximum.
- work/finalize_immich.sh: chains queue->Windows embed->cache copy->
  cluster_immich, with logging.

The 2026-04-26 run on https://fotos.computerliebe.org (Immich v2.7.2)
processed 53,842 admin-accessible assets, staged 10,261, embedded
19,462 face records on Vega DML in 64.6 min, matched 8,103 (42%) to
existing identities, and emitted 185 new facesets (faceset_026..264
with gaps). facesets_swap_ready/ went from 31 to 216 substantive
facesets.

Important caveat surfaced: /search/metadata's userIds filter is
silently ignored when the API key is bound to a different user, so
this run can't enumerate other users' libraries from the admin key.
A per-user API key would be required for nic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 18:14:26 +02:00
7ecbfae981 Add osrc identity-discovery pipeline + run analysis
work/cluster_osrc.py mirrors build_folders.py's shape (synthesize a
refine_manifest, hand off to cmd_export_swap, relocate, merge top-level
manifest) but discovers identities by clustering rather than asserting
them by folder. Drops faces already covered by existing identity
centroids, clusters the rest at 0.55, applies refine-equivalent gates
with min_faces=6, numbers new facesets past the existing maximum so
faceset_001..NNN are never disturbed.

The 2026-04-26 run on /mnt/x/src/osrc produced faceset_020..025 (sizes
4-26 exported PNGs); analysis writeup in docs/analysis/.

README also notes the refine-renumbers caveat in passing — extend +
orchestration script is the safe pattern; cmd_refine is for fresh
clusters only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:40:19 +02:00
e48dd8aec7 Add age-split run analysis for faceset_001
Documents the 2026-04-26 split of faceset_001 (707 curated faces) into
6 substantive era buckets + 68 thin fragments, including the readiness
probe evidence, the anchor-based assignment rationale (replaces
transitive union-find that caused year-drift), and the re-run / apply-
to-other-identity workflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:10:37 +02:00
d53ab9fbfc Add enrich + export-swap pipeline for downstream face-swap ready output
- enrich: re-detects each cached face with buffalo_l (detection +
  landmark_2d_106 + landmark_3d_68, recognition module skipped for speed)
  and persists landmarks + pose into the cache so per-face frontality and
  landmark-symmetry quality signals become available.
- compute_quality: composite score combining det_score, face short-edge,
  blur, frontality (from pose pitch/yaw), and 2D-landmark symmetry with
  tunable weights. Default weighting 0.30/0.20/0.20/0.15/0.15.
- export-swap: builds facesets_swap_ready/ from an existing refine
  manifest. Per identity: tighter outlier gate (default 0.45), visual-
  near-dupe collapse (keep best representative per group), multi-face-
  per-source-image collapse (keep best bbox), rank by composite score,
  single-face-per-PNG crops at 512x512 with 0.5 bbox padding, ready-to-
  drop .fsz bundles (top-N + full), per-faceset manifest.json, NAME.txt
  placeholder for the operator. The multi-face-per-PNG collapse is the
  critical fix: roop-unleashed's .fsz loader appends every detected face
  in each PNG to the FaceSet, so any multi-face crop would contaminate
  the averaged embedding.
- Optional --candidates rescues raw_full singletons: matches against the
  final per-faceset centroids and routes to _candidates/to_<faceset>/
  for manual review; orphaned singletons that still cluster among
  themselves land in _candidates/new_<NNN>/.
- docs/analysis/: evaluation document captures the evidence, downstream
  requirements (FaceSet averaging, inswapper_128), opportunity matrix
  (R1-R14), and the recommended target state this export implements.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 22:37:32 +02:00