Add target-side video preprocessing pipeline

Preprocesses a folder of video files into UUID-named clips suitable as target inputs for roop-unleashed-style face-swap. Counterpart to the faceset (source-side) tooling. work/video_target_pipeline.py — orchestration with subcommands scan / scenes / stage / merge / track / score / cut / report. Quality gates default to face-sets-can-handle-side-profile values (yaw<=75°, pitch<=45°, face_short>=80px, det>=0.5). Cross-track segment merge fuses adjacent-in-time tracks within the same scene up to 2s gap. Output organized into <output_dir>/<source_stem>/<uuid>.mp4 + <uuid>.json sidecar with full provenance. work/video_face_worker.py — Windows DML face detect+embed worker. Uses JSONL append-only for results.jsonl: a critical perf fix (re- serializing the monolithic 245MB results.json on every flush was the dominant cost in the first attempt, dropping throughput to 0.5 fps). Append-only got it to 13+ fps, ~7.5 fps cumulative across the first 6.18h batch. Also uses seek-once-per-video + sequential cap.grab() between samples to dodge cv2 per-sample seek pathology on long H.264. Legacy results.json is auto-migrated to .jsonl on first load. work/run_video_pipeline.sh — generic chain driver, parameterized via WORK / INPUT_DIR / OUTPUT_DIR / FILTER_FROM / SKIP_PATTERN / MAX_DUR / IDENTITY env vars. work/status_video_pipeline.sh — generic status helper. First production batch (ct_src_00050..00062, 13 files, 6.18h input): 600 emitted segments, 239.5min accepted content (64.6% of input), 254 segments built from >=2 tracks (cross-track merge), 1h43m wall clock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:38:50 +02:00
parent 49a43c7685
commit 998fa79f81
6 changed files with 1480 additions and 0 deletions
--- a/docs/analysis/video-target-preprocessing.md
+++ b/docs/analysis/video-target-preprocessing.md
@@ -0,0 +1,129 @@
+# Video target preprocessing for roop-unleashed
+
+_Initial design + first batch run: 2026-04-27. Driver scripts: `work/video_target_pipeline.py`, `work/video_face_worker.py`, `work/run_video_pipeline.sh`._
+
+Companion to the face-set side of the project: instead of building per-identity .fsz bundles for the *source* of a swap, this pipeline preprocesses the *target* (videos to swap into). Given a folder of video files, it identifies "swappable" segments — continuous shots where a face is detectable, sufficiently visible, and roughly within inswapper_128's working envelope — and cuts them into UUID-named clips ready to feed into roop-unleashed.
+
+## 1. Why build it
+
+I checked the obvious open-source projects for an existing implementation:
+
+- **FaceFusion** ([github.com/facefusion/facefusion](https://github.com/facefusion/facefusion)) — CLI has `run`, `headless-run`, `batch-run`, `job-*`, `force-download`, `benchmark`. No scene-detection or clip-extraction subcommand. Its own guides recommend "split your video manually first."
+- **roop-unleashed** at `/opt/roop-unleashed/roop/util_ffmpeg.py` — has `cut_video(start_frame, end_frame)` for a manual GUI trim, no detection-driven segmentation.
+- **Deep-Live-Cam** ([github.com/hacksider/Deep-Live-Cam](https://github.com/hacksider/Deep-Live-Cam)) — real-time / single-shot, no batch preprocessing.
+- **DeepFaceLab** — `extract_video.bat` dumps every frame between user-supplied trim points; no quality gating.
+
+Closest prior art for the cut-detection pattern is the two-stage hybrid in [SportSBD MMSys'26](https://dl.acm.org/doi/10.1145/3793853.3799803) (cheap detector for cuts, accurate net for verification), but the actual implementation has to be ours.
+
+## 2. Pipeline architecture
+
+```
+WSL  /opt/face-sets/work/                   Windows  C:\face_embed_venv\
+─────────────────────────────────────       ─────────────────────────────
+run_video_pipeline.sh (chain driver)
+   │
+   ├─ scan         (ffprobe metadata)
+   ├─ scenes       (PySceneDetect AdaptiveDetector, CPU)
+   ├─ stage        (sampled frame queue.json @ 2 fps)
+   │                                  │
+   │                                  ▼
+   │                            video_face_worker.py
+   │                            insightface FaceAnalysis
+   │                            on DmlExecutionProvider
+   │                            output: results.jsonl
+   ├─ merge        (ingest results.jsonl)
+   ├─ track        (IoU + embedding stitching, ~30 LOC)
+   ├─ score        (track-level quality gate + cross-track merge)
+   ├─ cut          (ffmpeg -c copy → per-source subfolders)
+   └─ report       (HTML preview)
+
+   Output: <output_dir>/<source_video_stem>/<uuid>.mp4
+                                           /<uuid>.json (sidecar)
+```
+
+`run_video_pipeline.sh` is parameterized via env vars (`WORK`, `INPUT_DIR`, `OUTPUT_DIR`, `FILTER_FROM`, `SKIP_PATTERN`, `MAX_DUR`, `IDENTITY`) so you can pin a particular batch without editing the script.
+
+## 3. Quality signals (matched to inswapper_128's working envelope)
+
+inswapper_128 is trained near-frontal at 128×128. The score gate uses defaults that admit side profiles (since rich face-sets can absorb non-frontal swap targets):
+
+| signal | threshold | rationale |
+|--------|----------:|-----------|
+| `|yaw|` | ≤ 75° | covers full 3/4 + side profile |
+| `|pitch|` | ≤ 45° | covers extreme up/down looks |
+| `face_short` | ≥ 80 px | inswapper resamples to 128; ≥80 still produces clean output |
+| `det_score` | ≥ 0.5 | matches buffalo_l's MIN_DET; lower = unreliable detection |
+| track-gate | ≥ 70 % frames pass | binary track filter rather than per-frame |
+| duration | 1 s ≤ dur ≤ 120 s | below 1s = unusable slivers; above 120s probably contains a missed micro-cut |
+
+Plus two segment-merging knobs:
+- `--bridge-gap` (default 3 s) — within a single track, brief pose-failure gaps shorter than this get bridged so single bad frames don't fragment a good run
+- `--merge-gap` (default 2 s) — across tracks within the same scene, segments closer than this get fused (cross-track merge fires when face detection briefly fails between adjacent good runs)
+
+The defaults can be tightened (e.g. `--max-yaw 25` for portrait-only) or loosened (e.g. `--max-yaw 90 --merge-gap 5`) without re-running detection — `score` reads the existing `tracks.json`.
+
+## 4. Performance + the JSONL append-only fix
+
+This is where the engineering interest is. The first production run on 13 videos / 6.18 h of input went through three failure modes before settling at production speed:
+
+| attempt | issue | rate observed |
+|---|---|---:|
+| 1. Original `cap.set(POS_FRAMES, N)` per sample | OpenCV seeks to nearest keyframe + decodes forward at every sample. Cost grows with depth into the video; on a 60-min H.264 it falls off a cliff. | 1.4 fps → degrading |
+| 2. Sequential `cap.grab()` from frame 0 | On resume, grab-walking from frame 0 to a deep target is unbounded. | 0.08 fps |
+| 3. Hybrid: seek-once-per-video + sequential within | Better in principle. But hit a different bug: `flush()` was re-serializing the entire `results.json` (245 MB at this point) every 100 frames or 30 sec. Save dominated wall-clock. | 0.5 fps |
+| 4. **JSONL append-only** | One result per line. Each flush is O(new records), not O(total records). | **13.77 fps** smoke / 7.57 fps cumulative across the full batch |
+
+Lesson: when the output is large + grows monotonically + needs frequent checkpointing, *do not* re-serialize the whole structure on each flush. Append-only line-delimited JSON is the right tool. The legacy `results.json` is auto-converted to `.jsonl` on first load (one-time migration), so resumes survive the format switch.
+
+## 5. Hardware decode/encode on AMD Vega + WSL
+
+Skipped. Per [Microsoft's WSL D3D12 video acceleration post](https://devblogs.microsoft.com/commandline/d3d12-gpu-video-acceleration-in-the-windows-subsystem-for-linux-now-available/), VAAPI-via-Mesa-D3D12 exists but is fragile on older AMD. AMF on Windows would mean a Windows-side ffmpeg leg, doubling boundary crossings. CPU software decode of 1280×720 H.264 in WSL ffmpeg is faster than realtime, and the bottleneck is buffalo_l detection on DML, not decode.
+
+For cutting we use `-c copy` stream-copy — no re-encode, hardware codecs are moot.
+
+## 6. First batch run results (ct_src_00050..00062)
+
+| | |
+|---|---:|
+| input videos | 13 |
+| input duration | 6.18 h |
+| sampled frames | 44,635 (@ 2 fps) |
+| accepted tracks | 1,193 / 2,564 (47 %) |
+| **emitted segments** | **600** |
+| segments built from ≥2 tracks (cross-track merge fired) | 254 |
+| accepted content total | 239.5 min (64.6 % of input) |
+| segment duration min/median/mean/max | 1 / 12 / 24 / 119 s |
+| output size | 3.63 GB |
+
+Phase timings:
+- scenes: 25 min (cached on later runs)
+- stage: instant
+- worker: 78 min @ ~7.5 fps cumulative
+- merge: 73 s
+- track: 77 s
+- score: 21 s
+- cut (600 ffmpeg stream-copies): 19 min
+- report (600 thumbs + HTML): 3 min
+- **total wall-clock: 1h43m**
+
+## 7. Re-running
+
+```bash
+# choose a per-batch workdir + log
+WORK=/opt/face-sets/work/video_preprocess_<batch_name> \
+  FILTER_FROM=ct_src_00050.mp4 \
+  bash work/run_video_pipeline.sh > work/logs/video_run_<batch_name>.log 2>&1 &
+
+# check status anytime
+bash work/status_video_pipeline.sh work/logs/video_run_<batch_name>.log
+```
+
+Skip patterns can exclude already-processed inputs:
+
+```bash
+SKIP_PATTERN='^ct_src_(0001[015]|005[0-9]|006[0-9])\.mp4$' \
+  WORK=/opt/face-sets/work/video_preprocess_rest \
+  bash work/run_video_pipeline.sh > work/logs/video_run_rest.log 2>&1 &
+```
+
+`scenes` outputs are cached in the batch's `WORK/scenes/` dir, so re-running the chain after an edit-to-score step doesn't redo detection. The worker is also resumable per `queue_id` — if killed mid-flight, just relaunch.