Files
face-sets/docs/analysis/video-target-preprocessing.md
Peter 7960dec350 Make per-clip sidecar JSONs opt-in (default off)
Previously every video_target_pipeline cut wrote a <uuid>.json provenance
sidecar alongside each <uuid>.mp4. The same provenance is already in the
per-batch plan.json, so the per-clip sidecars are redundant unless a
downstream tool wants each clip self-describing in isolation.

- video_target_pipeline.py cut: new --write-sidecar flag, default off.
- run_video_pipeline.sh: new SIDECAR env var (default "no"), passes
  --write-sidecar when SIDECAR=yes.
- README + docs/analysis/video-target-preprocessing.md updated.

The 1,984 already-emitted sidecars in /mnt/x/src/vd/ct/ct_src_*/ have
been deleted (1.5 MB).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 12:44:27 +02:00

131 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Video target preprocessing for roop-unleashed
_Initial design + first batch run: 2026-04-27. Driver scripts: `work/video_target_pipeline.py`, `work/video_face_worker.py`, `work/run_video_pipeline.sh`._
Companion to the face-set side of the project: instead of building per-identity .fsz bundles for the *source* of a swap, this pipeline preprocesses the *target* (videos to swap into). Given a folder of video files, it identifies "swappable" segments — continuous shots where a face is detectable, sufficiently visible, and roughly within inswapper_128's working envelope — and cuts them into UUID-named clips ready to feed into roop-unleashed.
## 1. Why build it
I checked the obvious open-source projects for an existing implementation:
- **FaceFusion** ([github.com/facefusion/facefusion](https://github.com/facefusion/facefusion)) — CLI has `run`, `headless-run`, `batch-run`, `job-*`, `force-download`, `benchmark`. No scene-detection or clip-extraction subcommand. Its own guides recommend "split your video manually first."
- **roop-unleashed** at `/opt/roop-unleashed/roop/util_ffmpeg.py` — has `cut_video(start_frame, end_frame)` for a manual GUI trim, no detection-driven segmentation.
- **Deep-Live-Cam** ([github.com/hacksider/Deep-Live-Cam](https://github.com/hacksider/Deep-Live-Cam)) — real-time / single-shot, no batch preprocessing.
- **DeepFaceLab** — `extract_video.bat` dumps every frame between user-supplied trim points; no quality gating.
Closest prior art for the cut-detection pattern is the two-stage hybrid in [SportSBD MMSys'26](https://dl.acm.org/doi/10.1145/3793853.3799803) (cheap detector for cuts, accurate net for verification), but the actual implementation has to be ours.
## 2. Pipeline architecture
```
WSL /opt/face-sets/work/ Windows C:\face_embed_venv\
───────────────────────────────────── ─────────────────────────────
run_video_pipeline.sh (chain driver)
├─ scan (ffprobe metadata)
├─ scenes (PySceneDetect AdaptiveDetector, CPU)
├─ stage (sampled frame queue.json @ 2 fps)
│ │
│ ▼
│ video_face_worker.py
│ insightface FaceAnalysis
│ on DmlExecutionProvider
│ output: results.jsonl
├─ merge (ingest results.jsonl)
├─ track (IoU + embedding stitching, ~30 LOC)
├─ score (track-level quality gate + cross-track merge)
├─ cut (ffmpeg -c copy → per-source subfolders)
└─ report (HTML preview)
Output: <output_dir>/<source_video_stem>/<uuid>.mp4
/<uuid>.json (sidecar; opt-in via
--write-sidecar)
```
`run_video_pipeline.sh` is parameterized via env vars (`WORK`, `INPUT_DIR`, `OUTPUT_DIR`, `FILTER_FROM`, `SKIP_PATTERN`, `MAX_DUR`, `IDENTITY`, `SIDECAR`) so you can pin a particular batch without editing the script. Sidecars are off by default — the per-batch `plan.json` always carries the full provenance for every clip; the `<uuid>.json` files alongside the clips are redundant and only useful if you need each clip to be self-describing in isolation.
## 3. Quality signals (matched to inswapper_128's working envelope)
inswapper_128 is trained near-frontal at 128×128. The score gate uses defaults that admit side profiles (since rich face-sets can absorb non-frontal swap targets):
| signal | threshold | rationale |
|--------|----------:|-----------|
| `|yaw|` | ≤ 75° | covers full 3/4 + side profile |
| `|pitch|` | ≤ 45° | covers extreme up/down looks |
| `face_short` | ≥ 80 px | inswapper resamples to 128; ≥80 still produces clean output |
| `det_score` | ≥ 0.5 | matches buffalo_l's MIN_DET; lower = unreliable detection |
| track-gate | ≥ 70 % frames pass | binary track filter rather than per-frame |
| duration | 1 s ≤ dur ≤ 120 s | below 1s = unusable slivers; above 120s probably contains a missed micro-cut |
Plus two segment-merging knobs:
- `--bridge-gap` (default 3 s) — within a single track, brief pose-failure gaps shorter than this get bridged so single bad frames don't fragment a good run
- `--merge-gap` (default 2 s) — across tracks within the same scene, segments closer than this get fused (cross-track merge fires when face detection briefly fails between adjacent good runs)
The defaults can be tightened (e.g. `--max-yaw 25` for portrait-only) or loosened (e.g. `--max-yaw 90 --merge-gap 5`) without re-running detection — `score` reads the existing `tracks.json`.
## 4. Performance + the JSONL append-only fix
This is where the engineering interest is. The first production run on 13 videos / 6.18 h of input went through three failure modes before settling at production speed:
| attempt | issue | rate observed |
|---|---|---:|
| 1. Original `cap.set(POS_FRAMES, N)` per sample | OpenCV seeks to nearest keyframe + decodes forward at every sample. Cost grows with depth into the video; on a 60-min H.264 it falls off a cliff. | 1.4 fps → degrading |
| 2. Sequential `cap.grab()` from frame 0 | On resume, grab-walking from frame 0 to a deep target is unbounded. | 0.08 fps |
| 3. Hybrid: seek-once-per-video + sequential within | Better in principle. But hit a different bug: `flush()` was re-serializing the entire `results.json` (245 MB at this point) every 100 frames or 30 sec. Save dominated wall-clock. | 0.5 fps |
| 4. **JSONL append-only** | One result per line. Each flush is O(new records), not O(total records). | **13.77 fps** smoke / 7.57 fps cumulative across the full batch |
Lesson: when the output is large + grows monotonically + needs frequent checkpointing, *do not* re-serialize the whole structure on each flush. Append-only line-delimited JSON is the right tool. The legacy `results.json` is auto-converted to `.jsonl` on first load (one-time migration), so resumes survive the format switch.
## 5. Hardware decode/encode on AMD Vega + WSL
Skipped. Per [Microsoft's WSL D3D12 video acceleration post](https://devblogs.microsoft.com/commandline/d3d12-gpu-video-acceleration-in-the-windows-subsystem-for-linux-now-available/), VAAPI-via-Mesa-D3D12 exists but is fragile on older AMD. AMF on Windows would mean a Windows-side ffmpeg leg, doubling boundary crossings. CPU software decode of 1280×720 H.264 in WSL ffmpeg is faster than realtime, and the bottleneck is buffalo_l detection on DML, not decode.
For cutting we use `-c copy` stream-copy — no re-encode, hardware codecs are moot.
## 6. First batch run results (ct_src_00050..00062)
| | |
|---|---:|
| input videos | 13 |
| input duration | 6.18 h |
| sampled frames | 44,635 (@ 2 fps) |
| accepted tracks | 1,193 / 2,564 (47 %) |
| **emitted segments** | **600** |
| segments built from ≥2 tracks (cross-track merge fired) | 254 |
| accepted content total | 239.5 min (64.6 % of input) |
| segment duration min/median/mean/max | 1 / 12 / 24 / 119 s |
| output size | 3.63 GB |
Phase timings:
- scenes: 25 min (cached on later runs)
- stage: instant
- worker: 78 min @ ~7.5 fps cumulative
- merge: 73 s
- track: 77 s
- score: 21 s
- cut (600 ffmpeg stream-copies): 19 min
- report (600 thumbs + HTML): 3 min
- **total wall-clock: 1h43m**
## 7. Re-running
```bash
# choose a per-batch workdir + log
WORK=/opt/face-sets/work/video_preprocess_<batch_name> \
FILTER_FROM=ct_src_00050.mp4 \
bash work/run_video_pipeline.sh > work/logs/video_run_<batch_name>.log 2>&1 &
# check status anytime
bash work/status_video_pipeline.sh work/logs/video_run_<batch_name>.log
```
Skip patterns can exclude already-processed inputs:
```bash
SKIP_PATTERN='^ct_src_(0001[015]|005[0-9]|006[0-9])\.mp4$' \
WORK=/opt/face-sets/work/video_preprocess_rest \
bash work/run_video_pipeline.sh > work/logs/video_run_rest.log 2>&1 &
```
`scenes` outputs are cached in the batch's `WORK/scenes/` dir, so re-running the chain after an edit-to-score step doesn't redo detection. The worker is also resumable per `queue_id` — if killed mid-flight, just relaunch.