Files
face-sets/docs/analysis/video-target-preprocessing.md
Peter 308597ebf0 Update video preprocessing doc with full-corpus results
After completing the rest-of-corpus run, update docs/analysis to reflect
the final numbers across all three batches (test + 13-file + 45-file)
and surface the numerical lessons:
- 1,984 segments / 10.78h accepted content from 19.76h / 61 input videos
- 0 worker errors across 143,137 sampled frames
- rest batch sustained 15.78 fps from a fresh JSONL start (vs 7.5 fps for
  the migrated batch), confirming the append-only fix is the right
  steady-state design
- skip-pattern note: 5-digit basename numbers need full padding
  (0005[0-9] not 005[0-9]) — bit me on the first relaunch
- documented SIDECAR=yes opt-in for the chain script

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 16:47:59 +02:00

143 lines
9.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Video target preprocessing for roop-unleashed
_Initial design + first batch run: 2026-04-27. Driver scripts: `work/video_target_pipeline.py`, `work/video_face_worker.py`, `work/run_video_pipeline.sh`._
Companion to the face-set side of the project: instead of building per-identity .fsz bundles for the *source* of a swap, this pipeline preprocesses the *target* (videos to swap into). Given a folder of video files, it identifies "swappable" segments — continuous shots where a face is detectable, sufficiently visible, and roughly within inswapper_128's working envelope — and cuts them into UUID-named clips ready to feed into roop-unleashed.
## 1. Why build it
I checked the obvious open-source projects for an existing implementation:
- **FaceFusion** ([github.com/facefusion/facefusion](https://github.com/facefusion/facefusion)) — CLI has `run`, `headless-run`, `batch-run`, `job-*`, `force-download`, `benchmark`. No scene-detection or clip-extraction subcommand. Its own guides recommend "split your video manually first."
- **roop-unleashed** at `/opt/roop-unleashed/roop/util_ffmpeg.py` — has `cut_video(start_frame, end_frame)` for a manual GUI trim, no detection-driven segmentation.
- **Deep-Live-Cam** ([github.com/hacksider/Deep-Live-Cam](https://github.com/hacksider/Deep-Live-Cam)) — real-time / single-shot, no batch preprocessing.
- **DeepFaceLab** — `extract_video.bat` dumps every frame between user-supplied trim points; no quality gating.
Closest prior art for the cut-detection pattern is the two-stage hybrid in [SportSBD MMSys'26](https://dl.acm.org/doi/10.1145/3793853.3799803) (cheap detector for cuts, accurate net for verification), but the actual implementation has to be ours.
## 2. Pipeline architecture
```
WSL /opt/face-sets/work/ Windows C:\face_embed_venv\
───────────────────────────────────── ─────────────────────────────
run_video_pipeline.sh (chain driver)
├─ scan (ffprobe metadata)
├─ scenes (PySceneDetect AdaptiveDetector, CPU)
├─ stage (sampled frame queue.json @ 2 fps)
│ │
│ ▼
│ video_face_worker.py
│ insightface FaceAnalysis
│ on DmlExecutionProvider
│ output: results.jsonl
├─ merge (ingest results.jsonl)
├─ track (IoU + embedding stitching, ~30 LOC)
├─ score (track-level quality gate + cross-track merge)
├─ cut (ffmpeg -c copy → per-source subfolders)
└─ report (HTML preview)
Output: <output_dir>/<source_video_stem>/<uuid>.mp4
/<uuid>.json (sidecar; opt-in via
--write-sidecar)
```
`run_video_pipeline.sh` is parameterized via env vars (`WORK`, `INPUT_DIR`, `OUTPUT_DIR`, `FILTER_FROM`, `SKIP_PATTERN`, `MAX_DUR`, `IDENTITY`, `SIDECAR`) so you can pin a particular batch without editing the script. Sidecars are off by default — the per-batch `plan.json` always carries the full provenance for every clip; the `<uuid>.json` files alongside the clips are redundant and only useful if you need each clip to be self-describing in isolation.
## 3. Quality signals (matched to inswapper_128's working envelope)
inswapper_128 is trained near-frontal at 128×128. The score gate uses defaults that admit side profiles (since rich face-sets can absorb non-frontal swap targets):
| signal | threshold | rationale |
|--------|----------:|-----------|
| `|yaw|` | ≤ 75° | covers full 3/4 + side profile |
| `|pitch|` | ≤ 45° | covers extreme up/down looks |
| `face_short` | ≥ 80 px | inswapper resamples to 128; ≥80 still produces clean output |
| `det_score` | ≥ 0.5 | matches buffalo_l's MIN_DET; lower = unreliable detection |
| track-gate | ≥ 70 % frames pass | binary track filter rather than per-frame |
| duration | 1 s ≤ dur ≤ 120 s | below 1s = unusable slivers; above 120s probably contains a missed micro-cut |
Plus two segment-merging knobs:
- `--bridge-gap` (default 3 s) — within a single track, brief pose-failure gaps shorter than this get bridged so single bad frames don't fragment a good run
- `--merge-gap` (default 2 s) — across tracks within the same scene, segments closer than this get fused (cross-track merge fires when face detection briefly fails between adjacent good runs)
The defaults can be tightened (e.g. `--max-yaw 25` for portrait-only) or loosened (e.g. `--max-yaw 90 --merge-gap 5`) without re-running detection — `score` reads the existing `tracks.json`.
## 4. Performance + the JSONL append-only fix
This is where the engineering interest is. The first production run on 13 videos / 6.18 h of input went through three failure modes before settling at production speed:
| attempt | issue | rate observed |
|---|---|---:|
| 1. Original `cap.set(POS_FRAMES, N)` per sample | OpenCV seeks to nearest keyframe + decodes forward at every sample. Cost grows with depth into the video; on a 60-min H.264 it falls off a cliff. | 1.4 fps → degrading |
| 2. Sequential `cap.grab()` from frame 0 | On resume, grab-walking from frame 0 to a deep target is unbounded. | 0.08 fps |
| 3. Hybrid: seek-once-per-video + sequential within | Better in principle. But hit a different bug: `flush()` was re-serializing the entire `results.json` (245 MB at this point) every 100 frames or 30 sec. Save dominated wall-clock. | 0.5 fps |
| 4. **JSONL append-only** | One result per line. Each flush is O(new records), not O(total records). | **13.77 fps** smoke / 7.57 fps cumulative across the full batch |
Lesson: when the output is large + grows monotonically + needs frequent checkpointing, *do not* re-serialize the whole structure on each flush. Append-only line-delimited JSON is the right tool. The legacy `results.json` is auto-converted to `.jsonl` on first load (one-time migration), so resumes survive the format switch.
## 5. Hardware decode/encode on AMD Vega + WSL
Skipped. Per [Microsoft's WSL D3D12 video acceleration post](https://devblogs.microsoft.com/commandline/d3d12-gpu-video-acceleration-in-the-windows-subsystem-for-linux-now-available/), VAAPI-via-Mesa-D3D12 exists but is fragile on older AMD. AMF on Windows would mean a Windows-side ffmpeg leg, doubling boundary crossings. CPU software decode of 1280×720 H.264 in WSL ffmpeg is faster than realtime, and the bottleneck is buffalo_l detection on DML, not decode.
For cutting we use `-c copy` stream-copy — no re-encode, hardware codecs are moot.
## 6. Full corpus run results
Three runs across the 61-video corpus at `/mnt/x/src/vd/`:
| | test (3 videos) | first batch (13 videos, 5062) | rest (45 videos, 0249 minus test) | **total** |
|---|---:|---:|---:|---:|
| input duration | 0.6 h | 6.18 h | 12.98 h | **19.76 h** |
| sampled frames @ 2 fps | 4,472 | 44,635 | 94,030 | 143,137 |
| tracks | 187 | 2,564 | 3,823 | 6,574 |
| accepted tracks | 94 (50 %) | 1,193 (47 %) | 1,905 (50 %) | 3,192 (49 %) |
| **emitted segments** | **83** | **600** | **1,301** | **1,984** |
| cross-track-merged segments | 14 | 254 | 382 | 650 |
| accepted content | 13 min | 239 min | 395 min | **647 min (= 10.78 h)** |
| acceptance rate by time | 36 % | 64.6 % | 50.7 % | **54.6 %** |
| output size | 0.135 GB | 3.63 GB | 4.84 GB | **8.6 GB** |
Phase timings (rest batch — best representative since it ran fully under JSONL append-only from a fresh start):
- scenes: 117 min (PySceneDetect, 45 × ~3 min/video)
- stage: instant
- worker: 100 min @ **15.78 fps** sustained (vs 7.5 fps for first batch which migrated mid-run)
- merge: 90 s
- track: 92 s
- score: 23 s
- cut (1,301 ffmpeg stream-copies): 30 min
- report (1,301 thumbs + HTML): 5.5 min
- **total wall-clock: 4h16m**
Across all three runs, **0 worker errors on 143,137 sampled frames**.
## 7. Re-running
```bash
# choose a per-batch workdir + log
WORK=/opt/face-sets/work/video_preprocess_<batch_name> \
FILTER_FROM=ct_src_00050.mp4 \
bash work/run_video_pipeline.sh > work/logs/video_run_<batch_name>.log 2>&1 &
# check status anytime
bash work/status_video_pipeline.sh work/logs/video_run_<batch_name>.log
```
Skip patterns can exclude already-processed inputs (note that 5-digit numbers need full padding in the regex, e.g. `0005[0-9]` not `005[0-9]`):
```bash
SKIP_PATTERN='^ct_src_(0001[015]|0005[0-9]|0006[0-2])\.mp4$' \
WORK=/opt/face-sets/work/video_preprocess_rest \
bash work/run_video_pipeline.sh > work/logs/video_run_rest.log 2>&1 &
```
To also emit per-clip provenance sidecars (off by default):
```bash
SIDECAR=yes \
WORK=/opt/face-sets/work/video_preprocess_<batch> \
bash work/run_video_pipeline.sh > work/logs/video_run_<batch>.log 2>&1 &
```
`scenes` outputs are cached in the batch's `WORK/scenes/` dir, so re-running the chain after an edit-to-score step doesn't redo detection. The worker is also resumable per `queue_id` — if killed mid-flight, just relaunch.