After completing the rest-of-corpus run, update docs/analysis to reflect the final numbers across all three batches (test + 13-file + 45-file) and surface the numerical lessons: - 1,984 segments / 10.78h accepted content from 19.76h / 61 input videos - 0 worker errors across 143,137 sampled frames - rest batch sustained 15.78 fps from a fresh JSONL start (vs 7.5 fps for the migrated batch), confirming the append-only fix is the right steady-state design - skip-pattern note: 5-digit basename numbers need full padding (0005[0-9] not 005[0-9]) — bit me on the first relaunch - documented SIDECAR=yes opt-in for the chain script Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
143 lines
9.0 KiB
Markdown
143 lines
9.0 KiB
Markdown
# Video target preprocessing for roop-unleashed
|
||
|
||
_Initial design + first batch run: 2026-04-27. Driver scripts: `work/video_target_pipeline.py`, `work/video_face_worker.py`, `work/run_video_pipeline.sh`._
|
||
|
||
Companion to the face-set side of the project: instead of building per-identity .fsz bundles for the *source* of a swap, this pipeline preprocesses the *target* (videos to swap into). Given a folder of video files, it identifies "swappable" segments — continuous shots where a face is detectable, sufficiently visible, and roughly within inswapper_128's working envelope — and cuts them into UUID-named clips ready to feed into roop-unleashed.
|
||
|
||
## 1. Why build it
|
||
|
||
I checked the obvious open-source projects for an existing implementation:
|
||
|
||
- **FaceFusion** ([github.com/facefusion/facefusion](https://github.com/facefusion/facefusion)) — CLI has `run`, `headless-run`, `batch-run`, `job-*`, `force-download`, `benchmark`. No scene-detection or clip-extraction subcommand. Its own guides recommend "split your video manually first."
|
||
- **roop-unleashed** at `/opt/roop-unleashed/roop/util_ffmpeg.py` — has `cut_video(start_frame, end_frame)` for a manual GUI trim, no detection-driven segmentation.
|
||
- **Deep-Live-Cam** ([github.com/hacksider/Deep-Live-Cam](https://github.com/hacksider/Deep-Live-Cam)) — real-time / single-shot, no batch preprocessing.
|
||
- **DeepFaceLab** — `extract_video.bat` dumps every frame between user-supplied trim points; no quality gating.
|
||
|
||
Closest prior art for the cut-detection pattern is the two-stage hybrid in [SportSBD MMSys'26](https://dl.acm.org/doi/10.1145/3793853.3799803) (cheap detector for cuts, accurate net for verification), but the actual implementation has to be ours.
|
||
|
||
## 2. Pipeline architecture
|
||
|
||
```
|
||
WSL /opt/face-sets/work/ Windows C:\face_embed_venv\
|
||
───────────────────────────────────── ─────────────────────────────
|
||
run_video_pipeline.sh (chain driver)
|
||
│
|
||
├─ scan (ffprobe metadata)
|
||
├─ scenes (PySceneDetect AdaptiveDetector, CPU)
|
||
├─ stage (sampled frame queue.json @ 2 fps)
|
||
│ │
|
||
│ ▼
|
||
│ video_face_worker.py
|
||
│ insightface FaceAnalysis
|
||
│ on DmlExecutionProvider
|
||
│ output: results.jsonl
|
||
├─ merge (ingest results.jsonl)
|
||
├─ track (IoU + embedding stitching, ~30 LOC)
|
||
├─ score (track-level quality gate + cross-track merge)
|
||
├─ cut (ffmpeg -c copy → per-source subfolders)
|
||
└─ report (HTML preview)
|
||
|
||
Output: <output_dir>/<source_video_stem>/<uuid>.mp4
|
||
/<uuid>.json (sidecar; opt-in via
|
||
--write-sidecar)
|
||
```
|
||
|
||
`run_video_pipeline.sh` is parameterized via env vars (`WORK`, `INPUT_DIR`, `OUTPUT_DIR`, `FILTER_FROM`, `SKIP_PATTERN`, `MAX_DUR`, `IDENTITY`, `SIDECAR`) so you can pin a particular batch without editing the script. Sidecars are off by default — the per-batch `plan.json` always carries the full provenance for every clip; the `<uuid>.json` files alongside the clips are redundant and only useful if you need each clip to be self-describing in isolation.
|
||
|
||
## 3. Quality signals (matched to inswapper_128's working envelope)
|
||
|
||
inswapper_128 is trained near-frontal at 128×128. The score gate uses defaults that admit side profiles (since rich face-sets can absorb non-frontal swap targets):
|
||
|
||
| signal | threshold | rationale |
|
||
|--------|----------:|-----------|
|
||
| `|yaw|` | ≤ 75° | covers full 3/4 + side profile |
|
||
| `|pitch|` | ≤ 45° | covers extreme up/down looks |
|
||
| `face_short` | ≥ 80 px | inswapper resamples to 128; ≥80 still produces clean output |
|
||
| `det_score` | ≥ 0.5 | matches buffalo_l's MIN_DET; lower = unreliable detection |
|
||
| track-gate | ≥ 70 % frames pass | binary track filter rather than per-frame |
|
||
| duration | 1 s ≤ dur ≤ 120 s | below 1s = unusable slivers; above 120s probably contains a missed micro-cut |
|
||
|
||
Plus two segment-merging knobs:
|
||
- `--bridge-gap` (default 3 s) — within a single track, brief pose-failure gaps shorter than this get bridged so single bad frames don't fragment a good run
|
||
- `--merge-gap` (default 2 s) — across tracks within the same scene, segments closer than this get fused (cross-track merge fires when face detection briefly fails between adjacent good runs)
|
||
|
||
The defaults can be tightened (e.g. `--max-yaw 25` for portrait-only) or loosened (e.g. `--max-yaw 90 --merge-gap 5`) without re-running detection — `score` reads the existing `tracks.json`.
|
||
|
||
## 4. Performance + the JSONL append-only fix
|
||
|
||
This is where the engineering interest is. The first production run on 13 videos / 6.18 h of input went through three failure modes before settling at production speed:
|
||
|
||
| attempt | issue | rate observed |
|
||
|---|---|---:|
|
||
| 1. Original `cap.set(POS_FRAMES, N)` per sample | OpenCV seeks to nearest keyframe + decodes forward at every sample. Cost grows with depth into the video; on a 60-min H.264 it falls off a cliff. | 1.4 fps → degrading |
|
||
| 2. Sequential `cap.grab()` from frame 0 | On resume, grab-walking from frame 0 to a deep target is unbounded. | 0.08 fps |
|
||
| 3. Hybrid: seek-once-per-video + sequential within | Better in principle. But hit a different bug: `flush()` was re-serializing the entire `results.json` (245 MB at this point) every 100 frames or 30 sec. Save dominated wall-clock. | 0.5 fps |
|
||
| 4. **JSONL append-only** | One result per line. Each flush is O(new records), not O(total records). | **13.77 fps** smoke / 7.57 fps cumulative across the full batch |
|
||
|
||
Lesson: when the output is large + grows monotonically + needs frequent checkpointing, *do not* re-serialize the whole structure on each flush. Append-only line-delimited JSON is the right tool. The legacy `results.json` is auto-converted to `.jsonl` on first load (one-time migration), so resumes survive the format switch.
|
||
|
||
## 5. Hardware decode/encode on AMD Vega + WSL
|
||
|
||
Skipped. Per [Microsoft's WSL D3D12 video acceleration post](https://devblogs.microsoft.com/commandline/d3d12-gpu-video-acceleration-in-the-windows-subsystem-for-linux-now-available/), VAAPI-via-Mesa-D3D12 exists but is fragile on older AMD. AMF on Windows would mean a Windows-side ffmpeg leg, doubling boundary crossings. CPU software decode of 1280×720 H.264 in WSL ffmpeg is faster than realtime, and the bottleneck is buffalo_l detection on DML, not decode.
|
||
|
||
For cutting we use `-c copy` stream-copy — no re-encode, hardware codecs are moot.
|
||
|
||
## 6. Full corpus run results
|
||
|
||
Three runs across the 61-video corpus at `/mnt/x/src/vd/`:
|
||
|
||
| | test (3 videos) | first batch (13 videos, 50–62) | rest (45 videos, 02–49 minus test) | **total** |
|
||
|---|---:|---:|---:|---:|
|
||
| input duration | 0.6 h | 6.18 h | 12.98 h | **19.76 h** |
|
||
| sampled frames @ 2 fps | 4,472 | 44,635 | 94,030 | 143,137 |
|
||
| tracks | 187 | 2,564 | 3,823 | 6,574 |
|
||
| accepted tracks | 94 (50 %) | 1,193 (47 %) | 1,905 (50 %) | 3,192 (49 %) |
|
||
| **emitted segments** | **83** | **600** | **1,301** | **1,984** |
|
||
| cross-track-merged segments | 14 | 254 | 382 | 650 |
|
||
| accepted content | 13 min | 239 min | 395 min | **647 min (= 10.78 h)** |
|
||
| acceptance rate by time | 36 % | 64.6 % | 50.7 % | **54.6 %** |
|
||
| output size | 0.135 GB | 3.63 GB | 4.84 GB | **8.6 GB** |
|
||
|
||
Phase timings (rest batch — best representative since it ran fully under JSONL append-only from a fresh start):
|
||
- scenes: 117 min (PySceneDetect, 45 × ~3 min/video)
|
||
- stage: instant
|
||
- worker: 100 min @ **15.78 fps** sustained (vs 7.5 fps for first batch which migrated mid-run)
|
||
- merge: 90 s
|
||
- track: 92 s
|
||
- score: 23 s
|
||
- cut (1,301 ffmpeg stream-copies): 30 min
|
||
- report (1,301 thumbs + HTML): 5.5 min
|
||
- **total wall-clock: 4h16m**
|
||
|
||
Across all three runs, **0 worker errors on 143,137 sampled frames**.
|
||
|
||
## 7. Re-running
|
||
|
||
```bash
|
||
# choose a per-batch workdir + log
|
||
WORK=/opt/face-sets/work/video_preprocess_<batch_name> \
|
||
FILTER_FROM=ct_src_00050.mp4 \
|
||
bash work/run_video_pipeline.sh > work/logs/video_run_<batch_name>.log 2>&1 &
|
||
|
||
# check status anytime
|
||
bash work/status_video_pipeline.sh work/logs/video_run_<batch_name>.log
|
||
```
|
||
|
||
Skip patterns can exclude already-processed inputs (note that 5-digit numbers need full padding in the regex, e.g. `0005[0-9]` not `005[0-9]`):
|
||
|
||
```bash
|
||
SKIP_PATTERN='^ct_src_(0001[015]|0005[0-9]|0006[0-2])\.mp4$' \
|
||
WORK=/opt/face-sets/work/video_preprocess_rest \
|
||
bash work/run_video_pipeline.sh > work/logs/video_run_rest.log 2>&1 &
|
||
```
|
||
|
||
To also emit per-clip provenance sidecars (off by default):
|
||
|
||
```bash
|
||
SIDECAR=yes \
|
||
WORK=/opt/face-sets/work/video_preprocess_<batch> \
|
||
bash work/run_video_pipeline.sh > work/logs/video_run_<batch>.log 2>&1 &
|
||
```
|
||
|
||
`scenes` outputs are cached in the batch's `WORK/scenes/` dir, so re-running the chain after an edit-to-score step doesn't redo detection. The worker is also resumable per `queue_id` — if killed mid-flight, just relaunch.
|