Files
face-sets/docs/analysis/video-target-preprocessing.md
Peter 998fa79f81 Add target-side video preprocessing pipeline
Preprocesses a folder of video files into UUID-named clips suitable as
target inputs for roop-unleashed-style face-swap. Counterpart to the
faceset (source-side) tooling.

work/video_target_pipeline.py — orchestration with subcommands
  scan / scenes / stage / merge / track / score / cut / report. Quality
  gates default to face-sets-can-handle-side-profile values (yaw<=75°,
  pitch<=45°, face_short>=80px, det>=0.5). Cross-track segment merge
  fuses adjacent-in-time tracks within the same scene up to 2s gap.
  Output organized into <output_dir>/<source_stem>/<uuid>.mp4 +
  <uuid>.json sidecar with full provenance.

work/video_face_worker.py — Windows DML face detect+embed worker. Uses
  JSONL append-only for results.jsonl: a critical perf fix (re-
  serializing the monolithic 245MB results.json on every flush was the
  dominant cost in the first attempt, dropping throughput to 0.5 fps).
  Append-only got it to 13+ fps, ~7.5 fps cumulative across the first
  6.18h batch. Also uses seek-once-per-video + sequential cap.grab()
  between samples to dodge cv2 per-sample seek pathology on long H.264.
  Legacy results.json is auto-migrated to .jsonl on first load.

work/run_video_pipeline.sh — generic chain driver, parameterized via
  WORK / INPUT_DIR / OUTPUT_DIR / FILTER_FROM / SKIP_PATTERN / MAX_DUR /
  IDENTITY env vars. work/status_video_pipeline.sh — generic status
  helper.

First production batch (ct_src_00050..00062, 13 files, 6.18h input):
600 emitted segments, 239.5min accepted content (64.6% of input), 254
segments built from >=2 tracks (cross-track merge), 1h43m wall clock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:38:50 +02:00

7.8 KiB
Raw Blame History

Video target preprocessing for roop-unleashed

Initial design + first batch run: 2026-04-27. Driver scripts: work/video_target_pipeline.py, work/video_face_worker.py, work/run_video_pipeline.sh.

Companion to the face-set side of the project: instead of building per-identity .fsz bundles for the source of a swap, this pipeline preprocesses the target (videos to swap into). Given a folder of video files, it identifies "swappable" segments — continuous shots where a face is detectable, sufficiently visible, and roughly within inswapper_128's working envelope — and cuts them into UUID-named clips ready to feed into roop-unleashed.

1. Why build it

I checked the obvious open-source projects for an existing implementation:

  • FaceFusion (github.com/facefusion/facefusion) — CLI has run, headless-run, batch-run, job-*, force-download, benchmark. No scene-detection or clip-extraction subcommand. Its own guides recommend "split your video manually first."
  • roop-unleashed at /opt/roop-unleashed/roop/util_ffmpeg.py — has cut_video(start_frame, end_frame) for a manual GUI trim, no detection-driven segmentation.
  • Deep-Live-Cam (github.com/hacksider/Deep-Live-Cam) — real-time / single-shot, no batch preprocessing.
  • DeepFaceLabextract_video.bat dumps every frame between user-supplied trim points; no quality gating.

Closest prior art for the cut-detection pattern is the two-stage hybrid in SportSBD MMSys'26 (cheap detector for cuts, accurate net for verification), but the actual implementation has to be ours.

2. Pipeline architecture

WSL  /opt/face-sets/work/                   Windows  C:\face_embed_venv\
─────────────────────────────────────       ─────────────────────────────
run_video_pipeline.sh (chain driver)
   │
   ├─ scan         (ffprobe metadata)
   ├─ scenes       (PySceneDetect AdaptiveDetector, CPU)
   ├─ stage        (sampled frame queue.json @ 2 fps)
   │                                  │
   │                                  ▼
   │                            video_face_worker.py
   │                            insightface FaceAnalysis
   │                            on DmlExecutionProvider
   │                            output: results.jsonl
   ├─ merge        (ingest results.jsonl)
   ├─ track        (IoU + embedding stitching, ~30 LOC)
   ├─ score        (track-level quality gate + cross-track merge)
   ├─ cut          (ffmpeg -c copy → per-source subfolders)
   └─ report       (HTML preview)

   Output: <output_dir>/<source_video_stem>/<uuid>.mp4
                                           /<uuid>.json (sidecar)

run_video_pipeline.sh is parameterized via env vars (WORK, INPUT_DIR, OUTPUT_DIR, FILTER_FROM, SKIP_PATTERN, MAX_DUR, IDENTITY) so you can pin a particular batch without editing the script.

3. Quality signals (matched to inswapper_128's working envelope)

inswapper_128 is trained near-frontal at 128×128. The score gate uses defaults that admit side profiles (since rich face-sets can absorb non-frontal swap targets):

signal threshold rationale
` yaw `
` pitch `
face_short ≥ 80 px inswapper resamples to 128; ≥80 still produces clean output
det_score ≥ 0.5 matches buffalo_l's MIN_DET; lower = unreliable detection
track-gate ≥ 70 % frames pass binary track filter rather than per-frame
duration 1 s ≤ dur ≤ 120 s below 1s = unusable slivers; above 120s probably contains a missed micro-cut

Plus two segment-merging knobs:

  • --bridge-gap (default 3 s) — within a single track, brief pose-failure gaps shorter than this get bridged so single bad frames don't fragment a good run
  • --merge-gap (default 2 s) — across tracks within the same scene, segments closer than this get fused (cross-track merge fires when face detection briefly fails between adjacent good runs)

The defaults can be tightened (e.g. --max-yaw 25 for portrait-only) or loosened (e.g. --max-yaw 90 --merge-gap 5) without re-running detection — score reads the existing tracks.json.

4. Performance + the JSONL append-only fix

This is where the engineering interest is. The first production run on 13 videos / 6.18 h of input went through three failure modes before settling at production speed:

attempt issue rate observed
1. Original cap.set(POS_FRAMES, N) per sample OpenCV seeks to nearest keyframe + decodes forward at every sample. Cost grows with depth into the video; on a 60-min H.264 it falls off a cliff. 1.4 fps → degrading
2. Sequential cap.grab() from frame 0 On resume, grab-walking from frame 0 to a deep target is unbounded. 0.08 fps
3. Hybrid: seek-once-per-video + sequential within Better in principle. But hit a different bug: flush() was re-serializing the entire results.json (245 MB at this point) every 100 frames or 30 sec. Save dominated wall-clock. 0.5 fps
4. JSONL append-only One result per line. Each flush is O(new records), not O(total records). 13.77 fps smoke / 7.57 fps cumulative across the full batch

Lesson: when the output is large + grows monotonically + needs frequent checkpointing, do not re-serialize the whole structure on each flush. Append-only line-delimited JSON is the right tool. The legacy results.json is auto-converted to .jsonl on first load (one-time migration), so resumes survive the format switch.

5. Hardware decode/encode on AMD Vega + WSL

Skipped. Per Microsoft's WSL D3D12 video acceleration post, VAAPI-via-Mesa-D3D12 exists but is fragile on older AMD. AMF on Windows would mean a Windows-side ffmpeg leg, doubling boundary crossings. CPU software decode of 1280×720 H.264 in WSL ffmpeg is faster than realtime, and the bottleneck is buffalo_l detection on DML, not decode.

For cutting we use -c copy stream-copy — no re-encode, hardware codecs are moot.

6. First batch run results (ct_src_00050..00062)

input videos 13
input duration 6.18 h
sampled frames 44,635 (@ 2 fps)
accepted tracks 1,193 / 2,564 (47 %)
emitted segments 600
segments built from ≥2 tracks (cross-track merge fired) 254
accepted content total 239.5 min (64.6 % of input)
segment duration min/median/mean/max 1 / 12 / 24 / 119 s
output size 3.63 GB

Phase timings:

  • scenes: 25 min (cached on later runs)
  • stage: instant
  • worker: 78 min @ ~7.5 fps cumulative
  • merge: 73 s
  • track: 77 s
  • score: 21 s
  • cut (600 ffmpeg stream-copies): 19 min
  • report (600 thumbs + HTML): 3 min
  • total wall-clock: 1h43m

7. Re-running

# choose a per-batch workdir + log
WORK=/opt/face-sets/work/video_preprocess_<batch_name> \
  FILTER_FROM=ct_src_00050.mp4 \
  bash work/run_video_pipeline.sh > work/logs/video_run_<batch_name>.log 2>&1 &

# check status anytime
bash work/status_video_pipeline.sh work/logs/video_run_<batch_name>.log

Skip patterns can exclude already-processed inputs:

SKIP_PATTERN='^ct_src_(0001[015]|005[0-9]|006[0-9])\.mp4$' \
  WORK=/opt/face-sets/work/video_preprocess_rest \
  bash work/run_video_pipeline.sh > work/logs/video_run_rest.log 2>&1 &

scenes outputs are cached in the batch's WORK/scenes/ dir, so re-running the chain after an edit-to-score step doesn't redo detection. The worker is also resumable per queue_id — if killed mid-flight, just relaunch.