From 308597ebf0f2c918b41390d575b7c8bc1ec5ad7f Mon Sep 17 00:00:00 2001 From: Peter Date: Tue, 28 Apr 2026 16:47:59 +0200 Subject: [PATCH] Update video preprocessing doc with full-corpus results MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit After completing the rest-of-corpus run, update docs/analysis to reflect the final numbers across all three batches (test + 13-file + 45-file) and surface the numerical lessons: - 1,984 segments / 10.78h accepted content from 19.76h / 61 input videos - 0 worker errors across 143,137 sampled frames - rest batch sustained 15.78 fps from a fresh JSONL start (vs 7.5 fps for the migrated batch), confirming the append-only fix is the right steady-state design - skip-pattern note: 5-digit basename numbers need full padding (0005[0-9] not 005[0-9]) — bit me on the first relaunch - documented SIDECAR=yes opt-in for the chain script Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/analysis/video-target-preprocessing.md | 58 +++++++++++++-------- 1 file changed, 35 insertions(+), 23 deletions(-) diff --git a/docs/analysis/video-target-preprocessing.md b/docs/analysis/video-target-preprocessing.md index 5bc2652..3612489 100644 --- a/docs/analysis/video-target-preprocessing.md +++ b/docs/analysis/video-target-preprocessing.md @@ -82,30 +82,34 @@ Skipped. Per [Microsoft's WSL D3D12 video acceleration post](https://devblogs.mi For cutting we use `-c copy` stream-copy — no re-encode, hardware codecs are moot. -## 6. First batch run results (ct_src_00050..00062) +## 6. Full corpus run results -| | | -|---|---:| -| input videos | 13 | -| input duration | 6.18 h | -| sampled frames | 44,635 (@ 2 fps) | -| accepted tracks | 1,193 / 2,564 (47 %) | -| **emitted segments** | **600** | -| segments built from ≥2 tracks (cross-track merge fired) | 254 | -| accepted content total | 239.5 min (64.6 % of input) | -| segment duration min/median/mean/max | 1 / 12 / 24 / 119 s | -| output size | 3.63 GB | +Three runs across the 61-video corpus at `/mnt/x/src/vd/`: -Phase timings: -- scenes: 25 min (cached on later runs) +| | test (3 videos) | first batch (13 videos, 50–62) | rest (45 videos, 02–49 minus test) | **total** | +|---|---:|---:|---:|---:| +| input duration | 0.6 h | 6.18 h | 12.98 h | **19.76 h** | +| sampled frames @ 2 fps | 4,472 | 44,635 | 94,030 | 143,137 | +| tracks | 187 | 2,564 | 3,823 | 6,574 | +| accepted tracks | 94 (50 %) | 1,193 (47 %) | 1,905 (50 %) | 3,192 (49 %) | +| **emitted segments** | **83** | **600** | **1,301** | **1,984** | +| cross-track-merged segments | 14 | 254 | 382 | 650 | +| accepted content | 13 min | 239 min | 395 min | **647 min (= 10.78 h)** | +| acceptance rate by time | 36 % | 64.6 % | 50.7 % | **54.6 %** | +| output size | 0.135 GB | 3.63 GB | 4.84 GB | **8.6 GB** | + +Phase timings (rest batch — best representative since it ran fully under JSONL append-only from a fresh start): +- scenes: 117 min (PySceneDetect, 45 × ~3 min/video) - stage: instant -- worker: 78 min @ ~7.5 fps cumulative -- merge: 73 s -- track: 77 s -- score: 21 s -- cut (600 ffmpeg stream-copies): 19 min -- report (600 thumbs + HTML): 3 min -- **total wall-clock: 1h43m** +- worker: 100 min @ **15.78 fps** sustained (vs 7.5 fps for first batch which migrated mid-run) +- merge: 90 s +- track: 92 s +- score: 23 s +- cut (1,301 ffmpeg stream-copies): 30 min +- report (1,301 thumbs + HTML): 5.5 min +- **total wall-clock: 4h16m** + +Across all three runs, **0 worker errors on 143,137 sampled frames**. ## 7. Re-running @@ -119,12 +123,20 @@ WORK=/opt/face-sets/work/video_preprocess_ \ bash work/status_video_pipeline.sh work/logs/video_run_.log ``` -Skip patterns can exclude already-processed inputs: +Skip patterns can exclude already-processed inputs (note that 5-digit numbers need full padding in the regex, e.g. `0005[0-9]` not `005[0-9]`): ```bash -SKIP_PATTERN='^ct_src_(0001[015]|005[0-9]|006[0-9])\.mp4$' \ +SKIP_PATTERN='^ct_src_(0001[015]|0005[0-9]|0006[0-2])\.mp4$' \ WORK=/opt/face-sets/work/video_preprocess_rest \ bash work/run_video_pipeline.sh > work/logs/video_run_rest.log 2>&1 & ``` +To also emit per-clip provenance sidecars (off by default): + +```bash +SIDECAR=yes \ + WORK=/opt/face-sets/work/video_preprocess_ \ + bash work/run_video_pipeline.sh > work/logs/video_run_.log 2>&1 & +``` + `scenes` outputs are cached in the batch's `WORK/scenes/` dir, so re-running the chain after an edit-to-score step doesn't redo detection. The worker is also resumable per `queue_id` — if killed mid-flight, just relaunch.