From 308597ebf0f2c918b41390d575b7c8bc1ec5ad7f Mon Sep 17 00:00:00 2001
From: Peter <peter@computerlie.be>
Date: Tue, 28 Apr 2026 16:47:59 +0200
Subject: [PATCH] Update video preprocessing doc with full-corpus results
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

After completing the rest-of-corpus run, update docs/analysis to reflect
the final numbers across all three batches (test + 13-file + 45-file)
and surface the numerical lessons:
- 1,984 segments / 10.78h accepted content from 19.76h / 61 input videos
- 0 worker errors across 143,137 sampled frames
- rest batch sustained 15.78 fps from a fresh JSONL start (vs 7.5 fps for
  the migrated batch), confirming the append-only fix is the right
  steady-state design
- skip-pattern note: 5-digit basename numbers need full padding
  (0005[0-9] not 005[0-9]) — bit me on the first relaunch
- documented SIDECAR=yes opt-in for the chain script

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/analysis/video-target-preprocessing.md | 58 +++++++++++++--------
 1 file changed, 35 insertions(+), 23 deletions(-)

diff --git a/docs/analysis/video-target-preprocessing.md b/docs/analysis/video-target-preprocessing.md
index 5bc2652..3612489 100644
--- a/docs/analysis/video-target-preprocessing.md
+++ b/docs/analysis/video-target-preprocessing.md
@@ -82,30 +82,34 @@ Skipped. Per [Microsoft's WSL D3D12 video acceleration post](https://devblogs.mi
 
 For cutting we use `-c copy` stream-copy — no re-encode, hardware codecs are moot.
 
-## 6. First batch run results (ct_src_00050..00062)
+## 6. Full corpus run results
 
-| | |
-|---|---:|
-| input videos | 13 |
-| input duration | 6.18 h |
-| sampled frames | 44,635 (@ 2 fps) |
-| accepted tracks | 1,193 / 2,564 (47 %) |
-| **emitted segments** | **600** |
-| segments built from ≥2 tracks (cross-track merge fired) | 254 |
-| accepted content total | 239.5 min (64.6 % of input) |
-| segment duration min/median/mean/max | 1 / 12 / 24 / 119 s |
-| output size | 3.63 GB |
+Three runs across the 61-video corpus at `/mnt/x/src/vd/`:
 
-Phase timings:
-- scenes: 25 min (cached on later runs)
+| | test (3 videos) | first batch (13 videos, 50–62) | rest (45 videos, 02–49 minus test) | **total** |
+|---|---:|---:|---:|---:|
+| input duration | 0.6 h | 6.18 h | 12.98 h | **19.76 h** |
+| sampled frames @ 2 fps | 4,472 | 44,635 | 94,030 | 143,137 |
+| tracks | 187 | 2,564 | 3,823 | 6,574 |
+| accepted tracks | 94 (50 %) | 1,193 (47 %) | 1,905 (50 %) | 3,192 (49 %) |
+| **emitted segments** | **83** | **600** | **1,301** | **1,984** |
+| cross-track-merged segments | 14 | 254 | 382 | 650 |
+| accepted content | 13 min | 239 min | 395 min | **647 min (= 10.78 h)** |
+| acceptance rate by time | 36 % | 64.6 % | 50.7 % | **54.6 %** |
+| output size | 0.135 GB | 3.63 GB | 4.84 GB | **8.6 GB** |
+
+Phase timings (rest batch — best representative since it ran fully under JSONL append-only from a fresh start):
+- scenes: 117 min (PySceneDetect, 45 × ~3 min/video)
 - stage: instant
-- worker: 78 min @ ~7.5 fps cumulative
-- merge: 73 s
-- track: 77 s
-- score: 21 s
-- cut (600 ffmpeg stream-copies): 19 min
-- report (600 thumbs + HTML): 3 min
-- **total wall-clock: 1h43m**
+- worker: 100 min @ **15.78 fps** sustained (vs 7.5 fps for first batch which migrated mid-run)
+- merge: 90 s
+- track: 92 s
+- score: 23 s
+- cut (1,301 ffmpeg stream-copies): 30 min
+- report (1,301 thumbs + HTML): 5.5 min
+- **total wall-clock: 4h16m**
+
+Across all three runs, **0 worker errors on 143,137 sampled frames**.
 
 ## 7. Re-running
 
@@ -119,12 +123,20 @@ WORK=/opt/face-sets/work/video_preprocess_<batch_name> \
 bash work/status_video_pipeline.sh work/logs/video_run_<batch_name>.log
 ```
 
-Skip patterns can exclude already-processed inputs:
+Skip patterns can exclude already-processed inputs (note that 5-digit numbers need full padding in the regex, e.g. `0005[0-9]` not `005[0-9]`):
 
 ```bash
-SKIP_PATTERN='^ct_src_(0001[015]|005[0-9]|006[0-9])\.mp4$' \
+SKIP_PATTERN='^ct_src_(0001[015]|0005[0-9]|0006[0-2])\.mp4$' \
   WORK=/opt/face-sets/work/video_preprocess_rest \
   bash work/run_video_pipeline.sh > work/logs/video_run_rest.log 2>&1 &
 ```
 
+To also emit per-clip provenance sidecars (off by default):
+
+```bash
+SIDECAR=yes \
+  WORK=/opt/face-sets/work/video_preprocess_<batch> \
+  bash work/run_video_pipeline.sh > work/logs/video_run_<batch>.log 2>&1 &
+```
+
 `scenes` outputs are cached in the batch's `WORK/scenes/` dir, so re-running the chain after an edit-to-score step doesn't redo detection. The worker is also resumable per `queue_id` — if killed mid-flight, just relaunch.