Update video preprocessing doc with full-corpus results

After completing the rest-of-corpus run, update docs/analysis to reflect
the final numbers across all three batches (test + 13-file + 45-file)
and surface the numerical lessons:
- 1,984 segments / 10.78h accepted content from 19.76h / 61 input videos
- 0 worker errors across 143,137 sampled frames
- rest batch sustained 15.78 fps from a fresh JSONL start (vs 7.5 fps for
  the migrated batch), confirming the append-only fix is the right
  steady-state design
- skip-pattern note: 5-digit basename numbers need full padding
  (0005[0-9] not 005[0-9]) — bit me on the first relaunch
- documented SIDECAR=yes opt-in for the chain script

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-28 16:47:59 +02:00
parent 7960dec350
commit 308597ebf0

View File

@@ -82,30 +82,34 @@ Skipped. Per [Microsoft's WSL D3D12 video acceleration post](https://devblogs.mi
For cutting we use `-c copy` stream-copy — no re-encode, hardware codecs are moot. For cutting we use `-c copy` stream-copy — no re-encode, hardware codecs are moot.
## 6. First batch run results (ct_src_00050..00062) ## 6. Full corpus run results
| | | Three runs across the 61-video corpus at `/mnt/x/src/vd/`:
|---|---:|
| input videos | 13 |
| input duration | 6.18 h |
| sampled frames | 44,635 (@ 2 fps) |
| accepted tracks | 1,193 / 2,564 (47 %) |
| **emitted segments** | **600** |
| segments built from ≥2 tracks (cross-track merge fired) | 254 |
| accepted content total | 239.5 min (64.6 % of input) |
| segment duration min/median/mean/max | 1 / 12 / 24 / 119 s |
| output size | 3.63 GB |
Phase timings: | | test (3 videos) | first batch (13 videos, 5062) | rest (45 videos, 0249 minus test) | **total** |
- scenes: 25 min (cached on later runs) |---|---:|---:|---:|---:|
| input duration | 0.6 h | 6.18 h | 12.98 h | **19.76 h** |
| sampled frames @ 2 fps | 4,472 | 44,635 | 94,030 | 143,137 |
| tracks | 187 | 2,564 | 3,823 | 6,574 |
| accepted tracks | 94 (50 %) | 1,193 (47 %) | 1,905 (50 %) | 3,192 (49 %) |
| **emitted segments** | **83** | **600** | **1,301** | **1,984** |
| cross-track-merged segments | 14 | 254 | 382 | 650 |
| accepted content | 13 min | 239 min | 395 min | **647 min (= 10.78 h)** |
| acceptance rate by time | 36 % | 64.6 % | 50.7 % | **54.6 %** |
| output size | 0.135 GB | 3.63 GB | 4.84 GB | **8.6 GB** |
Phase timings (rest batch — best representative since it ran fully under JSONL append-only from a fresh start):
- scenes: 117 min (PySceneDetect, 45 × ~3 min/video)
- stage: instant - stage: instant
- worker: 78 min @ ~7.5 fps cumulative - worker: 100 min @ **15.78 fps** sustained (vs 7.5 fps for first batch which migrated mid-run)
- merge: 73 s - merge: 90 s
- track: 77 s - track: 92 s
- score: 21 s - score: 23 s
- cut (600 ffmpeg stream-copies): 19 min - cut (1,301 ffmpeg stream-copies): 30 min
- report (600 thumbs + HTML): 3 min - report (1,301 thumbs + HTML): 5.5 min
- **total wall-clock: 1h43m** - **total wall-clock: 4h16m**
Across all three runs, **0 worker errors on 143,137 sampled frames**.
## 7. Re-running ## 7. Re-running
@@ -119,12 +123,20 @@ WORK=/opt/face-sets/work/video_preprocess_<batch_name> \
bash work/status_video_pipeline.sh work/logs/video_run_<batch_name>.log bash work/status_video_pipeline.sh work/logs/video_run_<batch_name>.log
``` ```
Skip patterns can exclude already-processed inputs: Skip patterns can exclude already-processed inputs (note that 5-digit numbers need full padding in the regex, e.g. `0005[0-9]` not `005[0-9]`):
```bash ```bash
SKIP_PATTERN='^ct_src_(0001[015]|005[0-9]|006[0-9])\.mp4$' \ SKIP_PATTERN='^ct_src_(0001[015]|0005[0-9]|0006[0-2])\.mp4$' \
WORK=/opt/face-sets/work/video_preprocess_rest \ WORK=/opt/face-sets/work/video_preprocess_rest \
bash work/run_video_pipeline.sh > work/logs/video_run_rest.log 2>&1 & bash work/run_video_pipeline.sh > work/logs/video_run_rest.log 2>&1 &
``` ```
To also emit per-clip provenance sidecars (off by default):
```bash
SIDECAR=yes \
WORK=/opt/face-sets/work/video_preprocess_<batch> \
bash work/run_video_pipeline.sh > work/logs/video_run_<batch>.log 2>&1 &
```
`scenes` outputs are cached in the batch's `WORK/scenes/` dir, so re-running the chain after an edit-to-score step doesn't redo detection. The worker is also resumable per `queue_id` — if killed mid-flight, just relaunch. `scenes` outputs are cached in the batch's `WORK/scenes/` dir, so re-running the chain after an edit-to-score step doesn't redo detection. The worker is also resumable per `queue_id` — if killed mid-flight, just relaunch.