Make per-clip sidecar JSONs opt-in (default off)
Previously every video_target_pipeline cut wrote a <uuid>.json provenance sidecar alongside each <uuid>.mp4. The same provenance is already in the per-batch plan.json, so the per-clip sidecars are redundant unless a downstream tool wants each clip self-describing in isolation. - video_target_pipeline.py cut: new --write-sidecar flag, default off. - run_video_pipeline.sh: new SIDECAR env var (default "no"), passes --write-sidecar when SIDECAR=yes. - README + docs/analysis/video-target-preprocessing.md updated. The 1,984 already-emitted sidecars in /mnt/x/src/vd/ct/ct_src_*/ have been deleted (1.5 MB). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -343,7 +343,7 @@ clean it up over time:
|
|||||||
| `work/consolidate_facesets.py` | Merge duplicate identities (centroid cosine sim ≥ 0.55 with confident ≥ 0.65, **complete-linkage** to defeat single-link chaining). Pulls embeddings from cache, no GPU. See `docs/analysis/identity-consolidation-and-age-extend.md`. |
|
| `work/consolidate_facesets.py` | Merge duplicate identities (centroid cosine sim ≥ 0.55 with confident ≥ 0.65, **complete-linkage** to defeat single-link chaining). Pulls embeddings from cache, no GPU. See `docs/analysis/identity-consolidation-and-age-extend.md`. |
|
||||||
| `work/age_extend_001.py` | Slot newly-added PNGs into existing era buckets of `faceset_001` (anchor cosine distance ≤ 0.40 AND `|year_delta|` ≤ 5). Same anchor-fragment rule as `age_split_001.py`. |
|
| `work/age_extend_001.py` | Slot newly-added PNGs into existing era buckets of `faceset_001` (anchor cosine distance ≤ 0.40 AND `|year_delta|` ≤ 5). Same anchor-fragment rule as `age_split_001.py`. |
|
||||||
| `work/dedup_optimize.py` (+ Windows `work/multiface_worker.py`) | (a) cross-family SHA256 byte-dedup, (b) within-faceset near-dup at cosine sim ≥ 0.95, (c) multi-face audit (re-detect via insightface, drop PNGs with face_count ≠ 1). Multi-face is the load-bearing roop invariant. See `docs/analysis/dedup-and-roop-optimization.md`. |
|
| `work/dedup_optimize.py` (+ Windows `work/multiface_worker.py`) | (a) cross-family SHA256 byte-dedup, (b) within-faceset near-dup at cosine sim ≥ 0.95, (c) multi-face audit (re-detect via insightface, drop PNGs with face_count ≠ 1). Multi-face is the load-bearing roop invariant. See `docs/analysis/dedup-and-roop-optimization.md`. |
|
||||||
| `work/video_target_pipeline.py` (+ Windows `work/video_face_worker.py` + `work/run_video_pipeline.sh` chain) | Target-side preprocessing: scan a folder of videos → PySceneDetect shot-cuts → 2 fps frame sampling → DML face detection + embedding → IoU+embedding tracking → quality-gated segments (yaw≤75°, face≥80px, det≥0.5, ≥70% pass-rate, 1–120s duration, 2s cross-track merge gap) → ffmpeg stream-copy into UUID-named clips with sidecar JSON. Output organized into per-source subfolders. See `docs/analysis/video-target-preprocessing.md`. |
|
| `work/video_target_pipeline.py` (+ Windows `work/video_face_worker.py` + `work/run_video_pipeline.sh` chain) | Target-side preprocessing: scan a folder of videos → PySceneDetect shot-cuts → 2 fps frame sampling → DML face detection + embedding → IoU+embedding tracking → quality-gated segments (yaw≤75°, face≥80px, det≥0.5, ≥70% pass-rate, 1–120s duration, 2s cross-track merge gap) → ffmpeg stream-copy into UUID-named clips. Output organized into per-source subfolders. Provenance sidecars are opt-in (`cut --write-sidecar` or `SIDECAR=yes` env var); the full plan is always retained in the per-batch `plan.json`. See `docs/analysis/video-target-preprocessing.md`. |
|
||||||
|
|
||||||
All four operate idempotently and reversibly: dropped PNGs go to
|
All four operate idempotently and reversibly: dropped PNGs go to
|
||||||
`<faceset>/faces/_dropped/`, quarantined whole facesets go to
|
`<faceset>/faces/_dropped/`, quarantined whole facesets go to
|
||||||
|
|||||||
@@ -38,10 +38,11 @@ run_video_pipeline.sh (chain driver)
|
|||||||
└─ report (HTML preview)
|
└─ report (HTML preview)
|
||||||
|
|
||||||
Output: <output_dir>/<source_video_stem>/<uuid>.mp4
|
Output: <output_dir>/<source_video_stem>/<uuid>.mp4
|
||||||
/<uuid>.json (sidecar)
|
/<uuid>.json (sidecar; opt-in via
|
||||||
|
--write-sidecar)
|
||||||
```
|
```
|
||||||
|
|
||||||
`run_video_pipeline.sh` is parameterized via env vars (`WORK`, `INPUT_DIR`, `OUTPUT_DIR`, `FILTER_FROM`, `SKIP_PATTERN`, `MAX_DUR`, `IDENTITY`) so you can pin a particular batch without editing the script.
|
`run_video_pipeline.sh` is parameterized via env vars (`WORK`, `INPUT_DIR`, `OUTPUT_DIR`, `FILTER_FROM`, `SKIP_PATTERN`, `MAX_DUR`, `IDENTITY`, `SIDECAR`) so you can pin a particular batch without editing the script. Sidecars are off by default — the per-batch `plan.json` always carries the full provenance for every clip; the `<uuid>.json` files alongside the clips are redundant and only useful if you need each clip to be self-describing in isolation.
|
||||||
|
|
||||||
## 3. Quality signals (matched to inswapper_128's working envelope)
|
## 3. Quality signals (matched to inswapper_128's working envelope)
|
||||||
|
|
||||||
|
|||||||
@@ -15,6 +15,7 @@
|
|||||||
# SKIP_PATTERN regex of basenames to exclude (Python `re` syntax). Applied AFTER FILTER_FROM.
|
# SKIP_PATTERN regex of basenames to exclude (Python `re` syntax). Applied AFTER FILTER_FROM.
|
||||||
# MAX_DUR score --max-dur (default 120)
|
# MAX_DUR score --max-dur (default 120)
|
||||||
# IDENTITY "yes" to enable identity tagging; default "no"
|
# IDENTITY "yes" to enable identity tagging; default "no"
|
||||||
|
# SIDECAR "yes" to emit <uuid>.json provenance sidecars; default "no"
|
||||||
|
|
||||||
set -e
|
set -e
|
||||||
|
|
||||||
@@ -23,6 +24,7 @@ set -e
|
|||||||
: ${OUTPUT_DIR:=/mnt/x/src/vd/ct}
|
: ${OUTPUT_DIR:=/mnt/x/src/vd/ct}
|
||||||
: ${MAX_DUR:=120}
|
: ${MAX_DUR:=120}
|
||||||
: ${IDENTITY:=no}
|
: ${IDENTITY:=no}
|
||||||
|
: ${SIDECAR:=no}
|
||||||
|
|
||||||
mkdir -p "$WORK" "$WORK/scenes"
|
mkdir -p "$WORK" "$WORK/scenes"
|
||||||
|
|
||||||
@@ -37,7 +39,7 @@ log() { echo "[$(ts)] [$PHASE] $*"; }
|
|||||||
|
|
||||||
PHASE="setup"
|
PHASE="setup"
|
||||||
log "STARTED — host=$(hostname) pid=$$ work=$WORK"
|
log "STARTED — host=$(hostname) pid=$$ work=$WORK"
|
||||||
log "config: input=$INPUT_DIR output=$OUTPUT_DIR filter_from=${FILTER_FROM:-<none>} skip_pattern=${SKIP_PATTERN:-<none>} max_dur=$MAX_DUR identity=$IDENTITY"
|
log "config: input=$INPUT_DIR output=$OUTPUT_DIR filter_from=${FILTER_FROM:-<none>} skip_pattern=${SKIP_PATTERN:-<none>} max_dur=$MAX_DUR identity=$IDENTITY sidecar=$SIDECAR"
|
||||||
|
|
||||||
PHASE="inventory"
|
PHASE="inventory"
|
||||||
log "building subset inventory"
|
log "building subset inventory"
|
||||||
@@ -110,7 +112,9 @@ log "done in $(($(date +%s)-T0))s"
|
|||||||
PHASE="cut"
|
PHASE="cut"
|
||||||
log "ffmpeg stream-copy into per-source subfolders (no --clean)"
|
log "ffmpeg stream-copy into per-source subfolders (no --clean)"
|
||||||
T0=$(date +%s)
|
T0=$(date +%s)
|
||||||
$PY_WSL $PIPELINE cut --plan "$WORK/plan.json" --output-dir "$OUTPUT_DIR"
|
SIDECAR_FLAG=""
|
||||||
|
if [ "$SIDECAR" = "yes" ]; then SIDECAR_FLAG="--write-sidecar"; fi
|
||||||
|
$PY_WSL $PIPELINE cut --plan "$WORK/plan.json" --output-dir "$OUTPUT_DIR" $SIDECAR_FLAG
|
||||||
log "done in $(($(date +%s)-T0))s"
|
log "done in $(($(date +%s)-T0))s"
|
||||||
|
|
||||||
PHASE="report"
|
PHASE="report"
|
||||||
|
|||||||
@@ -722,7 +722,7 @@ def cmd_cut(args):
|
|||||||
if out_video.exists() and out_video.stat().st_size < 1024:
|
if out_video.exists() and out_video.stat().st_size < 1024:
|
||||||
out_video.unlink()
|
out_video.unlink()
|
||||||
continue
|
continue
|
||||||
# sidecar (alongside the clip in the source-named subfolder)
|
if args.write_sidecar:
|
||||||
sidecar = seg_dir / f"{seg['uuid']}.json"
|
sidecar = seg_dir / f"{seg['uuid']}.json"
|
||||||
sidecar.write_text(json.dumps({
|
sidecar.write_text(json.dumps({
|
||||||
"uuid": seg["uuid"],
|
"uuid": seg["uuid"],
|
||||||
@@ -901,6 +901,8 @@ def main():
|
|||||||
cu.add_argument("--force", action="store_true")
|
cu.add_argument("--force", action="store_true")
|
||||||
cu.add_argument("--clean", action="store_true",
|
cu.add_argument("--clean", action="store_true",
|
||||||
help="remove prior UUID-named clips before cutting (preserves non-UUID files)")
|
help="remove prior UUID-named clips before cutting (preserves non-UUID files)")
|
||||||
|
cu.add_argument("--write-sidecar", action="store_true",
|
||||||
|
help="emit <uuid>.json provenance sidecar alongside each clip (default off)")
|
||||||
cu.set_defaults(func=cmd_cut)
|
cu.set_defaults(func=cmd_cut)
|
||||||
|
|
||||||
rp = sub.add_parser("report")
|
rp = sub.add_parser("report")
|
||||||
|
|||||||
Reference in New Issue
Block a user