Compare commits

...

12 Commits

Author SHA1 Message Date
Peter 308597ebf0 Update video preprocessing doc with full-corpus results
After completing the rest-of-corpus run, update docs/analysis to reflect
the final numbers across all three batches (test + 13-file + 45-file)
and surface the numerical lessons:
- 1,984 segments / 10.78h accepted content from 19.76h / 61 input videos
- 0 worker errors across 143,137 sampled frames
- rest batch sustained 15.78 fps from a fresh JSONL start (vs 7.5 fps for
  the migrated batch), confirming the append-only fix is the right
  steady-state design
- skip-pattern note: 5-digit basename numbers need full padding
  (0005[0-9] not 005[0-9]) — bit me on the first relaunch
- documented SIDECAR=yes opt-in for the chain script

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 16:47:59 +02:00
Peter 7960dec350 Make per-clip sidecar JSONs opt-in (default off)
Previously every video_target_pipeline cut wrote a <uuid>.json provenance
sidecar alongside each <uuid>.mp4. The same provenance is already in the
per-batch plan.json, so the per-clip sidecars are redundant unless a
downstream tool wants each clip self-describing in isolation.

- video_target_pipeline.py cut: new --write-sidecar flag, default off.
- run_video_pipeline.sh: new SIDECAR env var (default "no"), passes
  --write-sidecar when SIDECAR=yes.
- README + docs/analysis/video-target-preprocessing.md updated.

The 1,984 already-emitted sidecars in /mnt/x/src/vd/ct/ct_src_*/ have
been deleted (1.5 MB).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 12:44:27 +02:00
Peter 998fa79f81 Add target-side video preprocessing pipeline
Preprocesses a folder of video files into UUID-named clips suitable as
target inputs for roop-unleashed-style face-swap. Counterpart to the
faceset (source-side) tooling.

work/video_target_pipeline.py — orchestration with subcommands
  scan / scenes / stage / merge / track / score / cut / report. Quality
  gates default to face-sets-can-handle-side-profile values (yaw<=75°,
  pitch<=45°, face_short>=80px, det>=0.5). Cross-track segment merge
  fuses adjacent-in-time tracks within the same scene up to 2s gap.
  Output organized into <output_dir>/<source_stem>/<uuid>.mp4 +
  <uuid>.json sidecar with full provenance.

work/video_face_worker.py — Windows DML face detect+embed worker. Uses
  JSONL append-only for results.jsonl: a critical perf fix (re-
  serializing the monolithic 245MB results.json on every flush was the
  dominant cost in the first attempt, dropping throughput to 0.5 fps).
  Append-only got it to 13+ fps, ~7.5 fps cumulative across the first
  6.18h batch. Also uses seek-once-per-video + sequential cap.grab()
  between samples to dodge cv2 per-sample seek pathology on long H.264.
  Legacy results.json is auto-migrated to .jsonl on first load.

work/run_video_pipeline.sh — generic chain driver, parameterized via
  WORK / INPUT_DIR / OUTPUT_DIR / FILTER_FROM / SKIP_PATTERN / MAX_DUR /
  IDENTITY env vars. work/status_video_pipeline.sh — generic status
  helper.

First production batch (ct_src_00050..00062, 13 files, 6.18h input):
600 emitted segments, 239.5min accepted content (64.6% of input), 254
segments built from >=2 tracks (cross-track merge), 1h43m wall clock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:38:50 +02:00
Peter 49a43c7685 Add post-export corpus maintenance pipeline
Adds four new orchestration scripts that operate on an already-built
facesets_swap_ready/ to clean it up over time:

- filter_occlusions.py + clip_worker.py: CLIP zero-shot mask + sunglasses
  filter (open_clip ViT-L-14/dfn2b_s39b). WSL stages, Windows DML scores
  via new C:\clip_dml_venv. Image-level threshold 0.7; faceset-level
  quarantine at 40% domain dominance.

- consolidate_facesets.py: duplicate-identity merger using complete-linkage
  centroid clustering on cached arcface embeddings. Single-linkage chains
  catastrophically (60-faceset clusters with min sim < 0); complete-linkage
  guarantees within-group sim >= edge.

- age_extend_001.py: slots newly-added PNGs into existing era buckets of
  faceset_001 using the same anchor-fragment rule as age_split_001.py
  (dist <= 0.40 AND |year_delta| <= 5). Anchors not re-centered.

- dedup_optimize.py + multiface_worker.py: corpus-wide cleanup with three
  passes — cross-family SHA256 byte-dedup (preserves intra-family era
  duplication), within-faceset near-dup at sim >= 0.95, and a multi-face
  audit (the load-bearing roop invariant). Multi-face worker hits ~19 img/s
  on AMD Vega — ~7x embed_worker because input is 512x512 crops.

Same-day corpus evolution: 311 active / 0 masked / 68 thin / 0 merged →
181 / 51 / 71 / 74; 6,440 → 3,849 active PNGs. All quarantines and prunes
preserved on disk (faces/_dropped/, _masked/, _merged/, _thin/) for full
reversibility. Master manifest gains masked[], merged[], plus per-run
provenance blocks.

Three new docs/analysis/ writeups cover model choice, threshold rationale,
and per-pass run results.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:41:18 +02:00
Peter e66c97fd58 Document Immich nic run: 95 new facesets, manifest 216 -> 311
Overnight 2026-04-27 nic finalize completed. Per-user API key worked as
expected. The pipeline survived one mid-stage Immich outage via the
circuit breaker added in 62dba3d -- script paused, operator confirmed
connectivity, same command resumed from saved state.json.

Embed (Windows DML): 7,834 images -> 15,627 face records + 1 noface in
59 minutes (2.2 img/s end-to-end).

Cluster: 6,770 of 15,627 faces (43%) matched existing canonical
identities at cos-dist <= 0.45; biggest hits faceset_002 (+3,261),
faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408). The
faceset_008 and faceset_007 hits are noteworthy cross-matches: those
are hand-sorted "sab" and "s" identities, recurring frequently in nic's
library.

Of the 8,857 unmatched faces, 3,787 raw clusters at threshold 0.55,
129 surviving refine gates, 95 emitted as new facesets at faceset_265+.

Top-level facesets_swap_ready/manifest.json: 216 -> 311 substantive
facesets + 68 thin_eras unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 00:32:11 +02:00
Peter 62dba3ddb3 Add Immich outage circuit breaker; document nic run + Tailscale quirk
work/immich_stage.py:
- Startup probe of /server/version (exit 2 if unreachable).
- Outage circuit breaker: after OUTAGE_FAIL_STREAK=12 consecutive
  faces_error/download_error results, run a quick probe; if the probe
  also fails, persist state and exit with code 2 so a long unattended
  run can pause rather than silently churning through tens of thousands
  of retries during an upstream outage. Resume by re-running the same
  command -- state.json + queue.json are intact.

README:
- Document the nic run (per-user API key necessary; second pipeline
  invocation confirmed expected behavior; cleaner library than peter's
  with 0 internal byte-dupes vs 2,976).
- Mention the circuit breaker as the mechanism that keeps long
  unattended runs safe under the known Tailscale flicker pattern at
  this site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 23:36:11 +02:00
Peter 321fed01cc Add Immich import pipeline (WSL stage + Windows DML embed + cluster)
Three-piece workflow that imports a self-hosted Immich library and emits
new facesets without disturbing existing identity numbering:

- work/immich_stage.py (WSL): pages /search/metadata, parallel-fetches
  /faces?id= per asset, prefilters by face_short>=90 against bbox scaled
  to original-image coords, downloads originals, sha256-dedups against
  nl_full.npz and same-run staged files. 8-worker ThreadPoolExecutor
  doing the full /faces->filter->/original chain per asset; resumable
  via state.json. API URL + key come from IMMICH_URL / IMMICH_API_KEY
  env vars, label->UUID map from work/immich/users.json (gitignored).
- work/embed_worker.py (Windows venv at C:\face_embed_venv): runs
  insightface.FaceAnalysis(buffalo_l) with the DmlExecutionProvider on
  AMD Radeon Vega via onnxruntime-directml. Produces a cache file in
  the same .npz schema as sort_faces.cmd_embed (loadable via
  load_cache). ~7.5x speedup over CPU end-to-end; embeddings bit-
  identical to CPU (cosine similarity 1.0000 across 8 sample faces).
- work/cluster_immich.py (WSL): mirrors cluster_osrc.py against an
  immich_<user>.npz. Builds existing identity centroids from canonical
  faceset_NNN/ in facesets_swap_ready/, drops matches at <=0.45,
  clusters the rest at 0.55, applies refine gates, hands off to
  cmd_export_swap. Numbers new facesets past the existing maximum.
- work/finalize_immich.sh: chains queue->Windows embed->cache copy->
  cluster_immich, with logging.

The 2026-04-26 run on https://fotos.computerliebe.org (Immich v2.7.2)
processed 53,842 admin-accessible assets, staged 10,261, embedded
19,462 face records on Vega DML in 64.6 min, matched 8,103 (42%) to
existing identities, and emitted 185 new facesets (faceset_026..264
with gaps). facesets_swap_ready/ went from 31 to 216 substantive
facesets.

Important caveat surfaced: /search/metadata's userIds filter is
silently ignored when the API key is bound to a different user, so
this run can't enumerate other users' libraries from the admin key.
A per-user API key would be required for nic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 18:14:26 +02:00
Peter 7ecbfae981 Add osrc identity-discovery pipeline + run analysis
work/cluster_osrc.py mirrors build_folders.py's shape (synthesize a
refine_manifest, hand off to cmd_export_swap, relocate, merge top-level
manifest) but discovers identities by clustering rather than asserting
them by folder. Drops faces already covered by existing identity
centroids, clusters the rest at 0.55, applies refine-equivalent gates
with min_faces=6, numbers new facesets past the existing maximum so
faceset_001..NNN are never disturbed.

The 2026-04-26 run on /mnt/x/src/osrc produced faceset_020..025 (sizes
4-26 exported PNGs); analysis writeup in docs/analysis/.

README also notes the refine-renumbers caveat in passing — extend +
orchestration script is the safe pattern; cmd_refine is for fresh
clusters only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:40:19 +02:00
Peter 1d82d71e68 Force-track work/build_folders.py
The README documents work/build_folders.py as the orchestration script
for hand-sorted-folder identity import, but it was excluded by the
work/ gitignore. Force-track it for parity with the other orchestration
scripts (age_split_001.py, check_faceset001_age.py) so the documented
workflow points at code that exists in the repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:13:56 +02:00
Peter e48dd8aec7 Add age-split run analysis for faceset_001
Documents the 2026-04-26 split of faceset_001 (707 curated faces) into
6 substantive era buckets + 68 thin fragments, including the readiness
probe evidence, the anchor-based assignment rationale (replaces
transitive union-find that caused year-drift), and the re-run / apply-
to-other-identity workflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:10:37 +02:00
Peter 03a0c75531 Document hand-sorted-folder import + age-split workflow
- README: document work/build_folders.py (hand-sorted folder identities)
  and the new age-split workflow for splitting a long-running identity
  into era-specific facesets after clustering.
- Force-track work/age_split_001.py and work/check_faceset001_age.py;
  these are the worked example + readiness probe for faceset_001 and
  the template for splitting any other identity by EXIF era.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:08:25 +02:00
Peter 4d7a8780de Document enrich + export-swap + extend; add swap-ready usage guide
README.md now covers all six subcommands (embed, cluster, refine, dedup,
extend, enrich, export-swap), an end-to-end pipeline recipe, the delta
recipe for merging a new source into an existing result, the quality-
weight formula used by export-swap, and the GFPGAN blend recommendation
at swap time (0.85, overriding roop-unleashed's 0.65 default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:09:01 +02:00
26 changed files with 8005 additions and 36 deletions
+372 -28
View File
@@ -1,56 +1,400 @@
# face-sets # face-sets
Sort photos by similar face using InsightFace embeddings + agglomerative clustering, then refine into faceset-ready folders for downstream face-swap tooling (roop-unleashed, etc.). Sort photos by similar face using InsightFace embeddings + agglomerative clustering, refine into per-identity sets, and export ready-to-drop bundles for face-swap tooling (roop-unleashed, etc.).
## Pipeline ## Pipeline
`sort_faces.py` is a single-file CLI with four subcommands: `sort_faces.py` is a single-file CLI with six subcommands:
| step | what it does | | step | what it does |
|---------|------------------------------------------------------------------------------| |-------------|-------------------------------------------------------------------------------------------------------------|
| embed | Recursively scan a source tree, detect + embed every face, write `.npz` cache | | embed | Recursively scan a source tree, detect + embed every face, write `.npz` cache. Resumable; sha256-dedup. |
| cluster | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` | | cluster | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` with manifest. |
| refine | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/` | | refine | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/`. |
| dedup | Post-hoc near-duplicate report: byte-identical groups + visual near-dupes (same face + same size within a tight cosine threshold) | | dedup | Post-hoc near-duplicate report: byte-identical + visual near-dupe groups → `<cache>.duplicates.json`. |
| extend | Fold new embeddings into an existing raw/refine output via nearest person-centroid without renumbering. |
| enrich | Re-detect each cached face to persist landmark_2d_106, landmark_3d_68, pose (pitch/yaw/roll) into cache. |
| export-swap | Per-identity export: tight outlier gate + visual-dupe collapse + composite quality rank + single-face PNG crops + `.fsz` bundles (top-N and full) ready for roop-unleashed. Optional singleton rescue into `_candidates/`. |
`embed` is resumable and incremental: it loads any existing cache at the target path and only hashes/embeds files it hasn't processed before. A periodic flush (default every 50 new files) writes the cache atomically, so a mid-run crash loses at most a few dozen embeddings. ### Design principles
Byte-identical duplicates are detected via sha256 during the listing phase. The canonical file is embedded once; other paths with the same hash are carried as `aliases` on the cache's top-level `path_aliases` dict. Every alias is materialized by `cluster`/`refine`, so each on-disk location ends up represented in the output. - **embed is resumable and incremental.** It loads any existing cache at the target path and only hashes / embeds files it has not seen. Atomic flush every 50 new files so a mid-run crash loses at most ~50 embeddings.
- **Byte-identical duplicates are sha256-grouped at listing time.** The canonical file is embedded once; other paths with the same hash become `path_aliases` in the cache. Every alias is materialized by `cluster` / `refine` / `export-swap`, so each on-disk location is represented.
- **`safe_dst_name` always flattens the absolute path.** This keeps output filenames stable across runs even as `src_root` changes between embed / extend / export invocations.
- **Caches and outputs stay out of git** via `.gitignore`; defaults live under `work/`.
Cache and outputs are kept out of the repo via `.gitignore`; defaults live under `work/`. ## Typical end-to-end run
## Typical run
```bash ```bash
# 1. Embed (CPU; InsightFace buffalo_l). Caches faces + metadata. Resumable. SRC=/mnt/x/src/nl
python sort_faces.py embed /mnt/x/src/nl work/cache/nl_full.npz CACHE=work/cache/nl_full.npz
OUT=/mnt/e/temp_things/fcswp/nl_sorted
# 2. Raw clusters (every multi-face cluster -> a person_NNN/ folder). # 1. Embed (CPU; InsightFace buffalo_l). Resumable on re-run.
python sort_faces.py cluster work/cache/nl_full.npz /mnt/e/temp_things/fcswp/nl_sorted/raw_full python sort_faces.py embed "$SRC" "$CACHE"
# 3. Refined facesets (filters for faceset-ready quality). # 2. Raw clusters (one person_NNN/ per multi-face cluster).
python sort_faces.py refine work/cache/nl_full.npz /mnt/e/temp_things/fcswp/nl_sorted/facesets_full python sort_faces.py cluster "$CACHE" "$OUT/raw_full"
# 4. (Optional) report on byte-identical + visual near-duplicates. # 3. Refined facesets (quality-gated per-identity sets).
python sort_faces.py dedup work/cache/nl_full.npz python sort_faces.py refine "$CACHE" "$OUT/facesets_full"
# 4. Near-duplicate report (byte + visual).
python sort_faces.py dedup "$CACHE"
# 5. Enrich the cache with landmarks + pose (needed by export-swap).
python sort_faces.py enrich "$CACHE"
# 6. Export roop-unleashed-ready bundles.
python sort_faces.py export-swap "$CACHE" \
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
``` ```
## Refine defaults ### Merging a new source into an existing result
```bash
# Embed new source into the same cache (resume from existing embeddings + aliases).
python sort_faces.py embed /mnt/x/src/lzbkp_red "$CACHE"
# Fold new faces into raw_full + facesets_full without renumbering.
python sort_faces.py extend "$CACHE" "$OUT/raw_full" --refine-out "$OUT/facesets_full"
# Refresh the swap-ready export to reflect the merge.
python sort_faces.py enrich "$CACHE"
python sort_faces.py export-swap "$CACHE" \
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
```
### Importing hand-sorted folders as identities
When source folders are already hand-sorted by person (one folder per identity), the
clustering path is the wrong tool — the identity is asserted, not inferred. The
orchestration script `work/build_folders.py` covers this case:
- For each trusted folder, it filters cache records that fall under it, builds an
identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so
bystanders in group photos drop out, and writes a synthetic `refine_manifest.json`.
- It then routes each face record from a *mixed* folder (e.g. `osrc/`) into every
identity centroid within a tight cosine cutoff (default 0.45). A multi-identity
photo lands in multiple facesets; `export-swap`'s per-bbox outlier filter ensures
each faceset crops only its matching face.
- Finally it invokes `cmd_export_swap` against the synthetic manifest, renames the
emitted `.fsz` bundles after the source folder, drops a `<label>.txt` marker, and
merges the new entries into the canonical `facesets_swap_ready/manifest.json`
(existing facesets are left untouched).
```bash
# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
for d in k m mi mir s sab t osrc; do
python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
done
# Bring landmarks/pose + visual-dupe report in sync with the new records.
python sort_faces.py enrich "$CACHE"
python sort_faces.py dedup "$CACHE"
# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
python work/build_folders.py
```
The script's config block (`TRUSTED`, `START_NNN`, `OSRC_THRESHOLD`, `TOP_N`, etc.)
is the only thing to edit when adding more hand-sorted folders later.
### Splitting an identity by era (age sub-clustering)
Long-running source corpora produce identities that span 10+ years. The 2009 face
and the 2024 face of the same person sit in the same cluster (correctly — same
identity), but a single averaged embedding pulled from that cluster blurs across
ages. For face-swap output that should target a specific period, the identity
needs to be split by era *after* the identity is established.
`work/age_split_001.py` is a worked example for `faceset_001` and a template for
any other identity. The pipeline is:
- **Probe first** with `work/check_faceset001_age.py` — report intra-cluster
pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and
EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with
distinct year ranges, the identity is age-sortable.
- **Seed centroid** from the curated `facesets_swap_ready/faceset_001/`
(manifest provides face keys → cache rows).
- **Wide recovery** at cos-dist ≤ 0.55 against the seed under the original
source roots, then quality-gate (`face_short`, `blur`, `det_score`) and one
re-centroid + tighten pass at 0.50 to absorb new faces without drift.
- **Sub-cluster** the survivors at cos-dist 0.35 (precomputed-distance
agglomerative, average linkage).
- **Anchor-based fragment assignment** (not transitive merge — that caused
year-drift): sub-clusters with size ≥ 20 are *anchors*; smaller fragments
attach to the single nearest anchor only if both the centroid distance ≤ 0.40
AND the dominant EXIF year is within ±5 years. Fragments with no qualifying
anchor remain standalone (and end up THIN-tagged downstream).
- **EXIF year per source path** with on-disk caching at
`work/cache/age_split_exif.json` — the Windows-mount EXIF read is the
slowest step, so re-runs after a parameter tweak are nearly instant.
- **Per-era export** mirrors `export-swap`: composite-quality rank, single-face
square PNG crops, top-N + `_all` `.fsz` bundles, per-era `manifest.json`,
human-readable `<era>.txt` marker. Eras with < 20 face records also drop a
`THIN.txt` marker so they can be quarantined.
- **Top-level manifest merge**: era buckets are appended to
`facesets_swap_ready/manifest.json`. Operationally the THIN buckets should be
moved into `_thin/` (and the manifest split into `facesets` + `thin_eras`),
leaving only the substantive era buckets at the top level.
```bash
# 1. Confirm the identity is age-sortable.
python work/check_faceset001_age.py
# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
python work/age_split_001.py
```
For the `faceset_001` run on 5260-face `nl_full.npz`, this produced 6 substantive
era buckets (200510, 201013, 2011, 201417, 201819, 201820; sizes 43282)
plus 68 thin/fragment buckets quarantined under `_thin/`.
### Discovering new identities in a mixed bucket
A flat folder of mixed-identity photos (e.g. `osrc/`) is the opposite of the
hand-sorted case: identities have to be discovered, not asserted, but should
not collide with already-known identities or scramble their numbering.
`work/cluster_osrc.py` is the worked example. The pipeline:
- **Filter cache to the source root**, including any byte-aliased path that
resolves under it.
- **Drop already-covered faces** by comparing each candidate to the centroids
of the existing canonical facesets at the `EXISTING_MATCH_THRESHOLD`
(default 0.45 — same cutoff as `build_folders.py`'s osrc routing). These
faces are already routed by `extend` / `build_folders.py` and shouldn't
seed new facesets.
- **Cluster the unmatched** at cos-dist 0.55 (matches the `extend` default
for the new-cluster phase).
- **Apply `refine`-equivalent gates** per cluster: `face_short`, `blur`,
`det_score`, plus outlier rejection (cluster-centroid cos-dist > 0.55) for
clusters of size ≥ 4. Keep clusters whose surviving unique-source-path
count is ≥ `MIN_FACES`.
- **Number new facesets past the existing maximum** (`START_NNN`), so
`faceset_001..NNN` are never disturbed.
- **Synthesize a refine manifest** and run `cmd_export_swap` against it,
then move the resulting dirs into `facesets_swap_ready/` and append to the
top-level `manifest.json`. Each new dir gets an `osrc.txt` provenance
marker.
Always run `extend` first so `raw_full/` and `facesets_full/` reflect the new
source — the `cluster_osrc.py` step then operates against the canonical
cache and doesn't need `raw_full/` for input:
```bash
# 1. Bring raw_full / facesets_full up to date (folds matches into existing
# person folders + facesets, creates new person_NNN+ for unmatched).
python sort_faces.py extend "$CACHE" "$OUT/raw_full" \
--refine-out "$OUT/facesets_full"
# 2. Optional dry-run: report cluster sizes and per-faceset survivor counts
# without touching facesets_swap_ready/.
python work/cluster_osrc.py --dry-run
# 3. Real run: emits facesets_swap_ready/faceset_NNN+ and merges the manifest.
python work/cluster_osrc.py
```
For the 2026-04-26 run on 336 osrc face records (after dropping 18 covered by
existing identities), this produced 6 new facesets (`faceset_020..025`,
sizes 426 exported PNGs; the 7th candidate cluster lost all 6 faces to
export-swap's tighter `min_face_short=100` gate).
### Importing identities from a self-hosted Immich library
`work/immich_stage.py` + `work/embed_worker.py` + `work/cluster_immich.py`
together import an Immich library at scale, with the embed step running on
a Windows AMD GPU via DirectML and everything else on WSL. Three pieces:
1. **`work/immich_stage.py` (WSL)** — pages every IMAGE asset via
`/search/metadata`, fetches each asset's `/faces?id=` to read Immich's
own ML-driven bboxes, scales each bbox to original-image coordinates,
and prefilters by `face_short ≥ 90`. For survivors it downloads the
original, sha256-deduplicates against the canonical `nl_full.npz` and
against same-run staged files, and saves to
`/mnt/x/src/immich/<user>/<rel>`. Writes a `queue.json` that the embed
worker consumes. 8 concurrent worker threads run the full per-asset
I/O chain (`/faces` → filter → `/original`) so 8 workers ≈ 8× the
serial throughput.
2. **`work/embed_worker.py` (Windows venv at `C:\face_embed_venv\`)** —
loads `insightface.FaceAnalysis(buffalo_l)` with the
`DmlExecutionProvider` and runs detection + landmarks + recognition
over the queue. Produces a `.npz` cache that's bit-identical in
schema to what `sort_faces.py:cmd_embed` writes, so the result is
directly loadable by `load_cache()`. The cache already includes the
post-`enrich` fields (`landmark_2d_106`, `landmark_3d_68`, `pose`)
because FaceAnalysis returns them for free. AMD Vega gives ~7.5×
real-pipeline speedup over CPU.
3. **`work/cluster_immich.py` (WSL)** — mirrors `cluster_osrc.py`'s
shape but reads from `immich_<user>.npz`. Builds existing-identity
centroids from every canonical `faceset_NNN/` in
`facesets_swap_ready/` (skipping era splits and `_thin/`), drops
immich faces matching at cos-dist ≤ 0.45, clusters the rest at 0.55,
applies refine gates, numbers new facesets past the existing maximum,
and feeds `cmd_export_swap` via a synthetic manifest.
`work/finalize_immich.sh <user>` chains queue → Windows embed → cache
copy back → cluster_immich, with logging.
The Immich admin API key + base URL come from environment variables:
```bash
export IMMICH_URL=https://your-immich.example.com
export IMMICH_API_KEY=... # admin or per-user key
python work/immich_stage.py --user peter --workers 8
bash work/finalize_immich.sh peter
```
For the 2026-04-26 run against `https://fotos.computerliebe.org` (Immich
v2.7.2), with the admin API key:
| step | result |
|------|------|
| stage | 53,842 assets seen, **10,261 staged** (~10 GB), 978 byte-deduped against `nl_full.npz`, 2,976 internal byte-duplicates, 39K skipped no-face / no-big-face |
| Windows DML embed | 19,462 face records + 1 noface in **64.6 min** (2.6 img/s end-to-end) |
| matched existing identities | **8,103 of 19,480 (42%)** at cos-dist ≤ 0.45; biggest hits faceset_002 (+2,666), faceset_001 (+1,856), faceset_003 (+670) |
| new clusters | 2,534 at threshold 0.55 → 239 surviving refine gates → **185 emitted** as `faceset_026..264` (gaps where export-swap's tighter outlier filter dropped clusters below the export quality bar) |
A second 2026-04-26 run with **nic's per-user API key** confirmed the
expected behavior: 25,777 of nic's IMAGE assets were enumerated (matching
her `/server/statistics` count of 25,786, off by 9 ≈ the transient errors
that didn't get marked seen), **7,834 staged** (30% face-bearing-with-big-face,
denser than peter's 19%), 519 byte-deduped vs `nl_full.npz`, **0 internal
byte-duplicates** (cleaner library than peter's 2,976), 54 transient errors.
Embed + cluster on the nic queue:
| step | result |
|------|------|
| Windows DML embed | 15,627 face records + 1 noface in **59 min** (2.2 img/s end-to-end), 7 load errors |
| matched existing identities | **6,770 of 15,627 (43%)** at cos-dist ≤ 0.45; biggest hits faceset_002 (+3,261), faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408) |
| new clusters | 3,787 at threshold 0.55 → 129 surviving refine gates → **95 emitted** as `faceset_265..NNN` (gaps where export-swap's 0.45 outlier dropped clusters below the export bar) |
Top-level `facesets_swap_ready/manifest.json` after both Immich runs:
**311 substantive facesets** (12 auto-cluster nl/lzbkp + 7 hand-sorted +
6 era splits + 6 osrc-discovered + 185 peter-Immich + 95 nic-Immich) +
68 thin_eras under `_thin/`.
`work/immich_stage.py` carries a built-in **outage circuit breaker**:
after 12 consecutive HTTP errors it probes Immich; if that probe also
fails, the script exits cleanly with code 2, state preserved. This made
the nic run survive a mid-stage Immich outage — the script paused, the
operator confirmed connectivity was back, and the same command resumed
from the saved `state.json` without re-fetching what was already done.
**Important caveats for Immich v2.7.2**:
- The `userIds` filter on `/search/metadata` is **silently ignored** when
the API key is bound to a different user. The "import everything the
API key can see" semantics are what you actually get; cross-user
isolation is enforced server-side.
- `/server/statistics` reports counts that under-count what
`/search/metadata` actually returns (e.g. external library
thumbnail-dirs that got indexed because the import path included them).
Don't trust the statistics number as a denominator.
- A meaningful fraction of `originalPath`-based assets are *Immich's own
thumbnails* (`<library_root>/thumbs/.../-preview.jpeg`) — included if
the external library's import path covers the thumbs directory and the
exclusion patterns don't list `**/thumbs/**`. For our run, 5,563 of
10,261 staged were thumbnails. They embed and cluster fine but the
resulting faces are lower-resolution.
## Key defaults
`refine`:
| flag | default | meaning | | flag | default | meaning |
|---|---|---| |-------------------------|--------:|---------|
| `--initial-threshold` | 0.55 | cosine distance for stage-1 clustering | | `--initial-threshold` | 0.55 | cosine distance for stage-1 clustering |
| `--merge-threshold` | 0.40 | centroid-level merge of over-split clusters | | `--merge-threshold` | 0.40 | centroid-level merge of over-split clusters |
| `--outlier-threshold` | 0.55 | drop face if cosine dist from cluster centroid exceeds this (only if cluster ≥ 4) | | `--outlier-threshold` | 0.55 | drop face if cosine dist from centroid exceeds (only if cluster ≥ 4) |
| `--min-faces` | 15 | minimum unique images per faceset | | `--min-faces` | 15 | minimum unique images per faceset |
| `--min-short` | 90 | minimum short-edge pixels of face bbox | | `--min-short` | 90 | minimum short-edge pixels of face bbox |
| `--min-blur` | 40.0 | Laplacian-variance blur gate | | `--min-blur` | 40.0 | Laplacian-variance blur gate |
| `--min-det-score` | 0.6 | InsightFace detector score gate | | `--min-det-score` | 0.6 | InsightFace detector score gate |
| `--mode` | copy | copy / move / symlink |
## Prior runs (as of 2026-04-22) `export-swap`:
- `work/cache/kos11.npz` — 181 images, 333 faces from `Kos '11/``kos11_sorted/` | flag | default | meaning |
- `work/cache/nl_all.npz` — 916 images, 1396 faces from `Neuer Ordner (2)/New Folder/``nl_sorted/raw/`, refined to 6 facesets (197, 120, 91, 47, 23, 18 images) |-------------------------------|--------:|---------|
| `--top-n` | 30 | size of the `<faceset>_topN.fsz` bundle |
| `--outlier-threshold` | 0.45 | tighter than refine; trims cluster boundary for averaging |
| `--pad-ratio` | 0.5 | padding around face bbox for PNG crop |
| `--out-size` | 512 | PNG output is square `out_size × out_size` |
| `--min-face-short` | 100 | export gate; stricter than refine's 90 |
| `--candidates` | off | rescue `_singletons/` into `_candidates/` for manual review |
| `--candidate-match-threshold` | 0.55 | cos-dist cutoff for singleton → existing faceset |
| `--candidate-min-score` | 0.40 | composite-quality floor for candidates |
Output lives outside the repo at `/mnt/e/temp_things/fcswp/`. The composite quality score in `export-swap` is `0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness`, each normalized to `[0, 1]`.
## Post-export corpus maintenance
The `sort_faces.py` pipeline above produces `facesets_swap_ready/`. Four
orchestration scripts under `work/` operate on that already-built corpus to
clean it up over time:
| script | purpose |
|--------|---------|
| `work/filter_occlusions.py` (+ Windows `work/clip_worker.py`) | Drop PNGs of masked / sun-glassed faces using open_clip ViT-L-14/dfn2b_s39b zero-shot scoring. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. WSL stages a queue, Windows DML scores, WSL applies. See `docs/analysis/clip-occlusion-filter.md`. |
| `work/consolidate_facesets.py` | Merge duplicate identities (centroid cosine sim ≥ 0.55 with confident ≥ 0.65, **complete-linkage** to defeat single-link chaining). Pulls embeddings from cache, no GPU. See `docs/analysis/identity-consolidation-and-age-extend.md`. |
| `work/age_extend_001.py` | Slot newly-added PNGs into existing era buckets of `faceset_001` (anchor cosine distance ≤ 0.40 AND `|year_delta|` ≤ 5). Same anchor-fragment rule as `age_split_001.py`. |
| `work/dedup_optimize.py` (+ Windows `work/multiface_worker.py`) | (a) cross-family SHA256 byte-dedup, (b) within-faceset near-dup at cosine sim ≥ 0.95, (c) multi-face audit (re-detect via insightface, drop PNGs with face_count ≠ 1). Multi-face is the load-bearing roop invariant. See `docs/analysis/dedup-and-roop-optimization.md`. |
| `work/video_target_pipeline.py` (+ Windows `work/video_face_worker.py` + `work/run_video_pipeline.sh` chain) | Target-side preprocessing: scan a folder of videos → PySceneDetect shot-cuts → 2 fps frame sampling → DML face detection + embedding → IoU+embedding tracking → quality-gated segments (yaw≤75°, face≥80px, det≥0.5, ≥70% pass-rate, 1120s duration, 2s cross-track merge gap) → ffmpeg stream-copy into UUID-named clips. Output organized into per-source subfolders. Provenance sidecars are opt-in (`cut --write-sidecar` or `SIDECAR=yes` env var); the full plan is always retained in the per-batch `plan.json`. See `docs/analysis/video-target-preprocessing.md`. |
All four operate idempotently and reversibly: dropped PNGs go to
`<faceset>/faces/_dropped/`, quarantined whole facesets go to
`facesets_swap_ready/_masked/` or `_merged/` (parallel to the existing
`_thin/`). The master `manifest.json` partitions entries across `facesets[]`,
`masked[]`, `thin_eras[]`, and `merged[]` arrays, plus per-run provenance
blocks (`occlusion_filter_run`, `merge_run`, `age_extend_runs`, `dedup_runs`,
`multiface_runs`).
## Downstream: roop-unleashed
The `.fsz` bundles emitted by `export-swap` drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding.
Highly recommended at swap time: enable **Select post-processing = GFPGAN** with the **Original/Enhanced image blend ratio = 0.85** (default is 0.65 which is conservative). See `docs/analysis/facesets-downstream-refinement-evaluation.md` for the full evaluation.
## Layout
```
/opt/face-sets/
├─ README.md (this file)
├─ sort_faces.py (the tool)
├─ docs/
│ └─ analysis/
│ └─ facesets-downstream-refinement-evaluation.md
└─ work/ (gitignored except force-tracked .py / .sh)
├─ build_folders.py (hand-sorted-folder orchestration)
├─ check_faceset001_age.py (age-split readiness probe)
├─ age_split_001.py (age-split orchestration; faceset_001)
├─ age_extend_001.py (extends existing era buckets with new PNGs)
├─ cluster_osrc.py (mixed-bucket identity discovery)
├─ immich_stage.py (Immich library staging, parallel)
├─ embed_worker.py (Windows DML embed worker; C:\face_embed_venv\)
├─ cluster_immich.py (Immich identity discovery + export)
├─ finalize_immich.sh (chains queue → embed → cluster)
├─ filter_occlusions.py (CLIP zero-shot mask + sunglasses filter)
├─ clip_worker.py (Windows DML CLIP worker; C:\clip_dml_venv\)
├─ consolidate_facesets.py (duplicate-identity merger; complete-linkage)
├─ dedup_optimize.py (byte + near-dup + multi-face audit driver)
├─ multiface_worker.py (Windows DML multi-face audit worker)
├─ video_target_pipeline.py (video → swappable segment cuts orchestration)
├─ video_face_worker.py (Windows DML per-frame face worker; JSONL append-only)
├─ run_video_pipeline.sh (generic chain driver: scenes → stage → worker → cut)
├─ status_video_pipeline.sh (status helper for any video_pipeline log)
├─ synthetic_*_manifest.json (per-run synthetic refine manifests)
├─ immich/
│ ├─ users.json (label -> userId map; gitignored)
│ └─ <user>/{queue,state,aliases}.json (per-user staging artifacts)
├─ cache/
│ ├─ nl_full.npz (canonical cache + duplicates.json)
│ ├─ immich_<user>.npz (per-user immich embeddings)
│ └─ age_split_exif.json (path → EXIF-year cache)
└─ logs/
└─ *.log (every long step writes here)
```
+119
View File
@@ -0,0 +1,119 @@
# Age-splitting faceset_001 into era-specific facesets
_Run date: 2026-04-26. Cache: `work/cache/nl_full.npz` (5260 face records). Source: `work/age_split_001.py` and `work/check_faceset001_age.py`._
## 1. Why split
`faceset_001` aggregates a single identity across roughly 20 years of source
material. The averaged embedding consumed by roop-unleashed therefore mixes
features from very different ages. For face-swap output that should target a
specific period (e.g. "this person around 2011" or "this person around
201819"), the identity needs to be split *after* clustering — the cluster is
correctly one identity, but the averaged embedding is the problem.
## 2. Evidence the identity is age-sortable
`work/check_faceset001_age.py` probes `faceset_001` (707 curated faces).
**Pairwise cos-distance histogram** (249,571 pairs):
| range | pairs |
|-------------|------:|
| [0.0, 0.2) | 1,250 |
| [0.2, 0.3) | 11,277 |
| [0.3, 0.4) | 63,920 |
| [0.4, 0.5) | 92,555 |
| [0.5, 0.6) | 63,288 |
| [0.6, 0.7) | 16,048 |
| [0.7, 0.8) | 1,217 |
| [0.8, 1.0) | 16 |
Mean 0.453, median 0.452, max 0.842. The cluster is internally diffuse — wide
enough to admit non-trivial sub-structure without crossing the
inter-identity boundary (which sits well above 0.6 in this dataset).
**Sub-clusters at threshold 0.35** (precomputed cos-dist, average linkage):
156 sub-clusters, 10 with ≥ 10 faces, top-5 sizes [120, 105, 47, 40, 24].
The top sub-clusters align with distinct EXIF year medians (2011, 2019,
2018, 2011, 2010), so the split is meaningful.
## 3. Pipeline
`work/age_split_001.py`:
1. **Seed centroid.** Load the 707 face keys from
`facesets_swap_ready/faceset_001/manifest.json`; resolve to cache rows;
normalize the mean embedding.
2. **Wide recovery.** Pull every face record under `/mnt/x/src/{nl,
lzbkp_red}` from the cache with cos-dist ≤ 0.55 from the seed. The seed
is curated and tight, so 0.55 is a safe outer envelope. Result: 1,501
faces from 4,756 candidates.
3. **Quality gate** (mirrors export-swap defaults): `face_short ≥ 100`,
`blur ≥ 40.0`, `det_score ≥ 0.6`. Result: 892 → 856 after one
re-centroid + tighten pass at 0.50 to absorb the recovery without
drift.
4. **Sub-cluster** the survivors at cos-dist 0.35 (precomputed agglomerative,
average linkage). 223 raw sub-clusters; sizes top-10 = [127, 97, 55, 42,
40, 25, 17, 14, 13, 11].
5. **EXIF year per source path.** Read `DateTimeOriginal` once per unique
path; cache on disk at `work/cache/age_split_exif.json` so re-runs after
parameter tweaks skip the slow Windows-mount EXIF read. 728 of 855 paths
were dated.
6. **Anchor-based fragment assignment** (replaces transitive union-find merge
that caused observable year drift):
- sub-clusters with ≥ 20 faces are *anchors* (6 found: dom-years 2011,
2019, 2018, 2011, 2016, 2010);
- smaller fragments attach to the single nearest anchor *only if* both
`cent_dist ≤ 0.40` AND `|dom_year_anchor dom_year_fragment| ≤ 5`;
- anchors do not merge with each other (transitive merging produced
anchor-to-anchor year drift across 2010 → 2014 → 2018 in earlier
runs);
- fragments with no qualifying anchor remain standalone.
7. **Per-era export.** Composite-quality rank, single-face square PNG crops
(`pad_ratio=0.5`, `out_size=512`), top-N + `_all` `.fsz` bundles, per-era
`manifest.json`, `<label>.txt` marker, `THIN.txt` for buckets < 20 faces.
8. **Top-level manifest merge.** New entries are appended to
`facesets_swap_ready/manifest.json`. Operationally the THIN buckets are
then moved into `_thin/` and partitioned into a `thin_eras` array (with
`relpath: _thin/<name>`) so consumers reading `facesets` see only the
substantive entries.
## 4. Result
74 era buckets emitted; 6 substantive + 68 thin/standalone fragments.
| era | faces | dom year(s) |
|-------------------|------:|-------------|
| `faceset_001_2010-13` | 282 | 2011 |
| `faceset_001_2018-20` | 129 | 2019 |
| `faceset_001_2014-17` | 125 | 2018 (anchor sub 15 dom_year=2018) |
| `faceset_001_2018-19` | 107 | 2018 |
| `faceset_001_2005-10` | 88 | 2010 |
| `faceset_001_2011` | 43 | 2011 |
Two distinct 2011 anchors and two 2018-area anchors persist by design —
embedding-space distance separated them despite year overlap. The era-label
collisions are disambiguated with `_v2` suffixes, but only when both anchors
landed on the *same* literal label string (none of the substantive six did).
The 68 thin buckets are largely 1- or 2-face fragments with idiosyncratic
embeddings; they are quarantined into `_thin/` rather than deleted because
some are legitimate edge poses / lighting / age extremes that may be useful
for narrow targeted swaps.
## 5. Re-running and applying to other identities
- **Re-run with different parameters**: just re-execute `age_split_001.py`.
Embeddings are loaded from cache, EXIF is loaded from
`age_split_exif.json`, and only the sub-cluster + export steps re-run.
Total runtime ~2 min.
- **Apply to a different identity**: copy `age_split_001.py` to
`age_split_NNN.py` and change `FS001`. The `SCAN_ROOTS`,
`RECOVERY_THRESHOLD`, `TIGHTEN_THRESHOLD`, `SUBCLUSTER_THRESHOLD`,
`ANCHOR_MIN_SIZE`, `FRAGMENT_CENTROID_MAX`, and `FRAGMENT_YEAR_MAX`
defaults are tuned for `faceset_001`'s ~707-face curated cluster; smaller
identities likely need lower `ANCHOR_MIN_SIZE`.
- **Always quarantine THIN buckets** afterwards using the same partition
pattern (move to `_thin/`, split top-level manifest into
`facesets` + `thin_eras`). The script appends THIN entries to the top-level
manifest as if they were full facesets, so the cleanup is a separate step.
+154
View File
@@ -0,0 +1,154 @@
# CLIP zero-shot occlusion filter (masks + sunglasses)
_Run date: 2026-04-27. Driver scripts: `work/filter_occlusions.py`, `work/clip_worker.py`._
## 1. Why
`facesets_swap_ready/` ended the Immich import day with 311 substantive
facesets and a long tail of identities whose clusters had latched onto
*eyewear or mask appearance* instead of identity (covid-era shots, vacation
photos with sunglasses dominating the frame). Two failure modes:
1. **Pollution of averaged identity** — roop's `FaceSet.AverageEmbeddings()`
averages every face in the .fsz. A faceset where 40 % of images are
sunglassed gives a biased centroid; the swap reproduces sunglass-shaped
eye sockets.
2. **Whole-cluster identity drift** — clustering at the embedding level
sometimes anchors on the eyewear silhouette rather than the face,
producing clusters of "the same sunglasses across multiple people".
A targeted attribute scorer was the cleanest fix.
## 2. Model + prompts
**Model**: `open_clip` `ViT-L-14` / `dfn2b_s39b` (Apple Data Filtering Networks).
Best public zero-shot at this size. Loads weights from HF Hub (~890 MB).
Bit-identical scores between WSL CPU and Windows DML.
**Prompt design**: per-attribute ensembles of 56 positive + 56 negative
prompts. Positive ensembles are mean-pooled and L2-normalized before softmax.
**Critical bug if forgotten**: CLIP cosine similarities are tiny (0.20.3
range). Raw `softmax([sim_pos, sim_neg])` collapses to ~0.5/0.5 on every
image. **Multiply by `model.logit_scale.exp()` (~100) before softmax.**
Without that scale the entire scorer outputs a uniform 0.5.
**Sunglasses prompt pitfall**: the first set caught faces with sunglasses
*pushed up on the forehead* with the same probability as faces with
sunglasses *covering the eyes* — CLIP detects "presence of sunglasses in
frame", not "eyes occluded". Fixed by putting the false positive into the
*negative* class explicitly:
```
positive: "a face with dark sunglasses covering the eyes"
"a portrait with the eyes hidden behind opaque sunglasses"
...
negative: "a face with sunglasses pushed up on the forehead, eyes visible below"
"a face with sunglasses resting on top of the head, eyes visible"
"a face wearing clear prescription eyeglasses with visible eyes"
...
```
Validation pair (faceset_005): sunglasses-on-eyes → 0.91, sunglasses-on-forehead
→ 0.39. Threshold 0.7 cleanly separates.
## 3. Architecture
```
┌─────────────────────────────────────────────┐
│ WSL /opt/face-sets/work/filter_occlusions.py │
│ • stage: walk facesets/, write queue.json │
│ • merge: ingest worker results │
│ • report: HTML contact sheet │
│ • apply: prune + quarantine + re-zip │
└────────────┬────────────────────────────────┘
│ queue.json (paths) via \\wsl.localhost\
┌─────────────────────────────────────────────┐
│ Windows C:\clip_dml_venv\ │
│ /opt/face-sets/work/clip_worker.py │
│ Python 3.12 + torch 2.4.1 CPU │
│ + torch-directml 0.2.5 + open_clip_torch │
│ Reads PNGs from native E:\, writes scores │
└─────────────────────────────────────────────┘
```
A separate Windows venv (not the existing `C:\face_embed_venv\`) is needed
because `torch-directml` brings ~1.5 GB of wheels and version-pinned
numpy/pillow that risk breaking the embed_worker venv's
`onnxruntime-directml` + `insightface` stack.
## 4. DML throughput surprise
Measured on AMD Radeon RX Vega:
| input | model | throughput | speedup vs WSL CPU |
|------|-------|-----------:|-------------------:|
| ViT-L-14 (CLIP, this filter) | open_clip | **1.43 img/s** | **2.4×** |
| buffalo_l (insightface, embed_worker) | onnxruntime | 2.6 img/s | 7.5× |
Only 2.4× because `aten::_native_multi_head_attention` is not implemented in
the directml plugin and falls back to CPU. The vision encoder runs on GPU,
attention runs on CPU per layer, both alternating. A silenced UserWarning
makes this near-invisible. Workable for a one-shot 73-min corpus run, but
the embed_worker pattern (pure ONNX) remains the gold standard for DML.
## 5. Thresholds (validated 2026-04-27 on 6,318 PNGs)
| level | threshold | semantics |
|-------|----------:|-----------|
| image | P(positive) ≥ 0.7 | drop the PNG |
| faceset | ≥ 40 % of images flagged for either attr | quarantine whole faceset to `_masked/` |
| min-survivors | < 5 surviving AND something pruned | quarantine to `_thin/` |
The `AND something pruned` guard is essential — without it, naturally-small
facesets (hand-sorted with ≤4 PNGs) get incorrectly quarantined for being
small even when they have zero occlusions.
## 6. Run results
| action | count | net effect |
|--------|------:|------------|
| keep | 209 | unchanged |
| prune | 46 | 183 PNGs dropped within survivors |
| quarantine_masked | 51 | whole faceset → `_masked/` (11 mask-driven, 40 sunglasses-driven) |
| quarantine_thin | 3 | survivors < 5 → `_thin/` |
Net: 311 active → 255 active after the filter run. 763 PNGs quarantined
whole-faceset, 183 pruned within survivors. All dropped PNGs preserved at
`<faceset>/faces/_dropped/` for reversibility. Master manifest gained a
`masked[]` array parallel to `thin_eras[]`, plus an `occlusion_filter_run`
provenance block.
## 7. Known limitations
- **Per-faceset manifests are NOT updated by `apply`** — only the master
manifest is. Each faceset's own `<faceset>/manifest.json` retains stale
`faces[]` entries pointing at PNGs that moved into `_dropped/`. Harmless
for `.fsz` consumers (the .fsz is re-zipped from current disk state) but
downstream tools reading `faces[]` will see broken references. Discovered
later by `age_extend_001.py`'s rebuild loop, which generated 42 missing-PNG
warnings before being caught.
## 8. Re-running
```bash
# 1. Stage queue from current corpus state
python work/filter_occlusions.py stage --out work/clip_dml/queue.json
# 2. Score on Windows DML (resumable)
"/mnt/c/clip_dml_venv/Scripts/python.exe" work/clip_worker.py \
work/clip_dml/queue.json work/clip_dml/scores.json --batch 8
# 3. Reshape into per-faceset format, then HTML for visual approval
python work/filter_occlusions.py merge \
--scores work/clip_dml/scores.json --out work/occlusion_scores.json
python work/filter_occlusions.py report \
--scores work/occlusion_scores.json --out work/occlusion_review
# 4. Apply (always dry-run first)
python work/filter_occlusions.py apply \
--scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json --dry-run
python work/filter_occlusions.py apply \
--scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json
```
@@ -0,0 +1,155 @@
# Corpus dedup + roop-unleashed optimization
_Run date: 2026-04-27. Driver scripts: `work/dedup_optimize.py`, `work/multiface_worker.py`._
After consolidation collapsed duplicate identities and age-extend slotted
new PNGs into era buckets, the corpus still carried artifacts that hurt
roop's averaged-embedding quality:
- **Burst-photo near-duplicates** within facesets, especially in
immich-discovered identities where source libraries had many similar
shots within seconds.
- **Cross-faceset byte-identical PNGs** that escaped consolidation's
centroid-similarity matching when individual PNGs matched exactly but
cluster centroids diverged.
- **Multi-face PNGs** that polluted identity averaging because the roop
loader appends every detected face per PNG to the FaceSet (load-bearing
invariant — see § 2).
This pipeline runs three independent passes and an optional fourth, all
moving dropped PNGs to `<faceset>/faces/_dropped/` for reversibility.
## 1. Cross-family byte-dedup
SHA256-hash every PNG in the active corpus (parallel I/O via
`ThreadPoolExecutor(max_workers=16)`, ~17 s for 5,386 PNGs over the
`/mnt/e/` Windows mount). Group by hash; for groups with members in
multiple identity families, keep the higher-tier copy.
**Family detection**: regex `^(faceset_\d+)(?:_.+)?$` — captures the parent
identity. Same family includes parent + era splits (e.g. `faceset_001` +
`faceset_001_2010-13`); these are intentional duplications for the era
.fsz files and are preserved.
Run results: 20 cross-family hash groups → 24 PNGs dropped. Most cases were
small immich identity-cluster errors that consolidation missed because
individual PNG embeddings matched but the cluster mean did not.
## 2. Within-faceset near-dup at sim ≥ 0.95
Per-faceset pairwise cosine similarity on cached arcface embeddings.
Connected components in the `sim ≥ 0.95` graph. Keep highest
`quality.composite` per component, drop the rest.
**Threshold rationale**: legitimate same-person-different-pose pairs land at
0.50.85; ≥ 0.95 means essentially the same shot (burst frames or
recompressed dupes). Roop's `FaceSet.AverageEmbeddings()` averages all faces
into `faces[0].embedding`; near-identical embeddings averaged ≈ averaging
once. Removing them does not lose identity information; it removes a bias
weight on the most-photographed moments.
Run results: 851 groups → **1,225 PNGs dropped** (23 % of corpus).
Most-affected: `faceset_026` (-132 of 262), `faceset_027` (-107),
`faceset_028` (-92), `faceset_030` (-92). All immich-discovered identities
where the source library had burst sequences.
## 3. Multi-face audit (load-bearing roop invariant)
The roop loader at `roop/ui/tabs/faceswap_tab.py:661691` runs
`extract_face_images(filename, (False, 0))` on every PNG and **appends every
detected face** to `face_set.faces`. A multi-face PNG therefore pollutes the
averaged identity. The export-swap pipeline drops multi-face crops at
creation, but post-pipeline operations (consolidation, age-extend) move
PNGs across facesets without re-checking.
**This audit re-detects every PNG** with insightface FaceAnalysis and flags
any with `face_count ≠ 1` (filtered by `det_score ≥ 0.5` and
`face_short ≥ 40`). Includes:
- ≥ 2 faces → loader will inject extra identities into averaging
- 0 faces → insightface can't find a face on the cropped PNG; useless for
roop, would silently fail
Run results: 4,146 PNGs scored, 332 flagged (272 with 2 faces, 9 with 3,
2 with 4, **49 with 0**). 82 facesets affected.
## 4. DML throughput jump for face crops
The audit reuses the same insightface + onnxruntime-directml stack as
`embed_worker.py` but achieves **~19 img/s** on AMD Vega vs embed_worker's
2.6 img/s — same model, same hardware. The difference is input size:
| stage | typical input | DML throughput |
|-------|--------------|---------------:|
| `embed_worker.py` (Immich import) | 10244000 px source | 2.6 img/s |
| `multiface_worker.py` (this audit) | 512×512 face crops | **19 img/s** |
Detection on small inputs is fast; recognition on aligned 112×112 inputs is
the same cost either way. Implication: **any pipeline operating on
already-cropped face PNGs can rely on a roughly 7× higher DML throughput
ceiling than full-resolution embedding**.
## 5. Architecture
```
┌────────────────────────────────────────────┐
│ WSL /opt/face-sets/work/dedup_optimize.py │
│ • analyze: hashes + within-faceset sim │
│ • apply: move + re-zip (no GPU) │
│ • stage_multiface: write queue.json │
│ • merge_multiface: ingest worker results │
│ • apply_multiface: move + re-zip │
│ • report: HTML audit │
└────────────┬───────────────────────────────┘
│ queue.json via \\wsl.localhost\
┌────────────────────────────────────────────┐
│ Windows C:\face_embed_venv\ │
│ /opt/face-sets/work/multiface_worker.py │
│ insightface FaceAnalysis on DmlExecutionProvider │
│ Reads PNGs from native E:\, writes face_count │
└────────────────────────────────────────────┘
```
Reuses the existing `C:\face_embed_venv\` (no new venv needed — same
insightface stack as `embed_worker.py`).
## 6. Final corpus state (2026-04-27 night)
| metric | start of day | after occlusion filter | after consolidation | after age-extend | after this dedup + multiface |
|--------|-------------:|----------------------:|-------------------:|-----------------:|----------------------------:|
| active facesets | 311 | 255 | 181 | 181 | **181** |
| active PNGs | ~6,440 | 5,386 | 5,386 | 5,400 | **3,849** |
| `_masked/` | 0 | 51 | 51 | 51 | 51 |
| `_thin/` | 68 | 71 | 71 | 71 | 71 |
| `_merged/` | 0 | 0 | 74 | 74 | 74 |
Net reduction at the end of the day: **2,591 PNGs and 130 facesets** removed
or quarantined from the active pool. All preserved on disk for
reversibility (`<faceset>/faces/_dropped/` for prunes, `_masked/_merged/_thin/`
for quarantines).
## 7. Re-running
Run after any new import / consolidation / extend:
```bash
# 1. Byte-dedup + within-faceset near-dup (CPU only)
python work/dedup_optimize.py analyze --out work/dedup_audit/dedup_plan.json
python work/dedup_optimize.py apply --plan work/dedup_audit/dedup_plan.json
# 2. Multi-face audit on Windows DML (resumable)
python work/dedup_optimize.py stage_multiface --out work/dedup_audit/multiface_queue.json
"/mnt/c/face_embed_venv/Scripts/python.exe" work/multiface_worker.py \
work/dedup_audit/multiface_queue.json work/dedup_audit/multiface_results.json
python work/dedup_optimize.py merge_multiface \
--results work/dedup_audit/multiface_results.json \
--out work/dedup_audit/multiface_plan.json
python work/dedup_optimize.py apply_multiface \
--plan work/dedup_audit/multiface_plan.json
# 3. HTML audit
python work/dedup_optimize.py report \
--dedup work/dedup_audit/dedup_plan.json \
--multiface work/dedup_audit/multiface_plan.json \
--out work/dedup_audit
```
@@ -0,0 +1,170 @@
# Identity consolidation + age-bucket extension
_Run date: 2026-04-27. Driver scripts: `work/consolidate_facesets.py`, `work/age_extend_001.py`._
After the Immich peter + nic imports added 280 new facesets to a corpus that
had ~25 canonical identities, many "new" identities were duplicates of
existing household members at lower clustering confidence. Two cooperating
passes clean this up: identity consolidation merges duplicates, then
age-extend slots newly-merged PNGs into the existing era buckets of
`faceset_001`.
## 1. Identity consolidation
### 1.1 Approach
For each active faceset, pull cached arcface embeddings from
`work/cache/{nl_full,immich_peter,immich_nic}.npz` keyed by
`(source, bbox)` from the per-faceset manifest's `faces[]`. Compute
L2-normalized centroid. Pairwise cosine similarity matrix.
**Tier-based primary selection** (lowest tier number wins, size breaks ties):
| tier | sources | rationale |
|-----:|---------|-----------|
| 0 | `faceset_013..019` (hand-sorted) | user's curated labels |
| 1 | `faceset_001..012` (auto-clustered) | well-established household |
| 2 | `faceset_020..025` (osrc) | mixed-bucket discovery |
| 3 | `faceset_026..264` (immich peter) | speculative |
| 4 | `faceset_265+` (immich nic) | speculative |
**Era splits and quarantines excluded**`faceset_NNN_<era>`, `_masked/`,
`_thin/` are skipped during analysis.
### 1.2 Single-linkage chains catastrophically — complete-linkage required
First attempt used connected-components on edge ≥ 0.45 → produced a
**60-faceset cluster** around `faceset_001` with min within-group sim of
**0.16** (definitely-different people bridged via chains
`A↔B↔C` where `A`, `C` are not similar). Bumping to edge ≥ 0.55 still
chained (group of 17 with min 0.20).
Real fix: `scipy.cluster.hierarchy.linkage(method='complete')` then
`fcluster(Z, t=1-edge_threshold, criterion='distance')`. Complete-linkage
**guarantees** every within-group pair sim ≥ edge threshold. Without this
guarantee the report is unusable and the apply step would produce
identity-poisoned merges.
### 1.3 Thresholds + run results
`edge=0.55`, `confident=0.65` → 48 multi-faceset groups (29 confident, 19
uncertain). Max group size 7, all bilateral or small triplets after
complete-linkage.
After applying all 48 (with `--include-uncertain` after visual approval):
- **74 facesets consumed** (some groups had multiple secondaries:
`[10, 45, 135] → faceset_002`; `[113, 96, 178, 109, 110, 286] → faceset_095`;
etc.)
- Active count 255 → 181
- Notable absorptions: `faceset_001` (peter) 707 → 753 PNGs (+ 7, 132, 151);
`faceset_002` 209 → 247; `faceset_026` 60 → 262 (+ 168, 146, 325);
`faceset_028` → 207
- Master manifest gained `merged[]` array (parallel to `thin_eras[]`); each
entry has `merged_into` field pointing at the primary
### 1.4 Apply mechanics
Combine all PNGs from primary + secondaries, re-rank by existing
`quality.composite` desc (no re-enrich), renumber `0001..NNNN`, copy into a
fresh staging dir, atomic swap. Move secondary directories to
`_merged/<original_name>/` (preserved in full for reversibility). Re-zip
`_topN.fsz` and `_all.fsz`.
The primary's existing per-PNG quality scores are reused — re-ranking does
not require re-running `enrich`-equivalent landmarks/pose on the cropped
PNGs. The primary's `_dropped/` (from prior occlusion filter) is preserved
through the merge.
## 2. Age extension of faceset_001 era buckets
### 2.1 Why a follow-on pass
Consolidation absorbed faceset_007/132/151 into faceset_001 (+46 PNGs).
The original `age_split_001.py` had bucketed peter into 6 era anchors
(`_2005-10`, `_2010-13`, `_2011`, `_2014-17`, `_2018-19`, `_2018-20`), but
those new PNGs had never been seen by age_split. They sat in faceset_001's
parent-only set, missing from every era .fsz.
### 2.2 Era-label pitfall
The 6 anchor era labels are NOT strict year ranges. They are
`Counter(years).most_common(1)`-derived dom-years from the original sub-cluster:
| label | dom_year | actual span of members |
|-------|---------:|-----------------------:|
| `_2005-10` | 2010 | 20052010 |
| `_2010-13` | 2011 | **20072024** |
| `_2011` | 2011 | 2011 only |
| `_2014-17` | 2016 | 20052018 |
| `_2018-19` | 2018 | 20122020 |
| `_2018-20` | 2019 | 20142022 |
The clusters are *appearance-anchored*, not year-bounded. Year is a
descriptive label. Assignment rule must use dom-year, not member span.
### 2.3 Algorithm
For each unbucketed face entry in `faceset_001`'s manifest (50 of 753):
1. Look up embedding in cache by `(source, bbox)`.
2. Look up EXIF year via `work/cache/age_split_exif.json`; fetch on cache miss.
3. Find single nearest era anchor by cosine distance to its centroid.
4. Accept iff `dist ≤ 0.40` AND `|year anchor.dom_year| ≤ 5`.
These thresholds match `age_split_001.py`'s anchor-fragment rule.
5. Anchors are NOT re-centered after absorption (preserves age_split's
drift-prevention guarantee).
### 2.4 Run results
50 unbucketed → 21 with EXIF year → **14 accepted**:
| anchor | dom_year | added |
|--------|---------:|------:|
| `_2005-10` | 2010 | +2 |
| `_2010-13` | 2011 | +1 |
| `_2014-17` | 2016 | **+9** |
| `_2018-20` | 2019 | +2 |
29 PNGs skipped for missing EXIF year (mostly immich-stripped
photos). 7 dist/year-rejected (e.g. two PNGs from 2025 want
`_2018-19` but year-delta 7 > 5).
### 2.5 Reconciliation side effect
The apply rebuilds each affected era bucket's `faces/` from staging. This
incidentally reconciled the per-bucket manifests with disk after the prior
occlusion filter run had left era manifests stale at 282/126/132 entries vs
~248/125/129 actual files (occlusion filter only updates the master
manifest, never per-faceset manifests — see
`docs/analysis/clip-occlusion-filter.md` §7). 42 occlusion-dropped era PNGs
inside the old `faces/_dropped/` were removed during rebuild. The
parent `faceset_001/faces/_dropped/` still has the corpus-level audit; all
source images are intact at `/mnt/x/src/`, so the era-level dropped PNGs
are regeneratable via `cmd_export_swap`.
## 3. Re-running
Always run both passes after any new identity import (Immich, osrc,
hand-sorted folder):
```bash
# 1. Find duplicate identities
python work/consolidate_facesets.py analyze \
--out work/merge_review/candidates.json [--edge 0.55 --confident 0.65]
python work/consolidate_facesets.py report \
--candidates work/merge_review/candidates.json --out work/merge_review
# inspect work/merge_review/index.html
python work/consolidate_facesets.py apply \
--candidates work/merge_review/candidates.json [--include-uncertain]
# 2. Slot new faceset_001 PNGs into existing era buckets
python work/age_extend_001.py analyze --out work/age_extend/candidates.json
python work/age_extend_001.py report \
--candidates work/age_extend/candidates.json --out work/age_extend
python work/age_extend_001.py apply --candidates work/age_extend/candidates.json
```
Both are idempotent. `consolidate_facesets` skips secondaries already in
`_merged/`; `age_extend_001` recomputes anchor centroids + dom-year fresh
on every run.
+279
View File
@@ -0,0 +1,279 @@
# Importing identities from a self-hosted Immich library
_Run date: 2026-04-26. Target: Immich v2.7.2 at `https://fotos.computerliebe.org`.
Driver scripts: `work/immich_stage.py`, `work/embed_worker.py`,
`work/cluster_immich.py`, `work/finalize_immich.sh`._
## 1. Why a split workflow
InsightFace `buffalo_l` on the WSL CPU runs the full detection + landmarks +
recognition stack at ~34 faces/second. Re-detecting all 79K Immich photos
would have taken ~1028 days. The available AMD Radeon RX Vega is unusable
under WSL (no `/dev/dri/`, no ROCm), but **DirectML on Windows native**
runs the same models bit-identically and ~7.5× faster end-to-end. The
pipeline therefore splits:
- **WSL side** (`/opt/face-sets/`) — orchestration: API listing, download,
sha256 dedup, file management, clustering, faceset emission.
- **Windows side** (`C:\face_embed_venv\`) — the embed step only. A fresh
Python 3.12 (installed via `winget install Python.Python.3.12`) with
`numpy`, `pillow`, `opencv-python-headless`, `onnxruntime-directml`,
`insightface`. Models copied from `/home/peter/.insightface/models/buffalo_l/`
to `C:\face_embed_venv\models\buffalo_l\`.
A 30-iteration synthetic benchmark on Vega:
| model | DML | CPU | speedup |
|-------------|----:|----:|--------:|
| `det_10g.onnx` (640×640) | 10.0 ms | 183.5 ms | 18.4× |
| `w600k_r50.onnx` (112×112) | 8.2 ms | 90.5 ms | 11.0× |
End-to-end FaceAnalysis on 5 real Immich-sourced images (excluding the
first-call DML JIT warmup): ~7.5× speedup post-warmup. Per-face cosine
similarity DML vs CPU was 1.0000 across all 8 detected faces — DML is
bit-identical to CPU for arcface inference.
## 2. Architecture
```
┌─────────────────────────────────────────────┐
│ WSL /opt/face-sets/work/immich_stage.py │
│ ┌──────────────────────────────────────────┐│
│ │ ThreadPoolExecutor.map(_fetch_for_asset, ││
│ │ list_assets(user)) ││
│ │ ─ /faces?id= (Immich, parallel x8) ││
│ │ ─ filter face_short >= 90 ││
│ │ ─ /assets/.../original (parallel x8) ││
│ └──────────────────────────────────────────┘│
│ consumer (main thread): │
│ sha256 → dedup vs nl_full.npz │
│ save to /mnt/x/src/immich/<user>/<rel>/ │
│ append to queue.json │
└────────────────┬────────────────────────────┘
▼ queue.json (with WSL + Windows paths)
┌─────────────────────────────────────────────┐
│ Windows embed_worker.py (C:\face_embed_venv) │
│ insightface.FaceAnalysis( │
│ providers=[DmlExecutionProvider, ...]) │
│ per image: detection + landmarks + arcface │
│ emit cache in sort_faces.py:cmd_embed │
│ schema with embeddings + meta + processed │
│ + path_aliases + schema=v2 │
└────────────────┬────────────────────────────┘
▼ immich_<user>.npz
┌─────────────────────────────────────────────┐
│ WSL cluster_immich.py │
│ build centroids of canonical │
│ faceset_NNN/ in facesets_swap_ready/ │
│ drop matches at cos-dist <= 0.45 │
│ cluster the rest at 0.55 │
│ refine gates -> synthetic refine_manifest │
│ cmd_export_swap -> facesets_swap_ready/ │
│ merge top-level manifest │
└─────────────────────────────────────────────┘
```
Cache artifacts stay separate (per the architecture choice on this run):
each user's results live in their own `immich_<user>.npz`. A future
one-shot merge can fold them into `nl_full.npz` if needed; the existing
`extend` command would do the right thing once schemas align.
## 3. Path mapping
`/mnt/x/``X:\`. Cache stores WSL form (matching `nl_full.npz`'s
existing convention). `wsl_to_win()` translates for the embed worker
which runs natively on Windows.
`work/cluster_immich.py` always uses the canonical `facesets_swap_ready/`
view to build identity centroids — meaning the comparison is against the
*current* set of canonical facesets in the swap-ready directory (skipping
era splits and `_thin/`), not against the older `facesets_full/` snapshot.
## 4. Result of the 2026-04-26 run (peter / admin)
### 4a. Stage
```
total_assets_seen: 53842
staged_count: 10261 (~10 GB on /mnt/x/)
deduped_against_existing: 978 (sha256 in nl_full.npz already)
deduped_against_staged: 2976 (internal byte-dupes inside Immich)
skipped_no_big_face: 9539 (Immich detected only sub-90px faces)
skipped_no_faces: 29390 (Immich detected zero faces)
skipped_download_error: 698 (transient DNS / TLS, not seen-marked)
elapsed: ~70 min (6.4 assets/s end-to-end at 8 workers)
```
The 698 transient errors are recoverable on a re-run because
`immich_stage.py` does not add them to the `seen` set. Each transient
asset would be retried.
### 4b. Embed (Windows DML)
```
queue: 10261 entries
new face records: 19462
new noface records: 1
load errors: 125 (likely HEIC / unreadable)
elapsed: 3878.0s (64.6 min, 2.6 img/s end-to-end)
```
The 2.6 img/s end-to-end includes CIFS-share image load, image decode,
DML inference (~50 ms/face), and JSON / NPZ flushing. Pure DML inference
is faster; the rest of the pipeline dominates at scale.
### 4c. Cluster
```
existing canonical centroids: 25
faces already covered (cos-dist <= 0.45): 8103/19480 (42%)
faceset_001: 1856
faceset_002: 2666
faceset_003: 670
faceset_004: 48
faceset_005: 40
... (smaller hits to the remaining 20)
unmatched faces to cluster: 11377
clusters at threshold 0.55: 2534 (top sizes [469, 444, 342, 338, 262, ...])
survived refine gates: 239
emitted as new facesets: 185 (54 dropped by export-swap's 0.45 outlier)
```
Top-level `facesets_swap_ready/manifest.json` after this run: **216
facesets** (up from 31; ~7× growth) + 68 thin_eras under `_thin/`.
## 4d. Result of the 2026-04-26..27 run (nic, with per-user API key)
After issuing nic a per-user API key, the same pipeline ran end-to-end
with no code changes (only the `IMMICH_API_KEY` env var changed). The
run survived one Immich outage mid-stage thanks to the circuit breaker
added in `work/immich_stage.py` (12 consecutive HTTP errors → probe →
exit 2 with state preserved → resume on same command).
### Stage
```
total_assets_seen: 25777 (matches /server/statistics 25,786)
staged_count: 7834 (30% face-bearing-with-big-face;
peter was 19%)
deduped_against_existing: 519 (sha256 in nl_full.npz already)
deduped_against_staged: 0 (nic's library has zero internal
byte-dupes; peter had 2,976)
skipped_no_big_face: 725
skipped_no_faces: 16695
skipped_download_error: 54 (transient; not marked seen ->
would be retried on resume)
elapsed: ~75 min wall (across two pause/resume sessions
bracketing one Immich outage)
```
### Embed (Windows DML)
```
queue: 7834 entries
new face records: 15627
new noface records: 1
load errors: 7
elapsed: 3538.9s (59 min, 2.2 img/s end-to-end)
```
### Cluster
```
existing canonical centroids: 25
faces already covered (cos-dist <= 0.45): 6770/15627 (43%)
faceset_002: 3261 (the dominant family identity)
faceset_008: 1461 (cross-match to hand-sorted 'sab')
faceset_001: 955
faceset_007: 408 (cross-match to hand-sorted 's')
faceset_006: 114
...
unmatched: 8857
clusters at threshold 0.55: 3787 (top sizes [165, 134, 106, 99, 92,
67, 62, 61, 58, 53])
survived refine gates: 129
emitted as new facesets: 95 (faceset_265..NNN with gaps)
```
Top-level `facesets_swap_ready/manifest.json` after the nic run: **311
substantive facesets** + 68 thin_eras. Two-day cumulative growth:
| date | event | facesets total |
|------|------|------:|
| 2026-04-25 | hand-sorted folder import | 19 |
| 2026-04-26 morning | osrc + age split + cleanup | 31 |
| 2026-04-26 afternoon | Immich peter run | 216 |
| 2026-04-27 (overnight) | Immich nic run | 311 |
## 5. Surprises and caveats
### 5a. `/search/metadata`'s `userIds` filter is silently ignored (Immich v2.7.2)
When the admin API key is used, passing `userIds=[<other-user-uuid>]`
returns admin's own assets, not the other user's. The filter is
silently dropped. Verified by sampling 200 returned items and
confirming `ownerId` was admin for all of them.
To process another user's library, **a separate API key issued by that
user is required** — the admin key cannot enumerate cross-user
libraries through any documented endpoint we tried. `/timeline/buckets`
with a `userId` query parameter returns
`Not found or no timeline.read access`.
### 5b. `/server/statistics` undercounts what the search returns
`/server/statistics` reported admin = 53,842 photos. Our
`/search/metadata` paginated through... **53,842** top-level. So the
header agrees with the body in this case. But `/server/statistics` does
NOT count items that live under external libraries' import paths —
yet `/search/metadata` does include them. For this Immich, two external
libraries (`/mnt/media/photos` and `/mnt/media/omv_photos`) are
configured but `/libraries` reports `assetCount=0` for both. Yet 80% of
our staged paths come from those library import paths. Don't trust
statistics-vs-search consistency.
### 5c. Indexed Immich thumbnails masquerading as assets
5,563 of our 10,261 staged paths are `<library>/thumbs/.../-preview.jpeg`
— Immich's own internally-generated thumbnails got indexed because the
external library import path included the thumbs subdirectory and the
exclusion patterns didn't list `**/thumbs/**`. They embed and cluster
fine but produce lower-resolution face records. The fix on the Immich
side is adding `**/thumbs/**` to the exclusion patterns.
### 5d. Internal byte-duplicates (2,976)
Many Immich assets are byte-identical to other Immich assets — typically
because the same photo was uploaded both from a phone and from a
synced cloud folder. sha256 dedup catches all of these on the second
download (we still pay the bandwidth, but skip the disk write and
embed work). With Immich v2.7.2's own `assets/duplicates` endpoint we
could catch this earlier, but it's not currently used.
## 6. Re-running and applying to other Immich instances
```bash
export IMMICH_URL=https://your-immich.example.com
export IMMICH_API_KEY=... # admin or per-user key
# Optional: populate work/immich/users.json with label -> UUID map.
# 1. Stage (parallel /faces + downloads, resumable).
python work/immich_stage.py --user peter --workers 8
# 2. End-to-end finalize: copy queue to /mnt/c/, run Windows embed worker,
# copy the cache back, run cluster_immich.py.
bash work/finalize_immich.sh peter
```
For a different Immich instance, the only configuration is the env vars
and the `users.json` sidecar. `cluster_immich.py`'s tunables (matching
threshold, clustering threshold, refine gates, MIN_FACES) are at the
top of the script.
To process a *second* user's library, issue a per-user API key in the
Immich admin UI for that user, set `IMMICH_API_KEY` to that key, and
re-run with their `--user <label>`. The admin key cannot impersonate
other users via the search API.
+119
View File
@@ -0,0 +1,119 @@
# Identity discovery in `/mnt/x/src/osrc`
_Run date: 2026-04-26. Cache: `work/cache/nl_full.npz` (5260 face records).
Driver script: `work/cluster_osrc.py`._
## 1. Source
`/mnt/x/src/osrc/` is a flat mixed-identity bucket: 213 files in root + a
`psd/` subfolder with 41 PSD files + a single file in `[Originaldateien]/`.
File extensions are 171 jpg + 1 jpeg + 41 psd. PSDs are not embedded
(InsightFace's loader doesn't read PSD); the 41 PSDs were skipped, on the
working assumption that the same identities are also present in the
adjacent JPGs.
`nl_full.npz` already covered 160 of the 213 files (the remaining 53: 41
psd + 12 jpg). Of the 12 missing JPGs, 11 are byte-duplicates of `00843resc.jpg`
.. `00855resc.jpg` (same file sizes, paired by sha256) — already aliased
in the cache. Only 1 jpg (`19554226_..._n.jpg`) is genuinely uncovered.
The 160 covered files yielded **336 face records / 10 noface**, with 64
single-face / 35 two-face / 19 three-face / 24 four-face / 8 with 58
faces. Quality is good: median `face_short=116px`, `det_score=0.85`,
`blur=244`. Min `face_short=40px` will fail the 90px refine gate.
## 2. Coverage by existing identities
Computed cos-dist from each osrc face to the centroids of the canonical
`faceset_001..019` (built from each manifest's `(source, bbox)` keys).
Median nearest-cos-dist was 0.875 — i.e. the bulk of osrc is **not** the
existing 19 identities.
At cos-dist ≤ 0.45 (matching `build_folders.py`'s `OSRC_THRESHOLD`):
| existing identity | osrc faces matched |
|------------------|------------------:|
| faceset_002 | 7 |
| faceset_008 | 4 |
| faceset_015 | 3 |
| faceset_019 | 4 |
These 18 osrc faces are routed to existing identities by
`build_folders.py` and `extend`; they are excluded from the
identity-discovery step.
## 3. Pipeline
`work/cluster_osrc.py` mirrors `build_folders.py`'s structure (synthesize
a refine manifest, hand off to `cmd_export_swap`, relocate, merge
top-level manifest) but discovers identities by clustering rather than
asserting them by folder.
1. Filter cache to face records under `/mnt/x/src/osrc` (canonical or
byte-aliased path).
2. Drop the 18 already-covered faces (cos-dist ≤ 0.45 to any existing
identity centroid).
3. Cluster the remaining 318 faces among themselves at cos-dist 0.55
(matches the `extend` default for new-cluster formation).
4. For each cluster, apply `refine`-equivalent per-face gates
(`face_short ≥ 90`, `blur ≥ 40`, `det_score ≥ 0.6`); for clusters ≥ 4
faces apply outlier rejection at cluster-centroid cos-dist 0.55. Keep
clusters whose surviving unique-path count is ≥ 6 (the operator-
chosen `MIN_FACES`, lower than the canonical 15 because osrc is small
per-identity).
5. Number kept clusters `faceset_020+` (past the existing
`facesets_swap_ready/` max of 019) ordered by size descending.
6. Synthesize a refine manifest and call `cmd_export_swap` on it. Move
the emitted dirs into `facesets_swap_ready/`, drop an `osrc.txt`
provenance marker, and append the new entries to the top-level
`manifest.json` (without disturbing existing `facesets` / `thin_eras`).
## 4. Result (2026-04-26)
Phase 1 (clustering, before export-swap):
- 137 raw clusters at cos-dist 0.55; top sizes [37, 20, 12, 9, 7, 7, 6, 6, 6, 5].
- After quality gate: 124 faces dropped (mostly `face_short < 90` from
group-photo tertiary subjects).
- Outlier rejection: 0 dropped (clusters were tight).
- After `min_faces=6`: **7 candidate clusters kept** (sizes 628 unique
source paths).
Phase 2 (`cmd_export_swap` with `min_face_short=100`,
`outlier_threshold=0.45`):
| name | input | outlier drop | exported PNGs |
|--------------|------:|-------------:|--------------:|
| faceset_020 | 71 | 42 | 26 |
| faceset_021 | 36 | 21 | 10 |
| faceset_022 | 15 | 7 | 8 |
| faceset_023 | 19 | 14 | 4 |
| faceset_024 | 6 | 0 | 6 |
| faceset_025 | 10 | 4 | 6 |
| faceset_026 | — | — | 0 (skipped: empty after filter) |
`faceset_026`'s 6 cluster faces all failed export-swap's tighter
`min_face_short=100` gate (vs. cluster's 90); it is not emitted.
`faceset_023` is small (4 PNGs) but useful as an averaged identity at
that size.
Top-level `facesets_swap_ready/manifest.json` now: **31 substantive
facesets** (12 auto-cluster nl/lzbkp + 7 hand-sorted + 6 era splits + 6
osrc-discovered) + **68 thin_eras** under `_thin/`.
## 5. Re-running and applying to other mixed buckets
- The cache holds osrc embeddings; to re-run with different parameters,
edit `cluster_osrc.py`'s config block and re-execute. Cluster discovery
+ export-swap is a few minutes total.
- For a different mixed-bucket source, copy `cluster_osrc.py` to
`cluster_<name>.py` and change `OSRC_DIR`, `OUT_TMP`, `SYNTH_MANIFEST`,
`START_NNN`. The exclusion step compares against the *current* contents
of `facesets_swap_ready/faceset_NNN/` so it picks up everything emitted
by previous discovery / split / hand-sorted runs.
- Lowering `MIN_FACES` from 6 to 4 would have admitted ~3 additional
marginal clusters at this corpus size; the trade-off is a noisier
identity average for small-N facesets.
- `extend` should be run before `cluster_osrc.py` so `raw_full/` and
`facesets_full/` stay in sync — `cluster_osrc.py` itself only writes
to `facesets_swap_ready/`.
+142
View File
@@ -0,0 +1,142 @@
# Video target preprocessing for roop-unleashed
_Initial design + first batch run: 2026-04-27. Driver scripts: `work/video_target_pipeline.py`, `work/video_face_worker.py`, `work/run_video_pipeline.sh`._
Companion to the face-set side of the project: instead of building per-identity .fsz bundles for the *source* of a swap, this pipeline preprocesses the *target* (videos to swap into). Given a folder of video files, it identifies "swappable" segments — continuous shots where a face is detectable, sufficiently visible, and roughly within inswapper_128's working envelope — and cuts them into UUID-named clips ready to feed into roop-unleashed.
## 1. Why build it
I checked the obvious open-source projects for an existing implementation:
- **FaceFusion** ([github.com/facefusion/facefusion](https://github.com/facefusion/facefusion)) — CLI has `run`, `headless-run`, `batch-run`, `job-*`, `force-download`, `benchmark`. No scene-detection or clip-extraction subcommand. Its own guides recommend "split your video manually first."
- **roop-unleashed** at `/opt/roop-unleashed/roop/util_ffmpeg.py` — has `cut_video(start_frame, end_frame)` for a manual GUI trim, no detection-driven segmentation.
- **Deep-Live-Cam** ([github.com/hacksider/Deep-Live-Cam](https://github.com/hacksider/Deep-Live-Cam)) — real-time / single-shot, no batch preprocessing.
- **DeepFaceLab** — `extract_video.bat` dumps every frame between user-supplied trim points; no quality gating.
Closest prior art for the cut-detection pattern is the two-stage hybrid in [SportSBD MMSys'26](https://dl.acm.org/doi/10.1145/3793853.3799803) (cheap detector for cuts, accurate net for verification), but the actual implementation has to be ours.
## 2. Pipeline architecture
```
WSL /opt/face-sets/work/ Windows C:\face_embed_venv\
───────────────────────────────────── ─────────────────────────────
run_video_pipeline.sh (chain driver)
├─ scan (ffprobe metadata)
├─ scenes (PySceneDetect AdaptiveDetector, CPU)
├─ stage (sampled frame queue.json @ 2 fps)
│ │
│ ▼
│ video_face_worker.py
│ insightface FaceAnalysis
│ on DmlExecutionProvider
│ output: results.jsonl
├─ merge (ingest results.jsonl)
├─ track (IoU + embedding stitching, ~30 LOC)
├─ score (track-level quality gate + cross-track merge)
├─ cut (ffmpeg -c copy → per-source subfolders)
└─ report (HTML preview)
Output: <output_dir>/<source_video_stem>/<uuid>.mp4
/<uuid>.json (sidecar; opt-in via
--write-sidecar)
```
`run_video_pipeline.sh` is parameterized via env vars (`WORK`, `INPUT_DIR`, `OUTPUT_DIR`, `FILTER_FROM`, `SKIP_PATTERN`, `MAX_DUR`, `IDENTITY`, `SIDECAR`) so you can pin a particular batch without editing the script. Sidecars are off by default — the per-batch `plan.json` always carries the full provenance for every clip; the `<uuid>.json` files alongside the clips are redundant and only useful if you need each clip to be self-describing in isolation.
## 3. Quality signals (matched to inswapper_128's working envelope)
inswapper_128 is trained near-frontal at 128×128. The score gate uses defaults that admit side profiles (since rich face-sets can absorb non-frontal swap targets):
| signal | threshold | rationale |
|--------|----------:|-----------|
| `|yaw|` | ≤ 75° | covers full 3/4 + side profile |
| `|pitch|` | ≤ 45° | covers extreme up/down looks |
| `face_short` | ≥ 80 px | inswapper resamples to 128; ≥80 still produces clean output |
| `det_score` | ≥ 0.5 | matches buffalo_l's MIN_DET; lower = unreliable detection |
| track-gate | ≥ 70 % frames pass | binary track filter rather than per-frame |
| duration | 1 s ≤ dur ≤ 120 s | below 1s = unusable slivers; above 120s probably contains a missed micro-cut |
Plus two segment-merging knobs:
- `--bridge-gap` (default 3 s) — within a single track, brief pose-failure gaps shorter than this get bridged so single bad frames don't fragment a good run
- `--merge-gap` (default 2 s) — across tracks within the same scene, segments closer than this get fused (cross-track merge fires when face detection briefly fails between adjacent good runs)
The defaults can be tightened (e.g. `--max-yaw 25` for portrait-only) or loosened (e.g. `--max-yaw 90 --merge-gap 5`) without re-running detection — `score` reads the existing `tracks.json`.
## 4. Performance + the JSONL append-only fix
This is where the engineering interest is. The first production run on 13 videos / 6.18 h of input went through three failure modes before settling at production speed:
| attempt | issue | rate observed |
|---|---|---:|
| 1. Original `cap.set(POS_FRAMES, N)` per sample | OpenCV seeks to nearest keyframe + decodes forward at every sample. Cost grows with depth into the video; on a 60-min H.264 it falls off a cliff. | 1.4 fps → degrading |
| 2. Sequential `cap.grab()` from frame 0 | On resume, grab-walking from frame 0 to a deep target is unbounded. | 0.08 fps |
| 3. Hybrid: seek-once-per-video + sequential within | Better in principle. But hit a different bug: `flush()` was re-serializing the entire `results.json` (245 MB at this point) every 100 frames or 30 sec. Save dominated wall-clock. | 0.5 fps |
| 4. **JSONL append-only** | One result per line. Each flush is O(new records), not O(total records). | **13.77 fps** smoke / 7.57 fps cumulative across the full batch |
Lesson: when the output is large + grows monotonically + needs frequent checkpointing, *do not* re-serialize the whole structure on each flush. Append-only line-delimited JSON is the right tool. The legacy `results.json` is auto-converted to `.jsonl` on first load (one-time migration), so resumes survive the format switch.
## 5. Hardware decode/encode on AMD Vega + WSL
Skipped. Per [Microsoft's WSL D3D12 video acceleration post](https://devblogs.microsoft.com/commandline/d3d12-gpu-video-acceleration-in-the-windows-subsystem-for-linux-now-available/), VAAPI-via-Mesa-D3D12 exists but is fragile on older AMD. AMF on Windows would mean a Windows-side ffmpeg leg, doubling boundary crossings. CPU software decode of 1280×720 H.264 in WSL ffmpeg is faster than realtime, and the bottleneck is buffalo_l detection on DML, not decode.
For cutting we use `-c copy` stream-copy — no re-encode, hardware codecs are moot.
## 6. Full corpus run results
Three runs across the 61-video corpus at `/mnt/x/src/vd/`:
| | test (3 videos) | first batch (13 videos, 5062) | rest (45 videos, 0249 minus test) | **total** |
|---|---:|---:|---:|---:|
| input duration | 0.6 h | 6.18 h | 12.98 h | **19.76 h** |
| sampled frames @ 2 fps | 4,472 | 44,635 | 94,030 | 143,137 |
| tracks | 187 | 2,564 | 3,823 | 6,574 |
| accepted tracks | 94 (50 %) | 1,193 (47 %) | 1,905 (50 %) | 3,192 (49 %) |
| **emitted segments** | **83** | **600** | **1,301** | **1,984** |
| cross-track-merged segments | 14 | 254 | 382 | 650 |
| accepted content | 13 min | 239 min | 395 min | **647 min (= 10.78 h)** |
| acceptance rate by time | 36 % | 64.6 % | 50.7 % | **54.6 %** |
| output size | 0.135 GB | 3.63 GB | 4.84 GB | **8.6 GB** |
Phase timings (rest batch — best representative since it ran fully under JSONL append-only from a fresh start):
- scenes: 117 min (PySceneDetect, 45 × ~3 min/video)
- stage: instant
- worker: 100 min @ **15.78 fps** sustained (vs 7.5 fps for first batch which migrated mid-run)
- merge: 90 s
- track: 92 s
- score: 23 s
- cut (1,301 ffmpeg stream-copies): 30 min
- report (1,301 thumbs + HTML): 5.5 min
- **total wall-clock: 4h16m**
Across all three runs, **0 worker errors on 143,137 sampled frames**.
## 7. Re-running
```bash
# choose a per-batch workdir + log
WORK=/opt/face-sets/work/video_preprocess_<batch_name> \
FILTER_FROM=ct_src_00050.mp4 \
bash work/run_video_pipeline.sh > work/logs/video_run_<batch_name>.log 2>&1 &
# check status anytime
bash work/status_video_pipeline.sh work/logs/video_run_<batch_name>.log
```
Skip patterns can exclude already-processed inputs (note that 5-digit numbers need full padding in the regex, e.g. `0005[0-9]` not `005[0-9]`):
```bash
SKIP_PATTERN='^ct_src_(0001[015]|0005[0-9]|0006[0-2])\.mp4$' \
WORK=/opt/face-sets/work/video_preprocess_rest \
bash work/run_video_pipeline.sh > work/logs/video_run_rest.log 2>&1 &
```
To also emit per-clip provenance sidecars (off by default):
```bash
SIDECAR=yes \
WORK=/opt/face-sets/work/video_preprocess_<batch> \
bash work/run_video_pipeline.sh > work/logs/video_run_<batch>.log 2>&1 &
```
`scenes` outputs are cached in the batch's `WORK/scenes/` dir, so re-running the chain after an edit-to-score step doesn't redo detection. The worker is also resumable per `queue_id` — if killed mid-flight, just relaunch.
+576
View File
@@ -0,0 +1,576 @@
"""Extend the existing 6 era buckets of faceset_001 by absorbing PNGs that
post-date the original age_split run (from consolidation merges, etc.).
Mirrors the anchor-fragment assignment logic in age_split_001.py:
- For each unbucketed face in faceset_001's manifest, find the nearest active
era anchor by cosine distance to the anchor's centroid.
- Accept the assignment iff dist <= 0.40 AND |year_delta| <= 5
(where year_delta = exif_year(face) - dom_year(anchor)).
- Undated PNGs are skipped (no assignment).
- Anchors are NOT re-centered after absorption (preserves the same drift
guarantees as the original age_split).
CLI:
python work/age_extend_001.py analyze --out work/age_extend/candidates.json
python work/age_extend_001.py report --candidates ... --out work/age_extend
python work/age_extend_001.py apply --candidates ... [--dry-run]
"""
from __future__ import annotations
import argparse
import json
import shutil
import sys
import time
from collections import Counter
from pathlib import Path
import numpy as np
from PIL import Image, ExifTags
ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
PARENT = "faceset_001"
ACTIVE_ERAS = [
"faceset_001_2005-10",
"faceset_001_2010-13",
"faceset_001_2011",
"faceset_001_2014-17",
"faceset_001_2018-19",
"faceset_001_2018-20",
]
CACHES = [
Path("/opt/face-sets/work/cache/nl_full.npz"),
Path("/opt/face-sets/work/cache/immich_peter.npz"),
Path("/opt/face-sets/work/cache/immich_nic.npz"),
]
EXIF_CACHE = Path("/opt/face-sets/work/cache/age_split_exif.json")
# anchor-fragment thresholds (mirror age_split_001.py)
DIST_MAX = 0.40
YEAR_MAX = 5
# ----------------------------- caches -----------------------------
def load_caches():
rec_index: dict[tuple[str, tuple[int, int, int, int]], np.ndarray] = {}
alias_map: dict[str, str] = {}
for c in CACHES:
if not c.exists():
print(f"[warn] cache missing: {c}", file=sys.stderr)
continue
d = np.load(c, allow_pickle=True)
emb = d["embeddings"]
meta = json.loads(str(d["meta"]))
face_records = [m for m in meta if not m.get("noface")]
if len(face_records) != len(emb):
raise SystemExit(f"meta/emb mismatch in {c}: {len(face_records)} vs {len(emb)}")
if "path_aliases" in d.files:
paliases = json.loads(str(d["path_aliases"]))
for canon, alist in paliases.items():
alias_map.setdefault(canon, canon)
for a in alist:
alias_map[a] = canon
for i, rec in enumerate(face_records):
p = rec["path"]
bbox = tuple(int(x) for x in rec["bbox"])
v = emb[i].astype(np.float32)
n = float(np.linalg.norm(v))
if n > 0:
v = v / n
rec_index[(p, bbox)] = v
alias_map.setdefault(p, p)
print(f"[cache] indexed {len(rec_index)} face records, {len(alias_map)} aliases", file=sys.stderr)
return rec_index, alias_map
def lookup_emb(rec_index, alias_map, src: str, bbox):
bbox_t = tuple(int(x) for x in bbox)
canon = alias_map.get(src, src)
v = rec_index.get((canon, bbox_t))
if v is None and canon != src:
v = rec_index.get((src, bbox_t))
return v
# ----------------------------- exif -----------------------------
def load_exif_cache():
if not EXIF_CACHE.exists():
return {}
return json.loads(EXIF_CACHE.read_text())
def save_exif_cache(cache):
tmp = EXIF_CACHE.with_suffix(".tmp.json")
tmp.write_text(json.dumps(cache, indent=2))
tmp.replace(EXIF_CACHE)
def exif_year(path: Path) -> int | None:
try:
with Image.open(path) as im:
ex = im._getexif()
if not ex:
return None
for tag_id, val in ex.items():
tag = ExifTags.TAGS.get(tag_id, tag_id)
if tag == "DateTimeOriginal" and isinstance(val, str) and len(val) >= 4:
return int(val[:4])
except Exception:
return None
return None
def get_year(src: str, exif_cache) -> int | None:
"""Return EXIF year for src, using cache. Mutates cache for new lookups."""
if src in exif_cache:
return exif_cache[src]
p = Path(src)
y = exif_year(p) if p.exists() else None
exif_cache[src] = y
return y
# ----------------------------- analyze -----------------------------
def cmd_analyze(args):
rec_index, alias_map = load_caches()
exif_cache = load_exif_cache()
exif_cache_dirty = False
parent_dir = ROOT / PARENT
parent_manifest = json.loads((parent_dir / "manifest.json").read_text())
parent_faces = parent_manifest.get("faces", [])
print(f"[parent] {PARENT}: {len(parent_faces)} face entries", file=sys.stderr)
# Build "in_bucket" set + each anchor's centroid + dom_year
anchors = []
in_bucket: set[tuple[str, tuple[int, int, int, int]]] = set()
for era in ACTIVE_ERAS:
ed = ROOT / era
if not ed.is_dir():
print(f"[warn] missing era bucket: {era}", file=sys.stderr)
continue
em = json.loads((ed / "manifest.json").read_text())
emb_list = []
years = []
n_missing_emb = 0
for f in em.get("faces", []):
src = f.get("source")
bbox = f.get("bbox")
if not src or not bbox:
continue
key = (alias_map.get(src, src), tuple(int(x) for x in bbox))
in_bucket.add(key)
in_bucket.add((src, tuple(int(x) for x in bbox))) # cover both alias and raw
v = lookup_emb(rec_index, alias_map, src, bbox)
if v is None:
n_missing_emb += 1
else:
emb_list.append(v)
y = get_year(src, exif_cache)
if y is None:
exif_cache_dirty = True
else:
years.append(y)
if src not in exif_cache:
exif_cache_dirty = True
if not emb_list:
print(f"[warn] {era}: no embeddings found, skipping anchor", file=sys.stderr)
continue
arr = np.stack(emb_list).astype(np.float32)
c = arr.mean(axis=0)
n = float(np.linalg.norm(c))
if n > 0:
c = c / n
dom_year = Counter(years).most_common(1)[0][0] if years else None
anchors.append({
"name": era, "centroid": c, "n_faces": len(em.get("faces", [])),
"n_emb_used": len(emb_list), "n_emb_missing": n_missing_emb,
"dom_year": dom_year,
"year_min": min(years) if years else None,
"year_max": max(years) if years else None,
})
print(f"[anchor] {era}: n={len(em.get('faces', []))} emb_used={len(emb_list)} "
f"emb_miss={n_missing_emb} dom_year={dom_year} years=[{min(years) if years else '-'}..{max(years) if years else '-'}]",
file=sys.stderr)
# Find unbucketed faces in parent
unbucketed = []
for f in parent_faces:
src = f.get("source")
bbox = f.get("bbox")
if not src or not bbox:
continue
bbox_t = tuple(int(x) for x in bbox)
key1 = (alias_map.get(src, src), bbox_t)
key2 = (src, bbox_t)
if key1 in in_bucket or key2 in in_bucket:
continue
unbucketed.append(f)
print(f"[parent] {len(unbucketed)} unbucketed face entries (in {PARENT} but no era bucket)", file=sys.stderr)
# Score each unbucketed face against every anchor
proposals = []
skipped_no_emb = 0
skipped_no_year = 0
for f in unbucketed:
src = f["source"]
bbox = f["bbox"]
v = lookup_emb(rec_index, alias_map, src, bbox)
if v is None:
skipped_no_emb += 1
continue
y = get_year(src, exif_cache)
if y is None:
skipped_no_year += 1
exif_cache_dirty = True
continue
if src not in exif_cache:
exif_cache_dirty = True
# nearest anchor
best = None # (dist, idx)
for i, a in enumerate(anchors):
d = 1.0 - float(np.dot(a["centroid"], v))
if best is None or d < best[0]:
best = (d, i)
if best is None:
continue
dist, bidx = best
anchor = anchors[bidx]
year_delta = abs(y - anchor["dom_year"]) if anchor["dom_year"] is not None else None
accept = (dist <= DIST_MAX and year_delta is not None and year_delta <= YEAR_MAX)
proposals.append({
"png": f["png"],
"source": src,
"bbox": [int(x) for x in bbox],
"year": y,
"rank_in_parent": f.get("rank"),
"quality_composite": f.get("quality", {}).get("composite"),
"quality": f.get("quality", {}),
"best_anchor": anchor["name"],
"best_anchor_dom_year": anchor["dom_year"],
"centroid_dist": round(dist, 4),
"year_delta": year_delta,
"accept": bool(accept),
"all_anchor_dists": {
a["name"]: round(1.0 - float(np.dot(a["centroid"], v)), 4) for a in anchors
},
})
if exif_cache_dirty:
save_exif_cache(exif_cache)
print(f"[exif] cache flushed ({len(exif_cache)} entries total)", file=sys.stderr)
# Summarize
accepted = [p for p in proposals if p["accept"]]
rejected = [p for p in proposals if not p["accept"]]
by_anchor = Counter(p["best_anchor"] for p in accepted)
print(f"[summary] unbucketed={len(unbucketed)} scored={len(proposals)} "
f"accepted={len(accepted)} rejected={len(rejected)} "
f"skipped(no_emb={skipped_no_emb}, no_year={skipped_no_year})", file=sys.stderr)
for k, v in by_anchor.most_common():
print(f" {k}: +{v}", file=sys.stderr)
out = {
"thresholds": {"dist_max": DIST_MAX, "year_max": YEAR_MAX},
"anchors": [
{k: v for k, v in a.items() if k != "centroid"}
for a in anchors
],
"n_unbucketed": len(unbucketed),
"skipped": {"no_emb": skipped_no_emb, "no_year": skipped_no_year},
"proposals": sorted(proposals, key=lambda p: (not p["accept"], p["best_anchor"], -1 * (p["quality_composite"] or 0))),
"by_anchor": dict(by_anchor),
}
op = Path(args.out)
op.parent.mkdir(parents=True, exist_ok=True)
op.write_text(json.dumps(out, indent=2))
print(f"[done] {len(proposals)} proposals -> {op}", file=sys.stderr)
# ----------------------------- report -----------------------------
def cmd_report(args):
cand = json.loads(Path(args.candidates).read_text())
out_dir = Path(args.out)
thumbs_dir = out_dir / "thumbs"
thumbs_dir.mkdir(parents=True, exist_ok=True)
THUMB = 140
def make_thumb(png_relpath: str) -> str:
# png_relpath looks like "faces/0042.png"
src = ROOT / PARENT / png_relpath
name = Path(png_relpath).stem
dst = thumbs_dir / f"{name}.jpg"
if not dst.exists():
try:
img = Image.open(src).convert("RGB")
img.thumbnail((THUMB, THUMB), Image.LANCZOS)
img.save(dst, "JPEG", quality=82)
except Exception as e:
print(f"[thumb-skip] {src}: {e}", file=sys.stderr)
return ""
return f"thumbs/{name}.jpg"
# group accepted proposals by target anchor
by_anchor: dict[str, list] = {}
rejected = []
for p in cand["proposals"]:
if p["accept"]:
by_anchor.setdefault(p["best_anchor"], []).append(p)
else:
rejected.append(p)
rows = []
rows.append("<h1>faceset_001 age extension &mdash; review</h1>")
rows.append(f"<p>{cand['n_unbucketed']} unbucketed faces in {PARENT}; "
f"{sum(len(v) for v in by_anchor.values())} accepted / {len(rejected)} rejected; "
f"thresholds dist&le;{cand['thresholds']['dist_max']} AND |year_delta|&le;{cand['thresholds']['year_max']}.</p>")
nav = " · ".join(f"<a href='#{a}'>{a} (+{len(by_anchor[a])})</a>" for a in by_anchor) + " · <a href='#rejected'>rejected</a>"
rows.append(f"<div class='nav'>{nav}</div>")
for anchor_name in ACTIVE_ERAS:
if anchor_name not in by_anchor:
continue
items = by_anchor[anchor_name]
anchor_meta = next((a for a in cand["anchors"] if a["name"] == anchor_name), {})
rows.append(f"<section id='{anchor_name}' class='grp'>")
rows.append(f"<h2>{anchor_name} <small>(dom_year={anchor_meta.get('dom_year')}; "
f"existing n={anchor_meta.get('n_faces')}; +{len(items)} new)</small></h2>")
rows.append("<div class='cells'>")
for p in sorted(items, key=lambda x: (x["centroid_dist"], -1 * (x["quality_composite"] or 0))):
thumb = make_thumb(p["png"])
cls = "hi" if p["centroid_dist"] <= 0.30 else "mid"
rows.append(
f"<div class='cell'>"
f"<img src='{thumb}' loading='lazy' title='{p['png']}'>"
f"<div class='meta'>{p['png']}<br>year {p['year']}{p['year_delta']})<br>"
f"<span class='{cls}'>dist {p['centroid_dist']:.3f}</span></div>"
f"</div>"
)
rows.append("</div></section>")
if rejected:
rows.append("<section id='rejected' class='grp rej'>")
rows.append(f"<h2>rejected <small>({len(rejected)} faces don't fit any anchor)</small></h2>")
rows.append("<div class='cells'>")
for p in sorted(rejected, key=lambda x: x["centroid_dist"])[:200]:
thumb = make_thumb(p["png"])
why = []
if p["centroid_dist"] > cand['thresholds']['dist_max']:
why.append(f"dist {p['centroid_dist']:.2f}>{cand['thresholds']['dist_max']}")
if p["year_delta"] is None or p["year_delta"] > cand['thresholds']['year_max']:
why.append(f"{p['year_delta']}>{cand['thresholds']['year_max']}")
rows.append(
f"<div class='cell'>"
f"<img src='{thumb}' loading='lazy'>"
f"<div class='meta'>{p['png']}<br>year {p['year']} → best {p['best_anchor']}<br>"
f"<span class='lo'>{'; '.join(why)}</span></div>"
f"</div>"
)
if len(rejected) > 200:
rows.append(f"<p>...{len(rejected)-200} more truncated.</p>")
rows.append("</div></section>")
html = f"""<!doctype html>
<html><head><meta charset='utf-8'><title>faceset_001 age extension</title>
<style>
body {{ font-family: system-ui, sans-serif; background:#111; color:#eee; padding:1em; }}
h1 {{ margin-top:0; }} h2 {{ margin:0; }}
small {{ color:#999; font-weight:normal; }}
section.grp {{ background:#1a1a1a; border-radius:6px; padding:12px; margin:12px 0; }}
section.grp.rej {{ border-left:4px solid #ff5050; }}
.cells {{ display:flex; flex-wrap:wrap; gap:6px; }}
.cell {{ background:#222; border-radius:4px; padding:4px; width:160px; font-size:11px; font-family:monospace; text-align:center; }}
.cell img {{ height:140px; width:auto; border-radius:3px; }}
.meta {{ padding-top:4px; line-height:1.3; }}
.hi {{ color:#5fa05f; font-weight:bold; }}
.mid {{ color:#ffb050; }}
.lo {{ color:#ff5050; }}
.nav {{ position:sticky; top:0; background:#111; padding:.5em 0; border-bottom:1px solid #333; font-size:13px; }}
a {{ color:#6cf; }}
</style></head>
<body>
{''.join(rows)}
</body></html>"""
out_html = out_dir / "index.html"
out_html.write_text(html)
print(f"[done] {out_html}", file=sys.stderr)
# ----------------------------- apply -----------------------------
def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
import zipfile
with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
for i, p in enumerate(pngs):
zf.write(p, arcname=f"{i:04d}.png")
def cmd_apply(args):
cand = json.loads(Path(args.candidates).read_text())
accepted = [p for p in cand["proposals"] if p["accept"]]
if args.dry_run:
from collections import Counter as C
by = C(p["best_anchor"] for p in accepted)
print(f"=== dry-run: {len(accepted)} assignments across {len(by)} anchors ===")
for k, v in by.most_common():
print(f" {k}: +{v}")
return
parent_dir = ROOT / PARENT
master_path = ROOT / "manifest.json"
master = json.loads(master_path.read_text())
facesets_by_name = {f["name"]: f for f in master.get("facesets", [])}
by_anchor: dict[str, list] = {}
for p in accepted:
by_anchor.setdefault(p["best_anchor"], []).append(p)
total_added = 0
for anchor_name, props in by_anchor.items():
ed = ROOT / anchor_name
em_path = ed / "manifest.json"
em = json.loads(em_path.read_text())
existing = list(em.get("faces", []))
# gather new entries with their source PNG paths in faceset_001/faces/
new_with_src = []
for p in props:
src_png = parent_dir / p["png"]
if not src_png.exists():
print(f"[warn] missing parent PNG {src_png}; skip", file=sys.stderr)
continue
face_entry = {
"source": p["source"],
"bbox": p["bbox"],
"quality": p["quality"],
"exif_year": p["year"],
"centroid_dist_at_assign": p["centroid_dist"],
"year_delta_at_assign": p["year_delta"],
"extended_from_parent": True,
}
new_with_src.append((face_entry, src_png))
# combine; rank by quality.composite desc (existing entries already have rank,
# but we re-rank globally so new entries slot in by quality)
combined: list[tuple[dict, Path | None]] = []
for f in existing:
combined.append((f, None))
combined.extend(new_with_src)
combined.sort(key=lambda x: -x[0].get("quality", {}).get("composite", 0))
# stage fresh
staging = ed / "_faces_new"
if staging.exists():
shutil.rmtree(staging)
staging.mkdir()
new_face_entries = []
for new_rank, (face, src_png_or_none) in enumerate(combined, start=1):
new_name = f"{new_rank:04d}.png"
if src_png_or_none is None:
# existing entry: copy from current era bucket faces/
old_name = Path(face["png"]).name
src = ed / "faces" / old_name
if not src.exists():
print(f"[warn] {anchor_name}: missing existing PNG {src}; skip", file=sys.stderr)
continue
shutil.copy2(src, staging / new_name)
else:
shutil.copy2(src_png_or_none, staging / new_name)
face = dict(face)
face["rank"] = new_rank
face["png"] = f"faces/{new_name}"
new_face_entries.append(face)
# swap dirs
old_holding = ed / "_faces_old"
if old_holding.exists():
shutil.rmtree(old_holding)
(ed / "faces").rename(old_holding)
staging.rename(ed / "faces")
shutil.rmtree(old_holding)
# re-zip .fsz
survivor_pngs = sorted((ed / "faces").glob("*.png"))
top_n = em.get("top_n", 30)
top_n_eff = min(top_n, len(survivor_pngs))
for old in ed.glob("*.fsz"):
old.unlink()
top_fsz_name = f"{anchor_name}_top{top_n_eff}.fsz"
all_fsz_name = f"{anchor_name}_all.fsz"
_zip_png_list(survivor_pngs[:top_n_eff], ed / top_fsz_name)
if len(survivor_pngs) > top_n_eff:
_zip_png_list(survivor_pngs, ed / all_fsz_name)
all_fsz_used = all_fsz_name
else:
all_fsz_used = None
# update local + master manifests
em["faces"] = new_face_entries
em["exported"] = len(new_face_entries)
em["fsz_top"] = top_fsz_name
em["fsz_all"] = all_fsz_used
em["top_n"] = top_n_eff
em.setdefault("age_extend_history", []).append({
"added": len(new_with_src),
"thresholds": cand["thresholds"],
})
em_path.write_text(json.dumps(em, indent=2))
if anchor_name in facesets_by_name:
facesets_by_name[anchor_name]["exported"] = len(new_face_entries)
facesets_by_name[anchor_name]["fsz_top"] = top_fsz_name
facesets_by_name[anchor_name]["fsz_all"] = all_fsz_used
facesets_by_name[anchor_name]["top_n"] = top_n_eff
added_here = len(new_with_src)
total_added += added_here
print(f"[applied] {anchor_name}: +{added_here} (now {len(new_face_entries)} faces)", file=sys.stderr)
# rewrite master with ordering preserved
new_facesets = []
for entry in master.get("facesets", []):
new_facesets.append(facesets_by_name.get(entry["name"], entry))
master["facesets"] = new_facesets
master.setdefault("age_extend_runs", []).append({
"parent": PARENT,
"thresholds": cand["thresholds"],
"anchors": list(by_anchor.keys()),
"added_total": total_added,
})
tmp = master_path.with_suffix(".tmp.json")
tmp.write_text(json.dumps(master, indent=2))
tmp.replace(master_path)
print(f"[done] +{total_added} faces across {len(by_anchor)} anchors", file=sys.stderr)
# ----------------------------- main -----------------------------
def main():
ap = argparse.ArgumentParser()
sub = ap.add_subparsers(dest="cmd", required=True)
a = sub.add_parser("analyze")
a.add_argument("--out", required=True)
a.set_defaults(func=cmd_analyze)
r = sub.add_parser("report")
r.add_argument("--candidates", required=True)
r.add_argument("--out", required=True)
r.set_defaults(func=cmd_report)
p = sub.add_parser("apply")
p.add_argument("--candidates", required=True)
p.add_argument("--dry-run", action="store_true")
p.set_defaults(func=cmd_apply)
args = ap.parse_args()
args.func(args)
if __name__ == "__main__":
main()
+485
View File
@@ -0,0 +1,485 @@
#!/usr/bin/env python3
"""Age-split person_001 into era-specific facesets.
Workflow:
1. Seed a clean person_001 centroid from the existing curated 707-face
`facesets_swap_ready/faceset_001/`.
2. Wide-recovery scan: pull every face record under /mnt/x/src/{nl, lzbkp_red}
from `nl_full.npz` with cos-dist <= 0.55 from the seed centroid.
3. Apply export-swap-style per-face quality gates.
4. One re-centroid + 0.50 tighten pass to absorb the recovery without drift.
5. Agglomerative sub-clustering at cos-dist 0.35.
6. Post-merge sub-clusters whose centroids <0.30 AND whose dominant EXIF
years are within 2 years.
7. Read EXIF DateTimeOriginal for each face's source path; era label =
(p10 year, p90 year) over dated faces.
8. Undated faces are assigned to the nearest era by embedding distance.
9. For each era: composite-quality rank, single-face PNG crops, .fsz bundles
(top-N and _all if era > top_n). `<era>_<range>.txt` marker file. Eras
with <20 face records get a `THIN.txt` marker.
10. Append era entries into the canonical
`facesets_swap_ready/manifest.json` next to the existing 19.
"""
from __future__ import annotations
import json
import shutil
import sys
from collections import Counter
from pathlib import Path
import numpy as np
from PIL import Image, ExifTags, ImageOps
REPO = Path(__file__).resolve().parent.parent
sys.path.insert(0, str(REPO))
from sort_faces import ( # noqa: E402
QUALITY_WEIGHTS,
_crop_face_square,
_zip_png_list,
compute_quality,
load_cache,
load_rgb_bgr,
)
# ---- config -------------------------------------------------------------- #
CACHE = REPO / "work" / "cache" / "nl_full.npz"
SWAP_READY = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
FS001 = SWAP_READY / "faceset_001"
SCAN_ROOTS = [
Path("/mnt/x/src/nl"),
Path("/mnt/x/src/lzbkp_red"),
]
# Recovery + identity refinement
RECOVERY_THRESHOLD = 0.55 # initial centroid match
TIGHTEN_THRESHOLD = 0.50 # post-recentroid drift trim
# Quality gates (mirror export-swap defaults)
MIN_FACE_SHORT = 100
# Sub-cluster
SUBCLUSTER_THRESHOLD = 0.35
# Anchor-based fragment assignment (replaces transitive union-find merge):
ANCHOR_MIN_SIZE = 20 # sub-cluster size to qualify as an era anchor
FRAGMENT_CENTROID_MAX = 0.40 # small fragment may join an anchor only if cent_dist <=
FRAGMENT_YEAR_MAX = 5 # AND |dom_year_anchor - dom_year_fragment| <=
# Output
TOP_N = 30
PAD_RATIO = 0.5
OUT_SIZE = 512
THIN_THRESHOLD = 20
# EXIF cache (so re-runs skip the 30-min Windows-mount EXIF read)
EXIF_CACHE = REPO / "work" / "cache" / "age_split_exif.json"
# ---- helpers ------------------------------------------------------------- #
def _normalize(v: np.ndarray) -> np.ndarray:
n = np.linalg.norm(v)
return v / n if n > 0 else v
def _under(roots: list[Path], p: str) -> bool:
for r in roots:
rs = str(r).rstrip("/") + "/"
if p == str(r) or p.startswith(rs):
return True
return False
def _record_in_roots(rec: dict, roots: list[Path], path_aliases: dict) -> bool:
if _under(roots, rec["path"]):
return True
for alias in path_aliases.get(rec["path"], []):
if _under(roots, alias):
return True
return False
def exif_year(path: Path) -> int | None:
try:
with Image.open(path) as im:
exif = im._getexif()
if not exif:
return None
for tag_id, val in exif.items():
tag = ExifTags.TAGS.get(tag_id, tag_id)
if tag == "DateTimeOriginal" and isinstance(val, str) and len(val) >= 4:
return int(val[:4])
except Exception:
return None
return None
def label_for_era(years: list[int]) -> str:
"""Era label as a year-range string. Falls back to 'undated' if no years."""
if not years:
return "undated"
ys = sorted(years)
lo = ys[len(ys) // 10] if len(ys) >= 10 else ys[0]
hi = ys[-(len(ys) // 10) - 1] if len(ys) >= 10 else ys[-1]
if lo == hi:
return str(lo)
# Compact year range like 2011-13 if same century, else 2009-2024.
if (lo // 100) == (hi // 100):
return f"{lo}-{hi % 100:02d}"
return f"{lo}-{hi}"
# ---- phase 1 + 2: seed centroid + recovery scan ------------------------- #
def main() -> None:
if not FS001.exists():
raise SystemExit(f"missing seed faceset: {FS001}")
print("=== loading cache ===")
emb, meta, _src, _proc, path_aliases = load_cache(CACHE)
face_records = [m for m in meta if not m.get("noface")]
if len(face_records) != len(emb):
raise SystemExit(f"emb/meta mismatch: {len(face_records)} vs {len(emb)}")
bbox_idx = {(m["path"], tuple(m.get("bbox") or ())): i for i, m in enumerate(face_records)}
seed_manifest = json.loads((FS001 / "manifest.json").read_text())
seed_face_keys = [(f["source"], tuple(f.get("bbox") or ())) for f in seed_manifest["faces"]]
seed_indices = [bbox_idx[k] for k in seed_face_keys if k in bbox_idx]
print(f"seed faces from faceset_001: {len(seed_indices)} (manifest had {len(seed_face_keys)})")
seed_centroid = _normalize(emb[seed_indices].mean(axis=0))
# Recovery: every face record under nl/ + lzbkp_red/ within RECOVERY_THRESHOLD.
candidate_idxs = [
i for i, rec in enumerate(face_records)
if _record_in_roots(rec, SCAN_ROOTS, path_aliases)
]
print(f"\ncandidates under {[str(r) for r in SCAN_ROOTS]}: {len(candidate_idxs)}")
cand_emb = emb[candidate_idxs]
cand_dists = 1.0 - cand_emb @ seed_centroid
recovered_local = [k for k, d in enumerate(cand_dists) if d <= RECOVERY_THRESHOLD]
recovered = [candidate_idxs[k] for k in recovered_local]
print(f"recovered at cos-dist <= {RECOVERY_THRESHOLD}: {len(recovered)}")
# Quality gate.
qualified = []
drop_size = drop_blur = drop_det = 0
for i in recovered:
r = face_records[i]
if r.get("face_short", 0) < MIN_FACE_SHORT:
drop_size += 1
continue
if r.get("blur", 0.0) < 40.0:
drop_blur += 1
continue
if r.get("det_score", 0.0) < 0.6:
drop_det += 1
continue
qualified.append(i)
print(f"after quality gate: {len(qualified)} (drop size={drop_size} blur={drop_blur} det={drop_det})")
# One tightening pass: re-centroid on qualified, drop anyone > TIGHTEN_THRESHOLD.
qcent = _normalize(emb[qualified].mean(axis=0))
qd = 1.0 - emb[qualified] @ qcent
tight = [qualified[k] for k, d in enumerate(qd) if d <= TIGHTEN_THRESHOLD]
print(f"after re-centroid tighten ({TIGHTEN_THRESHOLD}): {len(tight)}")
# ---- phase 5: sub-cluster -------------------------------------------- #
print("\n=== sub-clustering ===")
from sklearn.cluster import AgglomerativeClustering
E = emb[tight]
sims = E @ E.T
dists = 1.0 - sims
# Floor numerical noise.
np.fill_diagonal(dists, 0.0)
dists = np.maximum(dists, 0.0)
ac = AgglomerativeClustering(
n_clusters=None,
metric="precomputed",
linkage="average",
distance_threshold=SUBCLUSTER_THRESHOLD,
)
labels = ac.fit_predict(dists)
sub_sizes = Counter(labels)
print(f"raw sub-clusters: {len(sub_sizes)} (sizes: top10={sorted(sub_sizes.values(), reverse=True)[:10]})")
# Per-cluster: indices, centroid, EXIF years.
cluster_indices: dict[int, list[int]] = {}
for k, lab in enumerate(labels):
cluster_indices.setdefault(int(lab), []).append(tight[k])
cluster_centroids: dict[int, np.ndarray] = {}
for lab, idxs in cluster_indices.items():
cluster_centroids[lab] = _normalize(emb[idxs].mean(axis=0))
print("\n=== EXIF years (one read per source path; cached) ===")
unique_paths = sorted({face_records[i]["path"] for i in tight})
if EXIF_CACHE.exists():
cached = json.loads(EXIF_CACHE.read_text())
else:
cached = {}
path_year: dict[str, int | None] = {}
new_reads = 0
for p in unique_paths:
if p in cached:
path_year[p] = cached[p]
else:
y = exif_year(Path(p))
path_year[p] = y
cached[p] = y
new_reads += 1
EXIF_CACHE.parent.mkdir(parents=True, exist_ok=True)
EXIF_CACHE.write_text(json.dumps(cached, indent=0))
dated = sum(1 for v in path_year.values() if v is not None)
print(f" EXIF cache: {len(cached)} entries, {new_reads} new reads, "
f"{dated}/{len(unique_paths)} dated")
cluster_years: dict[int, list[int]] = {}
cluster_dom_year: dict[int, int | None] = {}
for lab, idxs in cluster_indices.items():
ys = []
for i in idxs:
y = path_year.get(face_records[i]["path"])
if y is not None:
ys.append(y)
cluster_years[lab] = ys
cluster_dom_year[lab] = (Counter(ys).most_common(1)[0][0]) if ys else None
# ---- phase 6: anchor-based fragment assignment ----------------------- #
# Each sub-cluster of size >= ANCHOR_MIN_SIZE is an "era anchor". Smaller
# fragments are assigned to the single nearest anchor IFF (centroid distance
# <= FRAGMENT_CENTROID_MAX AND |dom_year delta| <= FRAGMENT_YEAR_MAX).
# Anchors do NOT merge with each other — that prevented transitive year drift
# observed when union-find was used. Standalone fragments stay as their own
# (likely THIN) eras.
print("\n=== anchor-based assignment ===")
anchors = [lab for lab, idxs in cluster_indices.items() if len(idxs) >= ANCHOR_MIN_SIZE]
fragments = [lab for lab in cluster_indices if lab not in anchors]
anchors.sort(key=lambda l: -len(cluster_indices[l]))
print(f"anchors (size>={ANCHOR_MIN_SIZE}): {len(anchors)}; fragments: {len(fragments)}")
for a in anchors:
print(f" anchor sub {a}: size={len(cluster_indices[a])} dom_year={cluster_dom_year[a]}")
if anchors:
a_cent = np.stack([cluster_centroids[a] for a in anchors])
assignments: dict[int, int] = {a: a for a in anchors} # anchor -> self
unassigned: list[int] = []
for f in fragments:
f_cent = cluster_centroids[f]
f_year = cluster_dom_year[f]
# cosine distances to each anchor
cd = 1.0 - a_cent @ f_cent
# year distance (inf if either dom-year unknown)
yd = []
for a in anchors:
ay = cluster_dom_year[a]
if f_year is None or ay is None:
yd.append(float("inf"))
else:
yd.append(abs(f_year - ay))
yd = np.array(yd)
ok = (cd <= FRAGMENT_CENTROID_MAX) & (yd <= FRAGMENT_YEAR_MAX)
if not ok.any():
unassigned.append(f)
continue
# nearest qualifying anchor by centroid distance.
cd_masked = np.where(ok, cd, np.inf)
best = int(np.argmin(cd_masked))
assignments[f] = anchors[best]
print(f" assigned fragments: {sum(1 for k,v in assignments.items() if k!=v)}/{len(fragments)}; "
f"unassigned (standalone): {len(unassigned)}")
else:
print(" no anchors; every sub-cluster stands alone")
assignments = {lab: lab for lab in cluster_indices}
unassigned = []
merged: dict[int, list[int]] = {}
for lab, idxs in cluster_indices.items():
root = assignments.get(lab, lab)
merged.setdefault(root, []).extend(idxs)
merged_sizes = sorted(((r, len(v)) for r, v in merged.items()), key=lambda kv: -kv[1])
print(f"era buckets: {len(merged)} (top10 sizes: {[s for _, s in merged_sizes[:10]]})")
# Recompute centroid + dom-year for merged eras.
era_indices: dict[int, list[int]] = merged
era_centroids: dict[int, np.ndarray] = {}
era_year_label: dict[int, str] = {}
era_years_full: dict[int, list[int]] = {}
for root, idxs in era_indices.items():
era_centroids[root] = _normalize(emb[idxs].mean(axis=0))
ys = []
for i in idxs:
y = path_year.get(face_records[i]["path"])
if y is not None:
ys.append(y)
era_years_full[root] = ys
era_year_label[root] = label_for_era(ys)
# ---- phase 8: assign undated faces (no-EXIF) to nearest era ---------- #
# NB: undated = path's EXIF was None. For era assignment we use embedding,
# but the year *label* is unaffected because labels come from dated faces only.
# Actually undated face is already in some sub-cluster; here we just note count.
n_undated = sum(1 for i in tight if path_year.get(face_records[i]["path"]) is None)
print(f"undated face records (no EXIF): {n_undated}/{len(tight)} (placed by embedding only)")
# ---- phase 9: per-era export ----------------------------------------- #
import cv2
print("\n=== exporting era bundles ===")
new_manifest_entries: list[dict] = []
eras_sorted = sorted(era_indices.items(), key=lambda kv: -len(kv[1]))
for root, idxs in eras_sorted:
size = len(idxs)
label = era_year_label[root]
era_name = f"faceset_001_{label}"
out_dir = SWAP_READY / era_name
# Disambiguate same-label collisions (e.g. two distinct embedding eras both 2019).
collision = 2
while out_dir.exists():
era_name = f"faceset_001_{label}_v{collision}"
out_dir = SWAP_READY / era_name
collision += 1
faces_dir = out_dir / "faces"
faces_dir.mkdir(parents=True, exist_ok=True)
# Composite quality + rank.
ranked = []
for ci in idxs:
rec = face_records[ci]
q = compute_quality(rec)
ranked.append({"cache_idx": ci, "rec": rec, "quality": q})
# Dedup by source path within this era — keep highest-quality face per path.
seen_path: dict[str, dict] = {}
for r in ranked:
p = r["rec"]["path"]
prev = seen_path.get(p)
if prev is None or r["quality"]["composite"] > prev["quality"]["composite"]:
seen_path[p] = r
unique = sorted(seen_path.values(), key=lambda r: -r["quality"]["composite"])
# Materialize crops.
written: list[Path] = []
face_entries: list[dict] = []
for rank, r in enumerate(unique, start=1):
rec = r["rec"]
src = Path(rec["path"])
if not src.exists():
continue
rgb, _ = load_rgb_bgr(src)
if rgb is None:
continue
crop = _crop_face_square(rgb, rec["bbox"], PAD_RATIO, OUT_SIZE)
png = faces_dir / f"{rank:04d}.png"
cv2.imwrite(str(png), cv2.cvtColor(crop, cv2.COLOR_RGB2BGR))
written.append(png)
face_entries.append({
"rank": rank,
"png": f"faces/{rank:04d}.png",
"source": rec["path"],
"aliases": path_aliases.get(rec["path"], []),
"bbox": rec["bbox"],
"face_short": rec.get("face_short"),
"det_score": rec.get("det_score"),
"blur": rec.get("blur"),
"pose": rec.get("pose"),
"exif_year": path_year.get(rec["path"]),
"quality": r["quality"],
})
if not written:
print(f"[{era_name}] empty after materialization; skipping")
shutil.rmtree(out_dir)
continue
# Bundle.
top_n_eff = min(TOP_N, len(written))
top_fsz = out_dir / f"{era_name}_top{top_n_eff}.fsz"
_zip_png_list(written[:top_n_eff], top_fsz)
all_fsz: Path | None = None
if len(written) > top_n_eff:
all_fsz = out_dir / f"{era_name}_all.fsz"
_zip_png_list(written, all_fsz)
# Per-era manifest.
ys = era_years_full[root]
year_summary = {
"label": label,
"year_count": len(ys),
"year_min": min(ys) if ys else None,
"year_max": max(ys) if ys else None,
"year_dist": dict(Counter(ys).most_common()),
}
is_thin = size < THIN_THRESHOLD
manifest = {
"name": era_name,
"parent_identity": "faceset_001",
"era": year_summary,
"input_face_records": size,
"exported": len(written),
"top_n": top_n_eff,
"fsz_top": top_fsz.name,
"fsz_all": all_fsz.name if all_fsz else None,
"thin": is_thin,
"quality_weights": QUALITY_WEIGHTS,
"params": {
"recovery_threshold": RECOVERY_THRESHOLD,
"tighten_threshold": TIGHTEN_THRESHOLD,
"subcluster_threshold": SUBCLUSTER_THRESHOLD,
"anchor_min_size": ANCHOR_MIN_SIZE,
"fragment_centroid_max": FRAGMENT_CENTROID_MAX,
"fragment_year_max": FRAGMENT_YEAR_MAX,
"min_face_short": MIN_FACE_SHORT,
},
"faces": face_entries,
}
(out_dir / "manifest.json").write_text(json.dumps(manifest, indent=2))
# Per-era marker file (always: <label>.txt for human reference).
(out_dir / f"{label}.txt").write_text(
f"{era_name}\n\nEra: {label}\n"
f"Year span: {year_summary['year_min']}..{year_summary['year_max']} "
f"({year_summary['year_count']} dated of {size} faces)\n"
f"Sub-cluster size: {size} face records, {len(unique)} unique source paths, "
f"{len(written)} exported PNGs.\n"
)
if is_thin:
(out_dir / "THIN.txt").write_text(
f"This era has only {size} face records (<{THIN_THRESHOLD}). "
f"Averaged embedding may be dominated by single-photo idiosyncrasies.\n"
)
# Append to top-level manifest summary.
new_manifest_entries.append({k: v for k, v in manifest.items() if k != "faces"})
thin_tag = " THIN" if is_thin else ""
print(
f"[{era_name}] size={size} unique_paths={len(unique)} exported={len(written)} "
f"top{top_n_eff}{thin_tag}"
)
# ---- merge into top-level manifest ----------------------------------- #
top_path = SWAP_READY / "manifest.json"
existing = json.loads(top_path.read_text()) if top_path.exists() else {"facesets": []}
existing_names = {fs.get("name") for fs in existing.get("facesets", [])}
appended = 0
for entry in new_manifest_entries:
if entry["name"] in existing_names:
continue
existing["facesets"].append(entry)
appended += 1
top_path.write_text(json.dumps(existing, indent=2))
print(f"\nAppended {appended} era entries to {top_path}")
print(f"Done. {len(new_manifest_entries)} era buckets emitted (faceset_001/ left untouched).")
if __name__ == "__main__":
main()
+323
View File
@@ -0,0 +1,323 @@
#!/usr/bin/env python3
"""Build per-folder facesets from hand-sorted source directories.
Phase B + C of the folder-import workflow:
- Filter cache records into per-folder identity sets, run 2-pass centroid+outlier
rejection so non-target faces in group photos drop out.
- Route every osrc face record to every trusted-folder identity within a tight
cosine cutoff (multi-identity osrc photos land in multiple facesets;
cmd_export_swap then per-bbox-filters so each faceset crops only the matching face).
- Synthesize a refine_manifest.json compatible with cmd_export_swap.
- Invoke cmd_export_swap to emit faceset_NNN/ dirs into a temp output dir.
- Rename .fsz bundles after the source folder, replace NAME.txt with foldername.txt,
move dirs into the canonical facesets_swap_ready/, merge top-level manifest
preserving existing faceset_001..012 entries.
"""
from __future__ import annotations
import json
import shutil
import sys
from pathlib import Path
import numpy as np
REPO = Path(__file__).resolve().parent.parent
sys.path.insert(0, str(REPO))
from sort_faces import ( # noqa: E402
cmd_export_swap,
load_cache,
)
# ---- config -------------------------------------------------------------- #
CACHE = REPO / "work" / "cache" / "nl_full.npz"
OUT_FINAL = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
OUT_TMP = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready_new")
SYNTH_MANIFEST = REPO / "work" / "synthetic_refine_manifest.json"
# Trusted folders, in numbering order. faceset_NNN starts at 013.
TRUSTED: list[tuple[str, Path]] = [
("k", Path("/mnt/x/src/k")),
("m", Path("/mnt/x/src/m")),
("mi", Path("/mnt/x/src/mi")),
("mir", Path("/mnt/x/src/mir")),
("s", Path("/mnt/x/src/s")),
("sab", Path("/mnt/x/src/sab")),
("t", Path("/mnt/x/src/t")),
]
START_NNN = 13
OSRC_DIR = Path("/mnt/x/src/osrc")
# Centroid-build outlier passes (loose then tight).
PASS1_THRESHOLD = 0.55
PASS2_THRESHOLD = 0.45
# osrc routing cutoff (tight).
OSRC_THRESHOLD = 0.45
# export-swap params (defaults from sort_faces.py).
TOP_N = 30
EXPORT_OUTLIER_THRESHOLD = 0.45
PAD_RATIO = 0.5
OUT_SIZE = 512
MIN_FACE_SHORT = 100
# ---- helpers ------------------------------------------------------------- #
def _normalize_rows(mat: np.ndarray) -> np.ndarray:
n = np.linalg.norm(mat, axis=1, keepdims=True)
n[n == 0] = 1.0
return mat / n
def _centroid(vecs: np.ndarray) -> np.ndarray:
c = vecs.mean(axis=0)
n = np.linalg.norm(c)
return c / n if n > 0 else c
def _under(folder: Path, p: str) -> bool:
"""True iff path string p lies under folder."""
fs = str(folder).rstrip("/") + "/"
return p == str(folder) or p.startswith(fs)
def _record_in_folder(rec: dict, folder: Path, path_aliases: dict[str, list[str]]) -> bool:
if _under(folder, rec["path"]):
return True
for alias in path_aliases.get(rec["path"], []):
if _under(folder, alias):
return True
return False
# ---- phase B: identity centroids + osrc routing ------------------------- #
def build_synthetic_manifest() -> tuple[dict, dict[str, np.ndarray], dict[str, dict]]:
emb, meta, _src_root, _processed, path_aliases = load_cache(CACHE)
# emb is aligned with the no-noface-filtered records (matching cmd_export_swap's
# invariant). Use indices into face_records to access emb.
face_records = [m for m in meta if not m.get("noface")]
if len(face_records) != len(emb):
raise SystemExit(f"meta/embedding mismatch: {len(face_records)} vs {len(emb)}")
print(f"Loaded cache: {len(face_records)} face records.")
# Per-folder identity centroids.
centroids: dict[str, np.ndarray] = {}
folder_paths: dict[str, set[str]] = {}
folder_stats: dict[str, dict] = {}
for label, folder in TRUSTED:
idxs = [i for i, m in enumerate(face_records) if _record_in_folder(m, folder, path_aliases)]
if not idxs:
print(f"[{label}] no face records found under {folder}; skipping")
continue
vecs = emb[idxs]
cent = _centroid(vecs)
# Pass 1: drop loose outliers.
d1 = 1.0 - vecs @ cent
keep1 = [idxs[k] for k, dist in enumerate(d1) if dist <= PASS1_THRESHOLD]
if not keep1:
print(f"[{label}] every face was a pass-1 outlier; using all faces as-is")
keep1 = idxs
cent = _centroid(emb[keep1])
# Pass 2: tight outlier rejection.
d2 = 1.0 - emb[keep1] @ cent
keep2 = [keep1[k] for k, dist in enumerate(d2) if dist <= PASS2_THRESHOLD]
if not keep2:
print(f"[{label}] every face was a pass-2 outlier; falling back to pass-1")
keep2 = keep1
cent = _centroid(emb[keep2])
centroids[label] = cent
# Use canonical path strings; export-swap will look up indices by path.
folder_paths[label] = {face_records[i]["path"] for i in keep2}
folder_stats[label] = {
"folder": str(folder),
"input_records": len(idxs),
"after_pass1": len(keep1),
"after_pass2": len(keep2),
"unique_paths": len(folder_paths[label]),
}
print(
f"[{label}] in={len(idxs)} pass1={len(keep1)} pass2={len(keep2)} "
f"unique_paths={len(folder_paths[label])}"
)
# osrc routing: every osrc face -> every centroid within OSRC_THRESHOLD.
osrc_idxs = [
i for i, m in enumerate(face_records)
if _record_in_folder(m, OSRC_DIR, path_aliases)
]
print(f"\nosrc: {len(osrc_idxs)} face records to route")
if osrc_idxs and centroids:
labels = list(centroids.keys())
cent_mat = np.stack([centroids[lab] for lab in labels])
# Build sims: (n_osrc, n_labels)
osrc_emb = emb[osrc_idxs]
sims = osrc_emb @ cent_mat.T # cosine similarity (vectors already normalized)
dists = 1.0 - sims
per_label_added: dict[str, int] = {lab: 0 for lab in labels}
for row, ci in enumerate(osrc_idxs):
p = face_records[ci]["path"]
for col, lab in enumerate(labels):
if dists[row, col] <= OSRC_THRESHOLD:
if p not in folder_paths[lab]:
folder_paths[lab].add(p)
per_label_added[lab] += 1
for lab in labels:
folder_stats[lab]["osrc_paths_added"] = per_label_added[lab]
print(f"[{lab}] osrc faces routed: +{per_label_added[lab]} unique paths")
# Build synthetic refine_manifest.
facesets: list[dict] = []
for n, (label, _folder) in enumerate(TRUSTED, start=START_NNN):
if label not in folder_paths:
continue
facesets.append({
"name": f"faceset_{n:03d}",
"label": label,
"image_count": len(folder_paths[label]),
"images": sorted(folder_paths[label]),
})
manifest = {
"params": {
"pass1_threshold": PASS1_THRESHOLD,
"pass2_threshold": PASS2_THRESHOLD,
"osrc_threshold": OSRC_THRESHOLD,
"min_face_short": MIN_FACE_SHORT,
},
"facesets": facesets,
"_per_folder_stats": folder_stats,
}
SYNTH_MANIFEST.write_text(json.dumps(manifest, indent=2))
print(f"\nSynthetic manifest -> {SYNTH_MANIFEST}")
return manifest, centroids, folder_stats
# ---- phase C: export + rename + merge ----------------------------------- #
def export_and_relocate(manifest: dict) -> None:
if OUT_TMP.exists():
shutil.rmtree(OUT_TMP)
OUT_TMP.mkdir(parents=True)
print(f"\nRunning cmd_export_swap -> {OUT_TMP}")
cmd_export_swap(
cache_path=CACHE,
refine_manifest_path=SYNTH_MANIFEST,
raw_manifest_path=None,
out_dir=OUT_TMP,
top_n=TOP_N,
outlier_threshold=EXPORT_OUTLIER_THRESHOLD,
pad_ratio=PAD_RATIO,
out_size=OUT_SIZE,
include_candidates=False,
candidate_match_threshold=0.55,
candidate_min_score=0.40,
min_face_short=MIN_FACE_SHORT,
)
# Map name -> label from the synthetic manifest.
name_to_label = {fs["name"]: fs["label"] for fs in manifest["facesets"]}
# Load the temp top-level manifest (export-swap just wrote it).
new_top = json.loads((OUT_TMP / "manifest.json").read_text())
new_entries = new_top.get("facesets", [])
# Per-faceset rename + relocate.
for fs_meta in new_entries:
name = fs_meta["name"]
label = name_to_label.get(name)
src_dir = OUT_TMP / name
if not src_dir.exists():
print(f"[{name}] export dir missing; skipping")
continue
# Rename .fsz bundles to <label>_*.fsz; record updated names.
renames = {}
for fsz in sorted(src_dir.glob(f"{name}_top*.fsz")):
new = src_dir / fsz.name.replace(name + "_", label + "_", 1)
fsz.rename(new)
renames[fsz.name] = new.name
for fsz in sorted(src_dir.glob(f"{name}_all.fsz")):
new = src_dir / fsz.name.replace(name + "_", label + "_", 1)
fsz.rename(new)
renames[fsz.name] = new.name
# Replace NAME.txt placeholder with <label>.txt.
nametxt = src_dir / "NAME.txt"
if nametxt.exists():
nametxt.unlink()
(src_dir / f"{label}.txt").write_text(
f"{label}\n\nSource: /mnt/x/src/{label} (hand-sorted) + matched osrc faces.\n"
)
# Update fs_meta entry's fsz fields to point at the renamed files.
for k in ("fsz_top", "fsz_all"):
if fs_meta.get(k) and fs_meta[k] in renames:
fs_meta[k] = renames[fs_meta[k]]
fs_meta["label"] = label
# Move the directory into the final output.
dst_dir = OUT_FINAL / name
if dst_dir.exists():
print(f"[{name}] {dst_dir} already exists; refusing to overwrite")
continue
shutil.move(str(src_dir), str(dst_dir))
print(f"[{name}] -> {dst_dir} (label={label})")
# Merge top-level manifest, preserving existing faceset_001..012 entries.
final_manifest_path = OUT_FINAL / "manifest.json"
if final_manifest_path.exists():
existing = json.loads(final_manifest_path.read_text())
else:
existing = {"facesets": []}
existing_names = {fs["name"] for fs in existing.get("facesets", [])}
appended = 0
for entry in new_entries:
if entry["name"] in existing_names:
print(f"[manifest] {entry['name']} already in top-level manifest; not duplicating")
continue
existing["facesets"].append(entry)
appended += 1
# Carry over export-swap params if not already present.
for k in ("quality_weights", "outlier_threshold", "top_n", "pad_ratio", "out_size"):
if k not in existing and k in new_top:
existing[k] = new_top[k]
final_manifest_path.write_text(json.dumps(existing, indent=2))
print(f"\nMerged manifest: appended {appended} entries -> {final_manifest_path}")
# Clean up temp dir if empty.
leftover = list(OUT_TMP.iterdir()) if OUT_TMP.exists() else []
if not leftover:
OUT_TMP.rmdir()
else:
# leave temp manifest.json for inspection
pass
# ---- main ---------------------------------------------------------------- #
def main() -> None:
manifest, _centroids, _stats = build_synthetic_manifest()
if not manifest.get("facesets"):
print("No facesets to build; nothing to do.")
return
export_and_relocate(manifest)
print("\nDone.")
if __name__ == "__main__":
main()
+151
View File
@@ -0,0 +1,151 @@
#!/usr/bin/env python3
"""Probe faceset_001 for age-sortable sub-structure.
Three questions:
1. How spread is the embedding cloud? (intra-cluster pairwise distance histogram)
2. Does it split naturally into sub-clusters at a tight threshold?
3. Do the sub-clusters correspond to distinct time periods (EXIF DateTimeOriginal)?
"""
from __future__ import annotations
import json
import sys
from collections import Counter
from pathlib import Path
import numpy as np
from PIL import Image, ExifTags
REPO = Path(__file__).resolve().parent.parent
sys.path.insert(0, str(REPO))
from sort_faces import load_cache # noqa: E402
CACHE = REPO / "work" / "cache" / "nl_full.npz"
FS001 = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready/faceset_001")
def exif_year(path: Path) -> int | None:
try:
with Image.open(path) as im:
exif = im._getexif()
if not exif:
return None
for tag_id, val in exif.items():
tag = ExifTags.TAGS.get(tag_id, tag_id)
if tag == "DateTimeOriginal" and isinstance(val, str) and len(val) >= 4:
return int(val[:4])
except Exception:
return None
return None
def main() -> None:
manifest = json.loads((FS001 / "manifest.json").read_text())
faces = manifest["faces"]
paths = [Path(f["source"]) for f in faces]
print(f"faceset_001 has {len(paths)} ranked faces in the swap-ready set")
# Pull embeddings for these face records by (path, bbox).
emb, meta, _src, _proc, _aliases = load_cache(CACHE)
face_records = [m for m in meta if not m.get("noface")]
if len(face_records) != len(emb):
raise SystemExit("emb/meta mismatch")
bbox_key = {}
for i, m in enumerate(face_records):
bbox_key[(m["path"], tuple(m.get("bbox") or ()))] = i
selected = []
missing = 0
for f in faces:
key = (f["source"], tuple(f.get("bbox") or ()))
i = bbox_key.get(key)
if i is None:
missing += 1
continue
selected.append(i)
print(f"matched {len(selected)} embeddings (missing {missing})")
E = emb[selected]
# All embeddings are L2-normalized -> cosine dist = 1 - dot.
sims = E @ E.T
dists = 1.0 - sims
iu = np.triu_indices_from(dists, k=1)
pw = dists[iu]
print("\n-- intra-cluster pairwise cosine distance --")
print(f" n_pairs = {len(pw):,}")
print(f" mean = {pw.mean():.3f}")
print(f" median = {np.median(pw):.3f}")
print(f" p10/p25/p75/p90 = {np.percentile(pw, [10,25,75,90])}")
print(f" max = {pw.max():.3f}")
# Histogram bins around interesting thresholds.
edges = [0.0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0, 1.4]
hist, _ = np.histogram(pw, bins=edges)
print("\n histogram (cos-dist bin -> pair count):")
for lo, hi, c in zip(edges[:-1], edges[1:], hist):
bar = "#" * int(60 * c / max(hist.max(), 1))
print(f" [{lo:.1f},{hi:.1f}) {c:7d} {bar}")
# Sub-cluster at three thresholds via agglomerative on the distance matrix.
from sklearn.cluster import AgglomerativeClustering
print("\n-- sub-clustering --")
for thr in (0.30, 0.35, 0.40, 0.45, 0.50):
ac = AgglomerativeClustering(
n_clusters=None,
metric="precomputed",
linkage="average",
distance_threshold=thr,
)
labels = ac.fit_predict(dists)
sizes = Counter(labels)
n = len(sizes)
big = sum(1 for s in sizes.values() if s >= 10)
top5 = sorted(sizes.values(), reverse=True)[:5]
print(f" threshold {thr:.2f}: {n} sub-clusters, {big} with >=10 images, top-5 sizes={top5}")
# Pick the threshold that gives 2-5 substantial sub-clusters.
target_thr = 0.35
ac = AgglomerativeClustering(
n_clusters=None, metric="precomputed", linkage="average",
distance_threshold=target_thr,
)
labels = ac.fit_predict(dists)
sizes = Counter(labels)
big_labels = [lab for lab, s in sizes.most_common() if s >= 20]
print(f"\n-- EXIF year analysis at threshold {target_thr} (sub-clusters with >=20 images) --")
print(f" {len(big_labels)} substantial sub-clusters")
# Build label -> list of source paths
by_label: dict[int, list[Path]] = {}
for ci, lab in zip(selected, labels):
rec = face_records[ci]
by_label.setdefault(int(lab), []).append(Path(rec["path"]))
for lab in big_labels[:6]:
paths_in = by_label[lab]
years = []
for p in paths_in:
y = exif_year(p)
if y is not None:
years.append(y)
n_paths = len(paths_in)
n_years = len(years)
if years:
ys = np.array(years)
ymin, ymax = int(ys.min()), int(ys.max())
ymed = int(np.median(ys))
yhist = Counter(years)
top_years = ", ".join(f"{y}:{c}" for y, c in sorted(yhist.most_common(5)))
else:
ymin = ymax = ymed = None
top_years = ""
print(
f" cluster {lab}: {n_paths} faces, EXIF on {n_years}/{n_paths}, "
f"year range {ymin}..{ymax} (median {ymed})"
)
print(f" top years: {top_years}")
if __name__ == "__main__":
main()
+221
View File
@@ -0,0 +1,221 @@
"""Windows / DirectML CLIP worker for occlusion scoring.
Reads a queue.json staged by /opt/face-sets/work/filter_occlusions.py (WSL side),
runs open_clip ViT-L-14 (dfn2b_s39b) on each PNG via torch-directml on the AMD
Vega, and writes a scores.json with mask + sunglasses softmax probabilities.
CLI:
py -3.12 clip_worker.py <queue.json> <out_scores.json> [--limit N] [--batch 8]
queue.json shape: list of objects
{"wsl_path": "...", "win_path": "E:\\...\\faceset_NNN\\faces\\NNNN.png",
"faceset": "faceset_NNN", "file": "NNNN.png"}
scores.json shape:
{"model": "ViT-L-14/dfn2b_s39b",
"logit_scale": 100.0,
"prompts": {...},
"results": [{"wsl_path": "...", "faceset": "...", "file": "...",
"mask": float, "sunglasses": float}],
"processed": [wsl_path, ...]}
"""
from __future__ import annotations
import argparse
import json
import os
import sys
import time
import warnings
from pathlib import Path
# DML emits a verbose UserWarning per attention call -- silence at import time
warnings.filterwarnings("ignore", category=UserWarning)
import torch
import torch_directml
import open_clip
from PIL import Image
MODEL_NAME = "ViT-L-14"
PRETRAINED = "dfn2b_s39b"
# kept in sync with /opt/face-sets/work/filter_occlusions.py PROMPTS
PROMPTS = {
"mask": {
"pos": [
"a photo of a person wearing a surgical face mask",
"a photo of a person wearing an FFP2 respirator covering mouth and nose",
"a photo of a person wearing a cloth face mask",
"a face partially covered by a medical mask",
"a person whose mouth and nose are hidden by a face mask",
],
"neg": [
"a photo of a person's face with mouth and nose clearly visible",
"a clear, unobstructed photo of a face",
"a photo of a face without any mask or covering",
"a portrait of a person showing their full face",
"a photo of a person with a beard and visible mouth",
],
},
"sunglasses": {
"pos": [
"a face with dark sunglasses covering the eyes",
"a portrait with the eyes hidden behind opaque sunglasses",
"a person wearing dark sunglasses over their eyes, eyes not visible",
"a face where the eyes are completely concealed by tinted lenses",
"a close-up portrait wearing aviator sunglasses on the eyes",
],
"neg": [
"a portrait with both eyes clearly visible and uncovered",
"a face with sunglasses pushed up on the forehead, eyes visible below",
"a face with sunglasses resting on top of the head, eyes visible",
"a person with sunglasses hanging from their shirt, eyes visible",
"a face wearing clear prescription eyeglasses with visible eyes",
"a portrait with no eyewear and visible eyes",
],
},
}
FLUSH_EVERY = 100
def load_existing(out_path: Path):
if not out_path.exists():
return None, set()
try:
d = json.loads(out_path.read_text())
processed = set(d.get("processed", []))
return d, processed
except Exception as e:
print(f"[warn] could not parse existing {out_path}: {e}; starting fresh", file=sys.stderr)
return None, set()
def save_atomic(out_path: Path, data: dict):
tmp = out_path.with_suffix(".tmp.json")
tmp.write_text(json.dumps(data, indent=2))
os.replace(tmp, out_path)
@torch.no_grad()
def build_text_features(model, tokenizer, device):
out = {}
for attr, sides in PROMPTS.items():
feats = {}
for side in ("pos", "neg"):
tokens = tokenizer(sides[side]).to(device)
f = model.encode_text(tokens)
f = f / f.norm(dim=-1, keepdim=True)
mean = f.mean(dim=0)
feats[side] = mean / mean.norm()
out[attr] = (feats["pos"], feats["neg"])
return out
def main():
ap = argparse.ArgumentParser()
ap.add_argument("queue", type=Path)
ap.add_argument("out", type=Path)
ap.add_argument("--limit", type=int, default=None)
ap.add_argument("--batch", type=int, default=8)
args = ap.parse_args()
queue = json.loads(args.queue.read_text())
print(f"[queue] {len(queue)} entries from {args.queue}")
args.out.parent.mkdir(parents=True, exist_ok=True)
existing, processed = load_existing(args.out)
if existing:
print(f"[resume] {len(processed)} entries already scored")
results = existing.get("results", [])
else:
results = []
pending = [e for e in queue if e["wsl_path"] not in processed]
if args.limit is not None:
pending = pending[: args.limit]
print(f"[pending] {len(pending)} entries to score")
if not pending:
print("[done] nothing to do")
return
device = torch_directml.device()
print(f"[load] {MODEL_NAME}/{PRETRAINED} on {torch_directml.device_name(0)}")
t0 = time.time()
model, _, preprocess = open_clip.create_model_and_transforms(MODEL_NAME, pretrained=PRETRAINED)
tokenizer = open_clip.get_tokenizer(MODEL_NAME)
model = model.to(device).eval()
logit_scale = float(model.logit_scale.exp().detach().cpu())
print(f"[load] ready in {time.time()-t0:.1f}s logit_scale={logit_scale:.2f}")
text_feats = build_text_features(model, tokenizer, device)
def flush():
save_atomic(args.out, {
"model": f"{MODEL_NAME}/{PRETRAINED}",
"logit_scale": logit_scale,
"prompts": PROMPTS,
"results": results,
"processed": sorted(processed),
})
n_done_this_run = 0
n_load_err = 0
last_flush = time.time()
t_start = time.time()
for i in range(0, len(pending), args.batch):
chunk = pending[i:i + args.batch]
imgs = []
keep = []
for entry in chunk:
try:
img = Image.open(entry["win_path"]).convert("RGB")
imgs.append(preprocess(img))
keep.append(entry)
except Exception as e:
print(f"[skip] {entry['win_path']}: {e}", file=sys.stderr)
n_load_err += 1
processed.add(entry["wsl_path"])
if not imgs:
continue
x = torch.stack(imgs).to(device)
with torch.no_grad():
feats = model.encode_image(x)
feats = feats / feats.norm(dim=-1, keepdim=True)
scores_per_attr = {}
for attr, (pos, neg) in text_feats.items():
sims = torch.stack([feats @ pos, feats @ neg], dim=1) * logit_scale
probs = sims.softmax(dim=1)[:, 0].detach().cpu().tolist()
scores_per_attr[attr] = probs
for j, entry in enumerate(keep):
results.append({
"wsl_path": entry["wsl_path"],
"faceset": entry["faceset"],
"file": entry["file"],
"mask": round(scores_per_attr["mask"][j], 4),
"sunglasses": round(scores_per_attr["sunglasses"][j], 4),
})
processed.add(entry["wsl_path"])
n_done_this_run += 1
if (n_done_this_run % FLUSH_EVERY < args.batch) or (time.time() - last_flush) > 30.0:
flush()
last_flush = time.time()
elapsed = time.time() - t_start
rate = n_done_this_run / max(0.1, elapsed)
eta_min = (len(pending) - n_done_this_run) / max(0.1, rate) / 60.0
print(f"[score] {n_done_this_run}/{len(pending)} "
f"rate={rate:.2f} img/s eta={eta_min:.1f}min "
f"load_err={n_load_err}", flush=True)
flush()
elapsed = time.time() - t_start
print(f"[done] {n_done_this_run} scored, {n_load_err} load errors, "
f"{elapsed:.1f}s ({n_done_this_run/max(0.1,elapsed):.2f} img/s) -> {args.out}")
if __name__ == "__main__":
main()
+340
View File
@@ -0,0 +1,340 @@
#!/usr/bin/env python3
"""Discover new identities in an Immich-sourced cache and emit them as facesets.
Mirrors `work/cluster_osrc.py`, but the source corpus is an arbitrary
Immich user's `immich_<user>.npz` cache produced by the Windows DML embed
worker. Existing identity centroids come from the union of every faceset
already in `facesets_swap_ready/` (faceset_001..NNN, both auto-clustered
and hand-sorted).
Pipeline:
1. Load immich_<user>.npz; restrict to face records (drop noface).
2. Build centroids of every existing canonical faceset in
facesets_swap_ready/ (skip era splits and _thin/).
3. Drop immich faces whose nearest existing centroid is within
EXISTING_MATCH_THRESHOLD; those are already covered by the canonical set.
4. Cluster the remaining among themselves at INITIAL_THRESHOLD.
5. Per cluster: refine-equivalent gates (face_short, blur, det_score),
plus outlier rejection at OUTLIER_THRESHOLD for clusters of size >= 4.
6. Keep clusters whose surviving unique source-path count is >= MIN_FACES.
7. Number kept clusters past the existing facesets_swap_ready/ max.
8. Synthesize a refine_manifest, hand off to cmd_export_swap, move dirs into
facesets_swap_ready/, drop a provenance marker, append to top-level
manifest.json (preserving facesets / thin_eras).
"""
from __future__ import annotations
import argparse
import json
import shutil
import sys
from pathlib import Path
import numpy as np
REPO = Path(__file__).resolve().parent.parent
sys.path.insert(0, str(REPO))
from sort_faces import ( # noqa: E402
_cluster_embeddings,
cmd_export_swap,
load_cache,
)
# ---- config -------------------------------------------------------------- #
REPO_WORK = REPO / "work"
SWAP_READY = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
EXISTING_MATCH_THRESHOLD = 0.45
INITIAL_THRESHOLD = 0.55
MIN_FACES = 6
MIN_SHORT = 90
MIN_BLUR = 40.0
MIN_DET_SCORE = 0.6
OUTLIER_THRESHOLD = 0.55
TOP_N = 30
EXPORT_OUTLIER_THRESHOLD = 0.45
PAD_RATIO = 0.5
OUT_SIZE = 512
EXPORT_MIN_FACE_SHORT = 100
# ---- helpers ------------------------------------------------------------- #
def _normalize(v: np.ndarray) -> np.ndarray:
n = np.linalg.norm(v)
return v / n if n > 0 else v
def _existing_identity_centroids(
nl_cache: Path,
) -> tuple[np.ndarray, list[str]]:
"""Build identity centroids from every canonical faceset_NNN/ in
facesets_swap_ready/. Era-split sub-dirs (faceset_001_<era>) and the
_thin/ quarantine are skipped. Each faceset's manifest.json provides
(source, bbox) keys we use to look up rows in nl_full.npz."""
emb, meta, _src, _proc, _aliases = load_cache(nl_cache)
face_records = [m for m in meta if not m.get("noface")]
if len(face_records) != len(emb):
raise SystemExit(f"meta/embedding mismatch in {nl_cache}: {len(face_records)} vs {len(emb)}")
bbox_idx = {(m["path"], tuple(m.get("bbox") or ())): i for i, m in enumerate(face_records)}
centroids: list[np.ndarray] = []
names: list[str] = []
for d in sorted(SWAP_READY.iterdir()):
if not d.is_dir():
continue
if d.name.startswith("_"):
continue
# Skip era-split sub-facesets (faceset_NNN_*).
if d.name.startswith("faceset_") and "_" in d.name[len("faceset_"):]:
continue
man = d / "manifest.json"
if not man.exists():
continue
try:
entries = json.loads(man.read_text()).get("faces", [])
except Exception:
continue
keys = [(f["source"], tuple(f.get("bbox") or ())) for f in entries]
idxs = [bbox_idx[k] for k in keys if k in bbox_idx]
if not idxs:
continue
centroids.append(_normalize(emb[idxs].mean(axis=0)))
names.append(d.name)
if not centroids:
raise SystemExit("no canonical identity centroids could be built; check facesets_swap_ready/")
return np.stack(centroids), names
def _next_faceset_number() -> int:
nums = []
for d in SWAP_READY.iterdir():
if not d.is_dir() or not d.name.startswith("faceset_"):
continue
tail = d.name[len("faceset_"):]
# Take only top-level numbered facesets (no era suffix).
if "_" in tail:
continue
try:
nums.append(int(tail))
except ValueError:
continue
return (max(nums) + 1) if nums else 1
# ---- phase 1: discover --------------------------------------------------- #
def discover_new_clusters(
immich_cache: Path, nl_cache: Path, start_nnn: int, source_label: str
) -> tuple[dict, list[dict]]:
print(f"loading immich cache: {immich_cache}")
emb, meta, _src, _proc, _aliases = load_cache(immich_cache)
face_records = [m for m in meta if not m.get("noface")]
if len(face_records) != len(emb):
raise SystemExit(f"meta/embedding mismatch: {len(face_records)} vs {len(emb)}")
print(f" {len(face_records)} face records, {sum(1 for m in meta if m.get('noface'))} noface")
print(f"building existing-identity centroids from {SWAP_READY}")
cents, cent_names = _existing_identity_centroids(nl_cache)
print(f" {len(cent_names)} canonical centroids")
sims = emb @ cents.T
nearest_d = 1.0 - sims.max(axis=1)
nearest_id = sims.argmax(axis=1)
covered = nearest_d <= EXISTING_MATCH_THRESHOLD
print(f"\nfaces already covered (cos-dist <= {EXISTING_MATCH_THRESHOLD}): "
f"{int(covered.sum())}/{len(emb)}")
for j, name in enumerate(cent_names):
c = int(((nearest_id == j) & covered).sum())
if c:
print(f" -> {name}: {c}")
new_idx = [i for i in range(len(emb)) if not covered[i]]
print(f"\nunmatched immich faces to cluster: {len(new_idx)}")
if len(new_idx) <= 1:
labels = np.zeros(len(new_idx), dtype=int)
else:
labels = _cluster_embeddings(emb[new_idx], INITIAL_THRESHOLD)
n_clusters = len(set(int(l) for l in labels))
sizes = sorted([int((labels == l).sum()) for l in set(labels)], reverse=True)
print(f"clusters at threshold {INITIAL_THRESHOLD}: {n_clusters} "
f"top sizes: {sizes[:10]}")
clusters: dict[int, list[int]] = {}
for k, lab in enumerate(labels):
clusters.setdefault(int(lab), []).append(new_idx[k])
kept: list[dict] = []
drop_quality_total = 0
drop_outlier_total = 0
for cid, idxs in clusters.items():
good: list[int] = []
for i in idxs:
r = face_records[i]
if r.get("face_short", 0) < MIN_SHORT:
drop_quality_total += 1; continue
if r.get("blur", 0.0) < MIN_BLUR:
drop_quality_total += 1; continue
if r.get("det_score", 0.0) < MIN_DET_SCORE:
drop_quality_total += 1; continue
good.append(i)
if not good:
continue
if len(good) >= 4:
cent = _normalize(emb[good].mean(axis=0))
d = 1.0 - emb[good] @ cent
tight = [good[k] for k, dist in enumerate(d) if dist <= OUTLIER_THRESHOLD]
drop_outlier_total += len(good) - len(tight)
good = tight
if not good:
continue
unique_paths = sorted({face_records[i]["path"] for i in good})
if len(unique_paths) < MIN_FACES:
continue
kept.append({
"indices": good,
"unique_paths": unique_paths,
"size_face": len(good),
"size_paths": len(unique_paths),
})
kept.sort(key=lambda c: -c["size_paths"])
print(f"\nafter quality+outlier+min_faces: {len(kept)} clusters kept "
f"(dropped: quality={drop_quality_total} outlier={drop_outlier_total})")
for rank, c in enumerate(kept, start=start_nnn):
print(f" faceset_{rank:03d}: faces={c['size_face']:3d} "
f"unique_paths={c['size_paths']:3d}")
facesets = [
{
"name": f"faceset_{rank:03d}",
"image_count": c["size_paths"],
"face_count": c["size_face"],
"images": c["unique_paths"],
}
for rank, c in enumerate(kept, start=start_nnn)
]
manifest = {
"params": {
"existing_match_threshold": EXISTING_MATCH_THRESHOLD,
"initial_threshold": INITIAL_THRESHOLD,
"outlier_threshold": OUTLIER_THRESHOLD,
"min_faces": MIN_FACES,
"min_short": MIN_SHORT,
"min_blur": MIN_BLUR,
"min_det_score": MIN_DET_SCORE,
"source_label": source_label,
"source_cache": str(immich_cache),
},
"facesets": facesets,
}
return manifest, kept
# ---- phase 2: export + relocate ----------------------------------------- #
def export_and_relocate(manifest: dict, immich_cache: Path, source_label: str) -> None:
synth_path = REPO_WORK / f"synthetic_{source_label}_manifest.json"
synth_path.write_text(json.dumps(manifest, indent=2))
print(f"\nsynthetic manifest -> {synth_path}")
out_tmp = SWAP_READY.parent / f"facesets_swap_ready_{source_label}_new"
if out_tmp.exists():
shutil.rmtree(out_tmp)
out_tmp.mkdir(parents=True)
print(f"running cmd_export_swap -> {out_tmp}")
cmd_export_swap(
cache_path=immich_cache,
refine_manifest_path=synth_path,
raw_manifest_path=None,
out_dir=out_tmp,
top_n=TOP_N,
outlier_threshold=EXPORT_OUTLIER_THRESHOLD,
pad_ratio=PAD_RATIO,
out_size=OUT_SIZE,
include_candidates=False,
candidate_match_threshold=0.55,
candidate_min_score=0.40,
min_face_short=EXPORT_MIN_FACE_SHORT,
)
new_top = json.loads((out_tmp / "manifest.json").read_text())
new_entries = new_top.get("facesets", [])
moved = 0
for fs_meta in new_entries:
name = fs_meta["name"]
src_dir = out_tmp / name
if not src_dir.exists():
print(f"[{name}] export dir missing; skipping")
continue
dst_dir = SWAP_READY / name
if dst_dir.exists():
print(f"[{name}] {dst_dir} already exists; refusing to overwrite")
continue
(src_dir / f"immich_{source_label}.txt").write_text(
f"{name}\n\nSource: Immich user {source_label} cluster (auto-discovered).\n"
)
shutil.move(str(src_dir), str(dst_dir))
moved += 1
print(f"[{name}] -> {dst_dir}")
final_manifest_path = SWAP_READY / "manifest.json"
if final_manifest_path.exists():
existing = json.loads(final_manifest_path.read_text())
else:
existing = {"facesets": []}
existing.setdefault("facesets", [])
existing_names = {fs["name"] for fs in existing["facesets"]}
appended = 0
for entry in new_entries:
if entry["name"] in existing_names:
print(f"[manifest] {entry['name']} already present; not duplicating")
continue
existing["facesets"].append(entry)
appended += 1
final_manifest_path.write_text(json.dumps(existing, indent=2))
print(f"\nmerged manifest: appended {appended} entries -> {final_manifest_path}")
print(f"moved {moved} faceset directories into {SWAP_READY}")
if out_tmp.exists() and not list(out_tmp.iterdir()):
out_tmp.rmdir()
# ---- main ---------------------------------------------------------------- #
def main() -> None:
p = argparse.ArgumentParser()
p.add_argument("immich_cache", type=Path,
help="path to immich_<user>.npz produced by the embed worker")
p.add_argument("--nl-cache", type=Path, default=REPO_WORK / "cache" / "nl_full.npz",
help="canonical cache for existing identity centroids")
p.add_argument("--source-label", default=None,
help="short label used in marker filenames; default = stem of immich_cache")
p.add_argument("--start-nnn", type=int, default=None,
help="first faceset number to assign; default = current max+1 in facesets_swap_ready/")
p.add_argument("--dry-run", action="store_true")
args = p.parse_args()
label = args.source_label or args.immich_cache.stem.removeprefix("immich_") or args.immich_cache.stem
start_nnn = args.start_nnn if args.start_nnn is not None else _next_faceset_number()
print(f"source label: {label!r}; first faceset number: {start_nnn:03d}")
manifest, kept = discover_new_clusters(args.immich_cache, args.nl_cache, start_nnn, label)
if args.dry_run:
print("\n--dry-run: stopping after cluster discovery (no exports written).")
return
if not manifest.get("facesets"):
print("no new facesets to build.")
return
export_and_relocate(manifest, args.immich_cache, label)
print("\nDone.")
if __name__ == "__main__":
main()
+352
View File
@@ -0,0 +1,352 @@
#!/usr/bin/env python3
"""Discover new identities in /mnt/x/src/osrc and emit them as facesets.
Workflow (mirrors the shape of build_folders.py, but identities are
discovered by clustering rather than asserted by folder):
1. Load cache; restrict to face records whose canonical or alias path
lies under /mnt/x/src/osrc/.
2. Build centroids of the existing 19 canonical identities in
facesets_swap_ready/faceset_001..019. Drop any osrc face whose
nearest-existing-identity cos-dist <= EXISTING_MATCH_THRESHOLD;
those are already covered by `extend` and shouldn't seed new
facesets.
3. Cluster the remaining osrc faces among themselves at
INITIAL_THRESHOLD (matches `extend`'s new_cluster_threshold default).
4. Per cluster, apply refine-equivalent gates: face_short >= MIN_SHORT,
blur >= MIN_BLUR, det_score >= MIN_DET_SCORE; for clusters >= 4,
drop faces with cos-dist > OUTLIER_THRESHOLD from the cluster
centroid.
5. Keep clusters whose surviving unique source-path count is >= MIN_FACES.
6. Number kept clusters faceset_020, 021, ... (past the highest existing
in facesets_swap_ready, which is 019). Order by descending size.
7. Synthesize a refine_manifest.json and call cmd_export_swap on it,
emitting into a temp dir. Move new dirs into facesets_swap_ready/.
8. Append new entries to the top-level facesets_swap_ready/manifest.json
(preserving existing facesets / thin_eras).
"""
from __future__ import annotations
import json
import shutil
import sys
from pathlib import Path
import numpy as np
REPO = Path(__file__).resolve().parent.parent
sys.path.insert(0, str(REPO))
from sort_faces import ( # noqa: E402
_cluster_embeddings,
cmd_export_swap,
load_cache,
)
# ---- config -------------------------------------------------------------- #
CACHE = REPO / "work" / "cache" / "nl_full.npz"
SWAP_READY = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
OUT_TMP = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready_osrc_new")
SYNTH_MANIFEST = REPO / "work" / "synthetic_osrc_manifest.json"
OSRC_DIR = Path("/mnt/x/src/osrc")
START_NNN = 20 # facesets_swap_ready max is 019; pick up here.
# Existing-identity exclusion: drop osrc faces whose nearest existing
# identity centroid is within this cosine distance. 0.45 matches the
# build_folders.py OSRC_THRESHOLD: at this cutoff the face is already
# routed to an existing identity by extend / build_folders.py.
EXISTING_MATCH_THRESHOLD = 0.45
# Cluster the unmatched.
INITIAL_THRESHOLD = 0.55
# Refine-equivalent gates (per the user's request: drop min_faces to 6).
MIN_FACES = 6
MIN_SHORT = 90
MIN_BLUR = 40.0
MIN_DET_SCORE = 0.6
OUTLIER_THRESHOLD = 0.55 # only applied if cluster >= 4
# export-swap params (defaults from sort_faces.py).
TOP_N = 30
EXPORT_OUTLIER_THRESHOLD = 0.45
PAD_RATIO = 0.5
OUT_SIZE = 512
EXPORT_MIN_FACE_SHORT = 100
# ---- helpers ------------------------------------------------------------- #
def _normalize(v: np.ndarray) -> np.ndarray:
n = np.linalg.norm(v)
return v / n if n > 0 else v
def _under(folder: Path, p: str) -> bool:
fs = str(folder).rstrip("/") + "/"
return p == str(folder) or p.startswith(fs)
def _record_in_folder(rec: dict, folder: Path, path_aliases: dict[str, list[str]]) -> bool:
if _under(folder, rec["path"]):
return True
for alias in path_aliases.get(rec["path"], []):
if _under(folder, alias):
return True
return False
def _existing_identity_centroids(
emb: np.ndarray, face_records: list[dict]
) -> tuple[np.ndarray, list[str]]:
"""Build a (n_identities, 512) matrix of L2-normalized centroids and a parallel name list,
drawn from the canonical faceset_001..019 manifests in facesets_swap_ready/."""
bbox_idx: dict[tuple[str, tuple], int] = {
(m["path"], tuple(m.get("bbox") or ())): i for i, m in enumerate(face_records)
}
centroids: list[np.ndarray] = []
names: list[str] = []
for n in range(1, 20):
d = SWAP_READY / f"faceset_{n:03d}"
man_path = d / "manifest.json"
if not man_path.exists():
continue
man = json.loads(man_path.read_text())
keys = [(f["source"], tuple(f.get("bbox") or ())) for f in man.get("faces", [])]
idxs = [bbox_idx[k] for k in keys if k in bbox_idx]
if not idxs:
continue
centroids.append(_normalize(emb[idxs].mean(axis=0)))
names.append(d.name)
return np.stack(centroids), names
# ---- phase 1: identify new osrc clusters --------------------------------- #
def discover_new_clusters() -> tuple[dict, list[dict]]:
emb, meta, _src_root, _proc, path_aliases = load_cache(CACHE)
face_records = [m for m in meta if not m.get("noface")]
if len(face_records) != len(emb):
raise SystemExit(f"meta/embedding mismatch: {len(face_records)} vs {len(emb)}")
print(f"Cache: {len(face_records)} face records.")
# Step 1: filter to osrc.
osrc_idx = [
i for i, m in enumerate(face_records)
if _record_in_folder(m, OSRC_DIR, path_aliases)
]
print(f"osrc face records: {len(osrc_idx)}")
# Step 2: drop those already matching an existing identity.
cents, cent_names = _existing_identity_centroids(emb, face_records)
osrc_emb = emb[osrc_idx]
sims = osrc_emb @ cents.T
nearest_d = 1.0 - sims.max(axis=1)
nearest_id = sims.argmax(axis=1)
covered_mask = nearest_d <= EXISTING_MATCH_THRESHOLD
n_covered = int(covered_mask.sum())
print(
f"Already covered by existing 19 identities at cos-dist <= "
f"{EXISTING_MATCH_THRESHOLD}: {n_covered}/{len(osrc_idx)}"
)
# Per-identity coverage breakdown (for logging only).
for j, name in enumerate(cent_names):
c = int(((nearest_id == j) & covered_mask).sum())
if c:
print(f" -> {name}: {c}")
new_idx = [osrc_idx[k] for k in range(len(osrc_idx)) if not covered_mask[k]]
print(f"\nUnmatched osrc faces to cluster: {len(new_idx)}")
# Step 3: cluster the unmatched among themselves.
new_emb = emb[new_idx]
if len(new_idx) <= 1:
labels = np.zeros(len(new_idx), dtype=int)
else:
labels = _cluster_embeddings(new_emb, INITIAL_THRESHOLD)
n_clusters = len(set(int(l) for l in labels))
print(
f"Initial clusters at threshold {INITIAL_THRESHOLD}: {n_clusters} "
f"(top sizes: {sorted([int((labels==l).sum()) for l in set(labels)], reverse=True)[:10]})"
)
# Step 4 + 5: per-cluster refine gates + min_faces.
clusters: dict[int, list[int]] = {}
for k, lab in enumerate(labels):
clusters.setdefault(int(lab), []).append(new_idx[k])
kept_clusters: list[dict] = []
drop_quality_total = 0
drop_outlier_total = 0
for cid, idxs in clusters.items():
# Per-face quality gate.
good: list[int] = []
for i in idxs:
r = face_records[i]
if r.get("face_short", 0) < MIN_SHORT:
drop_quality_total += 1
continue
if r.get("blur", 0.0) < MIN_BLUR:
drop_quality_total += 1
continue
if r.get("det_score", 0.0) < MIN_DET_SCORE:
drop_quality_total += 1
continue
good.append(i)
if not good:
continue
# Outlier rejection (only if cluster >= 4).
if len(good) >= 4:
cent = _normalize(emb[good].mean(axis=0))
d = 1.0 - emb[good] @ cent
tight = [good[k] for k, dist in enumerate(d) if dist <= OUTLIER_THRESHOLD]
drop_outlier_total += len(good) - len(tight)
good = tight
if not good:
continue
unique_paths = sorted({face_records[i]["path"] for i in good})
if len(unique_paths) < MIN_FACES:
continue
kept_clusters.append({
"indices": good,
"unique_paths": unique_paths,
"size_face": len(good),
"size_paths": len(unique_paths),
})
kept_clusters.sort(key=lambda c: -c["size_paths"])
print(
f"\nAfter quality gate ({drop_quality_total} dropped) + outlier "
f"rejection ({drop_outlier_total} dropped) + min_faces={MIN_FACES}: "
f"{len(kept_clusters)} clusters kept"
)
for rank, c in enumerate(kept_clusters, start=START_NNN):
print(
f" faceset_{rank:03d}: faces={c['size_face']:3d} "
f"unique_paths={c['size_paths']:3d}"
)
# Build synthetic refine_manifest.json compatible with cmd_export_swap.
facesets = [
{
"name": f"faceset_{rank:03d}",
"image_count": c["size_paths"],
"face_count": c["size_face"],
"images": c["unique_paths"],
}
for rank, c in enumerate(kept_clusters, start=START_NNN)
]
manifest = {
"params": {
"existing_match_threshold": EXISTING_MATCH_THRESHOLD,
"initial_threshold": INITIAL_THRESHOLD,
"outlier_threshold": OUTLIER_THRESHOLD,
"min_faces": MIN_FACES,
"min_short": MIN_SHORT,
"min_blur": MIN_BLUR,
"min_det_score": MIN_DET_SCORE,
"source_root": str(OSRC_DIR),
},
"facesets": facesets,
}
SYNTH_MANIFEST.write_text(json.dumps(manifest, indent=2))
print(f"\nSynthetic manifest -> {SYNTH_MANIFEST}")
return manifest, kept_clusters
# ---- phase 2: export + relocate + merge top-level manifest -------------- #
def export_and_relocate(manifest: dict) -> None:
if OUT_TMP.exists():
shutil.rmtree(OUT_TMP)
OUT_TMP.mkdir(parents=True)
print(f"\nRunning cmd_export_swap -> {OUT_TMP}")
cmd_export_swap(
cache_path=CACHE,
refine_manifest_path=SYNTH_MANIFEST,
raw_manifest_path=None,
out_dir=OUT_TMP,
top_n=TOP_N,
outlier_threshold=EXPORT_OUTLIER_THRESHOLD,
pad_ratio=PAD_RATIO,
out_size=OUT_SIZE,
include_candidates=False,
candidate_match_threshold=0.55,
candidate_min_score=0.40,
min_face_short=EXPORT_MIN_FACE_SHORT,
)
new_top = json.loads((OUT_TMP / "manifest.json").read_text())
new_entries = new_top.get("facesets", [])
moved = 0
for fs_meta in new_entries:
name = fs_meta["name"]
src_dir = OUT_TMP / name
if not src_dir.exists():
print(f"[{name}] export dir missing; skipping")
continue
dst_dir = SWAP_READY / name
if dst_dir.exists():
print(f"[{name}] {dst_dir} already exists; refusing to overwrite")
continue
# Add a marker file so the source provenance is obvious.
(src_dir / "osrc.txt").write_text(
f"{name}\n\nSource: osrc cluster (auto-discovered, {OSRC_DIR}).\n"
)
shutil.move(str(src_dir), str(dst_dir))
moved += 1
print(f"[{name}] -> {dst_dir}")
# Merge top-level manifest, preserving facesets / thin_eras / etc.
final_manifest_path = SWAP_READY / "manifest.json"
if final_manifest_path.exists():
existing = json.loads(final_manifest_path.read_text())
else:
existing = {"facesets": []}
existing.setdefault("facesets", [])
existing_names = {fs["name"] for fs in existing["facesets"]}
appended = 0
for entry in new_entries:
if entry["name"] in existing_names:
print(f"[manifest] {entry['name']} already present; not duplicating")
continue
existing["facesets"].append(entry)
appended += 1
final_manifest_path.write_text(json.dumps(existing, indent=2))
print(f"\nMerged manifest: appended {appended} entries -> {final_manifest_path}")
print(f"Moved {moved} faceset directories into {SWAP_READY}")
# Clean up temp dir if empty.
if OUT_TMP.exists():
leftover = list(OUT_TMP.iterdir())
if not leftover:
OUT_TMP.rmdir()
# ---- main ---------------------------------------------------------------- #
def main() -> None:
dry = "--dry-run" in sys.argv
manifest, kept = discover_new_clusters()
if dry:
print("\n--dry-run: stopping after cluster discovery (no exports written).")
return
if not manifest.get("facesets"):
print("No new facesets to build; nothing to do.")
return
export_and_relocate(manifest)
print("\nDone.")
if __name__ == "__main__":
main()
+634
View File
@@ -0,0 +1,634 @@
"""Consolidate facesets_swap_ready/ — find duplicate identities and merge.
Pipeline:
1. analyze: pull arcface embeddings from work/cache/*.npz for every PNG in every
active faceset (skipping _masked, _thin, era splits). Compute L2-normalized
centroid per faceset. Build similarity graph at sim>=0.45, extract components.
Pick primary per component by tier (hand-sorted > auto > osrc > immich) + size.
2. report: HTML contact sheet at work/merge_review/index.html grouped by
candidate cluster, with top-3 thumbs per faceset, all pairwise sims, and
"merge X,Y -> Z" plan. Confident edges (sim>=0.65) are highlighted.
3. apply: combine PNGs of secondaries into primary, re-rank by quality.composite
descending, renumber 0001..NNNN, re-zip _topN.fsz + _all.fsz, move secondaries
to facesets_swap_ready/_merged/<name>/, update master manifest with
`merged[]` array + `merge_run` provenance block.
Embeddings come from caches (no GPU re-embed needed); the original clusterer used
exactly these vectors so they are the right yardstick. Era splits are excluded
entirely (intentional time-period segmentation, not a duplication).
"""
from __future__ import annotations
import argparse
import json
import re
import shutil
import sys
import time
from pathlib import Path
import numpy as np
from PIL import Image
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform
ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
CACHES = [
Path("/opt/face-sets/work/cache/nl_full.npz"),
Path("/opt/face-sets/work/cache/immich_peter.npz"),
Path("/opt/face-sets/work/cache/immich_nic.npz"),
]
ERA_SPLIT_RE = re.compile(r"^faceset_\d+_(?:\d{4}-\d{2,4}|\d{4}|undated)$")
# ----------------------------- helpers -----------------------------
def load_caches():
"""Return (rec_index, alias_map). rec_index keyed by (path, bbox_tuple)
-> embedding (np.float32, shape (512,) L2-normalized).
alias_map maps every alias path -> canonical path."""
rec_index: dict[tuple[str, tuple[int, int, int, int]], np.ndarray] = {}
alias_map: dict[str, str] = {}
n_total = 0
for c in CACHES:
if not c.exists():
print(f"[warn] cache missing: {c}", file=sys.stderr)
continue
d = np.load(c, allow_pickle=True)
emb = d["embeddings"]
meta = json.loads(str(d["meta"]))
face_records = [m for m in meta if not m.get("noface")]
if len(face_records) != len(emb):
raise SystemExit(f"meta/emb mismatch in {c}: {len(face_records)} vs {len(emb)}")
# path_aliases may be present
if "path_aliases" in d.files:
paliases = json.loads(str(d["path_aliases"]))
for canon, alist in paliases.items():
alias_map.setdefault(canon, canon)
for a in alist:
alias_map[a] = canon
for i, rec in enumerate(face_records):
p = rec["path"]
bbox = tuple(int(x) for x in rec["bbox"])
v = emb[i].astype(np.float32)
n = float(np.linalg.norm(v))
if n > 0:
v = v / n
rec_index[(p, bbox)] = v
alias_map.setdefault(p, p)
print(f"[cache] {c.name}: +{len(face_records)} face records (running total {len(rec_index)})", file=sys.stderr)
n_total += len(face_records)
print(f"[cache] indexed {n_total} face records, {len(alias_map)} path aliases", file=sys.stderr)
return rec_index, alias_map
def faceset_tier(name: str) -> int:
"""Lower number = higher priority for primary selection."""
m = re.match(r"^faceset_0*(\d+)$", name)
if not m:
return 99 # unknown structure
n = int(m.group(1))
if 13 <= n <= 19:
return 0 # hand-sorted
if 1 <= n <= 12:
return 1 # auto-clustered
if 20 <= n <= 25:
return 2 # osrc
if 26 <= n <= 264:
return 3 # immich peter
if 265 <= n:
return 4 # immich nic and beyond
return 99
def is_era_split(name: str) -> bool:
return bool(ERA_SPLIT_RE.match(name))
def faceset_centroid(faceset_dir: Path, rec_index, alias_map):
"""Return (centroid, n_used, n_missing) where centroid is L2-normalized mean
of embeddings of the faces listed in the per-faceset manifest. Falls back to
None if too few embeddings found."""
manifest = faceset_dir / "manifest.json"
if not manifest.exists():
return None, 0, 0
m = json.loads(manifest.read_text())
vecs = []
n_missing = 0
for f in m.get("faces", []):
src = f.get("source")
bbox = f.get("bbox")
if src is None or bbox is None:
n_missing += 1
continue
bbox_t = tuple(int(x) for x in bbox)
canon = alias_map.get(src, src)
v = rec_index.get((canon, bbox_t))
if v is None and canon != src:
v = rec_index.get((src, bbox_t))
if v is None:
n_missing += 1
continue
vecs.append(v)
if len(vecs) < 3:
return None, len(vecs), n_missing
arr = np.stack(vecs).astype(np.float32)
c = arr.mean(axis=0)
n = float(np.linalg.norm(c))
if n > 0:
c = c / n
return c, len(vecs), n_missing
def connected_components(adj: dict[int, set[int]]) -> list[list[int]]:
seen: set[int] = set()
comps = []
for node in adj:
if node in seen:
continue
stack = [node]
comp = []
while stack:
x = stack.pop()
if x in seen:
continue
seen.add(x)
comp.append(x)
for y in adj.get(x, set()):
if y not in seen:
stack.append(y)
comps.append(sorted(comp))
return comps
# ----------------------------- analyze -----------------------------
def cmd_analyze(args):
rec_index, alias_map = load_caches()
# collect active facesets
active = []
for d in sorted(ROOT.iterdir()):
if not d.is_dir() or d.name.startswith("_"):
continue
if is_era_split(d.name):
continue
active.append(d)
print(f"[scan] {len(active)} active facesets (era splits + _masked + _thin excluded)", file=sys.stderr)
centroids: dict[str, np.ndarray] = {}
sizes: dict[str, int] = {}
skipped = []
t0 = time.time()
for fs in active:
c, n_used, n_miss = faceset_centroid(fs, rec_index, alias_map)
if c is None:
skipped.append((fs.name, n_used, n_miss))
continue
centroids[fs.name] = c
sizes[fs.name] = n_used
print(f"[centroid] {len(centroids)} facesets centroided in {time.time()-t0:.1f}s; "
f"{len(skipped)} skipped (too few embeddings)", file=sys.stderr)
if skipped:
for n, u, m in skipped[:10]:
print(f" skip {n}: used={u} missing={m}", file=sys.stderr)
if len(skipped) > 10:
print(f" ... +{len(skipped)-10} more", file=sys.stderr)
names = sorted(centroids.keys())
if not names:
raise SystemExit("no centroids built")
# similarity matrix
M = np.stack([centroids[n] for n in names]).astype(np.float32) # (N, 512), normalized
sim = M @ M.T # (N, N) cosine since unit-normalized
np.clip(sim, -1.0, 1.0, out=sim)
edge_thr = args.edge
confident_thr = args.confident
# complete-linkage agglomerative clustering on cosine distance.
# Cut at edge threshold: groups are guaranteed to have ALL pairs sim >= edge_thr.
# This avoids the chaining problem of single-link / connected-components.
n = len(names)
dist = 1.0 - sim
np.fill_diagonal(dist, 0.0)
# symmetrize numerical noise
dist = (dist + dist.T) / 2.0
np.clip(dist, 0.0, 2.0, out=dist)
cond = squareform(dist, checks=False)
Z = linkage(cond, method="complete")
cut_dist = 1.0 - edge_thr # complete-link distance corresponds to (1 - min sim)
labels = fcluster(Z, t=cut_dist, criterion="distance") # 1-indexed cluster ids
cluster_members: dict[int, list[int]] = {}
for idx, lbl in enumerate(labels):
cluster_members.setdefault(int(lbl), []).append(idx)
comps = [sorted(idxs) for idxs in cluster_members.values() if len(idxs) > 1]
n_pairs_in_groups = 0
for c in comps:
n_pairs_in_groups += len(c) * (len(c) - 1) // 2
print(f"[graph] complete-linkage cut at sim>={edge_thr}: {len(comps)} multi-faceset groups "
f"({n_pairs_in_groups} within-group pairs)", file=sys.stderr)
# pick primary per group: lowest tier number, then largest size
groups_out = []
for comp in comps:
members = [names[i] for i in comp]
members_sorted = sorted(members, key=lambda x: (faceset_tier(x), -sizes.get(x, 0), x))
primary = members_sorted[0]
secondaries = members_sorted[1:]
# gather pairwise sims within group
pair_sims = []
idx_of = {names[i]: i for i in comp}
for a in members:
for b in members:
if a >= b:
continue
pair_sims.append({"a": a, "b": b, "sim": round(float(sim[idx_of[a], idx_of[b]]), 4)})
# confidence: minimum within-group sim (the weakest link)
min_link = min(p["sim"] for p in pair_sims)
max_link = max(p["sim"] for p in pair_sims)
confidence = "confident" if min_link >= confident_thr else "uncertain"
groups_out.append({
"primary": primary,
"secondaries": secondaries,
"members": members_sorted,
"tiers": {n: faceset_tier(n) for n in members},
"sizes": {n: sizes.get(n, 0) for n in members},
"pair_sims": pair_sims,
"min_link": round(min_link, 4),
"max_link": round(max_link, 4),
"confidence": confidence,
})
# sort: confident first, then by max_link desc
groups_out.sort(key=lambda g: (0 if g["confidence"] == "confident" else 1, -g["max_link"]))
out = {
"thresholds": {"edge": edge_thr, "confident": confident_thr},
"n_active": len(active),
"n_centroided": len(centroids),
"n_skipped": len(skipped),
"skipped_reasons": [{"name": n, "used": u, "missing": m} for n, u, m in skipped],
"n_groups": len(groups_out),
"n_facesets_in_groups": sum(len(g["members"]) for g in groups_out),
"groups": groups_out,
}
op = Path(args.out)
op.parent.mkdir(parents=True, exist_ok=True)
op.write_text(json.dumps(out, indent=2))
confident = sum(1 for g in groups_out if g["confidence"] == "confident")
uncertain = sum(1 for g in groups_out if g["confidence"] == "uncertain")
print(f"[done] {len(groups_out)} groups ({confident} confident, {uncertain} uncertain) -> {op}", file=sys.stderr)
# ----------------------------- report -----------------------------
def cmd_report(args):
candidates = json.loads(Path(args.candidates).read_text())
out_dir = Path(args.out)
thumbs_dir = out_dir / "thumbs"
thumbs_dir.mkdir(parents=True, exist_ok=True)
THUMB = 140
THUMBS_PER_FACESET = 4
def make_thumb(faceset: str, fname: str) -> str:
d = thumbs_dir / faceset
d.mkdir(parents=True, exist_ok=True)
dst = d / (Path(fname).stem + ".jpg")
if not dst.exists():
try:
src = ROOT / faceset / "faces" / fname
img = Image.open(src).convert("RGB")
img.thumbnail((THUMB, THUMB), Image.LANCZOS)
img.save(dst, "JPEG", quality=82)
except Exception as e:
print(f"[thumb-skip] {faceset}/{fname}: {e}", file=sys.stderr)
return ""
return f"thumbs/{faceset}/{Path(fname).stem}.jpg"
rows = []
for gi, g in enumerate(candidates["groups"]):
primary = g["primary"]
sec = g["secondaries"]
conf_cls = "confident" if g["confidence"] == "confident" else "uncertain"
rows.append(f"<section class='grp {conf_cls}' id='g{gi}'>")
rows.append(f"<h2>group #{gi+1} <small>({g['confidence']}; min_sim={g['min_link']:.3f}, max_sim={g['max_link']:.3f})</small></h2>")
rows.append(f"<div class='plan'>merge <b>{', '.join(sec)}</b> &rarr; <b>{primary}</b></div>")
# member rows
for name in g["members"]:
tier = g["tiers"][name]
sz = g["sizes"][name]
tier_label = ["hand-sorted", "auto", "osrc", "immich-peter", "immich-nic", "?"][min(tier, 5)]
badge = "PRIMARY" if name == primary else "secondary"
rows.append(f"<div class='member'>")
rows.append(f"<div class='label'><span class='badge {badge.lower()}'>{badge}</span> "
f"<b>{name}</b> <small>tier={tier_label} · n={sz}</small></div>")
rows.append("<div class='thumbs'>")
faces_dir = ROOT / name / "faces"
files = sorted(faces_dir.glob("*.png"))[:THUMBS_PER_FACESET]
for f in files:
rel = make_thumb(name, f.name)
if rel:
rows.append(f"<img src='{rel}' loading='lazy' title='{f.name}'>")
rows.append("</div></div>")
# pairwise sims
rows.append("<table class='sims'><tr><th>a</th><th>b</th><th>sim</th></tr>")
for ps in sorted(g["pair_sims"], key=lambda x: -x["sim"]):
cls = "hi" if ps["sim"] >= candidates["thresholds"]["confident"] else "mid"
rows.append(f"<tr><td>{ps['a']}</td><td>{ps['b']}</td><td class='{cls}'>{ps['sim']:.3f}</td></tr>")
rows.append("</table>")
rows.append("</section>")
nav = " · ".join(f"<a href='#g{i}'>#{i+1}</a>" for i in range(len(candidates["groups"])))
html = f"""<!doctype html>
<html><head><meta charset='utf-8'><title>Faceset merge review</title>
<style>
body {{ font-family: system-ui, sans-serif; background: #111; color: #eee; padding: 1em; }}
h1 {{ margin-top: 0; }}
h2 {{ margin: 0; }}
small {{ color: #999; font-weight: normal; }}
section.grp {{ background: #1a1a1a; border-radius: 6px; padding: 12px; margin: 12px 0; }}
section.grp.confident {{ border-left: 4px solid #5fa05f; }}
section.grp.uncertain {{ border-left: 4px solid #ffb050; }}
.plan {{ margin: .5em 0; color: #6cf; }}
.member {{ margin: 8px 0; padding: 6px; background: #222; border-radius: 4px; }}
.label {{ font-family: monospace; font-size: 13px; }}
.badge {{ display: inline-block; padding: 0 6px; font-size: 10px; border-radius: 2px; }}
.badge.primary {{ background: #5fa05f; color: #000; font-weight: bold; }}
.badge.secondary {{ background: #444; color: #ccc; }}
.thumbs {{ display: flex; gap: 4px; margin-top: 4px; flex-wrap: wrap; }}
.thumbs img {{ height: 140px; width: auto; border-radius: 3px; }}
table.sims {{ font-family: monospace; font-size: 11px; margin-top: 6px; border-collapse: collapse; }}
table.sims td, table.sims th {{ padding: 1px 8px; border: 1px solid #333; text-align: left; }}
table.sims td.hi {{ color: #5fa05f; font-weight: bold; }}
table.sims td.mid {{ color: #ffb050; }}
.nav {{ position: sticky; top: 0; background: #111; padding: .5em 0; border-bottom: 1px solid #333; font-size: 12px; }}
a {{ color: #6cf; }}
</style></head>
<body>
<h1>Merge review &mdash; {len(candidates['groups'])} candidate groups
<small>(edge>={candidates['thresholds']['edge']}, confident>={candidates['thresholds']['confident']})</small></h1>
<p>{candidates['n_centroided']} of {candidates['n_active']} active facesets centroided
(skipped {candidates['n_skipped']} for too few cached embeddings).
Green = confident (min within-group sim >= {candidates['thresholds']['confident']}); orange = uncertain.</p>
<div class='nav'>{nav}</div>
{''.join(rows)}
</body></html>"""
out_html = out_dir / "index.html"
out_html.write_text(html)
print(f"[done] {out_html}", file=sys.stderr)
# ----------------------------- apply -----------------------------
def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
import zipfile
with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
for i, p in enumerate(pngs):
zf.write(p, arcname=f"{i:04d}.png")
def cmd_apply(args):
candidates = json.loads(Path(args.candidates).read_text())
master_path = ROOT / "manifest.json"
master = json.loads(master_path.read_text())
by_name = {f["name"]: f for f in master.get("facesets", [])}
# filter: skip "uncertain" groups unless --include-uncertain
accepted = [g for g in candidates["groups"]
if g["confidence"] == "confident" or args.include_uncertain]
skipped_unc = [g for g in candidates["groups"]
if g["confidence"] == "uncertain" and not args.include_uncertain]
# explicit --exclude / --only filters (group indices in the candidates file)
if args.only:
only = {int(s) for s in args.only.split(",")}
accepted = [g for i, g in enumerate(candidates["groups"]) if i in only]
if args.exclude:
excl = {int(s) for s in args.exclude.split(",")}
accepted = [g for i, g in enumerate(accepted) if i not in excl]
print(f"[plan] {len(accepted)} groups will be merged "
f"({len(skipped_unc)} uncertain skipped)", file=sys.stderr)
if args.dry_run:
for g in accepted:
print(f" merge {g['secondaries']} -> {g['primary']} "
f"({g['confidence']}, min_sim={g['min_link']:.3f})")
return
merged_dir = ROOT / "_merged"
merged_dir.mkdir(exist_ok=True)
new_facesets: list[dict] = []
new_merged: list[dict] = list(master.get("merged", []))
consumed_names: set[str] = set()
primary_updates: dict[str, dict] = {} # name -> new entry
primary_absorbed: dict[str, list[dict]] = {} # primary_name -> [secondary entries]
for g in accepted:
primary = g["primary"]
if primary not in by_name:
print(f"[warn] primary {primary} not in master; skipping group", file=sys.stderr)
continue
primary_dir = ROOT / primary
if not primary_dir.is_dir():
print(f"[warn] primary dir {primary_dir} missing; skipping group", file=sys.stderr)
continue
primary_faces = primary_dir / "faces"
primary_manifest_path = primary_dir / "manifest.json"
primary_manifest = json.loads(primary_manifest_path.read_text())
# gather all face entries: primary + each secondary
combined_faces: list[dict] = list(primary_manifest.get("faces", []))
# adjust composite quality fall-back: ensure key exists
for f in combined_faces:
f.setdefault("origin_faceset", primary)
for sec in g["secondaries"]:
sec_dir = ROOT / sec
if not sec_dir.is_dir():
print(f"[warn] secondary {sec} missing; skipping", file=sys.stderr)
continue
sec_manifest_path = sec_dir / "manifest.json"
sec_manifest = json.loads(sec_manifest_path.read_text()) if sec_manifest_path.exists() else {"faces": []}
for f in sec_manifest.get("faces", []):
f = dict(f)
f["origin_faceset"] = sec
combined_faces.append(f)
# rank by quality.composite descending; ties broken by lower cosd_centroid
def sort_key(f):
q = f.get("quality", {}).get("composite", 0)
d = f.get("cosd_centroid", 1.0)
return (-q, d)
combined_faces.sort(key=sort_key)
# renumber and stage PNGs into a fresh staging dir, then atomically swap
staging = primary_dir / "_faces_new"
if staging.exists():
shutil.rmtree(staging)
staging.mkdir()
new_face_entries = []
for new_rank, f in enumerate(combined_faces, start=1):
origin = f.pop("origin_faceset")
old_png_rel = f["png"] # e.g. "faces/0042.png"
old_png_name = Path(old_png_rel).name
origin_png = ROOT / origin / "faces" / old_png_name
if not origin_png.exists():
# could be in _dropped if occlusion-pruned; skip
continue
new_name = f"{new_rank:04d}.png"
shutil.copy2(origin_png, staging / new_name)
f = dict(f)
f["rank"] = new_rank
f["png"] = f"faces/{new_name}"
f["origin_faceset"] = origin # preserve provenance in manifest
new_face_entries.append(f)
# swap directories: primary/faces -> primary/_faces_old, staging -> primary/faces
old_faces_holding = primary_dir / "_faces_old"
if old_faces_holding.exists():
shutil.rmtree(old_faces_holding)
if primary_faces.exists():
primary_faces.rename(old_faces_holding)
staging.rename(primary_faces)
# migrate _dropped/ from old holding (so occlusion-pruned PNGs remain accessible)
old_dropped = old_faces_holding / "_dropped"
if old_dropped.exists():
(primary_faces / "_dropped").mkdir(exist_ok=True)
for x in old_dropped.iterdir():
shutil.move(str(x), str(primary_faces / "_dropped" / x.name))
shutil.rmtree(old_faces_holding)
# re-zip .fsz
survivor_pngs = sorted(primary_faces.glob("*.png"))
top_n = primary_manifest.get("top_n", 30)
top_n_eff = min(top_n, len(survivor_pngs))
# remove old .fsz files
for old in primary_dir.glob("*.fsz"):
old.unlink()
top_fsz_name = f"{primary}_top{top_n_eff}.fsz"
all_fsz_name = f"{primary}_all.fsz"
_zip_png_list(survivor_pngs[:top_n_eff], primary_dir / top_fsz_name)
if len(survivor_pngs) > top_n_eff:
_zip_png_list(survivor_pngs, primary_dir / all_fsz_name)
all_fsz_used = all_fsz_name
else:
all_fsz_used = None
# update primary's local manifest
primary_manifest["faces"] = new_face_entries
primary_manifest["exported"] = len(new_face_entries)
primary_manifest["fsz_top"] = top_fsz_name
primary_manifest["fsz_all"] = all_fsz_used
primary_manifest["top_n"] = top_n_eff
primary_manifest.setdefault("merge_history", []).append({
"absorbed": g["secondaries"],
"min_link": g["min_link"],
"max_link": g["max_link"],
"confidence": g["confidence"],
})
primary_manifest_path.write_text(json.dumps(primary_manifest, indent=2))
# move secondary directories into _merged/
absorbed_master_entries: list[dict] = []
for sec in g["secondaries"]:
sec_dir = ROOT / sec
target = merged_dir / sec
if not sec_dir.is_dir():
continue
if target.exists():
shutil.rmtree(sec_dir) # already moved by previous run; clean stub
else:
shutil.move(str(sec_dir), str(target))
sec_master = dict(by_name.get(sec, {"name": sec}))
sec_master["merged_into"] = primary
sec_master["relpath"] = f"_merged/{sec}"
sec_master["fsz_top"] = None
sec_master["fsz_all"] = None
absorbed_master_entries.append(sec_master)
consumed_names.add(sec)
new_merged.extend(absorbed_master_entries)
# bump primary master entry
prim_master = dict(by_name[primary])
prim_master["exported"] = len(new_face_entries)
prim_master["top_n"] = top_n_eff
prim_master["fsz_top"] = top_fsz_name
prim_master["fsz_all"] = all_fsz_used
prim_master.setdefault("merge_history", []).append({
"absorbed": g["secondaries"],
"min_link": g["min_link"],
"max_link": g["max_link"],
})
primary_updates[primary] = prim_master
print(f"[merged] {g['secondaries']} -> {primary} "
f"now {len(new_face_entries)} png", file=sys.stderr)
# rebuild master facesets list
for entry in master.get("facesets", []):
nm = entry["name"]
if nm in consumed_names:
continue
if nm in primary_updates:
new_facesets.append(primary_updates[nm])
else:
new_facesets.append(entry)
new_master = dict(master)
new_master["facesets"] = new_facesets
new_master["merged"] = new_merged
new_master["merge_run"] = {
"thresholds": candidates["thresholds"],
"groups_applied": len(accepted),
"facesets_consumed": len(consumed_names),
"include_uncertain": bool(args.include_uncertain),
}
tmp = master_path.with_suffix(".tmp.json")
tmp.write_text(json.dumps(new_master, indent=2))
tmp.replace(master_path)
print(f"[done] master manifest updated: {len(new_facesets)} active, "
f"{len(new_merged)} merged, {len(consumed_names)} consumed in this run",
file=sys.stderr)
# ----------------------------- main -----------------------------
def main():
ap = argparse.ArgumentParser()
sub = ap.add_subparsers(dest="cmd", required=True)
a = sub.add_parser("analyze")
a.add_argument("--out", required=True)
a.add_argument("--edge", type=float, default=0.45, help="min cosine sim to draw an edge (default 0.45)")
a.add_argument("--confident", type=float, default=0.65, help="min within-group sim to be confident (default 0.65)")
a.set_defaults(func=cmd_analyze)
r = sub.add_parser("report")
r.add_argument("--candidates", required=True)
r.add_argument("--out", required=True)
r.set_defaults(func=cmd_report)
p = sub.add_parser("apply")
p.add_argument("--candidates", required=True)
p.add_argument("--include-uncertain", action="store_true",
help="apply uncertain groups too (default: confident only)")
p.add_argument("--only", default=None, help="comma-separated group indices to apply")
p.add_argument("--exclude", default=None, help="comma-separated group indices to skip")
p.add_argument("--dry-run", action="store_true")
p.set_defaults(func=cmd_apply)
args = ap.parse_args()
args.func(args)
if __name__ == "__main__":
main()
+594
View File
@@ -0,0 +1,594 @@
"""Corpus-wide dedup + roop-unleashed optimization.
Two passes:
1. Cross-family byte-identical PNG dedup (same SHA256 in two different identity
families) — keep the higher-tier family copy. Era splits of the same parent
identity (faceset_NNN_*) are intentional duplications and are NOT deduped
within their family.
2. Within-faceset near-duplicate dedup using cached arcface embeddings
(cosine sim >= 0.95). Keep highest quality.composite, drop the rest.
Plus a Windows-DML multi-face audit (separate phase via clip_worker-style split):
3. Re-detect each PNG with insightface; flag any with 0 or >1 detected faces.
The roop loader appends every detected face per PNG, so multi-face crops
pollute identity averaging.
All flagged PNGs are MOVED to <faceset>/faces/_dropped/ (reversible). Affected
.fsz files are re-zipped, manifests updated.
CLI:
analyze --out work/dedup_audit/dedup_plan.json
apply --plan ... [--dry-run]
stage_multiface --out work/dedup_audit/multiface_queue.json
merge_multiface --results <worker_out> --out work/dedup_audit/multiface_plan.json
apply_multiface --plan ... [--dry-run]
report --dedup ... --multiface ... --out work/dedup_audit
"""
from __future__ import annotations
import argparse
import hashlib
import json
import re
import shutil
import sys
import time
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
import numpy as np
ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
WIN_ROOT = r"E:\temp_things\fcswp\nl_sorted\facesets_swap_ready"
CACHES = [
Path("/opt/face-sets/work/cache/nl_full.npz"),
Path("/opt/face-sets/work/cache/immich_peter.npz"),
Path("/opt/face-sets/work/cache/immich_nic.npz"),
]
NEAR_DUP_THRESHOLD = 0.95
HASH_PARALLEL = 16
# ----------------------------- helpers -----------------------------
def faceset_tier(name: str) -> int:
m = re.match(r"^faceset_0*(\d+)(?:_.+)?$", name)
if not m:
return 99
n = int(m.group(1))
if 13 <= n <= 19:
return 0
if 1 <= n <= 12:
return 1
if 20 <= n <= 25:
return 2
if 26 <= n <= 264:
return 3
if 265 <= n:
return 4
return 99
def faceset_family(name: str) -> str:
"""faceset_001_2010-13 → faceset_001; faceset_001 → faceset_001."""
m = re.match(r"^(faceset_\d+)(?:_.+)?$", name)
return m.group(1) if m else name
def wsl_to_win(p: str) -> str:
s = str(p)
if s.startswith("/mnt/"):
return f"{s[5].upper()}:\\{s[7:].replace('/', chr(92))}"
return s
def iter_active_facesets() -> list[Path]:
out = []
for d in sorted(ROOT.iterdir()):
if d.is_dir() and not d.name.startswith("_"):
out.append(d)
return out
def sha256_file(p: Path) -> str:
h = hashlib.sha256()
with open(p, "rb") as f:
while True:
b = f.read(1 << 20)
if not b:
break
h.update(b)
return h.hexdigest()
def load_caches():
rec_index: dict[tuple[str, tuple[int, int, int, int]], np.ndarray] = {}
alias_map: dict[str, str] = {}
for c in CACHES:
if not c.exists():
continue
d = np.load(c, allow_pickle=True)
emb = d["embeddings"]
meta = json.loads(str(d["meta"]))
face_records = [m for m in meta if not m.get("noface")]
if "path_aliases" in d.files:
paliases = json.loads(str(d["path_aliases"]))
for canon, alist in paliases.items():
alias_map.setdefault(canon, canon)
for a in alist:
alias_map[a] = canon
for i, rec in enumerate(face_records):
p = rec["path"]
bbox = tuple(int(x) for x in rec["bbox"])
v = emb[i].astype(np.float32)
n = float(np.linalg.norm(v))
if n > 0:
v = v / n
rec_index[(p, bbox)] = v
alias_map.setdefault(p, p)
return rec_index, alias_map
def lookup_emb(rec_index, alias_map, src: str, bbox):
bbox_t = tuple(int(x) for x in bbox)
canon = alias_map.get(src, src)
v = rec_index.get((canon, bbox_t))
if v is None and canon != src:
v = rec_index.get((src, bbox_t))
return v
# ----------------------------- analyze -----------------------------
def cmd_analyze(args):
rec_index, alias_map = load_caches()
facesets = iter_active_facesets()
print(f"[scan] {len(facesets)} active facesets", file=sys.stderr)
# Phase 1: walk every PNG, collect (faceset, file, src, bbox, quality, emb, sha256)
all_pngs = [] # list of dicts
t0 = time.time()
for fs in facesets:
manifest_path = fs / "manifest.json"
if not manifest_path.exists():
continue
m = json.loads(manifest_path.read_text())
for f in m.get("faces", []):
png_rel = f.get("png")
if not png_rel:
continue
disk_path = fs / png_rel
if not disk_path.exists():
continue
all_pngs.append({
"faceset": fs.name,
"family": faceset_family(fs.name),
"tier": faceset_tier(fs.name),
"file": Path(png_rel).name,
"rank": f.get("rank"),
"source": f.get("source"),
"bbox": f.get("bbox"),
"quality": f.get("quality", {}).get("composite", 0),
"disk_path": str(disk_path),
})
print(f"[scan] {len(all_pngs)} PNGs walked in {time.time()-t0:.1f}s", file=sys.stderr)
# Phase 2: SHA256 hash each PNG (parallel I/O)
t0 = time.time()
def _hash_one(idx):
all_pngs[idx]["sha256"] = sha256_file(Path(all_pngs[idx]["disk_path"]))
with ThreadPoolExecutor(max_workers=HASH_PARALLEL) as ex:
# exhaust the iterator to actually run
for _ in ex.map(_hash_one, range(len(all_pngs)), chunksize=16):
pass
print(f"[hash] {len(all_pngs)} PNGs hashed in {time.time()-t0:.1f}s", file=sys.stderr)
# Phase 3: cross-family byte-dedup
by_sha: dict[str, list[int]] = {}
for i, p in enumerate(all_pngs):
by_sha.setdefault(p["sha256"], []).append(i)
cross_family_groups = []
byte_drops: set[int] = set() # indices of PNGs to drop
for sha, idxs in by_sha.items():
if len(idxs) < 2:
continue
families = {all_pngs[i]["family"] for i in idxs}
if len(families) < 2:
continue # all in same family — intentional era duplication
# multiple families share this content → dedup keeping the best one
cross_family_groups.append({"sha256": sha, "members": [
{"faceset": all_pngs[i]["faceset"], "file": all_pngs[i]["file"],
"tier": all_pngs[i]["tier"], "quality": all_pngs[i]["quality"],
"rank": all_pngs[i]["rank"]} for i in idxs
]})
# keeper rule: lowest tier number, then highest quality
best = sorted(idxs, key=lambda i: (all_pngs[i]["tier"], -all_pngs[i]["quality"]))[0]
for i in idxs:
# NEVER drop within-family copies (preserve era duplication intentionally)
# We only drop indices whose family != best's family
if i != best and all_pngs[i]["family"] != all_pngs[best]["family"]:
byte_drops.add(i)
print(f"[byte] {len(cross_family_groups)} cross-family hash groups; "
f"{len(byte_drops)} PNGs marked for byte-dedup drop", file=sys.stderr)
# Phase 4: within-faceset near-dup (embedding sim >= threshold)
by_faceset: dict[str, list[int]] = {}
for i, p in enumerate(all_pngs):
by_faceset.setdefault(p["faceset"], []).append(i)
near_dup_groups = []
near_drops: set[int] = set()
miss_emb_total = 0
t0 = time.time()
for fs_name, idxs in by_faceset.items():
if len(idxs) < 2:
continue
# gather embeddings
embs = []
kept_idxs = []
for i in idxs:
v = lookup_emb(rec_index, alias_map, all_pngs[i]["source"], all_pngs[i]["bbox"])
if v is None:
miss_emb_total += 1
continue
embs.append(v)
kept_idxs.append(i)
if len(kept_idxs) < 2:
continue
M = np.stack(embs).astype(np.float32)
sim = M @ M.T
np.fill_diagonal(sim, -1) # ignore self
# find connected components in the (sim >= threshold) graph
adj = {k: set() for k in range(len(kept_idxs))}
for a in range(len(kept_idxs)):
# only check a < b to avoid double work
hi = np.where(sim[a, a+1:] >= NEAR_DUP_THRESHOLD)[0]
for off in hi:
b = a + 1 + int(off)
adj[a].add(b)
adj[b].add(a)
seen = set()
for k in adj:
if k in seen or not adj[k]:
continue
stack = [k]
comp = []
while stack:
x = stack.pop()
if x in seen:
continue
seen.add(x)
comp.append(x)
for y in adj[x]:
if y not in seen:
stack.append(y)
if len(comp) < 2:
continue
comp_idxs = [kept_idxs[c] for c in comp]
# keeper: highest quality.composite, tie-break: lowest rank
best = sorted(comp_idxs, key=lambda i: (-all_pngs[i]["quality"], all_pngs[i]["rank"] or 9999))[0]
sims_in_group = []
for ci in range(len(comp)):
for cj in range(ci+1, len(comp)):
sims_in_group.append(float(sim[comp[ci], comp[cj]]))
near_dup_groups.append({
"faceset": fs_name,
"members": [{"file": all_pngs[i]["file"], "rank": all_pngs[i]["rank"],
"quality": all_pngs[i]["quality"]} for i in comp_idxs],
"keeper": all_pngs[best]["file"],
"min_sim": min(sims_in_group) if sims_in_group else None,
"max_sim": max(sims_in_group) if sims_in_group else None,
})
for i in comp_idxs:
if i != best:
near_drops.add(i)
print(f"[near] {len(near_dup_groups)} near-dup groups; "
f"{len(near_drops)} PNGs marked for near-dup drop "
f"(miss_emb={miss_emb_total}); {time.time()-t0:.1f}s", file=sys.stderr)
# Combined drop set; for output, group by faceset
all_drops = byte_drops | near_drops
drops_by_faceset: dict[str, list] = {}
for i in all_drops:
p = all_pngs[i]
reason = []
if i in byte_drops: reason.append("byte_dup")
if i in near_drops: reason.append("near_dup")
drops_by_faceset.setdefault(p["faceset"], []).append({
"file": p["file"], "rank": p["rank"], "reason": "+".join(reason),
"sha256": p["sha256"], "quality": p["quality"],
})
plan = {
"thresholds": {"near_dup_sim": NEAR_DUP_THRESHOLD},
"totals": {
"active_facesets": len(facesets),
"active_pngs": len(all_pngs),
"byte_dup_groups": len(cross_family_groups),
"byte_dup_drops": len(byte_drops),
"near_dup_groups": len(near_dup_groups),
"near_dup_drops": len(near_drops),
"all_drops": len(all_drops),
"facesets_affected": len(drops_by_faceset),
},
"byte_dup_groups": cross_family_groups,
"near_dup_groups": near_dup_groups,
"drops_by_faceset": drops_by_faceset,
}
op = Path(args.out)
op.parent.mkdir(parents=True, exist_ok=True)
op.write_text(json.dumps(plan, indent=2))
print(f"[done] plan -> {op}", file=sys.stderr)
# ----------------------------- apply -----------------------------
def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
import zipfile
with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
for i, p in enumerate(pngs):
zf.write(p, arcname=f"{i:04d}.png")
def _apply_drops_to_facesets(drops_by_faceset: dict[str, list], reason_label: str, master_path: Path):
"""Move flagged PNGs to <faceset>/faces/_dropped/, rebuild manifests + .fsz.
drops_by_faceset values are lists of {"file": str, ...}.
Returns total moved + counts per faceset."""
master = json.loads(master_path.read_text())
by_name = {f["name"]: f for f in master.get("facesets", [])}
total_moved = 0
per_faceset_counts = {}
for fs_name, drops in drops_by_faceset.items():
fs_dir = ROOT / fs_name
if not fs_dir.is_dir():
print(f"[warn] {fs_name}: dir missing, skip", file=sys.stderr)
continue
faces_dir = fs_dir / "faces"
dropped_dir = faces_dir / "_dropped"
dropped_dir.mkdir(exist_ok=True)
drop_files = {d["file"] for d in drops}
moved_here = 0
for fname in sorted(drop_files):
src = faces_dir / fname
if not src.exists():
continue
shutil.move(str(src), str(dropped_dir / fname))
moved_here += 1
# rebuild manifest by filtering out dropped files
manifest_path = fs_dir / "manifest.json"
if manifest_path.exists():
mm = json.loads(manifest_path.read_text())
new_faces = [f for f in mm.get("faces", []) if Path(f.get("png", "")).name not in drop_files]
mm["faces"] = new_faces
mm["exported"] = len(new_faces)
mm.setdefault(f"{reason_label}_history", []).append({"dropped": moved_here})
# re-zip
survivor_pngs = sorted(faces_dir.glob("*.png"))
top_n = mm.get("top_n", 30)
top_n_eff = min(top_n, len(survivor_pngs))
for old in fs_dir.glob("*.fsz"):
old.unlink()
top_fsz_name = f"{fs_name}_top{top_n_eff}.fsz"
all_fsz_name = f"{fs_name}_all.fsz"
if top_n_eff > 0:
_zip_png_list(survivor_pngs[:top_n_eff], fs_dir / top_fsz_name)
mm["fsz_top"] = top_fsz_name
mm["top_n"] = top_n_eff
else:
mm["fsz_top"] = None
mm["top_n"] = 0
if len(survivor_pngs) > top_n_eff:
_zip_png_list(survivor_pngs, fs_dir / all_fsz_name)
mm["fsz_all"] = all_fsz_name
else:
mm["fsz_all"] = None
manifest_path.write_text(json.dumps(mm, indent=2))
if fs_name in by_name:
by_name[fs_name]["exported"] = len(new_faces)
by_name[fs_name]["fsz_top"] = mm["fsz_top"]
by_name[fs_name]["fsz_all"] = mm["fsz_all"]
by_name[fs_name]["top_n"] = mm["top_n"]
by_name[fs_name].setdefault(f"{reason_label}_dropped", 0)
by_name[fs_name][f"{reason_label}_dropped"] += moved_here
total_moved += moved_here
per_faceset_counts[fs_name] = moved_here
# rewrite master with same ordering
new_facesets = [by_name.get(e["name"], e) for e in master.get("facesets", [])]
master["facesets"] = new_facesets
master.setdefault(f"{reason_label}_runs", []).append({
"facesets_affected": len(per_faceset_counts),
"pngs_moved": total_moved,
})
tmp = master_path.with_suffix(".tmp.json")
tmp.write_text(json.dumps(master, indent=2))
tmp.replace(master_path)
return total_moved, per_faceset_counts
def cmd_apply(args):
plan = json.loads(Path(args.plan).read_text())
drops = plan["drops_by_faceset"]
if args.dry_run:
for fs, items in sorted(drops.items()):
reasons = {}
for it in items:
reasons[it["reason"]] = reasons.get(it["reason"], 0) + 1
print(f" {fs}: {len(items)} dropped ({reasons})")
print(f"=== total: {sum(len(v) for v in drops.values())} PNGs across {len(drops)} facesets ===")
return
master_path = ROOT / "manifest.json"
total, _ = _apply_drops_to_facesets(drops, "dedup", master_path)
print(f"[done] {total} PNGs moved to faces/_dropped/ across {len(drops)} facesets", file=sys.stderr)
# ----------------------------- multiface staging + apply -----------------------------
def cmd_stage_multiface(args):
"""Build queue.json of all currently-active PNGs in the corpus
for the Windows DML multi-face audit worker."""
queue = []
for fs in iter_active_facesets():
faces_dir = fs / "faces"
if not faces_dir.is_dir():
continue
for p in sorted(faces_dir.glob("*.png")):
queue.append({
"wsl_path": str(p),
"win_path": wsl_to_win(str(p)),
"faceset": fs.name,
"file": p.name,
})
op = Path(args.out)
op.parent.mkdir(parents=True, exist_ok=True)
op.write_text(json.dumps(queue, indent=2))
print(f"[stage] {len(queue)} PNGs -> {op}", file=sys.stderr)
def cmd_merge_multiface(args):
"""Convert worker results.json into a drops_by_faceset plan."""
src = json.loads(Path(args.results).read_text())
drops_by_faceset: dict[str, list] = {}
bad_count = 0
for r in src.get("results", []):
n_faces = r.get("face_count", -1)
if n_faces == 1:
continue
bad_count += 1
drops_by_faceset.setdefault(r["faceset"], []).append({
"file": r["file"],
"reason": f"multiface_{n_faces}",
"face_count": n_faces,
})
plan = {
"totals": {"bad_pngs": bad_count, "facesets_affected": len(drops_by_faceset),
"scored": len(src.get("results", []))},
"drops_by_faceset": drops_by_faceset,
}
op = Path(args.out)
op.parent.mkdir(parents=True, exist_ok=True)
op.write_text(json.dumps(plan, indent=2))
print(f"[merge] {bad_count} bad PNGs across {len(drops_by_faceset)} facesets -> {op}", file=sys.stderr)
def cmd_apply_multiface(args):
plan = json.loads(Path(args.plan).read_text())
drops = plan["drops_by_faceset"]
if args.dry_run:
for fs, items in sorted(drops.items()):
print(f" {fs}: {len(items)} bad PNG(s)")
print(f"=== total: {sum(len(v) for v in drops.values())} ===")
return
master_path = ROOT / "manifest.json"
total, _ = _apply_drops_to_facesets(drops, "multiface", master_path)
print(f"[done] {total} PNGs moved to faces/_dropped/ across {len(drops)} facesets", file=sys.stderr)
# ----------------------------- report -----------------------------
def cmd_report(args):
out_dir = Path(args.out)
out_dir.mkdir(parents=True, exist_ok=True)
sections = []
if args.dedup:
d = json.loads(Path(args.dedup).read_text())
t = d["totals"]
sections.append(f"<h2>Dedup</h2>")
sections.append(
f"<ul>"
f"<li>Active facesets: {t['active_facesets']}, active PNGs: {t['active_pngs']}</li>"
f"<li>Cross-family byte-dup groups: {t['byte_dup_groups']}{t['byte_dup_drops']} PNGs dropped</li>"
f"<li>Within-faceset near-dup groups (sim≥{d['thresholds']['near_dup_sim']}): {t['near_dup_groups']}{t['near_dup_drops']} PNGs dropped</li>"
f"<li><b>Total dedup drops: {t['all_drops']}</b> across {t['facesets_affected']} facesets</li>"
f"</ul>"
)
# top-N affected facesets
rows = sorted(d["drops_by_faceset"].items(), key=lambda x: -len(x[1]))[:25]
sections.append("<h3>Top 25 most-affected facesets</h3><table><tr><th>faceset</th><th>dropped</th><th>reasons</th></tr>")
for fs, items in rows:
r = {}
for it in items:
r[it["reason"]] = r.get(it["reason"], 0) + 1
sections.append(f"<tr><td>{fs}</td><td>{len(items)}</td><td>{r}</td></tr>")
sections.append("</table>")
if args.multiface:
m = json.loads(Path(args.multiface).read_text())
t = m["totals"]
sections.append("<h2>Multi-face audit</h2>")
sections.append(
f"<ul>"
f"<li>PNGs scored: {t['scored']}</li>"
f"<li>Bad PNGs (0 or >1 face): {t['bad_pngs']} across {t['facesets_affected']} facesets</li>"
f"</ul>"
)
html = f"""<!doctype html>
<html><head><meta charset='utf-8'><title>Dedup + multi-face audit</title>
<style>
body {{ font-family: system-ui, sans-serif; background:#111; color:#eee; padding:1em; }}
h1, h2, h3 {{ margin-top:1em; }}
table {{ border-collapse: collapse; font-family: monospace; font-size: 12px; }}
table td, table th {{ padding: 2px 8px; border: 1px solid #333; }}
ul li {{ margin: 4px 0; }}
</style></head>
<body>
<h1>facesets_swap_ready dedup + roop optimization audit</h1>
{''.join(sections)}
</body></html>"""
out_html = out_dir / "index.html"
out_html.write_text(html)
print(f"[done] {out_html}", file=sys.stderr)
# ----------------------------- main -----------------------------
def main():
ap = argparse.ArgumentParser()
sub = ap.add_subparsers(dest="cmd", required=True)
a = sub.add_parser("analyze")
a.add_argument("--out", required=True)
a.set_defaults(func=cmd_analyze)
p = sub.add_parser("apply")
p.add_argument("--plan", required=True)
p.add_argument("--dry-run", action="store_true")
p.set_defaults(func=cmd_apply)
sm = sub.add_parser("stage_multiface")
sm.add_argument("--out", required=True)
sm.set_defaults(func=cmd_stage_multiface)
mm = sub.add_parser("merge_multiface")
mm.add_argument("--results", required=True)
mm.add_argument("--out", required=True)
mm.set_defaults(func=cmd_merge_multiface)
am = sub.add_parser("apply_multiface")
am.add_argument("--plan", required=True)
am.add_argument("--dry-run", action="store_true")
am.set_defaults(func=cmd_apply_multiface)
r = sub.add_parser("report")
r.add_argument("--dedup", default=None)
r.add_argument("--multiface", default=None)
r.add_argument("--out", required=True)
r.set_defaults(func=cmd_report)
args = ap.parse_args()
args.func(args)
if __name__ == "__main__":
main()
+244
View File
@@ -0,0 +1,244 @@
"""Windows / DirectML embed worker.
Reads a queue.json staged by /opt/face-sets/work/immich_stage.py (WSL side),
runs InsightFace's FaceAnalysis on each image with the DmlExecutionProvider
backed by the AMD Vega, and writes a cache file in the schema produced by
sort_faces.py:cmd_embed (so it can be merged into nl_full.npz).
CLI:
py -3.12 embed_worker.py <queue.json> <out_cache.npz> [--limit N]
The queue.json entry shape (each item) is:
{
"asset_id": "...",
"sha256": "...",
"wsl_path": "/mnt/x/src/immich/<user>/<rel>", # canonical path stored
"win_path": "X:\\src\\immich\\<user>\\<rel>", # what we read from
"size_bytes": int,
"width": int, "height": int,
...
}
Per face record matches cmd_embed's schema:
path, face_idx, det_score, bbox, face_short, face_area, blur, noface=False, hash
plus landmark_2d_106, landmark_3d_68, pose (FaceAnalysis returns these for
free and the existing cache already carries them after `enrich`).
"""
from __future__ import annotations
import argparse
import json
import os
import sys
import time
from pathlib import Path
import numpy as np
from PIL import Image, ImageOps
from insightface.app import FaceAnalysis
MODEL_ROOT = r"C:\face_embed_venv\models"
MIN_DET_SCORE = 0.5
MIN_FACE_PIX = 40
FLUSH_EVERY = 50
def load_rgb_bgr(path: Path):
try:
with Image.open(path) as im:
im = ImageOps.exif_transpose(im)
im = im.convert("RGB")
rgb = np.array(im)
bgr = rgb[:, :, ::-1].copy()
return rgb, bgr
except Exception as e:
print(f"[warn] failed to load {path}: {e}", file=sys.stderr)
return None, None
def laplacian_variance(gray: np.ndarray) -> float:
g = gray.astype(np.float32)
lap = (
-4.0 * g[1:-1, 1:-1]
+ g[:-2, 1:-1] + g[2:, 1:-1]
+ g[1:-1, :-2] + g[1:-1, 2:]
)
return float(lap.var())
def save_cache(out_path: Path, emb_chunks: list, meta: list, processed: set, src_root: str):
emb = np.concatenate(emb_chunks) if emb_chunks else np.zeros((0, 512), dtype=np.float32)
tmp = out_path.with_suffix(".tmp.npz")
np.savez(
str(tmp),
embeddings=emb,
meta=json.dumps(meta),
src_root=str(src_root),
processed_paths=json.dumps(sorted(processed)),
path_aliases=json.dumps({}),
schema="v2",
)
os.replace(tmp, out_path)
def load_cache_if_exists(out_path: Path):
"""Resume helper. Returns (emb_chunks, meta, processed_set)."""
if not out_path.exists():
return [], [], set()
data = np.load(out_path, allow_pickle=True)
emb = data["embeddings"]
meta = json.loads(str(data["meta"]))
processed = set(json.loads(str(data["processed_paths"])))
return [emb] if len(emb) else [], list(meta), processed
def main():
p = argparse.ArgumentParser()
p.add_argument("queue", type=Path)
p.add_argument("out", type=Path)
p.add_argument("--limit", type=int, default=None)
args = p.parse_args()
queue = json.loads(args.queue.read_text())
print(f"queue: {len(queue)} entries from {args.queue}")
args.out.parent.mkdir(parents=True, exist_ok=True)
emb_chunks, meta, processed = load_cache_if_exists(args.out)
n_existing_records = len(meta)
n_existing_emb = sum(e.shape[0] for e in emb_chunks)
if n_existing_records:
print(f"resume: {n_existing_records} existing meta records "
f"({n_existing_emb} embeddings, {len(processed)} processed paths)")
print("initializing FaceAnalysis with DmlExecutionProvider")
app = FaceAnalysis(
name="buffalo_l",
root=MODEL_ROOT,
providers=["DmlExecutionProvider", "CPUExecutionProvider"],
)
app.prepare(ctx_id=0, det_size=(640, 640))
src_root = "/mnt/x/src/immich"
n_done = 0
n_face_records_added = 0
n_noface_added = 0
n_skipped = 0
n_load_err = 0
t0 = time.perf_counter()
last_flush = time.perf_counter()
new_emb_chunks: list[np.ndarray] = []
new_meta: list[dict] = []
def flush():
nonlocal new_emb_chunks, new_meta, last_flush
if not new_emb_chunks and not new_meta:
return
if new_emb_chunks:
emb_chunks.append(np.concatenate(new_emb_chunks))
new_emb_chunks = []
for r in new_meta:
meta.append(r)
new_meta = []
save_cache(args.out, emb_chunks, meta, processed, src_root)
last_flush = time.perf_counter()
for i, entry in enumerate(queue):
if args.limit is not None and n_done >= args.limit:
break
wsl_path = entry["wsl_path"]
win_path = entry["win_path"]
sha = entry["sha256"]
if wsl_path in processed:
n_skipped += 1
continue
rgb, bgr = load_rgb_bgr(Path(win_path))
if bgr is None:
new_meta.append({
"path": wsl_path, "face_idx": -1, "noface": True,
"hash": sha, "error": "load",
})
processed.add(wsl_path)
n_load_err += 1
n_done += 1
continue
faces = app.get(bgr)
kept_any = False
for j, f in enumerate(faces):
if float(f.det_score) < MIN_DET_SCORE:
continue
x1, y1, x2, y2 = [int(round(v)) for v in f.bbox]
x1 = max(x1, 0); y1 = max(y1, 0)
x2 = min(x2, rgb.shape[1]); y2 = min(y2, rgb.shape[0])
w, h = x2 - x1, y2 - y1
short = min(w, h)
if short < MIN_FACE_PIX:
continue
crop = rgb[y1:y2, x1:x2]
if crop.size == 0:
continue
gray = crop.mean(axis=2)
blur = laplacian_variance(gray) if min(gray.shape) > 3 else 0.0
emb = f.normed_embedding.astype(np.float32)
new_emb_chunks.append(emb[None, :])
rec = {
"path": wsl_path,
"face_idx": j,
"det_score": float(f.det_score),
"bbox": [x1, y1, x2, y2],
"face_short": int(short),
"face_area": int(w * h),
"blur": blur,
"noface": False,
"hash": sha,
}
# Enrichment-equivalent fields (FaceAnalysis returns these for free)
if hasattr(f, "landmark_2d_106") and f.landmark_2d_106 is not None:
rec["landmark_2d_106"] = f.landmark_2d_106.astype(np.float32).tolist()
if hasattr(f, "landmark_3d_68") and f.landmark_3d_68 is not None:
rec["landmark_3d_68"] = f.landmark_3d_68.astype(np.float32).tolist()
if hasattr(f, "pose") and f.pose is not None:
rec["pose"] = [float(x) for x in f.pose]
new_meta.append(rec)
kept_any = True
n_face_records_added += 1
if not kept_any:
new_meta.append({
"path": wsl_path, "face_idx": -1, "noface": True, "hash": sha,
})
n_noface_added += 1
processed.add(wsl_path)
n_done += 1
if (n_done % FLUSH_EVERY == 0) or (time.perf_counter() - last_flush) > 30.0:
flush()
elapsed = time.perf_counter() - t0
rate = n_done / max(0.1, elapsed)
print(
f"[embed] done={n_done:5d}/{len(queue)} faces+={n_face_records_added:5d} "
f"noface+={n_noface_added:4d} skipped={n_skipped:4d} "
f"load_err={n_load_err:3d} rate={rate:.1f} img/s "
f"({elapsed:.1f}s elapsed)"
)
flush()
elapsed = time.perf_counter() - t0
print()
print("=== embed done ===")
print(f" done: {n_done}")
print(f" new face records: {n_face_records_added}")
print(f" new noface records: {n_noface_added}")
print(f" skipped (already done): {n_skipped}")
print(f" load errors: {n_load_err}")
print(f" elapsed: {elapsed:.1f}s ({n_done/max(0.1,elapsed):.1f} img/s)")
print(f" cache: {args.out}")
if __name__ == "__main__":
main()
+574
View File
@@ -0,0 +1,574 @@
"""CLIP zero-shot scoring for masks + sunglasses across facesets_swap_ready/.
Usage:
# score one or more specific facesets (test mode)
python work/filter_occlusions.py score --facesets faceset_001,faceset_050 \
--out work/test_batch_occlusion/scores.json
# score everything (full corpus)
python work/filter_occlusions.py score --out work/occlusion_scores.json
# render HTML contact sheet from a scores.json
python work/filter_occlusions.py report --scores work/test_batch_occlusion/scores.json \
--out work/test_batch_occlusion
Notes:
- This script never modifies facesets_swap_ready/. An --apply step lives elsewhere
(or will be added once thresholds are validated).
- Model: open_clip ViT-L-14 / dfn2b_s39b (best public zero-shot at this size).
"""
from __future__ import annotations
import argparse
import json
import sys
import time
from pathlib import Path
from typing import Iterable
import torch
from PIL import Image
import open_clip
ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
WIN_ROOT = r"E:\temp_things\fcswp\nl_sorted\facesets_swap_ready"
MODEL_NAME = "ViT-L-14"
PRETRAINED = "dfn2b_s39b"
def wsl_to_win(wsl_path: str) -> str:
"""Translate a /mnt/e/... wsl path to E:\\... for the Windows worker."""
s = str(wsl_path)
if s.startswith("/mnt/"):
drive = s[5]
rest = s[7:].replace("/", "\\")
return f"{drive.upper()}:\\{rest}"
return s
# Prompt ensembles. Each pair (positive, negative) becomes one binary classifier.
# We average text embeddings within each list, then softmax across the two means.
PROMPTS = {
"mask": {
"pos": [
"a photo of a person wearing a surgical face mask",
"a photo of a person wearing an FFP2 respirator covering mouth and nose",
"a photo of a person wearing a cloth face mask",
"a face partially covered by a medical mask",
"a person whose mouth and nose are hidden by a face mask",
],
"neg": [
"a photo of a person's face with mouth and nose clearly visible",
"a clear, unobstructed photo of a face",
"a photo of a face without any mask or covering",
"a portrait of a person showing their full face",
"a photo of a person with a beard and visible mouth", # avoid beard false positives
],
},
"sunglasses": {
# We want to flag ONLY images where sunglasses occlude the eyes.
# False positives to defeat: sunglasses pushed up on the head/forehead, hanging on a shirt collar.
"pos": [
"a face with dark sunglasses covering the eyes",
"a portrait with the eyes hidden behind opaque sunglasses",
"a person wearing dark sunglasses over their eyes, eyes not visible",
"a face where the eyes are completely concealed by tinted lenses",
"a close-up portrait wearing aviator sunglasses on the eyes",
],
"neg": [
"a portrait with both eyes clearly visible and uncovered",
"a face with sunglasses pushed up on the forehead, eyes visible below",
"a face with sunglasses resting on top of the head, eyes visible",
"a person with sunglasses hanging from their shirt, eyes visible",
"a face wearing clear prescription eyeglasses with visible eyes",
"a portrait with no eyewear and visible eyes",
],
},
}
def load_model(device: str = "cpu"):
print(f"[clip] loading {MODEL_NAME} / {PRETRAINED} on {device} ...", file=sys.stderr)
t0 = time.time()
model, _, preprocess = open_clip.create_model_and_transforms(MODEL_NAME, pretrained=PRETRAINED)
tokenizer = open_clip.get_tokenizer(MODEL_NAME)
model = model.to(device).eval()
logit_scale = float(model.logit_scale.exp().detach().cpu())
print(f"[clip] ready in {time.time()-t0:.1f}s, logit_scale={logit_scale:.2f}", file=sys.stderr)
return model, preprocess, tokenizer, logit_scale
@torch.no_grad()
def build_text_features(model, tokenizer, device: str):
"""Return dict {attr: (pos_mean_emb, neg_mean_emb)} on device, both L2-normalized."""
out = {}
for attr, sides in PROMPTS.items():
feats = {}
for side in ("pos", "neg"):
tokens = tokenizer(sides[side]).to(device)
f = model.encode_text(tokens)
f = f / f.norm(dim=-1, keepdim=True)
mean = f.mean(dim=0)
feats[side] = mean / mean.norm()
out[attr] = (feats["pos"], feats["neg"])
return out
@torch.no_grad()
def score_images(model, preprocess, text_feats, logit_scale: float, paths: list[Path], device: str, batch: int = 16):
"""Yield (path, {attr: pos_prob}) per image. logit_scale is CLIP's learned temperature (~100)."""
for i in range(0, len(paths), batch):
chunk = paths[i:i + batch]
imgs = []
keep = []
for p in chunk:
try:
img = Image.open(p).convert("RGB")
imgs.append(preprocess(img))
keep.append(p)
except Exception as e:
print(f"[skip] {p}: {e}", file=sys.stderr)
if not imgs:
continue
x = torch.stack(imgs).to(device)
feats = model.encode_image(x)
feats = feats / feats.norm(dim=-1, keepdim=True) # (B, D)
results = {}
for attr, (pos, neg) in text_feats.items():
sims = torch.stack([feats @ pos, feats @ neg], dim=1) * logit_scale # (B, 2)
probs = sims.softmax(dim=1)[:, 0].tolist() # P(pos)
results[attr] = probs
for j, p in enumerate(keep):
yield p, {attr: results[attr][j] for attr in text_feats}
def iter_facesets(root: Path, only: list[str] | None) -> Iterable[Path]:
if only:
for name in only:
d = root / name
if d.is_dir():
yield d
else:
print(f"[warn] not a directory: {d}", file=sys.stderr)
return
for d in sorted(root.iterdir()):
if d.is_dir() and not d.name.startswith("_"):
yield d
def cmd_score(args):
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess, tokenizer, logit_scale = load_model(device)
text_feats = build_text_features(model, tokenizer, device)
only = [s.strip() for s in args.facesets.split(",")] if args.facesets else None
facesets = list(iter_facesets(ROOT, only))
if args.sample_per_faceset:
# take first N PNGs per faceset (cheap deterministic sample for test batches)
pass
report = {
"model": f"{MODEL_NAME}/{PRETRAINED}",
"root": str(ROOT),
"prompts": PROMPTS,
"facesets": {},
}
total_imgs = 0
t0 = time.time()
for fs in facesets:
faces = sorted((fs / "faces").glob("*.png")) if (fs / "faces").is_dir() else sorted(fs.glob("*.png"))
if args.sample_per_faceset:
faces = faces[: args.sample_per_faceset]
if not faces:
continue
print(f"[scan] {fs.name}: {len(faces)} png", file=sys.stderr)
per_image = []
for p, scores in score_images(model, preprocess, text_feats, logit_scale, faces, device):
per_image.append({"file": p.name, "mask": round(scores["mask"], 4), "sunglasses": round(scores["sunglasses"], 4)})
total_imgs += 1
report["facesets"][fs.name] = per_image
out = Path(args.out)
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(json.dumps(report, indent=2))
dt = time.time() - t0
print(f"[done] {total_imgs} images, {dt:.1f}s ({total_imgs/max(dt,1e-3):.2f} img/s) -> {out}", file=sys.stderr)
def cmd_report(args):
"""Render an HTML contact sheet from scores.json. Generates JPG thumbs."""
import io
scores = json.loads(Path(args.scores).read_text())
out_dir = Path(args.out)
thumbs_dir = out_dir / "thumbs"
thumbs_dir.mkdir(parents=True, exist_ok=True)
THUMB = 160
rows_html = []
def thumb_path(faceset: str, fname: str) -> Path:
d = thumbs_dir / faceset
d.mkdir(parents=True, exist_ok=True)
return d / (Path(fname).stem + ".jpg")
def make_thumb(src: Path, dst: Path):
if dst.exists():
return
try:
img = Image.open(src).convert("RGB")
img.thumbnail((THUMB, THUMB), Image.LANCZOS)
img.save(dst, "JPEG", quality=82)
except Exception as e:
print(f"[thumb-skip] {src}: {e}", file=sys.stderr)
facesets = scores["facesets"]
for faceset, items in facesets.items():
# sort: high score first so borderline cases group at the boundary
items_sorted = sorted(items, key=lambda x: max(x["mask"], x["sunglasses"]), reverse=True)
# faceset summary
n = len(items)
n_mask = sum(1 for x in items if x["mask"] >= 0.7)
n_sg = sum(1 for x in items if x["sunglasses"] >= 0.7)
pct_mask = (100 * n_mask / n) if n else 0
pct_sg = (100 * n_sg / n) if n else 0
rows_html.append(f"<h2 id='{faceset}'>{faceset} <small>({n} imgs &middot; mask&ge;0.7: {n_mask} ({pct_mask:.0f}%) &middot; sunglasses&ge;0.7: {n_sg} ({pct_sg:.0f}%))</small></h2>")
rows_html.append("<div class='grid'>")
src_root = ROOT / faceset
faces_root = (src_root / "faces") if (src_root / "faces").is_dir() else src_root
for it in items_sorted:
src = faces_root / it["file"]
dst = thumb_path(faceset, it["file"])
make_thumb(src, dst)
rel = f"thumbs/{faceset}/{Path(it['file']).stem}.jpg"
m, s = it["mask"], it["sunglasses"]
cls_m = "hi" if m >= 0.7 else ("mid" if m >= 0.4 else "lo")
cls_s = "hi" if s >= 0.7 else ("mid" if s >= 0.4 else "lo")
rows_html.append(
f"<div class='cell'>"
f"<img src='{rel}' loading='lazy' title='{it['file']}'>"
f"<div class='scores'><span class='{cls_m}'>M {m:.2f}</span> <span class='{cls_s}'>S {s:.2f}</span></div>"
f"</div>"
)
rows_html.append("</div>")
nav = " · ".join(f"<a href='#{f}'>{f}</a>" for f in facesets)
html = f"""<!doctype html>
<html><head><meta charset='utf-8'><title>Occlusion test batch</title>
<style>
body {{ font-family: system-ui, sans-serif; background: #111; color: #eee; padding: 1em; }}
h1 {{ margin-top: 0; }}
h2 {{ margin-top: 1.5em; border-bottom: 1px solid #333; padding-bottom: .25em; }}
small {{ color: #999; font-weight: normal; }}
.grid {{ display: grid; grid-template-columns: repeat(auto-fill, minmax(170px, 1fr)); gap: .5em; }}
.cell {{ background: #1c1c1c; padding: 4px; border-radius: 4px; text-align: center; }}
.cell img {{ max-width: 100%; height: auto; display: block; margin: 0 auto; }}
.scores {{ font-family: monospace; font-size: 11px; padding-top: 4px; }}
.hi {{ color: #ff5050; font-weight: bold; }}
.mid {{ color: #ffb050; }}
.lo {{ color: #5fa05f; }}
.nav {{ position: sticky; top: 0; background: #111; padding: .5em 0; border-bottom: 1px solid #333; }}
a {{ color: #6cf; }}
</style></head>
<body>
<h1>Occlusion scores &mdash; {scores['model']}</h1>
<p>Sorted within each faceset by max(mask, sunglasses) descending.
Color: <span class='hi'>&ge;0.70</span> &middot; <span class='mid'>0.40&ndash;0.70</span> &middot; <span class='lo'>&lt;0.40</span></p>
<div class='nav'>{nav}</div>
{''.join(rows_html)}
</body></html>"""
out_html = out_dir / "index.html"
out_html.write_text(html)
print(f"[done] {out_html}", file=sys.stderr)
def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
"""Mirror of sort_faces.py:_zip_png_list. Renames PNGs to 0000.png, 0001.png, ..."""
import zipfile
with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
for i, p in enumerate(pngs):
zf.write(p, arcname=f"{i:04d}.png")
def cmd_apply(args):
"""Prune mask/sunglasses PNGs, quarantine occlusion-dominated facesets,
re-zip .fsz, update top-level manifest. --dry-run prints the plan only."""
import shutil
threshold = args.threshold
domain_pct = args.domain_pct
min_survivors = args.min_survivors
top_n_target = args.top_n
scores = json.loads(Path(args.scores).read_text())
master_path = ROOT / "manifest.json"
master = json.loads(master_path.read_text())
by_name = {f["name"]: f for f in master.get("facesets", [])}
masked_dir = ROOT / "_masked"
thin_dir = ROOT / "_thin"
plan = []
for faceset, items in scores["facesets"].items():
if faceset not in by_name:
print(f"[warn] {faceset} not in master manifest — skipping", file=sys.stderr)
continue
n = len(items)
flagged_files = sorted(
it["file"] for it in items
if it["mask"] >= threshold or it["sunglasses"] >= threshold
)
survivors_items = [it for it in items if it["file"] not in set(flagged_files)]
# preserve quality order from filename (0001.png is highest-rank)
survivors_files = sorted(it["file"] for it in survivors_items)
n_mask = sum(1 for it in items if it["mask"] >= threshold)
n_sg = sum(1 for it in items if it["sunglasses"] >= threshold)
pct_mask = n_mask / n if n else 0
pct_sg = n_sg / n if n else 0
if pct_mask >= domain_pct:
action, reason = "quarantine_masked", f"mask_pct={pct_mask:.0%}"
elif pct_sg >= domain_pct:
action, reason = "quarantine_masked", f"sunglasses_pct={pct_sg:.0%}"
elif flagged_files and len(survivors_files) < min_survivors:
# only quarantine-as-thin if pruning is the cause of the drop below threshold;
# pre-existing small facesets without occlusions are left alone
action, reason = "quarantine_thin", f"survivors={len(survivors_files)}<{min_survivors}"
elif flagged_files:
action, reason = "prune", f"drop {len(flagged_files)}"
else:
action, reason = "keep", "clean"
plan.append({
"faceset": faceset, "action": action, "reason": reason,
"n": n, "n_mask": n_mask, "n_sg": n_sg,
"n_dropped": len(flagged_files), "n_survivors": len(survivors_files),
"dropped_files": flagged_files,
})
# Summary
counts = {a: 0 for a in ("keep", "prune", "quarantine_masked", "quarantine_thin")}
for p in plan:
counts[p["action"]] += 1
total_dropped_pngs = sum(p["n_dropped"] for p in plan if p["action"] == "prune")
total_quarantined_pngs = sum(p["n"] for p in plan if p["action"].startswith("quarantine"))
print(f"=== plan summary (threshold={threshold} domain_pct={domain_pct} min_survivors={min_survivors}) ===")
for a, c in counts.items():
print(f" {a}: {c}")
print(f" PNGs to drop (prune): {total_dropped_pngs}")
print(f" PNGs to quarantine (whole): {total_quarantined_pngs}")
print(f" facesets in master: {len(master['facesets'])}")
print(f" facesets scored: {len(plan)}")
# Write plan for audit
plan_path = Path(args.out_plan)
plan_path.parent.mkdir(parents=True, exist_ok=True)
plan_path.write_text(json.dumps({
"thresholds": {"image": threshold, "domain_pct": domain_pct, "min_survivors": min_survivors},
"counts": counts,
"totals": {"dropped_pngs": total_dropped_pngs, "quarantined_pngs": total_quarantined_pngs},
"plan": plan,
}, indent=2))
print(f" plan written to {plan_path}")
if args.dry_run:
# pretty list of quarantines
for p in plan:
if p["action"].startswith("quarantine"):
print(f" [{p['action']:>18s}] {p['faceset']} ({p['reason']}, n={p['n']})")
return
# ----- destructive section -----
masked_dir.mkdir(parents=True, exist_ok=True)
thin_dir.mkdir(parents=True, exist_ok=True)
new_facesets = []
new_masked = list(master.get("masked", [])) # preserve any prior runs
new_thin = list(master.get("thin_eras", []))
# build a name -> existing-thin/masked entry index, to update relpath if we re-quarantine
by_name_thin = {e["name"]: e for e in new_thin}
by_name_masked = {e["name"]: e for e in new_masked}
for p in plan:
entry = dict(by_name[p["faceset"]]) # copy
fs_dir = ROOT / p["faceset"]
faces_dir = fs_dir / "faces"
if p["action"] == "keep":
new_facesets.append(entry)
continue
# prune dropped PNGs first (applies to both prune and quarantine_thin paths)
if p["dropped_files"]:
dropped_holding = faces_dir / "_dropped"
dropped_holding.mkdir(exist_ok=True)
for fname in p["dropped_files"]:
src = faces_dir / fname
if src.exists():
shutil.move(str(src), str(dropped_holding / fname))
if p["action"].startswith("quarantine"):
target_root = masked_dir if p["action"] == "quarantine_masked" else thin_dir
target = target_root / p["faceset"]
if target.exists():
# idempotency: if a previous run already moved it, skip move
pass
else:
shutil.move(str(fs_dir), str(target))
entry["occlusion_filter"] = {
"action": p["action"], "reason": p["reason"],
"n_input": p["n"], "n_mask": p["n_mask"], "n_sg": p["n_sg"],
"n_dropped": p["n_dropped"], "n_survivors": p["n_survivors"],
"threshold": threshold, "domain_pct": domain_pct,
}
entry["relpath"] = f"{'_masked' if p['action']=='quarantine_masked' else '_thin'}/{p['faceset']}"
entry["fsz_top"] = None
entry["fsz_all"] = None
if p["action"] == "quarantine_masked":
entry["masked"] = True
new_masked.append(entry)
else:
entry["thin"] = True
new_thin.append(entry)
continue
# action == prune
survivor_pngs = sorted([pp for pp in faces_dir.glob("*.png")])
if not survivor_pngs:
print(f"[warn] {p['faceset']}: no survivor PNGs after prune", file=sys.stderr)
new_facesets.append(entry)
continue
# re-zip .fsz from survivors in quality order
top_n_eff = min(top_n_target, len(survivor_pngs))
top_fsz = fs_dir / f"{p['faceset']}_top{top_n_eff}.fsz"
all_fsz = fs_dir / f"{p['faceset']}_all.fsz"
# remove old .fsz files (they may have different top_n in name)
for old in fs_dir.glob("*.fsz"):
old.unlink()
_zip_png_list(survivor_pngs[:top_n_eff], top_fsz)
if len(survivor_pngs) > top_n_eff:
_zip_png_list(survivor_pngs, all_fsz)
entry["fsz_all"] = all_fsz.name
else:
entry["fsz_all"] = None
entry["fsz_top"] = top_fsz.name
entry["top_n"] = top_n_eff
entry["exported"] = len(survivor_pngs)
entry["dropped_occlusion"] = p["n_dropped"]
entry["occlusion_filter"] = {
"action": "prune", "n_input": p["n"], "n_mask": p["n_mask"],
"n_sg": p["n_sg"], "n_dropped": p["n_dropped"], "n_survivors": p["n_survivors"],
"threshold": threshold,
}
new_facesets.append(entry)
# write updated master manifest
new_master = dict(master)
new_master["facesets"] = new_facesets
new_master["masked"] = new_masked
new_master["thin_eras"] = new_thin
new_master["occlusion_filter_run"] = {
"model": scores.get("model"),
"threshold": threshold,
"domain_pct": domain_pct,
"min_survivors": min_survivors,
"counts": counts,
"totals": {"dropped_pngs": total_dropped_pngs, "quarantined_pngs": total_quarantined_pngs},
}
tmp = master_path.with_suffix(".tmp.json")
tmp.write_text(json.dumps(new_master, indent=2))
tmp.replace(master_path)
print(f"[done] master manifest updated: {len(new_facesets)} active, "
f"{len(new_masked)} masked, {len(new_thin)} thin")
def cmd_stage(args):
"""Walk facesets_swap_ready/ and write a queue.json for the Windows clip_worker."""
only = [s.strip() for s in args.facesets.split(",")] if args.facesets else None
queue = []
for fs in iter_facesets(ROOT, only):
faces = sorted((fs / "faces").glob("*.png")) if (fs / "faces").is_dir() else sorted(fs.glob("*.png"))
for p in faces:
queue.append({
"wsl_path": str(p),
"win_path": wsl_to_win(str(p)),
"faceset": fs.name,
"file": p.name,
})
out = Path(args.out)
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(json.dumps(queue, indent=2))
print(f"[stage] {len(queue)} png paths -> {out}", file=sys.stderr)
print(f"[stage] win queue file: {wsl_to_win(str(out))}", file=sys.stderr)
def cmd_merge(args):
"""Ingest worker scores.json into the per-faceset shape that `report` reads."""
src = json.loads(Path(args.scores).read_text())
by_faceset: dict[str, list] = {}
for r in src.get("results", []):
by_faceset.setdefault(r["faceset"], []).append({
"file": r["file"],
"mask": r["mask"],
"sunglasses": r["sunglasses"],
})
# stable ordering: faceset by name, files by name
out_data = {
"model": src.get("model", f"{MODEL_NAME}/{PRETRAINED}"),
"root": str(ROOT),
"prompts": src.get("prompts", PROMPTS),
"facesets": {fs: sorted(items, key=lambda x: x["file"]) for fs, items in sorted(by_faceset.items())},
}
out = Path(args.out)
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(json.dumps(out_data, indent=2))
total = sum(len(v) for v in by_faceset.values())
print(f"[merge] {total} scores across {len(by_faceset)} facesets -> {out}", file=sys.stderr)
def main():
ap = argparse.ArgumentParser()
sub = ap.add_subparsers(dest="cmd", required=True)
s = sub.add_parser("score", help="WSL CPU scoring (slow but no GPU dependency)")
s.add_argument("--facesets", default=None, help="comma-separated faceset names; default = all")
s.add_argument("--sample-per-faceset", type=int, default=0, help="cap PNGs per faceset (0 = all)")
s.add_argument("--out", required=True)
s.set_defaults(func=cmd_score)
st = sub.add_parser("stage", help="Build queue.json for Windows clip_worker.py")
st.add_argument("--facesets", default=None, help="comma-separated faceset names; default = all")
st.add_argument("--out", required=True)
st.set_defaults(func=cmd_stage)
m = sub.add_parser("merge", help="Convert worker scores.json into per-faceset report format")
m.add_argument("--scores", required=True, help="worker output (flat list of results)")
m.add_argument("--out", required=True, help="output path for per-faceset format")
m.set_defaults(func=cmd_merge)
r = sub.add_parser("report", help="Render HTML contact sheet from a per-faceset scores.json")
r.add_argument("--scores", required=True)
r.add_argument("--out", required=True)
r.set_defaults(func=cmd_report)
a = sub.add_parser("apply", help="Prune flagged PNGs, quarantine dominated facesets, re-zip .fsz, update manifest")
a.add_argument("--scores", required=True, help="per-faceset scores.json (output of `merge` or `score`)")
a.add_argument("--out-plan", required=True, help="path to write the apply plan json (audit)")
a.add_argument("--threshold", type=float, default=0.7, help="image-level drop threshold for mask/sunglasses (default 0.7)")
a.add_argument("--domain-pct", type=float, default=0.40, help="faceset-level quarantine threshold (default 0.40)")
a.add_argument("--min-survivors", type=int, default=5, help="quarantine to _thin if survivors below this (default 5)")
a.add_argument("--top-n", type=int, default=30, help="top-N for re-zipped _topN.fsz (default 30)")
a.add_argument("--dry-run", action="store_true", help="print plan only, no filesystem changes")
a.set_defaults(func=cmd_apply)
args = ap.parse_args()
args.func(args)
if __name__ == "__main__":
main()
+50
View File
@@ -0,0 +1,50 @@
#!/usr/bin/env bash
# Finalize an Immich user's stage:
# 1. Copy queue.json to /mnt/c so the Windows embed worker can read it
# 2. Run the embed worker on Windows (DML)
# 3. Copy the resulting cache back to /opt/face-sets/work/cache/
# 4. Run cluster_immich.py to discover + emit new facesets
#
# Usage: ./work/finalize_immich.sh <user-label>
set -euo pipefail
USER_LABEL="${1:?usage: $0 <user-label>}"
REPO="$(cd "$(dirname "$0")/.." && pwd)"
WSL_QUEUE="$REPO/work/immich/$USER_LABEL/queue.json"
WIN_QUEUE_DIR="/mnt/c/face_embed_venv/work/immich/$USER_LABEL"
WIN_QUEUE="$WIN_QUEUE_DIR/queue.json"
WIN_QUEUE_FOR_PS="C:\\face_embed_venv\\work\\immich\\$USER_LABEL\\queue.json"
WIN_CACHE_DIR="/mnt/c/face_embed_venv/work/cache"
WIN_CACHE="$WIN_CACHE_DIR/immich_${USER_LABEL}.npz"
WIN_CACHE_FOR_PS="C:\\face_embed_venv\\work\\cache\\immich_${USER_LABEL}.npz"
WSL_CACHE="$REPO/work/cache/immich_${USER_LABEL}.npz"
LOG="$REPO/work/logs/immich_finalize_${USER_LABEL}.log"
[ -f "$WSL_QUEUE" ] || { echo "missing queue: $WSL_QUEUE" >&2; exit 1; }
echo "=== finalize: $USER_LABEL ===" | tee -a "$LOG"
date | tee -a "$LOG"
mkdir -p "$WIN_QUEUE_DIR" "$WIN_CACHE_DIR" "$REPO/work/cache"
echo "[1/4] copying queue: $WSL_QUEUE -> $WIN_QUEUE" | tee -a "$LOG"
cp "$WSL_QUEUE" "$WIN_QUEUE"
echo " $(wc -c < "$WIN_QUEUE") bytes; $(/home/peter/face_sort_env/bin/python3 -c "import json,sys; print(len(json.load(open('$WIN_QUEUE'))))") entries"
echo "[2/4] running Windows DML embed worker" | tee -a "$LOG"
powershell.exe -NoProfile -Command "C:\\face_embed_venv\\Scripts\\python.exe C:\\face_embed_venv\\bench\\embed_worker.py '$WIN_QUEUE_FOR_PS' '$WIN_CACHE_FOR_PS'" 2>&1 | tee -a "$LOG"
[ -f "$WIN_CACHE" ] || { echo "embed produced no cache file at $WIN_CACHE" | tee -a "$LOG"; exit 1; }
echo "[3/4] copying cache back: $WIN_CACHE -> $WSL_CACHE" | tee -a "$LOG"
cp "$WIN_CACHE" "$WSL_CACHE"
echo " $(/home/peter/face_sort_env/bin/python3 -c "import sys,json; sys.path.insert(0,'$REPO'); from sort_faces import load_cache; e,m,_,_,_=load_cache('$WSL_CACHE'); print(f'{len(e)} embeddings, {sum(1 for x in m if x.get(\"noface\"))} noface, {sum(1 for x in m if not x.get(\"noface\"))} faces')")"
echo "[4/4] running cluster_immich.py" | tee -a "$LOG"
/home/peter/face_sort_env/bin/python3 "$REPO/work/cluster_immich.py" "$WSL_CACHE" 2>&1 | tee -a "$LOG"
echo "=== finalize done: $USER_LABEL ===" | tee -a "$LOG"
date | tee -a "$LOG"
+447
View File
@@ -0,0 +1,447 @@
#!/usr/bin/env python3
"""Stage Immich assets for embedding (WSL side of the split workflow).
For one Immich user:
1. Page through `/search/metadata` listing every IMAGE asset the user owns.
2. For each asset, fetch `/faces?id=` and decide if any detected face has a
scaled short side >= MIN_FACE_SHORT on the original. Skip assets that
don't.
3. Download the original. Compute sha256.
4. Dedup against (a) the existing canonical cache `nl_full.npz` and
(b) sha256s already staged in this run / earlier runs. If duplicate,
do NOT save to disk; record the alias.
5. Save survivors to /mnt/x/src/immich/<user>/<rel> mirroring the structure
after Immich's `/upload/library/<owner>/` prefix.
6. Write a queue file with WSL + Windows paths so the Windows DML embed
worker can find them.
7. Persist staging state continuously so the run is resumable.
Output artifacts:
work/immich/<user>/queue.json - what the Windows worker should embed
work/immich/<user>/state.json - resume state
work/immich/<user>/aliases.json - asset_id -> existing canonical path
when sha256 matched something already
in nl_full.npz
"""
from __future__ import annotations
import argparse
import hashlib
import json
import os
import sys
import time
import urllib.error
import urllib.request
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
import numpy as np
REPO = Path(__file__).resolve().parent.parent
sys.path.insert(0, str(REPO))
from sort_faces import load_cache # noqa: E402
# ---- config -------------------------------------------------------------- #
API = os.environ.get("IMMICH_URL", "").rstrip("/") + "/api" if os.environ.get("IMMICH_URL") else None
KEY = os.environ.get("IMMICH_API_KEY")
if not API or not KEY:
raise SystemExit(
"set IMMICH_URL and IMMICH_API_KEY env vars before running, e.g.\n"
" export IMMICH_URL=https://fotos.example.org\n"
" export IMMICH_API_KEY=... # admin API key"
)
HEADERS = {"x-api-key": KEY, "Accept": "application/json"}
# Short-label -> Immich userId. The user is responsible for filling this in for
# their own Immich instance. NOTE: as of Immich v2.7.2, /search/metadata's
# `userIds` filter is silently ignored when the API key is bound to a different
# user, so changing this label/UUID does not actually change which assets the
# API returns; we keep it here for naming output dirs and as future-proofing.
USERS_FILE = REPO / "work" / "immich" / "users.json"
USERS: dict[str, str] = {}
if USERS_FILE.exists():
USERS = json.loads(USERS_FILE.read_text())
CACHE_PATH = REPO / "work" / "cache" / "nl_full.npz" # for sha256 dedup
STAGE_DIR = REPO / "work" / "immich"
DEST_ROOT = Path("/mnt/x/src/immich")
WIN_DEST_ROOT = "X:\\src\\immich" # equivalent on the Windows side
PAGE_SIZE = 1000
MIN_FACE_SHORT = 90 # match refine's gate
MIN_DET_SCORE = 0.5 # weaker than refine's 0.6, since Immich's score scale differs
HTTP_TIMEOUT = 60 # seconds, conservative for big originals
HTTP_RETRIES = 3
HTTP_BACKOFF = 2.0
# Circuit breaker: if this many consecutive workers fail with network errors,
# probe Immich; if probe also fails, exit cleanly with code 2 so the orchestrator
# can pause until the user says resume. State is preserved (resume-safe).
OUTAGE_FAIL_STREAK = 12
OUTAGE_PROBE_TIMEOUT = 8
# ---- helpers ------------------------------------------------------------- #
def http_get(url: str, accept_bytes: bool = False) -> bytes | dict:
"""GET with retries. Returns parsed JSON unless accept_bytes is True."""
last_err = None
for attempt in range(HTTP_RETRIES):
try:
req = urllib.request.Request(url, headers=HEADERS)
with urllib.request.urlopen(req, timeout=HTTP_TIMEOUT) as resp:
data = resp.read()
return data if accept_bytes else json.loads(data)
except (urllib.error.URLError, urllib.error.HTTPError, TimeoutError) as e:
last_err = e
if attempt + 1 < HTTP_RETRIES:
time.sleep(HTTP_BACKOFF * (attempt + 1))
raise RuntimeError(f"GET {url} failed after {HTTP_RETRIES} attempts: {last_err}")
def probe_immich() -> bool:
"""Quick connectivity probe (no retry). Used by the circuit breaker."""
try:
req = urllib.request.Request(f"{API}/server/version", headers=HEADERS)
urllib.request.urlopen(req, timeout=OUTAGE_PROBE_TIMEOUT).read()
return True
except Exception:
return False
def http_post(url: str, payload: dict) -> dict:
last_err = None
body = json.dumps(payload).encode("utf-8")
for attempt in range(HTTP_RETRIES):
try:
req = urllib.request.Request(
url, data=body, headers={**HEADERS, "Content-Type": "application/json"}
)
with urllib.request.urlopen(req, timeout=HTTP_TIMEOUT) as resp:
return json.loads(resp.read())
except (urllib.error.URLError, urllib.error.HTTPError, TimeoutError) as e:
last_err = e
if attempt + 1 < HTTP_RETRIES:
time.sleep(HTTP_BACKOFF * (attempt + 1))
raise RuntimeError(f"POST {url} failed after {HTTP_RETRIES} attempts: {last_err}")
def sha256_bytes(b: bytes) -> str:
return hashlib.sha256(b).hexdigest()
def derive_relpath(original_path: str) -> str:
"""Return a relative subpath rooted at the user dir, mirroring Immich.
/usr/src/app/upload/library/admin/2026/2026-02-18/foo.jpg
-> 2026/2026-02-18/foo.jpg
Anything that doesn't match the expected prefix falls back to the basename
only.
"""
marker = "/upload/library/"
i = original_path.find(marker)
if i < 0:
return Path(original_path).name
rest = original_path[i + len(marker):]
parts = rest.split("/", 1)
return parts[1] if len(parts) == 2 else parts[0]
def wsl_to_win(p: Path) -> str:
"""Convert /mnt/x/.. -> X:\\.. for the embed worker that runs on Windows."""
s = str(p)
if s.startswith("/mnt/"):
drive = s[5]
rest = s[6:].lstrip("/")
return f"{drive.upper()}:\\{rest.replace('/', chr(92))}"
if s.startswith("/opt/face-sets/"):
# /opt/face-sets/work/... is on the WSL ext4 filesystem; reachable from
# Windows as \\wsl$\Ubuntu\opt\face-sets\... (slower than C:). For our
# use we keep all stage outputs under /mnt/x or /mnt/c so this branch
# should not be hit, but fall back rather than fail.
return f"\\\\wsl$\\Ubuntu\\opt\\face-sets\\{s[len('/opt/face-sets/'):].replace('/', chr(92))}"
return s
def keep_asset(asset: dict, faces: list) -> tuple[bool, list[dict]]:
"""Return (keep, eligible_face_records). A face is 'eligible' iff its
scaled-to-original short side >= MIN_FACE_SHORT and source-type is
machine-learning."""
W, H = asset.get("width"), asset.get("height")
if not W or not H:
return False, []
eligible = []
for f in faces:
if f.get("sourceType") and f["sourceType"] != "machine-learning":
continue
iw = f.get("imageWidth") or W
ih = f.get("imageHeight") or H
sx = (W / iw) if iw else 1.0
sy = (H / ih) if ih else 1.0
bw = (f["boundingBoxX2"] - f["boundingBoxX1"]) * sx
bh = (f["boundingBoxY2"] - f["boundingBoxY1"]) * sy
if min(bw, bh) >= MIN_FACE_SHORT:
eligible.append({
"id": f["id"],
"x1": int(round(f["boundingBoxX1"] * sx)),
"y1": int(round(f["boundingBoxY1"] * sy)),
"x2": int(round(f["boundingBoxX2"] * sx)),
"y2": int(round(f["boundingBoxY2"] * sy)),
"person": (f.get("person") or {}).get("name") or None,
})
return (len(eligible) > 0), eligible
# ---- main staging loop --------------------------------------------------- #
def list_assets(user_id: str):
"""Yield every IMAGE asset owned by user_id, paginated."""
page = 1
while True:
resp = http_post(f"{API}/search/metadata", {
"size": PAGE_SIZE,
"type": "IMAGE",
"page": page,
"userIds": [user_id],
})
items = resp["assets"]["items"]
if not items:
return
for a in items:
yield a
nxt = resp["assets"].get("nextPage")
if not nxt:
return
page = int(nxt)
def stage(user_label: str, limit: int | None, workers: int) -> None:
user_id = USERS[user_label]
user_dir = STAGE_DIR / user_label
user_dir.mkdir(parents=True, exist_ok=True)
state_path = user_dir / "state.json"
queue_path = user_dir / "queue.json"
aliases_path = user_dir / "aliases.json"
# ---- load existing state for resume ---- #
state = {
"started_at": time.strftime("%Y-%m-%dT%H:%M:%S"),
"user_label": user_label,
"user_id": user_id,
"seen_asset_ids": [],
"staged_count": 0,
"deduped_against_existing": 0,
"deduped_against_staged": 0,
"skipped_no_big_face": 0,
"skipped_no_faces": 0,
"skipped_download_error": 0,
"total_assets_seen": 0,
}
queue: list[dict] = []
aliases: dict[str, dict] = {} # asset_id -> {sha, canonical_path}
staged_hashes: set[str] = set()
if state_path.exists():
prior = json.loads(state_path.read_text())
state.update(prior)
state["resumed_at"] = time.strftime("%Y-%m-%dT%H:%M:%S")
if queue_path.exists():
queue = json.loads(queue_path.read_text())
staged_hashes = {q["sha256"] for q in queue}
if aliases_path.exists():
aliases = json.loads(aliases_path.read_text())
print(f"[resume] {len(state['seen_asset_ids'])} asset_ids already seen, "
f"{len(queue)} in queue, {len(aliases)} aliased to existing cache")
seen = set(state["seen_asset_ids"])
# ---- startup connectivity probe ---- #
if not probe_immich():
print(f"[init] Immich probe failed at {API}/server/version -- exiting code 2")
sys.exit(2)
print("[init] Immich reachable")
# ---- load existing canonical cache hashes (sha256) ---- #
print(f"[init] loading existing cache hashes from {CACHE_PATH}")
_emb, meta, _src, _proc, _aliases = load_cache(CACHE_PATH)
canonical_by_hash: dict[str, str] = {}
for m in meta:
h = m.get("hash")
if h:
canonical_by_hash.setdefault(h, m["path"])
print(f"[init] {len(canonical_by_hash)} unique sha256s in nl_full.npz")
# ---- iterate assets ---- #
# Each worker does the entire I/O chain for an asset: /faces -> filter ->
# /original. That way 8 workers translate to ~8x parallelism end-to-end.
# Main thread does sha256, dedup decisions, and writes (which are CPU/SMB
# bound but cheap relative to two HTTPS round-trips per asset).
# Worker result tuple:
# (asset, faces|None, blob|None, eligible|None, error|None)
def _fetch_for_asset(asset: dict):
if asset.get("type") != "IMAGE":
return asset, None, None, None, "not_image"
aid = asset["id"]
if aid in seen:
return asset, None, None, None, "already_seen"
try:
faces = http_get(f"{API}/faces?id={aid}")
except Exception as e:
return asset, None, None, None, f"faces_error:{e}"
if not faces:
return asset, [], None, [], "no_faces"
keep, eligible = keep_asset(asset, faces)
if not keep:
return asset, faces, None, eligible, "no_big_face"
try:
blob = http_get(f"{API}/assets/{aid}/original", accept_bytes=True)
except Exception as e:
return asset, faces, None, eligible, f"download_error:{e}"
return asset, faces, blob, eligible, None
n = 0
err_streak = 0
last_flush = time.time()
t0 = time.time()
pool = ThreadPoolExecutor(max_workers=workers)
try:
for asset, faces, blob, eligible, err in pool.map(_fetch_for_asset, list_assets(user_id)):
if asset.get("type") != "IMAGE":
continue
n += 1
state["total_assets_seen"] = n
if limit is not None and n > limit:
print(f"[stop] hit --limit {limit}")
break
aid = asset["id"]
# Already-seen / non-image: silently skip.
if err == "already_seen":
continue
# Transient: count, but DON'T mark as seen so resume retries.
if err and (err.startswith("faces_error") or err.startswith("download_error")):
kind = err.split(":", 1)[0]
detail = err.split(":", 1)[1][:160] if ":" in err else err
print(f"[err] {kind} {aid}: {detail}")
state["skipped_download_error"] += 1
err_streak += 1
# Circuit breaker: long streak -> probe; if down, save and exit.
if err_streak >= OUTAGE_FAIL_STREAK:
print(f"[breaker] {err_streak} consecutive errors; probing Immich...")
if probe_immich():
print("[breaker] probe ok, treating as transient; continuing")
err_streak = 0
else:
print("[breaker] probe FAILED -- pausing run; resume with same command")
queue_path.write_text(json.dumps(queue, indent=2))
state_path.write_text(json.dumps(state, indent=2))
aliases_path.write_text(json.dumps(aliases, indent=2))
sys.exit(2)
continue
else:
err_streak = 0
# Permanent classifications -> seen.
if err == "no_faces":
state["skipped_no_faces"] += 1
seen.add(aid); state["seen_asset_ids"] = sorted(seen)
continue
if err == "no_big_face":
state["skipped_no_big_face"] += 1
seen.add(aid); state["seen_asset_ids"] = sorted(seen)
continue
# Have faces + blob -> dedup + save.
h = sha256_bytes(blob)
if h in canonical_by_hash:
aliases[aid] = {"sha256": h, "canonical": canonical_by_hash[h]}
state["deduped_against_existing"] += 1
seen.add(aid); state["seen_asset_ids"] = sorted(seen)
continue
if h in staged_hashes:
state["deduped_against_staged"] += 1
seen.add(aid); state["seen_asset_ids"] = sorted(seen)
continue
rel = derive_relpath(asset.get("originalPath", asset.get("originalFileName", aid)))
wsl_path = DEST_ROOT / user_label / rel
wsl_path.parent.mkdir(parents=True, exist_ok=True)
wsl_path.write_bytes(blob)
staged_hashes.add(h)
queue.append({
"asset_id": aid,
"sha256": h,
"wsl_path": str(wsl_path),
"win_path": wsl_to_win(wsl_path),
"size_bytes": len(blob),
"width": asset.get("width"),
"height": asset.get("height"),
"originalPath": asset.get("originalPath"),
"originalFileName": asset.get("originalFileName"),
"localDateTime": asset.get("localDateTime"),
"immich_eligible_faces": eligible,
})
state["staged_count"] += 1
seen.add(aid)
state["seen_asset_ids"] = sorted(seen)
if time.time() - last_flush > 5.0 or len(queue) % 25 == 0:
queue_path.write_text(json.dumps(queue, indent=2))
state_path.write_text(json.dumps(state, indent=2))
aliases_path.write_text(json.dumps(aliases, indent=2))
last_flush = time.time()
elapsed = time.time() - t0
rate = state["total_assets_seen"] / max(0.1, elapsed)
print(f"[stage] seen={state['total_assets_seen']:6d} "
f"staged={state['staged_count']:5d} "
f"dedup-existing={state['deduped_against_existing']:5d} "
f"dedup-staged={state['deduped_against_staged']:5d} "
f"no-big-face={state['skipped_no_big_face']:6d} "
f"no-faces={state['skipped_no_faces']:6d} "
f"errs={state['skipped_download_error']:3d} "
f"({rate:.1f} assets/s)")
finally:
pool.shutdown(wait=False, cancel_futures=True)
# final flush
queue_path.write_text(json.dumps(queue, indent=2))
state_path.write_text(json.dumps(state, indent=2))
aliases_path.write_text(json.dumps(aliases, indent=2))
print()
print(f"=== final state for user {user_label} ===")
for k in [
"total_assets_seen", "staged_count", "deduped_against_existing",
"deduped_against_staged", "skipped_no_big_face", "skipped_no_faces",
"skipped_download_error",
]:
print(f" {k}: {state[k]}")
total_bytes = sum(q["size_bytes"] for q in queue)
print(f" staged bytes: {total_bytes/1e9:.2f} GB across {len(queue)} files")
print(f" queue: {queue_path}")
print(f" state: {state_path}")
print(f" aliases: {aliases_path}")
# ---- cli ----------------------------------------------------------------- #
def main() -> None:
p = argparse.ArgumentParser()
if not USERS:
p.add_argument("--user", required=True,
help=f"label for output dir (USERS map empty; populate {USERS_FILE} to constrain)")
else:
p.add_argument("--user", choices=list(USERS.keys()), required=True)
p.add_argument("--limit", type=int, default=None,
help="stop after seeing N assets total (for testing)")
p.add_argument("--workers", type=int, default=8,
help="concurrent /faces fetches (default 8)")
args = p.parse_args()
stage(args.user, args.limit, args.workers)
if __name__ == "__main__":
main()
+144
View File
@@ -0,0 +1,144 @@
"""Windows / DirectML multi-face audit worker.
For every PNG in queue.json, run insightface FaceAnalysis and record how many
faces were detected (filtering by det_score>=MIN_DET and face_short>=MIN_PIX).
Surfaces the load-bearing roop invariant: each .fsz PNG must hold exactly one
face, otherwise the loader's `extract_face_images` appends every detected face
into the FaceSet and pollutes the averaged identity embedding.
CLI:
py -3.12 multiface_worker.py <queue.json> <out_results.json> [--limit N]
"""
from __future__ import annotations
import argparse
import json
import os
import sys
import time
from pathlib import Path
import numpy as np
from PIL import Image, ImageOps
from insightface.app import FaceAnalysis
MODEL_ROOT = r"C:\face_embed_venv\models"
MIN_DET = 0.5
MIN_FACE_PIX = 40
FLUSH_EVERY = 200
def load_existing(out_path: Path):
if not out_path.exists():
return None, set()
try:
d = json.loads(out_path.read_text())
processed = set(d.get("processed", []))
return d, processed
except Exception as e:
print(f"[warn] could not parse {out_path}: {e}; starting fresh", file=sys.stderr)
return None, set()
def save_atomic(out_path: Path, data: dict):
tmp = out_path.with_suffix(".tmp.json")
tmp.write_text(json.dumps(data, indent=2))
os.replace(tmp, out_path)
def main():
ap = argparse.ArgumentParser()
ap.add_argument("queue", type=Path)
ap.add_argument("out", type=Path)
ap.add_argument("--limit", type=int, default=None)
args = ap.parse_args()
queue = json.loads(args.queue.read_text())
print(f"[queue] {len(queue)} entries from {args.queue}", flush=True)
args.out.parent.mkdir(parents=True, exist_ok=True)
existing, processed = load_existing(args.out)
if existing:
print(f"[resume] {len(processed)} already scored", flush=True)
results = existing.get("results", [])
else:
results = []
pending = [e for e in queue if e["wsl_path"] not in processed]
if args.limit is not None:
pending = pending[: args.limit]
print(f"[pending] {len(pending)} entries", flush=True)
if not pending:
print("[done] nothing to do")
return
print("[load] FaceAnalysis with DmlExecutionProvider", flush=True)
app = FaceAnalysis(
name="buffalo_l",
root=MODEL_ROOT,
providers=["DmlExecutionProvider", "CPUExecutionProvider"],
)
app.prepare(ctx_id=0, det_size=(640, 640))
n_done = 0
n_load_err = 0
last_flush = time.time()
t_start = time.time()
def flush():
save_atomic(args.out, {
"results": results,
"processed": sorted(processed),
})
for entry in pending:
try:
with Image.open(entry["win_path"]) as im:
im = ImageOps.exif_transpose(im)
im = im.convert("RGB")
rgb = np.array(im)
bgr = rgb[:, :, ::-1].copy()
except Exception as e:
n_load_err += 1
results.append({
"wsl_path": entry["wsl_path"], "faceset": entry["faceset"], "file": entry["file"],
"face_count": -1, "error": "load",
})
processed.add(entry["wsl_path"])
n_done += 1
continue
faces = app.get(bgr)
kept = 0
for f in faces:
if float(f.det_score) < MIN_DET:
continue
x1, y1, x2, y2 = [int(round(v)) for v in f.bbox]
short = min(max(x2 - x1, 0), max(y2 - y1, 0))
if short < MIN_FACE_PIX:
continue
kept += 1
results.append({
"wsl_path": entry["wsl_path"], "faceset": entry["faceset"], "file": entry["file"],
"face_count": kept,
})
processed.add(entry["wsl_path"])
n_done += 1
if (n_done % FLUSH_EVERY == 0) or (time.time() - last_flush) > 30.0:
flush()
last_flush = time.time()
elapsed = time.time() - t_start
rate = n_done / max(0.1, elapsed)
eta = (len(pending) - n_done) / max(0.1, rate) / 60.0
print(f"[scan] {n_done}/{len(pending)} rate={rate:.2f} img/s eta={eta:.1f}min "
f"load_err={n_load_err}", flush=True)
flush()
elapsed = time.time() - t_start
print(f"[done] {n_done} scored, {n_load_err} load errors, {elapsed:.1f}s "
f"({n_done/max(0.1,elapsed):.2f} img/s) -> {args.out}", flush=True)
if __name__ == "__main__":
main()
+127
View File
@@ -0,0 +1,127 @@
#!/bin/bash
# Generic chain driver for the video target preprocessing pipeline.
#
# Usage:
# WORK=/path/to/workdir SKIP_PATTERN='ct_src_(0001[015]|005[0-9]|006[0-9])\.mp4' \
# bash run_video_pipeline.sh > /opt/face-sets/work/logs/<name>.log 2>&1
#
# Required env vars:
# WORK per-batch workdir (will hold scenes/, queue.json, results.jsonl, plan.json, review/)
#
# Optional env vars:
# INPUT_DIR default /mnt/x/src/vd
# OUTPUT_DIR default /mnt/x/src/vd/ct
# FILTER_FROM basename floor; only files with name >= this go in (e.g. ct_src_00050.mp4)
# SKIP_PATTERN regex of basenames to exclude (Python `re` syntax). Applied AFTER FILTER_FROM.
# MAX_DUR score --max-dur (default 120)
# IDENTITY "yes" to enable identity tagging; default "no"
# SIDECAR "yes" to emit <uuid>.json provenance sidecars; default "no"
set -e
: ${WORK:?WORK env var must point at a workdir}
: ${INPUT_DIR:=/mnt/x/src/vd}
: ${OUTPUT_DIR:=/mnt/x/src/vd/ct}
: ${MAX_DUR:=120}
: ${IDENTITY:=no}
: ${SIDECAR:=no}
mkdir -p "$WORK" "$WORK/scenes"
PY_WSL=/home/peter/face_sort_env/bin/python
PY_WIN="/mnt/c/face_embed_venv/Scripts/python.exe"
PIPELINE=/opt/face-sets/work/video_target_pipeline.py
WORKER=/opt/face-sets/work/video_face_worker.py
INVENTORY_FULL=/opt/face-sets/work/video_preprocess/inventory_full.json
ts() { date +"%Y-%m-%d %H:%M:%S"; }
log() { echo "[$(ts)] [$PHASE] $*"; }
PHASE="setup"
log "STARTED — host=$(hostname) pid=$$ work=$WORK"
log "config: input=$INPUT_DIR output=$OUTPUT_DIR filter_from=${FILTER_FROM:-<none>} skip_pattern=${SKIP_PATTERN:-<none>} max_dur=$MAX_DUR identity=$IDENTITY sidecar=$SIDECAR"
PHASE="inventory"
log "building subset inventory"
T0=$(date +%s)
# rebuild full inventory if missing
if [ ! -f "$INVENTORY_FULL" ]; then
log "(no full inventory cached — running fresh scan)"
$PY_WSL $PIPELINE scan --input "$INPUT_DIR" --output-dir "$OUTPUT_DIR" --out "$INVENTORY_FULL"
fi
$PY_WSL <<EOF
import json, re
from pathlib import Path
inv = json.load(open('$INVENTORY_FULL'))
subset = list(inv['videos'])
filter_from = '${FILTER_FROM}'
skip_pat = '${SKIP_PATTERN}'
if filter_from:
subset = [v for v in subset if Path(v['path']).name >= filter_from]
if skip_pat:
pat = re.compile(skip_pat)
subset = [v for v in subset if not pat.search(Path(v['path']).name)]
subset.sort(key=lambda v: v['path'])
inv['videos'] = subset
json.dump(inv, open('$WORK/inventory.json','w'), indent=2)
total_dur = sum(v.get('duration_s', 0) for v in inv['videos'] if 'error' not in v)
print(f' {len(inv["videos"])} videos, total {total_dur/3600:.2f}h input')
EOF
log "done in $(($(date +%s)-T0))s"
PHASE="scenes"
log "PySceneDetect AdaptiveDetector across all videos (cached entries skipped)"
T0=$(date +%s)
$PY_WSL $PIPELINE scenes --inventory "$WORK/inventory.json" --out-dir "$WORK/scenes"
log "done in $(($(date +%s)-T0))s"
PHASE="stage"
log "building frame queue @ 2 fps within scenes"
T0=$(date +%s)
$PY_WSL $PIPELINE stage --inventory "$WORK/inventory.json" --scenes-dir "$WORK/scenes" --out "$WORK/queue.json"
log "done in $(($(date +%s)-T0))s"
PHASE="worker"
log "Windows DML face detect+embed (resumable; the slow one)"
T0=$(date +%s)
$PY_WIN $WORKER "$WORK/queue.json" "$WORK/results.json"
log "done in $(($(date +%s)-T0))s"
PHASE="merge"
log "ingesting worker output (jsonl)"
T0=$(date +%s)
$PY_WSL $PIPELINE merge --results "$WORK/results.json" --out "$WORK/frames.json"
log "done in $(($(date +%s)-T0))s"
PHASE="track"
log "stitching detections into tracks"
T0=$(date +%s)
$PY_WSL $PIPELINE track --frames "$WORK/frames.json" --scenes-dir "$WORK/scenes" \
--inventory "$WORK/inventory.json" --out "$WORK/tracks.json"
log "done in $(($(date +%s)-T0))s"
PHASE="score"
log "scoring with relaxed gates + max-dur=$MAX_DUR identity=$IDENTITY"
T0=$(date +%s)
ID_FLAG=""
if [ "$IDENTITY" != "yes" ]; then ID_FLAG="--no-identity"; fi
$PY_WSL $PIPELINE score --tracks "$WORK/tracks.json" --inventory "$WORK/inventory.json" \
--out "$WORK/plan.json" --max-dur "$MAX_DUR" $ID_FLAG
log "done in $(($(date +%s)-T0))s"
PHASE="cut"
log "ffmpeg stream-copy into per-source subfolders (no --clean)"
T0=$(date +%s)
SIDECAR_FLAG=""
if [ "$SIDECAR" = "yes" ]; then SIDECAR_FLAG="--write-sidecar"; fi
$PY_WSL $PIPELINE cut --plan "$WORK/plan.json" --output-dir "$OUTPUT_DIR" $SIDECAR_FLAG
log "done in $(($(date +%s)-T0))s"
PHASE="report"
log "rendering HTML"
T0=$(date +%s)
$PY_WSL $PIPELINE report --plan "$WORK/plan.json" --output-dir "$OUTPUT_DIR" --out "$WORK/review"
log "done in $(($(date +%s)-T0))s"
PHASE="done"
log "PIPELINE COMPLETE — review at file://$WORK/review/index.html"
+32
View File
@@ -0,0 +1,32 @@
#!/bin/bash
# Generic status helper for run_video_pipeline.sh.
# Usage: bash status_video_pipeline.sh <log_file>
# Defaults to /opt/face-sets/work/logs/video_run.log if no arg.
LOG="${1:-/opt/face-sets/work/logs/video_run.log}"
if [ ! -f "$LOG" ]; then
echo "no log at $LOG yet"
exit 0
fi
echo "=== last 8 log lines ==="
tail -8 "$LOG"
echo
# worker progress
last=$(grep -E "^\[scan\] [0-9]+/[0-9]+" "$LOG" | tail -1)
if [ -n "$last" ]; then
echo "=== DML worker progress ==="
echo " $last"
fi
# total elapsed
start_epoch=$(head -1 "$LOG" | sed 's/.*\[\(.*\)\].*\[setup\].*/\1/' | xargs -I{} date -d "{}" +%s 2>/dev/null)
now_epoch=$(date +%s)
if [ -n "$start_epoch" ] && [ "$start_epoch" != "" ] 2>/dev/null; then
elapsed=$((now_epoch - start_epoch))
h=$((elapsed / 3600))
m=$(( (elapsed % 3600) / 60 ))
echo " elapsed: ${h}h${m}m"
fi
+274
View File
@@ -0,0 +1,274 @@
"""Windows / DirectML video frame face worker.
Reads a queue.json from /opt/face-sets/work/video_target_pipeline.py:stage
(WSL side), each entry: {video_path, win_video_path, frame_idx, time_s,
queue_id}. Decodes frame N from the video, runs insightface FaceAnalysis,
emits per-face records (bbox, det_score, pose, embedding, face_short).
CLI:
py -3.12 video_face_worker.py <queue.json> <out_results.json> [--limit N]
Resumable: existing entries in out_results.json with the same queue_id are
skipped.
"""
from __future__ import annotations
import argparse
import json
import os
import sys
import time
from pathlib import Path
import numpy as np
import cv2
from insightface.app import FaceAnalysis
MODEL_ROOT = r"C:\face_embed_venv\models"
MIN_DET = 0.5
MIN_FACE_PIX = 40
FLUSH_EVERY = 100
def jsonl_path_for(out_path: Path) -> Path:
"""Sister JSONL file: one result-record per line, append-only."""
return out_path.with_suffix(".jsonl")
def load_existing(out_path: Path):
"""Load existing results from .jsonl (preferred) or legacy .json (one-time conversion).
Returns (records_list, processed_set)."""
jsonl = jsonl_path_for(out_path)
if jsonl.exists():
records = []
processed = set()
with open(jsonl) as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
r = json.loads(line)
records.append(r)
if r.get("queue_id"):
processed.add(r["queue_id"])
except json.JSONDecodeError:
print(f"[warn] {jsonl}:{line_num} corrupt; skipping", file=sys.stderr)
return records, processed
# legacy JSON support: load once, convert to JSONL
if out_path.exists():
try:
d = json.loads(out_path.read_text())
records = d.get("results", [])
processed = set(d.get("processed", []))
print(f"[migrate] converting {len(records)} legacy JSON records to JSONL", file=sys.stderr)
with open(jsonl, "w") as f:
for r in records:
f.write(json.dumps(r) + "\n")
return records, processed
except Exception as e:
print(f"[warn] could not parse {out_path}: {e}; starting fresh", file=sys.stderr)
return [], set()
def append_records(out_path: Path, new_records: list):
"""Append-only write to the sister .jsonl file. No re-serialization of prior records."""
if not new_records:
return
jsonl = jsonl_path_for(out_path)
with open(jsonl, "a") as f:
for r in new_records:
f.write(json.dumps(r) + "\n")
def write_compat_summary(out_path: Path, total_records: int, processed: set):
"""Write a tiny JSON pointer file at the legacy out_path so older consumers
still see *something*, but the canonical store is the .jsonl. Cheap."""
summary = {
"_format": "jsonl-pointer",
"_jsonl": str(jsonl_path_for(out_path).name),
"results_count": total_records,
"processed_count": len(processed),
}
tmp = out_path.with_suffix(".tmp.json")
tmp.write_text(json.dumps(summary, indent=2))
os.replace(tmp, out_path)
def main():
ap = argparse.ArgumentParser()
ap.add_argument("queue", type=Path)
ap.add_argument("out", type=Path)
ap.add_argument("--limit", type=int, default=None)
args = ap.parse_args()
queue = json.loads(args.queue.read_text())
print(f"[queue] {len(queue)} entries from {args.queue}", flush=True)
args.out.parent.mkdir(parents=True, exist_ok=True)
results, processed = load_existing(args.out)
if processed:
print(f"[resume] {len(processed)} already scored", flush=True)
pending = [e for e in queue if e["queue_id"] not in processed]
if args.limit is not None:
pending = pending[: args.limit]
print(f"[pending] {len(pending)} entries", flush=True)
if not pending:
print("[done] nothing to do")
return
print("[load] FaceAnalysis with DmlExecutionProvider", flush=True)
app = FaceAnalysis(
name="buffalo_l",
root=MODEL_ROOT,
providers=["DmlExecutionProvider", "CPUExecutionProvider"],
)
app.prepare(ctx_id=0, det_size=(640, 640))
# group queue by video so we can keep one VideoCapture open and seek
from collections import defaultdict
by_video = defaultdict(list)
for e in pending:
by_video[e["win_video_path"]].append(e)
n_done = 0
n_load_err = 0
last_flush = time.time()
t_start = time.time()
new_buffer: list = []
def flush():
# append-only: only NEW records since last flush get written. O(new_records),
# not O(total_records). Was 11s/flush at 9k records; now <50ms.
if new_buffer:
append_records(args.out, new_buffer)
new_buffer.clear()
write_compat_summary(args.out, len(results), processed)
for vidpath, entries in by_video.items():
# entries are already sorted by frame_idx. Hybrid decode strategy:
# 1. Seek ONCE to the first pending target (cheap keyframe-seek).
# 2. Sequential cap.grab() between subsequent targets (decode without
# BGR conversion until we reach a target, then cap.retrieve()).
# This avoids per-sample seek cost (the original pathology that
# caused 1.4 fps deep in long videos) AND avoids grab-walking from
# frame 0 on resume (the over-correction that gave 0.08 fps).
entries.sort(key=lambda e: e["frame_idx"])
cap = cv2.VideoCapture(vidpath)
if not cap.isOpened():
print(f"[err] cannot open {vidpath}", flush=True)
for e in entries:
rec = {
"queue_id": e["queue_id"], "video_path": e["video_path"],
"frame_idx": e["frame_idx"], "time_s": e["time_s"],
"faces": [], "error": "cap_open",
}
results.append(rec); new_buffer.append(rec)
processed.add(e["queue_id"])
n_done += 1
n_load_err += 1
continue
first_target = entries[0]["frame_idx"]
if first_target > 0:
cap.set(cv2.CAP_PROP_POS_FRAMES, first_target)
cur_frame_idx = first_target - 1
else:
cur_frame_idx = -1
for e in entries:
target = e["frame_idx"]
if target < cur_frame_idx + 1:
# backward jump (only triggers for unsorted entries — defensive)
cap.set(cv2.CAP_PROP_POS_FRAMES, target)
cur_frame_idx = target - 1
# advance up to (but not including) target via grab()-only
ran_out = False
while cur_frame_idx + 1 < target:
ok = cap.grab()
if not ok:
ran_out = True
break
cur_frame_idx += 1
if not ran_out:
ok = cap.grab()
if not ok:
ran_out = True
else:
cur_frame_idx = target
if ran_out:
rec = {
"queue_id": e["queue_id"], "video_path": e["video_path"],
"frame_idx": e["frame_idx"], "time_s": e["time_s"],
"faces": [], "error": "frame_read",
}
results.append(rec); new_buffer.append(rec)
processed.add(e["queue_id"])
n_done += 1
n_load_err += 1
continue
ok, bgr = cap.retrieve()
if not ok or bgr is None:
rec = {
"queue_id": e["queue_id"], "video_path": e["video_path"],
"frame_idx": e["frame_idx"], "time_s": e["time_s"],
"faces": [], "error": "frame_read",
}
results.append(rec); new_buffer.append(rec)
processed.add(e["queue_id"])
n_done += 1
n_load_err += 1
continue
faces = app.get(bgr)
kept_faces = []
H, W = bgr.shape[:2]
for f in faces:
if float(f.det_score) < MIN_DET:
continue
x1, y1, x2, y2 = [int(round(v)) for v in f.bbox]
x1 = max(x1, 0); y1 = max(y1, 0)
x2 = min(x2, W); y2 = min(y2, H)
w, h = x2 - x1, y2 - y1
short = min(w, h)
if short < MIN_FACE_PIX:
continue
rec = {
"bbox": [x1, y1, x2, y2],
"det_score": float(f.det_score),
"face_short": int(short),
}
if hasattr(f, "pose") and f.pose is not None:
rec["pose"] = [float(x) for x in f.pose] # pitch, yaw, roll
if hasattr(f, "normed_embedding") and f.normed_embedding is not None:
rec["embedding"] = f.normed_embedding.astype(np.float32).tolist()
kept_faces.append(rec)
rec = {
"queue_id": e["queue_id"], "video_path": e["video_path"],
"frame_idx": e["frame_idx"], "time_s": e["time_s"],
"frame_w": W, "frame_h": H,
"faces": kept_faces,
}
results.append(rec); new_buffer.append(rec)
processed.add(e["queue_id"])
n_done += 1
if (n_done % FLUSH_EVERY == 0) or (time.time() - last_flush) > 30.0:
flush()
last_flush = time.time()
el = time.time() - t_start
rate = n_done / max(0.1, el)
eta = (len(pending) - n_done) / max(0.1, rate) / 60.0
print(f"[scan] {n_done}/{len(pending)} rate={rate:.2f} fps eta={eta:.1f}min "
f"errs={n_load_err}", flush=True)
cap.release()
flush()
el = time.time() - t_start
print(f"[done] {n_done} scored, {n_load_err} errors, {el:.1f}s "
f"({n_done/max(0.1,el):.2f} fps) -> {args.out}", flush=True)
if __name__ == "__main__":
main()
+919
View File
@@ -0,0 +1,919 @@
"""Video target preprocessing pipeline for roop-unleashed.
Discovers video files in an input folder, runs scene-cut detection, samples
frames within each scene, runs face detection + embedding via Windows DML
worker, stitches per-frame detections into face tracks, applies quality
gates, cuts approved segments out with ffmpeg stream-copy, and writes a
report. Output clips have generic UUID names + a sidecar JSON with full
provenance.
Subcommands:
scan list input videos, run ffprobe, write per-video index
scenes PySceneDetect AdaptiveDetector per video; write scenes_<basename>.json
stage write frame queue.json (sampled @ 2 fps within scenes)
merge ingest worker results.json into per-video frame_results
track IoU+embedding stitching of per-frame detections into tracks
score track-level quality gating + segment plan
cut ffmpeg -c copy each accepted segment to <out_dir>/<uuid>.mp4
report HTML preview with thumbnails + identity tags
"""
from __future__ import annotations
import argparse
import json
import math
import re
import shutil
import subprocess
import sys
import time
import uuid
from collections import defaultdict
from pathlib import Path
import numpy as np
DEFAULT_INPUT = Path("/mnt/x/src/vd")
DEFAULT_OUTPUT = Path("/mnt/x/src/vd/ct")
WORK_DIR = Path("/opt/face-sets/work/video_preprocess")
# defaults — first set was strict-portrait; second set loosened for side-profile + segment merging
SAMPLE_FPS = 2.0
QUALITY_YAW_MAX = 75.0 # was 25; allow full 3/4 + profile (face-sets handle it)
QUALITY_PITCH_MAX = 45.0 # was 30
QUALITY_FACE_MIN = 80 # was 96
QUALITY_BLUR_MIN = 50.0
QUALITY_DET_MIN = 0.5 # was 0.6
TRACK_GATE_FRAC = 0.7 # >=70% of frames in track must pass per-frame gates
SEGMENT_MIN_S = 1.0
SEGMENT_MAX_S = 30.0 # was 10
SEGMENT_BRIDGE_S = 3.0 # was 1.0 — within-track pose-failure bridging
SEGMENT_MERGE_GAP_S = 2.0 # NEW — across-track merge if same scene + within this gap
TRACK_IOU_MIN = 0.3
TRACK_EMB_MIN = 0.5
CACHES = [
Path("/opt/face-sets/work/cache/nl_full.npz"),
Path("/opt/face-sets/work/cache/immich_peter.npz"),
Path("/opt/face-sets/work/cache/immich_nic.npz"),
]
FACESETS_ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
IDENTITY_TAG_THRESHOLD = 0.6 # cosine sim to faceset centroid
def wsl_to_win(p: str) -> str:
s = str(p)
if s.startswith("/mnt/"):
return f"{s[5].upper()}:\\{s[7:].replace('/', chr(92))}"
return s
# ----------------------------- ffprobe / scan -----------------------------
def ffprobe(video: Path) -> dict:
cmd = [
"ffprobe", "-v", "error", "-print_format", "json",
"-show_format", "-show_streams", str(video),
]
r = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
if r.returncode != 0:
return {"error": r.stderr.strip()}
return json.loads(r.stdout)
def parse_video_meta(probe: dict) -> dict:
if "error" in probe:
return {"error": probe["error"]}
fmt = probe.get("format", {})
duration = float(fmt.get("duration", 0))
vstream = next((s for s in probe.get("streams", []) if s.get("codec_type") == "video"), None)
if vstream is None:
return {"error": "no video stream"}
fps_str = vstream.get("avg_frame_rate", "0/1")
try:
num, den = (int(x) for x in fps_str.split("/"))
fps = num / den if den else 0.0
except Exception:
fps = 0.0
nb_frames = int(vstream.get("nb_frames", 0)) or int(round(duration * fps))
return {
"duration_s": duration,
"fps": fps,
"frames": nb_frames,
"width": int(vstream.get("width", 0)),
"height": int(vstream.get("height", 0)),
"codec": vstream.get("codec_name"),
}
def cmd_scan(args):
in_dir = Path(args.input)
out = Path(args.out)
out.parent.mkdir(parents=True, exist_ok=True)
extensions = {".mp4", ".mov", ".mkv", ".m4v", ".avi", ".webm"}
out_root = Path(args.output_dir).resolve()
videos = []
for p in sorted(in_dir.iterdir() if not args.recursive else in_dir.rglob("*")):
if not p.is_file():
continue
if out_root in p.parents or p.resolve() == out_root:
continue # never include the output dir
if p.suffix.lower() not in extensions:
continue
videos.append(p)
print(f"[scan] {len(videos)} candidate videos", file=sys.stderr)
inventory = []
for p in videos:
meta = parse_video_meta(ffprobe(p))
meta["path"] = str(p)
meta["win_path"] = wsl_to_win(str(p))
meta["size"] = p.stat().st_size
inventory.append(meta)
if "error" not in meta:
print(f" {p.name}: {meta['duration_s']:.1f}s @ {meta['fps']:.1f}fps "
f"{meta['width']}x{meta['height']} {meta['codec']}", file=sys.stderr)
else:
print(f" {p.name}: ERROR {meta['error']}", file=sys.stderr)
out.write_text(json.dumps({"input": str(in_dir), "videos": inventory}, indent=2))
print(f"[scan] inventory -> {out}", file=sys.stderr)
# ----------------------------- scenes -----------------------------
def cmd_scenes(args):
from scenedetect import open_video, SceneManager
from scenedetect.detectors import AdaptiveDetector
inv = json.loads(Path(args.inventory).read_text())
out_dir = Path(args.out_dir)
out_dir.mkdir(parents=True, exist_ok=True)
only = set(args.only.split(",")) if args.only else None
for v in inv["videos"]:
if "error" in v:
continue
path = Path(v["path"])
if only and path.name not in only:
continue
out_file = out_dir / (path.stem + ".scenes.json")
if out_file.exists() and not args.force:
continue
print(f"[scenes] {path.name} ...", file=sys.stderr, flush=True)
t0 = time.time()
try:
video = open_video(str(path))
sm = SceneManager()
sm.add_detector(AdaptiveDetector(min_scene_len=int(round(v.get("fps", 30) or 30) * 0.5)))
sm.detect_scenes(video, show_progress=False)
scenes = sm.get_scene_list()
entries = []
for s, e in scenes:
entries.append({
"start_frame": s.frame_num, "end_frame": e.frame_num,
"start_s": s.get_seconds(), "end_s": e.get_seconds(),
"duration_s": e.get_seconds() - s.get_seconds(),
})
# if no cuts found, treat the whole video as one scene
if not entries:
entries = [{
"start_frame": 0, "end_frame": v["frames"],
"start_s": 0.0, "end_s": v["duration_s"],
"duration_s": v["duration_s"],
}]
out_file.write_text(json.dumps({"video": str(path), "scenes": entries}, indent=2))
print(f" {len(entries)} scenes in {time.time()-t0:.1f}s -> {out_file.name}",
file=sys.stderr)
except Exception as e:
print(f" ERROR: {e}", file=sys.stderr)
# ----------------------------- stage -----------------------------
def cmd_stage(args):
inv = json.loads(Path(args.inventory).read_text())
scenes_dir = Path(args.scenes_dir)
queue = []
qid = 0
sample_every = 1.0 / args.sample_fps
for v in inv["videos"]:
if "error" in v:
continue
p = Path(v["path"])
sf = scenes_dir / (p.stem + ".scenes.json")
if not sf.exists():
print(f"[warn] no scenes file for {p.name}; skipping", file=sys.stderr)
continue
scenes = json.loads(sf.read_text()).get("scenes", [])
fps = v.get("fps", 30) or 30
for sc in scenes:
t = sc["start_s"]
while t < sc["end_s"] - 0.01:
fidx = int(round(t * fps))
if fidx >= v["frames"]:
break
queue.append({
"queue_id": f"q{qid:08d}",
"video_path": str(p),
"win_video_path": v["win_path"],
"frame_idx": fidx,
"time_s": t,
})
qid += 1
t += sample_every
out = Path(args.out)
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(json.dumps(queue, indent=2))
print(f"[stage] {len(queue)} sampled frames @ {args.sample_fps} fps -> {out}",
file=sys.stderr)
print(f"[stage] win path for worker: {wsl_to_win(str(out))}", file=sys.stderr)
# ----------------------------- merge + track -----------------------------
def cmd_merge(args):
"""Read worker output and group by video_path. Supports either JSONL (one record
per line, the new format) or legacy JSON (results.json with `results` list)."""
src_path = Path(args.results)
records = []
# try JSONL first (sister .jsonl file or .results passed directly)
jsonl_candidate = src_path.with_suffix(".jsonl")
if jsonl_candidate.exists():
with open(jsonl_candidate) as f:
for line in f:
line = line.strip()
if line:
records.append(json.loads(line))
elif src_path.suffix == ".jsonl":
with open(src_path) as f:
for line in f:
line = line.strip()
if line:
records.append(json.loads(line))
else:
# legacy: monolithic JSON
src = json.loads(src_path.read_text())
records = src.get("results", [])
by_video: dict[str, list] = {}
for r in records:
by_video.setdefault(r["video_path"], []).append(r)
for v in by_video:
by_video[v].sort(key=lambda x: x["frame_idx"])
out = Path(args.out)
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(json.dumps({"by_video": by_video}, indent=2))
print(f"[merge] {sum(len(v) for v in by_video.values())} frames across {len(by_video)} videos "
f"-> {out}", file=sys.stderr)
def _iou(a, b):
ax1, ay1, ax2, ay2 = a
bx1, by1, bx2, by2 = b
ix1 = max(ax1, bx1); iy1 = max(ay1, by1)
ix2 = min(ax2, bx2); iy2 = min(ay2, by2)
iw = max(ix2 - ix1, 0); ih = max(iy2 - iy1, 0)
inter = iw * ih
ua = (ax2 - ax1) * (ay2 - ay1) + (bx2 - bx1) * (by2 - by1) - inter
return inter / ua if ua > 0 else 0.0
def cmd_track(args):
"""Stitch per-frame face detections into tracks within each scene of each video.
Track = list of (frame_idx, face_idx) where adjacent samples have IoU>=0.3 OR
cosine(emb)>=0.5. New face → new track. No cross-scene merging."""
fr = json.loads(Path(args.frames).read_text())
scenes_dir = Path(args.scenes_dir)
inv = json.loads(Path(args.inventory).read_text())
inv_by_path = {v["path"]: v for v in inv["videos"]}
all_video_tracks: dict[str, list] = {}
for video_path, frames in fr["by_video"].items():
v = inv_by_path.get(video_path, {})
sf = scenes_dir / (Path(video_path).stem + ".scenes.json")
scenes = json.loads(sf.read_text()).get("scenes", []) if sf.exists() else []
# group frames by scene
scene_for_frame = {}
for si, sc in enumerate(scenes):
for f in frames:
if f["frame_idx"] >= sc["start_frame"] and f["frame_idx"] < sc["end_frame"]:
scene_for_frame.setdefault(si, []).append(f)
video_tracks = []
for si, scene_frames in scene_for_frame.items():
scene_frames.sort(key=lambda x: x["frame_idx"])
# tracks = list of dict{ "members": [(frame_idx, face_idx, face_dict)], "last_bbox", "last_emb" }
tracks = []
for f in scene_frames:
claimed = set()
for face_idx, face in enumerate(f.get("faces", [])):
bbox = face["bbox"]
emb = np.array(face.get("embedding", []), dtype=np.float32) if face.get("embedding") else None
best_track = None
best_score = 0.0
for ti, tr in enumerate(tracks):
if ti in claimed:
continue
# staleness in TIME (sample period independent of source fps)
last_time = tr["members"][-1][3]
if f["time_s"] - last_time > 1.5: # stale if >1.5s gap (3 sample periods @ 2fps)
continue
score = _iou(tr["last_bbox"], bbox)
if emb is not None and tr.get("last_emb") is not None:
score = max(score, float(np.dot(tr["last_emb"], emb)))
if score > best_score:
best_score = score
best_track = ti
if best_track is not None and best_score >= min(TRACK_IOU_MIN, TRACK_EMB_MIN):
tr = tracks[best_track]
tr["members"].append((f["frame_idx"], face_idx, face, f["time_s"]))
tr["last_bbox"] = bbox
if emb is not None:
tr["last_emb"] = emb
claimed.add(best_track)
else:
tracks.append({
"members": [(f["frame_idx"], face_idx, face, f["time_s"])],
"last_bbox": bbox,
"last_emb": emb,
})
for tr in tracks:
if len(tr["members"]) < 2:
continue
video_tracks.append({
"scene_idx": si,
"members": [
{"frame_idx": m[0], "face_idx": m[1], "time_s": m[3], "face": m[2]}
for m in tr["members"]
],
})
all_video_tracks[video_path] = video_tracks
print(f"[track] {Path(video_path).name}: {sum(len(s) for s in scene_for_frame.values())} frames "
f"-> {len(video_tracks)} tracks across {len(scene_for_frame)} scenes",
file=sys.stderr)
out = Path(args.out)
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(json.dumps({"by_video": all_video_tracks}, indent=2))
print(f"[track] -> {out}", file=sys.stderr)
# ----------------------------- score (quality gates) -----------------------------
def _track_passes(track, cfg):
"""Per-frame quality gating; return list of bool (does each member pass) +
aggregate stats. cfg: dict with yaw_max, pitch_max, face_min, det_min."""
passes = []
yaws, pitches, sizes, dets = [], [], [], []
for m in track["members"]:
f = m["face"]
yaw = abs(f.get("pose", [0, 0, 0])[1]) if f.get("pose") else 0
pitch = abs(f.get("pose", [0, 0, 0])[0]) if f.get("pose") else 0
size = f.get("face_short", 0)
det = f.get("det_score", 0)
ok = (yaw <= cfg["yaw_max"] and pitch <= cfg["pitch_max"]
and size >= cfg["face_min"] and det >= cfg["det_min"])
passes.append(ok)
yaws.append(yaw); pitches.append(pitch); sizes.append(size); dets.append(det)
return passes, {
"n": len(passes), "n_pass": sum(passes), "frac_pass": sum(passes) / max(1, len(passes)),
"yaw_med": float(np.median(yaws)) if yaws else None,
"pitch_med": float(np.median(pitches)) if pitches else None,
"size_med": float(np.median(sizes)) if sizes else None,
"det_med": float(np.median(dets)) if dets else None,
}
def _build_segments(track, cfg):
"""Return list of (start_s, end_s) accepted sub-segments of this track:
contiguous runs of passing frames meeting min/max duration. Pose-failure
spans <= cfg['bridge_s'] long get bridged across (handles momentary head
turns / detection misses)."""
passes, stats = _track_passes(track, cfg)
members = track["members"]
if not members:
return [], stats
# bridge gaps of failing frames (any width) up to cfg["bridge_s"] seconds
bridged = list(passes)
n = len(bridged)
i = 0
while i < n:
if bridged[i]:
i += 1
continue
# find run of consecutive False starting at i
j = i
while j < n and not bridged[j]:
j += 1
# bridge if surrounded by True on both sides AND time gap <= bridge_s
if i > 0 and j < n and bridged[i - 1] and bridged[j]:
t_left = members[i - 1]["time_s"]
t_right = members[j]["time_s"]
if t_right - t_left <= cfg["bridge_s"]:
for k in range(i, j):
bridged[k] = True
i = j
# find runs of True
runs = []
i = 0
while i < n:
if not bridged[i]:
i += 1; continue
j = i
while j + 1 < n and bridged[j + 1]:
j += 1
s = members[i]["time_s"]
# end is the time of the last passing sample plus one sample-period
e = members[j]["time_s"] + 1.0 / max(SAMPLE_FPS, 1e-3)
runs.append((s, e))
i = j + 1
return runs, stats
def _merge_close_segments(segs_with_meta, merge_gap_s: float):
"""Merge segments within the same scene that are within merge_gap_s of each other.
segs_with_meta: list of dicts with start_s, end_s, scene_idx, track_idx, stats.
Returns list of merged dicts (one per merged group). Identity-tag and stats
aggregation happen later."""
by_scene: dict[int, list] = {}
for s in segs_with_meta:
by_scene.setdefault(s["scene_idx"], []).append(s)
merged_all = []
for scene_idx, segs in by_scene.items():
segs.sort(key=lambda x: x["start_s"])
cur = None
for s in segs:
if cur is None:
cur = {**s, "track_idxs": [s["track_idx"]], "member_count": s["stats"]["n"],
"pass_count": s["stats"]["n_pass"]}
continue
gap = s["start_s"] - cur["end_s"]
if gap <= merge_gap_s:
# merge
cur["end_s"] = max(cur["end_s"], s["end_s"])
cur["track_idxs"].append(s["track_idx"])
cur["member_count"] += s["stats"]["n"]
cur["pass_count"] += s["stats"]["n_pass"]
# take the better-quality stats for display
if s["stats"]["n_pass"] > cur["stats"]["n_pass"]:
cur["stats"] = s["stats"]
else:
merged_all.append(cur)
cur = {**s, "track_idxs": [s["track_idx"]], "member_count": s["stats"]["n"],
"pass_count": s["stats"]["n_pass"]}
if cur is not None:
merged_all.append(cur)
return merged_all
def _split_long_segments(segs_with_meta, min_s: float, max_s: float):
"""Apply min/max duration: drop too-short, split too-long evenly."""
out = []
for s in segs_with_meta:
dur = s["end_s"] - s["start_s"]
if dur < min_s:
continue
if dur <= max_s:
out.append(s)
continue
n = int(math.ceil(dur / max_s))
chunk = dur / n
base_start = s["start_s"]
for k in range(n):
piece = dict(s)
piece["start_s"] = base_start + k * chunk
piece["end_s"] = base_start + (k + 1) * chunk
out.append(piece)
return out
# identity tagging via cached arcface centroids
def load_caches_index():
rec_index = {}
alias_map = {}
for c in CACHES:
if not c.exists():
continue
d = np.load(c, allow_pickle=True)
emb = d["embeddings"]
meta = json.loads(str(d["meta"]))
face_records = [m for m in meta if not m.get("noface")]
if "path_aliases" in d.files:
paliases = json.loads(str(d["path_aliases"]))
for canon, alist in paliases.items():
alias_map.setdefault(canon, canon)
for a in alist:
alias_map[a] = canon
for i, rec in enumerate(face_records):
v = emb[i].astype(np.float32)
n = float(np.linalg.norm(v))
if n > 0:
v = v / n
rec_index[(rec["path"], tuple(int(x) for x in rec["bbox"]))] = v
alias_map.setdefault(rec["path"], rec["path"])
return rec_index, alias_map
def load_faceset_centroids():
"""Return dict faceset_name -> normalized centroid embedding."""
rec_index, alias_map = load_caches_index()
centroids = {}
for fs_dir in sorted(FACESETS_ROOT.iterdir()):
if not fs_dir.is_dir() or fs_dir.name.startswith("_"):
continue
# exclude era splits to avoid double-tagging within a family
if re.match(r"^faceset_\d+_(?:\d{4}-\d{2,4}|\d{4}|undated)", fs_dir.name):
continue
mp = fs_dir / "manifest.json"
if not mp.exists():
continue
m = json.loads(mp.read_text())
vecs = []
for f in m.get("faces", []):
src = f.get("source"); bbox = f.get("bbox")
if not src or not bbox:
continue
canon = alias_map.get(src, src)
v = rec_index.get((canon, tuple(int(x) for x in bbox)))
if v is None and canon != src:
v = rec_index.get((src, tuple(int(x) for x in bbox)))
if v is not None:
vecs.append(v)
if len(vecs) < 3:
continue
c = np.stack(vecs).mean(axis=0)
n = float(np.linalg.norm(c))
if n > 0:
c = c / n
centroids[fs_dir.name] = c
return centroids
def _track_centroid(track):
embs = [m["face"].get("embedding") for m in track["members"] if m["face"].get("embedding")]
if not embs:
return None
arr = np.array(embs, dtype=np.float32)
c = arr.mean(axis=0)
n = float(np.linalg.norm(c))
return c / n if n > 0 else c
def cmd_score(args):
tr = json.loads(Path(args.tracks).read_text())
inv = json.loads(Path(args.inventory).read_text())
inv_by_path = {v["path"]: v for v in inv["videos"]}
cfg = {
"yaw_max": args.max_yaw, "pitch_max": args.max_pitch,
"face_min": args.min_face, "det_min": args.min_det,
"bridge_s": args.bridge_gap,
}
centroids = {}
if not args.no_identity:
print("[score] loading faceset centroids ...", file=sys.stderr)
t0 = time.time()
centroids = load_faceset_centroids()
print(f"[score] {len(centroids)} active faceset centroids loaded in {time.time()-t0:.1f}s",
file=sys.stderr)
n_total_tracks = 0
n_accepted_tracks = 0
# collect per-track candidate segments first; merging happens per-video below
per_video_candidates: dict[str, list] = {}
track_centroids_by_video: dict[str, dict] = {}
for video_path, tracks in tr["by_video"].items():
per_video_candidates.setdefault(video_path, [])
track_centroids_by_video.setdefault(video_path, {})
for ti, track in enumerate(tracks):
n_total_tracks += 1
runs, stats = _build_segments(track, cfg)
if stats["frac_pass"] < args.track_gate_frac:
continue
if not runs:
continue
n_accepted_tracks += 1
track_centroids_by_video[video_path][ti] = _track_centroid(track)
for (s, e) in runs:
per_video_candidates[video_path].append({
"video_path": video_path,
"track_idx": ti,
"scene_idx": track["scene_idx"],
"start_s": s,
"end_s": e,
"stats": stats,
})
plan = []
for video_path, segs in per_video_candidates.items():
if not segs:
continue
# merge across tracks within the same scene if gap <= merge_gap_s
merged = _merge_close_segments(segs, args.merge_gap)
# apply min/max duration (split long, drop short)
merged = _split_long_segments(merged, args.min_dur, args.max_dur)
for s in merged:
tag = None
tag_sim = None
# identity from union of contributing tracks' centroids
if centroids:
track_centroid_list = [
track_centroids_by_video[video_path].get(ti)
for ti in s.get("track_idxs", [s.get("track_idx")])
]
track_centroid_list = [c for c in track_centroid_list if c is not None]
if track_centroid_list:
union = np.stack(track_centroid_list).mean(axis=0)
nm = float(np.linalg.norm(union))
if nm > 0:
union = union / nm
sims = {name: float(np.dot(c, union)) for name, c in centroids.items()}
best = max(sims, key=sims.get)
if sims[best] >= IDENTITY_TAG_THRESHOLD:
tag = best; tag_sim = round(sims[best], 4)
plan.append({
"video_path": video_path,
"track_idxs": s.get("track_idxs", [s.get("track_idx")]),
"scene_idx": s["scene_idx"],
"start_s": round(s["start_s"], 3),
"end_s": round(s["end_s"], 3),
"duration_s": round(s["end_s"] - s["start_s"], 3),
"member_count": s.get("member_count", s["stats"]["n"]),
"pass_count": s.get("pass_count", s["stats"]["n_pass"]),
"stats": s["stats"],
"identity_tag": tag,
"identity_sim": tag_sim,
"uuid": uuid.uuid4().hex[:12],
})
plan.sort(key=lambda p: (p["video_path"], p["start_s"]))
out = Path(args.out)
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(json.dumps({
"thresholds": {
"yaw_max": args.max_yaw, "pitch_max": args.max_pitch,
"face_min": args.min_face, "blur_min": QUALITY_BLUR_MIN,
"det_min": args.min_det, "track_gate_frac": args.track_gate_frac,
"bridge_s": args.bridge_gap, "merge_gap_s": args.merge_gap,
"min_dur_s": args.min_dur, "max_dur_s": args.max_dur,
"identity_tag_threshold": IDENTITY_TAG_THRESHOLD,
},
"totals": {
"tracks_total": n_total_tracks, "tracks_accepted": n_accepted_tracks,
"segments": len(plan),
},
"plan": plan,
}, indent=2))
print(f"[score] {n_accepted_tracks}/{n_total_tracks} tracks accepted -> {len(plan)} segments "
f"-> {out}", file=sys.stderr)
# ----------------------------- cut -----------------------------
def cmd_cut(args):
plan = json.loads(Path(args.plan).read_text())
out_dir = Path(args.output_dir)
out_dir.mkdir(parents=True, exist_ok=True)
if args.clean:
# remove only existing UUID-named clips + sidecars (12-char hex), keeping any other files
import re as _re
uuid_pat = _re.compile(r"^[0-9a-f]{12}\.(mp4|json)$")
n_removed = 0
for child in out_dir.iterdir():
if child.is_file() and uuid_pat.match(child.name):
child.unlink()
n_removed += 1
elif child.is_dir() and _re.match(r"^[A-Za-z0-9_.-]+$", child.name):
# subfolder of prior runs — clear UUID files inside, then remove if empty
for inner in child.iterdir():
if inner.is_file() and uuid_pat.match(inner.name):
inner.unlink()
n_removed += 1
try:
child.rmdir()
except OSError:
pass
if n_removed:
print(f"[clean] removed {n_removed} prior UUID clips/sidecars", file=sys.stderr)
n_done = 0
n_err = 0
sidecars = []
for seg in plan["plan"]:
sub = Path(seg["video_path"]).stem
seg_dir = out_dir / sub
seg_dir.mkdir(parents=True, exist_ok=True)
out_video = seg_dir / f"{seg['uuid']}.mp4"
if out_video.exists() and not args.force:
continue
s = seg["start_s"]; d = seg["duration_s"]
cmd = [
"ffmpeg", "-y", "-loglevel", "error",
"-ss", f"{s}",
"-i", seg["video_path"],
"-t", f"{d}",
"-c", "copy",
"-avoid_negative_ts", "make_zero",
str(out_video),
]
r = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
if r.returncode != 0 or not out_video.exists() or out_video.stat().st_size < 1024:
print(f"[cut-err] {seg['uuid']} {seg['video_path']}@{s}+{d}: {r.stderr.strip()[:200]}",
file=sys.stderr)
n_err += 1
if out_video.exists() and out_video.stat().st_size < 1024:
out_video.unlink()
continue
if args.write_sidecar:
sidecar = seg_dir / f"{seg['uuid']}.json"
sidecar.write_text(json.dumps({
"uuid": seg["uuid"],
"source_video": seg["video_path"],
"source_basename": Path(seg["video_path"]).name,
"start_s": s, "end_s": seg["end_s"], "duration_s": d,
"scene_idx": seg["scene_idx"],
"track_idxs": seg.get("track_idxs", [seg.get("track_idx")]),
"member_count": seg.get("member_count"),
"pass_count": seg.get("pass_count"),
"stats": seg["stats"],
"identity_tag": seg["identity_tag"],
"identity_sim": seg["identity_sim"],
"thresholds": plan["thresholds"],
}, indent=2))
sidecars.append(sidecar)
n_done += 1
print(f"[cut] {n_done} clips written, {n_err} errors -> {out_dir}", file=sys.stderr)
# ----------------------------- report -----------------------------
def cmd_report(args):
plan = json.loads(Path(args.plan).read_text())
out_dir = Path(args.out)
out_dir.mkdir(parents=True, exist_ok=True)
thumbs_dir = out_dir / "thumbs"
thumbs_dir.mkdir(exist_ok=True)
output_dir = Path(args.output_dir)
# group by video
by_video: dict[str, list] = {}
for seg in plan["plan"]:
by_video.setdefault(seg["video_path"], []).append(seg)
# generate thumbs from each segment's first frame via ffmpeg
print(f"[report] generating thumbs for {len(plan['plan'])} segments", file=sys.stderr)
for seg in plan["plan"]:
thumb = thumbs_dir / f"{seg['uuid']}.jpg"
if thumb.exists():
continue
s = seg["start_s"] + 0.1
cmd = [
"ffmpeg", "-y", "-loglevel", "error",
"-ss", f"{s}",
"-i", seg["video_path"],
"-frames:v", "1",
"-vf", "scale=240:-1",
str(thumb),
]
subprocess.run(cmd, capture_output=True, timeout=30)
# render
rows = []
rows.append("<h1>Video target preprocessing &mdash; review</h1>")
t = plan["totals"]
th = plan["thresholds"]
rows.append(f"<p>Tracks accepted: {t['tracks_accepted']}/{t['tracks_total']}; "
f"segments emitted: {t['segments']}.<br>"
f"Thresholds: pose &le;{th['yaw_max']}&deg;yaw / {th['pitch_max']}&deg;pitch, "
f"face_short &ge;{th['face_min']}px, det &ge;{th['det_min']}, "
f"track-gate &ge;{int(100*th['track_gate_frac'])}%, "
f"duration {th['min_dur_s']}{th['max_dur_s']}s. "
f"Output dir: <code>{output_dir}</code></p>")
nav = " · ".join(f"<a href='#v{i}'>{Path(v).name}</a>"
for i, v in enumerate(by_video.keys()))
rows.append(f"<div class='nav'>{nav}</div>")
for vi, (video_path, segs) in enumerate(by_video.items()):
rows.append(f"<section id='v{vi}' class='vid'>")
rows.append(f"<h2>{Path(video_path).name} <small>({len(segs)} segments)</small></h2>")
rows.append("<div class='cells'>")
for seg in sorted(segs, key=lambda x: x["start_s"]):
stats = seg["stats"]
tag = seg["identity_tag"] or ""
tag_sim = seg["identity_sim"]
tag_html = (f"<span class='tag'>{tag} ({tag_sim:.2f})</span>" if tag else "<span class='tag none'>untagged</span>")
sub_name = Path(seg['video_path']).stem
rows.append(
f"<div class='cell'>"
f"<a href='{output_dir}/{sub_name}/{seg['uuid']}.mp4'><img src='thumbs/{seg['uuid']}.jpg' loading='lazy'></a>"
f"<div class='meta'>"
f"<code>{sub_name}/{seg['uuid']}.mp4</code><br>"
f"{seg['start_s']:.1f}s &rarr; {seg['end_s']:.1f}s ({seg['duration_s']:.1f}s)<br>"
f"yaw={stats['yaw_med']:.0f}&deg; size={stats['size_med']:.0f}px det={stats['det_med']:.2f}<br>"
f"pass {stats['n_pass']}/{stats['n']}<br>"
f"{tag_html}"
f"</div></div>"
)
rows.append("</div></section>")
html = f"""<!doctype html>
<html><head><meta charset='utf-8'><title>Video targets review</title>
<style>
body {{ font-family: system-ui, sans-serif; background:#111; color:#eee; padding:1em; }}
h1, h2 {{ margin-top: 1em; }} h2 {{ border-bottom: 1px solid #333; padding-bottom: 4px; }}
small {{ color:#999; font-weight:normal; }}
section.vid {{ background:#1a1a1a; border-radius:6px; padding:12px; margin:12px 0; }}
.cells {{ display:flex; flex-wrap:wrap; gap:8px; }}
.cell {{ background:#222; border-radius:4px; padding:6px; width:260px; font-size:11px; font-family:monospace; }}
.cell img {{ width:100%; height:auto; border-radius:3px; }}
.meta {{ padding-top:4px; line-height:1.4; }}
.tag {{ display:inline-block; padding:1px 6px; background:#5fa05f; color:#000; border-radius:2px; }}
.tag.none {{ background:#444; color:#aaa; }}
.nav {{ position:sticky; top:0; background:#111; padding:.5em 0; border-bottom:1px solid #333; font-size:12px; }}
a {{ color:#6cf; }}
code {{ background:#000; padding:1px 4px; border-radius:2px; }}
</style></head>
<body>
{''.join(rows)}
</body></html>"""
out_html = out_dir / "index.html"
out_html.write_text(html)
print(f"[report] -> {out_html}", file=sys.stderr)
# ----------------------------- main -----------------------------
def main():
ap = argparse.ArgumentParser()
sub = ap.add_subparsers(dest="cmd", required=True)
s = sub.add_parser("scan")
s.add_argument("--input", default=str(DEFAULT_INPUT))
s.add_argument("--output-dir", default=str(DEFAULT_OUTPUT))
s.add_argument("--recursive", action="store_true")
s.add_argument("--out", required=True)
s.set_defaults(func=cmd_scan)
sc = sub.add_parser("scenes")
sc.add_argument("--inventory", required=True)
sc.add_argument("--out-dir", required=True)
sc.add_argument("--only", default=None, help="comma-separated basenames to limit run")
sc.add_argument("--force", action="store_true")
sc.set_defaults(func=cmd_scenes)
st = sub.add_parser("stage")
st.add_argument("--inventory", required=True)
st.add_argument("--scenes-dir", required=True)
st.add_argument("--sample-fps", type=float, default=SAMPLE_FPS)
st.add_argument("--out", required=True)
st.set_defaults(func=cmd_stage)
m = sub.add_parser("merge")
m.add_argument("--results", required=True)
m.add_argument("--out", required=True)
m.set_defaults(func=cmd_merge)
tr = sub.add_parser("track")
tr.add_argument("--frames", required=True)
tr.add_argument("--scenes-dir", required=True)
tr.add_argument("--inventory", required=True)
tr.add_argument("--sample-fps", type=float, default=SAMPLE_FPS)
tr.add_argument("--out", required=True)
tr.set_defaults(func=cmd_track)
sc2 = sub.add_parser("score")
sc2.add_argument("--tracks", required=True)
sc2.add_argument("--inventory", required=True)
sc2.add_argument("--out", required=True)
sc2.add_argument("--no-identity", action="store_true")
sc2.add_argument("--max-yaw", type=float, default=QUALITY_YAW_MAX)
sc2.add_argument("--max-pitch", type=float, default=QUALITY_PITCH_MAX)
sc2.add_argument("--min-face", type=int, default=QUALITY_FACE_MIN)
sc2.add_argument("--min-det", type=float, default=QUALITY_DET_MIN)
sc2.add_argument("--track-gate-frac", type=float, default=TRACK_GATE_FRAC)
sc2.add_argument("--bridge-gap", type=float, default=SEGMENT_BRIDGE_S,
help="bridge within-track failure gaps up to this many seconds")
sc2.add_argument("--merge-gap", type=float, default=SEGMENT_MERGE_GAP_S,
help="merge across-track segments in same scene if within this gap")
sc2.add_argument("--min-dur", type=float, default=SEGMENT_MIN_S)
sc2.add_argument("--max-dur", type=float, default=SEGMENT_MAX_S)
sc2.set_defaults(func=cmd_score)
cu = sub.add_parser("cut")
cu.add_argument("--plan", required=True)
cu.add_argument("--output-dir", default=str(DEFAULT_OUTPUT))
cu.add_argument("--force", action="store_true")
cu.add_argument("--clean", action="store_true",
help="remove prior UUID-named clips before cutting (preserves non-UUID files)")
cu.add_argument("--write-sidecar", action="store_true",
help="emit <uuid>.json provenance sidecar alongside each clip (default off)")
cu.set_defaults(func=cmd_cut)
rp = sub.add_parser("report")
rp.add_argument("--plan", required=True)
rp.add_argument("--output-dir", default=str(DEFAULT_OUTPUT))
rp.add_argument("--out", required=True)
rp.set_defaults(func=cmd_report)
args = ap.parse_args()
args.func(args)
if __name__ == "__main__":
main()