Compare commits
12 Commits
d53ab9fbfc
..
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 308597ebf0 | |||
| 7960dec350 | |||
| 998fa79f81 | |||
| 49a43c7685 | |||
| e66c97fd58 | |||
| 62dba3ddb3 | |||
| 321fed01cc | |||
| 7ecbfae981 | |||
| 1d82d71e68 | |||
| e48dd8aec7 | |||
| 03a0c75531 | |||
| 4d7a8780de |
@@ -1,56 +1,400 @@
|
||||
# face-sets
|
||||
|
||||
Sort photos by similar face using InsightFace embeddings + agglomerative clustering, then refine into faceset-ready folders for downstream face-swap tooling (roop-unleashed, etc.).
|
||||
Sort photos by similar face using InsightFace embeddings + agglomerative clustering, refine into per-identity sets, and export ready-to-drop bundles for face-swap tooling (roop-unleashed, etc.).
|
||||
|
||||
## Pipeline
|
||||
|
||||
`sort_faces.py` is a single-file CLI with four subcommands:
|
||||
`sort_faces.py` is a single-file CLI with six subcommands:
|
||||
|
||||
| step | what it does |
|
||||
|---------|------------------------------------------------------------------------------|
|
||||
| embed | Recursively scan a source tree, detect + embed every face, write `.npz` cache |
|
||||
| cluster | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` |
|
||||
| refine | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/` |
|
||||
| dedup | Post-hoc near-duplicate report: byte-identical groups + visual near-dupes (same face + same size within a tight cosine threshold) |
|
||||
| step | what it does |
|
||||
|-------------|-------------------------------------------------------------------------------------------------------------|
|
||||
| embed | Recursively scan a source tree, detect + embed every face, write `.npz` cache. Resumable; sha256-dedup. |
|
||||
| cluster | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` with manifest. |
|
||||
| refine | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/`. |
|
||||
| dedup | Post-hoc near-duplicate report: byte-identical + visual near-dupe groups → `<cache>.duplicates.json`. |
|
||||
| extend | Fold new embeddings into an existing raw/refine output via nearest person-centroid without renumbering. |
|
||||
| enrich | Re-detect each cached face to persist landmark_2d_106, landmark_3d_68, pose (pitch/yaw/roll) into cache. |
|
||||
| export-swap | Per-identity export: tight outlier gate + visual-dupe collapse + composite quality rank + single-face PNG crops + `.fsz` bundles (top-N and full) ready for roop-unleashed. Optional singleton rescue into `_candidates/`. |
|
||||
|
||||
`embed` is resumable and incremental: it loads any existing cache at the target path and only hashes/embeds files it hasn't processed before. A periodic flush (default every 50 new files) writes the cache atomically, so a mid-run crash loses at most a few dozen embeddings.
|
||||
### Design principles
|
||||
|
||||
Byte-identical duplicates are detected via sha256 during the listing phase. The canonical file is embedded once; other paths with the same hash are carried as `aliases` on the cache's top-level `path_aliases` dict. Every alias is materialized by `cluster`/`refine`, so each on-disk location ends up represented in the output.
|
||||
- **embed is resumable and incremental.** It loads any existing cache at the target path and only hashes / embeds files it has not seen. Atomic flush every 50 new files so a mid-run crash loses at most ~50 embeddings.
|
||||
- **Byte-identical duplicates are sha256-grouped at listing time.** The canonical file is embedded once; other paths with the same hash become `path_aliases` in the cache. Every alias is materialized by `cluster` / `refine` / `export-swap`, so each on-disk location is represented.
|
||||
- **`safe_dst_name` always flattens the absolute path.** This keeps output filenames stable across runs even as `src_root` changes between embed / extend / export invocations.
|
||||
- **Caches and outputs stay out of git** via `.gitignore`; defaults live under `work/`.
|
||||
|
||||
Cache and outputs are kept out of the repo via `.gitignore`; defaults live under `work/`.
|
||||
|
||||
## Typical run
|
||||
## Typical end-to-end run
|
||||
|
||||
```bash
|
||||
# 1. Embed (CPU; InsightFace buffalo_l). Caches faces + metadata. Resumable.
|
||||
python sort_faces.py embed /mnt/x/src/nl work/cache/nl_full.npz
|
||||
SRC=/mnt/x/src/nl
|
||||
CACHE=work/cache/nl_full.npz
|
||||
OUT=/mnt/e/temp_things/fcswp/nl_sorted
|
||||
|
||||
# 2. Raw clusters (every multi-face cluster -> a person_NNN/ folder).
|
||||
python sort_faces.py cluster work/cache/nl_full.npz /mnt/e/temp_things/fcswp/nl_sorted/raw_full
|
||||
# 1. Embed (CPU; InsightFace buffalo_l). Resumable on re-run.
|
||||
python sort_faces.py embed "$SRC" "$CACHE"
|
||||
|
||||
# 3. Refined facesets (filters for faceset-ready quality).
|
||||
python sort_faces.py refine work/cache/nl_full.npz /mnt/e/temp_things/fcswp/nl_sorted/facesets_full
|
||||
# 2. Raw clusters (one person_NNN/ per multi-face cluster).
|
||||
python sort_faces.py cluster "$CACHE" "$OUT/raw_full"
|
||||
|
||||
# 4. (Optional) report on byte-identical + visual near-duplicates.
|
||||
python sort_faces.py dedup work/cache/nl_full.npz
|
||||
# 3. Refined facesets (quality-gated per-identity sets).
|
||||
python sort_faces.py refine "$CACHE" "$OUT/facesets_full"
|
||||
|
||||
# 4. Near-duplicate report (byte + visual).
|
||||
python sort_faces.py dedup "$CACHE"
|
||||
|
||||
# 5. Enrich the cache with landmarks + pose (needed by export-swap).
|
||||
python sort_faces.py enrich "$CACHE"
|
||||
|
||||
# 6. Export roop-unleashed-ready bundles.
|
||||
python sort_faces.py export-swap "$CACHE" \
|
||||
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
|
||||
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
|
||||
```
|
||||
|
||||
## Refine defaults
|
||||
### Merging a new source into an existing result
|
||||
|
||||
| flag | default | meaning |
|
||||
|---|---|---|
|
||||
| `--initial-threshold` | 0.55 | cosine distance for stage-1 clustering |
|
||||
| `--merge-threshold` | 0.40 | centroid-level merge of over-split clusters |
|
||||
| `--outlier-threshold` | 0.55 | drop face if cosine dist from cluster centroid exceeds this (only if cluster ≥ 4) |
|
||||
| `--min-faces` | 15 | minimum unique images per faceset |
|
||||
| `--min-short` | 90 | minimum short-edge pixels of face bbox |
|
||||
| `--min-blur` | 40.0 | Laplacian-variance blur gate |
|
||||
| `--min-det-score` | 0.6 | InsightFace detector score gate |
|
||||
| `--mode` | copy | copy / move / symlink |
|
||||
```bash
|
||||
# Embed new source into the same cache (resume from existing embeddings + aliases).
|
||||
python sort_faces.py embed /mnt/x/src/lzbkp_red "$CACHE"
|
||||
|
||||
## Prior runs (as of 2026-04-22)
|
||||
# Fold new faces into raw_full + facesets_full without renumbering.
|
||||
python sort_faces.py extend "$CACHE" "$OUT/raw_full" --refine-out "$OUT/facesets_full"
|
||||
|
||||
- `work/cache/kos11.npz` — 181 images, 333 faces from `Kos '11/` → `kos11_sorted/`
|
||||
- `work/cache/nl_all.npz` — 916 images, 1396 faces from `Neuer Ordner (2)/New Folder/` → `nl_sorted/raw/`, refined to 6 facesets (197, 120, 91, 47, 23, 18 images)
|
||||
# Refresh the swap-ready export to reflect the merge.
|
||||
python sort_faces.py enrich "$CACHE"
|
||||
python sort_faces.py export-swap "$CACHE" \
|
||||
"$OUT/facesets_full/refine_manifest.json" "$OUT/facesets_swap_ready" \
|
||||
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
|
||||
```
|
||||
|
||||
Output lives outside the repo at `/mnt/e/temp_things/fcswp/`.
|
||||
### Importing hand-sorted folders as identities
|
||||
|
||||
When source folders are already hand-sorted by person (one folder per identity), the
|
||||
clustering path is the wrong tool — the identity is asserted, not inferred. The
|
||||
orchestration script `work/build_folders.py` covers this case:
|
||||
|
||||
- For each trusted folder, it filters cache records that fall under it, builds an
|
||||
identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so
|
||||
bystanders in group photos drop out, and writes a synthetic `refine_manifest.json`.
|
||||
- It then routes each face record from a *mixed* folder (e.g. `osrc/`) into every
|
||||
identity centroid within a tight cosine cutoff (default 0.45). A multi-identity
|
||||
photo lands in multiple facesets; `export-swap`'s per-bbox outlier filter ensures
|
||||
each faceset crops only its matching face.
|
||||
- Finally it invokes `cmd_export_swap` against the synthetic manifest, renames the
|
||||
emitted `.fsz` bundles after the source folder, drops a `<label>.txt` marker, and
|
||||
merges the new entries into the canonical `facesets_swap_ready/manifest.json`
|
||||
(existing facesets are left untouched).
|
||||
|
||||
```bash
|
||||
# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
|
||||
for d in k m mi mir s sab t osrc; do
|
||||
python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
|
||||
done
|
||||
|
||||
# Bring landmarks/pose + visual-dupe report in sync with the new records.
|
||||
python sort_faces.py enrich "$CACHE"
|
||||
python sort_faces.py dedup "$CACHE"
|
||||
|
||||
# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
|
||||
python work/build_folders.py
|
||||
```
|
||||
|
||||
The script's config block (`TRUSTED`, `START_NNN`, `OSRC_THRESHOLD`, `TOP_N`, etc.)
|
||||
is the only thing to edit when adding more hand-sorted folders later.
|
||||
|
||||
### Splitting an identity by era (age sub-clustering)
|
||||
|
||||
Long-running source corpora produce identities that span 10+ years. The 2009 face
|
||||
and the 2024 face of the same person sit in the same cluster (correctly — same
|
||||
identity), but a single averaged embedding pulled from that cluster blurs across
|
||||
ages. For face-swap output that should target a specific period, the identity
|
||||
needs to be split by era *after* the identity is established.
|
||||
|
||||
`work/age_split_001.py` is a worked example for `faceset_001` and a template for
|
||||
any other identity. The pipeline is:
|
||||
|
||||
- **Probe first** with `work/check_faceset001_age.py` — report intra-cluster
|
||||
pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and
|
||||
EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with
|
||||
distinct year ranges, the identity is age-sortable.
|
||||
- **Seed centroid** from the curated `facesets_swap_ready/faceset_001/`
|
||||
(manifest provides face keys → cache rows).
|
||||
- **Wide recovery** at cos-dist ≤ 0.55 against the seed under the original
|
||||
source roots, then quality-gate (`face_short`, `blur`, `det_score`) and one
|
||||
re-centroid + tighten pass at 0.50 to absorb new faces without drift.
|
||||
- **Sub-cluster** the survivors at cos-dist 0.35 (precomputed-distance
|
||||
agglomerative, average linkage).
|
||||
- **Anchor-based fragment assignment** (not transitive merge — that caused
|
||||
year-drift): sub-clusters with size ≥ 20 are *anchors*; smaller fragments
|
||||
attach to the single nearest anchor only if both the centroid distance ≤ 0.40
|
||||
AND the dominant EXIF year is within ±5 years. Fragments with no qualifying
|
||||
anchor remain standalone (and end up THIN-tagged downstream).
|
||||
- **EXIF year per source path** with on-disk caching at
|
||||
`work/cache/age_split_exif.json` — the Windows-mount EXIF read is the
|
||||
slowest step, so re-runs after a parameter tweak are nearly instant.
|
||||
- **Per-era export** mirrors `export-swap`: composite-quality rank, single-face
|
||||
square PNG crops, top-N + `_all` `.fsz` bundles, per-era `manifest.json`,
|
||||
human-readable `<era>.txt` marker. Eras with < 20 face records also drop a
|
||||
`THIN.txt` marker so they can be quarantined.
|
||||
- **Top-level manifest merge**: era buckets are appended to
|
||||
`facesets_swap_ready/manifest.json`. Operationally the THIN buckets should be
|
||||
moved into `_thin/` (and the manifest split into `facesets` + `thin_eras`),
|
||||
leaving only the substantive era buckets at the top level.
|
||||
|
||||
```bash
|
||||
# 1. Confirm the identity is age-sortable.
|
||||
python work/check_faceset001_age.py
|
||||
|
||||
# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
|
||||
python work/age_split_001.py
|
||||
```
|
||||
|
||||
For the `faceset_001` run on 5260-face `nl_full.npz`, this produced 6 substantive
|
||||
era buckets (2005–10, 2010–13, 2011, 2014–17, 2018–19, 2018–20; sizes 43–282)
|
||||
plus 68 thin/fragment buckets quarantined under `_thin/`.
|
||||
|
||||
### Discovering new identities in a mixed bucket
|
||||
|
||||
A flat folder of mixed-identity photos (e.g. `osrc/`) is the opposite of the
|
||||
hand-sorted case: identities have to be discovered, not asserted, but should
|
||||
not collide with already-known identities or scramble their numbering.
|
||||
|
||||
`work/cluster_osrc.py` is the worked example. The pipeline:
|
||||
|
||||
- **Filter cache to the source root**, including any byte-aliased path that
|
||||
resolves under it.
|
||||
- **Drop already-covered faces** by comparing each candidate to the centroids
|
||||
of the existing canonical facesets at the `EXISTING_MATCH_THRESHOLD`
|
||||
(default 0.45 — same cutoff as `build_folders.py`'s osrc routing). These
|
||||
faces are already routed by `extend` / `build_folders.py` and shouldn't
|
||||
seed new facesets.
|
||||
- **Cluster the unmatched** at cos-dist 0.55 (matches the `extend` default
|
||||
for the new-cluster phase).
|
||||
- **Apply `refine`-equivalent gates** per cluster: `face_short`, `blur`,
|
||||
`det_score`, plus outlier rejection (cluster-centroid cos-dist > 0.55) for
|
||||
clusters of size ≥ 4. Keep clusters whose surviving unique-source-path
|
||||
count is ≥ `MIN_FACES`.
|
||||
- **Number new facesets past the existing maximum** (`START_NNN`), so
|
||||
`faceset_001..NNN` are never disturbed.
|
||||
- **Synthesize a refine manifest** and run `cmd_export_swap` against it,
|
||||
then move the resulting dirs into `facesets_swap_ready/` and append to the
|
||||
top-level `manifest.json`. Each new dir gets an `osrc.txt` provenance
|
||||
marker.
|
||||
|
||||
Always run `extend` first so `raw_full/` and `facesets_full/` reflect the new
|
||||
source — the `cluster_osrc.py` step then operates against the canonical
|
||||
cache and doesn't need `raw_full/` for input:
|
||||
|
||||
```bash
|
||||
# 1. Bring raw_full / facesets_full up to date (folds matches into existing
|
||||
# person folders + facesets, creates new person_NNN+ for unmatched).
|
||||
python sort_faces.py extend "$CACHE" "$OUT/raw_full" \
|
||||
--refine-out "$OUT/facesets_full"
|
||||
|
||||
# 2. Optional dry-run: report cluster sizes and per-faceset survivor counts
|
||||
# without touching facesets_swap_ready/.
|
||||
python work/cluster_osrc.py --dry-run
|
||||
|
||||
# 3. Real run: emits facesets_swap_ready/faceset_NNN+ and merges the manifest.
|
||||
python work/cluster_osrc.py
|
||||
```
|
||||
|
||||
For the 2026-04-26 run on 336 osrc face records (after dropping 18 covered by
|
||||
existing identities), this produced 6 new facesets (`faceset_020..025`,
|
||||
sizes 4–26 exported PNGs; the 7th candidate cluster lost all 6 faces to
|
||||
export-swap's tighter `min_face_short=100` gate).
|
||||
|
||||
### Importing identities from a self-hosted Immich library
|
||||
|
||||
`work/immich_stage.py` + `work/embed_worker.py` + `work/cluster_immich.py`
|
||||
together import an Immich library at scale, with the embed step running on
|
||||
a Windows AMD GPU via DirectML and everything else on WSL. Three pieces:
|
||||
|
||||
1. **`work/immich_stage.py` (WSL)** — pages every IMAGE asset via
|
||||
`/search/metadata`, fetches each asset's `/faces?id=` to read Immich's
|
||||
own ML-driven bboxes, scales each bbox to original-image coordinates,
|
||||
and prefilters by `face_short ≥ 90`. For survivors it downloads the
|
||||
original, sha256-deduplicates against the canonical `nl_full.npz` and
|
||||
against same-run staged files, and saves to
|
||||
`/mnt/x/src/immich/<user>/<rel>`. Writes a `queue.json` that the embed
|
||||
worker consumes. 8 concurrent worker threads run the full per-asset
|
||||
I/O chain (`/faces` → filter → `/original`) so 8 workers ≈ 8× the
|
||||
serial throughput.
|
||||
2. **`work/embed_worker.py` (Windows venv at `C:\face_embed_venv\`)** —
|
||||
loads `insightface.FaceAnalysis(buffalo_l)` with the
|
||||
`DmlExecutionProvider` and runs detection + landmarks + recognition
|
||||
over the queue. Produces a `.npz` cache that's bit-identical in
|
||||
schema to what `sort_faces.py:cmd_embed` writes, so the result is
|
||||
directly loadable by `load_cache()`. The cache already includes the
|
||||
post-`enrich` fields (`landmark_2d_106`, `landmark_3d_68`, `pose`)
|
||||
because FaceAnalysis returns them for free. AMD Vega gives ~7.5×
|
||||
real-pipeline speedup over CPU.
|
||||
3. **`work/cluster_immich.py` (WSL)** — mirrors `cluster_osrc.py`'s
|
||||
shape but reads from `immich_<user>.npz`. Builds existing-identity
|
||||
centroids from every canonical `faceset_NNN/` in
|
||||
`facesets_swap_ready/` (skipping era splits and `_thin/`), drops
|
||||
immich faces matching at cos-dist ≤ 0.45, clusters the rest at 0.55,
|
||||
applies refine gates, numbers new facesets past the existing maximum,
|
||||
and feeds `cmd_export_swap` via a synthetic manifest.
|
||||
|
||||
`work/finalize_immich.sh <user>` chains queue → Windows embed → cache
|
||||
copy back → cluster_immich, with logging.
|
||||
|
||||
The Immich admin API key + base URL come from environment variables:
|
||||
|
||||
```bash
|
||||
export IMMICH_URL=https://your-immich.example.com
|
||||
export IMMICH_API_KEY=... # admin or per-user key
|
||||
python work/immich_stage.py --user peter --workers 8
|
||||
bash work/finalize_immich.sh peter
|
||||
```
|
||||
|
||||
For the 2026-04-26 run against `https://fotos.computerliebe.org` (Immich
|
||||
v2.7.2), with the admin API key:
|
||||
|
||||
| step | result |
|
||||
|------|------|
|
||||
| stage | 53,842 assets seen, **10,261 staged** (~10 GB), 978 byte-deduped against `nl_full.npz`, 2,976 internal byte-duplicates, 39K skipped no-face / no-big-face |
|
||||
| Windows DML embed | 19,462 face records + 1 noface in **64.6 min** (2.6 img/s end-to-end) |
|
||||
| matched existing identities | **8,103 of 19,480 (42%)** at cos-dist ≤ 0.45; biggest hits faceset_002 (+2,666), faceset_001 (+1,856), faceset_003 (+670) |
|
||||
| new clusters | 2,534 at threshold 0.55 → 239 surviving refine gates → **185 emitted** as `faceset_026..264` (gaps where export-swap's tighter outlier filter dropped clusters below the export quality bar) |
|
||||
|
||||
A second 2026-04-26 run with **nic's per-user API key** confirmed the
|
||||
expected behavior: 25,777 of nic's IMAGE assets were enumerated (matching
|
||||
her `/server/statistics` count of 25,786, off by 9 ≈ the transient errors
|
||||
that didn't get marked seen), **7,834 staged** (30% face-bearing-with-big-face,
|
||||
denser than peter's 19%), 519 byte-deduped vs `nl_full.npz`, **0 internal
|
||||
byte-duplicates** (cleaner library than peter's 2,976), 54 transient errors.
|
||||
|
||||
Embed + cluster on the nic queue:
|
||||
|
||||
| step | result |
|
||||
|------|------|
|
||||
| Windows DML embed | 15,627 face records + 1 noface in **59 min** (2.2 img/s end-to-end), 7 load errors |
|
||||
| matched existing identities | **6,770 of 15,627 (43%)** at cos-dist ≤ 0.45; biggest hits faceset_002 (+3,261), faceset_008 (+1,461), faceset_001 (+955), faceset_007 (+408) |
|
||||
| new clusters | 3,787 at threshold 0.55 → 129 surviving refine gates → **95 emitted** as `faceset_265..NNN` (gaps where export-swap's 0.45 outlier dropped clusters below the export bar) |
|
||||
|
||||
Top-level `facesets_swap_ready/manifest.json` after both Immich runs:
|
||||
**311 substantive facesets** (12 auto-cluster nl/lzbkp + 7 hand-sorted +
|
||||
6 era splits + 6 osrc-discovered + 185 peter-Immich + 95 nic-Immich) +
|
||||
68 thin_eras under `_thin/`.
|
||||
|
||||
`work/immich_stage.py` carries a built-in **outage circuit breaker**:
|
||||
after 12 consecutive HTTP errors it probes Immich; if that probe also
|
||||
fails, the script exits cleanly with code 2, state preserved. This made
|
||||
the nic run survive a mid-stage Immich outage — the script paused, the
|
||||
operator confirmed connectivity was back, and the same command resumed
|
||||
from the saved `state.json` without re-fetching what was already done.
|
||||
|
||||
**Important caveats for Immich v2.7.2**:
|
||||
- The `userIds` filter on `/search/metadata` is **silently ignored** when
|
||||
the API key is bound to a different user. The "import everything the
|
||||
API key can see" semantics are what you actually get; cross-user
|
||||
isolation is enforced server-side.
|
||||
- `/server/statistics` reports counts that under-count what
|
||||
`/search/metadata` actually returns (e.g. external library
|
||||
thumbnail-dirs that got indexed because the import path included them).
|
||||
Don't trust the statistics number as a denominator.
|
||||
- A meaningful fraction of `originalPath`-based assets are *Immich's own
|
||||
thumbnails* (`<library_root>/thumbs/.../-preview.jpeg`) — included if
|
||||
the external library's import path covers the thumbs directory and the
|
||||
exclusion patterns don't list `**/thumbs/**`. For our run, 5,563 of
|
||||
10,261 staged were thumbnails. They embed and cluster fine but the
|
||||
resulting faces are lower-resolution.
|
||||
|
||||
## Key defaults
|
||||
|
||||
`refine`:
|
||||
|
||||
| flag | default | meaning |
|
||||
|-------------------------|--------:|---------|
|
||||
| `--initial-threshold` | 0.55 | cosine distance for stage-1 clustering |
|
||||
| `--merge-threshold` | 0.40 | centroid-level merge of over-split clusters |
|
||||
| `--outlier-threshold` | 0.55 | drop face if cosine dist from centroid exceeds (only if cluster ≥ 4) |
|
||||
| `--min-faces` | 15 | minimum unique images per faceset |
|
||||
| `--min-short` | 90 | minimum short-edge pixels of face bbox |
|
||||
| `--min-blur` | 40.0 | Laplacian-variance blur gate |
|
||||
| `--min-det-score` | 0.6 | InsightFace detector score gate |
|
||||
|
||||
`export-swap`:
|
||||
|
||||
| flag | default | meaning |
|
||||
|-------------------------------|--------:|---------|
|
||||
| `--top-n` | 30 | size of the `<faceset>_topN.fsz` bundle |
|
||||
| `--outlier-threshold` | 0.45 | tighter than refine; trims cluster boundary for averaging |
|
||||
| `--pad-ratio` | 0.5 | padding around face bbox for PNG crop |
|
||||
| `--out-size` | 512 | PNG output is square `out_size × out_size` |
|
||||
| `--min-face-short` | 100 | export gate; stricter than refine's 90 |
|
||||
| `--candidates` | off | rescue `_singletons/` into `_candidates/` for manual review |
|
||||
| `--candidate-match-threshold` | 0.55 | cos-dist cutoff for singleton → existing faceset |
|
||||
| `--candidate-min-score` | 0.40 | composite-quality floor for candidates |
|
||||
|
||||
The composite quality score in `export-swap` is `0.30·frontality + 0.20·det_score + 0.20·landmark_symmetry + 0.15·face_size + 0.15·sharpness`, each normalized to `[0, 1]`.
|
||||
|
||||
## Post-export corpus maintenance
|
||||
|
||||
The `sort_faces.py` pipeline above produces `facesets_swap_ready/`. Four
|
||||
orchestration scripts under `work/` operate on that already-built corpus to
|
||||
clean it up over time:
|
||||
|
||||
| script | purpose |
|
||||
|--------|---------|
|
||||
| `work/filter_occlusions.py` (+ Windows `work/clip_worker.py`) | Drop PNGs of masked / sun-glassed faces using open_clip ViT-L-14/dfn2b_s39b zero-shot scoring. Image-level threshold 0.7; faceset-level quarantine at 40% domain dominance. WSL stages a queue, Windows DML scores, WSL applies. See `docs/analysis/clip-occlusion-filter.md`. |
|
||||
| `work/consolidate_facesets.py` | Merge duplicate identities (centroid cosine sim ≥ 0.55 with confident ≥ 0.65, **complete-linkage** to defeat single-link chaining). Pulls embeddings from cache, no GPU. See `docs/analysis/identity-consolidation-and-age-extend.md`. |
|
||||
| `work/age_extend_001.py` | Slot newly-added PNGs into existing era buckets of `faceset_001` (anchor cosine distance ≤ 0.40 AND `|year_delta|` ≤ 5). Same anchor-fragment rule as `age_split_001.py`. |
|
||||
| `work/dedup_optimize.py` (+ Windows `work/multiface_worker.py`) | (a) cross-family SHA256 byte-dedup, (b) within-faceset near-dup at cosine sim ≥ 0.95, (c) multi-face audit (re-detect via insightface, drop PNGs with face_count ≠ 1). Multi-face is the load-bearing roop invariant. See `docs/analysis/dedup-and-roop-optimization.md`. |
|
||||
| `work/video_target_pipeline.py` (+ Windows `work/video_face_worker.py` + `work/run_video_pipeline.sh` chain) | Target-side preprocessing: scan a folder of videos → PySceneDetect shot-cuts → 2 fps frame sampling → DML face detection + embedding → IoU+embedding tracking → quality-gated segments (yaw≤75°, face≥80px, det≥0.5, ≥70% pass-rate, 1–120s duration, 2s cross-track merge gap) → ffmpeg stream-copy into UUID-named clips. Output organized into per-source subfolders. Provenance sidecars are opt-in (`cut --write-sidecar` or `SIDECAR=yes` env var); the full plan is always retained in the per-batch `plan.json`. See `docs/analysis/video-target-preprocessing.md`. |
|
||||
|
||||
All four operate idempotently and reversibly: dropped PNGs go to
|
||||
`<faceset>/faces/_dropped/`, quarantined whole facesets go to
|
||||
`facesets_swap_ready/_masked/` or `_merged/` (parallel to the existing
|
||||
`_thin/`). The master `manifest.json` partitions entries across `facesets[]`,
|
||||
`masked[]`, `thin_eras[]`, and `merged[]` arrays, plus per-run provenance
|
||||
blocks (`occlusion_filter_run`, `merge_run`, `age_extend_runs`, `dedup_runs`,
|
||||
`multiface_runs`).
|
||||
|
||||
## Downstream: roop-unleashed
|
||||
|
||||
The `.fsz` bundles emitted by `export-swap` drop straight into roop-unleashed's Face Swap tab. Each PNG inside is already a clean single-face crop — critical, because the roop-unleashed loader appends every face it re-detects in each PNG to the averaged identity embedding.
|
||||
|
||||
Highly recommended at swap time: enable **Select post-processing = GFPGAN** with the **Original/Enhanced image blend ratio = 0.85** (default is 0.65 which is conservative). See `docs/analysis/facesets-downstream-refinement-evaluation.md` for the full evaluation.
|
||||
|
||||
## Layout
|
||||
|
||||
```
|
||||
/opt/face-sets/
|
||||
├─ README.md (this file)
|
||||
├─ sort_faces.py (the tool)
|
||||
├─ docs/
|
||||
│ └─ analysis/
|
||||
│ └─ facesets-downstream-refinement-evaluation.md
|
||||
└─ work/ (gitignored except force-tracked .py / .sh)
|
||||
├─ build_folders.py (hand-sorted-folder orchestration)
|
||||
├─ check_faceset001_age.py (age-split readiness probe)
|
||||
├─ age_split_001.py (age-split orchestration; faceset_001)
|
||||
├─ age_extend_001.py (extends existing era buckets with new PNGs)
|
||||
├─ cluster_osrc.py (mixed-bucket identity discovery)
|
||||
├─ immich_stage.py (Immich library staging, parallel)
|
||||
├─ embed_worker.py (Windows DML embed worker; C:\face_embed_venv\)
|
||||
├─ cluster_immich.py (Immich identity discovery + export)
|
||||
├─ finalize_immich.sh (chains queue → embed → cluster)
|
||||
├─ filter_occlusions.py (CLIP zero-shot mask + sunglasses filter)
|
||||
├─ clip_worker.py (Windows DML CLIP worker; C:\clip_dml_venv\)
|
||||
├─ consolidate_facesets.py (duplicate-identity merger; complete-linkage)
|
||||
├─ dedup_optimize.py (byte + near-dup + multi-face audit driver)
|
||||
├─ multiface_worker.py (Windows DML multi-face audit worker)
|
||||
├─ video_target_pipeline.py (video → swappable segment cuts orchestration)
|
||||
├─ video_face_worker.py (Windows DML per-frame face worker; JSONL append-only)
|
||||
├─ run_video_pipeline.sh (generic chain driver: scenes → stage → worker → cut)
|
||||
├─ status_video_pipeline.sh (status helper for any video_pipeline log)
|
||||
├─ synthetic_*_manifest.json (per-run synthetic refine manifests)
|
||||
├─ immich/
|
||||
│ ├─ users.json (label -> userId map; gitignored)
|
||||
│ └─ <user>/{queue,state,aliases}.json (per-user staging artifacts)
|
||||
├─ cache/
|
||||
│ ├─ nl_full.npz (canonical cache + duplicates.json)
|
||||
│ ├─ immich_<user>.npz (per-user immich embeddings)
|
||||
│ └─ age_split_exif.json (path → EXIF-year cache)
|
||||
└─ logs/
|
||||
└─ *.log (every long step writes here)
|
||||
```
|
||||
|
||||
@@ -0,0 +1,119 @@
|
||||
# Age-splitting faceset_001 into era-specific facesets
|
||||
|
||||
_Run date: 2026-04-26. Cache: `work/cache/nl_full.npz` (5260 face records). Source: `work/age_split_001.py` and `work/check_faceset001_age.py`._
|
||||
|
||||
## 1. Why split
|
||||
|
||||
`faceset_001` aggregates a single identity across roughly 20 years of source
|
||||
material. The averaged embedding consumed by roop-unleashed therefore mixes
|
||||
features from very different ages. For face-swap output that should target a
|
||||
specific period (e.g. "this person around 2011" or "this person around
|
||||
2018–19"), the identity needs to be split *after* clustering — the cluster is
|
||||
correctly one identity, but the averaged embedding is the problem.
|
||||
|
||||
## 2. Evidence the identity is age-sortable
|
||||
|
||||
`work/check_faceset001_age.py` probes `faceset_001` (707 curated faces).
|
||||
|
||||
**Pairwise cos-distance histogram** (249,571 pairs):
|
||||
|
||||
| range | pairs |
|
||||
|-------------|------:|
|
||||
| [0.0, 0.2) | 1,250 |
|
||||
| [0.2, 0.3) | 11,277 |
|
||||
| [0.3, 0.4) | 63,920 |
|
||||
| [0.4, 0.5) | 92,555 |
|
||||
| [0.5, 0.6) | 63,288 |
|
||||
| [0.6, 0.7) | 16,048 |
|
||||
| [0.7, 0.8) | 1,217 |
|
||||
| [0.8, 1.0) | 16 |
|
||||
|
||||
Mean 0.453, median 0.452, max 0.842. The cluster is internally diffuse — wide
|
||||
enough to admit non-trivial sub-structure without crossing the
|
||||
inter-identity boundary (which sits well above 0.6 in this dataset).
|
||||
|
||||
**Sub-clusters at threshold 0.35** (precomputed cos-dist, average linkage):
|
||||
156 sub-clusters, 10 with ≥ 10 faces, top-5 sizes [120, 105, 47, 40, 24].
|
||||
The top sub-clusters align with distinct EXIF year medians (2011, 2019,
|
||||
2018, 2011, 2010), so the split is meaningful.
|
||||
|
||||
## 3. Pipeline
|
||||
|
||||
`work/age_split_001.py`:
|
||||
|
||||
1. **Seed centroid.** Load the 707 face keys from
|
||||
`facesets_swap_ready/faceset_001/manifest.json`; resolve to cache rows;
|
||||
normalize the mean embedding.
|
||||
2. **Wide recovery.** Pull every face record under `/mnt/x/src/{nl,
|
||||
lzbkp_red}` from the cache with cos-dist ≤ 0.55 from the seed. The seed
|
||||
is curated and tight, so 0.55 is a safe outer envelope. Result: 1,501
|
||||
faces from 4,756 candidates.
|
||||
3. **Quality gate** (mirrors export-swap defaults): `face_short ≥ 100`,
|
||||
`blur ≥ 40.0`, `det_score ≥ 0.6`. Result: 892 → 856 after one
|
||||
re-centroid + tighten pass at 0.50 to absorb the recovery without
|
||||
drift.
|
||||
4. **Sub-cluster** the survivors at cos-dist 0.35 (precomputed agglomerative,
|
||||
average linkage). 223 raw sub-clusters; sizes top-10 = [127, 97, 55, 42,
|
||||
40, 25, 17, 14, 13, 11].
|
||||
5. **EXIF year per source path.** Read `DateTimeOriginal` once per unique
|
||||
path; cache on disk at `work/cache/age_split_exif.json` so re-runs after
|
||||
parameter tweaks skip the slow Windows-mount EXIF read. 728 of 855 paths
|
||||
were dated.
|
||||
6. **Anchor-based fragment assignment** (replaces transitive union-find merge
|
||||
that caused observable year drift):
|
||||
- sub-clusters with ≥ 20 faces are *anchors* (6 found: dom-years 2011,
|
||||
2019, 2018, 2011, 2016, 2010);
|
||||
- smaller fragments attach to the single nearest anchor *only if* both
|
||||
`cent_dist ≤ 0.40` AND `|dom_year_anchor − dom_year_fragment| ≤ 5`;
|
||||
- anchors do not merge with each other (transitive merging produced
|
||||
anchor-to-anchor year drift across 2010 → 2014 → 2018 in earlier
|
||||
runs);
|
||||
- fragments with no qualifying anchor remain standalone.
|
||||
7. **Per-era export.** Composite-quality rank, single-face square PNG crops
|
||||
(`pad_ratio=0.5`, `out_size=512`), top-N + `_all` `.fsz` bundles, per-era
|
||||
`manifest.json`, `<label>.txt` marker, `THIN.txt` for buckets < 20 faces.
|
||||
8. **Top-level manifest merge.** New entries are appended to
|
||||
`facesets_swap_ready/manifest.json`. Operationally the THIN buckets are
|
||||
then moved into `_thin/` and partitioned into a `thin_eras` array (with
|
||||
`relpath: _thin/<name>`) so consumers reading `facesets` see only the
|
||||
substantive entries.
|
||||
|
||||
## 4. Result
|
||||
|
||||
74 era buckets emitted; 6 substantive + 68 thin/standalone fragments.
|
||||
|
||||
| era | faces | dom year(s) |
|
||||
|-------------------|------:|-------------|
|
||||
| `faceset_001_2010-13` | 282 | 2011 |
|
||||
| `faceset_001_2018-20` | 129 | 2019 |
|
||||
| `faceset_001_2014-17` | 125 | 2018 (anchor sub 15 dom_year=2018) |
|
||||
| `faceset_001_2018-19` | 107 | 2018 |
|
||||
| `faceset_001_2005-10` | 88 | 2010 |
|
||||
| `faceset_001_2011` | 43 | 2011 |
|
||||
|
||||
Two distinct 2011 anchors and two 2018-area anchors persist by design —
|
||||
embedding-space distance separated them despite year overlap. The era-label
|
||||
collisions are disambiguated with `_v2` suffixes, but only when both anchors
|
||||
landed on the *same* literal label string (none of the substantive six did).
|
||||
|
||||
The 68 thin buckets are largely 1- or 2-face fragments with idiosyncratic
|
||||
embeddings; they are quarantined into `_thin/` rather than deleted because
|
||||
some are legitimate edge poses / lighting / age extremes that may be useful
|
||||
for narrow targeted swaps.
|
||||
|
||||
## 5. Re-running and applying to other identities
|
||||
|
||||
- **Re-run with different parameters**: just re-execute `age_split_001.py`.
|
||||
Embeddings are loaded from cache, EXIF is loaded from
|
||||
`age_split_exif.json`, and only the sub-cluster + export steps re-run.
|
||||
Total runtime ~2 min.
|
||||
- **Apply to a different identity**: copy `age_split_001.py` to
|
||||
`age_split_NNN.py` and change `FS001`. The `SCAN_ROOTS`,
|
||||
`RECOVERY_THRESHOLD`, `TIGHTEN_THRESHOLD`, `SUBCLUSTER_THRESHOLD`,
|
||||
`ANCHOR_MIN_SIZE`, `FRAGMENT_CENTROID_MAX`, and `FRAGMENT_YEAR_MAX`
|
||||
defaults are tuned for `faceset_001`'s ~707-face curated cluster; smaller
|
||||
identities likely need lower `ANCHOR_MIN_SIZE`.
|
||||
- **Always quarantine THIN buckets** afterwards using the same partition
|
||||
pattern (move to `_thin/`, split top-level manifest into
|
||||
`facesets` + `thin_eras`). The script appends THIN entries to the top-level
|
||||
manifest as if they were full facesets, so the cleanup is a separate step.
|
||||
@@ -0,0 +1,154 @@
|
||||
# CLIP zero-shot occlusion filter (masks + sunglasses)
|
||||
|
||||
_Run date: 2026-04-27. Driver scripts: `work/filter_occlusions.py`, `work/clip_worker.py`._
|
||||
|
||||
## 1. Why
|
||||
|
||||
`facesets_swap_ready/` ended the Immich import day with 311 substantive
|
||||
facesets and a long tail of identities whose clusters had latched onto
|
||||
*eyewear or mask appearance* instead of identity (covid-era shots, vacation
|
||||
photos with sunglasses dominating the frame). Two failure modes:
|
||||
|
||||
1. **Pollution of averaged identity** — roop's `FaceSet.AverageEmbeddings()`
|
||||
averages every face in the .fsz. A faceset where 40 % of images are
|
||||
sunglassed gives a biased centroid; the swap reproduces sunglass-shaped
|
||||
eye sockets.
|
||||
2. **Whole-cluster identity drift** — clustering at the embedding level
|
||||
sometimes anchors on the eyewear silhouette rather than the face,
|
||||
producing clusters of "the same sunglasses across multiple people".
|
||||
|
||||
A targeted attribute scorer was the cleanest fix.
|
||||
|
||||
## 2. Model + prompts
|
||||
|
||||
**Model**: `open_clip` `ViT-L-14` / `dfn2b_s39b` (Apple Data Filtering Networks).
|
||||
Best public zero-shot at this size. Loads weights from HF Hub (~890 MB).
|
||||
Bit-identical scores between WSL CPU and Windows DML.
|
||||
|
||||
**Prompt design**: per-attribute ensembles of 5–6 positive + 5–6 negative
|
||||
prompts. Positive ensembles are mean-pooled and L2-normalized before softmax.
|
||||
|
||||
**Critical bug if forgotten**: CLIP cosine similarities are tiny (0.2–0.3
|
||||
range). Raw `softmax([sim_pos, sim_neg])` collapses to ~0.5/0.5 on every
|
||||
image. **Multiply by `model.logit_scale.exp()` (~100) before softmax.**
|
||||
Without that scale the entire scorer outputs a uniform 0.5.
|
||||
|
||||
**Sunglasses prompt pitfall**: the first set caught faces with sunglasses
|
||||
*pushed up on the forehead* with the same probability as faces with
|
||||
sunglasses *covering the eyes* — CLIP detects "presence of sunglasses in
|
||||
frame", not "eyes occluded". Fixed by putting the false positive into the
|
||||
*negative* class explicitly:
|
||||
|
||||
```
|
||||
positive: "a face with dark sunglasses covering the eyes"
|
||||
"a portrait with the eyes hidden behind opaque sunglasses"
|
||||
...
|
||||
negative: "a face with sunglasses pushed up on the forehead, eyes visible below"
|
||||
"a face with sunglasses resting on top of the head, eyes visible"
|
||||
"a face wearing clear prescription eyeglasses with visible eyes"
|
||||
...
|
||||
```
|
||||
|
||||
Validation pair (faceset_005): sunglasses-on-eyes → 0.91, sunglasses-on-forehead
|
||||
→ 0.39. Threshold 0.7 cleanly separates.
|
||||
|
||||
## 3. Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ WSL /opt/face-sets/work/filter_occlusions.py │
|
||||
│ • stage: walk facesets/, write queue.json │
|
||||
│ • merge: ingest worker results │
|
||||
│ • report: HTML contact sheet │
|
||||
│ • apply: prune + quarantine + re-zip │
|
||||
└────────────┬────────────────────────────────┘
|
||||
│ queue.json (paths) via \\wsl.localhost\
|
||||
▼
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Windows C:\clip_dml_venv\ │
|
||||
│ /opt/face-sets/work/clip_worker.py │
|
||||
│ Python 3.12 + torch 2.4.1 CPU │
|
||||
│ + torch-directml 0.2.5 + open_clip_torch │
|
||||
│ Reads PNGs from native E:\, writes scores │
|
||||
└─────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
A separate Windows venv (not the existing `C:\face_embed_venv\`) is needed
|
||||
because `torch-directml` brings ~1.5 GB of wheels and version-pinned
|
||||
numpy/pillow that risk breaking the embed_worker venv's
|
||||
`onnxruntime-directml` + `insightface` stack.
|
||||
|
||||
## 4. DML throughput surprise
|
||||
|
||||
Measured on AMD Radeon RX Vega:
|
||||
|
||||
| input | model | throughput | speedup vs WSL CPU |
|
||||
|------|-------|-----------:|-------------------:|
|
||||
| ViT-L-14 (CLIP, this filter) | open_clip | **1.43 img/s** | **2.4×** |
|
||||
| buffalo_l (insightface, embed_worker) | onnxruntime | 2.6 img/s | 7.5× |
|
||||
|
||||
Only 2.4× because `aten::_native_multi_head_attention` is not implemented in
|
||||
the directml plugin and falls back to CPU. The vision encoder runs on GPU,
|
||||
attention runs on CPU per layer, both alternating. A silenced UserWarning
|
||||
makes this near-invisible. Workable for a one-shot 73-min corpus run, but
|
||||
the embed_worker pattern (pure ONNX) remains the gold standard for DML.
|
||||
|
||||
## 5. Thresholds (validated 2026-04-27 on 6,318 PNGs)
|
||||
|
||||
| level | threshold | semantics |
|
||||
|-------|----------:|-----------|
|
||||
| image | P(positive) ≥ 0.7 | drop the PNG |
|
||||
| faceset | ≥ 40 % of images flagged for either attr | quarantine whole faceset to `_masked/` |
|
||||
| min-survivors | < 5 surviving AND something pruned | quarantine to `_thin/` |
|
||||
|
||||
The `AND something pruned` guard is essential — without it, naturally-small
|
||||
facesets (hand-sorted with ≤4 PNGs) get incorrectly quarantined for being
|
||||
small even when they have zero occlusions.
|
||||
|
||||
## 6. Run results
|
||||
|
||||
| action | count | net effect |
|
||||
|--------|------:|------------|
|
||||
| keep | 209 | unchanged |
|
||||
| prune | 46 | 183 PNGs dropped within survivors |
|
||||
| quarantine_masked | 51 | whole faceset → `_masked/` (11 mask-driven, 40 sunglasses-driven) |
|
||||
| quarantine_thin | 3 | survivors < 5 → `_thin/` |
|
||||
|
||||
Net: 311 active → 255 active after the filter run. 763 PNGs quarantined
|
||||
whole-faceset, 183 pruned within survivors. All dropped PNGs preserved at
|
||||
`<faceset>/faces/_dropped/` for reversibility. Master manifest gained a
|
||||
`masked[]` array parallel to `thin_eras[]`, plus an `occlusion_filter_run`
|
||||
provenance block.
|
||||
|
||||
## 7. Known limitations
|
||||
|
||||
- **Per-faceset manifests are NOT updated by `apply`** — only the master
|
||||
manifest is. Each faceset's own `<faceset>/manifest.json` retains stale
|
||||
`faces[]` entries pointing at PNGs that moved into `_dropped/`. Harmless
|
||||
for `.fsz` consumers (the .fsz is re-zipped from current disk state) but
|
||||
downstream tools reading `faces[]` will see broken references. Discovered
|
||||
later by `age_extend_001.py`'s rebuild loop, which generated 42 missing-PNG
|
||||
warnings before being caught.
|
||||
|
||||
## 8. Re-running
|
||||
|
||||
```bash
|
||||
# 1. Stage queue from current corpus state
|
||||
python work/filter_occlusions.py stage --out work/clip_dml/queue.json
|
||||
|
||||
# 2. Score on Windows DML (resumable)
|
||||
"/mnt/c/clip_dml_venv/Scripts/python.exe" work/clip_worker.py \
|
||||
work/clip_dml/queue.json work/clip_dml/scores.json --batch 8
|
||||
|
||||
# 3. Reshape into per-faceset format, then HTML for visual approval
|
||||
python work/filter_occlusions.py merge \
|
||||
--scores work/clip_dml/scores.json --out work/occlusion_scores.json
|
||||
python work/filter_occlusions.py report \
|
||||
--scores work/occlusion_scores.json --out work/occlusion_review
|
||||
|
||||
# 4. Apply (always dry-run first)
|
||||
python work/filter_occlusions.py apply \
|
||||
--scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json --dry-run
|
||||
python work/filter_occlusions.py apply \
|
||||
--scores work/occlusion_scores.json --out-plan work/occlusion_apply_plan.json
|
||||
```
|
||||
@@ -0,0 +1,155 @@
|
||||
# Corpus dedup + roop-unleashed optimization
|
||||
|
||||
_Run date: 2026-04-27. Driver scripts: `work/dedup_optimize.py`, `work/multiface_worker.py`._
|
||||
|
||||
After consolidation collapsed duplicate identities and age-extend slotted
|
||||
new PNGs into era buckets, the corpus still carried artifacts that hurt
|
||||
roop's averaged-embedding quality:
|
||||
|
||||
- **Burst-photo near-duplicates** within facesets, especially in
|
||||
immich-discovered identities where source libraries had many similar
|
||||
shots within seconds.
|
||||
- **Cross-faceset byte-identical PNGs** that escaped consolidation's
|
||||
centroid-similarity matching when individual PNGs matched exactly but
|
||||
cluster centroids diverged.
|
||||
- **Multi-face PNGs** that polluted identity averaging because the roop
|
||||
loader appends every detected face per PNG to the FaceSet (load-bearing
|
||||
invariant — see § 2).
|
||||
|
||||
This pipeline runs three independent passes and an optional fourth, all
|
||||
moving dropped PNGs to `<faceset>/faces/_dropped/` for reversibility.
|
||||
|
||||
## 1. Cross-family byte-dedup
|
||||
|
||||
SHA256-hash every PNG in the active corpus (parallel I/O via
|
||||
`ThreadPoolExecutor(max_workers=16)`, ~17 s for 5,386 PNGs over the
|
||||
`/mnt/e/` Windows mount). Group by hash; for groups with members in
|
||||
multiple identity families, keep the higher-tier copy.
|
||||
|
||||
**Family detection**: regex `^(faceset_\d+)(?:_.+)?$` — captures the parent
|
||||
identity. Same family includes parent + era splits (e.g. `faceset_001` +
|
||||
`faceset_001_2010-13`); these are intentional duplications for the era
|
||||
.fsz files and are preserved.
|
||||
|
||||
Run results: 20 cross-family hash groups → 24 PNGs dropped. Most cases were
|
||||
small immich identity-cluster errors that consolidation missed because
|
||||
individual PNG embeddings matched but the cluster mean did not.
|
||||
|
||||
## 2. Within-faceset near-dup at sim ≥ 0.95
|
||||
|
||||
Per-faceset pairwise cosine similarity on cached arcface embeddings.
|
||||
Connected components in the `sim ≥ 0.95` graph. Keep highest
|
||||
`quality.composite` per component, drop the rest.
|
||||
|
||||
**Threshold rationale**: legitimate same-person-different-pose pairs land at
|
||||
0.5–0.85; ≥ 0.95 means essentially the same shot (burst frames or
|
||||
recompressed dupes). Roop's `FaceSet.AverageEmbeddings()` averages all faces
|
||||
into `faces[0].embedding`; near-identical embeddings averaged ≈ averaging
|
||||
once. Removing them does not lose identity information; it removes a bias
|
||||
weight on the most-photographed moments.
|
||||
|
||||
Run results: 851 groups → **1,225 PNGs dropped** (23 % of corpus).
|
||||
Most-affected: `faceset_026` (-132 of 262), `faceset_027` (-107),
|
||||
`faceset_028` (-92), `faceset_030` (-92). All immich-discovered identities
|
||||
where the source library had burst sequences.
|
||||
|
||||
## 3. Multi-face audit (load-bearing roop invariant)
|
||||
|
||||
The roop loader at `roop/ui/tabs/faceswap_tab.py:661–691` runs
|
||||
`extract_face_images(filename, (False, 0))` on every PNG and **appends every
|
||||
detected face** to `face_set.faces`. A multi-face PNG therefore pollutes the
|
||||
averaged identity. The export-swap pipeline drops multi-face crops at
|
||||
creation, but post-pipeline operations (consolidation, age-extend) move
|
||||
PNGs across facesets without re-checking.
|
||||
|
||||
**This audit re-detects every PNG** with insightface FaceAnalysis and flags
|
||||
any with `face_count ≠ 1` (filtered by `det_score ≥ 0.5` and
|
||||
`face_short ≥ 40`). Includes:
|
||||
- ≥ 2 faces → loader will inject extra identities into averaging
|
||||
- 0 faces → insightface can't find a face on the cropped PNG; useless for
|
||||
roop, would silently fail
|
||||
|
||||
Run results: 4,146 PNGs scored, 332 flagged (272 with 2 faces, 9 with 3,
|
||||
2 with 4, **49 with 0**). 82 facesets affected.
|
||||
|
||||
## 4. DML throughput jump for face crops
|
||||
|
||||
The audit reuses the same insightface + onnxruntime-directml stack as
|
||||
`embed_worker.py` but achieves **~19 img/s** on AMD Vega vs embed_worker's
|
||||
2.6 img/s — same model, same hardware. The difference is input size:
|
||||
|
||||
| stage | typical input | DML throughput |
|
||||
|-------|--------------|---------------:|
|
||||
| `embed_worker.py` (Immich import) | 1024–4000 px source | 2.6 img/s |
|
||||
| `multiface_worker.py` (this audit) | 512×512 face crops | **19 img/s** |
|
||||
|
||||
Detection on small inputs is fast; recognition on aligned 112×112 inputs is
|
||||
the same cost either way. Implication: **any pipeline operating on
|
||||
already-cropped face PNGs can rely on a roughly 7× higher DML throughput
|
||||
ceiling than full-resolution embedding**.
|
||||
|
||||
## 5. Architecture
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────┐
|
||||
│ WSL /opt/face-sets/work/dedup_optimize.py │
|
||||
│ • analyze: hashes + within-faceset sim │
|
||||
│ • apply: move + re-zip (no GPU) │
|
||||
│ • stage_multiface: write queue.json │
|
||||
│ • merge_multiface: ingest worker results │
|
||||
│ • apply_multiface: move + re-zip │
|
||||
│ • report: HTML audit │
|
||||
└────────────┬───────────────────────────────┘
|
||||
│ queue.json via \\wsl.localhost\
|
||||
▼
|
||||
┌────────────────────────────────────────────┐
|
||||
│ Windows C:\face_embed_venv\ │
|
||||
│ /opt/face-sets/work/multiface_worker.py │
|
||||
│ insightface FaceAnalysis on DmlExecutionProvider │
|
||||
│ Reads PNGs from native E:\, writes face_count │
|
||||
└────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
Reuses the existing `C:\face_embed_venv\` (no new venv needed — same
|
||||
insightface stack as `embed_worker.py`).
|
||||
|
||||
## 6. Final corpus state (2026-04-27 night)
|
||||
|
||||
| metric | start of day | after occlusion filter | after consolidation | after age-extend | after this dedup + multiface |
|
||||
|--------|-------------:|----------------------:|-------------------:|-----------------:|----------------------------:|
|
||||
| active facesets | 311 | 255 | 181 | 181 | **181** |
|
||||
| active PNGs | ~6,440 | 5,386 | 5,386 | 5,400 | **3,849** |
|
||||
| `_masked/` | 0 | 51 | 51 | 51 | 51 |
|
||||
| `_thin/` | 68 | 71 | 71 | 71 | 71 |
|
||||
| `_merged/` | 0 | 0 | 74 | 74 | 74 |
|
||||
|
||||
Net reduction at the end of the day: **2,591 PNGs and 130 facesets** removed
|
||||
or quarantined from the active pool. All preserved on disk for
|
||||
reversibility (`<faceset>/faces/_dropped/` for prunes, `_masked/_merged/_thin/`
|
||||
for quarantines).
|
||||
|
||||
## 7. Re-running
|
||||
|
||||
Run after any new import / consolidation / extend:
|
||||
|
||||
```bash
|
||||
# 1. Byte-dedup + within-faceset near-dup (CPU only)
|
||||
python work/dedup_optimize.py analyze --out work/dedup_audit/dedup_plan.json
|
||||
python work/dedup_optimize.py apply --plan work/dedup_audit/dedup_plan.json
|
||||
|
||||
# 2. Multi-face audit on Windows DML (resumable)
|
||||
python work/dedup_optimize.py stage_multiface --out work/dedup_audit/multiface_queue.json
|
||||
"/mnt/c/face_embed_venv/Scripts/python.exe" work/multiface_worker.py \
|
||||
work/dedup_audit/multiface_queue.json work/dedup_audit/multiface_results.json
|
||||
python work/dedup_optimize.py merge_multiface \
|
||||
--results work/dedup_audit/multiface_results.json \
|
||||
--out work/dedup_audit/multiface_plan.json
|
||||
python work/dedup_optimize.py apply_multiface \
|
||||
--plan work/dedup_audit/multiface_plan.json
|
||||
|
||||
# 3. HTML audit
|
||||
python work/dedup_optimize.py report \
|
||||
--dedup work/dedup_audit/dedup_plan.json \
|
||||
--multiface work/dedup_audit/multiface_plan.json \
|
||||
--out work/dedup_audit
|
||||
```
|
||||
@@ -0,0 +1,170 @@
|
||||
# Identity consolidation + age-bucket extension
|
||||
|
||||
_Run date: 2026-04-27. Driver scripts: `work/consolidate_facesets.py`, `work/age_extend_001.py`._
|
||||
|
||||
After the Immich peter + nic imports added 280 new facesets to a corpus that
|
||||
had ~25 canonical identities, many "new" identities were duplicates of
|
||||
existing household members at lower clustering confidence. Two cooperating
|
||||
passes clean this up: identity consolidation merges duplicates, then
|
||||
age-extend slots newly-merged PNGs into the existing era buckets of
|
||||
`faceset_001`.
|
||||
|
||||
## 1. Identity consolidation
|
||||
|
||||
### 1.1 Approach
|
||||
|
||||
For each active faceset, pull cached arcface embeddings from
|
||||
`work/cache/{nl_full,immich_peter,immich_nic}.npz` keyed by
|
||||
`(source, bbox)` from the per-faceset manifest's `faces[]`. Compute
|
||||
L2-normalized centroid. Pairwise cosine similarity matrix.
|
||||
|
||||
**Tier-based primary selection** (lowest tier number wins, size breaks ties):
|
||||
|
||||
| tier | sources | rationale |
|
||||
|-----:|---------|-----------|
|
||||
| 0 | `faceset_013..019` (hand-sorted) | user's curated labels |
|
||||
| 1 | `faceset_001..012` (auto-clustered) | well-established household |
|
||||
| 2 | `faceset_020..025` (osrc) | mixed-bucket discovery |
|
||||
| 3 | `faceset_026..264` (immich peter) | speculative |
|
||||
| 4 | `faceset_265+` (immich nic) | speculative |
|
||||
|
||||
**Era splits and quarantines excluded** — `faceset_NNN_<era>`, `_masked/`,
|
||||
`_thin/` are skipped during analysis.
|
||||
|
||||
### 1.2 Single-linkage chains catastrophically — complete-linkage required
|
||||
|
||||
First attempt used connected-components on edge ≥ 0.45 → produced a
|
||||
**60-faceset cluster** around `faceset_001` with min within-group sim of
|
||||
**−0.16** (definitely-different people bridged via chains
|
||||
`A↔B↔C` where `A`, `C` are not similar). Bumping to edge ≥ 0.55 still
|
||||
chained (group of 17 with min 0.20).
|
||||
|
||||
Real fix: `scipy.cluster.hierarchy.linkage(method='complete')` then
|
||||
`fcluster(Z, t=1-edge_threshold, criterion='distance')`. Complete-linkage
|
||||
**guarantees** every within-group pair sim ≥ edge threshold. Without this
|
||||
guarantee the report is unusable and the apply step would produce
|
||||
identity-poisoned merges.
|
||||
|
||||
### 1.3 Thresholds + run results
|
||||
|
||||
`edge=0.55`, `confident=0.65` → 48 multi-faceset groups (29 confident, 19
|
||||
uncertain). Max group size 7, all bilateral or small triplets after
|
||||
complete-linkage.
|
||||
|
||||
After applying all 48 (with `--include-uncertain` after visual approval):
|
||||
|
||||
- **74 facesets consumed** (some groups had multiple secondaries:
|
||||
`[10, 45, 135] → faceset_002`; `[113, 96, 178, 109, 110, 286] → faceset_095`;
|
||||
etc.)
|
||||
- Active count 255 → 181
|
||||
- Notable absorptions: `faceset_001` (peter) 707 → 753 PNGs (+ 7, 132, 151);
|
||||
`faceset_002` 209 → 247; `faceset_026` 60 → 262 (+ 168, 146, 325);
|
||||
`faceset_028` → 207
|
||||
- Master manifest gained `merged[]` array (parallel to `thin_eras[]`); each
|
||||
entry has `merged_into` field pointing at the primary
|
||||
|
||||
### 1.4 Apply mechanics
|
||||
|
||||
Combine all PNGs from primary + secondaries, re-rank by existing
|
||||
`quality.composite` desc (no re-enrich), renumber `0001..NNNN`, copy into a
|
||||
fresh staging dir, atomic swap. Move secondary directories to
|
||||
`_merged/<original_name>/` (preserved in full for reversibility). Re-zip
|
||||
`_topN.fsz` and `_all.fsz`.
|
||||
|
||||
The primary's existing per-PNG quality scores are reused — re-ranking does
|
||||
not require re-running `enrich`-equivalent landmarks/pose on the cropped
|
||||
PNGs. The primary's `_dropped/` (from prior occlusion filter) is preserved
|
||||
through the merge.
|
||||
|
||||
## 2. Age extension of faceset_001 era buckets
|
||||
|
||||
### 2.1 Why a follow-on pass
|
||||
|
||||
Consolidation absorbed faceset_007/132/151 into faceset_001 (+46 PNGs).
|
||||
The original `age_split_001.py` had bucketed peter into 6 era anchors
|
||||
(`_2005-10`, `_2010-13`, `_2011`, `_2014-17`, `_2018-19`, `_2018-20`), but
|
||||
those new PNGs had never been seen by age_split. They sat in faceset_001's
|
||||
parent-only set, missing from every era .fsz.
|
||||
|
||||
### 2.2 Era-label pitfall
|
||||
|
||||
The 6 anchor era labels are NOT strict year ranges. They are
|
||||
`Counter(years).most_common(1)`-derived dom-years from the original sub-cluster:
|
||||
|
||||
| label | dom_year | actual span of members |
|
||||
|-------|---------:|-----------------------:|
|
||||
| `_2005-10` | 2010 | 2005–2010 |
|
||||
| `_2010-13` | 2011 | **2007–2024** |
|
||||
| `_2011` | 2011 | 2011 only |
|
||||
| `_2014-17` | 2016 | 2005–2018 |
|
||||
| `_2018-19` | 2018 | 2012–2020 |
|
||||
| `_2018-20` | 2019 | 2014–2022 |
|
||||
|
||||
The clusters are *appearance-anchored*, not year-bounded. Year is a
|
||||
descriptive label. Assignment rule must use dom-year, not member span.
|
||||
|
||||
### 2.3 Algorithm
|
||||
|
||||
For each unbucketed face entry in `faceset_001`'s manifest (50 of 753):
|
||||
|
||||
1. Look up embedding in cache by `(source, bbox)`.
|
||||
2. Look up EXIF year via `work/cache/age_split_exif.json`; fetch on cache miss.
|
||||
3. Find single nearest era anchor by cosine distance to its centroid.
|
||||
4. Accept iff `dist ≤ 0.40` AND `|year − anchor.dom_year| ≤ 5`.
|
||||
These thresholds match `age_split_001.py`'s anchor-fragment rule.
|
||||
5. Anchors are NOT re-centered after absorption (preserves age_split's
|
||||
drift-prevention guarantee).
|
||||
|
||||
### 2.4 Run results
|
||||
|
||||
50 unbucketed → 21 with EXIF year → **14 accepted**:
|
||||
|
||||
| anchor | dom_year | added |
|
||||
|--------|---------:|------:|
|
||||
| `_2005-10` | 2010 | +2 |
|
||||
| `_2010-13` | 2011 | +1 |
|
||||
| `_2014-17` | 2016 | **+9** |
|
||||
| `_2018-20` | 2019 | +2 |
|
||||
|
||||
29 PNGs skipped for missing EXIF year (mostly immich-stripped
|
||||
photos). 7 dist/year-rejected (e.g. two PNGs from 2025 want
|
||||
`_2018-19` but year-delta 7 > 5).
|
||||
|
||||
### 2.5 Reconciliation side effect
|
||||
|
||||
The apply rebuilds each affected era bucket's `faces/` from staging. This
|
||||
incidentally reconciled the per-bucket manifests with disk after the prior
|
||||
occlusion filter run had left era manifests stale at 282/126/132 entries vs
|
||||
~248/125/129 actual files (occlusion filter only updates the master
|
||||
manifest, never per-faceset manifests — see
|
||||
`docs/analysis/clip-occlusion-filter.md` §7). 42 occlusion-dropped era PNGs
|
||||
inside the old `faces/_dropped/` were removed during rebuild. The
|
||||
parent `faceset_001/faces/_dropped/` still has the corpus-level audit; all
|
||||
source images are intact at `/mnt/x/src/`, so the era-level dropped PNGs
|
||||
are regeneratable via `cmd_export_swap`.
|
||||
|
||||
## 3. Re-running
|
||||
|
||||
Always run both passes after any new identity import (Immich, osrc,
|
||||
hand-sorted folder):
|
||||
|
||||
```bash
|
||||
# 1. Find duplicate identities
|
||||
python work/consolidate_facesets.py analyze \
|
||||
--out work/merge_review/candidates.json [--edge 0.55 --confident 0.65]
|
||||
python work/consolidate_facesets.py report \
|
||||
--candidates work/merge_review/candidates.json --out work/merge_review
|
||||
# inspect work/merge_review/index.html
|
||||
python work/consolidate_facesets.py apply \
|
||||
--candidates work/merge_review/candidates.json [--include-uncertain]
|
||||
|
||||
# 2. Slot new faceset_001 PNGs into existing era buckets
|
||||
python work/age_extend_001.py analyze --out work/age_extend/candidates.json
|
||||
python work/age_extend_001.py report \
|
||||
--candidates work/age_extend/candidates.json --out work/age_extend
|
||||
python work/age_extend_001.py apply --candidates work/age_extend/candidates.json
|
||||
```
|
||||
|
||||
Both are idempotent. `consolidate_facesets` skips secondaries already in
|
||||
`_merged/`; `age_extend_001` recomputes anchor centroids + dom-year fresh
|
||||
on every run.
|
||||
@@ -0,0 +1,279 @@
|
||||
# Importing identities from a self-hosted Immich library
|
||||
|
||||
_Run date: 2026-04-26. Target: Immich v2.7.2 at `https://fotos.computerliebe.org`.
|
||||
Driver scripts: `work/immich_stage.py`, `work/embed_worker.py`,
|
||||
`work/cluster_immich.py`, `work/finalize_immich.sh`._
|
||||
|
||||
## 1. Why a split workflow
|
||||
|
||||
InsightFace `buffalo_l` on the WSL CPU runs the full detection + landmarks +
|
||||
recognition stack at ~3–4 faces/second. Re-detecting all 79K Immich photos
|
||||
would have taken ~10–28 days. The available AMD Radeon RX Vega is unusable
|
||||
under WSL (no `/dev/dri/`, no ROCm), but **DirectML on Windows native**
|
||||
runs the same models bit-identically and ~7.5× faster end-to-end. The
|
||||
pipeline therefore splits:
|
||||
|
||||
- **WSL side** (`/opt/face-sets/`) — orchestration: API listing, download,
|
||||
sha256 dedup, file management, clustering, faceset emission.
|
||||
- **Windows side** (`C:\face_embed_venv\`) — the embed step only. A fresh
|
||||
Python 3.12 (installed via `winget install Python.Python.3.12`) with
|
||||
`numpy`, `pillow`, `opencv-python-headless`, `onnxruntime-directml`,
|
||||
`insightface`. Models copied from `/home/peter/.insightface/models/buffalo_l/`
|
||||
to `C:\face_embed_venv\models\buffalo_l\`.
|
||||
|
||||
A 30-iteration synthetic benchmark on Vega:
|
||||
|
||||
| model | DML | CPU | speedup |
|
||||
|-------------|----:|----:|--------:|
|
||||
| `det_10g.onnx` (640×640) | 10.0 ms | 183.5 ms | 18.4× |
|
||||
| `w600k_r50.onnx` (112×112) | 8.2 ms | 90.5 ms | 11.0× |
|
||||
|
||||
End-to-end FaceAnalysis on 5 real Immich-sourced images (excluding the
|
||||
first-call DML JIT warmup): ~7.5× speedup post-warmup. Per-face cosine
|
||||
similarity DML vs CPU was 1.0000 across all 8 detected faces — DML is
|
||||
bit-identical to CPU for arcface inference.
|
||||
|
||||
## 2. Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ WSL /opt/face-sets/work/immich_stage.py │
|
||||
│ ┌──────────────────────────────────────────┐│
|
||||
│ │ ThreadPoolExecutor.map(_fetch_for_asset, ││
|
||||
│ │ list_assets(user)) ││
|
||||
│ │ ─ /faces?id= (Immich, parallel x8) ││
|
||||
│ │ ─ filter face_short >= 90 ││
|
||||
│ │ ─ /assets/.../original (parallel x8) ││
|
||||
│ └──────────────────────────────────────────┘│
|
||||
│ consumer (main thread): │
|
||||
│ sha256 → dedup vs nl_full.npz │
|
||||
│ save to /mnt/x/src/immich/<user>/<rel>/ │
|
||||
│ append to queue.json │
|
||||
└────────────────┬────────────────────────────┘
|
||||
│
|
||||
▼ queue.json (with WSL + Windows paths)
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Windows embed_worker.py (C:\face_embed_venv) │
|
||||
│ insightface.FaceAnalysis( │
|
||||
│ providers=[DmlExecutionProvider, ...]) │
|
||||
│ per image: detection + landmarks + arcface │
|
||||
│ emit cache in sort_faces.py:cmd_embed │
|
||||
│ schema with embeddings + meta + processed │
|
||||
│ + path_aliases + schema=v2 │
|
||||
└────────────────┬────────────────────────────┘
|
||||
│
|
||||
▼ immich_<user>.npz
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ WSL cluster_immich.py │
|
||||
│ build centroids of canonical │
|
||||
│ faceset_NNN/ in facesets_swap_ready/ │
|
||||
│ drop matches at cos-dist <= 0.45 │
|
||||
│ cluster the rest at 0.55 │
|
||||
│ refine gates -> synthetic refine_manifest │
|
||||
│ cmd_export_swap -> facesets_swap_ready/ │
|
||||
│ merge top-level manifest │
|
||||
└─────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
Cache artifacts stay separate (per the architecture choice on this run):
|
||||
each user's results live in their own `immich_<user>.npz`. A future
|
||||
one-shot merge can fold them into `nl_full.npz` if needed; the existing
|
||||
`extend` command would do the right thing once schemas align.
|
||||
|
||||
## 3. Path mapping
|
||||
|
||||
`/mnt/x/` ↔ `X:\`. Cache stores WSL form (matching `nl_full.npz`'s
|
||||
existing convention). `wsl_to_win()` translates for the embed worker
|
||||
which runs natively on Windows.
|
||||
|
||||
`work/cluster_immich.py` always uses the canonical `facesets_swap_ready/`
|
||||
view to build identity centroids — meaning the comparison is against the
|
||||
*current* set of canonical facesets in the swap-ready directory (skipping
|
||||
era splits and `_thin/`), not against the older `facesets_full/` snapshot.
|
||||
|
||||
## 4. Result of the 2026-04-26 run (peter / admin)
|
||||
|
||||
### 4a. Stage
|
||||
|
||||
```
|
||||
total_assets_seen: 53842
|
||||
staged_count: 10261 (~10 GB on /mnt/x/)
|
||||
deduped_against_existing: 978 (sha256 in nl_full.npz already)
|
||||
deduped_against_staged: 2976 (internal byte-dupes inside Immich)
|
||||
skipped_no_big_face: 9539 (Immich detected only sub-90px faces)
|
||||
skipped_no_faces: 29390 (Immich detected zero faces)
|
||||
skipped_download_error: 698 (transient DNS / TLS, not seen-marked)
|
||||
elapsed: ~70 min (6.4 assets/s end-to-end at 8 workers)
|
||||
```
|
||||
|
||||
The 698 transient errors are recoverable on a re-run because
|
||||
`immich_stage.py` does not add them to the `seen` set. Each transient
|
||||
asset would be retried.
|
||||
|
||||
### 4b. Embed (Windows DML)
|
||||
|
||||
```
|
||||
queue: 10261 entries
|
||||
new face records: 19462
|
||||
new noface records: 1
|
||||
load errors: 125 (likely HEIC / unreadable)
|
||||
elapsed: 3878.0s (64.6 min, 2.6 img/s end-to-end)
|
||||
```
|
||||
|
||||
The 2.6 img/s end-to-end includes CIFS-share image load, image decode,
|
||||
DML inference (~50 ms/face), and JSON / NPZ flushing. Pure DML inference
|
||||
is faster; the rest of the pipeline dominates at scale.
|
||||
|
||||
### 4c. Cluster
|
||||
|
||||
```
|
||||
existing canonical centroids: 25
|
||||
faces already covered (cos-dist <= 0.45): 8103/19480 (42%)
|
||||
faceset_001: 1856
|
||||
faceset_002: 2666
|
||||
faceset_003: 670
|
||||
faceset_004: 48
|
||||
faceset_005: 40
|
||||
... (smaller hits to the remaining 20)
|
||||
unmatched faces to cluster: 11377
|
||||
clusters at threshold 0.55: 2534 (top sizes [469, 444, 342, 338, 262, ...])
|
||||
survived refine gates: 239
|
||||
emitted as new facesets: 185 (54 dropped by export-swap's 0.45 outlier)
|
||||
```
|
||||
|
||||
Top-level `facesets_swap_ready/manifest.json` after this run: **216
|
||||
facesets** (up from 31; ~7× growth) + 68 thin_eras under `_thin/`.
|
||||
|
||||
## 4d. Result of the 2026-04-26..27 run (nic, with per-user API key)
|
||||
|
||||
After issuing nic a per-user API key, the same pipeline ran end-to-end
|
||||
with no code changes (only the `IMMICH_API_KEY` env var changed). The
|
||||
run survived one Immich outage mid-stage thanks to the circuit breaker
|
||||
added in `work/immich_stage.py` (12 consecutive HTTP errors → probe →
|
||||
exit 2 with state preserved → resume on same command).
|
||||
|
||||
### Stage
|
||||
|
||||
```
|
||||
total_assets_seen: 25777 (matches /server/statistics 25,786)
|
||||
staged_count: 7834 (30% face-bearing-with-big-face;
|
||||
peter was 19%)
|
||||
deduped_against_existing: 519 (sha256 in nl_full.npz already)
|
||||
deduped_against_staged: 0 (nic's library has zero internal
|
||||
byte-dupes; peter had 2,976)
|
||||
skipped_no_big_face: 725
|
||||
skipped_no_faces: 16695
|
||||
skipped_download_error: 54 (transient; not marked seen ->
|
||||
would be retried on resume)
|
||||
elapsed: ~75 min wall (across two pause/resume sessions
|
||||
bracketing one Immich outage)
|
||||
```
|
||||
|
||||
### Embed (Windows DML)
|
||||
|
||||
```
|
||||
queue: 7834 entries
|
||||
new face records: 15627
|
||||
new noface records: 1
|
||||
load errors: 7
|
||||
elapsed: 3538.9s (59 min, 2.2 img/s end-to-end)
|
||||
```
|
||||
|
||||
### Cluster
|
||||
|
||||
```
|
||||
existing canonical centroids: 25
|
||||
faces already covered (cos-dist <= 0.45): 6770/15627 (43%)
|
||||
faceset_002: 3261 (the dominant family identity)
|
||||
faceset_008: 1461 (cross-match to hand-sorted 'sab')
|
||||
faceset_001: 955
|
||||
faceset_007: 408 (cross-match to hand-sorted 's')
|
||||
faceset_006: 114
|
||||
...
|
||||
unmatched: 8857
|
||||
clusters at threshold 0.55: 3787 (top sizes [165, 134, 106, 99, 92,
|
||||
67, 62, 61, 58, 53])
|
||||
survived refine gates: 129
|
||||
emitted as new facesets: 95 (faceset_265..NNN with gaps)
|
||||
```
|
||||
|
||||
Top-level `facesets_swap_ready/manifest.json` after the nic run: **311
|
||||
substantive facesets** + 68 thin_eras. Two-day cumulative growth:
|
||||
|
||||
| date | event | facesets total |
|
||||
|------|------|------:|
|
||||
| 2026-04-25 | hand-sorted folder import | 19 |
|
||||
| 2026-04-26 morning | osrc + age split + cleanup | 31 |
|
||||
| 2026-04-26 afternoon | Immich peter run | 216 |
|
||||
| 2026-04-27 (overnight) | Immich nic run | 311 |
|
||||
|
||||
## 5. Surprises and caveats
|
||||
|
||||
### 5a. `/search/metadata`'s `userIds` filter is silently ignored (Immich v2.7.2)
|
||||
|
||||
When the admin API key is used, passing `userIds=[<other-user-uuid>]`
|
||||
returns admin's own assets, not the other user's. The filter is
|
||||
silently dropped. Verified by sampling 200 returned items and
|
||||
confirming `ownerId` was admin for all of them.
|
||||
|
||||
To process another user's library, **a separate API key issued by that
|
||||
user is required** — the admin key cannot enumerate cross-user
|
||||
libraries through any documented endpoint we tried. `/timeline/buckets`
|
||||
with a `userId` query parameter returns
|
||||
`Not found or no timeline.read access`.
|
||||
|
||||
### 5b. `/server/statistics` undercounts what the search returns
|
||||
|
||||
`/server/statistics` reported admin = 53,842 photos. Our
|
||||
`/search/metadata` paginated through... **53,842** top-level. So the
|
||||
header agrees with the body in this case. But `/server/statistics` does
|
||||
NOT count items that live under external libraries' import paths —
|
||||
yet `/search/metadata` does include them. For this Immich, two external
|
||||
libraries (`/mnt/media/photos` and `/mnt/media/omv_photos`) are
|
||||
configured but `/libraries` reports `assetCount=0` for both. Yet 80% of
|
||||
our staged paths come from those library import paths. Don't trust
|
||||
statistics-vs-search consistency.
|
||||
|
||||
### 5c. Indexed Immich thumbnails masquerading as assets
|
||||
|
||||
5,563 of our 10,261 staged paths are `<library>/thumbs/.../-preview.jpeg`
|
||||
— Immich's own internally-generated thumbnails got indexed because the
|
||||
external library import path included the thumbs subdirectory and the
|
||||
exclusion patterns didn't list `**/thumbs/**`. They embed and cluster
|
||||
fine but produce lower-resolution face records. The fix on the Immich
|
||||
side is adding `**/thumbs/**` to the exclusion patterns.
|
||||
|
||||
### 5d. Internal byte-duplicates (2,976)
|
||||
|
||||
Many Immich assets are byte-identical to other Immich assets — typically
|
||||
because the same photo was uploaded both from a phone and from a
|
||||
synced cloud folder. sha256 dedup catches all of these on the second
|
||||
download (we still pay the bandwidth, but skip the disk write and
|
||||
embed work). With Immich v2.7.2's own `assets/duplicates` endpoint we
|
||||
could catch this earlier, but it's not currently used.
|
||||
|
||||
## 6. Re-running and applying to other Immich instances
|
||||
|
||||
```bash
|
||||
export IMMICH_URL=https://your-immich.example.com
|
||||
export IMMICH_API_KEY=... # admin or per-user key
|
||||
|
||||
# Optional: populate work/immich/users.json with label -> UUID map.
|
||||
|
||||
# 1. Stage (parallel /faces + downloads, resumable).
|
||||
python work/immich_stage.py --user peter --workers 8
|
||||
|
||||
# 2. End-to-end finalize: copy queue to /mnt/c/, run Windows embed worker,
|
||||
# copy the cache back, run cluster_immich.py.
|
||||
bash work/finalize_immich.sh peter
|
||||
```
|
||||
|
||||
For a different Immich instance, the only configuration is the env vars
|
||||
and the `users.json` sidecar. `cluster_immich.py`'s tunables (matching
|
||||
threshold, clustering threshold, refine gates, MIN_FACES) are at the
|
||||
top of the script.
|
||||
|
||||
To process a *second* user's library, issue a per-user API key in the
|
||||
Immich admin UI for that user, set `IMMICH_API_KEY` to that key, and
|
||||
re-run with their `--user <label>`. The admin key cannot impersonate
|
||||
other users via the search API.
|
||||
@@ -0,0 +1,119 @@
|
||||
# Identity discovery in `/mnt/x/src/osrc`
|
||||
|
||||
_Run date: 2026-04-26. Cache: `work/cache/nl_full.npz` (5260 face records).
|
||||
Driver script: `work/cluster_osrc.py`._
|
||||
|
||||
## 1. Source
|
||||
|
||||
`/mnt/x/src/osrc/` is a flat mixed-identity bucket: 213 files in root + a
|
||||
`psd/` subfolder with 41 PSD files + a single file in `[Originaldateien]/`.
|
||||
File extensions are 171 jpg + 1 jpeg + 41 psd. PSDs are not embedded
|
||||
(InsightFace's loader doesn't read PSD); the 41 PSDs were skipped, on the
|
||||
working assumption that the same identities are also present in the
|
||||
adjacent JPGs.
|
||||
|
||||
`nl_full.npz` already covered 160 of the 213 files (the remaining 53: 41
|
||||
psd + 12 jpg). Of the 12 missing JPGs, 11 are byte-duplicates of `00843resc.jpg`
|
||||
.. `00855resc.jpg` (same file sizes, paired by sha256) — already aliased
|
||||
in the cache. Only 1 jpg (`19554226_..._n.jpg`) is genuinely uncovered.
|
||||
|
||||
The 160 covered files yielded **336 face records / 10 noface**, with 64
|
||||
single-face / 35 two-face / 19 three-face / 24 four-face / 8 with 5–8
|
||||
faces. Quality is good: median `face_short=116px`, `det_score=0.85`,
|
||||
`blur=244`. Min `face_short=40px` will fail the 90px refine gate.
|
||||
|
||||
## 2. Coverage by existing identities
|
||||
|
||||
Computed cos-dist from each osrc face to the centroids of the canonical
|
||||
`faceset_001..019` (built from each manifest's `(source, bbox)` keys).
|
||||
Median nearest-cos-dist was 0.875 — i.e. the bulk of osrc is **not** the
|
||||
existing 19 identities.
|
||||
|
||||
At cos-dist ≤ 0.45 (matching `build_folders.py`'s `OSRC_THRESHOLD`):
|
||||
|
||||
| existing identity | osrc faces matched |
|
||||
|------------------|------------------:|
|
||||
| faceset_002 | 7 |
|
||||
| faceset_008 | 4 |
|
||||
| faceset_015 | 3 |
|
||||
| faceset_019 | 4 |
|
||||
|
||||
These 18 osrc faces are routed to existing identities by
|
||||
`build_folders.py` and `extend`; they are excluded from the
|
||||
identity-discovery step.
|
||||
|
||||
## 3. Pipeline
|
||||
|
||||
`work/cluster_osrc.py` mirrors `build_folders.py`'s structure (synthesize
|
||||
a refine manifest, hand off to `cmd_export_swap`, relocate, merge
|
||||
top-level manifest) but discovers identities by clustering rather than
|
||||
asserting them by folder.
|
||||
|
||||
1. Filter cache to face records under `/mnt/x/src/osrc` (canonical or
|
||||
byte-aliased path).
|
||||
2. Drop the 18 already-covered faces (cos-dist ≤ 0.45 to any existing
|
||||
identity centroid).
|
||||
3. Cluster the remaining 318 faces among themselves at cos-dist 0.55
|
||||
(matches the `extend` default for new-cluster formation).
|
||||
4. For each cluster, apply `refine`-equivalent per-face gates
|
||||
(`face_short ≥ 90`, `blur ≥ 40`, `det_score ≥ 0.6`); for clusters ≥ 4
|
||||
faces apply outlier rejection at cluster-centroid cos-dist 0.55. Keep
|
||||
clusters whose surviving unique-path count is ≥ 6 (the operator-
|
||||
chosen `MIN_FACES`, lower than the canonical 15 because osrc is small
|
||||
per-identity).
|
||||
5. Number kept clusters `faceset_020+` (past the existing
|
||||
`facesets_swap_ready/` max of 019) ordered by size descending.
|
||||
6. Synthesize a refine manifest and call `cmd_export_swap` on it. Move
|
||||
the emitted dirs into `facesets_swap_ready/`, drop an `osrc.txt`
|
||||
provenance marker, and append the new entries to the top-level
|
||||
`manifest.json` (without disturbing existing `facesets` / `thin_eras`).
|
||||
|
||||
## 4. Result (2026-04-26)
|
||||
|
||||
Phase 1 (clustering, before export-swap):
|
||||
|
||||
- 137 raw clusters at cos-dist 0.55; top sizes [37, 20, 12, 9, 7, 7, 6, 6, 6, 5].
|
||||
- After quality gate: 124 faces dropped (mostly `face_short < 90` from
|
||||
group-photo tertiary subjects).
|
||||
- Outlier rejection: 0 dropped (clusters were tight).
|
||||
- After `min_faces=6`: **7 candidate clusters kept** (sizes 6–28 unique
|
||||
source paths).
|
||||
|
||||
Phase 2 (`cmd_export_swap` with `min_face_short=100`,
|
||||
`outlier_threshold=0.45`):
|
||||
|
||||
| name | input | outlier drop | exported PNGs |
|
||||
|--------------|------:|-------------:|--------------:|
|
||||
| faceset_020 | 71 | 42 | 26 |
|
||||
| faceset_021 | 36 | 21 | 10 |
|
||||
| faceset_022 | 15 | 7 | 8 |
|
||||
| faceset_023 | 19 | 14 | 4 |
|
||||
| faceset_024 | 6 | 0 | 6 |
|
||||
| faceset_025 | 10 | 4 | 6 |
|
||||
| faceset_026 | — | — | 0 (skipped: empty after filter) |
|
||||
|
||||
`faceset_026`'s 6 cluster faces all failed export-swap's tighter
|
||||
`min_face_short=100` gate (vs. cluster's 90); it is not emitted.
|
||||
`faceset_023` is small (4 PNGs) but useful as an averaged identity at
|
||||
that size.
|
||||
|
||||
Top-level `facesets_swap_ready/manifest.json` now: **31 substantive
|
||||
facesets** (12 auto-cluster nl/lzbkp + 7 hand-sorted + 6 era splits + 6
|
||||
osrc-discovered) + **68 thin_eras** under `_thin/`.
|
||||
|
||||
## 5. Re-running and applying to other mixed buckets
|
||||
|
||||
- The cache holds osrc embeddings; to re-run with different parameters,
|
||||
edit `cluster_osrc.py`'s config block and re-execute. Cluster discovery
|
||||
+ export-swap is a few minutes total.
|
||||
- For a different mixed-bucket source, copy `cluster_osrc.py` to
|
||||
`cluster_<name>.py` and change `OSRC_DIR`, `OUT_TMP`, `SYNTH_MANIFEST`,
|
||||
`START_NNN`. The exclusion step compares against the *current* contents
|
||||
of `facesets_swap_ready/faceset_NNN/` so it picks up everything emitted
|
||||
by previous discovery / split / hand-sorted runs.
|
||||
- Lowering `MIN_FACES` from 6 to 4 would have admitted ~3 additional
|
||||
marginal clusters at this corpus size; the trade-off is a noisier
|
||||
identity average for small-N facesets.
|
||||
- `extend` should be run before `cluster_osrc.py` so `raw_full/` and
|
||||
`facesets_full/` stay in sync — `cluster_osrc.py` itself only writes
|
||||
to `facesets_swap_ready/`.
|
||||
@@ -0,0 +1,142 @@
|
||||
# Video target preprocessing for roop-unleashed
|
||||
|
||||
_Initial design + first batch run: 2026-04-27. Driver scripts: `work/video_target_pipeline.py`, `work/video_face_worker.py`, `work/run_video_pipeline.sh`._
|
||||
|
||||
Companion to the face-set side of the project: instead of building per-identity .fsz bundles for the *source* of a swap, this pipeline preprocesses the *target* (videos to swap into). Given a folder of video files, it identifies "swappable" segments — continuous shots where a face is detectable, sufficiently visible, and roughly within inswapper_128's working envelope — and cuts them into UUID-named clips ready to feed into roop-unleashed.
|
||||
|
||||
## 1. Why build it
|
||||
|
||||
I checked the obvious open-source projects for an existing implementation:
|
||||
|
||||
- **FaceFusion** ([github.com/facefusion/facefusion](https://github.com/facefusion/facefusion)) — CLI has `run`, `headless-run`, `batch-run`, `job-*`, `force-download`, `benchmark`. No scene-detection or clip-extraction subcommand. Its own guides recommend "split your video manually first."
|
||||
- **roop-unleashed** at `/opt/roop-unleashed/roop/util_ffmpeg.py` — has `cut_video(start_frame, end_frame)` for a manual GUI trim, no detection-driven segmentation.
|
||||
- **Deep-Live-Cam** ([github.com/hacksider/Deep-Live-Cam](https://github.com/hacksider/Deep-Live-Cam)) — real-time / single-shot, no batch preprocessing.
|
||||
- **DeepFaceLab** — `extract_video.bat` dumps every frame between user-supplied trim points; no quality gating.
|
||||
|
||||
Closest prior art for the cut-detection pattern is the two-stage hybrid in [SportSBD MMSys'26](https://dl.acm.org/doi/10.1145/3793853.3799803) (cheap detector for cuts, accurate net for verification), but the actual implementation has to be ours.
|
||||
|
||||
## 2. Pipeline architecture
|
||||
|
||||
```
|
||||
WSL /opt/face-sets/work/ Windows C:\face_embed_venv\
|
||||
───────────────────────────────────── ─────────────────────────────
|
||||
run_video_pipeline.sh (chain driver)
|
||||
│
|
||||
├─ scan (ffprobe metadata)
|
||||
├─ scenes (PySceneDetect AdaptiveDetector, CPU)
|
||||
├─ stage (sampled frame queue.json @ 2 fps)
|
||||
│ │
|
||||
│ ▼
|
||||
│ video_face_worker.py
|
||||
│ insightface FaceAnalysis
|
||||
│ on DmlExecutionProvider
|
||||
│ output: results.jsonl
|
||||
├─ merge (ingest results.jsonl)
|
||||
├─ track (IoU + embedding stitching, ~30 LOC)
|
||||
├─ score (track-level quality gate + cross-track merge)
|
||||
├─ cut (ffmpeg -c copy → per-source subfolders)
|
||||
└─ report (HTML preview)
|
||||
|
||||
Output: <output_dir>/<source_video_stem>/<uuid>.mp4
|
||||
/<uuid>.json (sidecar; opt-in via
|
||||
--write-sidecar)
|
||||
```
|
||||
|
||||
`run_video_pipeline.sh` is parameterized via env vars (`WORK`, `INPUT_DIR`, `OUTPUT_DIR`, `FILTER_FROM`, `SKIP_PATTERN`, `MAX_DUR`, `IDENTITY`, `SIDECAR`) so you can pin a particular batch without editing the script. Sidecars are off by default — the per-batch `plan.json` always carries the full provenance for every clip; the `<uuid>.json` files alongside the clips are redundant and only useful if you need each clip to be self-describing in isolation.
|
||||
|
||||
## 3. Quality signals (matched to inswapper_128's working envelope)
|
||||
|
||||
inswapper_128 is trained near-frontal at 128×128. The score gate uses defaults that admit side profiles (since rich face-sets can absorb non-frontal swap targets):
|
||||
|
||||
| signal | threshold | rationale |
|
||||
|--------|----------:|-----------|
|
||||
| `|yaw|` | ≤ 75° | covers full 3/4 + side profile |
|
||||
| `|pitch|` | ≤ 45° | covers extreme up/down looks |
|
||||
| `face_short` | ≥ 80 px | inswapper resamples to 128; ≥80 still produces clean output |
|
||||
| `det_score` | ≥ 0.5 | matches buffalo_l's MIN_DET; lower = unreliable detection |
|
||||
| track-gate | ≥ 70 % frames pass | binary track filter rather than per-frame |
|
||||
| duration | 1 s ≤ dur ≤ 120 s | below 1s = unusable slivers; above 120s probably contains a missed micro-cut |
|
||||
|
||||
Plus two segment-merging knobs:
|
||||
- `--bridge-gap` (default 3 s) — within a single track, brief pose-failure gaps shorter than this get bridged so single bad frames don't fragment a good run
|
||||
- `--merge-gap` (default 2 s) — across tracks within the same scene, segments closer than this get fused (cross-track merge fires when face detection briefly fails between adjacent good runs)
|
||||
|
||||
The defaults can be tightened (e.g. `--max-yaw 25` for portrait-only) or loosened (e.g. `--max-yaw 90 --merge-gap 5`) without re-running detection — `score` reads the existing `tracks.json`.
|
||||
|
||||
## 4. Performance + the JSONL append-only fix
|
||||
|
||||
This is where the engineering interest is. The first production run on 13 videos / 6.18 h of input went through three failure modes before settling at production speed:
|
||||
|
||||
| attempt | issue | rate observed |
|
||||
|---|---|---:|
|
||||
| 1. Original `cap.set(POS_FRAMES, N)` per sample | OpenCV seeks to nearest keyframe + decodes forward at every sample. Cost grows with depth into the video; on a 60-min H.264 it falls off a cliff. | 1.4 fps → degrading |
|
||||
| 2. Sequential `cap.grab()` from frame 0 | On resume, grab-walking from frame 0 to a deep target is unbounded. | 0.08 fps |
|
||||
| 3. Hybrid: seek-once-per-video + sequential within | Better in principle. But hit a different bug: `flush()` was re-serializing the entire `results.json` (245 MB at this point) every 100 frames or 30 sec. Save dominated wall-clock. | 0.5 fps |
|
||||
| 4. **JSONL append-only** | One result per line. Each flush is O(new records), not O(total records). | **13.77 fps** smoke / 7.57 fps cumulative across the full batch |
|
||||
|
||||
Lesson: when the output is large + grows monotonically + needs frequent checkpointing, *do not* re-serialize the whole structure on each flush. Append-only line-delimited JSON is the right tool. The legacy `results.json` is auto-converted to `.jsonl` on first load (one-time migration), so resumes survive the format switch.
|
||||
|
||||
## 5. Hardware decode/encode on AMD Vega + WSL
|
||||
|
||||
Skipped. Per [Microsoft's WSL D3D12 video acceleration post](https://devblogs.microsoft.com/commandline/d3d12-gpu-video-acceleration-in-the-windows-subsystem-for-linux-now-available/), VAAPI-via-Mesa-D3D12 exists but is fragile on older AMD. AMF on Windows would mean a Windows-side ffmpeg leg, doubling boundary crossings. CPU software decode of 1280×720 H.264 in WSL ffmpeg is faster than realtime, and the bottleneck is buffalo_l detection on DML, not decode.
|
||||
|
||||
For cutting we use `-c copy` stream-copy — no re-encode, hardware codecs are moot.
|
||||
|
||||
## 6. Full corpus run results
|
||||
|
||||
Three runs across the 61-video corpus at `/mnt/x/src/vd/`:
|
||||
|
||||
| | test (3 videos) | first batch (13 videos, 50–62) | rest (45 videos, 02–49 minus test) | **total** |
|
||||
|---|---:|---:|---:|---:|
|
||||
| input duration | 0.6 h | 6.18 h | 12.98 h | **19.76 h** |
|
||||
| sampled frames @ 2 fps | 4,472 | 44,635 | 94,030 | 143,137 |
|
||||
| tracks | 187 | 2,564 | 3,823 | 6,574 |
|
||||
| accepted tracks | 94 (50 %) | 1,193 (47 %) | 1,905 (50 %) | 3,192 (49 %) |
|
||||
| **emitted segments** | **83** | **600** | **1,301** | **1,984** |
|
||||
| cross-track-merged segments | 14 | 254 | 382 | 650 |
|
||||
| accepted content | 13 min | 239 min | 395 min | **647 min (= 10.78 h)** |
|
||||
| acceptance rate by time | 36 % | 64.6 % | 50.7 % | **54.6 %** |
|
||||
| output size | 0.135 GB | 3.63 GB | 4.84 GB | **8.6 GB** |
|
||||
|
||||
Phase timings (rest batch — best representative since it ran fully under JSONL append-only from a fresh start):
|
||||
- scenes: 117 min (PySceneDetect, 45 × ~3 min/video)
|
||||
- stage: instant
|
||||
- worker: 100 min @ **15.78 fps** sustained (vs 7.5 fps for first batch which migrated mid-run)
|
||||
- merge: 90 s
|
||||
- track: 92 s
|
||||
- score: 23 s
|
||||
- cut (1,301 ffmpeg stream-copies): 30 min
|
||||
- report (1,301 thumbs + HTML): 5.5 min
|
||||
- **total wall-clock: 4h16m**
|
||||
|
||||
Across all three runs, **0 worker errors on 143,137 sampled frames**.
|
||||
|
||||
## 7. Re-running
|
||||
|
||||
```bash
|
||||
# choose a per-batch workdir + log
|
||||
WORK=/opt/face-sets/work/video_preprocess_<batch_name> \
|
||||
FILTER_FROM=ct_src_00050.mp4 \
|
||||
bash work/run_video_pipeline.sh > work/logs/video_run_<batch_name>.log 2>&1 &
|
||||
|
||||
# check status anytime
|
||||
bash work/status_video_pipeline.sh work/logs/video_run_<batch_name>.log
|
||||
```
|
||||
|
||||
Skip patterns can exclude already-processed inputs (note that 5-digit numbers need full padding in the regex, e.g. `0005[0-9]` not `005[0-9]`):
|
||||
|
||||
```bash
|
||||
SKIP_PATTERN='^ct_src_(0001[015]|0005[0-9]|0006[0-2])\.mp4$' \
|
||||
WORK=/opt/face-sets/work/video_preprocess_rest \
|
||||
bash work/run_video_pipeline.sh > work/logs/video_run_rest.log 2>&1 &
|
||||
```
|
||||
|
||||
To also emit per-clip provenance sidecars (off by default):
|
||||
|
||||
```bash
|
||||
SIDECAR=yes \
|
||||
WORK=/opt/face-sets/work/video_preprocess_<batch> \
|
||||
bash work/run_video_pipeline.sh > work/logs/video_run_<batch>.log 2>&1 &
|
||||
```
|
||||
|
||||
`scenes` outputs are cached in the batch's `WORK/scenes/` dir, so re-running the chain after an edit-to-score step doesn't redo detection. The worker is also resumable per `queue_id` — if killed mid-flight, just relaunch.
|
||||
@@ -0,0 +1,576 @@
|
||||
"""Extend the existing 6 era buckets of faceset_001 by absorbing PNGs that
|
||||
post-date the original age_split run (from consolidation merges, etc.).
|
||||
|
||||
Mirrors the anchor-fragment assignment logic in age_split_001.py:
|
||||
- For each unbucketed face in faceset_001's manifest, find the nearest active
|
||||
era anchor by cosine distance to the anchor's centroid.
|
||||
- Accept the assignment iff dist <= 0.40 AND |year_delta| <= 5
|
||||
(where year_delta = exif_year(face) - dom_year(anchor)).
|
||||
- Undated PNGs are skipped (no assignment).
|
||||
- Anchors are NOT re-centered after absorption (preserves the same drift
|
||||
guarantees as the original age_split).
|
||||
|
||||
CLI:
|
||||
python work/age_extend_001.py analyze --out work/age_extend/candidates.json
|
||||
python work/age_extend_001.py report --candidates ... --out work/age_extend
|
||||
python work/age_extend_001.py apply --candidates ... [--dry-run]
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import shutil
|
||||
import sys
|
||||
import time
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
from PIL import Image, ExifTags
|
||||
|
||||
ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
|
||||
PARENT = "faceset_001"
|
||||
ACTIVE_ERAS = [
|
||||
"faceset_001_2005-10",
|
||||
"faceset_001_2010-13",
|
||||
"faceset_001_2011",
|
||||
"faceset_001_2014-17",
|
||||
"faceset_001_2018-19",
|
||||
"faceset_001_2018-20",
|
||||
]
|
||||
CACHES = [
|
||||
Path("/opt/face-sets/work/cache/nl_full.npz"),
|
||||
Path("/opt/face-sets/work/cache/immich_peter.npz"),
|
||||
Path("/opt/face-sets/work/cache/immich_nic.npz"),
|
||||
]
|
||||
EXIF_CACHE = Path("/opt/face-sets/work/cache/age_split_exif.json")
|
||||
|
||||
# anchor-fragment thresholds (mirror age_split_001.py)
|
||||
DIST_MAX = 0.40
|
||||
YEAR_MAX = 5
|
||||
|
||||
|
||||
# ----------------------------- caches -----------------------------
|
||||
|
||||
def load_caches():
|
||||
rec_index: dict[tuple[str, tuple[int, int, int, int]], np.ndarray] = {}
|
||||
alias_map: dict[str, str] = {}
|
||||
for c in CACHES:
|
||||
if not c.exists():
|
||||
print(f"[warn] cache missing: {c}", file=sys.stderr)
|
||||
continue
|
||||
d = np.load(c, allow_pickle=True)
|
||||
emb = d["embeddings"]
|
||||
meta = json.loads(str(d["meta"]))
|
||||
face_records = [m for m in meta if not m.get("noface")]
|
||||
if len(face_records) != len(emb):
|
||||
raise SystemExit(f"meta/emb mismatch in {c}: {len(face_records)} vs {len(emb)}")
|
||||
if "path_aliases" in d.files:
|
||||
paliases = json.loads(str(d["path_aliases"]))
|
||||
for canon, alist in paliases.items():
|
||||
alias_map.setdefault(canon, canon)
|
||||
for a in alist:
|
||||
alias_map[a] = canon
|
||||
for i, rec in enumerate(face_records):
|
||||
p = rec["path"]
|
||||
bbox = tuple(int(x) for x in rec["bbox"])
|
||||
v = emb[i].astype(np.float32)
|
||||
n = float(np.linalg.norm(v))
|
||||
if n > 0:
|
||||
v = v / n
|
||||
rec_index[(p, bbox)] = v
|
||||
alias_map.setdefault(p, p)
|
||||
print(f"[cache] indexed {len(rec_index)} face records, {len(alias_map)} aliases", file=sys.stderr)
|
||||
return rec_index, alias_map
|
||||
|
||||
|
||||
def lookup_emb(rec_index, alias_map, src: str, bbox):
|
||||
bbox_t = tuple(int(x) for x in bbox)
|
||||
canon = alias_map.get(src, src)
|
||||
v = rec_index.get((canon, bbox_t))
|
||||
if v is None and canon != src:
|
||||
v = rec_index.get((src, bbox_t))
|
||||
return v
|
||||
|
||||
|
||||
# ----------------------------- exif -----------------------------
|
||||
|
||||
def load_exif_cache():
|
||||
if not EXIF_CACHE.exists():
|
||||
return {}
|
||||
return json.loads(EXIF_CACHE.read_text())
|
||||
|
||||
|
||||
def save_exif_cache(cache):
|
||||
tmp = EXIF_CACHE.with_suffix(".tmp.json")
|
||||
tmp.write_text(json.dumps(cache, indent=2))
|
||||
tmp.replace(EXIF_CACHE)
|
||||
|
||||
|
||||
def exif_year(path: Path) -> int | None:
|
||||
try:
|
||||
with Image.open(path) as im:
|
||||
ex = im._getexif()
|
||||
if not ex:
|
||||
return None
|
||||
for tag_id, val in ex.items():
|
||||
tag = ExifTags.TAGS.get(tag_id, tag_id)
|
||||
if tag == "DateTimeOriginal" and isinstance(val, str) and len(val) >= 4:
|
||||
return int(val[:4])
|
||||
except Exception:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
def get_year(src: str, exif_cache) -> int | None:
|
||||
"""Return EXIF year for src, using cache. Mutates cache for new lookups."""
|
||||
if src in exif_cache:
|
||||
return exif_cache[src]
|
||||
p = Path(src)
|
||||
y = exif_year(p) if p.exists() else None
|
||||
exif_cache[src] = y
|
||||
return y
|
||||
|
||||
|
||||
# ----------------------------- analyze -----------------------------
|
||||
|
||||
def cmd_analyze(args):
|
||||
rec_index, alias_map = load_caches()
|
||||
exif_cache = load_exif_cache()
|
||||
exif_cache_dirty = False
|
||||
|
||||
parent_dir = ROOT / PARENT
|
||||
parent_manifest = json.loads((parent_dir / "manifest.json").read_text())
|
||||
parent_faces = parent_manifest.get("faces", [])
|
||||
print(f"[parent] {PARENT}: {len(parent_faces)} face entries", file=sys.stderr)
|
||||
|
||||
# Build "in_bucket" set + each anchor's centroid + dom_year
|
||||
anchors = []
|
||||
in_bucket: set[tuple[str, tuple[int, int, int, int]]] = set()
|
||||
for era in ACTIVE_ERAS:
|
||||
ed = ROOT / era
|
||||
if not ed.is_dir():
|
||||
print(f"[warn] missing era bucket: {era}", file=sys.stderr)
|
||||
continue
|
||||
em = json.loads((ed / "manifest.json").read_text())
|
||||
emb_list = []
|
||||
years = []
|
||||
n_missing_emb = 0
|
||||
for f in em.get("faces", []):
|
||||
src = f.get("source")
|
||||
bbox = f.get("bbox")
|
||||
if not src or not bbox:
|
||||
continue
|
||||
key = (alias_map.get(src, src), tuple(int(x) for x in bbox))
|
||||
in_bucket.add(key)
|
||||
in_bucket.add((src, tuple(int(x) for x in bbox))) # cover both alias and raw
|
||||
v = lookup_emb(rec_index, alias_map, src, bbox)
|
||||
if v is None:
|
||||
n_missing_emb += 1
|
||||
else:
|
||||
emb_list.append(v)
|
||||
y = get_year(src, exif_cache)
|
||||
if y is None:
|
||||
exif_cache_dirty = True
|
||||
else:
|
||||
years.append(y)
|
||||
if src not in exif_cache:
|
||||
exif_cache_dirty = True
|
||||
if not emb_list:
|
||||
print(f"[warn] {era}: no embeddings found, skipping anchor", file=sys.stderr)
|
||||
continue
|
||||
arr = np.stack(emb_list).astype(np.float32)
|
||||
c = arr.mean(axis=0)
|
||||
n = float(np.linalg.norm(c))
|
||||
if n > 0:
|
||||
c = c / n
|
||||
dom_year = Counter(years).most_common(1)[0][0] if years else None
|
||||
anchors.append({
|
||||
"name": era, "centroid": c, "n_faces": len(em.get("faces", [])),
|
||||
"n_emb_used": len(emb_list), "n_emb_missing": n_missing_emb,
|
||||
"dom_year": dom_year,
|
||||
"year_min": min(years) if years else None,
|
||||
"year_max": max(years) if years else None,
|
||||
})
|
||||
print(f"[anchor] {era}: n={len(em.get('faces', []))} emb_used={len(emb_list)} "
|
||||
f"emb_miss={n_missing_emb} dom_year={dom_year} years=[{min(years) if years else '-'}..{max(years) if years else '-'}]",
|
||||
file=sys.stderr)
|
||||
|
||||
# Find unbucketed faces in parent
|
||||
unbucketed = []
|
||||
for f in parent_faces:
|
||||
src = f.get("source")
|
||||
bbox = f.get("bbox")
|
||||
if not src or not bbox:
|
||||
continue
|
||||
bbox_t = tuple(int(x) for x in bbox)
|
||||
key1 = (alias_map.get(src, src), bbox_t)
|
||||
key2 = (src, bbox_t)
|
||||
if key1 in in_bucket or key2 in in_bucket:
|
||||
continue
|
||||
unbucketed.append(f)
|
||||
print(f"[parent] {len(unbucketed)} unbucketed face entries (in {PARENT} but no era bucket)", file=sys.stderr)
|
||||
|
||||
# Score each unbucketed face against every anchor
|
||||
proposals = []
|
||||
skipped_no_emb = 0
|
||||
skipped_no_year = 0
|
||||
for f in unbucketed:
|
||||
src = f["source"]
|
||||
bbox = f["bbox"]
|
||||
v = lookup_emb(rec_index, alias_map, src, bbox)
|
||||
if v is None:
|
||||
skipped_no_emb += 1
|
||||
continue
|
||||
y = get_year(src, exif_cache)
|
||||
if y is None:
|
||||
skipped_no_year += 1
|
||||
exif_cache_dirty = True
|
||||
continue
|
||||
if src not in exif_cache:
|
||||
exif_cache_dirty = True
|
||||
# nearest anchor
|
||||
best = None # (dist, idx)
|
||||
for i, a in enumerate(anchors):
|
||||
d = 1.0 - float(np.dot(a["centroid"], v))
|
||||
if best is None or d < best[0]:
|
||||
best = (d, i)
|
||||
if best is None:
|
||||
continue
|
||||
dist, bidx = best
|
||||
anchor = anchors[bidx]
|
||||
year_delta = abs(y - anchor["dom_year"]) if anchor["dom_year"] is not None else None
|
||||
accept = (dist <= DIST_MAX and year_delta is not None and year_delta <= YEAR_MAX)
|
||||
proposals.append({
|
||||
"png": f["png"],
|
||||
"source": src,
|
||||
"bbox": [int(x) for x in bbox],
|
||||
"year": y,
|
||||
"rank_in_parent": f.get("rank"),
|
||||
"quality_composite": f.get("quality", {}).get("composite"),
|
||||
"quality": f.get("quality", {}),
|
||||
"best_anchor": anchor["name"],
|
||||
"best_anchor_dom_year": anchor["dom_year"],
|
||||
"centroid_dist": round(dist, 4),
|
||||
"year_delta": year_delta,
|
||||
"accept": bool(accept),
|
||||
"all_anchor_dists": {
|
||||
a["name"]: round(1.0 - float(np.dot(a["centroid"], v)), 4) for a in anchors
|
||||
},
|
||||
})
|
||||
|
||||
if exif_cache_dirty:
|
||||
save_exif_cache(exif_cache)
|
||||
print(f"[exif] cache flushed ({len(exif_cache)} entries total)", file=sys.stderr)
|
||||
|
||||
# Summarize
|
||||
accepted = [p for p in proposals if p["accept"]]
|
||||
rejected = [p for p in proposals if not p["accept"]]
|
||||
by_anchor = Counter(p["best_anchor"] for p in accepted)
|
||||
print(f"[summary] unbucketed={len(unbucketed)} scored={len(proposals)} "
|
||||
f"accepted={len(accepted)} rejected={len(rejected)} "
|
||||
f"skipped(no_emb={skipped_no_emb}, no_year={skipped_no_year})", file=sys.stderr)
|
||||
for k, v in by_anchor.most_common():
|
||||
print(f" {k}: +{v}", file=sys.stderr)
|
||||
|
||||
out = {
|
||||
"thresholds": {"dist_max": DIST_MAX, "year_max": YEAR_MAX},
|
||||
"anchors": [
|
||||
{k: v for k, v in a.items() if k != "centroid"}
|
||||
for a in anchors
|
||||
],
|
||||
"n_unbucketed": len(unbucketed),
|
||||
"skipped": {"no_emb": skipped_no_emb, "no_year": skipped_no_year},
|
||||
"proposals": sorted(proposals, key=lambda p: (not p["accept"], p["best_anchor"], -1 * (p["quality_composite"] or 0))),
|
||||
"by_anchor": dict(by_anchor),
|
||||
}
|
||||
op = Path(args.out)
|
||||
op.parent.mkdir(parents=True, exist_ok=True)
|
||||
op.write_text(json.dumps(out, indent=2))
|
||||
print(f"[done] {len(proposals)} proposals -> {op}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- report -----------------------------
|
||||
|
||||
def cmd_report(args):
|
||||
cand = json.loads(Path(args.candidates).read_text())
|
||||
out_dir = Path(args.out)
|
||||
thumbs_dir = out_dir / "thumbs"
|
||||
thumbs_dir.mkdir(parents=True, exist_ok=True)
|
||||
THUMB = 140
|
||||
|
||||
def make_thumb(png_relpath: str) -> str:
|
||||
# png_relpath looks like "faces/0042.png"
|
||||
src = ROOT / PARENT / png_relpath
|
||||
name = Path(png_relpath).stem
|
||||
dst = thumbs_dir / f"{name}.jpg"
|
||||
if not dst.exists():
|
||||
try:
|
||||
img = Image.open(src).convert("RGB")
|
||||
img.thumbnail((THUMB, THUMB), Image.LANCZOS)
|
||||
img.save(dst, "JPEG", quality=82)
|
||||
except Exception as e:
|
||||
print(f"[thumb-skip] {src}: {e}", file=sys.stderr)
|
||||
return ""
|
||||
return f"thumbs/{name}.jpg"
|
||||
|
||||
# group accepted proposals by target anchor
|
||||
by_anchor: dict[str, list] = {}
|
||||
rejected = []
|
||||
for p in cand["proposals"]:
|
||||
if p["accept"]:
|
||||
by_anchor.setdefault(p["best_anchor"], []).append(p)
|
||||
else:
|
||||
rejected.append(p)
|
||||
|
||||
rows = []
|
||||
rows.append("<h1>faceset_001 age extension — review</h1>")
|
||||
rows.append(f"<p>{cand['n_unbucketed']} unbucketed faces in {PARENT}; "
|
||||
f"{sum(len(v) for v in by_anchor.values())} accepted / {len(rejected)} rejected; "
|
||||
f"thresholds dist≤{cand['thresholds']['dist_max']} AND |year_delta|≤{cand['thresholds']['year_max']}.</p>")
|
||||
nav = " · ".join(f"<a href='#{a}'>{a} (+{len(by_anchor[a])})</a>" for a in by_anchor) + " · <a href='#rejected'>rejected</a>"
|
||||
rows.append(f"<div class='nav'>{nav}</div>")
|
||||
|
||||
for anchor_name in ACTIVE_ERAS:
|
||||
if anchor_name not in by_anchor:
|
||||
continue
|
||||
items = by_anchor[anchor_name]
|
||||
anchor_meta = next((a for a in cand["anchors"] if a["name"] == anchor_name), {})
|
||||
rows.append(f"<section id='{anchor_name}' class='grp'>")
|
||||
rows.append(f"<h2>{anchor_name} <small>(dom_year={anchor_meta.get('dom_year')}; "
|
||||
f"existing n={anchor_meta.get('n_faces')}; +{len(items)} new)</small></h2>")
|
||||
rows.append("<div class='cells'>")
|
||||
for p in sorted(items, key=lambda x: (x["centroid_dist"], -1 * (x["quality_composite"] or 0))):
|
||||
thumb = make_thumb(p["png"])
|
||||
cls = "hi" if p["centroid_dist"] <= 0.30 else "mid"
|
||||
rows.append(
|
||||
f"<div class='cell'>"
|
||||
f"<img src='{thumb}' loading='lazy' title='{p['png']}'>"
|
||||
f"<div class='meta'>{p['png']}<br>year {p['year']} (Δ{p['year_delta']})<br>"
|
||||
f"<span class='{cls}'>dist {p['centroid_dist']:.3f}</span></div>"
|
||||
f"</div>"
|
||||
)
|
||||
rows.append("</div></section>")
|
||||
|
||||
if rejected:
|
||||
rows.append("<section id='rejected' class='grp rej'>")
|
||||
rows.append(f"<h2>rejected <small>({len(rejected)} faces don't fit any anchor)</small></h2>")
|
||||
rows.append("<div class='cells'>")
|
||||
for p in sorted(rejected, key=lambda x: x["centroid_dist"])[:200]:
|
||||
thumb = make_thumb(p["png"])
|
||||
why = []
|
||||
if p["centroid_dist"] > cand['thresholds']['dist_max']:
|
||||
why.append(f"dist {p['centroid_dist']:.2f}>{cand['thresholds']['dist_max']}")
|
||||
if p["year_delta"] is None or p["year_delta"] > cand['thresholds']['year_max']:
|
||||
why.append(f"yΔ{p['year_delta']}>{cand['thresholds']['year_max']}")
|
||||
rows.append(
|
||||
f"<div class='cell'>"
|
||||
f"<img src='{thumb}' loading='lazy'>"
|
||||
f"<div class='meta'>{p['png']}<br>year {p['year']} → best {p['best_anchor']}<br>"
|
||||
f"<span class='lo'>{'; '.join(why)}</span></div>"
|
||||
f"</div>"
|
||||
)
|
||||
if len(rejected) > 200:
|
||||
rows.append(f"<p>...{len(rejected)-200} more truncated.</p>")
|
||||
rows.append("</div></section>")
|
||||
|
||||
html = f"""<!doctype html>
|
||||
<html><head><meta charset='utf-8'><title>faceset_001 age extension</title>
|
||||
<style>
|
||||
body {{ font-family: system-ui, sans-serif; background:#111; color:#eee; padding:1em; }}
|
||||
h1 {{ margin-top:0; }} h2 {{ margin:0; }}
|
||||
small {{ color:#999; font-weight:normal; }}
|
||||
section.grp {{ background:#1a1a1a; border-radius:6px; padding:12px; margin:12px 0; }}
|
||||
section.grp.rej {{ border-left:4px solid #ff5050; }}
|
||||
.cells {{ display:flex; flex-wrap:wrap; gap:6px; }}
|
||||
.cell {{ background:#222; border-radius:4px; padding:4px; width:160px; font-size:11px; font-family:monospace; text-align:center; }}
|
||||
.cell img {{ height:140px; width:auto; border-radius:3px; }}
|
||||
.meta {{ padding-top:4px; line-height:1.3; }}
|
||||
.hi {{ color:#5fa05f; font-weight:bold; }}
|
||||
.mid {{ color:#ffb050; }}
|
||||
.lo {{ color:#ff5050; }}
|
||||
.nav {{ position:sticky; top:0; background:#111; padding:.5em 0; border-bottom:1px solid #333; font-size:13px; }}
|
||||
a {{ color:#6cf; }}
|
||||
</style></head>
|
||||
<body>
|
||||
{''.join(rows)}
|
||||
</body></html>"""
|
||||
out_html = out_dir / "index.html"
|
||||
out_html.write_text(html)
|
||||
print(f"[done] {out_html}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- apply -----------------------------
|
||||
|
||||
def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
|
||||
import zipfile
|
||||
with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
|
||||
for i, p in enumerate(pngs):
|
||||
zf.write(p, arcname=f"{i:04d}.png")
|
||||
|
||||
|
||||
def cmd_apply(args):
|
||||
cand = json.loads(Path(args.candidates).read_text())
|
||||
accepted = [p for p in cand["proposals"] if p["accept"]]
|
||||
if args.dry_run:
|
||||
from collections import Counter as C
|
||||
by = C(p["best_anchor"] for p in accepted)
|
||||
print(f"=== dry-run: {len(accepted)} assignments across {len(by)} anchors ===")
|
||||
for k, v in by.most_common():
|
||||
print(f" {k}: +{v}")
|
||||
return
|
||||
|
||||
parent_dir = ROOT / PARENT
|
||||
master_path = ROOT / "manifest.json"
|
||||
master = json.loads(master_path.read_text())
|
||||
facesets_by_name = {f["name"]: f for f in master.get("facesets", [])}
|
||||
|
||||
by_anchor: dict[str, list] = {}
|
||||
for p in accepted:
|
||||
by_anchor.setdefault(p["best_anchor"], []).append(p)
|
||||
|
||||
total_added = 0
|
||||
for anchor_name, props in by_anchor.items():
|
||||
ed = ROOT / anchor_name
|
||||
em_path = ed / "manifest.json"
|
||||
em = json.loads(em_path.read_text())
|
||||
existing = list(em.get("faces", []))
|
||||
|
||||
# gather new entries with their source PNG paths in faceset_001/faces/
|
||||
new_with_src = []
|
||||
for p in props:
|
||||
src_png = parent_dir / p["png"]
|
||||
if not src_png.exists():
|
||||
print(f"[warn] missing parent PNG {src_png}; skip", file=sys.stderr)
|
||||
continue
|
||||
face_entry = {
|
||||
"source": p["source"],
|
||||
"bbox": p["bbox"],
|
||||
"quality": p["quality"],
|
||||
"exif_year": p["year"],
|
||||
"centroid_dist_at_assign": p["centroid_dist"],
|
||||
"year_delta_at_assign": p["year_delta"],
|
||||
"extended_from_parent": True,
|
||||
}
|
||||
new_with_src.append((face_entry, src_png))
|
||||
|
||||
# combine; rank by quality.composite desc (existing entries already have rank,
|
||||
# but we re-rank globally so new entries slot in by quality)
|
||||
combined: list[tuple[dict, Path | None]] = []
|
||||
for f in existing:
|
||||
combined.append((f, None))
|
||||
combined.extend(new_with_src)
|
||||
combined.sort(key=lambda x: -x[0].get("quality", {}).get("composite", 0))
|
||||
|
||||
# stage fresh
|
||||
staging = ed / "_faces_new"
|
||||
if staging.exists():
|
||||
shutil.rmtree(staging)
|
||||
staging.mkdir()
|
||||
new_face_entries = []
|
||||
for new_rank, (face, src_png_or_none) in enumerate(combined, start=1):
|
||||
new_name = f"{new_rank:04d}.png"
|
||||
if src_png_or_none is None:
|
||||
# existing entry: copy from current era bucket faces/
|
||||
old_name = Path(face["png"]).name
|
||||
src = ed / "faces" / old_name
|
||||
if not src.exists():
|
||||
print(f"[warn] {anchor_name}: missing existing PNG {src}; skip", file=sys.stderr)
|
||||
continue
|
||||
shutil.copy2(src, staging / new_name)
|
||||
else:
|
||||
shutil.copy2(src_png_or_none, staging / new_name)
|
||||
face = dict(face)
|
||||
face["rank"] = new_rank
|
||||
face["png"] = f"faces/{new_name}"
|
||||
new_face_entries.append(face)
|
||||
|
||||
# swap dirs
|
||||
old_holding = ed / "_faces_old"
|
||||
if old_holding.exists():
|
||||
shutil.rmtree(old_holding)
|
||||
(ed / "faces").rename(old_holding)
|
||||
staging.rename(ed / "faces")
|
||||
shutil.rmtree(old_holding)
|
||||
|
||||
# re-zip .fsz
|
||||
survivor_pngs = sorted((ed / "faces").glob("*.png"))
|
||||
top_n = em.get("top_n", 30)
|
||||
top_n_eff = min(top_n, len(survivor_pngs))
|
||||
for old in ed.glob("*.fsz"):
|
||||
old.unlink()
|
||||
top_fsz_name = f"{anchor_name}_top{top_n_eff}.fsz"
|
||||
all_fsz_name = f"{anchor_name}_all.fsz"
|
||||
_zip_png_list(survivor_pngs[:top_n_eff], ed / top_fsz_name)
|
||||
if len(survivor_pngs) > top_n_eff:
|
||||
_zip_png_list(survivor_pngs, ed / all_fsz_name)
|
||||
all_fsz_used = all_fsz_name
|
||||
else:
|
||||
all_fsz_used = None
|
||||
|
||||
# update local + master manifests
|
||||
em["faces"] = new_face_entries
|
||||
em["exported"] = len(new_face_entries)
|
||||
em["fsz_top"] = top_fsz_name
|
||||
em["fsz_all"] = all_fsz_used
|
||||
em["top_n"] = top_n_eff
|
||||
em.setdefault("age_extend_history", []).append({
|
||||
"added": len(new_with_src),
|
||||
"thresholds": cand["thresholds"],
|
||||
})
|
||||
em_path.write_text(json.dumps(em, indent=2))
|
||||
|
||||
if anchor_name in facesets_by_name:
|
||||
facesets_by_name[anchor_name]["exported"] = len(new_face_entries)
|
||||
facesets_by_name[anchor_name]["fsz_top"] = top_fsz_name
|
||||
facesets_by_name[anchor_name]["fsz_all"] = all_fsz_used
|
||||
facesets_by_name[anchor_name]["top_n"] = top_n_eff
|
||||
|
||||
added_here = len(new_with_src)
|
||||
total_added += added_here
|
||||
print(f"[applied] {anchor_name}: +{added_here} (now {len(new_face_entries)} faces)", file=sys.stderr)
|
||||
|
||||
# rewrite master with ordering preserved
|
||||
new_facesets = []
|
||||
for entry in master.get("facesets", []):
|
||||
new_facesets.append(facesets_by_name.get(entry["name"], entry))
|
||||
master["facesets"] = new_facesets
|
||||
master.setdefault("age_extend_runs", []).append({
|
||||
"parent": PARENT,
|
||||
"thresholds": cand["thresholds"],
|
||||
"anchors": list(by_anchor.keys()),
|
||||
"added_total": total_added,
|
||||
})
|
||||
tmp = master_path.with_suffix(".tmp.json")
|
||||
tmp.write_text(json.dumps(master, indent=2))
|
||||
tmp.replace(master_path)
|
||||
print(f"[done] +{total_added} faces across {len(by_anchor)} anchors", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- main -----------------------------
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
sub = ap.add_subparsers(dest="cmd", required=True)
|
||||
|
||||
a = sub.add_parser("analyze")
|
||||
a.add_argument("--out", required=True)
|
||||
a.set_defaults(func=cmd_analyze)
|
||||
|
||||
r = sub.add_parser("report")
|
||||
r.add_argument("--candidates", required=True)
|
||||
r.add_argument("--out", required=True)
|
||||
r.set_defaults(func=cmd_report)
|
||||
|
||||
p = sub.add_parser("apply")
|
||||
p.add_argument("--candidates", required=True)
|
||||
p.add_argument("--dry-run", action="store_true")
|
||||
p.set_defaults(func=cmd_apply)
|
||||
|
||||
args = ap.parse_args()
|
||||
args.func(args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,485 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Age-split person_001 into era-specific facesets.
|
||||
|
||||
Workflow:
|
||||
1. Seed a clean person_001 centroid from the existing curated 707-face
|
||||
`facesets_swap_ready/faceset_001/`.
|
||||
2. Wide-recovery scan: pull every face record under /mnt/x/src/{nl, lzbkp_red}
|
||||
from `nl_full.npz` with cos-dist <= 0.55 from the seed centroid.
|
||||
3. Apply export-swap-style per-face quality gates.
|
||||
4. One re-centroid + 0.50 tighten pass to absorb the recovery without drift.
|
||||
5. Agglomerative sub-clustering at cos-dist 0.35.
|
||||
6. Post-merge sub-clusters whose centroids <0.30 AND whose dominant EXIF
|
||||
years are within 2 years.
|
||||
7. Read EXIF DateTimeOriginal for each face's source path; era label =
|
||||
(p10 year, p90 year) over dated faces.
|
||||
8. Undated faces are assigned to the nearest era by embedding distance.
|
||||
9. For each era: composite-quality rank, single-face PNG crops, .fsz bundles
|
||||
(top-N and _all if era > top_n). `<era>_<range>.txt` marker file. Eras
|
||||
with <20 face records get a `THIN.txt` marker.
|
||||
10. Append era entries into the canonical
|
||||
`facesets_swap_ready/manifest.json` next to the existing 19.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import shutil
|
||||
import sys
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
from PIL import Image, ExifTags, ImageOps
|
||||
|
||||
REPO = Path(__file__).resolve().parent.parent
|
||||
sys.path.insert(0, str(REPO))
|
||||
|
||||
from sort_faces import ( # noqa: E402
|
||||
QUALITY_WEIGHTS,
|
||||
_crop_face_square,
|
||||
_zip_png_list,
|
||||
compute_quality,
|
||||
load_cache,
|
||||
load_rgb_bgr,
|
||||
)
|
||||
|
||||
# ---- config -------------------------------------------------------------- #
|
||||
|
||||
CACHE = REPO / "work" / "cache" / "nl_full.npz"
|
||||
SWAP_READY = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
|
||||
FS001 = SWAP_READY / "faceset_001"
|
||||
|
||||
SCAN_ROOTS = [
|
||||
Path("/mnt/x/src/nl"),
|
||||
Path("/mnt/x/src/lzbkp_red"),
|
||||
]
|
||||
|
||||
# Recovery + identity refinement
|
||||
RECOVERY_THRESHOLD = 0.55 # initial centroid match
|
||||
TIGHTEN_THRESHOLD = 0.50 # post-recentroid drift trim
|
||||
# Quality gates (mirror export-swap defaults)
|
||||
MIN_FACE_SHORT = 100
|
||||
# Sub-cluster
|
||||
SUBCLUSTER_THRESHOLD = 0.35
|
||||
# Anchor-based fragment assignment (replaces transitive union-find merge):
|
||||
ANCHOR_MIN_SIZE = 20 # sub-cluster size to qualify as an era anchor
|
||||
FRAGMENT_CENTROID_MAX = 0.40 # small fragment may join an anchor only if cent_dist <=
|
||||
FRAGMENT_YEAR_MAX = 5 # AND |dom_year_anchor - dom_year_fragment| <=
|
||||
# Output
|
||||
TOP_N = 30
|
||||
PAD_RATIO = 0.5
|
||||
OUT_SIZE = 512
|
||||
THIN_THRESHOLD = 20
|
||||
|
||||
# EXIF cache (so re-runs skip the 30-min Windows-mount EXIF read)
|
||||
EXIF_CACHE = REPO / "work" / "cache" / "age_split_exif.json"
|
||||
|
||||
|
||||
# ---- helpers ------------------------------------------------------------- #
|
||||
|
||||
def _normalize(v: np.ndarray) -> np.ndarray:
|
||||
n = np.linalg.norm(v)
|
||||
return v / n if n > 0 else v
|
||||
|
||||
|
||||
def _under(roots: list[Path], p: str) -> bool:
|
||||
for r in roots:
|
||||
rs = str(r).rstrip("/") + "/"
|
||||
if p == str(r) or p.startswith(rs):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def _record_in_roots(rec: dict, roots: list[Path], path_aliases: dict) -> bool:
|
||||
if _under(roots, rec["path"]):
|
||||
return True
|
||||
for alias in path_aliases.get(rec["path"], []):
|
||||
if _under(roots, alias):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def exif_year(path: Path) -> int | None:
|
||||
try:
|
||||
with Image.open(path) as im:
|
||||
exif = im._getexif()
|
||||
if not exif:
|
||||
return None
|
||||
for tag_id, val in exif.items():
|
||||
tag = ExifTags.TAGS.get(tag_id, tag_id)
|
||||
if tag == "DateTimeOriginal" and isinstance(val, str) and len(val) >= 4:
|
||||
return int(val[:4])
|
||||
except Exception:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
def label_for_era(years: list[int]) -> str:
|
||||
"""Era label as a year-range string. Falls back to 'undated' if no years."""
|
||||
if not years:
|
||||
return "undated"
|
||||
ys = sorted(years)
|
||||
lo = ys[len(ys) // 10] if len(ys) >= 10 else ys[0]
|
||||
hi = ys[-(len(ys) // 10) - 1] if len(ys) >= 10 else ys[-1]
|
||||
if lo == hi:
|
||||
return str(lo)
|
||||
# Compact year range like 2011-13 if same century, else 2009-2024.
|
||||
if (lo // 100) == (hi // 100):
|
||||
return f"{lo}-{hi % 100:02d}"
|
||||
return f"{lo}-{hi}"
|
||||
|
||||
|
||||
# ---- phase 1 + 2: seed centroid + recovery scan ------------------------- #
|
||||
|
||||
def main() -> None:
|
||||
if not FS001.exists():
|
||||
raise SystemExit(f"missing seed faceset: {FS001}")
|
||||
|
||||
print("=== loading cache ===")
|
||||
emb, meta, _src, _proc, path_aliases = load_cache(CACHE)
|
||||
face_records = [m for m in meta if not m.get("noface")]
|
||||
if len(face_records) != len(emb):
|
||||
raise SystemExit(f"emb/meta mismatch: {len(face_records)} vs {len(emb)}")
|
||||
|
||||
bbox_idx = {(m["path"], tuple(m.get("bbox") or ())): i for i, m in enumerate(face_records)}
|
||||
|
||||
seed_manifest = json.loads((FS001 / "manifest.json").read_text())
|
||||
seed_face_keys = [(f["source"], tuple(f.get("bbox") or ())) for f in seed_manifest["faces"]]
|
||||
seed_indices = [bbox_idx[k] for k in seed_face_keys if k in bbox_idx]
|
||||
print(f"seed faces from faceset_001: {len(seed_indices)} (manifest had {len(seed_face_keys)})")
|
||||
|
||||
seed_centroid = _normalize(emb[seed_indices].mean(axis=0))
|
||||
|
||||
# Recovery: every face record under nl/ + lzbkp_red/ within RECOVERY_THRESHOLD.
|
||||
candidate_idxs = [
|
||||
i for i, rec in enumerate(face_records)
|
||||
if _record_in_roots(rec, SCAN_ROOTS, path_aliases)
|
||||
]
|
||||
print(f"\ncandidates under {[str(r) for r in SCAN_ROOTS]}: {len(candidate_idxs)}")
|
||||
|
||||
cand_emb = emb[candidate_idxs]
|
||||
cand_dists = 1.0 - cand_emb @ seed_centroid
|
||||
recovered_local = [k for k, d in enumerate(cand_dists) if d <= RECOVERY_THRESHOLD]
|
||||
recovered = [candidate_idxs[k] for k in recovered_local]
|
||||
print(f"recovered at cos-dist <= {RECOVERY_THRESHOLD}: {len(recovered)}")
|
||||
|
||||
# Quality gate.
|
||||
qualified = []
|
||||
drop_size = drop_blur = drop_det = 0
|
||||
for i in recovered:
|
||||
r = face_records[i]
|
||||
if r.get("face_short", 0) < MIN_FACE_SHORT:
|
||||
drop_size += 1
|
||||
continue
|
||||
if r.get("blur", 0.0) < 40.0:
|
||||
drop_blur += 1
|
||||
continue
|
||||
if r.get("det_score", 0.0) < 0.6:
|
||||
drop_det += 1
|
||||
continue
|
||||
qualified.append(i)
|
||||
print(f"after quality gate: {len(qualified)} (drop size={drop_size} blur={drop_blur} det={drop_det})")
|
||||
|
||||
# One tightening pass: re-centroid on qualified, drop anyone > TIGHTEN_THRESHOLD.
|
||||
qcent = _normalize(emb[qualified].mean(axis=0))
|
||||
qd = 1.0 - emb[qualified] @ qcent
|
||||
tight = [qualified[k] for k, d in enumerate(qd) if d <= TIGHTEN_THRESHOLD]
|
||||
print(f"after re-centroid tighten ({TIGHTEN_THRESHOLD}): {len(tight)}")
|
||||
|
||||
# ---- phase 5: sub-cluster -------------------------------------------- #
|
||||
print("\n=== sub-clustering ===")
|
||||
from sklearn.cluster import AgglomerativeClustering
|
||||
|
||||
E = emb[tight]
|
||||
sims = E @ E.T
|
||||
dists = 1.0 - sims
|
||||
# Floor numerical noise.
|
||||
np.fill_diagonal(dists, 0.0)
|
||||
dists = np.maximum(dists, 0.0)
|
||||
|
||||
ac = AgglomerativeClustering(
|
||||
n_clusters=None,
|
||||
metric="precomputed",
|
||||
linkage="average",
|
||||
distance_threshold=SUBCLUSTER_THRESHOLD,
|
||||
)
|
||||
labels = ac.fit_predict(dists)
|
||||
sub_sizes = Counter(labels)
|
||||
print(f"raw sub-clusters: {len(sub_sizes)} (sizes: top10={sorted(sub_sizes.values(), reverse=True)[:10]})")
|
||||
|
||||
# Per-cluster: indices, centroid, EXIF years.
|
||||
cluster_indices: dict[int, list[int]] = {}
|
||||
for k, lab in enumerate(labels):
|
||||
cluster_indices.setdefault(int(lab), []).append(tight[k])
|
||||
|
||||
cluster_centroids: dict[int, np.ndarray] = {}
|
||||
for lab, idxs in cluster_indices.items():
|
||||
cluster_centroids[lab] = _normalize(emb[idxs].mean(axis=0))
|
||||
|
||||
print("\n=== EXIF years (one read per source path; cached) ===")
|
||||
unique_paths = sorted({face_records[i]["path"] for i in tight})
|
||||
if EXIF_CACHE.exists():
|
||||
cached = json.loads(EXIF_CACHE.read_text())
|
||||
else:
|
||||
cached = {}
|
||||
path_year: dict[str, int | None] = {}
|
||||
new_reads = 0
|
||||
for p in unique_paths:
|
||||
if p in cached:
|
||||
path_year[p] = cached[p]
|
||||
else:
|
||||
y = exif_year(Path(p))
|
||||
path_year[p] = y
|
||||
cached[p] = y
|
||||
new_reads += 1
|
||||
EXIF_CACHE.parent.mkdir(parents=True, exist_ok=True)
|
||||
EXIF_CACHE.write_text(json.dumps(cached, indent=0))
|
||||
dated = sum(1 for v in path_year.values() if v is not None)
|
||||
print(f" EXIF cache: {len(cached)} entries, {new_reads} new reads, "
|
||||
f"{dated}/{len(unique_paths)} dated")
|
||||
|
||||
cluster_years: dict[int, list[int]] = {}
|
||||
cluster_dom_year: dict[int, int | None] = {}
|
||||
for lab, idxs in cluster_indices.items():
|
||||
ys = []
|
||||
for i in idxs:
|
||||
y = path_year.get(face_records[i]["path"])
|
||||
if y is not None:
|
||||
ys.append(y)
|
||||
cluster_years[lab] = ys
|
||||
cluster_dom_year[lab] = (Counter(ys).most_common(1)[0][0]) if ys else None
|
||||
|
||||
# ---- phase 6: anchor-based fragment assignment ----------------------- #
|
||||
# Each sub-cluster of size >= ANCHOR_MIN_SIZE is an "era anchor". Smaller
|
||||
# fragments are assigned to the single nearest anchor IFF (centroid distance
|
||||
# <= FRAGMENT_CENTROID_MAX AND |dom_year delta| <= FRAGMENT_YEAR_MAX).
|
||||
# Anchors do NOT merge with each other — that prevented transitive year drift
|
||||
# observed when union-find was used. Standalone fragments stay as their own
|
||||
# (likely THIN) eras.
|
||||
print("\n=== anchor-based assignment ===")
|
||||
anchors = [lab for lab, idxs in cluster_indices.items() if len(idxs) >= ANCHOR_MIN_SIZE]
|
||||
fragments = [lab for lab in cluster_indices if lab not in anchors]
|
||||
anchors.sort(key=lambda l: -len(cluster_indices[l]))
|
||||
print(f"anchors (size>={ANCHOR_MIN_SIZE}): {len(anchors)}; fragments: {len(fragments)}")
|
||||
for a in anchors:
|
||||
print(f" anchor sub {a}: size={len(cluster_indices[a])} dom_year={cluster_dom_year[a]}")
|
||||
|
||||
if anchors:
|
||||
a_cent = np.stack([cluster_centroids[a] for a in anchors])
|
||||
assignments: dict[int, int] = {a: a for a in anchors} # anchor -> self
|
||||
unassigned: list[int] = []
|
||||
for f in fragments:
|
||||
f_cent = cluster_centroids[f]
|
||||
f_year = cluster_dom_year[f]
|
||||
# cosine distances to each anchor
|
||||
cd = 1.0 - a_cent @ f_cent
|
||||
# year distance (inf if either dom-year unknown)
|
||||
yd = []
|
||||
for a in anchors:
|
||||
ay = cluster_dom_year[a]
|
||||
if f_year is None or ay is None:
|
||||
yd.append(float("inf"))
|
||||
else:
|
||||
yd.append(abs(f_year - ay))
|
||||
yd = np.array(yd)
|
||||
ok = (cd <= FRAGMENT_CENTROID_MAX) & (yd <= FRAGMENT_YEAR_MAX)
|
||||
if not ok.any():
|
||||
unassigned.append(f)
|
||||
continue
|
||||
# nearest qualifying anchor by centroid distance.
|
||||
cd_masked = np.where(ok, cd, np.inf)
|
||||
best = int(np.argmin(cd_masked))
|
||||
assignments[f] = anchors[best]
|
||||
print(f" assigned fragments: {sum(1 for k,v in assignments.items() if k!=v)}/{len(fragments)}; "
|
||||
f"unassigned (standalone): {len(unassigned)}")
|
||||
else:
|
||||
print(" no anchors; every sub-cluster stands alone")
|
||||
assignments = {lab: lab for lab in cluster_indices}
|
||||
unassigned = []
|
||||
|
||||
merged: dict[int, list[int]] = {}
|
||||
for lab, idxs in cluster_indices.items():
|
||||
root = assignments.get(lab, lab)
|
||||
merged.setdefault(root, []).extend(idxs)
|
||||
|
||||
merged_sizes = sorted(((r, len(v)) for r, v in merged.items()), key=lambda kv: -kv[1])
|
||||
print(f"era buckets: {len(merged)} (top10 sizes: {[s for _, s in merged_sizes[:10]]})")
|
||||
|
||||
# Recompute centroid + dom-year for merged eras.
|
||||
era_indices: dict[int, list[int]] = merged
|
||||
era_centroids: dict[int, np.ndarray] = {}
|
||||
era_year_label: dict[int, str] = {}
|
||||
era_years_full: dict[int, list[int]] = {}
|
||||
for root, idxs in era_indices.items():
|
||||
era_centroids[root] = _normalize(emb[idxs].mean(axis=0))
|
||||
ys = []
|
||||
for i in idxs:
|
||||
y = path_year.get(face_records[i]["path"])
|
||||
if y is not None:
|
||||
ys.append(y)
|
||||
era_years_full[root] = ys
|
||||
era_year_label[root] = label_for_era(ys)
|
||||
|
||||
# ---- phase 8: assign undated faces (no-EXIF) to nearest era ---------- #
|
||||
# NB: undated = path's EXIF was None. For era assignment we use embedding,
|
||||
# but the year *label* is unaffected because labels come from dated faces only.
|
||||
# Actually undated face is already in some sub-cluster; here we just note count.
|
||||
n_undated = sum(1 for i in tight if path_year.get(face_records[i]["path"]) is None)
|
||||
print(f"undated face records (no EXIF): {n_undated}/{len(tight)} (placed by embedding only)")
|
||||
|
||||
# ---- phase 9: per-era export ----------------------------------------- #
|
||||
import cv2
|
||||
|
||||
print("\n=== exporting era bundles ===")
|
||||
new_manifest_entries: list[dict] = []
|
||||
eras_sorted = sorted(era_indices.items(), key=lambda kv: -len(kv[1]))
|
||||
for root, idxs in eras_sorted:
|
||||
size = len(idxs)
|
||||
label = era_year_label[root]
|
||||
era_name = f"faceset_001_{label}"
|
||||
out_dir = SWAP_READY / era_name
|
||||
|
||||
# Disambiguate same-label collisions (e.g. two distinct embedding eras both 2019).
|
||||
collision = 2
|
||||
while out_dir.exists():
|
||||
era_name = f"faceset_001_{label}_v{collision}"
|
||||
out_dir = SWAP_READY / era_name
|
||||
collision += 1
|
||||
|
||||
faces_dir = out_dir / "faces"
|
||||
faces_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Composite quality + rank.
|
||||
ranked = []
|
||||
for ci in idxs:
|
||||
rec = face_records[ci]
|
||||
q = compute_quality(rec)
|
||||
ranked.append({"cache_idx": ci, "rec": rec, "quality": q})
|
||||
|
||||
# Dedup by source path within this era — keep highest-quality face per path.
|
||||
seen_path: dict[str, dict] = {}
|
||||
for r in ranked:
|
||||
p = r["rec"]["path"]
|
||||
prev = seen_path.get(p)
|
||||
if prev is None or r["quality"]["composite"] > prev["quality"]["composite"]:
|
||||
seen_path[p] = r
|
||||
unique = sorted(seen_path.values(), key=lambda r: -r["quality"]["composite"])
|
||||
|
||||
# Materialize crops.
|
||||
written: list[Path] = []
|
||||
face_entries: list[dict] = []
|
||||
for rank, r in enumerate(unique, start=1):
|
||||
rec = r["rec"]
|
||||
src = Path(rec["path"])
|
||||
if not src.exists():
|
||||
continue
|
||||
rgb, _ = load_rgb_bgr(src)
|
||||
if rgb is None:
|
||||
continue
|
||||
crop = _crop_face_square(rgb, rec["bbox"], PAD_RATIO, OUT_SIZE)
|
||||
png = faces_dir / f"{rank:04d}.png"
|
||||
cv2.imwrite(str(png), cv2.cvtColor(crop, cv2.COLOR_RGB2BGR))
|
||||
written.append(png)
|
||||
face_entries.append({
|
||||
"rank": rank,
|
||||
"png": f"faces/{rank:04d}.png",
|
||||
"source": rec["path"],
|
||||
"aliases": path_aliases.get(rec["path"], []),
|
||||
"bbox": rec["bbox"],
|
||||
"face_short": rec.get("face_short"),
|
||||
"det_score": rec.get("det_score"),
|
||||
"blur": rec.get("blur"),
|
||||
"pose": rec.get("pose"),
|
||||
"exif_year": path_year.get(rec["path"]),
|
||||
"quality": r["quality"],
|
||||
})
|
||||
|
||||
if not written:
|
||||
print(f"[{era_name}] empty after materialization; skipping")
|
||||
shutil.rmtree(out_dir)
|
||||
continue
|
||||
|
||||
# Bundle.
|
||||
top_n_eff = min(TOP_N, len(written))
|
||||
top_fsz = out_dir / f"{era_name}_top{top_n_eff}.fsz"
|
||||
_zip_png_list(written[:top_n_eff], top_fsz)
|
||||
all_fsz: Path | None = None
|
||||
if len(written) > top_n_eff:
|
||||
all_fsz = out_dir / f"{era_name}_all.fsz"
|
||||
_zip_png_list(written, all_fsz)
|
||||
|
||||
# Per-era manifest.
|
||||
ys = era_years_full[root]
|
||||
year_summary = {
|
||||
"label": label,
|
||||
"year_count": len(ys),
|
||||
"year_min": min(ys) if ys else None,
|
||||
"year_max": max(ys) if ys else None,
|
||||
"year_dist": dict(Counter(ys).most_common()),
|
||||
}
|
||||
is_thin = size < THIN_THRESHOLD
|
||||
manifest = {
|
||||
"name": era_name,
|
||||
"parent_identity": "faceset_001",
|
||||
"era": year_summary,
|
||||
"input_face_records": size,
|
||||
"exported": len(written),
|
||||
"top_n": top_n_eff,
|
||||
"fsz_top": top_fsz.name,
|
||||
"fsz_all": all_fsz.name if all_fsz else None,
|
||||
"thin": is_thin,
|
||||
"quality_weights": QUALITY_WEIGHTS,
|
||||
"params": {
|
||||
"recovery_threshold": RECOVERY_THRESHOLD,
|
||||
"tighten_threshold": TIGHTEN_THRESHOLD,
|
||||
"subcluster_threshold": SUBCLUSTER_THRESHOLD,
|
||||
"anchor_min_size": ANCHOR_MIN_SIZE,
|
||||
"fragment_centroid_max": FRAGMENT_CENTROID_MAX,
|
||||
"fragment_year_max": FRAGMENT_YEAR_MAX,
|
||||
"min_face_short": MIN_FACE_SHORT,
|
||||
},
|
||||
"faces": face_entries,
|
||||
}
|
||||
(out_dir / "manifest.json").write_text(json.dumps(manifest, indent=2))
|
||||
|
||||
# Per-era marker file (always: <label>.txt for human reference).
|
||||
(out_dir / f"{label}.txt").write_text(
|
||||
f"{era_name}\n\nEra: {label}\n"
|
||||
f"Year span: {year_summary['year_min']}..{year_summary['year_max']} "
|
||||
f"({year_summary['year_count']} dated of {size} faces)\n"
|
||||
f"Sub-cluster size: {size} face records, {len(unique)} unique source paths, "
|
||||
f"{len(written)} exported PNGs.\n"
|
||||
)
|
||||
if is_thin:
|
||||
(out_dir / "THIN.txt").write_text(
|
||||
f"This era has only {size} face records (<{THIN_THRESHOLD}). "
|
||||
f"Averaged embedding may be dominated by single-photo idiosyncrasies.\n"
|
||||
)
|
||||
|
||||
# Append to top-level manifest summary.
|
||||
new_manifest_entries.append({k: v for k, v in manifest.items() if k != "faces"})
|
||||
|
||||
thin_tag = " THIN" if is_thin else ""
|
||||
print(
|
||||
f"[{era_name}] size={size} unique_paths={len(unique)} exported={len(written)} "
|
||||
f"top{top_n_eff}{thin_tag}"
|
||||
)
|
||||
|
||||
# ---- merge into top-level manifest ----------------------------------- #
|
||||
top_path = SWAP_READY / "manifest.json"
|
||||
existing = json.loads(top_path.read_text()) if top_path.exists() else {"facesets": []}
|
||||
existing_names = {fs.get("name") for fs in existing.get("facesets", [])}
|
||||
appended = 0
|
||||
for entry in new_manifest_entries:
|
||||
if entry["name"] in existing_names:
|
||||
continue
|
||||
existing["facesets"].append(entry)
|
||||
appended += 1
|
||||
top_path.write_text(json.dumps(existing, indent=2))
|
||||
print(f"\nAppended {appended} era entries to {top_path}")
|
||||
print(f"Done. {len(new_manifest_entries)} era buckets emitted (faceset_001/ left untouched).")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,323 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Build per-folder facesets from hand-sorted source directories.
|
||||
|
||||
Phase B + C of the folder-import workflow:
|
||||
- Filter cache records into per-folder identity sets, run 2-pass centroid+outlier
|
||||
rejection so non-target faces in group photos drop out.
|
||||
- Route every osrc face record to every trusted-folder identity within a tight
|
||||
cosine cutoff (multi-identity osrc photos land in multiple facesets;
|
||||
cmd_export_swap then per-bbox-filters so each faceset crops only the matching face).
|
||||
- Synthesize a refine_manifest.json compatible with cmd_export_swap.
|
||||
- Invoke cmd_export_swap to emit faceset_NNN/ dirs into a temp output dir.
|
||||
- Rename .fsz bundles after the source folder, replace NAME.txt with foldername.txt,
|
||||
move dirs into the canonical facesets_swap_ready/, merge top-level manifest
|
||||
preserving existing faceset_001..012 entries.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import shutil
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
|
||||
REPO = Path(__file__).resolve().parent.parent
|
||||
sys.path.insert(0, str(REPO))
|
||||
|
||||
from sort_faces import ( # noqa: E402
|
||||
cmd_export_swap,
|
||||
load_cache,
|
||||
)
|
||||
|
||||
# ---- config -------------------------------------------------------------- #
|
||||
|
||||
CACHE = REPO / "work" / "cache" / "nl_full.npz"
|
||||
OUT_FINAL = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
|
||||
OUT_TMP = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready_new")
|
||||
SYNTH_MANIFEST = REPO / "work" / "synthetic_refine_manifest.json"
|
||||
|
||||
# Trusted folders, in numbering order. faceset_NNN starts at 013.
|
||||
TRUSTED: list[tuple[str, Path]] = [
|
||||
("k", Path("/mnt/x/src/k")),
|
||||
("m", Path("/mnt/x/src/m")),
|
||||
("mi", Path("/mnt/x/src/mi")),
|
||||
("mir", Path("/mnt/x/src/mir")),
|
||||
("s", Path("/mnt/x/src/s")),
|
||||
("sab", Path("/mnt/x/src/sab")),
|
||||
("t", Path("/mnt/x/src/t")),
|
||||
]
|
||||
START_NNN = 13
|
||||
OSRC_DIR = Path("/mnt/x/src/osrc")
|
||||
|
||||
# Centroid-build outlier passes (loose then tight).
|
||||
PASS1_THRESHOLD = 0.55
|
||||
PASS2_THRESHOLD = 0.45
|
||||
# osrc routing cutoff (tight).
|
||||
OSRC_THRESHOLD = 0.45
|
||||
|
||||
# export-swap params (defaults from sort_faces.py).
|
||||
TOP_N = 30
|
||||
EXPORT_OUTLIER_THRESHOLD = 0.45
|
||||
PAD_RATIO = 0.5
|
||||
OUT_SIZE = 512
|
||||
MIN_FACE_SHORT = 100
|
||||
|
||||
|
||||
# ---- helpers ------------------------------------------------------------- #
|
||||
|
||||
def _normalize_rows(mat: np.ndarray) -> np.ndarray:
|
||||
n = np.linalg.norm(mat, axis=1, keepdims=True)
|
||||
n[n == 0] = 1.0
|
||||
return mat / n
|
||||
|
||||
|
||||
def _centroid(vecs: np.ndarray) -> np.ndarray:
|
||||
c = vecs.mean(axis=0)
|
||||
n = np.linalg.norm(c)
|
||||
return c / n if n > 0 else c
|
||||
|
||||
|
||||
def _under(folder: Path, p: str) -> bool:
|
||||
"""True iff path string p lies under folder."""
|
||||
fs = str(folder).rstrip("/") + "/"
|
||||
return p == str(folder) or p.startswith(fs)
|
||||
|
||||
|
||||
def _record_in_folder(rec: dict, folder: Path, path_aliases: dict[str, list[str]]) -> bool:
|
||||
if _under(folder, rec["path"]):
|
||||
return True
|
||||
for alias in path_aliases.get(rec["path"], []):
|
||||
if _under(folder, alias):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
# ---- phase B: identity centroids + osrc routing ------------------------- #
|
||||
|
||||
def build_synthetic_manifest() -> tuple[dict, dict[str, np.ndarray], dict[str, dict]]:
|
||||
emb, meta, _src_root, _processed, path_aliases = load_cache(CACHE)
|
||||
# emb is aligned with the no-noface-filtered records (matching cmd_export_swap's
|
||||
# invariant). Use indices into face_records to access emb.
|
||||
face_records = [m for m in meta if not m.get("noface")]
|
||||
if len(face_records) != len(emb):
|
||||
raise SystemExit(f"meta/embedding mismatch: {len(face_records)} vs {len(emb)}")
|
||||
|
||||
print(f"Loaded cache: {len(face_records)} face records.")
|
||||
|
||||
# Per-folder identity centroids.
|
||||
centroids: dict[str, np.ndarray] = {}
|
||||
folder_paths: dict[str, set[str]] = {}
|
||||
folder_stats: dict[str, dict] = {}
|
||||
|
||||
for label, folder in TRUSTED:
|
||||
idxs = [i for i, m in enumerate(face_records) if _record_in_folder(m, folder, path_aliases)]
|
||||
if not idxs:
|
||||
print(f"[{label}] no face records found under {folder}; skipping")
|
||||
continue
|
||||
|
||||
vecs = emb[idxs]
|
||||
cent = _centroid(vecs)
|
||||
|
||||
# Pass 1: drop loose outliers.
|
||||
d1 = 1.0 - vecs @ cent
|
||||
keep1 = [idxs[k] for k, dist in enumerate(d1) if dist <= PASS1_THRESHOLD]
|
||||
if not keep1:
|
||||
print(f"[{label}] every face was a pass-1 outlier; using all faces as-is")
|
||||
keep1 = idxs
|
||||
cent = _centroid(emb[keep1])
|
||||
|
||||
# Pass 2: tight outlier rejection.
|
||||
d2 = 1.0 - emb[keep1] @ cent
|
||||
keep2 = [keep1[k] for k, dist in enumerate(d2) if dist <= PASS2_THRESHOLD]
|
||||
if not keep2:
|
||||
print(f"[{label}] every face was a pass-2 outlier; falling back to pass-1")
|
||||
keep2 = keep1
|
||||
cent = _centroid(emb[keep2])
|
||||
|
||||
centroids[label] = cent
|
||||
# Use canonical path strings; export-swap will look up indices by path.
|
||||
folder_paths[label] = {face_records[i]["path"] for i in keep2}
|
||||
folder_stats[label] = {
|
||||
"folder": str(folder),
|
||||
"input_records": len(idxs),
|
||||
"after_pass1": len(keep1),
|
||||
"after_pass2": len(keep2),
|
||||
"unique_paths": len(folder_paths[label]),
|
||||
}
|
||||
print(
|
||||
f"[{label}] in={len(idxs)} pass1={len(keep1)} pass2={len(keep2)} "
|
||||
f"unique_paths={len(folder_paths[label])}"
|
||||
)
|
||||
|
||||
# osrc routing: every osrc face -> every centroid within OSRC_THRESHOLD.
|
||||
osrc_idxs = [
|
||||
i for i, m in enumerate(face_records)
|
||||
if _record_in_folder(m, OSRC_DIR, path_aliases)
|
||||
]
|
||||
print(f"\nosrc: {len(osrc_idxs)} face records to route")
|
||||
if osrc_idxs and centroids:
|
||||
labels = list(centroids.keys())
|
||||
cent_mat = np.stack([centroids[lab] for lab in labels])
|
||||
# Build sims: (n_osrc, n_labels)
|
||||
osrc_emb = emb[osrc_idxs]
|
||||
sims = osrc_emb @ cent_mat.T # cosine similarity (vectors already normalized)
|
||||
dists = 1.0 - sims
|
||||
per_label_added: dict[str, int] = {lab: 0 for lab in labels}
|
||||
for row, ci in enumerate(osrc_idxs):
|
||||
p = face_records[ci]["path"]
|
||||
for col, lab in enumerate(labels):
|
||||
if dists[row, col] <= OSRC_THRESHOLD:
|
||||
if p not in folder_paths[lab]:
|
||||
folder_paths[lab].add(p)
|
||||
per_label_added[lab] += 1
|
||||
for lab in labels:
|
||||
folder_stats[lab]["osrc_paths_added"] = per_label_added[lab]
|
||||
print(f"[{lab}] osrc faces routed: +{per_label_added[lab]} unique paths")
|
||||
|
||||
# Build synthetic refine_manifest.
|
||||
facesets: list[dict] = []
|
||||
for n, (label, _folder) in enumerate(TRUSTED, start=START_NNN):
|
||||
if label not in folder_paths:
|
||||
continue
|
||||
facesets.append({
|
||||
"name": f"faceset_{n:03d}",
|
||||
"label": label,
|
||||
"image_count": len(folder_paths[label]),
|
||||
"images": sorted(folder_paths[label]),
|
||||
})
|
||||
|
||||
manifest = {
|
||||
"params": {
|
||||
"pass1_threshold": PASS1_THRESHOLD,
|
||||
"pass2_threshold": PASS2_THRESHOLD,
|
||||
"osrc_threshold": OSRC_THRESHOLD,
|
||||
"min_face_short": MIN_FACE_SHORT,
|
||||
},
|
||||
"facesets": facesets,
|
||||
"_per_folder_stats": folder_stats,
|
||||
}
|
||||
SYNTH_MANIFEST.write_text(json.dumps(manifest, indent=2))
|
||||
print(f"\nSynthetic manifest -> {SYNTH_MANIFEST}")
|
||||
return manifest, centroids, folder_stats
|
||||
|
||||
|
||||
# ---- phase C: export + rename + merge ----------------------------------- #
|
||||
|
||||
def export_and_relocate(manifest: dict) -> None:
|
||||
if OUT_TMP.exists():
|
||||
shutil.rmtree(OUT_TMP)
|
||||
OUT_TMP.mkdir(parents=True)
|
||||
|
||||
print(f"\nRunning cmd_export_swap -> {OUT_TMP}")
|
||||
cmd_export_swap(
|
||||
cache_path=CACHE,
|
||||
refine_manifest_path=SYNTH_MANIFEST,
|
||||
raw_manifest_path=None,
|
||||
out_dir=OUT_TMP,
|
||||
top_n=TOP_N,
|
||||
outlier_threshold=EXPORT_OUTLIER_THRESHOLD,
|
||||
pad_ratio=PAD_RATIO,
|
||||
out_size=OUT_SIZE,
|
||||
include_candidates=False,
|
||||
candidate_match_threshold=0.55,
|
||||
candidate_min_score=0.40,
|
||||
min_face_short=MIN_FACE_SHORT,
|
||||
)
|
||||
|
||||
# Map name -> label from the synthetic manifest.
|
||||
name_to_label = {fs["name"]: fs["label"] for fs in manifest["facesets"]}
|
||||
|
||||
# Load the temp top-level manifest (export-swap just wrote it).
|
||||
new_top = json.loads((OUT_TMP / "manifest.json").read_text())
|
||||
new_entries = new_top.get("facesets", [])
|
||||
|
||||
# Per-faceset rename + relocate.
|
||||
for fs_meta in new_entries:
|
||||
name = fs_meta["name"]
|
||||
label = name_to_label.get(name)
|
||||
src_dir = OUT_TMP / name
|
||||
if not src_dir.exists():
|
||||
print(f"[{name}] export dir missing; skipping")
|
||||
continue
|
||||
|
||||
# Rename .fsz bundles to <label>_*.fsz; record updated names.
|
||||
renames = {}
|
||||
for fsz in sorted(src_dir.glob(f"{name}_top*.fsz")):
|
||||
new = src_dir / fsz.name.replace(name + "_", label + "_", 1)
|
||||
fsz.rename(new)
|
||||
renames[fsz.name] = new.name
|
||||
for fsz in sorted(src_dir.glob(f"{name}_all.fsz")):
|
||||
new = src_dir / fsz.name.replace(name + "_", label + "_", 1)
|
||||
fsz.rename(new)
|
||||
renames[fsz.name] = new.name
|
||||
|
||||
# Replace NAME.txt placeholder with <label>.txt.
|
||||
nametxt = src_dir / "NAME.txt"
|
||||
if nametxt.exists():
|
||||
nametxt.unlink()
|
||||
(src_dir / f"{label}.txt").write_text(
|
||||
f"{label}\n\nSource: /mnt/x/src/{label} (hand-sorted) + matched osrc faces.\n"
|
||||
)
|
||||
|
||||
# Update fs_meta entry's fsz fields to point at the renamed files.
|
||||
for k in ("fsz_top", "fsz_all"):
|
||||
if fs_meta.get(k) and fs_meta[k] in renames:
|
||||
fs_meta[k] = renames[fs_meta[k]]
|
||||
fs_meta["label"] = label
|
||||
|
||||
# Move the directory into the final output.
|
||||
dst_dir = OUT_FINAL / name
|
||||
if dst_dir.exists():
|
||||
print(f"[{name}] {dst_dir} already exists; refusing to overwrite")
|
||||
continue
|
||||
shutil.move(str(src_dir), str(dst_dir))
|
||||
print(f"[{name}] -> {dst_dir} (label={label})")
|
||||
|
||||
# Merge top-level manifest, preserving existing faceset_001..012 entries.
|
||||
final_manifest_path = OUT_FINAL / "manifest.json"
|
||||
if final_manifest_path.exists():
|
||||
existing = json.loads(final_manifest_path.read_text())
|
||||
else:
|
||||
existing = {"facesets": []}
|
||||
|
||||
existing_names = {fs["name"] for fs in existing.get("facesets", [])}
|
||||
appended = 0
|
||||
for entry in new_entries:
|
||||
if entry["name"] in existing_names:
|
||||
print(f"[manifest] {entry['name']} already in top-level manifest; not duplicating")
|
||||
continue
|
||||
existing["facesets"].append(entry)
|
||||
appended += 1
|
||||
|
||||
# Carry over export-swap params if not already present.
|
||||
for k in ("quality_weights", "outlier_threshold", "top_n", "pad_ratio", "out_size"):
|
||||
if k not in existing and k in new_top:
|
||||
existing[k] = new_top[k]
|
||||
|
||||
final_manifest_path.write_text(json.dumps(existing, indent=2))
|
||||
print(f"\nMerged manifest: appended {appended} entries -> {final_manifest_path}")
|
||||
|
||||
# Clean up temp dir if empty.
|
||||
leftover = list(OUT_TMP.iterdir()) if OUT_TMP.exists() else []
|
||||
if not leftover:
|
||||
OUT_TMP.rmdir()
|
||||
else:
|
||||
# leave temp manifest.json for inspection
|
||||
pass
|
||||
|
||||
|
||||
# ---- main ---------------------------------------------------------------- #
|
||||
|
||||
def main() -> None:
|
||||
manifest, _centroids, _stats = build_synthetic_manifest()
|
||||
if not manifest.get("facesets"):
|
||||
print("No facesets to build; nothing to do.")
|
||||
return
|
||||
export_and_relocate(manifest)
|
||||
print("\nDone.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,151 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Probe faceset_001 for age-sortable sub-structure.
|
||||
|
||||
Three questions:
|
||||
1. How spread is the embedding cloud? (intra-cluster pairwise distance histogram)
|
||||
2. Does it split naturally into sub-clusters at a tight threshold?
|
||||
3. Do the sub-clusters correspond to distinct time periods (EXIF DateTimeOriginal)?
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import sys
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
from PIL import Image, ExifTags
|
||||
|
||||
REPO = Path(__file__).resolve().parent.parent
|
||||
sys.path.insert(0, str(REPO))
|
||||
from sort_faces import load_cache # noqa: E402
|
||||
|
||||
CACHE = REPO / "work" / "cache" / "nl_full.npz"
|
||||
FS001 = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready/faceset_001")
|
||||
|
||||
|
||||
def exif_year(path: Path) -> int | None:
|
||||
try:
|
||||
with Image.open(path) as im:
|
||||
exif = im._getexif()
|
||||
if not exif:
|
||||
return None
|
||||
for tag_id, val in exif.items():
|
||||
tag = ExifTags.TAGS.get(tag_id, tag_id)
|
||||
if tag == "DateTimeOriginal" and isinstance(val, str) and len(val) >= 4:
|
||||
return int(val[:4])
|
||||
except Exception:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
def main() -> None:
|
||||
manifest = json.loads((FS001 / "manifest.json").read_text())
|
||||
faces = manifest["faces"]
|
||||
paths = [Path(f["source"]) for f in faces]
|
||||
print(f"faceset_001 has {len(paths)} ranked faces in the swap-ready set")
|
||||
|
||||
# Pull embeddings for these face records by (path, bbox).
|
||||
emb, meta, _src, _proc, _aliases = load_cache(CACHE)
|
||||
face_records = [m for m in meta if not m.get("noface")]
|
||||
if len(face_records) != len(emb):
|
||||
raise SystemExit("emb/meta mismatch")
|
||||
bbox_key = {}
|
||||
for i, m in enumerate(face_records):
|
||||
bbox_key[(m["path"], tuple(m.get("bbox") or ()))] = i
|
||||
|
||||
selected = []
|
||||
missing = 0
|
||||
for f in faces:
|
||||
key = (f["source"], tuple(f.get("bbox") or ()))
|
||||
i = bbox_key.get(key)
|
||||
if i is None:
|
||||
missing += 1
|
||||
continue
|
||||
selected.append(i)
|
||||
print(f"matched {len(selected)} embeddings (missing {missing})")
|
||||
|
||||
E = emb[selected]
|
||||
# All embeddings are L2-normalized -> cosine dist = 1 - dot.
|
||||
sims = E @ E.T
|
||||
dists = 1.0 - sims
|
||||
iu = np.triu_indices_from(dists, k=1)
|
||||
pw = dists[iu]
|
||||
print("\n-- intra-cluster pairwise cosine distance --")
|
||||
print(f" n_pairs = {len(pw):,}")
|
||||
print(f" mean = {pw.mean():.3f}")
|
||||
print(f" median = {np.median(pw):.3f}")
|
||||
print(f" p10/p25/p75/p90 = {np.percentile(pw, [10,25,75,90])}")
|
||||
print(f" max = {pw.max():.3f}")
|
||||
|
||||
# Histogram bins around interesting thresholds.
|
||||
edges = [0.0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0, 1.4]
|
||||
hist, _ = np.histogram(pw, bins=edges)
|
||||
print("\n histogram (cos-dist bin -> pair count):")
|
||||
for lo, hi, c in zip(edges[:-1], edges[1:], hist):
|
||||
bar = "#" * int(60 * c / max(hist.max(), 1))
|
||||
print(f" [{lo:.1f},{hi:.1f}) {c:7d} {bar}")
|
||||
|
||||
# Sub-cluster at three thresholds via agglomerative on the distance matrix.
|
||||
from sklearn.cluster import AgglomerativeClustering
|
||||
print("\n-- sub-clustering --")
|
||||
for thr in (0.30, 0.35, 0.40, 0.45, 0.50):
|
||||
ac = AgglomerativeClustering(
|
||||
n_clusters=None,
|
||||
metric="precomputed",
|
||||
linkage="average",
|
||||
distance_threshold=thr,
|
||||
)
|
||||
labels = ac.fit_predict(dists)
|
||||
sizes = Counter(labels)
|
||||
n = len(sizes)
|
||||
big = sum(1 for s in sizes.values() if s >= 10)
|
||||
top5 = sorted(sizes.values(), reverse=True)[:5]
|
||||
print(f" threshold {thr:.2f}: {n} sub-clusters, {big} with >=10 images, top-5 sizes={top5}")
|
||||
|
||||
# Pick the threshold that gives 2-5 substantial sub-clusters.
|
||||
target_thr = 0.35
|
||||
ac = AgglomerativeClustering(
|
||||
n_clusters=None, metric="precomputed", linkage="average",
|
||||
distance_threshold=target_thr,
|
||||
)
|
||||
labels = ac.fit_predict(dists)
|
||||
sizes = Counter(labels)
|
||||
big_labels = [lab for lab, s in sizes.most_common() if s >= 20]
|
||||
print(f"\n-- EXIF year analysis at threshold {target_thr} (sub-clusters with >=20 images) --")
|
||||
print(f" {len(big_labels)} substantial sub-clusters")
|
||||
|
||||
# Build label -> list of source paths
|
||||
by_label: dict[int, list[Path]] = {}
|
||||
for ci, lab in zip(selected, labels):
|
||||
rec = face_records[ci]
|
||||
by_label.setdefault(int(lab), []).append(Path(rec["path"]))
|
||||
|
||||
for lab in big_labels[:6]:
|
||||
paths_in = by_label[lab]
|
||||
years = []
|
||||
for p in paths_in:
|
||||
y = exif_year(p)
|
||||
if y is not None:
|
||||
years.append(y)
|
||||
n_paths = len(paths_in)
|
||||
n_years = len(years)
|
||||
if years:
|
||||
ys = np.array(years)
|
||||
ymin, ymax = int(ys.min()), int(ys.max())
|
||||
ymed = int(np.median(ys))
|
||||
yhist = Counter(years)
|
||||
top_years = ", ".join(f"{y}:{c}" for y, c in sorted(yhist.most_common(5)))
|
||||
else:
|
||||
ymin = ymax = ymed = None
|
||||
top_years = ""
|
||||
print(
|
||||
f" cluster {lab}: {n_paths} faces, EXIF on {n_years}/{n_paths}, "
|
||||
f"year range {ymin}..{ymax} (median {ymed})"
|
||||
)
|
||||
print(f" top years: {top_years}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,221 @@
|
||||
"""Windows / DirectML CLIP worker for occlusion scoring.
|
||||
|
||||
Reads a queue.json staged by /opt/face-sets/work/filter_occlusions.py (WSL side),
|
||||
runs open_clip ViT-L-14 (dfn2b_s39b) on each PNG via torch-directml on the AMD
|
||||
Vega, and writes a scores.json with mask + sunglasses softmax probabilities.
|
||||
|
||||
CLI:
|
||||
py -3.12 clip_worker.py <queue.json> <out_scores.json> [--limit N] [--batch 8]
|
||||
|
||||
queue.json shape: list of objects
|
||||
{"wsl_path": "...", "win_path": "E:\\...\\faceset_NNN\\faces\\NNNN.png",
|
||||
"faceset": "faceset_NNN", "file": "NNNN.png"}
|
||||
|
||||
scores.json shape:
|
||||
{"model": "ViT-L-14/dfn2b_s39b",
|
||||
"logit_scale": 100.0,
|
||||
"prompts": {...},
|
||||
"results": [{"wsl_path": "...", "faceset": "...", "file": "...",
|
||||
"mask": float, "sunglasses": float}],
|
||||
"processed": [wsl_path, ...]}
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
import warnings
|
||||
from pathlib import Path
|
||||
|
||||
# DML emits a verbose UserWarning per attention call -- silence at import time
|
||||
warnings.filterwarnings("ignore", category=UserWarning)
|
||||
|
||||
import torch
|
||||
import torch_directml
|
||||
import open_clip
|
||||
from PIL import Image
|
||||
|
||||
MODEL_NAME = "ViT-L-14"
|
||||
PRETRAINED = "dfn2b_s39b"
|
||||
|
||||
# kept in sync with /opt/face-sets/work/filter_occlusions.py PROMPTS
|
||||
PROMPTS = {
|
||||
"mask": {
|
||||
"pos": [
|
||||
"a photo of a person wearing a surgical face mask",
|
||||
"a photo of a person wearing an FFP2 respirator covering mouth and nose",
|
||||
"a photo of a person wearing a cloth face mask",
|
||||
"a face partially covered by a medical mask",
|
||||
"a person whose mouth and nose are hidden by a face mask",
|
||||
],
|
||||
"neg": [
|
||||
"a photo of a person's face with mouth and nose clearly visible",
|
||||
"a clear, unobstructed photo of a face",
|
||||
"a photo of a face without any mask or covering",
|
||||
"a portrait of a person showing their full face",
|
||||
"a photo of a person with a beard and visible mouth",
|
||||
],
|
||||
},
|
||||
"sunglasses": {
|
||||
"pos": [
|
||||
"a face with dark sunglasses covering the eyes",
|
||||
"a portrait with the eyes hidden behind opaque sunglasses",
|
||||
"a person wearing dark sunglasses over their eyes, eyes not visible",
|
||||
"a face where the eyes are completely concealed by tinted lenses",
|
||||
"a close-up portrait wearing aviator sunglasses on the eyes",
|
||||
],
|
||||
"neg": [
|
||||
"a portrait with both eyes clearly visible and uncovered",
|
||||
"a face with sunglasses pushed up on the forehead, eyes visible below",
|
||||
"a face with sunglasses resting on top of the head, eyes visible",
|
||||
"a person with sunglasses hanging from their shirt, eyes visible",
|
||||
"a face wearing clear prescription eyeglasses with visible eyes",
|
||||
"a portrait with no eyewear and visible eyes",
|
||||
],
|
||||
},
|
||||
}
|
||||
|
||||
FLUSH_EVERY = 100
|
||||
|
||||
|
||||
def load_existing(out_path: Path):
|
||||
if not out_path.exists():
|
||||
return None, set()
|
||||
try:
|
||||
d = json.loads(out_path.read_text())
|
||||
processed = set(d.get("processed", []))
|
||||
return d, processed
|
||||
except Exception as e:
|
||||
print(f"[warn] could not parse existing {out_path}: {e}; starting fresh", file=sys.stderr)
|
||||
return None, set()
|
||||
|
||||
|
||||
def save_atomic(out_path: Path, data: dict):
|
||||
tmp = out_path.with_suffix(".tmp.json")
|
||||
tmp.write_text(json.dumps(data, indent=2))
|
||||
os.replace(tmp, out_path)
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def build_text_features(model, tokenizer, device):
|
||||
out = {}
|
||||
for attr, sides in PROMPTS.items():
|
||||
feats = {}
|
||||
for side in ("pos", "neg"):
|
||||
tokens = tokenizer(sides[side]).to(device)
|
||||
f = model.encode_text(tokens)
|
||||
f = f / f.norm(dim=-1, keepdim=True)
|
||||
mean = f.mean(dim=0)
|
||||
feats[side] = mean / mean.norm()
|
||||
out[attr] = (feats["pos"], feats["neg"])
|
||||
return out
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("queue", type=Path)
|
||||
ap.add_argument("out", type=Path)
|
||||
ap.add_argument("--limit", type=int, default=None)
|
||||
ap.add_argument("--batch", type=int, default=8)
|
||||
args = ap.parse_args()
|
||||
|
||||
queue = json.loads(args.queue.read_text())
|
||||
print(f"[queue] {len(queue)} entries from {args.queue}")
|
||||
|
||||
args.out.parent.mkdir(parents=True, exist_ok=True)
|
||||
existing, processed = load_existing(args.out)
|
||||
if existing:
|
||||
print(f"[resume] {len(processed)} entries already scored")
|
||||
results = existing.get("results", [])
|
||||
else:
|
||||
results = []
|
||||
|
||||
pending = [e for e in queue if e["wsl_path"] not in processed]
|
||||
if args.limit is not None:
|
||||
pending = pending[: args.limit]
|
||||
print(f"[pending] {len(pending)} entries to score")
|
||||
|
||||
if not pending:
|
||||
print("[done] nothing to do")
|
||||
return
|
||||
|
||||
device = torch_directml.device()
|
||||
print(f"[load] {MODEL_NAME}/{PRETRAINED} on {torch_directml.device_name(0)}")
|
||||
t0 = time.time()
|
||||
model, _, preprocess = open_clip.create_model_and_transforms(MODEL_NAME, pretrained=PRETRAINED)
|
||||
tokenizer = open_clip.get_tokenizer(MODEL_NAME)
|
||||
model = model.to(device).eval()
|
||||
logit_scale = float(model.logit_scale.exp().detach().cpu())
|
||||
print(f"[load] ready in {time.time()-t0:.1f}s logit_scale={logit_scale:.2f}")
|
||||
text_feats = build_text_features(model, tokenizer, device)
|
||||
|
||||
def flush():
|
||||
save_atomic(args.out, {
|
||||
"model": f"{MODEL_NAME}/{PRETRAINED}",
|
||||
"logit_scale": logit_scale,
|
||||
"prompts": PROMPTS,
|
||||
"results": results,
|
||||
"processed": sorted(processed),
|
||||
})
|
||||
|
||||
n_done_this_run = 0
|
||||
n_load_err = 0
|
||||
last_flush = time.time()
|
||||
t_start = time.time()
|
||||
|
||||
for i in range(0, len(pending), args.batch):
|
||||
chunk = pending[i:i + args.batch]
|
||||
imgs = []
|
||||
keep = []
|
||||
for entry in chunk:
|
||||
try:
|
||||
img = Image.open(entry["win_path"]).convert("RGB")
|
||||
imgs.append(preprocess(img))
|
||||
keep.append(entry)
|
||||
except Exception as e:
|
||||
print(f"[skip] {entry['win_path']}: {e}", file=sys.stderr)
|
||||
n_load_err += 1
|
||||
processed.add(entry["wsl_path"])
|
||||
if not imgs:
|
||||
continue
|
||||
x = torch.stack(imgs).to(device)
|
||||
with torch.no_grad():
|
||||
feats = model.encode_image(x)
|
||||
feats = feats / feats.norm(dim=-1, keepdim=True)
|
||||
scores_per_attr = {}
|
||||
for attr, (pos, neg) in text_feats.items():
|
||||
sims = torch.stack([feats @ pos, feats @ neg], dim=1) * logit_scale
|
||||
probs = sims.softmax(dim=1)[:, 0].detach().cpu().tolist()
|
||||
scores_per_attr[attr] = probs
|
||||
for j, entry in enumerate(keep):
|
||||
results.append({
|
||||
"wsl_path": entry["wsl_path"],
|
||||
"faceset": entry["faceset"],
|
||||
"file": entry["file"],
|
||||
"mask": round(scores_per_attr["mask"][j], 4),
|
||||
"sunglasses": round(scores_per_attr["sunglasses"][j], 4),
|
||||
})
|
||||
processed.add(entry["wsl_path"])
|
||||
n_done_this_run += 1
|
||||
|
||||
if (n_done_this_run % FLUSH_EVERY < args.batch) or (time.time() - last_flush) > 30.0:
|
||||
flush()
|
||||
last_flush = time.time()
|
||||
elapsed = time.time() - t_start
|
||||
rate = n_done_this_run / max(0.1, elapsed)
|
||||
eta_min = (len(pending) - n_done_this_run) / max(0.1, rate) / 60.0
|
||||
print(f"[score] {n_done_this_run}/{len(pending)} "
|
||||
f"rate={rate:.2f} img/s eta={eta_min:.1f}min "
|
||||
f"load_err={n_load_err}", flush=True)
|
||||
|
||||
flush()
|
||||
elapsed = time.time() - t_start
|
||||
print(f"[done] {n_done_this_run} scored, {n_load_err} load errors, "
|
||||
f"{elapsed:.1f}s ({n_done_this_run/max(0.1,elapsed):.2f} img/s) -> {args.out}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,340 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Discover new identities in an Immich-sourced cache and emit them as facesets.
|
||||
|
||||
Mirrors `work/cluster_osrc.py`, but the source corpus is an arbitrary
|
||||
Immich user's `immich_<user>.npz` cache produced by the Windows DML embed
|
||||
worker. Existing identity centroids come from the union of every faceset
|
||||
already in `facesets_swap_ready/` (faceset_001..NNN, both auto-clustered
|
||||
and hand-sorted).
|
||||
|
||||
Pipeline:
|
||||
1. Load immich_<user>.npz; restrict to face records (drop noface).
|
||||
2. Build centroids of every existing canonical faceset in
|
||||
facesets_swap_ready/ (skip era splits and _thin/).
|
||||
3. Drop immich faces whose nearest existing centroid is within
|
||||
EXISTING_MATCH_THRESHOLD; those are already covered by the canonical set.
|
||||
4. Cluster the remaining among themselves at INITIAL_THRESHOLD.
|
||||
5. Per cluster: refine-equivalent gates (face_short, blur, det_score),
|
||||
plus outlier rejection at OUTLIER_THRESHOLD for clusters of size >= 4.
|
||||
6. Keep clusters whose surviving unique source-path count is >= MIN_FACES.
|
||||
7. Number kept clusters past the existing facesets_swap_ready/ max.
|
||||
8. Synthesize a refine_manifest, hand off to cmd_export_swap, move dirs into
|
||||
facesets_swap_ready/, drop a provenance marker, append to top-level
|
||||
manifest.json (preserving facesets / thin_eras).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import shutil
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
|
||||
REPO = Path(__file__).resolve().parent.parent
|
||||
sys.path.insert(0, str(REPO))
|
||||
|
||||
from sort_faces import ( # noqa: E402
|
||||
_cluster_embeddings,
|
||||
cmd_export_swap,
|
||||
load_cache,
|
||||
)
|
||||
|
||||
# ---- config -------------------------------------------------------------- #
|
||||
|
||||
REPO_WORK = REPO / "work"
|
||||
SWAP_READY = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
|
||||
|
||||
EXISTING_MATCH_THRESHOLD = 0.45
|
||||
INITIAL_THRESHOLD = 0.55
|
||||
|
||||
MIN_FACES = 6
|
||||
MIN_SHORT = 90
|
||||
MIN_BLUR = 40.0
|
||||
MIN_DET_SCORE = 0.6
|
||||
OUTLIER_THRESHOLD = 0.55
|
||||
|
||||
TOP_N = 30
|
||||
EXPORT_OUTLIER_THRESHOLD = 0.45
|
||||
PAD_RATIO = 0.5
|
||||
OUT_SIZE = 512
|
||||
EXPORT_MIN_FACE_SHORT = 100
|
||||
|
||||
|
||||
# ---- helpers ------------------------------------------------------------- #
|
||||
|
||||
def _normalize(v: np.ndarray) -> np.ndarray:
|
||||
n = np.linalg.norm(v)
|
||||
return v / n if n > 0 else v
|
||||
|
||||
|
||||
def _existing_identity_centroids(
|
||||
nl_cache: Path,
|
||||
) -> tuple[np.ndarray, list[str]]:
|
||||
"""Build identity centroids from every canonical faceset_NNN/ in
|
||||
facesets_swap_ready/. Era-split sub-dirs (faceset_001_<era>) and the
|
||||
_thin/ quarantine are skipped. Each faceset's manifest.json provides
|
||||
(source, bbox) keys we use to look up rows in nl_full.npz."""
|
||||
emb, meta, _src, _proc, _aliases = load_cache(nl_cache)
|
||||
face_records = [m for m in meta if not m.get("noface")]
|
||||
if len(face_records) != len(emb):
|
||||
raise SystemExit(f"meta/embedding mismatch in {nl_cache}: {len(face_records)} vs {len(emb)}")
|
||||
bbox_idx = {(m["path"], tuple(m.get("bbox") or ())): i for i, m in enumerate(face_records)}
|
||||
|
||||
centroids: list[np.ndarray] = []
|
||||
names: list[str] = []
|
||||
for d in sorted(SWAP_READY.iterdir()):
|
||||
if not d.is_dir():
|
||||
continue
|
||||
if d.name.startswith("_"):
|
||||
continue
|
||||
# Skip era-split sub-facesets (faceset_NNN_*).
|
||||
if d.name.startswith("faceset_") and "_" in d.name[len("faceset_"):]:
|
||||
continue
|
||||
man = d / "manifest.json"
|
||||
if not man.exists():
|
||||
continue
|
||||
try:
|
||||
entries = json.loads(man.read_text()).get("faces", [])
|
||||
except Exception:
|
||||
continue
|
||||
keys = [(f["source"], tuple(f.get("bbox") or ())) for f in entries]
|
||||
idxs = [bbox_idx[k] for k in keys if k in bbox_idx]
|
||||
if not idxs:
|
||||
continue
|
||||
centroids.append(_normalize(emb[idxs].mean(axis=0)))
|
||||
names.append(d.name)
|
||||
if not centroids:
|
||||
raise SystemExit("no canonical identity centroids could be built; check facesets_swap_ready/")
|
||||
return np.stack(centroids), names
|
||||
|
||||
|
||||
def _next_faceset_number() -> int:
|
||||
nums = []
|
||||
for d in SWAP_READY.iterdir():
|
||||
if not d.is_dir() or not d.name.startswith("faceset_"):
|
||||
continue
|
||||
tail = d.name[len("faceset_"):]
|
||||
# Take only top-level numbered facesets (no era suffix).
|
||||
if "_" in tail:
|
||||
continue
|
||||
try:
|
||||
nums.append(int(tail))
|
||||
except ValueError:
|
||||
continue
|
||||
return (max(nums) + 1) if nums else 1
|
||||
|
||||
|
||||
# ---- phase 1: discover --------------------------------------------------- #
|
||||
|
||||
def discover_new_clusters(
|
||||
immich_cache: Path, nl_cache: Path, start_nnn: int, source_label: str
|
||||
) -> tuple[dict, list[dict]]:
|
||||
print(f"loading immich cache: {immich_cache}")
|
||||
emb, meta, _src, _proc, _aliases = load_cache(immich_cache)
|
||||
face_records = [m for m in meta if not m.get("noface")]
|
||||
if len(face_records) != len(emb):
|
||||
raise SystemExit(f"meta/embedding mismatch: {len(face_records)} vs {len(emb)}")
|
||||
print(f" {len(face_records)} face records, {sum(1 for m in meta if m.get('noface'))} noface")
|
||||
|
||||
print(f"building existing-identity centroids from {SWAP_READY}")
|
||||
cents, cent_names = _existing_identity_centroids(nl_cache)
|
||||
print(f" {len(cent_names)} canonical centroids")
|
||||
|
||||
sims = emb @ cents.T
|
||||
nearest_d = 1.0 - sims.max(axis=1)
|
||||
nearest_id = sims.argmax(axis=1)
|
||||
covered = nearest_d <= EXISTING_MATCH_THRESHOLD
|
||||
print(f"\nfaces already covered (cos-dist <= {EXISTING_MATCH_THRESHOLD}): "
|
||||
f"{int(covered.sum())}/{len(emb)}")
|
||||
for j, name in enumerate(cent_names):
|
||||
c = int(((nearest_id == j) & covered).sum())
|
||||
if c:
|
||||
print(f" -> {name}: {c}")
|
||||
|
||||
new_idx = [i for i in range(len(emb)) if not covered[i]]
|
||||
print(f"\nunmatched immich faces to cluster: {len(new_idx)}")
|
||||
if len(new_idx) <= 1:
|
||||
labels = np.zeros(len(new_idx), dtype=int)
|
||||
else:
|
||||
labels = _cluster_embeddings(emb[new_idx], INITIAL_THRESHOLD)
|
||||
n_clusters = len(set(int(l) for l in labels))
|
||||
sizes = sorted([int((labels == l).sum()) for l in set(labels)], reverse=True)
|
||||
print(f"clusters at threshold {INITIAL_THRESHOLD}: {n_clusters} "
|
||||
f"top sizes: {sizes[:10]}")
|
||||
|
||||
clusters: dict[int, list[int]] = {}
|
||||
for k, lab in enumerate(labels):
|
||||
clusters.setdefault(int(lab), []).append(new_idx[k])
|
||||
|
||||
kept: list[dict] = []
|
||||
drop_quality_total = 0
|
||||
drop_outlier_total = 0
|
||||
for cid, idxs in clusters.items():
|
||||
good: list[int] = []
|
||||
for i in idxs:
|
||||
r = face_records[i]
|
||||
if r.get("face_short", 0) < MIN_SHORT:
|
||||
drop_quality_total += 1; continue
|
||||
if r.get("blur", 0.0) < MIN_BLUR:
|
||||
drop_quality_total += 1; continue
|
||||
if r.get("det_score", 0.0) < MIN_DET_SCORE:
|
||||
drop_quality_total += 1; continue
|
||||
good.append(i)
|
||||
if not good:
|
||||
continue
|
||||
if len(good) >= 4:
|
||||
cent = _normalize(emb[good].mean(axis=0))
|
||||
d = 1.0 - emb[good] @ cent
|
||||
tight = [good[k] for k, dist in enumerate(d) if dist <= OUTLIER_THRESHOLD]
|
||||
drop_outlier_total += len(good) - len(tight)
|
||||
good = tight
|
||||
if not good:
|
||||
continue
|
||||
unique_paths = sorted({face_records[i]["path"] for i in good})
|
||||
if len(unique_paths) < MIN_FACES:
|
||||
continue
|
||||
kept.append({
|
||||
"indices": good,
|
||||
"unique_paths": unique_paths,
|
||||
"size_face": len(good),
|
||||
"size_paths": len(unique_paths),
|
||||
})
|
||||
|
||||
kept.sort(key=lambda c: -c["size_paths"])
|
||||
print(f"\nafter quality+outlier+min_faces: {len(kept)} clusters kept "
|
||||
f"(dropped: quality={drop_quality_total} outlier={drop_outlier_total})")
|
||||
for rank, c in enumerate(kept, start=start_nnn):
|
||||
print(f" faceset_{rank:03d}: faces={c['size_face']:3d} "
|
||||
f"unique_paths={c['size_paths']:3d}")
|
||||
|
||||
facesets = [
|
||||
{
|
||||
"name": f"faceset_{rank:03d}",
|
||||
"image_count": c["size_paths"],
|
||||
"face_count": c["size_face"],
|
||||
"images": c["unique_paths"],
|
||||
}
|
||||
for rank, c in enumerate(kept, start=start_nnn)
|
||||
]
|
||||
manifest = {
|
||||
"params": {
|
||||
"existing_match_threshold": EXISTING_MATCH_THRESHOLD,
|
||||
"initial_threshold": INITIAL_THRESHOLD,
|
||||
"outlier_threshold": OUTLIER_THRESHOLD,
|
||||
"min_faces": MIN_FACES,
|
||||
"min_short": MIN_SHORT,
|
||||
"min_blur": MIN_BLUR,
|
||||
"min_det_score": MIN_DET_SCORE,
|
||||
"source_label": source_label,
|
||||
"source_cache": str(immich_cache),
|
||||
},
|
||||
"facesets": facesets,
|
||||
}
|
||||
return manifest, kept
|
||||
|
||||
|
||||
# ---- phase 2: export + relocate ----------------------------------------- #
|
||||
|
||||
def export_and_relocate(manifest: dict, immich_cache: Path, source_label: str) -> None:
|
||||
synth_path = REPO_WORK / f"synthetic_{source_label}_manifest.json"
|
||||
synth_path.write_text(json.dumps(manifest, indent=2))
|
||||
print(f"\nsynthetic manifest -> {synth_path}")
|
||||
|
||||
out_tmp = SWAP_READY.parent / f"facesets_swap_ready_{source_label}_new"
|
||||
if out_tmp.exists():
|
||||
shutil.rmtree(out_tmp)
|
||||
out_tmp.mkdir(parents=True)
|
||||
|
||||
print(f"running cmd_export_swap -> {out_tmp}")
|
||||
cmd_export_swap(
|
||||
cache_path=immich_cache,
|
||||
refine_manifest_path=synth_path,
|
||||
raw_manifest_path=None,
|
||||
out_dir=out_tmp,
|
||||
top_n=TOP_N,
|
||||
outlier_threshold=EXPORT_OUTLIER_THRESHOLD,
|
||||
pad_ratio=PAD_RATIO,
|
||||
out_size=OUT_SIZE,
|
||||
include_candidates=False,
|
||||
candidate_match_threshold=0.55,
|
||||
candidate_min_score=0.40,
|
||||
min_face_short=EXPORT_MIN_FACE_SHORT,
|
||||
)
|
||||
|
||||
new_top = json.loads((out_tmp / "manifest.json").read_text())
|
||||
new_entries = new_top.get("facesets", [])
|
||||
|
||||
moved = 0
|
||||
for fs_meta in new_entries:
|
||||
name = fs_meta["name"]
|
||||
src_dir = out_tmp / name
|
||||
if not src_dir.exists():
|
||||
print(f"[{name}] export dir missing; skipping")
|
||||
continue
|
||||
dst_dir = SWAP_READY / name
|
||||
if dst_dir.exists():
|
||||
print(f"[{name}] {dst_dir} already exists; refusing to overwrite")
|
||||
continue
|
||||
(src_dir / f"immich_{source_label}.txt").write_text(
|
||||
f"{name}\n\nSource: Immich user {source_label} cluster (auto-discovered).\n"
|
||||
)
|
||||
shutil.move(str(src_dir), str(dst_dir))
|
||||
moved += 1
|
||||
print(f"[{name}] -> {dst_dir}")
|
||||
|
||||
final_manifest_path = SWAP_READY / "manifest.json"
|
||||
if final_manifest_path.exists():
|
||||
existing = json.loads(final_manifest_path.read_text())
|
||||
else:
|
||||
existing = {"facesets": []}
|
||||
existing.setdefault("facesets", [])
|
||||
existing_names = {fs["name"] for fs in existing["facesets"]}
|
||||
appended = 0
|
||||
for entry in new_entries:
|
||||
if entry["name"] in existing_names:
|
||||
print(f"[manifest] {entry['name']} already present; not duplicating")
|
||||
continue
|
||||
existing["facesets"].append(entry)
|
||||
appended += 1
|
||||
final_manifest_path.write_text(json.dumps(existing, indent=2))
|
||||
print(f"\nmerged manifest: appended {appended} entries -> {final_manifest_path}")
|
||||
print(f"moved {moved} faceset directories into {SWAP_READY}")
|
||||
if out_tmp.exists() and not list(out_tmp.iterdir()):
|
||||
out_tmp.rmdir()
|
||||
|
||||
|
||||
# ---- main ---------------------------------------------------------------- #
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument("immich_cache", type=Path,
|
||||
help="path to immich_<user>.npz produced by the embed worker")
|
||||
p.add_argument("--nl-cache", type=Path, default=REPO_WORK / "cache" / "nl_full.npz",
|
||||
help="canonical cache for existing identity centroids")
|
||||
p.add_argument("--source-label", default=None,
|
||||
help="short label used in marker filenames; default = stem of immich_cache")
|
||||
p.add_argument("--start-nnn", type=int, default=None,
|
||||
help="first faceset number to assign; default = current max+1 in facesets_swap_ready/")
|
||||
p.add_argument("--dry-run", action="store_true")
|
||||
args = p.parse_args()
|
||||
|
||||
label = args.source_label or args.immich_cache.stem.removeprefix("immich_") or args.immich_cache.stem
|
||||
start_nnn = args.start_nnn if args.start_nnn is not None else _next_faceset_number()
|
||||
print(f"source label: {label!r}; first faceset number: {start_nnn:03d}")
|
||||
|
||||
manifest, kept = discover_new_clusters(args.immich_cache, args.nl_cache, start_nnn, label)
|
||||
if args.dry_run:
|
||||
print("\n--dry-run: stopping after cluster discovery (no exports written).")
|
||||
return
|
||||
if not manifest.get("facesets"):
|
||||
print("no new facesets to build.")
|
||||
return
|
||||
export_and_relocate(manifest, args.immich_cache, label)
|
||||
print("\nDone.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,352 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Discover new identities in /mnt/x/src/osrc and emit them as facesets.
|
||||
|
||||
Workflow (mirrors the shape of build_folders.py, but identities are
|
||||
discovered by clustering rather than asserted by folder):
|
||||
|
||||
1. Load cache; restrict to face records whose canonical or alias path
|
||||
lies under /mnt/x/src/osrc/.
|
||||
2. Build centroids of the existing 19 canonical identities in
|
||||
facesets_swap_ready/faceset_001..019. Drop any osrc face whose
|
||||
nearest-existing-identity cos-dist <= EXISTING_MATCH_THRESHOLD;
|
||||
those are already covered by `extend` and shouldn't seed new
|
||||
facesets.
|
||||
3. Cluster the remaining osrc faces among themselves at
|
||||
INITIAL_THRESHOLD (matches `extend`'s new_cluster_threshold default).
|
||||
4. Per cluster, apply refine-equivalent gates: face_short >= MIN_SHORT,
|
||||
blur >= MIN_BLUR, det_score >= MIN_DET_SCORE; for clusters >= 4,
|
||||
drop faces with cos-dist > OUTLIER_THRESHOLD from the cluster
|
||||
centroid.
|
||||
5. Keep clusters whose surviving unique source-path count is >= MIN_FACES.
|
||||
6. Number kept clusters faceset_020, 021, ... (past the highest existing
|
||||
in facesets_swap_ready, which is 019). Order by descending size.
|
||||
7. Synthesize a refine_manifest.json and call cmd_export_swap on it,
|
||||
emitting into a temp dir. Move new dirs into facesets_swap_ready/.
|
||||
8. Append new entries to the top-level facesets_swap_ready/manifest.json
|
||||
(preserving existing facesets / thin_eras).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import shutil
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
|
||||
REPO = Path(__file__).resolve().parent.parent
|
||||
sys.path.insert(0, str(REPO))
|
||||
|
||||
from sort_faces import ( # noqa: E402
|
||||
_cluster_embeddings,
|
||||
cmd_export_swap,
|
||||
load_cache,
|
||||
)
|
||||
|
||||
# ---- config -------------------------------------------------------------- #
|
||||
|
||||
CACHE = REPO / "work" / "cache" / "nl_full.npz"
|
||||
SWAP_READY = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
|
||||
OUT_TMP = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready_osrc_new")
|
||||
SYNTH_MANIFEST = REPO / "work" / "synthetic_osrc_manifest.json"
|
||||
|
||||
OSRC_DIR = Path("/mnt/x/src/osrc")
|
||||
START_NNN = 20 # facesets_swap_ready max is 019; pick up here.
|
||||
|
||||
# Existing-identity exclusion: drop osrc faces whose nearest existing
|
||||
# identity centroid is within this cosine distance. 0.45 matches the
|
||||
# build_folders.py OSRC_THRESHOLD: at this cutoff the face is already
|
||||
# routed to an existing identity by extend / build_folders.py.
|
||||
EXISTING_MATCH_THRESHOLD = 0.45
|
||||
|
||||
# Cluster the unmatched.
|
||||
INITIAL_THRESHOLD = 0.55
|
||||
|
||||
# Refine-equivalent gates (per the user's request: drop min_faces to 6).
|
||||
MIN_FACES = 6
|
||||
MIN_SHORT = 90
|
||||
MIN_BLUR = 40.0
|
||||
MIN_DET_SCORE = 0.6
|
||||
OUTLIER_THRESHOLD = 0.55 # only applied if cluster >= 4
|
||||
|
||||
# export-swap params (defaults from sort_faces.py).
|
||||
TOP_N = 30
|
||||
EXPORT_OUTLIER_THRESHOLD = 0.45
|
||||
PAD_RATIO = 0.5
|
||||
OUT_SIZE = 512
|
||||
EXPORT_MIN_FACE_SHORT = 100
|
||||
|
||||
|
||||
# ---- helpers ------------------------------------------------------------- #
|
||||
|
||||
def _normalize(v: np.ndarray) -> np.ndarray:
|
||||
n = np.linalg.norm(v)
|
||||
return v / n if n > 0 else v
|
||||
|
||||
|
||||
def _under(folder: Path, p: str) -> bool:
|
||||
fs = str(folder).rstrip("/") + "/"
|
||||
return p == str(folder) or p.startswith(fs)
|
||||
|
||||
|
||||
def _record_in_folder(rec: dict, folder: Path, path_aliases: dict[str, list[str]]) -> bool:
|
||||
if _under(folder, rec["path"]):
|
||||
return True
|
||||
for alias in path_aliases.get(rec["path"], []):
|
||||
if _under(folder, alias):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def _existing_identity_centroids(
|
||||
emb: np.ndarray, face_records: list[dict]
|
||||
) -> tuple[np.ndarray, list[str]]:
|
||||
"""Build a (n_identities, 512) matrix of L2-normalized centroids and a parallel name list,
|
||||
drawn from the canonical faceset_001..019 manifests in facesets_swap_ready/."""
|
||||
bbox_idx: dict[tuple[str, tuple], int] = {
|
||||
(m["path"], tuple(m.get("bbox") or ())): i for i, m in enumerate(face_records)
|
||||
}
|
||||
centroids: list[np.ndarray] = []
|
||||
names: list[str] = []
|
||||
for n in range(1, 20):
|
||||
d = SWAP_READY / f"faceset_{n:03d}"
|
||||
man_path = d / "manifest.json"
|
||||
if not man_path.exists():
|
||||
continue
|
||||
man = json.loads(man_path.read_text())
|
||||
keys = [(f["source"], tuple(f.get("bbox") or ())) for f in man.get("faces", [])]
|
||||
idxs = [bbox_idx[k] for k in keys if k in bbox_idx]
|
||||
if not idxs:
|
||||
continue
|
||||
centroids.append(_normalize(emb[idxs].mean(axis=0)))
|
||||
names.append(d.name)
|
||||
return np.stack(centroids), names
|
||||
|
||||
|
||||
# ---- phase 1: identify new osrc clusters --------------------------------- #
|
||||
|
||||
def discover_new_clusters() -> tuple[dict, list[dict]]:
|
||||
emb, meta, _src_root, _proc, path_aliases = load_cache(CACHE)
|
||||
face_records = [m for m in meta if not m.get("noface")]
|
||||
if len(face_records) != len(emb):
|
||||
raise SystemExit(f"meta/embedding mismatch: {len(face_records)} vs {len(emb)}")
|
||||
print(f"Cache: {len(face_records)} face records.")
|
||||
|
||||
# Step 1: filter to osrc.
|
||||
osrc_idx = [
|
||||
i for i, m in enumerate(face_records)
|
||||
if _record_in_folder(m, OSRC_DIR, path_aliases)
|
||||
]
|
||||
print(f"osrc face records: {len(osrc_idx)}")
|
||||
|
||||
# Step 2: drop those already matching an existing identity.
|
||||
cents, cent_names = _existing_identity_centroids(emb, face_records)
|
||||
osrc_emb = emb[osrc_idx]
|
||||
sims = osrc_emb @ cents.T
|
||||
nearest_d = 1.0 - sims.max(axis=1)
|
||||
nearest_id = sims.argmax(axis=1)
|
||||
covered_mask = nearest_d <= EXISTING_MATCH_THRESHOLD
|
||||
n_covered = int(covered_mask.sum())
|
||||
print(
|
||||
f"Already covered by existing 19 identities at cos-dist <= "
|
||||
f"{EXISTING_MATCH_THRESHOLD}: {n_covered}/{len(osrc_idx)}"
|
||||
)
|
||||
# Per-identity coverage breakdown (for logging only).
|
||||
for j, name in enumerate(cent_names):
|
||||
c = int(((nearest_id == j) & covered_mask).sum())
|
||||
if c:
|
||||
print(f" -> {name}: {c}")
|
||||
|
||||
new_idx = [osrc_idx[k] for k in range(len(osrc_idx)) if not covered_mask[k]]
|
||||
print(f"\nUnmatched osrc faces to cluster: {len(new_idx)}")
|
||||
|
||||
# Step 3: cluster the unmatched among themselves.
|
||||
new_emb = emb[new_idx]
|
||||
if len(new_idx) <= 1:
|
||||
labels = np.zeros(len(new_idx), dtype=int)
|
||||
else:
|
||||
labels = _cluster_embeddings(new_emb, INITIAL_THRESHOLD)
|
||||
n_clusters = len(set(int(l) for l in labels))
|
||||
print(
|
||||
f"Initial clusters at threshold {INITIAL_THRESHOLD}: {n_clusters} "
|
||||
f"(top sizes: {sorted([int((labels==l).sum()) for l in set(labels)], reverse=True)[:10]})"
|
||||
)
|
||||
|
||||
# Step 4 + 5: per-cluster refine gates + min_faces.
|
||||
clusters: dict[int, list[int]] = {}
|
||||
for k, lab in enumerate(labels):
|
||||
clusters.setdefault(int(lab), []).append(new_idx[k])
|
||||
|
||||
kept_clusters: list[dict] = []
|
||||
drop_quality_total = 0
|
||||
drop_outlier_total = 0
|
||||
for cid, idxs in clusters.items():
|
||||
# Per-face quality gate.
|
||||
good: list[int] = []
|
||||
for i in idxs:
|
||||
r = face_records[i]
|
||||
if r.get("face_short", 0) < MIN_SHORT:
|
||||
drop_quality_total += 1
|
||||
continue
|
||||
if r.get("blur", 0.0) < MIN_BLUR:
|
||||
drop_quality_total += 1
|
||||
continue
|
||||
if r.get("det_score", 0.0) < MIN_DET_SCORE:
|
||||
drop_quality_total += 1
|
||||
continue
|
||||
good.append(i)
|
||||
if not good:
|
||||
continue
|
||||
|
||||
# Outlier rejection (only if cluster >= 4).
|
||||
if len(good) >= 4:
|
||||
cent = _normalize(emb[good].mean(axis=0))
|
||||
d = 1.0 - emb[good] @ cent
|
||||
tight = [good[k] for k, dist in enumerate(d) if dist <= OUTLIER_THRESHOLD]
|
||||
drop_outlier_total += len(good) - len(tight)
|
||||
good = tight
|
||||
if not good:
|
||||
continue
|
||||
|
||||
unique_paths = sorted({face_records[i]["path"] for i in good})
|
||||
if len(unique_paths) < MIN_FACES:
|
||||
continue
|
||||
|
||||
kept_clusters.append({
|
||||
"indices": good,
|
||||
"unique_paths": unique_paths,
|
||||
"size_face": len(good),
|
||||
"size_paths": len(unique_paths),
|
||||
})
|
||||
|
||||
kept_clusters.sort(key=lambda c: -c["size_paths"])
|
||||
print(
|
||||
f"\nAfter quality gate ({drop_quality_total} dropped) + outlier "
|
||||
f"rejection ({drop_outlier_total} dropped) + min_faces={MIN_FACES}: "
|
||||
f"{len(kept_clusters)} clusters kept"
|
||||
)
|
||||
for rank, c in enumerate(kept_clusters, start=START_NNN):
|
||||
print(
|
||||
f" faceset_{rank:03d}: faces={c['size_face']:3d} "
|
||||
f"unique_paths={c['size_paths']:3d}"
|
||||
)
|
||||
|
||||
# Build synthetic refine_manifest.json compatible with cmd_export_swap.
|
||||
facesets = [
|
||||
{
|
||||
"name": f"faceset_{rank:03d}",
|
||||
"image_count": c["size_paths"],
|
||||
"face_count": c["size_face"],
|
||||
"images": c["unique_paths"],
|
||||
}
|
||||
for rank, c in enumerate(kept_clusters, start=START_NNN)
|
||||
]
|
||||
manifest = {
|
||||
"params": {
|
||||
"existing_match_threshold": EXISTING_MATCH_THRESHOLD,
|
||||
"initial_threshold": INITIAL_THRESHOLD,
|
||||
"outlier_threshold": OUTLIER_THRESHOLD,
|
||||
"min_faces": MIN_FACES,
|
||||
"min_short": MIN_SHORT,
|
||||
"min_blur": MIN_BLUR,
|
||||
"min_det_score": MIN_DET_SCORE,
|
||||
"source_root": str(OSRC_DIR),
|
||||
},
|
||||
"facesets": facesets,
|
||||
}
|
||||
SYNTH_MANIFEST.write_text(json.dumps(manifest, indent=2))
|
||||
print(f"\nSynthetic manifest -> {SYNTH_MANIFEST}")
|
||||
return manifest, kept_clusters
|
||||
|
||||
|
||||
# ---- phase 2: export + relocate + merge top-level manifest -------------- #
|
||||
|
||||
def export_and_relocate(manifest: dict) -> None:
|
||||
if OUT_TMP.exists():
|
||||
shutil.rmtree(OUT_TMP)
|
||||
OUT_TMP.mkdir(parents=True)
|
||||
|
||||
print(f"\nRunning cmd_export_swap -> {OUT_TMP}")
|
||||
cmd_export_swap(
|
||||
cache_path=CACHE,
|
||||
refine_manifest_path=SYNTH_MANIFEST,
|
||||
raw_manifest_path=None,
|
||||
out_dir=OUT_TMP,
|
||||
top_n=TOP_N,
|
||||
outlier_threshold=EXPORT_OUTLIER_THRESHOLD,
|
||||
pad_ratio=PAD_RATIO,
|
||||
out_size=OUT_SIZE,
|
||||
include_candidates=False,
|
||||
candidate_match_threshold=0.55,
|
||||
candidate_min_score=0.40,
|
||||
min_face_short=EXPORT_MIN_FACE_SHORT,
|
||||
)
|
||||
|
||||
new_top = json.loads((OUT_TMP / "manifest.json").read_text())
|
||||
new_entries = new_top.get("facesets", [])
|
||||
|
||||
moved = 0
|
||||
for fs_meta in new_entries:
|
||||
name = fs_meta["name"]
|
||||
src_dir = OUT_TMP / name
|
||||
if not src_dir.exists():
|
||||
print(f"[{name}] export dir missing; skipping")
|
||||
continue
|
||||
dst_dir = SWAP_READY / name
|
||||
if dst_dir.exists():
|
||||
print(f"[{name}] {dst_dir} already exists; refusing to overwrite")
|
||||
continue
|
||||
# Add a marker file so the source provenance is obvious.
|
||||
(src_dir / "osrc.txt").write_text(
|
||||
f"{name}\n\nSource: osrc cluster (auto-discovered, {OSRC_DIR}).\n"
|
||||
)
|
||||
shutil.move(str(src_dir), str(dst_dir))
|
||||
moved += 1
|
||||
print(f"[{name}] -> {dst_dir}")
|
||||
|
||||
# Merge top-level manifest, preserving facesets / thin_eras / etc.
|
||||
final_manifest_path = SWAP_READY / "manifest.json"
|
||||
if final_manifest_path.exists():
|
||||
existing = json.loads(final_manifest_path.read_text())
|
||||
else:
|
||||
existing = {"facesets": []}
|
||||
existing.setdefault("facesets", [])
|
||||
|
||||
existing_names = {fs["name"] for fs in existing["facesets"]}
|
||||
appended = 0
|
||||
for entry in new_entries:
|
||||
if entry["name"] in existing_names:
|
||||
print(f"[manifest] {entry['name']} already present; not duplicating")
|
||||
continue
|
||||
existing["facesets"].append(entry)
|
||||
appended += 1
|
||||
|
||||
final_manifest_path.write_text(json.dumps(existing, indent=2))
|
||||
print(f"\nMerged manifest: appended {appended} entries -> {final_manifest_path}")
|
||||
print(f"Moved {moved} faceset directories into {SWAP_READY}")
|
||||
|
||||
# Clean up temp dir if empty.
|
||||
if OUT_TMP.exists():
|
||||
leftover = list(OUT_TMP.iterdir())
|
||||
if not leftover:
|
||||
OUT_TMP.rmdir()
|
||||
|
||||
|
||||
# ---- main ---------------------------------------------------------------- #
|
||||
|
||||
def main() -> None:
|
||||
dry = "--dry-run" in sys.argv
|
||||
manifest, kept = discover_new_clusters()
|
||||
if dry:
|
||||
print("\n--dry-run: stopping after cluster discovery (no exports written).")
|
||||
return
|
||||
if not manifest.get("facesets"):
|
||||
print("No new facesets to build; nothing to do.")
|
||||
return
|
||||
export_and_relocate(manifest)
|
||||
print("\nDone.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,634 @@
|
||||
"""Consolidate facesets_swap_ready/ — find duplicate identities and merge.
|
||||
|
||||
Pipeline:
|
||||
1. analyze: pull arcface embeddings from work/cache/*.npz for every PNG in every
|
||||
active faceset (skipping _masked, _thin, era splits). Compute L2-normalized
|
||||
centroid per faceset. Build similarity graph at sim>=0.45, extract components.
|
||||
Pick primary per component by tier (hand-sorted > auto > osrc > immich) + size.
|
||||
2. report: HTML contact sheet at work/merge_review/index.html grouped by
|
||||
candidate cluster, with top-3 thumbs per faceset, all pairwise sims, and
|
||||
"merge X,Y -> Z" plan. Confident edges (sim>=0.65) are highlighted.
|
||||
3. apply: combine PNGs of secondaries into primary, re-rank by quality.composite
|
||||
descending, renumber 0001..NNNN, re-zip _topN.fsz + _all.fsz, move secondaries
|
||||
to facesets_swap_ready/_merged/<name>/, update master manifest with
|
||||
`merged[]` array + `merge_run` provenance block.
|
||||
|
||||
Embeddings come from caches (no GPU re-embed needed); the original clusterer used
|
||||
exactly these vectors so they are the right yardstick. Era splits are excluded
|
||||
entirely (intentional time-period segmentation, not a duplication).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import shutil
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
from PIL import Image
|
||||
from scipy.cluster.hierarchy import linkage, fcluster
|
||||
from scipy.spatial.distance import squareform
|
||||
|
||||
ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
|
||||
CACHES = [
|
||||
Path("/opt/face-sets/work/cache/nl_full.npz"),
|
||||
Path("/opt/face-sets/work/cache/immich_peter.npz"),
|
||||
Path("/opt/face-sets/work/cache/immich_nic.npz"),
|
||||
]
|
||||
|
||||
ERA_SPLIT_RE = re.compile(r"^faceset_\d+_(?:\d{4}-\d{2,4}|\d{4}|undated)$")
|
||||
|
||||
|
||||
# ----------------------------- helpers -----------------------------
|
||||
|
||||
def load_caches():
|
||||
"""Return (rec_index, alias_map). rec_index keyed by (path, bbox_tuple)
|
||||
-> embedding (np.float32, shape (512,) L2-normalized).
|
||||
alias_map maps every alias path -> canonical path."""
|
||||
rec_index: dict[tuple[str, tuple[int, int, int, int]], np.ndarray] = {}
|
||||
alias_map: dict[str, str] = {}
|
||||
n_total = 0
|
||||
for c in CACHES:
|
||||
if not c.exists():
|
||||
print(f"[warn] cache missing: {c}", file=sys.stderr)
|
||||
continue
|
||||
d = np.load(c, allow_pickle=True)
|
||||
emb = d["embeddings"]
|
||||
meta = json.loads(str(d["meta"]))
|
||||
face_records = [m for m in meta if not m.get("noface")]
|
||||
if len(face_records) != len(emb):
|
||||
raise SystemExit(f"meta/emb mismatch in {c}: {len(face_records)} vs {len(emb)}")
|
||||
# path_aliases may be present
|
||||
if "path_aliases" in d.files:
|
||||
paliases = json.loads(str(d["path_aliases"]))
|
||||
for canon, alist in paliases.items():
|
||||
alias_map.setdefault(canon, canon)
|
||||
for a in alist:
|
||||
alias_map[a] = canon
|
||||
for i, rec in enumerate(face_records):
|
||||
p = rec["path"]
|
||||
bbox = tuple(int(x) for x in rec["bbox"])
|
||||
v = emb[i].astype(np.float32)
|
||||
n = float(np.linalg.norm(v))
|
||||
if n > 0:
|
||||
v = v / n
|
||||
rec_index[(p, bbox)] = v
|
||||
alias_map.setdefault(p, p)
|
||||
print(f"[cache] {c.name}: +{len(face_records)} face records (running total {len(rec_index)})", file=sys.stderr)
|
||||
n_total += len(face_records)
|
||||
print(f"[cache] indexed {n_total} face records, {len(alias_map)} path aliases", file=sys.stderr)
|
||||
return rec_index, alias_map
|
||||
|
||||
|
||||
def faceset_tier(name: str) -> int:
|
||||
"""Lower number = higher priority for primary selection."""
|
||||
m = re.match(r"^faceset_0*(\d+)$", name)
|
||||
if not m:
|
||||
return 99 # unknown structure
|
||||
n = int(m.group(1))
|
||||
if 13 <= n <= 19:
|
||||
return 0 # hand-sorted
|
||||
if 1 <= n <= 12:
|
||||
return 1 # auto-clustered
|
||||
if 20 <= n <= 25:
|
||||
return 2 # osrc
|
||||
if 26 <= n <= 264:
|
||||
return 3 # immich peter
|
||||
if 265 <= n:
|
||||
return 4 # immich nic and beyond
|
||||
return 99
|
||||
|
||||
|
||||
def is_era_split(name: str) -> bool:
|
||||
return bool(ERA_SPLIT_RE.match(name))
|
||||
|
||||
|
||||
def faceset_centroid(faceset_dir: Path, rec_index, alias_map):
|
||||
"""Return (centroid, n_used, n_missing) where centroid is L2-normalized mean
|
||||
of embeddings of the faces listed in the per-faceset manifest. Falls back to
|
||||
None if too few embeddings found."""
|
||||
manifest = faceset_dir / "manifest.json"
|
||||
if not manifest.exists():
|
||||
return None, 0, 0
|
||||
m = json.loads(manifest.read_text())
|
||||
vecs = []
|
||||
n_missing = 0
|
||||
for f in m.get("faces", []):
|
||||
src = f.get("source")
|
||||
bbox = f.get("bbox")
|
||||
if src is None or bbox is None:
|
||||
n_missing += 1
|
||||
continue
|
||||
bbox_t = tuple(int(x) for x in bbox)
|
||||
canon = alias_map.get(src, src)
|
||||
v = rec_index.get((canon, bbox_t))
|
||||
if v is None and canon != src:
|
||||
v = rec_index.get((src, bbox_t))
|
||||
if v is None:
|
||||
n_missing += 1
|
||||
continue
|
||||
vecs.append(v)
|
||||
if len(vecs) < 3:
|
||||
return None, len(vecs), n_missing
|
||||
arr = np.stack(vecs).astype(np.float32)
|
||||
c = arr.mean(axis=0)
|
||||
n = float(np.linalg.norm(c))
|
||||
if n > 0:
|
||||
c = c / n
|
||||
return c, len(vecs), n_missing
|
||||
|
||||
|
||||
def connected_components(adj: dict[int, set[int]]) -> list[list[int]]:
|
||||
seen: set[int] = set()
|
||||
comps = []
|
||||
for node in adj:
|
||||
if node in seen:
|
||||
continue
|
||||
stack = [node]
|
||||
comp = []
|
||||
while stack:
|
||||
x = stack.pop()
|
||||
if x in seen:
|
||||
continue
|
||||
seen.add(x)
|
||||
comp.append(x)
|
||||
for y in adj.get(x, set()):
|
||||
if y not in seen:
|
||||
stack.append(y)
|
||||
comps.append(sorted(comp))
|
||||
return comps
|
||||
|
||||
|
||||
# ----------------------------- analyze -----------------------------
|
||||
|
||||
def cmd_analyze(args):
|
||||
rec_index, alias_map = load_caches()
|
||||
|
||||
# collect active facesets
|
||||
active = []
|
||||
for d in sorted(ROOT.iterdir()):
|
||||
if not d.is_dir() or d.name.startswith("_"):
|
||||
continue
|
||||
if is_era_split(d.name):
|
||||
continue
|
||||
active.append(d)
|
||||
print(f"[scan] {len(active)} active facesets (era splits + _masked + _thin excluded)", file=sys.stderr)
|
||||
|
||||
centroids: dict[str, np.ndarray] = {}
|
||||
sizes: dict[str, int] = {}
|
||||
skipped = []
|
||||
t0 = time.time()
|
||||
for fs in active:
|
||||
c, n_used, n_miss = faceset_centroid(fs, rec_index, alias_map)
|
||||
if c is None:
|
||||
skipped.append((fs.name, n_used, n_miss))
|
||||
continue
|
||||
centroids[fs.name] = c
|
||||
sizes[fs.name] = n_used
|
||||
print(f"[centroid] {len(centroids)} facesets centroided in {time.time()-t0:.1f}s; "
|
||||
f"{len(skipped)} skipped (too few embeddings)", file=sys.stderr)
|
||||
if skipped:
|
||||
for n, u, m in skipped[:10]:
|
||||
print(f" skip {n}: used={u} missing={m}", file=sys.stderr)
|
||||
if len(skipped) > 10:
|
||||
print(f" ... +{len(skipped)-10} more", file=sys.stderr)
|
||||
|
||||
names = sorted(centroids.keys())
|
||||
if not names:
|
||||
raise SystemExit("no centroids built")
|
||||
|
||||
# similarity matrix
|
||||
M = np.stack([centroids[n] for n in names]).astype(np.float32) # (N, 512), normalized
|
||||
sim = M @ M.T # (N, N) cosine since unit-normalized
|
||||
np.clip(sim, -1.0, 1.0, out=sim)
|
||||
|
||||
edge_thr = args.edge
|
||||
confident_thr = args.confident
|
||||
|
||||
# complete-linkage agglomerative clustering on cosine distance.
|
||||
# Cut at edge threshold: groups are guaranteed to have ALL pairs sim >= edge_thr.
|
||||
# This avoids the chaining problem of single-link / connected-components.
|
||||
n = len(names)
|
||||
dist = 1.0 - sim
|
||||
np.fill_diagonal(dist, 0.0)
|
||||
# symmetrize numerical noise
|
||||
dist = (dist + dist.T) / 2.0
|
||||
np.clip(dist, 0.0, 2.0, out=dist)
|
||||
cond = squareform(dist, checks=False)
|
||||
Z = linkage(cond, method="complete")
|
||||
cut_dist = 1.0 - edge_thr # complete-link distance corresponds to (1 - min sim)
|
||||
labels = fcluster(Z, t=cut_dist, criterion="distance") # 1-indexed cluster ids
|
||||
|
||||
cluster_members: dict[int, list[int]] = {}
|
||||
for idx, lbl in enumerate(labels):
|
||||
cluster_members.setdefault(int(lbl), []).append(idx)
|
||||
comps = [sorted(idxs) for idxs in cluster_members.values() if len(idxs) > 1]
|
||||
|
||||
n_pairs_in_groups = 0
|
||||
for c in comps:
|
||||
n_pairs_in_groups += len(c) * (len(c) - 1) // 2
|
||||
print(f"[graph] complete-linkage cut at sim>={edge_thr}: {len(comps)} multi-faceset groups "
|
||||
f"({n_pairs_in_groups} within-group pairs)", file=sys.stderr)
|
||||
|
||||
# pick primary per group: lowest tier number, then largest size
|
||||
groups_out = []
|
||||
for comp in comps:
|
||||
members = [names[i] for i in comp]
|
||||
members_sorted = sorted(members, key=lambda x: (faceset_tier(x), -sizes.get(x, 0), x))
|
||||
primary = members_sorted[0]
|
||||
secondaries = members_sorted[1:]
|
||||
# gather pairwise sims within group
|
||||
pair_sims = []
|
||||
idx_of = {names[i]: i for i in comp}
|
||||
for a in members:
|
||||
for b in members:
|
||||
if a >= b:
|
||||
continue
|
||||
pair_sims.append({"a": a, "b": b, "sim": round(float(sim[idx_of[a], idx_of[b]]), 4)})
|
||||
# confidence: minimum within-group sim (the weakest link)
|
||||
min_link = min(p["sim"] for p in pair_sims)
|
||||
max_link = max(p["sim"] for p in pair_sims)
|
||||
confidence = "confident" if min_link >= confident_thr else "uncertain"
|
||||
groups_out.append({
|
||||
"primary": primary,
|
||||
"secondaries": secondaries,
|
||||
"members": members_sorted,
|
||||
"tiers": {n: faceset_tier(n) for n in members},
|
||||
"sizes": {n: sizes.get(n, 0) for n in members},
|
||||
"pair_sims": pair_sims,
|
||||
"min_link": round(min_link, 4),
|
||||
"max_link": round(max_link, 4),
|
||||
"confidence": confidence,
|
||||
})
|
||||
# sort: confident first, then by max_link desc
|
||||
groups_out.sort(key=lambda g: (0 if g["confidence"] == "confident" else 1, -g["max_link"]))
|
||||
|
||||
out = {
|
||||
"thresholds": {"edge": edge_thr, "confident": confident_thr},
|
||||
"n_active": len(active),
|
||||
"n_centroided": len(centroids),
|
||||
"n_skipped": len(skipped),
|
||||
"skipped_reasons": [{"name": n, "used": u, "missing": m} for n, u, m in skipped],
|
||||
"n_groups": len(groups_out),
|
||||
"n_facesets_in_groups": sum(len(g["members"]) for g in groups_out),
|
||||
"groups": groups_out,
|
||||
}
|
||||
op = Path(args.out)
|
||||
op.parent.mkdir(parents=True, exist_ok=True)
|
||||
op.write_text(json.dumps(out, indent=2))
|
||||
confident = sum(1 for g in groups_out if g["confidence"] == "confident")
|
||||
uncertain = sum(1 for g in groups_out if g["confidence"] == "uncertain")
|
||||
print(f"[done] {len(groups_out)} groups ({confident} confident, {uncertain} uncertain) -> {op}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- report -----------------------------
|
||||
|
||||
def cmd_report(args):
|
||||
candidates = json.loads(Path(args.candidates).read_text())
|
||||
out_dir = Path(args.out)
|
||||
thumbs_dir = out_dir / "thumbs"
|
||||
thumbs_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
THUMB = 140
|
||||
THUMBS_PER_FACESET = 4
|
||||
|
||||
def make_thumb(faceset: str, fname: str) -> str:
|
||||
d = thumbs_dir / faceset
|
||||
d.mkdir(parents=True, exist_ok=True)
|
||||
dst = d / (Path(fname).stem + ".jpg")
|
||||
if not dst.exists():
|
||||
try:
|
||||
src = ROOT / faceset / "faces" / fname
|
||||
img = Image.open(src).convert("RGB")
|
||||
img.thumbnail((THUMB, THUMB), Image.LANCZOS)
|
||||
img.save(dst, "JPEG", quality=82)
|
||||
except Exception as e:
|
||||
print(f"[thumb-skip] {faceset}/{fname}: {e}", file=sys.stderr)
|
||||
return ""
|
||||
return f"thumbs/{faceset}/{Path(fname).stem}.jpg"
|
||||
|
||||
rows = []
|
||||
for gi, g in enumerate(candidates["groups"]):
|
||||
primary = g["primary"]
|
||||
sec = g["secondaries"]
|
||||
conf_cls = "confident" if g["confidence"] == "confident" else "uncertain"
|
||||
rows.append(f"<section class='grp {conf_cls}' id='g{gi}'>")
|
||||
rows.append(f"<h2>group #{gi+1} <small>({g['confidence']}; min_sim={g['min_link']:.3f}, max_sim={g['max_link']:.3f})</small></h2>")
|
||||
rows.append(f"<div class='plan'>merge <b>{', '.join(sec)}</b> → <b>{primary}</b></div>")
|
||||
# member rows
|
||||
for name in g["members"]:
|
||||
tier = g["tiers"][name]
|
||||
sz = g["sizes"][name]
|
||||
tier_label = ["hand-sorted", "auto", "osrc", "immich-peter", "immich-nic", "?"][min(tier, 5)]
|
||||
badge = "PRIMARY" if name == primary else "secondary"
|
||||
rows.append(f"<div class='member'>")
|
||||
rows.append(f"<div class='label'><span class='badge {badge.lower()}'>{badge}</span> "
|
||||
f"<b>{name}</b> <small>tier={tier_label} · n={sz}</small></div>")
|
||||
rows.append("<div class='thumbs'>")
|
||||
faces_dir = ROOT / name / "faces"
|
||||
files = sorted(faces_dir.glob("*.png"))[:THUMBS_PER_FACESET]
|
||||
for f in files:
|
||||
rel = make_thumb(name, f.name)
|
||||
if rel:
|
||||
rows.append(f"<img src='{rel}' loading='lazy' title='{f.name}'>")
|
||||
rows.append("</div></div>")
|
||||
# pairwise sims
|
||||
rows.append("<table class='sims'><tr><th>a</th><th>b</th><th>sim</th></tr>")
|
||||
for ps in sorted(g["pair_sims"], key=lambda x: -x["sim"]):
|
||||
cls = "hi" if ps["sim"] >= candidates["thresholds"]["confident"] else "mid"
|
||||
rows.append(f"<tr><td>{ps['a']}</td><td>{ps['b']}</td><td class='{cls}'>{ps['sim']:.3f}</td></tr>")
|
||||
rows.append("</table>")
|
||||
rows.append("</section>")
|
||||
|
||||
nav = " · ".join(f"<a href='#g{i}'>#{i+1}</a>" for i in range(len(candidates["groups"])))
|
||||
|
||||
html = f"""<!doctype html>
|
||||
<html><head><meta charset='utf-8'><title>Faceset merge review</title>
|
||||
<style>
|
||||
body {{ font-family: system-ui, sans-serif; background: #111; color: #eee; padding: 1em; }}
|
||||
h1 {{ margin-top: 0; }}
|
||||
h2 {{ margin: 0; }}
|
||||
small {{ color: #999; font-weight: normal; }}
|
||||
section.grp {{ background: #1a1a1a; border-radius: 6px; padding: 12px; margin: 12px 0; }}
|
||||
section.grp.confident {{ border-left: 4px solid #5fa05f; }}
|
||||
section.grp.uncertain {{ border-left: 4px solid #ffb050; }}
|
||||
.plan {{ margin: .5em 0; color: #6cf; }}
|
||||
.member {{ margin: 8px 0; padding: 6px; background: #222; border-radius: 4px; }}
|
||||
.label {{ font-family: monospace; font-size: 13px; }}
|
||||
.badge {{ display: inline-block; padding: 0 6px; font-size: 10px; border-radius: 2px; }}
|
||||
.badge.primary {{ background: #5fa05f; color: #000; font-weight: bold; }}
|
||||
.badge.secondary {{ background: #444; color: #ccc; }}
|
||||
.thumbs {{ display: flex; gap: 4px; margin-top: 4px; flex-wrap: wrap; }}
|
||||
.thumbs img {{ height: 140px; width: auto; border-radius: 3px; }}
|
||||
table.sims {{ font-family: monospace; font-size: 11px; margin-top: 6px; border-collapse: collapse; }}
|
||||
table.sims td, table.sims th {{ padding: 1px 8px; border: 1px solid #333; text-align: left; }}
|
||||
table.sims td.hi {{ color: #5fa05f; font-weight: bold; }}
|
||||
table.sims td.mid {{ color: #ffb050; }}
|
||||
.nav {{ position: sticky; top: 0; background: #111; padding: .5em 0; border-bottom: 1px solid #333; font-size: 12px; }}
|
||||
a {{ color: #6cf; }}
|
||||
</style></head>
|
||||
<body>
|
||||
<h1>Merge review — {len(candidates['groups'])} candidate groups
|
||||
<small>(edge>={candidates['thresholds']['edge']}, confident>={candidates['thresholds']['confident']})</small></h1>
|
||||
<p>{candidates['n_centroided']} of {candidates['n_active']} active facesets centroided
|
||||
(skipped {candidates['n_skipped']} for too few cached embeddings).
|
||||
Green = confident (min within-group sim >= {candidates['thresholds']['confident']}); orange = uncertain.</p>
|
||||
<div class='nav'>{nav}</div>
|
||||
{''.join(rows)}
|
||||
</body></html>"""
|
||||
|
||||
out_html = out_dir / "index.html"
|
||||
out_html.write_text(html)
|
||||
print(f"[done] {out_html}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- apply -----------------------------
|
||||
|
||||
def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
|
||||
import zipfile
|
||||
with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
|
||||
for i, p in enumerate(pngs):
|
||||
zf.write(p, arcname=f"{i:04d}.png")
|
||||
|
||||
|
||||
def cmd_apply(args):
|
||||
candidates = json.loads(Path(args.candidates).read_text())
|
||||
master_path = ROOT / "manifest.json"
|
||||
master = json.loads(master_path.read_text())
|
||||
by_name = {f["name"]: f for f in master.get("facesets", [])}
|
||||
|
||||
# filter: skip "uncertain" groups unless --include-uncertain
|
||||
accepted = [g for g in candidates["groups"]
|
||||
if g["confidence"] == "confident" or args.include_uncertain]
|
||||
skipped_unc = [g for g in candidates["groups"]
|
||||
if g["confidence"] == "uncertain" and not args.include_uncertain]
|
||||
# explicit --exclude / --only filters (group indices in the candidates file)
|
||||
if args.only:
|
||||
only = {int(s) for s in args.only.split(",")}
|
||||
accepted = [g for i, g in enumerate(candidates["groups"]) if i in only]
|
||||
if args.exclude:
|
||||
excl = {int(s) for s in args.exclude.split(",")}
|
||||
accepted = [g for i, g in enumerate(accepted) if i not in excl]
|
||||
|
||||
print(f"[plan] {len(accepted)} groups will be merged "
|
||||
f"({len(skipped_unc)} uncertain skipped)", file=sys.stderr)
|
||||
|
||||
if args.dry_run:
|
||||
for g in accepted:
|
||||
print(f" merge {g['secondaries']} -> {g['primary']} "
|
||||
f"({g['confidence']}, min_sim={g['min_link']:.3f})")
|
||||
return
|
||||
|
||||
merged_dir = ROOT / "_merged"
|
||||
merged_dir.mkdir(exist_ok=True)
|
||||
new_facesets: list[dict] = []
|
||||
new_merged: list[dict] = list(master.get("merged", []))
|
||||
consumed_names: set[str] = set()
|
||||
primary_updates: dict[str, dict] = {} # name -> new entry
|
||||
primary_absorbed: dict[str, list[dict]] = {} # primary_name -> [secondary entries]
|
||||
|
||||
for g in accepted:
|
||||
primary = g["primary"]
|
||||
if primary not in by_name:
|
||||
print(f"[warn] primary {primary} not in master; skipping group", file=sys.stderr)
|
||||
continue
|
||||
primary_dir = ROOT / primary
|
||||
if not primary_dir.is_dir():
|
||||
print(f"[warn] primary dir {primary_dir} missing; skipping group", file=sys.stderr)
|
||||
continue
|
||||
primary_faces = primary_dir / "faces"
|
||||
primary_manifest_path = primary_dir / "manifest.json"
|
||||
primary_manifest = json.loads(primary_manifest_path.read_text())
|
||||
|
||||
# gather all face entries: primary + each secondary
|
||||
combined_faces: list[dict] = list(primary_manifest.get("faces", []))
|
||||
# adjust composite quality fall-back: ensure key exists
|
||||
for f in combined_faces:
|
||||
f.setdefault("origin_faceset", primary)
|
||||
|
||||
for sec in g["secondaries"]:
|
||||
sec_dir = ROOT / sec
|
||||
if not sec_dir.is_dir():
|
||||
print(f"[warn] secondary {sec} missing; skipping", file=sys.stderr)
|
||||
continue
|
||||
sec_manifest_path = sec_dir / "manifest.json"
|
||||
sec_manifest = json.loads(sec_manifest_path.read_text()) if sec_manifest_path.exists() else {"faces": []}
|
||||
for f in sec_manifest.get("faces", []):
|
||||
f = dict(f)
|
||||
f["origin_faceset"] = sec
|
||||
combined_faces.append(f)
|
||||
|
||||
# rank by quality.composite descending; ties broken by lower cosd_centroid
|
||||
def sort_key(f):
|
||||
q = f.get("quality", {}).get("composite", 0)
|
||||
d = f.get("cosd_centroid", 1.0)
|
||||
return (-q, d)
|
||||
combined_faces.sort(key=sort_key)
|
||||
|
||||
# renumber and stage PNGs into a fresh staging dir, then atomically swap
|
||||
staging = primary_dir / "_faces_new"
|
||||
if staging.exists():
|
||||
shutil.rmtree(staging)
|
||||
staging.mkdir()
|
||||
new_face_entries = []
|
||||
for new_rank, f in enumerate(combined_faces, start=1):
|
||||
origin = f.pop("origin_faceset")
|
||||
old_png_rel = f["png"] # e.g. "faces/0042.png"
|
||||
old_png_name = Path(old_png_rel).name
|
||||
origin_png = ROOT / origin / "faces" / old_png_name
|
||||
if not origin_png.exists():
|
||||
# could be in _dropped if occlusion-pruned; skip
|
||||
continue
|
||||
new_name = f"{new_rank:04d}.png"
|
||||
shutil.copy2(origin_png, staging / new_name)
|
||||
f = dict(f)
|
||||
f["rank"] = new_rank
|
||||
f["png"] = f"faces/{new_name}"
|
||||
f["origin_faceset"] = origin # preserve provenance in manifest
|
||||
new_face_entries.append(f)
|
||||
|
||||
# swap directories: primary/faces -> primary/_faces_old, staging -> primary/faces
|
||||
old_faces_holding = primary_dir / "_faces_old"
|
||||
if old_faces_holding.exists():
|
||||
shutil.rmtree(old_faces_holding)
|
||||
if primary_faces.exists():
|
||||
primary_faces.rename(old_faces_holding)
|
||||
staging.rename(primary_faces)
|
||||
# migrate _dropped/ from old holding (so occlusion-pruned PNGs remain accessible)
|
||||
old_dropped = old_faces_holding / "_dropped"
|
||||
if old_dropped.exists():
|
||||
(primary_faces / "_dropped").mkdir(exist_ok=True)
|
||||
for x in old_dropped.iterdir():
|
||||
shutil.move(str(x), str(primary_faces / "_dropped" / x.name))
|
||||
shutil.rmtree(old_faces_holding)
|
||||
|
||||
# re-zip .fsz
|
||||
survivor_pngs = sorted(primary_faces.glob("*.png"))
|
||||
top_n = primary_manifest.get("top_n", 30)
|
||||
top_n_eff = min(top_n, len(survivor_pngs))
|
||||
# remove old .fsz files
|
||||
for old in primary_dir.glob("*.fsz"):
|
||||
old.unlink()
|
||||
top_fsz_name = f"{primary}_top{top_n_eff}.fsz"
|
||||
all_fsz_name = f"{primary}_all.fsz"
|
||||
_zip_png_list(survivor_pngs[:top_n_eff], primary_dir / top_fsz_name)
|
||||
if len(survivor_pngs) > top_n_eff:
|
||||
_zip_png_list(survivor_pngs, primary_dir / all_fsz_name)
|
||||
all_fsz_used = all_fsz_name
|
||||
else:
|
||||
all_fsz_used = None
|
||||
|
||||
# update primary's local manifest
|
||||
primary_manifest["faces"] = new_face_entries
|
||||
primary_manifest["exported"] = len(new_face_entries)
|
||||
primary_manifest["fsz_top"] = top_fsz_name
|
||||
primary_manifest["fsz_all"] = all_fsz_used
|
||||
primary_manifest["top_n"] = top_n_eff
|
||||
primary_manifest.setdefault("merge_history", []).append({
|
||||
"absorbed": g["secondaries"],
|
||||
"min_link": g["min_link"],
|
||||
"max_link": g["max_link"],
|
||||
"confidence": g["confidence"],
|
||||
})
|
||||
primary_manifest_path.write_text(json.dumps(primary_manifest, indent=2))
|
||||
|
||||
# move secondary directories into _merged/
|
||||
absorbed_master_entries: list[dict] = []
|
||||
for sec in g["secondaries"]:
|
||||
sec_dir = ROOT / sec
|
||||
target = merged_dir / sec
|
||||
if not sec_dir.is_dir():
|
||||
continue
|
||||
if target.exists():
|
||||
shutil.rmtree(sec_dir) # already moved by previous run; clean stub
|
||||
else:
|
||||
shutil.move(str(sec_dir), str(target))
|
||||
sec_master = dict(by_name.get(sec, {"name": sec}))
|
||||
sec_master["merged_into"] = primary
|
||||
sec_master["relpath"] = f"_merged/{sec}"
|
||||
sec_master["fsz_top"] = None
|
||||
sec_master["fsz_all"] = None
|
||||
absorbed_master_entries.append(sec_master)
|
||||
consumed_names.add(sec)
|
||||
|
||||
new_merged.extend(absorbed_master_entries)
|
||||
|
||||
# bump primary master entry
|
||||
prim_master = dict(by_name[primary])
|
||||
prim_master["exported"] = len(new_face_entries)
|
||||
prim_master["top_n"] = top_n_eff
|
||||
prim_master["fsz_top"] = top_fsz_name
|
||||
prim_master["fsz_all"] = all_fsz_used
|
||||
prim_master.setdefault("merge_history", []).append({
|
||||
"absorbed": g["secondaries"],
|
||||
"min_link": g["min_link"],
|
||||
"max_link": g["max_link"],
|
||||
})
|
||||
primary_updates[primary] = prim_master
|
||||
|
||||
print(f"[merged] {g['secondaries']} -> {primary} "
|
||||
f"now {len(new_face_entries)} png", file=sys.stderr)
|
||||
|
||||
# rebuild master facesets list
|
||||
for entry in master.get("facesets", []):
|
||||
nm = entry["name"]
|
||||
if nm in consumed_names:
|
||||
continue
|
||||
if nm in primary_updates:
|
||||
new_facesets.append(primary_updates[nm])
|
||||
else:
|
||||
new_facesets.append(entry)
|
||||
|
||||
new_master = dict(master)
|
||||
new_master["facesets"] = new_facesets
|
||||
new_master["merged"] = new_merged
|
||||
new_master["merge_run"] = {
|
||||
"thresholds": candidates["thresholds"],
|
||||
"groups_applied": len(accepted),
|
||||
"facesets_consumed": len(consumed_names),
|
||||
"include_uncertain": bool(args.include_uncertain),
|
||||
}
|
||||
tmp = master_path.with_suffix(".tmp.json")
|
||||
tmp.write_text(json.dumps(new_master, indent=2))
|
||||
tmp.replace(master_path)
|
||||
print(f"[done] master manifest updated: {len(new_facesets)} active, "
|
||||
f"{len(new_merged)} merged, {len(consumed_names)} consumed in this run",
|
||||
file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- main -----------------------------
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
sub = ap.add_subparsers(dest="cmd", required=True)
|
||||
|
||||
a = sub.add_parser("analyze")
|
||||
a.add_argument("--out", required=True)
|
||||
a.add_argument("--edge", type=float, default=0.45, help="min cosine sim to draw an edge (default 0.45)")
|
||||
a.add_argument("--confident", type=float, default=0.65, help="min within-group sim to be confident (default 0.65)")
|
||||
a.set_defaults(func=cmd_analyze)
|
||||
|
||||
r = sub.add_parser("report")
|
||||
r.add_argument("--candidates", required=True)
|
||||
r.add_argument("--out", required=True)
|
||||
r.set_defaults(func=cmd_report)
|
||||
|
||||
p = sub.add_parser("apply")
|
||||
p.add_argument("--candidates", required=True)
|
||||
p.add_argument("--include-uncertain", action="store_true",
|
||||
help="apply uncertain groups too (default: confident only)")
|
||||
p.add_argument("--only", default=None, help="comma-separated group indices to apply")
|
||||
p.add_argument("--exclude", default=None, help="comma-separated group indices to skip")
|
||||
p.add_argument("--dry-run", action="store_true")
|
||||
p.set_defaults(func=cmd_apply)
|
||||
|
||||
args = ap.parse_args()
|
||||
args.func(args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,594 @@
|
||||
"""Corpus-wide dedup + roop-unleashed optimization.
|
||||
|
||||
Two passes:
|
||||
1. Cross-family byte-identical PNG dedup (same SHA256 in two different identity
|
||||
families) — keep the higher-tier family copy. Era splits of the same parent
|
||||
identity (faceset_NNN_*) are intentional duplications and are NOT deduped
|
||||
within their family.
|
||||
2. Within-faceset near-duplicate dedup using cached arcface embeddings
|
||||
(cosine sim >= 0.95). Keep highest quality.composite, drop the rest.
|
||||
|
||||
Plus a Windows-DML multi-face audit (separate phase via clip_worker-style split):
|
||||
3. Re-detect each PNG with insightface; flag any with 0 or >1 detected faces.
|
||||
The roop loader appends every detected face per PNG, so multi-face crops
|
||||
pollute identity averaging.
|
||||
|
||||
All flagged PNGs are MOVED to <faceset>/faces/_dropped/ (reversible). Affected
|
||||
.fsz files are re-zipped, manifests updated.
|
||||
|
||||
CLI:
|
||||
analyze --out work/dedup_audit/dedup_plan.json
|
||||
apply --plan ... [--dry-run]
|
||||
stage_multiface --out work/dedup_audit/multiface_queue.json
|
||||
merge_multiface --results <worker_out> --out work/dedup_audit/multiface_plan.json
|
||||
apply_multiface --plan ... [--dry-run]
|
||||
report --dedup ... --multiface ... --out work/dedup_audit
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import re
|
||||
import shutil
|
||||
import sys
|
||||
import time
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
|
||||
ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
|
||||
WIN_ROOT = r"E:\temp_things\fcswp\nl_sorted\facesets_swap_ready"
|
||||
CACHES = [
|
||||
Path("/opt/face-sets/work/cache/nl_full.npz"),
|
||||
Path("/opt/face-sets/work/cache/immich_peter.npz"),
|
||||
Path("/opt/face-sets/work/cache/immich_nic.npz"),
|
||||
]
|
||||
|
||||
NEAR_DUP_THRESHOLD = 0.95
|
||||
HASH_PARALLEL = 16
|
||||
|
||||
|
||||
# ----------------------------- helpers -----------------------------
|
||||
|
||||
def faceset_tier(name: str) -> int:
|
||||
m = re.match(r"^faceset_0*(\d+)(?:_.+)?$", name)
|
||||
if not m:
|
||||
return 99
|
||||
n = int(m.group(1))
|
||||
if 13 <= n <= 19:
|
||||
return 0
|
||||
if 1 <= n <= 12:
|
||||
return 1
|
||||
if 20 <= n <= 25:
|
||||
return 2
|
||||
if 26 <= n <= 264:
|
||||
return 3
|
||||
if 265 <= n:
|
||||
return 4
|
||||
return 99
|
||||
|
||||
|
||||
def faceset_family(name: str) -> str:
|
||||
"""faceset_001_2010-13 → faceset_001; faceset_001 → faceset_001."""
|
||||
m = re.match(r"^(faceset_\d+)(?:_.+)?$", name)
|
||||
return m.group(1) if m else name
|
||||
|
||||
|
||||
def wsl_to_win(p: str) -> str:
|
||||
s = str(p)
|
||||
if s.startswith("/mnt/"):
|
||||
return f"{s[5].upper()}:\\{s[7:].replace('/', chr(92))}"
|
||||
return s
|
||||
|
||||
|
||||
def iter_active_facesets() -> list[Path]:
|
||||
out = []
|
||||
for d in sorted(ROOT.iterdir()):
|
||||
if d.is_dir() and not d.name.startswith("_"):
|
||||
out.append(d)
|
||||
return out
|
||||
|
||||
|
||||
def sha256_file(p: Path) -> str:
|
||||
h = hashlib.sha256()
|
||||
with open(p, "rb") as f:
|
||||
while True:
|
||||
b = f.read(1 << 20)
|
||||
if not b:
|
||||
break
|
||||
h.update(b)
|
||||
return h.hexdigest()
|
||||
|
||||
|
||||
def load_caches():
|
||||
rec_index: dict[tuple[str, tuple[int, int, int, int]], np.ndarray] = {}
|
||||
alias_map: dict[str, str] = {}
|
||||
for c in CACHES:
|
||||
if not c.exists():
|
||||
continue
|
||||
d = np.load(c, allow_pickle=True)
|
||||
emb = d["embeddings"]
|
||||
meta = json.loads(str(d["meta"]))
|
||||
face_records = [m for m in meta if not m.get("noface")]
|
||||
if "path_aliases" in d.files:
|
||||
paliases = json.loads(str(d["path_aliases"]))
|
||||
for canon, alist in paliases.items():
|
||||
alias_map.setdefault(canon, canon)
|
||||
for a in alist:
|
||||
alias_map[a] = canon
|
||||
for i, rec in enumerate(face_records):
|
||||
p = rec["path"]
|
||||
bbox = tuple(int(x) for x in rec["bbox"])
|
||||
v = emb[i].astype(np.float32)
|
||||
n = float(np.linalg.norm(v))
|
||||
if n > 0:
|
||||
v = v / n
|
||||
rec_index[(p, bbox)] = v
|
||||
alias_map.setdefault(p, p)
|
||||
return rec_index, alias_map
|
||||
|
||||
|
||||
def lookup_emb(rec_index, alias_map, src: str, bbox):
|
||||
bbox_t = tuple(int(x) for x in bbox)
|
||||
canon = alias_map.get(src, src)
|
||||
v = rec_index.get((canon, bbox_t))
|
||||
if v is None and canon != src:
|
||||
v = rec_index.get((src, bbox_t))
|
||||
return v
|
||||
|
||||
|
||||
# ----------------------------- analyze -----------------------------
|
||||
|
||||
def cmd_analyze(args):
|
||||
rec_index, alias_map = load_caches()
|
||||
facesets = iter_active_facesets()
|
||||
print(f"[scan] {len(facesets)} active facesets", file=sys.stderr)
|
||||
|
||||
# Phase 1: walk every PNG, collect (faceset, file, src, bbox, quality, emb, sha256)
|
||||
all_pngs = [] # list of dicts
|
||||
t0 = time.time()
|
||||
for fs in facesets:
|
||||
manifest_path = fs / "manifest.json"
|
||||
if not manifest_path.exists():
|
||||
continue
|
||||
m = json.loads(manifest_path.read_text())
|
||||
for f in m.get("faces", []):
|
||||
png_rel = f.get("png")
|
||||
if not png_rel:
|
||||
continue
|
||||
disk_path = fs / png_rel
|
||||
if not disk_path.exists():
|
||||
continue
|
||||
all_pngs.append({
|
||||
"faceset": fs.name,
|
||||
"family": faceset_family(fs.name),
|
||||
"tier": faceset_tier(fs.name),
|
||||
"file": Path(png_rel).name,
|
||||
"rank": f.get("rank"),
|
||||
"source": f.get("source"),
|
||||
"bbox": f.get("bbox"),
|
||||
"quality": f.get("quality", {}).get("composite", 0),
|
||||
"disk_path": str(disk_path),
|
||||
})
|
||||
print(f"[scan] {len(all_pngs)} PNGs walked in {time.time()-t0:.1f}s", file=sys.stderr)
|
||||
|
||||
# Phase 2: SHA256 hash each PNG (parallel I/O)
|
||||
t0 = time.time()
|
||||
def _hash_one(idx):
|
||||
all_pngs[idx]["sha256"] = sha256_file(Path(all_pngs[idx]["disk_path"]))
|
||||
with ThreadPoolExecutor(max_workers=HASH_PARALLEL) as ex:
|
||||
# exhaust the iterator to actually run
|
||||
for _ in ex.map(_hash_one, range(len(all_pngs)), chunksize=16):
|
||||
pass
|
||||
print(f"[hash] {len(all_pngs)} PNGs hashed in {time.time()-t0:.1f}s", file=sys.stderr)
|
||||
|
||||
# Phase 3: cross-family byte-dedup
|
||||
by_sha: dict[str, list[int]] = {}
|
||||
for i, p in enumerate(all_pngs):
|
||||
by_sha.setdefault(p["sha256"], []).append(i)
|
||||
|
||||
cross_family_groups = []
|
||||
byte_drops: set[int] = set() # indices of PNGs to drop
|
||||
for sha, idxs in by_sha.items():
|
||||
if len(idxs) < 2:
|
||||
continue
|
||||
families = {all_pngs[i]["family"] for i in idxs}
|
||||
if len(families) < 2:
|
||||
continue # all in same family — intentional era duplication
|
||||
# multiple families share this content → dedup keeping the best one
|
||||
cross_family_groups.append({"sha256": sha, "members": [
|
||||
{"faceset": all_pngs[i]["faceset"], "file": all_pngs[i]["file"],
|
||||
"tier": all_pngs[i]["tier"], "quality": all_pngs[i]["quality"],
|
||||
"rank": all_pngs[i]["rank"]} for i in idxs
|
||||
]})
|
||||
# keeper rule: lowest tier number, then highest quality
|
||||
best = sorted(idxs, key=lambda i: (all_pngs[i]["tier"], -all_pngs[i]["quality"]))[0]
|
||||
for i in idxs:
|
||||
# NEVER drop within-family copies (preserve era duplication intentionally)
|
||||
# We only drop indices whose family != best's family
|
||||
if i != best and all_pngs[i]["family"] != all_pngs[best]["family"]:
|
||||
byte_drops.add(i)
|
||||
print(f"[byte] {len(cross_family_groups)} cross-family hash groups; "
|
||||
f"{len(byte_drops)} PNGs marked for byte-dedup drop", file=sys.stderr)
|
||||
|
||||
# Phase 4: within-faceset near-dup (embedding sim >= threshold)
|
||||
by_faceset: dict[str, list[int]] = {}
|
||||
for i, p in enumerate(all_pngs):
|
||||
by_faceset.setdefault(p["faceset"], []).append(i)
|
||||
|
||||
near_dup_groups = []
|
||||
near_drops: set[int] = set()
|
||||
miss_emb_total = 0
|
||||
t0 = time.time()
|
||||
for fs_name, idxs in by_faceset.items():
|
||||
if len(idxs) < 2:
|
||||
continue
|
||||
# gather embeddings
|
||||
embs = []
|
||||
kept_idxs = []
|
||||
for i in idxs:
|
||||
v = lookup_emb(rec_index, alias_map, all_pngs[i]["source"], all_pngs[i]["bbox"])
|
||||
if v is None:
|
||||
miss_emb_total += 1
|
||||
continue
|
||||
embs.append(v)
|
||||
kept_idxs.append(i)
|
||||
if len(kept_idxs) < 2:
|
||||
continue
|
||||
M = np.stack(embs).astype(np.float32)
|
||||
sim = M @ M.T
|
||||
np.fill_diagonal(sim, -1) # ignore self
|
||||
# find connected components in the (sim >= threshold) graph
|
||||
adj = {k: set() for k in range(len(kept_idxs))}
|
||||
for a in range(len(kept_idxs)):
|
||||
# only check a < b to avoid double work
|
||||
hi = np.where(sim[a, a+1:] >= NEAR_DUP_THRESHOLD)[0]
|
||||
for off in hi:
|
||||
b = a + 1 + int(off)
|
||||
adj[a].add(b)
|
||||
adj[b].add(a)
|
||||
seen = set()
|
||||
for k in adj:
|
||||
if k in seen or not adj[k]:
|
||||
continue
|
||||
stack = [k]
|
||||
comp = []
|
||||
while stack:
|
||||
x = stack.pop()
|
||||
if x in seen:
|
||||
continue
|
||||
seen.add(x)
|
||||
comp.append(x)
|
||||
for y in adj[x]:
|
||||
if y not in seen:
|
||||
stack.append(y)
|
||||
if len(comp) < 2:
|
||||
continue
|
||||
comp_idxs = [kept_idxs[c] for c in comp]
|
||||
# keeper: highest quality.composite, tie-break: lowest rank
|
||||
best = sorted(comp_idxs, key=lambda i: (-all_pngs[i]["quality"], all_pngs[i]["rank"] or 9999))[0]
|
||||
sims_in_group = []
|
||||
for ci in range(len(comp)):
|
||||
for cj in range(ci+1, len(comp)):
|
||||
sims_in_group.append(float(sim[comp[ci], comp[cj]]))
|
||||
near_dup_groups.append({
|
||||
"faceset": fs_name,
|
||||
"members": [{"file": all_pngs[i]["file"], "rank": all_pngs[i]["rank"],
|
||||
"quality": all_pngs[i]["quality"]} for i in comp_idxs],
|
||||
"keeper": all_pngs[best]["file"],
|
||||
"min_sim": min(sims_in_group) if sims_in_group else None,
|
||||
"max_sim": max(sims_in_group) if sims_in_group else None,
|
||||
})
|
||||
for i in comp_idxs:
|
||||
if i != best:
|
||||
near_drops.add(i)
|
||||
print(f"[near] {len(near_dup_groups)} near-dup groups; "
|
||||
f"{len(near_drops)} PNGs marked for near-dup drop "
|
||||
f"(miss_emb={miss_emb_total}); {time.time()-t0:.1f}s", file=sys.stderr)
|
||||
|
||||
# Combined drop set; for output, group by faceset
|
||||
all_drops = byte_drops | near_drops
|
||||
drops_by_faceset: dict[str, list] = {}
|
||||
for i in all_drops:
|
||||
p = all_pngs[i]
|
||||
reason = []
|
||||
if i in byte_drops: reason.append("byte_dup")
|
||||
if i in near_drops: reason.append("near_dup")
|
||||
drops_by_faceset.setdefault(p["faceset"], []).append({
|
||||
"file": p["file"], "rank": p["rank"], "reason": "+".join(reason),
|
||||
"sha256": p["sha256"], "quality": p["quality"],
|
||||
})
|
||||
|
||||
plan = {
|
||||
"thresholds": {"near_dup_sim": NEAR_DUP_THRESHOLD},
|
||||
"totals": {
|
||||
"active_facesets": len(facesets),
|
||||
"active_pngs": len(all_pngs),
|
||||
"byte_dup_groups": len(cross_family_groups),
|
||||
"byte_dup_drops": len(byte_drops),
|
||||
"near_dup_groups": len(near_dup_groups),
|
||||
"near_dup_drops": len(near_drops),
|
||||
"all_drops": len(all_drops),
|
||||
"facesets_affected": len(drops_by_faceset),
|
||||
},
|
||||
"byte_dup_groups": cross_family_groups,
|
||||
"near_dup_groups": near_dup_groups,
|
||||
"drops_by_faceset": drops_by_faceset,
|
||||
}
|
||||
op = Path(args.out)
|
||||
op.parent.mkdir(parents=True, exist_ok=True)
|
||||
op.write_text(json.dumps(plan, indent=2))
|
||||
print(f"[done] plan -> {op}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- apply -----------------------------
|
||||
|
||||
def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
|
||||
import zipfile
|
||||
with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
|
||||
for i, p in enumerate(pngs):
|
||||
zf.write(p, arcname=f"{i:04d}.png")
|
||||
|
||||
|
||||
def _apply_drops_to_facesets(drops_by_faceset: dict[str, list], reason_label: str, master_path: Path):
|
||||
"""Move flagged PNGs to <faceset>/faces/_dropped/, rebuild manifests + .fsz.
|
||||
drops_by_faceset values are lists of {"file": str, ...}.
|
||||
Returns total moved + counts per faceset."""
|
||||
master = json.loads(master_path.read_text())
|
||||
by_name = {f["name"]: f for f in master.get("facesets", [])}
|
||||
total_moved = 0
|
||||
per_faceset_counts = {}
|
||||
|
||||
for fs_name, drops in drops_by_faceset.items():
|
||||
fs_dir = ROOT / fs_name
|
||||
if not fs_dir.is_dir():
|
||||
print(f"[warn] {fs_name}: dir missing, skip", file=sys.stderr)
|
||||
continue
|
||||
faces_dir = fs_dir / "faces"
|
||||
dropped_dir = faces_dir / "_dropped"
|
||||
dropped_dir.mkdir(exist_ok=True)
|
||||
drop_files = {d["file"] for d in drops}
|
||||
|
||||
moved_here = 0
|
||||
for fname in sorted(drop_files):
|
||||
src = faces_dir / fname
|
||||
if not src.exists():
|
||||
continue
|
||||
shutil.move(str(src), str(dropped_dir / fname))
|
||||
moved_here += 1
|
||||
|
||||
# rebuild manifest by filtering out dropped files
|
||||
manifest_path = fs_dir / "manifest.json"
|
||||
if manifest_path.exists():
|
||||
mm = json.loads(manifest_path.read_text())
|
||||
new_faces = [f for f in mm.get("faces", []) if Path(f.get("png", "")).name not in drop_files]
|
||||
mm["faces"] = new_faces
|
||||
mm["exported"] = len(new_faces)
|
||||
mm.setdefault(f"{reason_label}_history", []).append({"dropped": moved_here})
|
||||
|
||||
# re-zip
|
||||
survivor_pngs = sorted(faces_dir.glob("*.png"))
|
||||
top_n = mm.get("top_n", 30)
|
||||
top_n_eff = min(top_n, len(survivor_pngs))
|
||||
for old in fs_dir.glob("*.fsz"):
|
||||
old.unlink()
|
||||
top_fsz_name = f"{fs_name}_top{top_n_eff}.fsz"
|
||||
all_fsz_name = f"{fs_name}_all.fsz"
|
||||
if top_n_eff > 0:
|
||||
_zip_png_list(survivor_pngs[:top_n_eff], fs_dir / top_fsz_name)
|
||||
mm["fsz_top"] = top_fsz_name
|
||||
mm["top_n"] = top_n_eff
|
||||
else:
|
||||
mm["fsz_top"] = None
|
||||
mm["top_n"] = 0
|
||||
if len(survivor_pngs) > top_n_eff:
|
||||
_zip_png_list(survivor_pngs, fs_dir / all_fsz_name)
|
||||
mm["fsz_all"] = all_fsz_name
|
||||
else:
|
||||
mm["fsz_all"] = None
|
||||
manifest_path.write_text(json.dumps(mm, indent=2))
|
||||
|
||||
if fs_name in by_name:
|
||||
by_name[fs_name]["exported"] = len(new_faces)
|
||||
by_name[fs_name]["fsz_top"] = mm["fsz_top"]
|
||||
by_name[fs_name]["fsz_all"] = mm["fsz_all"]
|
||||
by_name[fs_name]["top_n"] = mm["top_n"]
|
||||
by_name[fs_name].setdefault(f"{reason_label}_dropped", 0)
|
||||
by_name[fs_name][f"{reason_label}_dropped"] += moved_here
|
||||
|
||||
total_moved += moved_here
|
||||
per_faceset_counts[fs_name] = moved_here
|
||||
|
||||
# rewrite master with same ordering
|
||||
new_facesets = [by_name.get(e["name"], e) for e in master.get("facesets", [])]
|
||||
master["facesets"] = new_facesets
|
||||
master.setdefault(f"{reason_label}_runs", []).append({
|
||||
"facesets_affected": len(per_faceset_counts),
|
||||
"pngs_moved": total_moved,
|
||||
})
|
||||
tmp = master_path.with_suffix(".tmp.json")
|
||||
tmp.write_text(json.dumps(master, indent=2))
|
||||
tmp.replace(master_path)
|
||||
return total_moved, per_faceset_counts
|
||||
|
||||
|
||||
def cmd_apply(args):
|
||||
plan = json.loads(Path(args.plan).read_text())
|
||||
drops = plan["drops_by_faceset"]
|
||||
if args.dry_run:
|
||||
for fs, items in sorted(drops.items()):
|
||||
reasons = {}
|
||||
for it in items:
|
||||
reasons[it["reason"]] = reasons.get(it["reason"], 0) + 1
|
||||
print(f" {fs}: {len(items)} dropped ({reasons})")
|
||||
print(f"=== total: {sum(len(v) for v in drops.values())} PNGs across {len(drops)} facesets ===")
|
||||
return
|
||||
master_path = ROOT / "manifest.json"
|
||||
total, _ = _apply_drops_to_facesets(drops, "dedup", master_path)
|
||||
print(f"[done] {total} PNGs moved to faces/_dropped/ across {len(drops)} facesets", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- multiface staging + apply -----------------------------
|
||||
|
||||
def cmd_stage_multiface(args):
|
||||
"""Build queue.json of all currently-active PNGs in the corpus
|
||||
for the Windows DML multi-face audit worker."""
|
||||
queue = []
|
||||
for fs in iter_active_facesets():
|
||||
faces_dir = fs / "faces"
|
||||
if not faces_dir.is_dir():
|
||||
continue
|
||||
for p in sorted(faces_dir.glob("*.png")):
|
||||
queue.append({
|
||||
"wsl_path": str(p),
|
||||
"win_path": wsl_to_win(str(p)),
|
||||
"faceset": fs.name,
|
||||
"file": p.name,
|
||||
})
|
||||
op = Path(args.out)
|
||||
op.parent.mkdir(parents=True, exist_ok=True)
|
||||
op.write_text(json.dumps(queue, indent=2))
|
||||
print(f"[stage] {len(queue)} PNGs -> {op}", file=sys.stderr)
|
||||
|
||||
|
||||
def cmd_merge_multiface(args):
|
||||
"""Convert worker results.json into a drops_by_faceset plan."""
|
||||
src = json.loads(Path(args.results).read_text())
|
||||
drops_by_faceset: dict[str, list] = {}
|
||||
bad_count = 0
|
||||
for r in src.get("results", []):
|
||||
n_faces = r.get("face_count", -1)
|
||||
if n_faces == 1:
|
||||
continue
|
||||
bad_count += 1
|
||||
drops_by_faceset.setdefault(r["faceset"], []).append({
|
||||
"file": r["file"],
|
||||
"reason": f"multiface_{n_faces}",
|
||||
"face_count": n_faces,
|
||||
})
|
||||
plan = {
|
||||
"totals": {"bad_pngs": bad_count, "facesets_affected": len(drops_by_faceset),
|
||||
"scored": len(src.get("results", []))},
|
||||
"drops_by_faceset": drops_by_faceset,
|
||||
}
|
||||
op = Path(args.out)
|
||||
op.parent.mkdir(parents=True, exist_ok=True)
|
||||
op.write_text(json.dumps(plan, indent=2))
|
||||
print(f"[merge] {bad_count} bad PNGs across {len(drops_by_faceset)} facesets -> {op}", file=sys.stderr)
|
||||
|
||||
|
||||
def cmd_apply_multiface(args):
|
||||
plan = json.loads(Path(args.plan).read_text())
|
||||
drops = plan["drops_by_faceset"]
|
||||
if args.dry_run:
|
||||
for fs, items in sorted(drops.items()):
|
||||
print(f" {fs}: {len(items)} bad PNG(s)")
|
||||
print(f"=== total: {sum(len(v) for v in drops.values())} ===")
|
||||
return
|
||||
master_path = ROOT / "manifest.json"
|
||||
total, _ = _apply_drops_to_facesets(drops, "multiface", master_path)
|
||||
print(f"[done] {total} PNGs moved to faces/_dropped/ across {len(drops)} facesets", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- report -----------------------------
|
||||
|
||||
def cmd_report(args):
|
||||
out_dir = Path(args.out)
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
sections = []
|
||||
if args.dedup:
|
||||
d = json.loads(Path(args.dedup).read_text())
|
||||
t = d["totals"]
|
||||
sections.append(f"<h2>Dedup</h2>")
|
||||
sections.append(
|
||||
f"<ul>"
|
||||
f"<li>Active facesets: {t['active_facesets']}, active PNGs: {t['active_pngs']}</li>"
|
||||
f"<li>Cross-family byte-dup groups: {t['byte_dup_groups']} → {t['byte_dup_drops']} PNGs dropped</li>"
|
||||
f"<li>Within-faceset near-dup groups (sim≥{d['thresholds']['near_dup_sim']}): {t['near_dup_groups']} → {t['near_dup_drops']} PNGs dropped</li>"
|
||||
f"<li><b>Total dedup drops: {t['all_drops']}</b> across {t['facesets_affected']} facesets</li>"
|
||||
f"</ul>"
|
||||
)
|
||||
# top-N affected facesets
|
||||
rows = sorted(d["drops_by_faceset"].items(), key=lambda x: -len(x[1]))[:25]
|
||||
sections.append("<h3>Top 25 most-affected facesets</h3><table><tr><th>faceset</th><th>dropped</th><th>reasons</th></tr>")
|
||||
for fs, items in rows:
|
||||
r = {}
|
||||
for it in items:
|
||||
r[it["reason"]] = r.get(it["reason"], 0) + 1
|
||||
sections.append(f"<tr><td>{fs}</td><td>{len(items)}</td><td>{r}</td></tr>")
|
||||
sections.append("</table>")
|
||||
|
||||
if args.multiface:
|
||||
m = json.loads(Path(args.multiface).read_text())
|
||||
t = m["totals"]
|
||||
sections.append("<h2>Multi-face audit</h2>")
|
||||
sections.append(
|
||||
f"<ul>"
|
||||
f"<li>PNGs scored: {t['scored']}</li>"
|
||||
f"<li>Bad PNGs (0 or >1 face): {t['bad_pngs']} across {t['facesets_affected']} facesets</li>"
|
||||
f"</ul>"
|
||||
)
|
||||
|
||||
html = f"""<!doctype html>
|
||||
<html><head><meta charset='utf-8'><title>Dedup + multi-face audit</title>
|
||||
<style>
|
||||
body {{ font-family: system-ui, sans-serif; background:#111; color:#eee; padding:1em; }}
|
||||
h1, h2, h3 {{ margin-top:1em; }}
|
||||
table {{ border-collapse: collapse; font-family: monospace; font-size: 12px; }}
|
||||
table td, table th {{ padding: 2px 8px; border: 1px solid #333; }}
|
||||
ul li {{ margin: 4px 0; }}
|
||||
</style></head>
|
||||
<body>
|
||||
<h1>facesets_swap_ready dedup + roop optimization audit</h1>
|
||||
{''.join(sections)}
|
||||
</body></html>"""
|
||||
out_html = out_dir / "index.html"
|
||||
out_html.write_text(html)
|
||||
print(f"[done] {out_html}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- main -----------------------------
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
sub = ap.add_subparsers(dest="cmd", required=True)
|
||||
|
||||
a = sub.add_parser("analyze")
|
||||
a.add_argument("--out", required=True)
|
||||
a.set_defaults(func=cmd_analyze)
|
||||
|
||||
p = sub.add_parser("apply")
|
||||
p.add_argument("--plan", required=True)
|
||||
p.add_argument("--dry-run", action="store_true")
|
||||
p.set_defaults(func=cmd_apply)
|
||||
|
||||
sm = sub.add_parser("stage_multiface")
|
||||
sm.add_argument("--out", required=True)
|
||||
sm.set_defaults(func=cmd_stage_multiface)
|
||||
|
||||
mm = sub.add_parser("merge_multiface")
|
||||
mm.add_argument("--results", required=True)
|
||||
mm.add_argument("--out", required=True)
|
||||
mm.set_defaults(func=cmd_merge_multiface)
|
||||
|
||||
am = sub.add_parser("apply_multiface")
|
||||
am.add_argument("--plan", required=True)
|
||||
am.add_argument("--dry-run", action="store_true")
|
||||
am.set_defaults(func=cmd_apply_multiface)
|
||||
|
||||
r = sub.add_parser("report")
|
||||
r.add_argument("--dedup", default=None)
|
||||
r.add_argument("--multiface", default=None)
|
||||
r.add_argument("--out", required=True)
|
||||
r.set_defaults(func=cmd_report)
|
||||
|
||||
args = ap.parse_args()
|
||||
args.func(args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Executable
+244
@@ -0,0 +1,244 @@
|
||||
"""Windows / DirectML embed worker.
|
||||
|
||||
Reads a queue.json staged by /opt/face-sets/work/immich_stage.py (WSL side),
|
||||
runs InsightFace's FaceAnalysis on each image with the DmlExecutionProvider
|
||||
backed by the AMD Vega, and writes a cache file in the schema produced by
|
||||
sort_faces.py:cmd_embed (so it can be merged into nl_full.npz).
|
||||
|
||||
CLI:
|
||||
py -3.12 embed_worker.py <queue.json> <out_cache.npz> [--limit N]
|
||||
|
||||
The queue.json entry shape (each item) is:
|
||||
{
|
||||
"asset_id": "...",
|
||||
"sha256": "...",
|
||||
"wsl_path": "/mnt/x/src/immich/<user>/<rel>", # canonical path stored
|
||||
"win_path": "X:\\src\\immich\\<user>\\<rel>", # what we read from
|
||||
"size_bytes": int,
|
||||
"width": int, "height": int,
|
||||
...
|
||||
}
|
||||
|
||||
Per face record matches cmd_embed's schema:
|
||||
path, face_idx, det_score, bbox, face_short, face_area, blur, noface=False, hash
|
||||
plus landmark_2d_106, landmark_3d_68, pose (FaceAnalysis returns these for
|
||||
free and the existing cache already carries them after `enrich`).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
from PIL import Image, ImageOps
|
||||
from insightface.app import FaceAnalysis
|
||||
|
||||
MODEL_ROOT = r"C:\face_embed_venv\models"
|
||||
MIN_DET_SCORE = 0.5
|
||||
MIN_FACE_PIX = 40
|
||||
FLUSH_EVERY = 50
|
||||
|
||||
|
||||
def load_rgb_bgr(path: Path):
|
||||
try:
|
||||
with Image.open(path) as im:
|
||||
im = ImageOps.exif_transpose(im)
|
||||
im = im.convert("RGB")
|
||||
rgb = np.array(im)
|
||||
bgr = rgb[:, :, ::-1].copy()
|
||||
return rgb, bgr
|
||||
except Exception as e:
|
||||
print(f"[warn] failed to load {path}: {e}", file=sys.stderr)
|
||||
return None, None
|
||||
|
||||
|
||||
def laplacian_variance(gray: np.ndarray) -> float:
|
||||
g = gray.astype(np.float32)
|
||||
lap = (
|
||||
-4.0 * g[1:-1, 1:-1]
|
||||
+ g[:-2, 1:-1] + g[2:, 1:-1]
|
||||
+ g[1:-1, :-2] + g[1:-1, 2:]
|
||||
)
|
||||
return float(lap.var())
|
||||
|
||||
|
||||
def save_cache(out_path: Path, emb_chunks: list, meta: list, processed: set, src_root: str):
|
||||
emb = np.concatenate(emb_chunks) if emb_chunks else np.zeros((0, 512), dtype=np.float32)
|
||||
tmp = out_path.with_suffix(".tmp.npz")
|
||||
np.savez(
|
||||
str(tmp),
|
||||
embeddings=emb,
|
||||
meta=json.dumps(meta),
|
||||
src_root=str(src_root),
|
||||
processed_paths=json.dumps(sorted(processed)),
|
||||
path_aliases=json.dumps({}),
|
||||
schema="v2",
|
||||
)
|
||||
os.replace(tmp, out_path)
|
||||
|
||||
|
||||
def load_cache_if_exists(out_path: Path):
|
||||
"""Resume helper. Returns (emb_chunks, meta, processed_set)."""
|
||||
if not out_path.exists():
|
||||
return [], [], set()
|
||||
data = np.load(out_path, allow_pickle=True)
|
||||
emb = data["embeddings"]
|
||||
meta = json.loads(str(data["meta"]))
|
||||
processed = set(json.loads(str(data["processed_paths"])))
|
||||
return [emb] if len(emb) else [], list(meta), processed
|
||||
|
||||
|
||||
def main():
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument("queue", type=Path)
|
||||
p.add_argument("out", type=Path)
|
||||
p.add_argument("--limit", type=int, default=None)
|
||||
args = p.parse_args()
|
||||
|
||||
queue = json.loads(args.queue.read_text())
|
||||
print(f"queue: {len(queue)} entries from {args.queue}")
|
||||
|
||||
args.out.parent.mkdir(parents=True, exist_ok=True)
|
||||
emb_chunks, meta, processed = load_cache_if_exists(args.out)
|
||||
n_existing_records = len(meta)
|
||||
n_existing_emb = sum(e.shape[0] for e in emb_chunks)
|
||||
if n_existing_records:
|
||||
print(f"resume: {n_existing_records} existing meta records "
|
||||
f"({n_existing_emb} embeddings, {len(processed)} processed paths)")
|
||||
|
||||
print("initializing FaceAnalysis with DmlExecutionProvider")
|
||||
app = FaceAnalysis(
|
||||
name="buffalo_l",
|
||||
root=MODEL_ROOT,
|
||||
providers=["DmlExecutionProvider", "CPUExecutionProvider"],
|
||||
)
|
||||
app.prepare(ctx_id=0, det_size=(640, 640))
|
||||
|
||||
src_root = "/mnt/x/src/immich"
|
||||
|
||||
n_done = 0
|
||||
n_face_records_added = 0
|
||||
n_noface_added = 0
|
||||
n_skipped = 0
|
||||
n_load_err = 0
|
||||
t0 = time.perf_counter()
|
||||
last_flush = time.perf_counter()
|
||||
new_emb_chunks: list[np.ndarray] = []
|
||||
new_meta: list[dict] = []
|
||||
|
||||
def flush():
|
||||
nonlocal new_emb_chunks, new_meta, last_flush
|
||||
if not new_emb_chunks and not new_meta:
|
||||
return
|
||||
if new_emb_chunks:
|
||||
emb_chunks.append(np.concatenate(new_emb_chunks))
|
||||
new_emb_chunks = []
|
||||
for r in new_meta:
|
||||
meta.append(r)
|
||||
new_meta = []
|
||||
save_cache(args.out, emb_chunks, meta, processed, src_root)
|
||||
last_flush = time.perf_counter()
|
||||
|
||||
for i, entry in enumerate(queue):
|
||||
if args.limit is not None and n_done >= args.limit:
|
||||
break
|
||||
wsl_path = entry["wsl_path"]
|
||||
win_path = entry["win_path"]
|
||||
sha = entry["sha256"]
|
||||
|
||||
if wsl_path in processed:
|
||||
n_skipped += 1
|
||||
continue
|
||||
|
||||
rgb, bgr = load_rgb_bgr(Path(win_path))
|
||||
if bgr is None:
|
||||
new_meta.append({
|
||||
"path": wsl_path, "face_idx": -1, "noface": True,
|
||||
"hash": sha, "error": "load",
|
||||
})
|
||||
processed.add(wsl_path)
|
||||
n_load_err += 1
|
||||
n_done += 1
|
||||
continue
|
||||
|
||||
faces = app.get(bgr)
|
||||
kept_any = False
|
||||
for j, f in enumerate(faces):
|
||||
if float(f.det_score) < MIN_DET_SCORE:
|
||||
continue
|
||||
x1, y1, x2, y2 = [int(round(v)) for v in f.bbox]
|
||||
x1 = max(x1, 0); y1 = max(y1, 0)
|
||||
x2 = min(x2, rgb.shape[1]); y2 = min(y2, rgb.shape[0])
|
||||
w, h = x2 - x1, y2 - y1
|
||||
short = min(w, h)
|
||||
if short < MIN_FACE_PIX:
|
||||
continue
|
||||
crop = rgb[y1:y2, x1:x2]
|
||||
if crop.size == 0:
|
||||
continue
|
||||
gray = crop.mean(axis=2)
|
||||
blur = laplacian_variance(gray) if min(gray.shape) > 3 else 0.0
|
||||
|
||||
emb = f.normed_embedding.astype(np.float32)
|
||||
new_emb_chunks.append(emb[None, :])
|
||||
rec = {
|
||||
"path": wsl_path,
|
||||
"face_idx": j,
|
||||
"det_score": float(f.det_score),
|
||||
"bbox": [x1, y1, x2, y2],
|
||||
"face_short": int(short),
|
||||
"face_area": int(w * h),
|
||||
"blur": blur,
|
||||
"noface": False,
|
||||
"hash": sha,
|
||||
}
|
||||
# Enrichment-equivalent fields (FaceAnalysis returns these for free)
|
||||
if hasattr(f, "landmark_2d_106") and f.landmark_2d_106 is not None:
|
||||
rec["landmark_2d_106"] = f.landmark_2d_106.astype(np.float32).tolist()
|
||||
if hasattr(f, "landmark_3d_68") and f.landmark_3d_68 is not None:
|
||||
rec["landmark_3d_68"] = f.landmark_3d_68.astype(np.float32).tolist()
|
||||
if hasattr(f, "pose") and f.pose is not None:
|
||||
rec["pose"] = [float(x) for x in f.pose]
|
||||
new_meta.append(rec)
|
||||
kept_any = True
|
||||
n_face_records_added += 1
|
||||
if not kept_any:
|
||||
new_meta.append({
|
||||
"path": wsl_path, "face_idx": -1, "noface": True, "hash": sha,
|
||||
})
|
||||
n_noface_added += 1
|
||||
|
||||
processed.add(wsl_path)
|
||||
n_done += 1
|
||||
|
||||
if (n_done % FLUSH_EVERY == 0) or (time.perf_counter() - last_flush) > 30.0:
|
||||
flush()
|
||||
elapsed = time.perf_counter() - t0
|
||||
rate = n_done / max(0.1, elapsed)
|
||||
print(
|
||||
f"[embed] done={n_done:5d}/{len(queue)} faces+={n_face_records_added:5d} "
|
||||
f"noface+={n_noface_added:4d} skipped={n_skipped:4d} "
|
||||
f"load_err={n_load_err:3d} rate={rate:.1f} img/s "
|
||||
f"({elapsed:.1f}s elapsed)"
|
||||
)
|
||||
|
||||
flush()
|
||||
elapsed = time.perf_counter() - t0
|
||||
print()
|
||||
print("=== embed done ===")
|
||||
print(f" done: {n_done}")
|
||||
print(f" new face records: {n_face_records_added}")
|
||||
print(f" new noface records: {n_noface_added}")
|
||||
print(f" skipped (already done): {n_skipped}")
|
||||
print(f" load errors: {n_load_err}")
|
||||
print(f" elapsed: {elapsed:.1f}s ({n_done/max(0.1,elapsed):.1f} img/s)")
|
||||
print(f" cache: {args.out}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,574 @@
|
||||
"""CLIP zero-shot scoring for masks + sunglasses across facesets_swap_ready/.
|
||||
|
||||
Usage:
|
||||
# score one or more specific facesets (test mode)
|
||||
python work/filter_occlusions.py score --facesets faceset_001,faceset_050 \
|
||||
--out work/test_batch_occlusion/scores.json
|
||||
|
||||
# score everything (full corpus)
|
||||
python work/filter_occlusions.py score --out work/occlusion_scores.json
|
||||
|
||||
# render HTML contact sheet from a scores.json
|
||||
python work/filter_occlusions.py report --scores work/test_batch_occlusion/scores.json \
|
||||
--out work/test_batch_occlusion
|
||||
|
||||
Notes:
|
||||
- This script never modifies facesets_swap_ready/. An --apply step lives elsewhere
|
||||
(or will be added once thresholds are validated).
|
||||
- Model: open_clip ViT-L-14 / dfn2b_s39b (best public zero-shot at this size).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Iterable
|
||||
|
||||
import torch
|
||||
from PIL import Image
|
||||
import open_clip
|
||||
|
||||
ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
|
||||
WIN_ROOT = r"E:\temp_things\fcswp\nl_sorted\facesets_swap_ready"
|
||||
|
||||
MODEL_NAME = "ViT-L-14"
|
||||
PRETRAINED = "dfn2b_s39b"
|
||||
|
||||
|
||||
def wsl_to_win(wsl_path: str) -> str:
|
||||
"""Translate a /mnt/e/... wsl path to E:\\... for the Windows worker."""
|
||||
s = str(wsl_path)
|
||||
if s.startswith("/mnt/"):
|
||||
drive = s[5]
|
||||
rest = s[7:].replace("/", "\\")
|
||||
return f"{drive.upper()}:\\{rest}"
|
||||
return s
|
||||
|
||||
# Prompt ensembles. Each pair (positive, negative) becomes one binary classifier.
|
||||
# We average text embeddings within each list, then softmax across the two means.
|
||||
PROMPTS = {
|
||||
"mask": {
|
||||
"pos": [
|
||||
"a photo of a person wearing a surgical face mask",
|
||||
"a photo of a person wearing an FFP2 respirator covering mouth and nose",
|
||||
"a photo of a person wearing a cloth face mask",
|
||||
"a face partially covered by a medical mask",
|
||||
"a person whose mouth and nose are hidden by a face mask",
|
||||
],
|
||||
"neg": [
|
||||
"a photo of a person's face with mouth and nose clearly visible",
|
||||
"a clear, unobstructed photo of a face",
|
||||
"a photo of a face without any mask or covering",
|
||||
"a portrait of a person showing their full face",
|
||||
"a photo of a person with a beard and visible mouth", # avoid beard false positives
|
||||
],
|
||||
},
|
||||
"sunglasses": {
|
||||
# We want to flag ONLY images where sunglasses occlude the eyes.
|
||||
# False positives to defeat: sunglasses pushed up on the head/forehead, hanging on a shirt collar.
|
||||
"pos": [
|
||||
"a face with dark sunglasses covering the eyes",
|
||||
"a portrait with the eyes hidden behind opaque sunglasses",
|
||||
"a person wearing dark sunglasses over their eyes, eyes not visible",
|
||||
"a face where the eyes are completely concealed by tinted lenses",
|
||||
"a close-up portrait wearing aviator sunglasses on the eyes",
|
||||
],
|
||||
"neg": [
|
||||
"a portrait with both eyes clearly visible and uncovered",
|
||||
"a face with sunglasses pushed up on the forehead, eyes visible below",
|
||||
"a face with sunglasses resting on top of the head, eyes visible",
|
||||
"a person with sunglasses hanging from their shirt, eyes visible",
|
||||
"a face wearing clear prescription eyeglasses with visible eyes",
|
||||
"a portrait with no eyewear and visible eyes",
|
||||
],
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def load_model(device: str = "cpu"):
|
||||
print(f"[clip] loading {MODEL_NAME} / {PRETRAINED} on {device} ...", file=sys.stderr)
|
||||
t0 = time.time()
|
||||
model, _, preprocess = open_clip.create_model_and_transforms(MODEL_NAME, pretrained=PRETRAINED)
|
||||
tokenizer = open_clip.get_tokenizer(MODEL_NAME)
|
||||
model = model.to(device).eval()
|
||||
logit_scale = float(model.logit_scale.exp().detach().cpu())
|
||||
print(f"[clip] ready in {time.time()-t0:.1f}s, logit_scale={logit_scale:.2f}", file=sys.stderr)
|
||||
return model, preprocess, tokenizer, logit_scale
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def build_text_features(model, tokenizer, device: str):
|
||||
"""Return dict {attr: (pos_mean_emb, neg_mean_emb)} on device, both L2-normalized."""
|
||||
out = {}
|
||||
for attr, sides in PROMPTS.items():
|
||||
feats = {}
|
||||
for side in ("pos", "neg"):
|
||||
tokens = tokenizer(sides[side]).to(device)
|
||||
f = model.encode_text(tokens)
|
||||
f = f / f.norm(dim=-1, keepdim=True)
|
||||
mean = f.mean(dim=0)
|
||||
feats[side] = mean / mean.norm()
|
||||
out[attr] = (feats["pos"], feats["neg"])
|
||||
return out
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def score_images(model, preprocess, text_feats, logit_scale: float, paths: list[Path], device: str, batch: int = 16):
|
||||
"""Yield (path, {attr: pos_prob}) per image. logit_scale is CLIP's learned temperature (~100)."""
|
||||
for i in range(0, len(paths), batch):
|
||||
chunk = paths[i:i + batch]
|
||||
imgs = []
|
||||
keep = []
|
||||
for p in chunk:
|
||||
try:
|
||||
img = Image.open(p).convert("RGB")
|
||||
imgs.append(preprocess(img))
|
||||
keep.append(p)
|
||||
except Exception as e:
|
||||
print(f"[skip] {p}: {e}", file=sys.stderr)
|
||||
if not imgs:
|
||||
continue
|
||||
x = torch.stack(imgs).to(device)
|
||||
feats = model.encode_image(x)
|
||||
feats = feats / feats.norm(dim=-1, keepdim=True) # (B, D)
|
||||
results = {}
|
||||
for attr, (pos, neg) in text_feats.items():
|
||||
sims = torch.stack([feats @ pos, feats @ neg], dim=1) * logit_scale # (B, 2)
|
||||
probs = sims.softmax(dim=1)[:, 0].tolist() # P(pos)
|
||||
results[attr] = probs
|
||||
for j, p in enumerate(keep):
|
||||
yield p, {attr: results[attr][j] for attr in text_feats}
|
||||
|
||||
|
||||
def iter_facesets(root: Path, only: list[str] | None) -> Iterable[Path]:
|
||||
if only:
|
||||
for name in only:
|
||||
d = root / name
|
||||
if d.is_dir():
|
||||
yield d
|
||||
else:
|
||||
print(f"[warn] not a directory: {d}", file=sys.stderr)
|
||||
return
|
||||
for d in sorted(root.iterdir()):
|
||||
if d.is_dir() and not d.name.startswith("_"):
|
||||
yield d
|
||||
|
||||
|
||||
def cmd_score(args):
|
||||
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
model, preprocess, tokenizer, logit_scale = load_model(device)
|
||||
text_feats = build_text_features(model, tokenizer, device)
|
||||
|
||||
only = [s.strip() for s in args.facesets.split(",")] if args.facesets else None
|
||||
facesets = list(iter_facesets(ROOT, only))
|
||||
if args.sample_per_faceset:
|
||||
# take first N PNGs per faceset (cheap deterministic sample for test batches)
|
||||
pass
|
||||
|
||||
report = {
|
||||
"model": f"{MODEL_NAME}/{PRETRAINED}",
|
||||
"root": str(ROOT),
|
||||
"prompts": PROMPTS,
|
||||
"facesets": {},
|
||||
}
|
||||
total_imgs = 0
|
||||
t0 = time.time()
|
||||
for fs in facesets:
|
||||
faces = sorted((fs / "faces").glob("*.png")) if (fs / "faces").is_dir() else sorted(fs.glob("*.png"))
|
||||
if args.sample_per_faceset:
|
||||
faces = faces[: args.sample_per_faceset]
|
||||
if not faces:
|
||||
continue
|
||||
print(f"[scan] {fs.name}: {len(faces)} png", file=sys.stderr)
|
||||
per_image = []
|
||||
for p, scores in score_images(model, preprocess, text_feats, logit_scale, faces, device):
|
||||
per_image.append({"file": p.name, "mask": round(scores["mask"], 4), "sunglasses": round(scores["sunglasses"], 4)})
|
||||
total_imgs += 1
|
||||
report["facesets"][fs.name] = per_image
|
||||
|
||||
out = Path(args.out)
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
out.write_text(json.dumps(report, indent=2))
|
||||
dt = time.time() - t0
|
||||
print(f"[done] {total_imgs} images, {dt:.1f}s ({total_imgs/max(dt,1e-3):.2f} img/s) -> {out}", file=sys.stderr)
|
||||
|
||||
|
||||
def cmd_report(args):
|
||||
"""Render an HTML contact sheet from scores.json. Generates JPG thumbs."""
|
||||
import io
|
||||
scores = json.loads(Path(args.scores).read_text())
|
||||
out_dir = Path(args.out)
|
||||
thumbs_dir = out_dir / "thumbs"
|
||||
thumbs_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
THUMB = 160
|
||||
rows_html = []
|
||||
|
||||
def thumb_path(faceset: str, fname: str) -> Path:
|
||||
d = thumbs_dir / faceset
|
||||
d.mkdir(parents=True, exist_ok=True)
|
||||
return d / (Path(fname).stem + ".jpg")
|
||||
|
||||
def make_thumb(src: Path, dst: Path):
|
||||
if dst.exists():
|
||||
return
|
||||
try:
|
||||
img = Image.open(src).convert("RGB")
|
||||
img.thumbnail((THUMB, THUMB), Image.LANCZOS)
|
||||
img.save(dst, "JPEG", quality=82)
|
||||
except Exception as e:
|
||||
print(f"[thumb-skip] {src}: {e}", file=sys.stderr)
|
||||
|
||||
facesets = scores["facesets"]
|
||||
for faceset, items in facesets.items():
|
||||
# sort: high score first so borderline cases group at the boundary
|
||||
items_sorted = sorted(items, key=lambda x: max(x["mask"], x["sunglasses"]), reverse=True)
|
||||
# faceset summary
|
||||
n = len(items)
|
||||
n_mask = sum(1 for x in items if x["mask"] >= 0.7)
|
||||
n_sg = sum(1 for x in items if x["sunglasses"] >= 0.7)
|
||||
pct_mask = (100 * n_mask / n) if n else 0
|
||||
pct_sg = (100 * n_sg / n) if n else 0
|
||||
rows_html.append(f"<h2 id='{faceset}'>{faceset} <small>({n} imgs · mask≥0.7: {n_mask} ({pct_mask:.0f}%) · sunglasses≥0.7: {n_sg} ({pct_sg:.0f}%))</small></h2>")
|
||||
rows_html.append("<div class='grid'>")
|
||||
src_root = ROOT / faceset
|
||||
faces_root = (src_root / "faces") if (src_root / "faces").is_dir() else src_root
|
||||
for it in items_sorted:
|
||||
src = faces_root / it["file"]
|
||||
dst = thumb_path(faceset, it["file"])
|
||||
make_thumb(src, dst)
|
||||
rel = f"thumbs/{faceset}/{Path(it['file']).stem}.jpg"
|
||||
m, s = it["mask"], it["sunglasses"]
|
||||
cls_m = "hi" if m >= 0.7 else ("mid" if m >= 0.4 else "lo")
|
||||
cls_s = "hi" if s >= 0.7 else ("mid" if s >= 0.4 else "lo")
|
||||
rows_html.append(
|
||||
f"<div class='cell'>"
|
||||
f"<img src='{rel}' loading='lazy' title='{it['file']}'>"
|
||||
f"<div class='scores'><span class='{cls_m}'>M {m:.2f}</span> <span class='{cls_s}'>S {s:.2f}</span></div>"
|
||||
f"</div>"
|
||||
)
|
||||
rows_html.append("</div>")
|
||||
|
||||
nav = " · ".join(f"<a href='#{f}'>{f}</a>" for f in facesets)
|
||||
|
||||
html = f"""<!doctype html>
|
||||
<html><head><meta charset='utf-8'><title>Occlusion test batch</title>
|
||||
<style>
|
||||
body {{ font-family: system-ui, sans-serif; background: #111; color: #eee; padding: 1em; }}
|
||||
h1 {{ margin-top: 0; }}
|
||||
h2 {{ margin-top: 1.5em; border-bottom: 1px solid #333; padding-bottom: .25em; }}
|
||||
small {{ color: #999; font-weight: normal; }}
|
||||
.grid {{ display: grid; grid-template-columns: repeat(auto-fill, minmax(170px, 1fr)); gap: .5em; }}
|
||||
.cell {{ background: #1c1c1c; padding: 4px; border-radius: 4px; text-align: center; }}
|
||||
.cell img {{ max-width: 100%; height: auto; display: block; margin: 0 auto; }}
|
||||
.scores {{ font-family: monospace; font-size: 11px; padding-top: 4px; }}
|
||||
.hi {{ color: #ff5050; font-weight: bold; }}
|
||||
.mid {{ color: #ffb050; }}
|
||||
.lo {{ color: #5fa05f; }}
|
||||
.nav {{ position: sticky; top: 0; background: #111; padding: .5em 0; border-bottom: 1px solid #333; }}
|
||||
a {{ color: #6cf; }}
|
||||
</style></head>
|
||||
<body>
|
||||
<h1>Occlusion scores — {scores['model']}</h1>
|
||||
<p>Sorted within each faceset by max(mask, sunglasses) descending.
|
||||
Color: <span class='hi'>≥0.70</span> · <span class='mid'>0.40–0.70</span> · <span class='lo'><0.40</span></p>
|
||||
<div class='nav'>{nav}</div>
|
||||
{''.join(rows_html)}
|
||||
</body></html>"""
|
||||
|
||||
out_html = out_dir / "index.html"
|
||||
out_html.write_text(html)
|
||||
print(f"[done] {out_html}", file=sys.stderr)
|
||||
|
||||
|
||||
def _zip_png_list(pngs: list[Path], zip_path: Path) -> None:
|
||||
"""Mirror of sort_faces.py:_zip_png_list. Renames PNGs to 0000.png, 0001.png, ..."""
|
||||
import zipfile
|
||||
with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=4) as zf:
|
||||
for i, p in enumerate(pngs):
|
||||
zf.write(p, arcname=f"{i:04d}.png")
|
||||
|
||||
|
||||
def cmd_apply(args):
|
||||
"""Prune mask/sunglasses PNGs, quarantine occlusion-dominated facesets,
|
||||
re-zip .fsz, update top-level manifest. --dry-run prints the plan only."""
|
||||
import shutil
|
||||
|
||||
threshold = args.threshold
|
||||
domain_pct = args.domain_pct
|
||||
min_survivors = args.min_survivors
|
||||
top_n_target = args.top_n
|
||||
|
||||
scores = json.loads(Path(args.scores).read_text())
|
||||
master_path = ROOT / "manifest.json"
|
||||
master = json.loads(master_path.read_text())
|
||||
by_name = {f["name"]: f for f in master.get("facesets", [])}
|
||||
|
||||
masked_dir = ROOT / "_masked"
|
||||
thin_dir = ROOT / "_thin"
|
||||
|
||||
plan = []
|
||||
for faceset, items in scores["facesets"].items():
|
||||
if faceset not in by_name:
|
||||
print(f"[warn] {faceset} not in master manifest — skipping", file=sys.stderr)
|
||||
continue
|
||||
n = len(items)
|
||||
flagged_files = sorted(
|
||||
it["file"] for it in items
|
||||
if it["mask"] >= threshold or it["sunglasses"] >= threshold
|
||||
)
|
||||
survivors_items = [it for it in items if it["file"] not in set(flagged_files)]
|
||||
# preserve quality order from filename (0001.png is highest-rank)
|
||||
survivors_files = sorted(it["file"] for it in survivors_items)
|
||||
|
||||
n_mask = sum(1 for it in items if it["mask"] >= threshold)
|
||||
n_sg = sum(1 for it in items if it["sunglasses"] >= threshold)
|
||||
pct_mask = n_mask / n if n else 0
|
||||
pct_sg = n_sg / n if n else 0
|
||||
|
||||
if pct_mask >= domain_pct:
|
||||
action, reason = "quarantine_masked", f"mask_pct={pct_mask:.0%}"
|
||||
elif pct_sg >= domain_pct:
|
||||
action, reason = "quarantine_masked", f"sunglasses_pct={pct_sg:.0%}"
|
||||
elif flagged_files and len(survivors_files) < min_survivors:
|
||||
# only quarantine-as-thin if pruning is the cause of the drop below threshold;
|
||||
# pre-existing small facesets without occlusions are left alone
|
||||
action, reason = "quarantine_thin", f"survivors={len(survivors_files)}<{min_survivors}"
|
||||
elif flagged_files:
|
||||
action, reason = "prune", f"drop {len(flagged_files)}"
|
||||
else:
|
||||
action, reason = "keep", "clean"
|
||||
|
||||
plan.append({
|
||||
"faceset": faceset, "action": action, "reason": reason,
|
||||
"n": n, "n_mask": n_mask, "n_sg": n_sg,
|
||||
"n_dropped": len(flagged_files), "n_survivors": len(survivors_files),
|
||||
"dropped_files": flagged_files,
|
||||
})
|
||||
|
||||
# Summary
|
||||
counts = {a: 0 for a in ("keep", "prune", "quarantine_masked", "quarantine_thin")}
|
||||
for p in plan:
|
||||
counts[p["action"]] += 1
|
||||
total_dropped_pngs = sum(p["n_dropped"] for p in plan if p["action"] == "prune")
|
||||
total_quarantined_pngs = sum(p["n"] for p in plan if p["action"].startswith("quarantine"))
|
||||
print(f"=== plan summary (threshold={threshold} domain_pct={domain_pct} min_survivors={min_survivors}) ===")
|
||||
for a, c in counts.items():
|
||||
print(f" {a}: {c}")
|
||||
print(f" PNGs to drop (prune): {total_dropped_pngs}")
|
||||
print(f" PNGs to quarantine (whole): {total_quarantined_pngs}")
|
||||
print(f" facesets in master: {len(master['facesets'])}")
|
||||
print(f" facesets scored: {len(plan)}")
|
||||
|
||||
# Write plan for audit
|
||||
plan_path = Path(args.out_plan)
|
||||
plan_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
plan_path.write_text(json.dumps({
|
||||
"thresholds": {"image": threshold, "domain_pct": domain_pct, "min_survivors": min_survivors},
|
||||
"counts": counts,
|
||||
"totals": {"dropped_pngs": total_dropped_pngs, "quarantined_pngs": total_quarantined_pngs},
|
||||
"plan": plan,
|
||||
}, indent=2))
|
||||
print(f" plan written to {plan_path}")
|
||||
|
||||
if args.dry_run:
|
||||
# pretty list of quarantines
|
||||
for p in plan:
|
||||
if p["action"].startswith("quarantine"):
|
||||
print(f" [{p['action']:>18s}] {p['faceset']} ({p['reason']}, n={p['n']})")
|
||||
return
|
||||
|
||||
# ----- destructive section -----
|
||||
masked_dir.mkdir(parents=True, exist_ok=True)
|
||||
thin_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
new_facesets = []
|
||||
new_masked = list(master.get("masked", [])) # preserve any prior runs
|
||||
new_thin = list(master.get("thin_eras", []))
|
||||
|
||||
# build a name -> existing-thin/masked entry index, to update relpath if we re-quarantine
|
||||
by_name_thin = {e["name"]: e for e in new_thin}
|
||||
by_name_masked = {e["name"]: e for e in new_masked}
|
||||
|
||||
for p in plan:
|
||||
entry = dict(by_name[p["faceset"]]) # copy
|
||||
fs_dir = ROOT / p["faceset"]
|
||||
faces_dir = fs_dir / "faces"
|
||||
|
||||
if p["action"] == "keep":
|
||||
new_facesets.append(entry)
|
||||
continue
|
||||
|
||||
# prune dropped PNGs first (applies to both prune and quarantine_thin paths)
|
||||
if p["dropped_files"]:
|
||||
dropped_holding = faces_dir / "_dropped"
|
||||
dropped_holding.mkdir(exist_ok=True)
|
||||
for fname in p["dropped_files"]:
|
||||
src = faces_dir / fname
|
||||
if src.exists():
|
||||
shutil.move(str(src), str(dropped_holding / fname))
|
||||
|
||||
if p["action"].startswith("quarantine"):
|
||||
target_root = masked_dir if p["action"] == "quarantine_masked" else thin_dir
|
||||
target = target_root / p["faceset"]
|
||||
if target.exists():
|
||||
# idempotency: if a previous run already moved it, skip move
|
||||
pass
|
||||
else:
|
||||
shutil.move(str(fs_dir), str(target))
|
||||
entry["occlusion_filter"] = {
|
||||
"action": p["action"], "reason": p["reason"],
|
||||
"n_input": p["n"], "n_mask": p["n_mask"], "n_sg": p["n_sg"],
|
||||
"n_dropped": p["n_dropped"], "n_survivors": p["n_survivors"],
|
||||
"threshold": threshold, "domain_pct": domain_pct,
|
||||
}
|
||||
entry["relpath"] = f"{'_masked' if p['action']=='quarantine_masked' else '_thin'}/{p['faceset']}"
|
||||
entry["fsz_top"] = None
|
||||
entry["fsz_all"] = None
|
||||
if p["action"] == "quarantine_masked":
|
||||
entry["masked"] = True
|
||||
new_masked.append(entry)
|
||||
else:
|
||||
entry["thin"] = True
|
||||
new_thin.append(entry)
|
||||
continue
|
||||
|
||||
# action == prune
|
||||
survivor_pngs = sorted([pp for pp in faces_dir.glob("*.png")])
|
||||
if not survivor_pngs:
|
||||
print(f"[warn] {p['faceset']}: no survivor PNGs after prune", file=sys.stderr)
|
||||
new_facesets.append(entry)
|
||||
continue
|
||||
|
||||
# re-zip .fsz from survivors in quality order
|
||||
top_n_eff = min(top_n_target, len(survivor_pngs))
|
||||
top_fsz = fs_dir / f"{p['faceset']}_top{top_n_eff}.fsz"
|
||||
all_fsz = fs_dir / f"{p['faceset']}_all.fsz"
|
||||
# remove old .fsz files (they may have different top_n in name)
|
||||
for old in fs_dir.glob("*.fsz"):
|
||||
old.unlink()
|
||||
_zip_png_list(survivor_pngs[:top_n_eff], top_fsz)
|
||||
if len(survivor_pngs) > top_n_eff:
|
||||
_zip_png_list(survivor_pngs, all_fsz)
|
||||
entry["fsz_all"] = all_fsz.name
|
||||
else:
|
||||
entry["fsz_all"] = None
|
||||
entry["fsz_top"] = top_fsz.name
|
||||
entry["top_n"] = top_n_eff
|
||||
entry["exported"] = len(survivor_pngs)
|
||||
entry["dropped_occlusion"] = p["n_dropped"]
|
||||
entry["occlusion_filter"] = {
|
||||
"action": "prune", "n_input": p["n"], "n_mask": p["n_mask"],
|
||||
"n_sg": p["n_sg"], "n_dropped": p["n_dropped"], "n_survivors": p["n_survivors"],
|
||||
"threshold": threshold,
|
||||
}
|
||||
new_facesets.append(entry)
|
||||
|
||||
# write updated master manifest
|
||||
new_master = dict(master)
|
||||
new_master["facesets"] = new_facesets
|
||||
new_master["masked"] = new_masked
|
||||
new_master["thin_eras"] = new_thin
|
||||
new_master["occlusion_filter_run"] = {
|
||||
"model": scores.get("model"),
|
||||
"threshold": threshold,
|
||||
"domain_pct": domain_pct,
|
||||
"min_survivors": min_survivors,
|
||||
"counts": counts,
|
||||
"totals": {"dropped_pngs": total_dropped_pngs, "quarantined_pngs": total_quarantined_pngs},
|
||||
}
|
||||
tmp = master_path.with_suffix(".tmp.json")
|
||||
tmp.write_text(json.dumps(new_master, indent=2))
|
||||
tmp.replace(master_path)
|
||||
print(f"[done] master manifest updated: {len(new_facesets)} active, "
|
||||
f"{len(new_masked)} masked, {len(new_thin)} thin")
|
||||
|
||||
|
||||
def cmd_stage(args):
|
||||
"""Walk facesets_swap_ready/ and write a queue.json for the Windows clip_worker."""
|
||||
only = [s.strip() for s in args.facesets.split(",")] if args.facesets else None
|
||||
queue = []
|
||||
for fs in iter_facesets(ROOT, only):
|
||||
faces = sorted((fs / "faces").glob("*.png")) if (fs / "faces").is_dir() else sorted(fs.glob("*.png"))
|
||||
for p in faces:
|
||||
queue.append({
|
||||
"wsl_path": str(p),
|
||||
"win_path": wsl_to_win(str(p)),
|
||||
"faceset": fs.name,
|
||||
"file": p.name,
|
||||
})
|
||||
out = Path(args.out)
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
out.write_text(json.dumps(queue, indent=2))
|
||||
print(f"[stage] {len(queue)} png paths -> {out}", file=sys.stderr)
|
||||
print(f"[stage] win queue file: {wsl_to_win(str(out))}", file=sys.stderr)
|
||||
|
||||
|
||||
def cmd_merge(args):
|
||||
"""Ingest worker scores.json into the per-faceset shape that `report` reads."""
|
||||
src = json.loads(Path(args.scores).read_text())
|
||||
by_faceset: dict[str, list] = {}
|
||||
for r in src.get("results", []):
|
||||
by_faceset.setdefault(r["faceset"], []).append({
|
||||
"file": r["file"],
|
||||
"mask": r["mask"],
|
||||
"sunglasses": r["sunglasses"],
|
||||
})
|
||||
# stable ordering: faceset by name, files by name
|
||||
out_data = {
|
||||
"model": src.get("model", f"{MODEL_NAME}/{PRETRAINED}"),
|
||||
"root": str(ROOT),
|
||||
"prompts": src.get("prompts", PROMPTS),
|
||||
"facesets": {fs: sorted(items, key=lambda x: x["file"]) for fs, items in sorted(by_faceset.items())},
|
||||
}
|
||||
out = Path(args.out)
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
out.write_text(json.dumps(out_data, indent=2))
|
||||
total = sum(len(v) for v in by_faceset.values())
|
||||
print(f"[merge] {total} scores across {len(by_faceset)} facesets -> {out}", file=sys.stderr)
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
sub = ap.add_subparsers(dest="cmd", required=True)
|
||||
|
||||
s = sub.add_parser("score", help="WSL CPU scoring (slow but no GPU dependency)")
|
||||
s.add_argument("--facesets", default=None, help="comma-separated faceset names; default = all")
|
||||
s.add_argument("--sample-per-faceset", type=int, default=0, help="cap PNGs per faceset (0 = all)")
|
||||
s.add_argument("--out", required=True)
|
||||
s.set_defaults(func=cmd_score)
|
||||
|
||||
st = sub.add_parser("stage", help="Build queue.json for Windows clip_worker.py")
|
||||
st.add_argument("--facesets", default=None, help="comma-separated faceset names; default = all")
|
||||
st.add_argument("--out", required=True)
|
||||
st.set_defaults(func=cmd_stage)
|
||||
|
||||
m = sub.add_parser("merge", help="Convert worker scores.json into per-faceset report format")
|
||||
m.add_argument("--scores", required=True, help="worker output (flat list of results)")
|
||||
m.add_argument("--out", required=True, help="output path for per-faceset format")
|
||||
m.set_defaults(func=cmd_merge)
|
||||
|
||||
r = sub.add_parser("report", help="Render HTML contact sheet from a per-faceset scores.json")
|
||||
r.add_argument("--scores", required=True)
|
||||
r.add_argument("--out", required=True)
|
||||
r.set_defaults(func=cmd_report)
|
||||
|
||||
a = sub.add_parser("apply", help="Prune flagged PNGs, quarantine dominated facesets, re-zip .fsz, update manifest")
|
||||
a.add_argument("--scores", required=True, help="per-faceset scores.json (output of `merge` or `score`)")
|
||||
a.add_argument("--out-plan", required=True, help="path to write the apply plan json (audit)")
|
||||
a.add_argument("--threshold", type=float, default=0.7, help="image-level drop threshold for mask/sunglasses (default 0.7)")
|
||||
a.add_argument("--domain-pct", type=float, default=0.40, help="faceset-level quarantine threshold (default 0.40)")
|
||||
a.add_argument("--min-survivors", type=int, default=5, help="quarantine to _thin if survivors below this (default 5)")
|
||||
a.add_argument("--top-n", type=int, default=30, help="top-N for re-zipped _topN.fsz (default 30)")
|
||||
a.add_argument("--dry-run", action="store_true", help="print plan only, no filesystem changes")
|
||||
a.set_defaults(func=cmd_apply)
|
||||
|
||||
args = ap.parse_args()
|
||||
args.func(args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Executable
+50
@@ -0,0 +1,50 @@
|
||||
#!/usr/bin/env bash
|
||||
# Finalize an Immich user's stage:
|
||||
# 1. Copy queue.json to /mnt/c so the Windows embed worker can read it
|
||||
# 2. Run the embed worker on Windows (DML)
|
||||
# 3. Copy the resulting cache back to /opt/face-sets/work/cache/
|
||||
# 4. Run cluster_immich.py to discover + emit new facesets
|
||||
#
|
||||
# Usage: ./work/finalize_immich.sh <user-label>
|
||||
set -euo pipefail
|
||||
|
||||
USER_LABEL="${1:?usage: $0 <user-label>}"
|
||||
|
||||
REPO="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
WSL_QUEUE="$REPO/work/immich/$USER_LABEL/queue.json"
|
||||
WIN_QUEUE_DIR="/mnt/c/face_embed_venv/work/immich/$USER_LABEL"
|
||||
WIN_QUEUE="$WIN_QUEUE_DIR/queue.json"
|
||||
WIN_QUEUE_FOR_PS="C:\\face_embed_venv\\work\\immich\\$USER_LABEL\\queue.json"
|
||||
|
||||
WIN_CACHE_DIR="/mnt/c/face_embed_venv/work/cache"
|
||||
WIN_CACHE="$WIN_CACHE_DIR/immich_${USER_LABEL}.npz"
|
||||
WIN_CACHE_FOR_PS="C:\\face_embed_venv\\work\\cache\\immich_${USER_LABEL}.npz"
|
||||
WSL_CACHE="$REPO/work/cache/immich_${USER_LABEL}.npz"
|
||||
|
||||
LOG="$REPO/work/logs/immich_finalize_${USER_LABEL}.log"
|
||||
|
||||
[ -f "$WSL_QUEUE" ] || { echo "missing queue: $WSL_QUEUE" >&2; exit 1; }
|
||||
|
||||
echo "=== finalize: $USER_LABEL ===" | tee -a "$LOG"
|
||||
date | tee -a "$LOG"
|
||||
|
||||
mkdir -p "$WIN_QUEUE_DIR" "$WIN_CACHE_DIR" "$REPO/work/cache"
|
||||
|
||||
echo "[1/4] copying queue: $WSL_QUEUE -> $WIN_QUEUE" | tee -a "$LOG"
|
||||
cp "$WSL_QUEUE" "$WIN_QUEUE"
|
||||
echo " $(wc -c < "$WIN_QUEUE") bytes; $(/home/peter/face_sort_env/bin/python3 -c "import json,sys; print(len(json.load(open('$WIN_QUEUE'))))") entries"
|
||||
|
||||
echo "[2/4] running Windows DML embed worker" | tee -a "$LOG"
|
||||
powershell.exe -NoProfile -Command "C:\\face_embed_venv\\Scripts\\python.exe C:\\face_embed_venv\\bench\\embed_worker.py '$WIN_QUEUE_FOR_PS' '$WIN_CACHE_FOR_PS'" 2>&1 | tee -a "$LOG"
|
||||
|
||||
[ -f "$WIN_CACHE" ] || { echo "embed produced no cache file at $WIN_CACHE" | tee -a "$LOG"; exit 1; }
|
||||
|
||||
echo "[3/4] copying cache back: $WIN_CACHE -> $WSL_CACHE" | tee -a "$LOG"
|
||||
cp "$WIN_CACHE" "$WSL_CACHE"
|
||||
echo " $(/home/peter/face_sort_env/bin/python3 -c "import sys,json; sys.path.insert(0,'$REPO'); from sort_faces import load_cache; e,m,_,_,_=load_cache('$WSL_CACHE'); print(f'{len(e)} embeddings, {sum(1 for x in m if x.get(\"noface\"))} noface, {sum(1 for x in m if not x.get(\"noface\"))} faces')")"
|
||||
|
||||
echo "[4/4] running cluster_immich.py" | tee -a "$LOG"
|
||||
/home/peter/face_sort_env/bin/python3 "$REPO/work/cluster_immich.py" "$WSL_CACHE" 2>&1 | tee -a "$LOG"
|
||||
|
||||
echo "=== finalize done: $USER_LABEL ===" | tee -a "$LOG"
|
||||
date | tee -a "$LOG"
|
||||
@@ -0,0 +1,447 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Stage Immich assets for embedding (WSL side of the split workflow).
|
||||
|
||||
For one Immich user:
|
||||
1. Page through `/search/metadata` listing every IMAGE asset the user owns.
|
||||
2. For each asset, fetch `/faces?id=` and decide if any detected face has a
|
||||
scaled short side >= MIN_FACE_SHORT on the original. Skip assets that
|
||||
don't.
|
||||
3. Download the original. Compute sha256.
|
||||
4. Dedup against (a) the existing canonical cache `nl_full.npz` and
|
||||
(b) sha256s already staged in this run / earlier runs. If duplicate,
|
||||
do NOT save to disk; record the alias.
|
||||
5. Save survivors to /mnt/x/src/immich/<user>/<rel> mirroring the structure
|
||||
after Immich's `/upload/library/<owner>/` prefix.
|
||||
6. Write a queue file with WSL + Windows paths so the Windows DML embed
|
||||
worker can find them.
|
||||
7. Persist staging state continuously so the run is resumable.
|
||||
|
||||
Output artifacts:
|
||||
work/immich/<user>/queue.json - what the Windows worker should embed
|
||||
work/immich/<user>/state.json - resume state
|
||||
work/immich/<user>/aliases.json - asset_id -> existing canonical path
|
||||
when sha256 matched something already
|
||||
in nl_full.npz
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
import urllib.error
|
||||
import urllib.request
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
|
||||
REPO = Path(__file__).resolve().parent.parent
|
||||
sys.path.insert(0, str(REPO))
|
||||
|
||||
from sort_faces import load_cache # noqa: E402
|
||||
|
||||
# ---- config -------------------------------------------------------------- #
|
||||
|
||||
API = os.environ.get("IMMICH_URL", "").rstrip("/") + "/api" if os.environ.get("IMMICH_URL") else None
|
||||
KEY = os.environ.get("IMMICH_API_KEY")
|
||||
if not API or not KEY:
|
||||
raise SystemExit(
|
||||
"set IMMICH_URL and IMMICH_API_KEY env vars before running, e.g.\n"
|
||||
" export IMMICH_URL=https://fotos.example.org\n"
|
||||
" export IMMICH_API_KEY=... # admin API key"
|
||||
)
|
||||
HEADERS = {"x-api-key": KEY, "Accept": "application/json"}
|
||||
|
||||
# Short-label -> Immich userId. The user is responsible for filling this in for
|
||||
# their own Immich instance. NOTE: as of Immich v2.7.2, /search/metadata's
|
||||
# `userIds` filter is silently ignored when the API key is bound to a different
|
||||
# user, so changing this label/UUID does not actually change which assets the
|
||||
# API returns; we keep it here for naming output dirs and as future-proofing.
|
||||
USERS_FILE = REPO / "work" / "immich" / "users.json"
|
||||
USERS: dict[str, str] = {}
|
||||
if USERS_FILE.exists():
|
||||
USERS = json.loads(USERS_FILE.read_text())
|
||||
|
||||
CACHE_PATH = REPO / "work" / "cache" / "nl_full.npz" # for sha256 dedup
|
||||
STAGE_DIR = REPO / "work" / "immich"
|
||||
DEST_ROOT = Path("/mnt/x/src/immich")
|
||||
WIN_DEST_ROOT = "X:\\src\\immich" # equivalent on the Windows side
|
||||
|
||||
PAGE_SIZE = 1000
|
||||
MIN_FACE_SHORT = 90 # match refine's gate
|
||||
MIN_DET_SCORE = 0.5 # weaker than refine's 0.6, since Immich's score scale differs
|
||||
HTTP_TIMEOUT = 60 # seconds, conservative for big originals
|
||||
HTTP_RETRIES = 3
|
||||
HTTP_BACKOFF = 2.0
|
||||
|
||||
# Circuit breaker: if this many consecutive workers fail with network errors,
|
||||
# probe Immich; if probe also fails, exit cleanly with code 2 so the orchestrator
|
||||
# can pause until the user says resume. State is preserved (resume-safe).
|
||||
OUTAGE_FAIL_STREAK = 12
|
||||
OUTAGE_PROBE_TIMEOUT = 8
|
||||
|
||||
# ---- helpers ------------------------------------------------------------- #
|
||||
|
||||
def http_get(url: str, accept_bytes: bool = False) -> bytes | dict:
|
||||
"""GET with retries. Returns parsed JSON unless accept_bytes is True."""
|
||||
last_err = None
|
||||
for attempt in range(HTTP_RETRIES):
|
||||
try:
|
||||
req = urllib.request.Request(url, headers=HEADERS)
|
||||
with urllib.request.urlopen(req, timeout=HTTP_TIMEOUT) as resp:
|
||||
data = resp.read()
|
||||
return data if accept_bytes else json.loads(data)
|
||||
except (urllib.error.URLError, urllib.error.HTTPError, TimeoutError) as e:
|
||||
last_err = e
|
||||
if attempt + 1 < HTTP_RETRIES:
|
||||
time.sleep(HTTP_BACKOFF * (attempt + 1))
|
||||
raise RuntimeError(f"GET {url} failed after {HTTP_RETRIES} attempts: {last_err}")
|
||||
|
||||
|
||||
def probe_immich() -> bool:
|
||||
"""Quick connectivity probe (no retry). Used by the circuit breaker."""
|
||||
try:
|
||||
req = urllib.request.Request(f"{API}/server/version", headers=HEADERS)
|
||||
urllib.request.urlopen(req, timeout=OUTAGE_PROBE_TIMEOUT).read()
|
||||
return True
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def http_post(url: str, payload: dict) -> dict:
|
||||
last_err = None
|
||||
body = json.dumps(payload).encode("utf-8")
|
||||
for attempt in range(HTTP_RETRIES):
|
||||
try:
|
||||
req = urllib.request.Request(
|
||||
url, data=body, headers={**HEADERS, "Content-Type": "application/json"}
|
||||
)
|
||||
with urllib.request.urlopen(req, timeout=HTTP_TIMEOUT) as resp:
|
||||
return json.loads(resp.read())
|
||||
except (urllib.error.URLError, urllib.error.HTTPError, TimeoutError) as e:
|
||||
last_err = e
|
||||
if attempt + 1 < HTTP_RETRIES:
|
||||
time.sleep(HTTP_BACKOFF * (attempt + 1))
|
||||
raise RuntimeError(f"POST {url} failed after {HTTP_RETRIES} attempts: {last_err}")
|
||||
|
||||
|
||||
def sha256_bytes(b: bytes) -> str:
|
||||
return hashlib.sha256(b).hexdigest()
|
||||
|
||||
|
||||
def derive_relpath(original_path: str) -> str:
|
||||
"""Return a relative subpath rooted at the user dir, mirroring Immich.
|
||||
|
||||
/usr/src/app/upload/library/admin/2026/2026-02-18/foo.jpg
|
||||
-> 2026/2026-02-18/foo.jpg
|
||||
Anything that doesn't match the expected prefix falls back to the basename
|
||||
only.
|
||||
"""
|
||||
marker = "/upload/library/"
|
||||
i = original_path.find(marker)
|
||||
if i < 0:
|
||||
return Path(original_path).name
|
||||
rest = original_path[i + len(marker):]
|
||||
parts = rest.split("/", 1)
|
||||
return parts[1] if len(parts) == 2 else parts[0]
|
||||
|
||||
|
||||
def wsl_to_win(p: Path) -> str:
|
||||
"""Convert /mnt/x/.. -> X:\\.. for the embed worker that runs on Windows."""
|
||||
s = str(p)
|
||||
if s.startswith("/mnt/"):
|
||||
drive = s[5]
|
||||
rest = s[6:].lstrip("/")
|
||||
return f"{drive.upper()}:\\{rest.replace('/', chr(92))}"
|
||||
if s.startswith("/opt/face-sets/"):
|
||||
# /opt/face-sets/work/... is on the WSL ext4 filesystem; reachable from
|
||||
# Windows as \\wsl$\Ubuntu\opt\face-sets\... (slower than C:). For our
|
||||
# use we keep all stage outputs under /mnt/x or /mnt/c so this branch
|
||||
# should not be hit, but fall back rather than fail.
|
||||
return f"\\\\wsl$\\Ubuntu\\opt\\face-sets\\{s[len('/opt/face-sets/'):].replace('/', chr(92))}"
|
||||
return s
|
||||
|
||||
|
||||
def keep_asset(asset: dict, faces: list) -> tuple[bool, list[dict]]:
|
||||
"""Return (keep, eligible_face_records). A face is 'eligible' iff its
|
||||
scaled-to-original short side >= MIN_FACE_SHORT and source-type is
|
||||
machine-learning."""
|
||||
W, H = asset.get("width"), asset.get("height")
|
||||
if not W or not H:
|
||||
return False, []
|
||||
eligible = []
|
||||
for f in faces:
|
||||
if f.get("sourceType") and f["sourceType"] != "machine-learning":
|
||||
continue
|
||||
iw = f.get("imageWidth") or W
|
||||
ih = f.get("imageHeight") or H
|
||||
sx = (W / iw) if iw else 1.0
|
||||
sy = (H / ih) if ih else 1.0
|
||||
bw = (f["boundingBoxX2"] - f["boundingBoxX1"]) * sx
|
||||
bh = (f["boundingBoxY2"] - f["boundingBoxY1"]) * sy
|
||||
if min(bw, bh) >= MIN_FACE_SHORT:
|
||||
eligible.append({
|
||||
"id": f["id"],
|
||||
"x1": int(round(f["boundingBoxX1"] * sx)),
|
||||
"y1": int(round(f["boundingBoxY1"] * sy)),
|
||||
"x2": int(round(f["boundingBoxX2"] * sx)),
|
||||
"y2": int(round(f["boundingBoxY2"] * sy)),
|
||||
"person": (f.get("person") or {}).get("name") or None,
|
||||
})
|
||||
return (len(eligible) > 0), eligible
|
||||
|
||||
|
||||
# ---- main staging loop --------------------------------------------------- #
|
||||
|
||||
def list_assets(user_id: str):
|
||||
"""Yield every IMAGE asset owned by user_id, paginated."""
|
||||
page = 1
|
||||
while True:
|
||||
resp = http_post(f"{API}/search/metadata", {
|
||||
"size": PAGE_SIZE,
|
||||
"type": "IMAGE",
|
||||
"page": page,
|
||||
"userIds": [user_id],
|
||||
})
|
||||
items = resp["assets"]["items"]
|
||||
if not items:
|
||||
return
|
||||
for a in items:
|
||||
yield a
|
||||
nxt = resp["assets"].get("nextPage")
|
||||
if not nxt:
|
||||
return
|
||||
page = int(nxt)
|
||||
|
||||
|
||||
def stage(user_label: str, limit: int | None, workers: int) -> None:
|
||||
user_id = USERS[user_label]
|
||||
user_dir = STAGE_DIR / user_label
|
||||
user_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
state_path = user_dir / "state.json"
|
||||
queue_path = user_dir / "queue.json"
|
||||
aliases_path = user_dir / "aliases.json"
|
||||
|
||||
# ---- load existing state for resume ---- #
|
||||
state = {
|
||||
"started_at": time.strftime("%Y-%m-%dT%H:%M:%S"),
|
||||
"user_label": user_label,
|
||||
"user_id": user_id,
|
||||
"seen_asset_ids": [],
|
||||
"staged_count": 0,
|
||||
"deduped_against_existing": 0,
|
||||
"deduped_against_staged": 0,
|
||||
"skipped_no_big_face": 0,
|
||||
"skipped_no_faces": 0,
|
||||
"skipped_download_error": 0,
|
||||
"total_assets_seen": 0,
|
||||
}
|
||||
queue: list[dict] = []
|
||||
aliases: dict[str, dict] = {} # asset_id -> {sha, canonical_path}
|
||||
staged_hashes: set[str] = set()
|
||||
if state_path.exists():
|
||||
prior = json.loads(state_path.read_text())
|
||||
state.update(prior)
|
||||
state["resumed_at"] = time.strftime("%Y-%m-%dT%H:%M:%S")
|
||||
if queue_path.exists():
|
||||
queue = json.loads(queue_path.read_text())
|
||||
staged_hashes = {q["sha256"] for q in queue}
|
||||
if aliases_path.exists():
|
||||
aliases = json.loads(aliases_path.read_text())
|
||||
print(f"[resume] {len(state['seen_asset_ids'])} asset_ids already seen, "
|
||||
f"{len(queue)} in queue, {len(aliases)} aliased to existing cache")
|
||||
seen = set(state["seen_asset_ids"])
|
||||
|
||||
# ---- startup connectivity probe ---- #
|
||||
if not probe_immich():
|
||||
print(f"[init] Immich probe failed at {API}/server/version -- exiting code 2")
|
||||
sys.exit(2)
|
||||
print("[init] Immich reachable")
|
||||
|
||||
# ---- load existing canonical cache hashes (sha256) ---- #
|
||||
print(f"[init] loading existing cache hashes from {CACHE_PATH}")
|
||||
_emb, meta, _src, _proc, _aliases = load_cache(CACHE_PATH)
|
||||
canonical_by_hash: dict[str, str] = {}
|
||||
for m in meta:
|
||||
h = m.get("hash")
|
||||
if h:
|
||||
canonical_by_hash.setdefault(h, m["path"])
|
||||
print(f"[init] {len(canonical_by_hash)} unique sha256s in nl_full.npz")
|
||||
|
||||
# ---- iterate assets ---- #
|
||||
# Each worker does the entire I/O chain for an asset: /faces -> filter ->
|
||||
# /original. That way 8 workers translate to ~8x parallelism end-to-end.
|
||||
# Main thread does sha256, dedup decisions, and writes (which are CPU/SMB
|
||||
# bound but cheap relative to two HTTPS round-trips per asset).
|
||||
# Worker result tuple:
|
||||
# (asset, faces|None, blob|None, eligible|None, error|None)
|
||||
def _fetch_for_asset(asset: dict):
|
||||
if asset.get("type") != "IMAGE":
|
||||
return asset, None, None, None, "not_image"
|
||||
aid = asset["id"]
|
||||
if aid in seen:
|
||||
return asset, None, None, None, "already_seen"
|
||||
try:
|
||||
faces = http_get(f"{API}/faces?id={aid}")
|
||||
except Exception as e:
|
||||
return asset, None, None, None, f"faces_error:{e}"
|
||||
if not faces:
|
||||
return asset, [], None, [], "no_faces"
|
||||
keep, eligible = keep_asset(asset, faces)
|
||||
if not keep:
|
||||
return asset, faces, None, eligible, "no_big_face"
|
||||
try:
|
||||
blob = http_get(f"{API}/assets/{aid}/original", accept_bytes=True)
|
||||
except Exception as e:
|
||||
return asset, faces, None, eligible, f"download_error:{e}"
|
||||
return asset, faces, blob, eligible, None
|
||||
|
||||
n = 0
|
||||
err_streak = 0
|
||||
last_flush = time.time()
|
||||
t0 = time.time()
|
||||
pool = ThreadPoolExecutor(max_workers=workers)
|
||||
try:
|
||||
for asset, faces, blob, eligible, err in pool.map(_fetch_for_asset, list_assets(user_id)):
|
||||
if asset.get("type") != "IMAGE":
|
||||
continue
|
||||
n += 1
|
||||
state["total_assets_seen"] = n
|
||||
if limit is not None and n > limit:
|
||||
print(f"[stop] hit --limit {limit}")
|
||||
break
|
||||
aid = asset["id"]
|
||||
|
||||
# Already-seen / non-image: silently skip.
|
||||
if err == "already_seen":
|
||||
continue
|
||||
|
||||
# Transient: count, but DON'T mark as seen so resume retries.
|
||||
if err and (err.startswith("faces_error") or err.startswith("download_error")):
|
||||
kind = err.split(":", 1)[0]
|
||||
detail = err.split(":", 1)[1][:160] if ":" in err else err
|
||||
print(f"[err] {kind} {aid}: {detail}")
|
||||
state["skipped_download_error"] += 1
|
||||
err_streak += 1
|
||||
# Circuit breaker: long streak -> probe; if down, save and exit.
|
||||
if err_streak >= OUTAGE_FAIL_STREAK:
|
||||
print(f"[breaker] {err_streak} consecutive errors; probing Immich...")
|
||||
if probe_immich():
|
||||
print("[breaker] probe ok, treating as transient; continuing")
|
||||
err_streak = 0
|
||||
else:
|
||||
print("[breaker] probe FAILED -- pausing run; resume with same command")
|
||||
queue_path.write_text(json.dumps(queue, indent=2))
|
||||
state_path.write_text(json.dumps(state, indent=2))
|
||||
aliases_path.write_text(json.dumps(aliases, indent=2))
|
||||
sys.exit(2)
|
||||
continue
|
||||
else:
|
||||
err_streak = 0
|
||||
|
||||
# Permanent classifications -> seen.
|
||||
if err == "no_faces":
|
||||
state["skipped_no_faces"] += 1
|
||||
seen.add(aid); state["seen_asset_ids"] = sorted(seen)
|
||||
continue
|
||||
if err == "no_big_face":
|
||||
state["skipped_no_big_face"] += 1
|
||||
seen.add(aid); state["seen_asset_ids"] = sorted(seen)
|
||||
continue
|
||||
|
||||
# Have faces + blob -> dedup + save.
|
||||
h = sha256_bytes(blob)
|
||||
if h in canonical_by_hash:
|
||||
aliases[aid] = {"sha256": h, "canonical": canonical_by_hash[h]}
|
||||
state["deduped_against_existing"] += 1
|
||||
seen.add(aid); state["seen_asset_ids"] = sorted(seen)
|
||||
continue
|
||||
if h in staged_hashes:
|
||||
state["deduped_against_staged"] += 1
|
||||
seen.add(aid); state["seen_asset_ids"] = sorted(seen)
|
||||
continue
|
||||
|
||||
rel = derive_relpath(asset.get("originalPath", asset.get("originalFileName", aid)))
|
||||
wsl_path = DEST_ROOT / user_label / rel
|
||||
wsl_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
wsl_path.write_bytes(blob)
|
||||
staged_hashes.add(h)
|
||||
|
||||
queue.append({
|
||||
"asset_id": aid,
|
||||
"sha256": h,
|
||||
"wsl_path": str(wsl_path),
|
||||
"win_path": wsl_to_win(wsl_path),
|
||||
"size_bytes": len(blob),
|
||||
"width": asset.get("width"),
|
||||
"height": asset.get("height"),
|
||||
"originalPath": asset.get("originalPath"),
|
||||
"originalFileName": asset.get("originalFileName"),
|
||||
"localDateTime": asset.get("localDateTime"),
|
||||
"immich_eligible_faces": eligible,
|
||||
})
|
||||
state["staged_count"] += 1
|
||||
seen.add(aid)
|
||||
state["seen_asset_ids"] = sorted(seen)
|
||||
|
||||
if time.time() - last_flush > 5.0 or len(queue) % 25 == 0:
|
||||
queue_path.write_text(json.dumps(queue, indent=2))
|
||||
state_path.write_text(json.dumps(state, indent=2))
|
||||
aliases_path.write_text(json.dumps(aliases, indent=2))
|
||||
last_flush = time.time()
|
||||
elapsed = time.time() - t0
|
||||
rate = state["total_assets_seen"] / max(0.1, elapsed)
|
||||
print(f"[stage] seen={state['total_assets_seen']:6d} "
|
||||
f"staged={state['staged_count']:5d} "
|
||||
f"dedup-existing={state['deduped_against_existing']:5d} "
|
||||
f"dedup-staged={state['deduped_against_staged']:5d} "
|
||||
f"no-big-face={state['skipped_no_big_face']:6d} "
|
||||
f"no-faces={state['skipped_no_faces']:6d} "
|
||||
f"errs={state['skipped_download_error']:3d} "
|
||||
f"({rate:.1f} assets/s)")
|
||||
finally:
|
||||
pool.shutdown(wait=False, cancel_futures=True)
|
||||
|
||||
# final flush
|
||||
queue_path.write_text(json.dumps(queue, indent=2))
|
||||
state_path.write_text(json.dumps(state, indent=2))
|
||||
aliases_path.write_text(json.dumps(aliases, indent=2))
|
||||
print()
|
||||
print(f"=== final state for user {user_label} ===")
|
||||
for k in [
|
||||
"total_assets_seen", "staged_count", "deduped_against_existing",
|
||||
"deduped_against_staged", "skipped_no_big_face", "skipped_no_faces",
|
||||
"skipped_download_error",
|
||||
]:
|
||||
print(f" {k}: {state[k]}")
|
||||
total_bytes = sum(q["size_bytes"] for q in queue)
|
||||
print(f" staged bytes: {total_bytes/1e9:.2f} GB across {len(queue)} files")
|
||||
print(f" queue: {queue_path}")
|
||||
print(f" state: {state_path}")
|
||||
print(f" aliases: {aliases_path}")
|
||||
|
||||
|
||||
# ---- cli ----------------------------------------------------------------- #
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser()
|
||||
if not USERS:
|
||||
p.add_argument("--user", required=True,
|
||||
help=f"label for output dir (USERS map empty; populate {USERS_FILE} to constrain)")
|
||||
else:
|
||||
p.add_argument("--user", choices=list(USERS.keys()), required=True)
|
||||
p.add_argument("--limit", type=int, default=None,
|
||||
help="stop after seeing N assets total (for testing)")
|
||||
p.add_argument("--workers", type=int, default=8,
|
||||
help="concurrent /faces fetches (default 8)")
|
||||
args = p.parse_args()
|
||||
stage(args.user, args.limit, args.workers)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,144 @@
|
||||
"""Windows / DirectML multi-face audit worker.
|
||||
|
||||
For every PNG in queue.json, run insightface FaceAnalysis and record how many
|
||||
faces were detected (filtering by det_score>=MIN_DET and face_short>=MIN_PIX).
|
||||
Surfaces the load-bearing roop invariant: each .fsz PNG must hold exactly one
|
||||
face, otherwise the loader's `extract_face_images` appends every detected face
|
||||
into the FaceSet and pollutes the averaged identity embedding.
|
||||
|
||||
CLI:
|
||||
py -3.12 multiface_worker.py <queue.json> <out_results.json> [--limit N]
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
from PIL import Image, ImageOps
|
||||
from insightface.app import FaceAnalysis
|
||||
|
||||
MODEL_ROOT = r"C:\face_embed_venv\models"
|
||||
MIN_DET = 0.5
|
||||
MIN_FACE_PIX = 40
|
||||
FLUSH_EVERY = 200
|
||||
|
||||
|
||||
def load_existing(out_path: Path):
|
||||
if not out_path.exists():
|
||||
return None, set()
|
||||
try:
|
||||
d = json.loads(out_path.read_text())
|
||||
processed = set(d.get("processed", []))
|
||||
return d, processed
|
||||
except Exception as e:
|
||||
print(f"[warn] could not parse {out_path}: {e}; starting fresh", file=sys.stderr)
|
||||
return None, set()
|
||||
|
||||
|
||||
def save_atomic(out_path: Path, data: dict):
|
||||
tmp = out_path.with_suffix(".tmp.json")
|
||||
tmp.write_text(json.dumps(data, indent=2))
|
||||
os.replace(tmp, out_path)
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("queue", type=Path)
|
||||
ap.add_argument("out", type=Path)
|
||||
ap.add_argument("--limit", type=int, default=None)
|
||||
args = ap.parse_args()
|
||||
|
||||
queue = json.loads(args.queue.read_text())
|
||||
print(f"[queue] {len(queue)} entries from {args.queue}", flush=True)
|
||||
args.out.parent.mkdir(parents=True, exist_ok=True)
|
||||
existing, processed = load_existing(args.out)
|
||||
if existing:
|
||||
print(f"[resume] {len(processed)} already scored", flush=True)
|
||||
results = existing.get("results", [])
|
||||
else:
|
||||
results = []
|
||||
pending = [e for e in queue if e["wsl_path"] not in processed]
|
||||
if args.limit is not None:
|
||||
pending = pending[: args.limit]
|
||||
print(f"[pending] {len(pending)} entries", flush=True)
|
||||
if not pending:
|
||||
print("[done] nothing to do")
|
||||
return
|
||||
|
||||
print("[load] FaceAnalysis with DmlExecutionProvider", flush=True)
|
||||
app = FaceAnalysis(
|
||||
name="buffalo_l",
|
||||
root=MODEL_ROOT,
|
||||
providers=["DmlExecutionProvider", "CPUExecutionProvider"],
|
||||
)
|
||||
app.prepare(ctx_id=0, det_size=(640, 640))
|
||||
|
||||
n_done = 0
|
||||
n_load_err = 0
|
||||
last_flush = time.time()
|
||||
t_start = time.time()
|
||||
|
||||
def flush():
|
||||
save_atomic(args.out, {
|
||||
"results": results,
|
||||
"processed": sorted(processed),
|
||||
})
|
||||
|
||||
for entry in pending:
|
||||
try:
|
||||
with Image.open(entry["win_path"]) as im:
|
||||
im = ImageOps.exif_transpose(im)
|
||||
im = im.convert("RGB")
|
||||
rgb = np.array(im)
|
||||
bgr = rgb[:, :, ::-1].copy()
|
||||
except Exception as e:
|
||||
n_load_err += 1
|
||||
results.append({
|
||||
"wsl_path": entry["wsl_path"], "faceset": entry["faceset"], "file": entry["file"],
|
||||
"face_count": -1, "error": "load",
|
||||
})
|
||||
processed.add(entry["wsl_path"])
|
||||
n_done += 1
|
||||
continue
|
||||
|
||||
faces = app.get(bgr)
|
||||
kept = 0
|
||||
for f in faces:
|
||||
if float(f.det_score) < MIN_DET:
|
||||
continue
|
||||
x1, y1, x2, y2 = [int(round(v)) for v in f.bbox]
|
||||
short = min(max(x2 - x1, 0), max(y2 - y1, 0))
|
||||
if short < MIN_FACE_PIX:
|
||||
continue
|
||||
kept += 1
|
||||
|
||||
results.append({
|
||||
"wsl_path": entry["wsl_path"], "faceset": entry["faceset"], "file": entry["file"],
|
||||
"face_count": kept,
|
||||
})
|
||||
processed.add(entry["wsl_path"])
|
||||
n_done += 1
|
||||
|
||||
if (n_done % FLUSH_EVERY == 0) or (time.time() - last_flush) > 30.0:
|
||||
flush()
|
||||
last_flush = time.time()
|
||||
elapsed = time.time() - t_start
|
||||
rate = n_done / max(0.1, elapsed)
|
||||
eta = (len(pending) - n_done) / max(0.1, rate) / 60.0
|
||||
print(f"[scan] {n_done}/{len(pending)} rate={rate:.2f} img/s eta={eta:.1f}min "
|
||||
f"load_err={n_load_err}", flush=True)
|
||||
|
||||
flush()
|
||||
elapsed = time.time() - t_start
|
||||
print(f"[done] {n_done} scored, {n_load_err} load errors, {elapsed:.1f}s "
|
||||
f"({n_done/max(0.1,elapsed):.2f} img/s) -> {args.out}", flush=True)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Executable
+127
@@ -0,0 +1,127 @@
|
||||
#!/bin/bash
|
||||
# Generic chain driver for the video target preprocessing pipeline.
|
||||
#
|
||||
# Usage:
|
||||
# WORK=/path/to/workdir SKIP_PATTERN='ct_src_(0001[015]|005[0-9]|006[0-9])\.mp4' \
|
||||
# bash run_video_pipeline.sh > /opt/face-sets/work/logs/<name>.log 2>&1
|
||||
#
|
||||
# Required env vars:
|
||||
# WORK per-batch workdir (will hold scenes/, queue.json, results.jsonl, plan.json, review/)
|
||||
#
|
||||
# Optional env vars:
|
||||
# INPUT_DIR default /mnt/x/src/vd
|
||||
# OUTPUT_DIR default /mnt/x/src/vd/ct
|
||||
# FILTER_FROM basename floor; only files with name >= this go in (e.g. ct_src_00050.mp4)
|
||||
# SKIP_PATTERN regex of basenames to exclude (Python `re` syntax). Applied AFTER FILTER_FROM.
|
||||
# MAX_DUR score --max-dur (default 120)
|
||||
# IDENTITY "yes" to enable identity tagging; default "no"
|
||||
# SIDECAR "yes" to emit <uuid>.json provenance sidecars; default "no"
|
||||
|
||||
set -e
|
||||
|
||||
: ${WORK:?WORK env var must point at a workdir}
|
||||
: ${INPUT_DIR:=/mnt/x/src/vd}
|
||||
: ${OUTPUT_DIR:=/mnt/x/src/vd/ct}
|
||||
: ${MAX_DUR:=120}
|
||||
: ${IDENTITY:=no}
|
||||
: ${SIDECAR:=no}
|
||||
|
||||
mkdir -p "$WORK" "$WORK/scenes"
|
||||
|
||||
PY_WSL=/home/peter/face_sort_env/bin/python
|
||||
PY_WIN="/mnt/c/face_embed_venv/Scripts/python.exe"
|
||||
PIPELINE=/opt/face-sets/work/video_target_pipeline.py
|
||||
WORKER=/opt/face-sets/work/video_face_worker.py
|
||||
INVENTORY_FULL=/opt/face-sets/work/video_preprocess/inventory_full.json
|
||||
|
||||
ts() { date +"%Y-%m-%d %H:%M:%S"; }
|
||||
log() { echo "[$(ts)] [$PHASE] $*"; }
|
||||
|
||||
PHASE="setup"
|
||||
log "STARTED — host=$(hostname) pid=$$ work=$WORK"
|
||||
log "config: input=$INPUT_DIR output=$OUTPUT_DIR filter_from=${FILTER_FROM:-<none>} skip_pattern=${SKIP_PATTERN:-<none>} max_dur=$MAX_DUR identity=$IDENTITY sidecar=$SIDECAR"
|
||||
|
||||
PHASE="inventory"
|
||||
log "building subset inventory"
|
||||
T0=$(date +%s)
|
||||
# rebuild full inventory if missing
|
||||
if [ ! -f "$INVENTORY_FULL" ]; then
|
||||
log "(no full inventory cached — running fresh scan)"
|
||||
$PY_WSL $PIPELINE scan --input "$INPUT_DIR" --output-dir "$OUTPUT_DIR" --out "$INVENTORY_FULL"
|
||||
fi
|
||||
$PY_WSL <<EOF
|
||||
import json, re
|
||||
from pathlib import Path
|
||||
inv = json.load(open('$INVENTORY_FULL'))
|
||||
subset = list(inv['videos'])
|
||||
filter_from = '${FILTER_FROM}'
|
||||
skip_pat = '${SKIP_PATTERN}'
|
||||
if filter_from:
|
||||
subset = [v for v in subset if Path(v['path']).name >= filter_from]
|
||||
if skip_pat:
|
||||
pat = re.compile(skip_pat)
|
||||
subset = [v for v in subset if not pat.search(Path(v['path']).name)]
|
||||
subset.sort(key=lambda v: v['path'])
|
||||
inv['videos'] = subset
|
||||
json.dump(inv, open('$WORK/inventory.json','w'), indent=2)
|
||||
total_dur = sum(v.get('duration_s', 0) for v in inv['videos'] if 'error' not in v)
|
||||
print(f' {len(inv["videos"])} videos, total {total_dur/3600:.2f}h input')
|
||||
EOF
|
||||
log "done in $(($(date +%s)-T0))s"
|
||||
|
||||
PHASE="scenes"
|
||||
log "PySceneDetect AdaptiveDetector across all videos (cached entries skipped)"
|
||||
T0=$(date +%s)
|
||||
$PY_WSL $PIPELINE scenes --inventory "$WORK/inventory.json" --out-dir "$WORK/scenes"
|
||||
log "done in $(($(date +%s)-T0))s"
|
||||
|
||||
PHASE="stage"
|
||||
log "building frame queue @ 2 fps within scenes"
|
||||
T0=$(date +%s)
|
||||
$PY_WSL $PIPELINE stage --inventory "$WORK/inventory.json" --scenes-dir "$WORK/scenes" --out "$WORK/queue.json"
|
||||
log "done in $(($(date +%s)-T0))s"
|
||||
|
||||
PHASE="worker"
|
||||
log "Windows DML face detect+embed (resumable; the slow one)"
|
||||
T0=$(date +%s)
|
||||
$PY_WIN $WORKER "$WORK/queue.json" "$WORK/results.json"
|
||||
log "done in $(($(date +%s)-T0))s"
|
||||
|
||||
PHASE="merge"
|
||||
log "ingesting worker output (jsonl)"
|
||||
T0=$(date +%s)
|
||||
$PY_WSL $PIPELINE merge --results "$WORK/results.json" --out "$WORK/frames.json"
|
||||
log "done in $(($(date +%s)-T0))s"
|
||||
|
||||
PHASE="track"
|
||||
log "stitching detections into tracks"
|
||||
T0=$(date +%s)
|
||||
$PY_WSL $PIPELINE track --frames "$WORK/frames.json" --scenes-dir "$WORK/scenes" \
|
||||
--inventory "$WORK/inventory.json" --out "$WORK/tracks.json"
|
||||
log "done in $(($(date +%s)-T0))s"
|
||||
|
||||
PHASE="score"
|
||||
log "scoring with relaxed gates + max-dur=$MAX_DUR identity=$IDENTITY"
|
||||
T0=$(date +%s)
|
||||
ID_FLAG=""
|
||||
if [ "$IDENTITY" != "yes" ]; then ID_FLAG="--no-identity"; fi
|
||||
$PY_WSL $PIPELINE score --tracks "$WORK/tracks.json" --inventory "$WORK/inventory.json" \
|
||||
--out "$WORK/plan.json" --max-dur "$MAX_DUR" $ID_FLAG
|
||||
log "done in $(($(date +%s)-T0))s"
|
||||
|
||||
PHASE="cut"
|
||||
log "ffmpeg stream-copy into per-source subfolders (no --clean)"
|
||||
T0=$(date +%s)
|
||||
SIDECAR_FLAG=""
|
||||
if [ "$SIDECAR" = "yes" ]; then SIDECAR_FLAG="--write-sidecar"; fi
|
||||
$PY_WSL $PIPELINE cut --plan "$WORK/plan.json" --output-dir "$OUTPUT_DIR" $SIDECAR_FLAG
|
||||
log "done in $(($(date +%s)-T0))s"
|
||||
|
||||
PHASE="report"
|
||||
log "rendering HTML"
|
||||
T0=$(date +%s)
|
||||
$PY_WSL $PIPELINE report --plan "$WORK/plan.json" --output-dir "$OUTPUT_DIR" --out "$WORK/review"
|
||||
log "done in $(($(date +%s)-T0))s"
|
||||
|
||||
PHASE="done"
|
||||
log "PIPELINE COMPLETE — review at file://$WORK/review/index.html"
|
||||
Executable
+32
@@ -0,0 +1,32 @@
|
||||
#!/bin/bash
|
||||
# Generic status helper for run_video_pipeline.sh.
|
||||
# Usage: bash status_video_pipeline.sh <log_file>
|
||||
# Defaults to /opt/face-sets/work/logs/video_run.log if no arg.
|
||||
|
||||
LOG="${1:-/opt/face-sets/work/logs/video_run.log}"
|
||||
|
||||
if [ ! -f "$LOG" ]; then
|
||||
echo "no log at $LOG yet"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo "=== last 8 log lines ==="
|
||||
tail -8 "$LOG"
|
||||
echo
|
||||
|
||||
# worker progress
|
||||
last=$(grep -E "^\[scan\] [0-9]+/[0-9]+" "$LOG" | tail -1)
|
||||
if [ -n "$last" ]; then
|
||||
echo "=== DML worker progress ==="
|
||||
echo " $last"
|
||||
fi
|
||||
|
||||
# total elapsed
|
||||
start_epoch=$(head -1 "$LOG" | sed 's/.*\[\(.*\)\].*\[setup\].*/\1/' | xargs -I{} date -d "{}" +%s 2>/dev/null)
|
||||
now_epoch=$(date +%s)
|
||||
if [ -n "$start_epoch" ] && [ "$start_epoch" != "" ] 2>/dev/null; then
|
||||
elapsed=$((now_epoch - start_epoch))
|
||||
h=$((elapsed / 3600))
|
||||
m=$(( (elapsed % 3600) / 60 ))
|
||||
echo " elapsed: ${h}h${m}m"
|
||||
fi
|
||||
@@ -0,0 +1,274 @@
|
||||
"""Windows / DirectML video frame face worker.
|
||||
|
||||
Reads a queue.json from /opt/face-sets/work/video_target_pipeline.py:stage
|
||||
(WSL side), each entry: {video_path, win_video_path, frame_idx, time_s,
|
||||
queue_id}. Decodes frame N from the video, runs insightface FaceAnalysis,
|
||||
emits per-face records (bbox, det_score, pose, embedding, face_short).
|
||||
|
||||
CLI:
|
||||
py -3.12 video_face_worker.py <queue.json> <out_results.json> [--limit N]
|
||||
|
||||
Resumable: existing entries in out_results.json with the same queue_id are
|
||||
skipped.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
import cv2
|
||||
from insightface.app import FaceAnalysis
|
||||
|
||||
MODEL_ROOT = r"C:\face_embed_venv\models"
|
||||
MIN_DET = 0.5
|
||||
MIN_FACE_PIX = 40
|
||||
FLUSH_EVERY = 100
|
||||
|
||||
|
||||
def jsonl_path_for(out_path: Path) -> Path:
|
||||
"""Sister JSONL file: one result-record per line, append-only."""
|
||||
return out_path.with_suffix(".jsonl")
|
||||
|
||||
|
||||
def load_existing(out_path: Path):
|
||||
"""Load existing results from .jsonl (preferred) or legacy .json (one-time conversion).
|
||||
Returns (records_list, processed_set)."""
|
||||
jsonl = jsonl_path_for(out_path)
|
||||
if jsonl.exists():
|
||||
records = []
|
||||
processed = set()
|
||||
with open(jsonl) as f:
|
||||
for line_num, line in enumerate(f, 1):
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
try:
|
||||
r = json.loads(line)
|
||||
records.append(r)
|
||||
if r.get("queue_id"):
|
||||
processed.add(r["queue_id"])
|
||||
except json.JSONDecodeError:
|
||||
print(f"[warn] {jsonl}:{line_num} corrupt; skipping", file=sys.stderr)
|
||||
return records, processed
|
||||
# legacy JSON support: load once, convert to JSONL
|
||||
if out_path.exists():
|
||||
try:
|
||||
d = json.loads(out_path.read_text())
|
||||
records = d.get("results", [])
|
||||
processed = set(d.get("processed", []))
|
||||
print(f"[migrate] converting {len(records)} legacy JSON records to JSONL", file=sys.stderr)
|
||||
with open(jsonl, "w") as f:
|
||||
for r in records:
|
||||
f.write(json.dumps(r) + "\n")
|
||||
return records, processed
|
||||
except Exception as e:
|
||||
print(f"[warn] could not parse {out_path}: {e}; starting fresh", file=sys.stderr)
|
||||
return [], set()
|
||||
|
||||
|
||||
def append_records(out_path: Path, new_records: list):
|
||||
"""Append-only write to the sister .jsonl file. No re-serialization of prior records."""
|
||||
if not new_records:
|
||||
return
|
||||
jsonl = jsonl_path_for(out_path)
|
||||
with open(jsonl, "a") as f:
|
||||
for r in new_records:
|
||||
f.write(json.dumps(r) + "\n")
|
||||
|
||||
|
||||
def write_compat_summary(out_path: Path, total_records: int, processed: set):
|
||||
"""Write a tiny JSON pointer file at the legacy out_path so older consumers
|
||||
still see *something*, but the canonical store is the .jsonl. Cheap."""
|
||||
summary = {
|
||||
"_format": "jsonl-pointer",
|
||||
"_jsonl": str(jsonl_path_for(out_path).name),
|
||||
"results_count": total_records,
|
||||
"processed_count": len(processed),
|
||||
}
|
||||
tmp = out_path.with_suffix(".tmp.json")
|
||||
tmp.write_text(json.dumps(summary, indent=2))
|
||||
os.replace(tmp, out_path)
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("queue", type=Path)
|
||||
ap.add_argument("out", type=Path)
|
||||
ap.add_argument("--limit", type=int, default=None)
|
||||
args = ap.parse_args()
|
||||
|
||||
queue = json.loads(args.queue.read_text())
|
||||
print(f"[queue] {len(queue)} entries from {args.queue}", flush=True)
|
||||
args.out.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
results, processed = load_existing(args.out)
|
||||
if processed:
|
||||
print(f"[resume] {len(processed)} already scored", flush=True)
|
||||
|
||||
pending = [e for e in queue if e["queue_id"] not in processed]
|
||||
if args.limit is not None:
|
||||
pending = pending[: args.limit]
|
||||
print(f"[pending] {len(pending)} entries", flush=True)
|
||||
if not pending:
|
||||
print("[done] nothing to do")
|
||||
return
|
||||
|
||||
print("[load] FaceAnalysis with DmlExecutionProvider", flush=True)
|
||||
app = FaceAnalysis(
|
||||
name="buffalo_l",
|
||||
root=MODEL_ROOT,
|
||||
providers=["DmlExecutionProvider", "CPUExecutionProvider"],
|
||||
)
|
||||
app.prepare(ctx_id=0, det_size=(640, 640))
|
||||
|
||||
# group queue by video so we can keep one VideoCapture open and seek
|
||||
from collections import defaultdict
|
||||
by_video = defaultdict(list)
|
||||
for e in pending:
|
||||
by_video[e["win_video_path"]].append(e)
|
||||
|
||||
n_done = 0
|
||||
n_load_err = 0
|
||||
last_flush = time.time()
|
||||
t_start = time.time()
|
||||
new_buffer: list = []
|
||||
|
||||
def flush():
|
||||
# append-only: only NEW records since last flush get written. O(new_records),
|
||||
# not O(total_records). Was 11s/flush at 9k records; now <50ms.
|
||||
if new_buffer:
|
||||
append_records(args.out, new_buffer)
|
||||
new_buffer.clear()
|
||||
write_compat_summary(args.out, len(results), processed)
|
||||
|
||||
for vidpath, entries in by_video.items():
|
||||
# entries are already sorted by frame_idx. Hybrid decode strategy:
|
||||
# 1. Seek ONCE to the first pending target (cheap keyframe-seek).
|
||||
# 2. Sequential cap.grab() between subsequent targets (decode without
|
||||
# BGR conversion until we reach a target, then cap.retrieve()).
|
||||
# This avoids per-sample seek cost (the original pathology that
|
||||
# caused 1.4 fps deep in long videos) AND avoids grab-walking from
|
||||
# frame 0 on resume (the over-correction that gave 0.08 fps).
|
||||
entries.sort(key=lambda e: e["frame_idx"])
|
||||
cap = cv2.VideoCapture(vidpath)
|
||||
if not cap.isOpened():
|
||||
print(f"[err] cannot open {vidpath}", flush=True)
|
||||
for e in entries:
|
||||
rec = {
|
||||
"queue_id": e["queue_id"], "video_path": e["video_path"],
|
||||
"frame_idx": e["frame_idx"], "time_s": e["time_s"],
|
||||
"faces": [], "error": "cap_open",
|
||||
}
|
||||
results.append(rec); new_buffer.append(rec)
|
||||
processed.add(e["queue_id"])
|
||||
n_done += 1
|
||||
n_load_err += 1
|
||||
continue
|
||||
first_target = entries[0]["frame_idx"]
|
||||
if first_target > 0:
|
||||
cap.set(cv2.CAP_PROP_POS_FRAMES, first_target)
|
||||
cur_frame_idx = first_target - 1
|
||||
else:
|
||||
cur_frame_idx = -1
|
||||
for e in entries:
|
||||
target = e["frame_idx"]
|
||||
if target < cur_frame_idx + 1:
|
||||
# backward jump (only triggers for unsorted entries — defensive)
|
||||
cap.set(cv2.CAP_PROP_POS_FRAMES, target)
|
||||
cur_frame_idx = target - 1
|
||||
# advance up to (but not including) target via grab()-only
|
||||
ran_out = False
|
||||
while cur_frame_idx + 1 < target:
|
||||
ok = cap.grab()
|
||||
if not ok:
|
||||
ran_out = True
|
||||
break
|
||||
cur_frame_idx += 1
|
||||
if not ran_out:
|
||||
ok = cap.grab()
|
||||
if not ok:
|
||||
ran_out = True
|
||||
else:
|
||||
cur_frame_idx = target
|
||||
if ran_out:
|
||||
rec = {
|
||||
"queue_id": e["queue_id"], "video_path": e["video_path"],
|
||||
"frame_idx": e["frame_idx"], "time_s": e["time_s"],
|
||||
"faces": [], "error": "frame_read",
|
||||
}
|
||||
results.append(rec); new_buffer.append(rec)
|
||||
processed.add(e["queue_id"])
|
||||
n_done += 1
|
||||
n_load_err += 1
|
||||
continue
|
||||
ok, bgr = cap.retrieve()
|
||||
if not ok or bgr is None:
|
||||
rec = {
|
||||
"queue_id": e["queue_id"], "video_path": e["video_path"],
|
||||
"frame_idx": e["frame_idx"], "time_s": e["time_s"],
|
||||
"faces": [], "error": "frame_read",
|
||||
}
|
||||
results.append(rec); new_buffer.append(rec)
|
||||
processed.add(e["queue_id"])
|
||||
n_done += 1
|
||||
n_load_err += 1
|
||||
continue
|
||||
|
||||
faces = app.get(bgr)
|
||||
kept_faces = []
|
||||
H, W = bgr.shape[:2]
|
||||
for f in faces:
|
||||
if float(f.det_score) < MIN_DET:
|
||||
continue
|
||||
x1, y1, x2, y2 = [int(round(v)) for v in f.bbox]
|
||||
x1 = max(x1, 0); y1 = max(y1, 0)
|
||||
x2 = min(x2, W); y2 = min(y2, H)
|
||||
w, h = x2 - x1, y2 - y1
|
||||
short = min(w, h)
|
||||
if short < MIN_FACE_PIX:
|
||||
continue
|
||||
rec = {
|
||||
"bbox": [x1, y1, x2, y2],
|
||||
"det_score": float(f.det_score),
|
||||
"face_short": int(short),
|
||||
}
|
||||
if hasattr(f, "pose") and f.pose is not None:
|
||||
rec["pose"] = [float(x) for x in f.pose] # pitch, yaw, roll
|
||||
if hasattr(f, "normed_embedding") and f.normed_embedding is not None:
|
||||
rec["embedding"] = f.normed_embedding.astype(np.float32).tolist()
|
||||
kept_faces.append(rec)
|
||||
|
||||
rec = {
|
||||
"queue_id": e["queue_id"], "video_path": e["video_path"],
|
||||
"frame_idx": e["frame_idx"], "time_s": e["time_s"],
|
||||
"frame_w": W, "frame_h": H,
|
||||
"faces": kept_faces,
|
||||
}
|
||||
results.append(rec); new_buffer.append(rec)
|
||||
processed.add(e["queue_id"])
|
||||
n_done += 1
|
||||
|
||||
if (n_done % FLUSH_EVERY == 0) or (time.time() - last_flush) > 30.0:
|
||||
flush()
|
||||
last_flush = time.time()
|
||||
el = time.time() - t_start
|
||||
rate = n_done / max(0.1, el)
|
||||
eta = (len(pending) - n_done) / max(0.1, rate) / 60.0
|
||||
print(f"[scan] {n_done}/{len(pending)} rate={rate:.2f} fps eta={eta:.1f}min "
|
||||
f"errs={n_load_err}", flush=True)
|
||||
cap.release()
|
||||
|
||||
flush()
|
||||
el = time.time() - t_start
|
||||
print(f"[done] {n_done} scored, {n_load_err} errors, {el:.1f}s "
|
||||
f"({n_done/max(0.1,el):.2f} fps) -> {args.out}", flush=True)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,919 @@
|
||||
"""Video target preprocessing pipeline for roop-unleashed.
|
||||
|
||||
Discovers video files in an input folder, runs scene-cut detection, samples
|
||||
frames within each scene, runs face detection + embedding via Windows DML
|
||||
worker, stitches per-frame detections into face tracks, applies quality
|
||||
gates, cuts approved segments out with ffmpeg stream-copy, and writes a
|
||||
report. Output clips have generic UUID names + a sidecar JSON with full
|
||||
provenance.
|
||||
|
||||
Subcommands:
|
||||
scan list input videos, run ffprobe, write per-video index
|
||||
scenes PySceneDetect AdaptiveDetector per video; write scenes_<basename>.json
|
||||
stage write frame queue.json (sampled @ 2 fps within scenes)
|
||||
merge ingest worker results.json into per-video frame_results
|
||||
track IoU+embedding stitching of per-frame detections into tracks
|
||||
score track-level quality gating + segment plan
|
||||
cut ffmpeg -c copy each accepted segment to <out_dir>/<uuid>.mp4
|
||||
report HTML preview with thumbnails + identity tags
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import math
|
||||
import re
|
||||
import shutil
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
import uuid
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
|
||||
DEFAULT_INPUT = Path("/mnt/x/src/vd")
|
||||
DEFAULT_OUTPUT = Path("/mnt/x/src/vd/ct")
|
||||
WORK_DIR = Path("/opt/face-sets/work/video_preprocess")
|
||||
|
||||
# defaults — first set was strict-portrait; second set loosened for side-profile + segment merging
|
||||
SAMPLE_FPS = 2.0
|
||||
QUALITY_YAW_MAX = 75.0 # was 25; allow full 3/4 + profile (face-sets handle it)
|
||||
QUALITY_PITCH_MAX = 45.0 # was 30
|
||||
QUALITY_FACE_MIN = 80 # was 96
|
||||
QUALITY_BLUR_MIN = 50.0
|
||||
QUALITY_DET_MIN = 0.5 # was 0.6
|
||||
TRACK_GATE_FRAC = 0.7 # >=70% of frames in track must pass per-frame gates
|
||||
SEGMENT_MIN_S = 1.0
|
||||
SEGMENT_MAX_S = 30.0 # was 10
|
||||
SEGMENT_BRIDGE_S = 3.0 # was 1.0 — within-track pose-failure bridging
|
||||
SEGMENT_MERGE_GAP_S = 2.0 # NEW — across-track merge if same scene + within this gap
|
||||
TRACK_IOU_MIN = 0.3
|
||||
TRACK_EMB_MIN = 0.5
|
||||
|
||||
CACHES = [
|
||||
Path("/opt/face-sets/work/cache/nl_full.npz"),
|
||||
Path("/opt/face-sets/work/cache/immich_peter.npz"),
|
||||
Path("/opt/face-sets/work/cache/immich_nic.npz"),
|
||||
]
|
||||
FACESETS_ROOT = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
|
||||
IDENTITY_TAG_THRESHOLD = 0.6 # cosine sim to faceset centroid
|
||||
|
||||
|
||||
def wsl_to_win(p: str) -> str:
|
||||
s = str(p)
|
||||
if s.startswith("/mnt/"):
|
||||
return f"{s[5].upper()}:\\{s[7:].replace('/', chr(92))}"
|
||||
return s
|
||||
|
||||
|
||||
# ----------------------------- ffprobe / scan -----------------------------
|
||||
|
||||
def ffprobe(video: Path) -> dict:
|
||||
cmd = [
|
||||
"ffprobe", "-v", "error", "-print_format", "json",
|
||||
"-show_format", "-show_streams", str(video),
|
||||
]
|
||||
r = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
|
||||
if r.returncode != 0:
|
||||
return {"error": r.stderr.strip()}
|
||||
return json.loads(r.stdout)
|
||||
|
||||
|
||||
def parse_video_meta(probe: dict) -> dict:
|
||||
if "error" in probe:
|
||||
return {"error": probe["error"]}
|
||||
fmt = probe.get("format", {})
|
||||
duration = float(fmt.get("duration", 0))
|
||||
vstream = next((s for s in probe.get("streams", []) if s.get("codec_type") == "video"), None)
|
||||
if vstream is None:
|
||||
return {"error": "no video stream"}
|
||||
fps_str = vstream.get("avg_frame_rate", "0/1")
|
||||
try:
|
||||
num, den = (int(x) for x in fps_str.split("/"))
|
||||
fps = num / den if den else 0.0
|
||||
except Exception:
|
||||
fps = 0.0
|
||||
nb_frames = int(vstream.get("nb_frames", 0)) or int(round(duration * fps))
|
||||
return {
|
||||
"duration_s": duration,
|
||||
"fps": fps,
|
||||
"frames": nb_frames,
|
||||
"width": int(vstream.get("width", 0)),
|
||||
"height": int(vstream.get("height", 0)),
|
||||
"codec": vstream.get("codec_name"),
|
||||
}
|
||||
|
||||
|
||||
def cmd_scan(args):
|
||||
in_dir = Path(args.input)
|
||||
out = Path(args.out)
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
extensions = {".mp4", ".mov", ".mkv", ".m4v", ".avi", ".webm"}
|
||||
out_root = Path(args.output_dir).resolve()
|
||||
videos = []
|
||||
for p in sorted(in_dir.iterdir() if not args.recursive else in_dir.rglob("*")):
|
||||
if not p.is_file():
|
||||
continue
|
||||
if out_root in p.parents or p.resolve() == out_root:
|
||||
continue # never include the output dir
|
||||
if p.suffix.lower() not in extensions:
|
||||
continue
|
||||
videos.append(p)
|
||||
print(f"[scan] {len(videos)} candidate videos", file=sys.stderr)
|
||||
inventory = []
|
||||
for p in videos:
|
||||
meta = parse_video_meta(ffprobe(p))
|
||||
meta["path"] = str(p)
|
||||
meta["win_path"] = wsl_to_win(str(p))
|
||||
meta["size"] = p.stat().st_size
|
||||
inventory.append(meta)
|
||||
if "error" not in meta:
|
||||
print(f" {p.name}: {meta['duration_s']:.1f}s @ {meta['fps']:.1f}fps "
|
||||
f"{meta['width']}x{meta['height']} {meta['codec']}", file=sys.stderr)
|
||||
else:
|
||||
print(f" {p.name}: ERROR {meta['error']}", file=sys.stderr)
|
||||
out.write_text(json.dumps({"input": str(in_dir), "videos": inventory}, indent=2))
|
||||
print(f"[scan] inventory -> {out}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- scenes -----------------------------
|
||||
|
||||
def cmd_scenes(args):
|
||||
from scenedetect import open_video, SceneManager
|
||||
from scenedetect.detectors import AdaptiveDetector
|
||||
inv = json.loads(Path(args.inventory).read_text())
|
||||
out_dir = Path(args.out_dir)
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
only = set(args.only.split(",")) if args.only else None
|
||||
for v in inv["videos"]:
|
||||
if "error" in v:
|
||||
continue
|
||||
path = Path(v["path"])
|
||||
if only and path.name not in only:
|
||||
continue
|
||||
out_file = out_dir / (path.stem + ".scenes.json")
|
||||
if out_file.exists() and not args.force:
|
||||
continue
|
||||
print(f"[scenes] {path.name} ...", file=sys.stderr, flush=True)
|
||||
t0 = time.time()
|
||||
try:
|
||||
video = open_video(str(path))
|
||||
sm = SceneManager()
|
||||
sm.add_detector(AdaptiveDetector(min_scene_len=int(round(v.get("fps", 30) or 30) * 0.5)))
|
||||
sm.detect_scenes(video, show_progress=False)
|
||||
scenes = sm.get_scene_list()
|
||||
entries = []
|
||||
for s, e in scenes:
|
||||
entries.append({
|
||||
"start_frame": s.frame_num, "end_frame": e.frame_num,
|
||||
"start_s": s.get_seconds(), "end_s": e.get_seconds(),
|
||||
"duration_s": e.get_seconds() - s.get_seconds(),
|
||||
})
|
||||
# if no cuts found, treat the whole video as one scene
|
||||
if not entries:
|
||||
entries = [{
|
||||
"start_frame": 0, "end_frame": v["frames"],
|
||||
"start_s": 0.0, "end_s": v["duration_s"],
|
||||
"duration_s": v["duration_s"],
|
||||
}]
|
||||
out_file.write_text(json.dumps({"video": str(path), "scenes": entries}, indent=2))
|
||||
print(f" {len(entries)} scenes in {time.time()-t0:.1f}s -> {out_file.name}",
|
||||
file=sys.stderr)
|
||||
except Exception as e:
|
||||
print(f" ERROR: {e}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- stage -----------------------------
|
||||
|
||||
def cmd_stage(args):
|
||||
inv = json.loads(Path(args.inventory).read_text())
|
||||
scenes_dir = Path(args.scenes_dir)
|
||||
queue = []
|
||||
qid = 0
|
||||
sample_every = 1.0 / args.sample_fps
|
||||
for v in inv["videos"]:
|
||||
if "error" in v:
|
||||
continue
|
||||
p = Path(v["path"])
|
||||
sf = scenes_dir / (p.stem + ".scenes.json")
|
||||
if not sf.exists():
|
||||
print(f"[warn] no scenes file for {p.name}; skipping", file=sys.stderr)
|
||||
continue
|
||||
scenes = json.loads(sf.read_text()).get("scenes", [])
|
||||
fps = v.get("fps", 30) or 30
|
||||
for sc in scenes:
|
||||
t = sc["start_s"]
|
||||
while t < sc["end_s"] - 0.01:
|
||||
fidx = int(round(t * fps))
|
||||
if fidx >= v["frames"]:
|
||||
break
|
||||
queue.append({
|
||||
"queue_id": f"q{qid:08d}",
|
||||
"video_path": str(p),
|
||||
"win_video_path": v["win_path"],
|
||||
"frame_idx": fidx,
|
||||
"time_s": t,
|
||||
})
|
||||
qid += 1
|
||||
t += sample_every
|
||||
out = Path(args.out)
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
out.write_text(json.dumps(queue, indent=2))
|
||||
print(f"[stage] {len(queue)} sampled frames @ {args.sample_fps} fps -> {out}",
|
||||
file=sys.stderr)
|
||||
print(f"[stage] win path for worker: {wsl_to_win(str(out))}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- merge + track -----------------------------
|
||||
|
||||
def cmd_merge(args):
|
||||
"""Read worker output and group by video_path. Supports either JSONL (one record
|
||||
per line, the new format) or legacy JSON (results.json with `results` list)."""
|
||||
src_path = Path(args.results)
|
||||
records = []
|
||||
# try JSONL first (sister .jsonl file or .results passed directly)
|
||||
jsonl_candidate = src_path.with_suffix(".jsonl")
|
||||
if jsonl_candidate.exists():
|
||||
with open(jsonl_candidate) as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if line:
|
||||
records.append(json.loads(line))
|
||||
elif src_path.suffix == ".jsonl":
|
||||
with open(src_path) as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if line:
|
||||
records.append(json.loads(line))
|
||||
else:
|
||||
# legacy: monolithic JSON
|
||||
src = json.loads(src_path.read_text())
|
||||
records = src.get("results", [])
|
||||
by_video: dict[str, list] = {}
|
||||
for r in records:
|
||||
by_video.setdefault(r["video_path"], []).append(r)
|
||||
for v in by_video:
|
||||
by_video[v].sort(key=lambda x: x["frame_idx"])
|
||||
out = Path(args.out)
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
out.write_text(json.dumps({"by_video": by_video}, indent=2))
|
||||
print(f"[merge] {sum(len(v) for v in by_video.values())} frames across {len(by_video)} videos "
|
||||
f"-> {out}", file=sys.stderr)
|
||||
|
||||
|
||||
def _iou(a, b):
|
||||
ax1, ay1, ax2, ay2 = a
|
||||
bx1, by1, bx2, by2 = b
|
||||
ix1 = max(ax1, bx1); iy1 = max(ay1, by1)
|
||||
ix2 = min(ax2, bx2); iy2 = min(ay2, by2)
|
||||
iw = max(ix2 - ix1, 0); ih = max(iy2 - iy1, 0)
|
||||
inter = iw * ih
|
||||
ua = (ax2 - ax1) * (ay2 - ay1) + (bx2 - bx1) * (by2 - by1) - inter
|
||||
return inter / ua if ua > 0 else 0.0
|
||||
|
||||
|
||||
def cmd_track(args):
|
||||
"""Stitch per-frame face detections into tracks within each scene of each video.
|
||||
Track = list of (frame_idx, face_idx) where adjacent samples have IoU>=0.3 OR
|
||||
cosine(emb)>=0.5. New face → new track. No cross-scene merging."""
|
||||
fr = json.loads(Path(args.frames).read_text())
|
||||
scenes_dir = Path(args.scenes_dir)
|
||||
inv = json.loads(Path(args.inventory).read_text())
|
||||
inv_by_path = {v["path"]: v for v in inv["videos"]}
|
||||
|
||||
all_video_tracks: dict[str, list] = {}
|
||||
for video_path, frames in fr["by_video"].items():
|
||||
v = inv_by_path.get(video_path, {})
|
||||
sf = scenes_dir / (Path(video_path).stem + ".scenes.json")
|
||||
scenes = json.loads(sf.read_text()).get("scenes", []) if sf.exists() else []
|
||||
# group frames by scene
|
||||
scene_for_frame = {}
|
||||
for si, sc in enumerate(scenes):
|
||||
for f in frames:
|
||||
if f["frame_idx"] >= sc["start_frame"] and f["frame_idx"] < sc["end_frame"]:
|
||||
scene_for_frame.setdefault(si, []).append(f)
|
||||
video_tracks = []
|
||||
for si, scene_frames in scene_for_frame.items():
|
||||
scene_frames.sort(key=lambda x: x["frame_idx"])
|
||||
# tracks = list of dict{ "members": [(frame_idx, face_idx, face_dict)], "last_bbox", "last_emb" }
|
||||
tracks = []
|
||||
for f in scene_frames:
|
||||
claimed = set()
|
||||
for face_idx, face in enumerate(f.get("faces", [])):
|
||||
bbox = face["bbox"]
|
||||
emb = np.array(face.get("embedding", []), dtype=np.float32) if face.get("embedding") else None
|
||||
best_track = None
|
||||
best_score = 0.0
|
||||
for ti, tr in enumerate(tracks):
|
||||
if ti in claimed:
|
||||
continue
|
||||
# staleness in TIME (sample period independent of source fps)
|
||||
last_time = tr["members"][-1][3]
|
||||
if f["time_s"] - last_time > 1.5: # stale if >1.5s gap (3 sample periods @ 2fps)
|
||||
continue
|
||||
score = _iou(tr["last_bbox"], bbox)
|
||||
if emb is not None and tr.get("last_emb") is not None:
|
||||
score = max(score, float(np.dot(tr["last_emb"], emb)))
|
||||
if score > best_score:
|
||||
best_score = score
|
||||
best_track = ti
|
||||
if best_track is not None and best_score >= min(TRACK_IOU_MIN, TRACK_EMB_MIN):
|
||||
tr = tracks[best_track]
|
||||
tr["members"].append((f["frame_idx"], face_idx, face, f["time_s"]))
|
||||
tr["last_bbox"] = bbox
|
||||
if emb is not None:
|
||||
tr["last_emb"] = emb
|
||||
claimed.add(best_track)
|
||||
else:
|
||||
tracks.append({
|
||||
"members": [(f["frame_idx"], face_idx, face, f["time_s"])],
|
||||
"last_bbox": bbox,
|
||||
"last_emb": emb,
|
||||
})
|
||||
for tr in tracks:
|
||||
if len(tr["members"]) < 2:
|
||||
continue
|
||||
video_tracks.append({
|
||||
"scene_idx": si,
|
||||
"members": [
|
||||
{"frame_idx": m[0], "face_idx": m[1], "time_s": m[3], "face": m[2]}
|
||||
for m in tr["members"]
|
||||
],
|
||||
})
|
||||
all_video_tracks[video_path] = video_tracks
|
||||
print(f"[track] {Path(video_path).name}: {sum(len(s) for s in scene_for_frame.values())} frames "
|
||||
f"-> {len(video_tracks)} tracks across {len(scene_for_frame)} scenes",
|
||||
file=sys.stderr)
|
||||
|
||||
out = Path(args.out)
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
out.write_text(json.dumps({"by_video": all_video_tracks}, indent=2))
|
||||
print(f"[track] -> {out}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- score (quality gates) -----------------------------
|
||||
|
||||
def _track_passes(track, cfg):
|
||||
"""Per-frame quality gating; return list of bool (does each member pass) +
|
||||
aggregate stats. cfg: dict with yaw_max, pitch_max, face_min, det_min."""
|
||||
passes = []
|
||||
yaws, pitches, sizes, dets = [], [], [], []
|
||||
for m in track["members"]:
|
||||
f = m["face"]
|
||||
yaw = abs(f.get("pose", [0, 0, 0])[1]) if f.get("pose") else 0
|
||||
pitch = abs(f.get("pose", [0, 0, 0])[0]) if f.get("pose") else 0
|
||||
size = f.get("face_short", 0)
|
||||
det = f.get("det_score", 0)
|
||||
ok = (yaw <= cfg["yaw_max"] and pitch <= cfg["pitch_max"]
|
||||
and size >= cfg["face_min"] and det >= cfg["det_min"])
|
||||
passes.append(ok)
|
||||
yaws.append(yaw); pitches.append(pitch); sizes.append(size); dets.append(det)
|
||||
return passes, {
|
||||
"n": len(passes), "n_pass": sum(passes), "frac_pass": sum(passes) / max(1, len(passes)),
|
||||
"yaw_med": float(np.median(yaws)) if yaws else None,
|
||||
"pitch_med": float(np.median(pitches)) if pitches else None,
|
||||
"size_med": float(np.median(sizes)) if sizes else None,
|
||||
"det_med": float(np.median(dets)) if dets else None,
|
||||
}
|
||||
|
||||
|
||||
def _build_segments(track, cfg):
|
||||
"""Return list of (start_s, end_s) accepted sub-segments of this track:
|
||||
contiguous runs of passing frames meeting min/max duration. Pose-failure
|
||||
spans <= cfg['bridge_s'] long get bridged across (handles momentary head
|
||||
turns / detection misses)."""
|
||||
passes, stats = _track_passes(track, cfg)
|
||||
members = track["members"]
|
||||
if not members:
|
||||
return [], stats
|
||||
# bridge gaps of failing frames (any width) up to cfg["bridge_s"] seconds
|
||||
bridged = list(passes)
|
||||
n = len(bridged)
|
||||
i = 0
|
||||
while i < n:
|
||||
if bridged[i]:
|
||||
i += 1
|
||||
continue
|
||||
# find run of consecutive False starting at i
|
||||
j = i
|
||||
while j < n and not bridged[j]:
|
||||
j += 1
|
||||
# bridge if surrounded by True on both sides AND time gap <= bridge_s
|
||||
if i > 0 and j < n and bridged[i - 1] and bridged[j]:
|
||||
t_left = members[i - 1]["time_s"]
|
||||
t_right = members[j]["time_s"]
|
||||
if t_right - t_left <= cfg["bridge_s"]:
|
||||
for k in range(i, j):
|
||||
bridged[k] = True
|
||||
i = j
|
||||
# find runs of True
|
||||
runs = []
|
||||
i = 0
|
||||
while i < n:
|
||||
if not bridged[i]:
|
||||
i += 1; continue
|
||||
j = i
|
||||
while j + 1 < n and bridged[j + 1]:
|
||||
j += 1
|
||||
s = members[i]["time_s"]
|
||||
# end is the time of the last passing sample plus one sample-period
|
||||
e = members[j]["time_s"] + 1.0 / max(SAMPLE_FPS, 1e-3)
|
||||
runs.append((s, e))
|
||||
i = j + 1
|
||||
return runs, stats
|
||||
|
||||
|
||||
def _merge_close_segments(segs_with_meta, merge_gap_s: float):
|
||||
"""Merge segments within the same scene that are within merge_gap_s of each other.
|
||||
segs_with_meta: list of dicts with start_s, end_s, scene_idx, track_idx, stats.
|
||||
Returns list of merged dicts (one per merged group). Identity-tag and stats
|
||||
aggregation happen later."""
|
||||
by_scene: dict[int, list] = {}
|
||||
for s in segs_with_meta:
|
||||
by_scene.setdefault(s["scene_idx"], []).append(s)
|
||||
merged_all = []
|
||||
for scene_idx, segs in by_scene.items():
|
||||
segs.sort(key=lambda x: x["start_s"])
|
||||
cur = None
|
||||
for s in segs:
|
||||
if cur is None:
|
||||
cur = {**s, "track_idxs": [s["track_idx"]], "member_count": s["stats"]["n"],
|
||||
"pass_count": s["stats"]["n_pass"]}
|
||||
continue
|
||||
gap = s["start_s"] - cur["end_s"]
|
||||
if gap <= merge_gap_s:
|
||||
# merge
|
||||
cur["end_s"] = max(cur["end_s"], s["end_s"])
|
||||
cur["track_idxs"].append(s["track_idx"])
|
||||
cur["member_count"] += s["stats"]["n"]
|
||||
cur["pass_count"] += s["stats"]["n_pass"]
|
||||
# take the better-quality stats for display
|
||||
if s["stats"]["n_pass"] > cur["stats"]["n_pass"]:
|
||||
cur["stats"] = s["stats"]
|
||||
else:
|
||||
merged_all.append(cur)
|
||||
cur = {**s, "track_idxs": [s["track_idx"]], "member_count": s["stats"]["n"],
|
||||
"pass_count": s["stats"]["n_pass"]}
|
||||
if cur is not None:
|
||||
merged_all.append(cur)
|
||||
return merged_all
|
||||
|
||||
|
||||
def _split_long_segments(segs_with_meta, min_s: float, max_s: float):
|
||||
"""Apply min/max duration: drop too-short, split too-long evenly."""
|
||||
out = []
|
||||
for s in segs_with_meta:
|
||||
dur = s["end_s"] - s["start_s"]
|
||||
if dur < min_s:
|
||||
continue
|
||||
if dur <= max_s:
|
||||
out.append(s)
|
||||
continue
|
||||
n = int(math.ceil(dur / max_s))
|
||||
chunk = dur / n
|
||||
base_start = s["start_s"]
|
||||
for k in range(n):
|
||||
piece = dict(s)
|
||||
piece["start_s"] = base_start + k * chunk
|
||||
piece["end_s"] = base_start + (k + 1) * chunk
|
||||
out.append(piece)
|
||||
return out
|
||||
|
||||
|
||||
# identity tagging via cached arcface centroids
|
||||
def load_caches_index():
|
||||
rec_index = {}
|
||||
alias_map = {}
|
||||
for c in CACHES:
|
||||
if not c.exists():
|
||||
continue
|
||||
d = np.load(c, allow_pickle=True)
|
||||
emb = d["embeddings"]
|
||||
meta = json.loads(str(d["meta"]))
|
||||
face_records = [m for m in meta if not m.get("noface")]
|
||||
if "path_aliases" in d.files:
|
||||
paliases = json.loads(str(d["path_aliases"]))
|
||||
for canon, alist in paliases.items():
|
||||
alias_map.setdefault(canon, canon)
|
||||
for a in alist:
|
||||
alias_map[a] = canon
|
||||
for i, rec in enumerate(face_records):
|
||||
v = emb[i].astype(np.float32)
|
||||
n = float(np.linalg.norm(v))
|
||||
if n > 0:
|
||||
v = v / n
|
||||
rec_index[(rec["path"], tuple(int(x) for x in rec["bbox"]))] = v
|
||||
alias_map.setdefault(rec["path"], rec["path"])
|
||||
return rec_index, alias_map
|
||||
|
||||
|
||||
def load_faceset_centroids():
|
||||
"""Return dict faceset_name -> normalized centroid embedding."""
|
||||
rec_index, alias_map = load_caches_index()
|
||||
centroids = {}
|
||||
for fs_dir in sorted(FACESETS_ROOT.iterdir()):
|
||||
if not fs_dir.is_dir() or fs_dir.name.startswith("_"):
|
||||
continue
|
||||
# exclude era splits to avoid double-tagging within a family
|
||||
if re.match(r"^faceset_\d+_(?:\d{4}-\d{2,4}|\d{4}|undated)", fs_dir.name):
|
||||
continue
|
||||
mp = fs_dir / "manifest.json"
|
||||
if not mp.exists():
|
||||
continue
|
||||
m = json.loads(mp.read_text())
|
||||
vecs = []
|
||||
for f in m.get("faces", []):
|
||||
src = f.get("source"); bbox = f.get("bbox")
|
||||
if not src or not bbox:
|
||||
continue
|
||||
canon = alias_map.get(src, src)
|
||||
v = rec_index.get((canon, tuple(int(x) for x in bbox)))
|
||||
if v is None and canon != src:
|
||||
v = rec_index.get((src, tuple(int(x) for x in bbox)))
|
||||
if v is not None:
|
||||
vecs.append(v)
|
||||
if len(vecs) < 3:
|
||||
continue
|
||||
c = np.stack(vecs).mean(axis=0)
|
||||
n = float(np.linalg.norm(c))
|
||||
if n > 0:
|
||||
c = c / n
|
||||
centroids[fs_dir.name] = c
|
||||
return centroids
|
||||
|
||||
|
||||
def _track_centroid(track):
|
||||
embs = [m["face"].get("embedding") for m in track["members"] if m["face"].get("embedding")]
|
||||
if not embs:
|
||||
return None
|
||||
arr = np.array(embs, dtype=np.float32)
|
||||
c = arr.mean(axis=0)
|
||||
n = float(np.linalg.norm(c))
|
||||
return c / n if n > 0 else c
|
||||
|
||||
|
||||
def cmd_score(args):
|
||||
tr = json.loads(Path(args.tracks).read_text())
|
||||
inv = json.loads(Path(args.inventory).read_text())
|
||||
inv_by_path = {v["path"]: v for v in inv["videos"]}
|
||||
|
||||
cfg = {
|
||||
"yaw_max": args.max_yaw, "pitch_max": args.max_pitch,
|
||||
"face_min": args.min_face, "det_min": args.min_det,
|
||||
"bridge_s": args.bridge_gap,
|
||||
}
|
||||
|
||||
centroids = {}
|
||||
if not args.no_identity:
|
||||
print("[score] loading faceset centroids ...", file=sys.stderr)
|
||||
t0 = time.time()
|
||||
centroids = load_faceset_centroids()
|
||||
print(f"[score] {len(centroids)} active faceset centroids loaded in {time.time()-t0:.1f}s",
|
||||
file=sys.stderr)
|
||||
|
||||
n_total_tracks = 0
|
||||
n_accepted_tracks = 0
|
||||
# collect per-track candidate segments first; merging happens per-video below
|
||||
per_video_candidates: dict[str, list] = {}
|
||||
track_centroids_by_video: dict[str, dict] = {}
|
||||
for video_path, tracks in tr["by_video"].items():
|
||||
per_video_candidates.setdefault(video_path, [])
|
||||
track_centroids_by_video.setdefault(video_path, {})
|
||||
for ti, track in enumerate(tracks):
|
||||
n_total_tracks += 1
|
||||
runs, stats = _build_segments(track, cfg)
|
||||
if stats["frac_pass"] < args.track_gate_frac:
|
||||
continue
|
||||
if not runs:
|
||||
continue
|
||||
n_accepted_tracks += 1
|
||||
track_centroids_by_video[video_path][ti] = _track_centroid(track)
|
||||
for (s, e) in runs:
|
||||
per_video_candidates[video_path].append({
|
||||
"video_path": video_path,
|
||||
"track_idx": ti,
|
||||
"scene_idx": track["scene_idx"],
|
||||
"start_s": s,
|
||||
"end_s": e,
|
||||
"stats": stats,
|
||||
})
|
||||
|
||||
plan = []
|
||||
for video_path, segs in per_video_candidates.items():
|
||||
if not segs:
|
||||
continue
|
||||
# merge across tracks within the same scene if gap <= merge_gap_s
|
||||
merged = _merge_close_segments(segs, args.merge_gap)
|
||||
# apply min/max duration (split long, drop short)
|
||||
merged = _split_long_segments(merged, args.min_dur, args.max_dur)
|
||||
for s in merged:
|
||||
tag = None
|
||||
tag_sim = None
|
||||
# identity from union of contributing tracks' centroids
|
||||
if centroids:
|
||||
track_centroid_list = [
|
||||
track_centroids_by_video[video_path].get(ti)
|
||||
for ti in s.get("track_idxs", [s.get("track_idx")])
|
||||
]
|
||||
track_centroid_list = [c for c in track_centroid_list if c is not None]
|
||||
if track_centroid_list:
|
||||
union = np.stack(track_centroid_list).mean(axis=0)
|
||||
nm = float(np.linalg.norm(union))
|
||||
if nm > 0:
|
||||
union = union / nm
|
||||
sims = {name: float(np.dot(c, union)) for name, c in centroids.items()}
|
||||
best = max(sims, key=sims.get)
|
||||
if sims[best] >= IDENTITY_TAG_THRESHOLD:
|
||||
tag = best; tag_sim = round(sims[best], 4)
|
||||
plan.append({
|
||||
"video_path": video_path,
|
||||
"track_idxs": s.get("track_idxs", [s.get("track_idx")]),
|
||||
"scene_idx": s["scene_idx"],
|
||||
"start_s": round(s["start_s"], 3),
|
||||
"end_s": round(s["end_s"], 3),
|
||||
"duration_s": round(s["end_s"] - s["start_s"], 3),
|
||||
"member_count": s.get("member_count", s["stats"]["n"]),
|
||||
"pass_count": s.get("pass_count", s["stats"]["n_pass"]),
|
||||
"stats": s["stats"],
|
||||
"identity_tag": tag,
|
||||
"identity_sim": tag_sim,
|
||||
"uuid": uuid.uuid4().hex[:12],
|
||||
})
|
||||
|
||||
plan.sort(key=lambda p: (p["video_path"], p["start_s"]))
|
||||
out = Path(args.out)
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
out.write_text(json.dumps({
|
||||
"thresholds": {
|
||||
"yaw_max": args.max_yaw, "pitch_max": args.max_pitch,
|
||||
"face_min": args.min_face, "blur_min": QUALITY_BLUR_MIN,
|
||||
"det_min": args.min_det, "track_gate_frac": args.track_gate_frac,
|
||||
"bridge_s": args.bridge_gap, "merge_gap_s": args.merge_gap,
|
||||
"min_dur_s": args.min_dur, "max_dur_s": args.max_dur,
|
||||
"identity_tag_threshold": IDENTITY_TAG_THRESHOLD,
|
||||
},
|
||||
"totals": {
|
||||
"tracks_total": n_total_tracks, "tracks_accepted": n_accepted_tracks,
|
||||
"segments": len(plan),
|
||||
},
|
||||
"plan": plan,
|
||||
}, indent=2))
|
||||
print(f"[score] {n_accepted_tracks}/{n_total_tracks} tracks accepted -> {len(plan)} segments "
|
||||
f"-> {out}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- cut -----------------------------
|
||||
|
||||
def cmd_cut(args):
|
||||
plan = json.loads(Path(args.plan).read_text())
|
||||
out_dir = Path(args.output_dir)
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
if args.clean:
|
||||
# remove only existing UUID-named clips + sidecars (12-char hex), keeping any other files
|
||||
import re as _re
|
||||
uuid_pat = _re.compile(r"^[0-9a-f]{12}\.(mp4|json)$")
|
||||
n_removed = 0
|
||||
for child in out_dir.iterdir():
|
||||
if child.is_file() and uuid_pat.match(child.name):
|
||||
child.unlink()
|
||||
n_removed += 1
|
||||
elif child.is_dir() and _re.match(r"^[A-Za-z0-9_.-]+$", child.name):
|
||||
# subfolder of prior runs — clear UUID files inside, then remove if empty
|
||||
for inner in child.iterdir():
|
||||
if inner.is_file() and uuid_pat.match(inner.name):
|
||||
inner.unlink()
|
||||
n_removed += 1
|
||||
try:
|
||||
child.rmdir()
|
||||
except OSError:
|
||||
pass
|
||||
if n_removed:
|
||||
print(f"[clean] removed {n_removed} prior UUID clips/sidecars", file=sys.stderr)
|
||||
|
||||
n_done = 0
|
||||
n_err = 0
|
||||
sidecars = []
|
||||
for seg in plan["plan"]:
|
||||
sub = Path(seg["video_path"]).stem
|
||||
seg_dir = out_dir / sub
|
||||
seg_dir.mkdir(parents=True, exist_ok=True)
|
||||
out_video = seg_dir / f"{seg['uuid']}.mp4"
|
||||
if out_video.exists() and not args.force:
|
||||
continue
|
||||
s = seg["start_s"]; d = seg["duration_s"]
|
||||
cmd = [
|
||||
"ffmpeg", "-y", "-loglevel", "error",
|
||||
"-ss", f"{s}",
|
||||
"-i", seg["video_path"],
|
||||
"-t", f"{d}",
|
||||
"-c", "copy",
|
||||
"-avoid_negative_ts", "make_zero",
|
||||
str(out_video),
|
||||
]
|
||||
r = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
|
||||
if r.returncode != 0 or not out_video.exists() or out_video.stat().st_size < 1024:
|
||||
print(f"[cut-err] {seg['uuid']} {seg['video_path']}@{s}+{d}: {r.stderr.strip()[:200]}",
|
||||
file=sys.stderr)
|
||||
n_err += 1
|
||||
if out_video.exists() and out_video.stat().st_size < 1024:
|
||||
out_video.unlink()
|
||||
continue
|
||||
if args.write_sidecar:
|
||||
sidecar = seg_dir / f"{seg['uuid']}.json"
|
||||
sidecar.write_text(json.dumps({
|
||||
"uuid": seg["uuid"],
|
||||
"source_video": seg["video_path"],
|
||||
"source_basename": Path(seg["video_path"]).name,
|
||||
"start_s": s, "end_s": seg["end_s"], "duration_s": d,
|
||||
"scene_idx": seg["scene_idx"],
|
||||
"track_idxs": seg.get("track_idxs", [seg.get("track_idx")]),
|
||||
"member_count": seg.get("member_count"),
|
||||
"pass_count": seg.get("pass_count"),
|
||||
"stats": seg["stats"],
|
||||
"identity_tag": seg["identity_tag"],
|
||||
"identity_sim": seg["identity_sim"],
|
||||
"thresholds": plan["thresholds"],
|
||||
}, indent=2))
|
||||
sidecars.append(sidecar)
|
||||
n_done += 1
|
||||
print(f"[cut] {n_done} clips written, {n_err} errors -> {out_dir}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- report -----------------------------
|
||||
|
||||
def cmd_report(args):
|
||||
plan = json.loads(Path(args.plan).read_text())
|
||||
out_dir = Path(args.out)
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
thumbs_dir = out_dir / "thumbs"
|
||||
thumbs_dir.mkdir(exist_ok=True)
|
||||
output_dir = Path(args.output_dir)
|
||||
|
||||
# group by video
|
||||
by_video: dict[str, list] = {}
|
||||
for seg in plan["plan"]:
|
||||
by_video.setdefault(seg["video_path"], []).append(seg)
|
||||
|
||||
# generate thumbs from each segment's first frame via ffmpeg
|
||||
print(f"[report] generating thumbs for {len(plan['plan'])} segments", file=sys.stderr)
|
||||
for seg in plan["plan"]:
|
||||
thumb = thumbs_dir / f"{seg['uuid']}.jpg"
|
||||
if thumb.exists():
|
||||
continue
|
||||
s = seg["start_s"] + 0.1
|
||||
cmd = [
|
||||
"ffmpeg", "-y", "-loglevel", "error",
|
||||
"-ss", f"{s}",
|
||||
"-i", seg["video_path"],
|
||||
"-frames:v", "1",
|
||||
"-vf", "scale=240:-1",
|
||||
str(thumb),
|
||||
]
|
||||
subprocess.run(cmd, capture_output=True, timeout=30)
|
||||
|
||||
# render
|
||||
rows = []
|
||||
rows.append("<h1>Video target preprocessing — review</h1>")
|
||||
t = plan["totals"]
|
||||
th = plan["thresholds"]
|
||||
rows.append(f"<p>Tracks accepted: {t['tracks_accepted']}/{t['tracks_total']}; "
|
||||
f"segments emitted: {t['segments']}.<br>"
|
||||
f"Thresholds: pose ≤{th['yaw_max']}°yaw / {th['pitch_max']}°pitch, "
|
||||
f"face_short ≥{th['face_min']}px, det ≥{th['det_min']}, "
|
||||
f"track-gate ≥{int(100*th['track_gate_frac'])}%, "
|
||||
f"duration {th['min_dur_s']}–{th['max_dur_s']}s. "
|
||||
f"Output dir: <code>{output_dir}</code></p>")
|
||||
nav = " · ".join(f"<a href='#v{i}'>{Path(v).name}</a>"
|
||||
for i, v in enumerate(by_video.keys()))
|
||||
rows.append(f"<div class='nav'>{nav}</div>")
|
||||
for vi, (video_path, segs) in enumerate(by_video.items()):
|
||||
rows.append(f"<section id='v{vi}' class='vid'>")
|
||||
rows.append(f"<h2>{Path(video_path).name} <small>({len(segs)} segments)</small></h2>")
|
||||
rows.append("<div class='cells'>")
|
||||
for seg in sorted(segs, key=lambda x: x["start_s"]):
|
||||
stats = seg["stats"]
|
||||
tag = seg["identity_tag"] or ""
|
||||
tag_sim = seg["identity_sim"]
|
||||
tag_html = (f"<span class='tag'>{tag} ({tag_sim:.2f})</span>" if tag else "<span class='tag none'>untagged</span>")
|
||||
sub_name = Path(seg['video_path']).stem
|
||||
rows.append(
|
||||
f"<div class='cell'>"
|
||||
f"<a href='{output_dir}/{sub_name}/{seg['uuid']}.mp4'><img src='thumbs/{seg['uuid']}.jpg' loading='lazy'></a>"
|
||||
f"<div class='meta'>"
|
||||
f"<code>{sub_name}/{seg['uuid']}.mp4</code><br>"
|
||||
f"{seg['start_s']:.1f}s → {seg['end_s']:.1f}s ({seg['duration_s']:.1f}s)<br>"
|
||||
f"yaw={stats['yaw_med']:.0f}° size={stats['size_med']:.0f}px det={stats['det_med']:.2f}<br>"
|
||||
f"pass {stats['n_pass']}/{stats['n']}<br>"
|
||||
f"{tag_html}"
|
||||
f"</div></div>"
|
||||
)
|
||||
rows.append("</div></section>")
|
||||
html = f"""<!doctype html>
|
||||
<html><head><meta charset='utf-8'><title>Video targets review</title>
|
||||
<style>
|
||||
body {{ font-family: system-ui, sans-serif; background:#111; color:#eee; padding:1em; }}
|
||||
h1, h2 {{ margin-top: 1em; }} h2 {{ border-bottom: 1px solid #333; padding-bottom: 4px; }}
|
||||
small {{ color:#999; font-weight:normal; }}
|
||||
section.vid {{ background:#1a1a1a; border-radius:6px; padding:12px; margin:12px 0; }}
|
||||
.cells {{ display:flex; flex-wrap:wrap; gap:8px; }}
|
||||
.cell {{ background:#222; border-radius:4px; padding:6px; width:260px; font-size:11px; font-family:monospace; }}
|
||||
.cell img {{ width:100%; height:auto; border-radius:3px; }}
|
||||
.meta {{ padding-top:4px; line-height:1.4; }}
|
||||
.tag {{ display:inline-block; padding:1px 6px; background:#5fa05f; color:#000; border-radius:2px; }}
|
||||
.tag.none {{ background:#444; color:#aaa; }}
|
||||
.nav {{ position:sticky; top:0; background:#111; padding:.5em 0; border-bottom:1px solid #333; font-size:12px; }}
|
||||
a {{ color:#6cf; }}
|
||||
code {{ background:#000; padding:1px 4px; border-radius:2px; }}
|
||||
</style></head>
|
||||
<body>
|
||||
{''.join(rows)}
|
||||
</body></html>"""
|
||||
out_html = out_dir / "index.html"
|
||||
out_html.write_text(html)
|
||||
print(f"[report] -> {out_html}", file=sys.stderr)
|
||||
|
||||
|
||||
# ----------------------------- main -----------------------------
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
sub = ap.add_subparsers(dest="cmd", required=True)
|
||||
|
||||
s = sub.add_parser("scan")
|
||||
s.add_argument("--input", default=str(DEFAULT_INPUT))
|
||||
s.add_argument("--output-dir", default=str(DEFAULT_OUTPUT))
|
||||
s.add_argument("--recursive", action="store_true")
|
||||
s.add_argument("--out", required=True)
|
||||
s.set_defaults(func=cmd_scan)
|
||||
|
||||
sc = sub.add_parser("scenes")
|
||||
sc.add_argument("--inventory", required=True)
|
||||
sc.add_argument("--out-dir", required=True)
|
||||
sc.add_argument("--only", default=None, help="comma-separated basenames to limit run")
|
||||
sc.add_argument("--force", action="store_true")
|
||||
sc.set_defaults(func=cmd_scenes)
|
||||
|
||||
st = sub.add_parser("stage")
|
||||
st.add_argument("--inventory", required=True)
|
||||
st.add_argument("--scenes-dir", required=True)
|
||||
st.add_argument("--sample-fps", type=float, default=SAMPLE_FPS)
|
||||
st.add_argument("--out", required=True)
|
||||
st.set_defaults(func=cmd_stage)
|
||||
|
||||
m = sub.add_parser("merge")
|
||||
m.add_argument("--results", required=True)
|
||||
m.add_argument("--out", required=True)
|
||||
m.set_defaults(func=cmd_merge)
|
||||
|
||||
tr = sub.add_parser("track")
|
||||
tr.add_argument("--frames", required=True)
|
||||
tr.add_argument("--scenes-dir", required=True)
|
||||
tr.add_argument("--inventory", required=True)
|
||||
tr.add_argument("--sample-fps", type=float, default=SAMPLE_FPS)
|
||||
tr.add_argument("--out", required=True)
|
||||
tr.set_defaults(func=cmd_track)
|
||||
|
||||
sc2 = sub.add_parser("score")
|
||||
sc2.add_argument("--tracks", required=True)
|
||||
sc2.add_argument("--inventory", required=True)
|
||||
sc2.add_argument("--out", required=True)
|
||||
sc2.add_argument("--no-identity", action="store_true")
|
||||
sc2.add_argument("--max-yaw", type=float, default=QUALITY_YAW_MAX)
|
||||
sc2.add_argument("--max-pitch", type=float, default=QUALITY_PITCH_MAX)
|
||||
sc2.add_argument("--min-face", type=int, default=QUALITY_FACE_MIN)
|
||||
sc2.add_argument("--min-det", type=float, default=QUALITY_DET_MIN)
|
||||
sc2.add_argument("--track-gate-frac", type=float, default=TRACK_GATE_FRAC)
|
||||
sc2.add_argument("--bridge-gap", type=float, default=SEGMENT_BRIDGE_S,
|
||||
help="bridge within-track failure gaps up to this many seconds")
|
||||
sc2.add_argument("--merge-gap", type=float, default=SEGMENT_MERGE_GAP_S,
|
||||
help="merge across-track segments in same scene if within this gap")
|
||||
sc2.add_argument("--min-dur", type=float, default=SEGMENT_MIN_S)
|
||||
sc2.add_argument("--max-dur", type=float, default=SEGMENT_MAX_S)
|
||||
sc2.set_defaults(func=cmd_score)
|
||||
|
||||
cu = sub.add_parser("cut")
|
||||
cu.add_argument("--plan", required=True)
|
||||
cu.add_argument("--output-dir", default=str(DEFAULT_OUTPUT))
|
||||
cu.add_argument("--force", action="store_true")
|
||||
cu.add_argument("--clean", action="store_true",
|
||||
help="remove prior UUID-named clips before cutting (preserves non-UUID files)")
|
||||
cu.add_argument("--write-sidecar", action="store_true",
|
||||
help="emit <uuid>.json provenance sidecar alongside each clip (default off)")
|
||||
cu.set_defaults(func=cmd_cut)
|
||||
|
||||
rp = sub.add_parser("report")
|
||||
rp.add_argument("--plan", required=True)
|
||||
rp.add_argument("--output-dir", default=str(DEFAULT_OUTPUT))
|
||||
rp.add_argument("--out", required=True)
|
||||
rp.set_defaults(func=cmd_report)
|
||||
|
||||
args = ap.parse_args()
|
||||
args.func(args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user