Add osrc identity-discovery pipeline + run analysis

work/cluster_osrc.py mirrors build_folders.py's shape (synthesize a refine_manifest, hand off to cmd_export_swap, relocate, merge top-level manifest) but discovers identities by clustering rather than asserting them by folder. Drops faces already covered by existing identity centroids, clusters the rest at 0.55, applies refine-equivalent gates with min_faces=6, numbers new facesets past the existing maximum so faceset_001..NNN are never disturbed. The 2026-04-26 run on /mnt/x/src/osrc produced faceset_020..025 (sizes 4-26 exported PNGs); analysis writeup in docs/analysis/. README also notes the refine-renumbers caveat in passing — extend + orchestration script is the safe pattern; cmd_refine is for fresh clusters only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:40:19 +02:00
parent 1d82d71e68
commit 7ecbfae981
3 changed files with 524 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -153,6 +153,57 @@ For the `faceset_001` run on 5260-face `nl_full.npz`, this produced 6 substantiv
 era buckets (2005–10, 2010–13, 2011, 2014–17, 2018–19, 2018–20; sizes 43–282)
 plus 68 thin/fragment buckets quarantined under `_thin/`.

+### Discovering new identities in a mixed bucket
+
+A flat folder of mixed-identity photos (e.g. `osrc/`) is the opposite of the
+hand-sorted case: identities have to be discovered, not asserted, but should
+not collide with already-known identities or scramble their numbering.
+
+`work/cluster_osrc.py` is the worked example. The pipeline:
+
+- **Filter cache to the source root**, including any byte-aliased path that
+  resolves under it.
+- **Drop already-covered faces** by comparing each candidate to the centroids
+  of the existing canonical facesets at the `EXISTING_MATCH_THRESHOLD`
+  (default 0.45 — same cutoff as `build_folders.py`'s osrc routing). These
+  faces are already routed by `extend` / `build_folders.py` and shouldn't
+  seed new facesets.
+- **Cluster the unmatched** at cos-dist 0.55 (matches the `extend` default
+  for the new-cluster phase).
+- **Apply `refine`-equivalent gates** per cluster: `face_short`, `blur`,
+  `det_score`, plus outlier rejection (cluster-centroid cos-dist > 0.55) for
+  clusters of size ≥ 4. Keep clusters whose surviving unique-source-path
+  count is ≥ `MIN_FACES`.
+- **Number new facesets past the existing maximum** (`START_NNN`), so
+  `faceset_001..NNN` are never disturbed.
+- **Synthesize a refine manifest** and run `cmd_export_swap` against it,
+  then move the resulting dirs into `facesets_swap_ready/` and append to the
+  top-level `manifest.json`. Each new dir gets an `osrc.txt` provenance
+  marker.
+
+Always run `extend` first so `raw_full/` and `facesets_full/` reflect the new
+source — the `cluster_osrc.py` step then operates against the canonical
+cache and doesn't need `raw_full/` for input:
+
+```bash
+# 1. Bring raw_full / facesets_full up to date (folds matches into existing
+#    person folders + facesets, creates new person_NNN+ for unmatched).
+python sort_faces.py extend "$CACHE" "$OUT/raw_full" \
+  --refine-out "$OUT/facesets_full"
+
+# 2. Optional dry-run: report cluster sizes and per-faceset survivor counts
+#    without touching facesets_swap_ready/.
+python work/cluster_osrc.py --dry-run
+
+# 3. Real run: emits facesets_swap_ready/faceset_NNN+ and merges the manifest.
+python work/cluster_osrc.py
+```
+
+For the 2026-04-26 run on 336 osrc face records (after dropping 18 covered by
+existing identities), this produced 6 new facesets (`faceset_020..025`,
+sizes 4–26 exported PNGs; the 7th candidate cluster lost all 6 faces to
+export-swap's tighter `min_face_short=100` gate).
+
 ## Key defaults

 `refine`:
@@ -201,7 +252,9 @@ Highly recommended at swap time: enable **Select post-processing = GFPGAN** with
   ├─ build_folders.py                           (hand-sorted-folder orchestration)
   ├─ check_faceset001_age.py                    (age-split readiness probe)
   ├─ age_split_001.py                           (age-split orchestration; faceset_001)
+   ├─ cluster_osrc.py                            (mixed-bucket identity discovery)
   ├─ synthetic_refine_manifest.json             (last build_folders.py output)
+   ├─ synthetic_osrc_manifest.json               (last cluster_osrc.py output)
   ├─ cache/
   │  ├─ nl_full.npz                             (canonical cache + duplicates.json)
   │  └─ age_split_exif.json                     (path → EXIF-year cache)