Add osrc identity-discovery pipeline + run analysis

work/cluster_osrc.py mirrors build_folders.py's shape (synthesize a
refine_manifest, hand off to cmd_export_swap, relocate, merge top-level
manifest) but discovers identities by clustering rather than asserting
them by folder. Drops faces already covered by existing identity
centroids, clusters the rest at 0.55, applies refine-equivalent gates
with min_faces=6, numbers new facesets past the existing maximum so
faceset_001..NNN are never disturbed.

The 2026-04-26 run on /mnt/x/src/osrc produced faceset_020..025 (sizes
4-26 exported PNGs); analysis writeup in docs/analysis/.

README also notes the refine-renumbers caveat in passing — extend +
orchestration script is the safe pattern; cmd_refine is for fresh
clusters only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-26 12:40:19 +02:00
parent 1d82d71e68
commit 7ecbfae981
3 changed files with 524 additions and 0 deletions

View File

@@ -153,6 +153,57 @@ For the `faceset_001` run on 5260-face `nl_full.npz`, this produced 6 substantiv
era buckets (200510, 201013, 2011, 201417, 201819, 201820; sizes 43282)
plus 68 thin/fragment buckets quarantined under `_thin/`.
### Discovering new identities in a mixed bucket
A flat folder of mixed-identity photos (e.g. `osrc/`) is the opposite of the
hand-sorted case: identities have to be discovered, not asserted, but should
not collide with already-known identities or scramble their numbering.
`work/cluster_osrc.py` is the worked example. The pipeline:
- **Filter cache to the source root**, including any byte-aliased path that
resolves under it.
- **Drop already-covered faces** by comparing each candidate to the centroids
of the existing canonical facesets at the `EXISTING_MATCH_THRESHOLD`
(default 0.45 — same cutoff as `build_folders.py`'s osrc routing). These
faces are already routed by `extend` / `build_folders.py` and shouldn't
seed new facesets.
- **Cluster the unmatched** at cos-dist 0.55 (matches the `extend` default
for the new-cluster phase).
- **Apply `refine`-equivalent gates** per cluster: `face_short`, `blur`,
`det_score`, plus outlier rejection (cluster-centroid cos-dist > 0.55) for
clusters of size ≥ 4. Keep clusters whose surviving unique-source-path
count is ≥ `MIN_FACES`.
- **Number new facesets past the existing maximum** (`START_NNN`), so
`faceset_001..NNN` are never disturbed.
- **Synthesize a refine manifest** and run `cmd_export_swap` against it,
then move the resulting dirs into `facesets_swap_ready/` and append to the
top-level `manifest.json`. Each new dir gets an `osrc.txt` provenance
marker.
Always run `extend` first so `raw_full/` and `facesets_full/` reflect the new
source — the `cluster_osrc.py` step then operates against the canonical
cache and doesn't need `raw_full/` for input:
```bash
# 1. Bring raw_full / facesets_full up to date (folds matches into existing
# person folders + facesets, creates new person_NNN+ for unmatched).
python sort_faces.py extend "$CACHE" "$OUT/raw_full" \
--refine-out "$OUT/facesets_full"
# 2. Optional dry-run: report cluster sizes and per-faceset survivor counts
# without touching facesets_swap_ready/.
python work/cluster_osrc.py --dry-run
# 3. Real run: emits facesets_swap_ready/faceset_NNN+ and merges the manifest.
python work/cluster_osrc.py
```
For the 2026-04-26 run on 336 osrc face records (after dropping 18 covered by
existing identities), this produced 6 new facesets (`faceset_020..025`,
sizes 426 exported PNGs; the 7th candidate cluster lost all 6 faces to
export-swap's tighter `min_face_short=100` gate).
## Key defaults
`refine`:
@@ -201,7 +252,9 @@ Highly recommended at swap time: enable **Select post-processing = GFPGAN** with
├─ build_folders.py (hand-sorted-folder orchestration)
├─ check_faceset001_age.py (age-split readiness probe)
├─ age_split_001.py (age-split orchestration; faceset_001)
├─ cluster_osrc.py (mixed-bucket identity discovery)
├─ synthetic_refine_manifest.json (last build_folders.py output)
├─ synthetic_osrc_manifest.json (last cluster_osrc.py output)
├─ cache/
│ ├─ nl_full.npz (canonical cache + duplicates.json)
│ └─ age_split_exif.json (path → EXIF-year cache)