Add osrc identity-discovery pipeline + run analysis
work/cluster_osrc.py mirrors build_folders.py's shape (synthesize a refine_manifest, hand off to cmd_export_swap, relocate, merge top-level manifest) but discovers identities by clustering rather than asserting them by folder. Drops faces already covered by existing identity centroids, clusters the rest at 0.55, applies refine-equivalent gates with min_faces=6, numbers new facesets past the existing maximum so faceset_001..NNN are never disturbed. The 2026-04-26 run on /mnt/x/src/osrc produced faceset_020..025 (sizes 4-26 exported PNGs); analysis writeup in docs/analysis/. README also notes the refine-renumbers caveat in passing — extend + orchestration script is the safe pattern; cmd_refine is for fresh clusters only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
53
README.md
53
README.md
@@ -153,6 +153,57 @@ For the `faceset_001` run on 5260-face `nl_full.npz`, this produced 6 substantiv
|
||||
era buckets (2005–10, 2010–13, 2011, 2014–17, 2018–19, 2018–20; sizes 43–282)
|
||||
plus 68 thin/fragment buckets quarantined under `_thin/`.
|
||||
|
||||
### Discovering new identities in a mixed bucket
|
||||
|
||||
A flat folder of mixed-identity photos (e.g. `osrc/`) is the opposite of the
|
||||
hand-sorted case: identities have to be discovered, not asserted, but should
|
||||
not collide with already-known identities or scramble their numbering.
|
||||
|
||||
`work/cluster_osrc.py` is the worked example. The pipeline:
|
||||
|
||||
- **Filter cache to the source root**, including any byte-aliased path that
|
||||
resolves under it.
|
||||
- **Drop already-covered faces** by comparing each candidate to the centroids
|
||||
of the existing canonical facesets at the `EXISTING_MATCH_THRESHOLD`
|
||||
(default 0.45 — same cutoff as `build_folders.py`'s osrc routing). These
|
||||
faces are already routed by `extend` / `build_folders.py` and shouldn't
|
||||
seed new facesets.
|
||||
- **Cluster the unmatched** at cos-dist 0.55 (matches the `extend` default
|
||||
for the new-cluster phase).
|
||||
- **Apply `refine`-equivalent gates** per cluster: `face_short`, `blur`,
|
||||
`det_score`, plus outlier rejection (cluster-centroid cos-dist > 0.55) for
|
||||
clusters of size ≥ 4. Keep clusters whose surviving unique-source-path
|
||||
count is ≥ `MIN_FACES`.
|
||||
- **Number new facesets past the existing maximum** (`START_NNN`), so
|
||||
`faceset_001..NNN` are never disturbed.
|
||||
- **Synthesize a refine manifest** and run `cmd_export_swap` against it,
|
||||
then move the resulting dirs into `facesets_swap_ready/` and append to the
|
||||
top-level `manifest.json`. Each new dir gets an `osrc.txt` provenance
|
||||
marker.
|
||||
|
||||
Always run `extend` first so `raw_full/` and `facesets_full/` reflect the new
|
||||
source — the `cluster_osrc.py` step then operates against the canonical
|
||||
cache and doesn't need `raw_full/` for input:
|
||||
|
||||
```bash
|
||||
# 1. Bring raw_full / facesets_full up to date (folds matches into existing
|
||||
# person folders + facesets, creates new person_NNN+ for unmatched).
|
||||
python sort_faces.py extend "$CACHE" "$OUT/raw_full" \
|
||||
--refine-out "$OUT/facesets_full"
|
||||
|
||||
# 2. Optional dry-run: report cluster sizes and per-faceset survivor counts
|
||||
# without touching facesets_swap_ready/.
|
||||
python work/cluster_osrc.py --dry-run
|
||||
|
||||
# 3. Real run: emits facesets_swap_ready/faceset_NNN+ and merges the manifest.
|
||||
python work/cluster_osrc.py
|
||||
```
|
||||
|
||||
For the 2026-04-26 run on 336 osrc face records (after dropping 18 covered by
|
||||
existing identities), this produced 6 new facesets (`faceset_020..025`,
|
||||
sizes 4–26 exported PNGs; the 7th candidate cluster lost all 6 faces to
|
||||
export-swap's tighter `min_face_short=100` gate).
|
||||
|
||||
## Key defaults
|
||||
|
||||
`refine`:
|
||||
@@ -201,7 +252,9 @@ Highly recommended at swap time: enable **Select post-processing = GFPGAN** with
|
||||
├─ build_folders.py (hand-sorted-folder orchestration)
|
||||
├─ check_faceset001_age.py (age-split readiness probe)
|
||||
├─ age_split_001.py (age-split orchestration; faceset_001)
|
||||
├─ cluster_osrc.py (mixed-bucket identity discovery)
|
||||
├─ synthetic_refine_manifest.json (last build_folders.py output)
|
||||
├─ synthetic_osrc_manifest.json (last cluster_osrc.py output)
|
||||
├─ cache/
|
||||
│ ├─ nl_full.npz (canonical cache + duplicates.json)
|
||||
│ └─ age_split_exif.json (path → EXIF-year cache)
|
||||
|
||||
Reference in New Issue
Block a user