work/cluster_osrc.py mirrors build_folders.py's shape (synthesize a refine_manifest, hand off to cmd_export_swap, relocate, merge top-level manifest) but discovers identities by clustering rather than asserting them by folder. Drops faces already covered by existing identity centroids, clusters the rest at 0.55, applies refine-equivalent gates with min_faces=6, numbers new facesets past the existing maximum so faceset_001..NNN are never disturbed. The 2026-04-26 run on /mnt/x/src/osrc produced faceset_020..025 (sizes 4-26 exported PNGs); analysis writeup in docs/analysis/. README also notes the refine-renumbers caveat in passing — extend + orchestration script is the safe pattern; cmd_refine is for fresh clusters only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.2 KiB
Identity discovery in /mnt/x/src/osrc
Run date: 2026-04-26. Cache: work/cache/nl_full.npz (5260 face records).
Driver script: work/cluster_osrc.py.
1. Source
/mnt/x/src/osrc/ is a flat mixed-identity bucket: 213 files in root + a
psd/ subfolder with 41 PSD files + a single file in [Originaldateien]/.
File extensions are 171 jpg + 1 jpeg + 41 psd. PSDs are not embedded
(InsightFace's loader doesn't read PSD); the 41 PSDs were skipped, on the
working assumption that the same identities are also present in the
adjacent JPGs.
nl_full.npz already covered 160 of the 213 files (the remaining 53: 41
psd + 12 jpg). Of the 12 missing JPGs, 11 are byte-duplicates of 00843resc.jpg
.. 00855resc.jpg (same file sizes, paired by sha256) — already aliased
in the cache. Only 1 jpg (19554226_..._n.jpg) is genuinely uncovered.
The 160 covered files yielded 336 face records / 10 noface, with 64
single-face / 35 two-face / 19 three-face / 24 four-face / 8 with 5–8
faces. Quality is good: median face_short=116px, det_score=0.85,
blur=244. Min face_short=40px will fail the 90px refine gate.
2. Coverage by existing identities
Computed cos-dist from each osrc face to the centroids of the canonical
faceset_001..019 (built from each manifest's (source, bbox) keys).
Median nearest-cos-dist was 0.875 — i.e. the bulk of osrc is not the
existing 19 identities.
At cos-dist ≤ 0.45 (matching build_folders.py's OSRC_THRESHOLD):
| existing identity | osrc faces matched |
|---|---|
| faceset_002 | 7 |
| faceset_008 | 4 |
| faceset_015 | 3 |
| faceset_019 | 4 |
These 18 osrc faces are routed to existing identities by
build_folders.py and extend; they are excluded from the
identity-discovery step.
3. Pipeline
work/cluster_osrc.py mirrors build_folders.py's structure (synthesize
a refine manifest, hand off to cmd_export_swap, relocate, merge
top-level manifest) but discovers identities by clustering rather than
asserting them by folder.
- Filter cache to face records under
/mnt/x/src/osrc(canonical or byte-aliased path). - Drop the 18 already-covered faces (cos-dist ≤ 0.45 to any existing identity centroid).
- Cluster the remaining 318 faces among themselves at cos-dist 0.55
(matches the
extenddefault for new-cluster formation). - For each cluster, apply
refine-equivalent per-face gates (face_short ≥ 90,blur ≥ 40,det_score ≥ 0.6); for clusters ≥ 4 faces apply outlier rejection at cluster-centroid cos-dist 0.55. Keep clusters whose surviving unique-path count is ≥ 6 (the operator- chosenMIN_FACES, lower than the canonical 15 because osrc is small per-identity). - Number kept clusters
faceset_020+(past the existingfacesets_swap_ready/max of 019) ordered by size descending. - Synthesize a refine manifest and call
cmd_export_swapon it. Move the emitted dirs intofacesets_swap_ready/, drop anosrc.txtprovenance marker, and append the new entries to the top-levelmanifest.json(without disturbing existingfacesets/thin_eras).
4. Result (2026-04-26)
Phase 1 (clustering, before export-swap):
- 137 raw clusters at cos-dist 0.55; top sizes [37, 20, 12, 9, 7, 7, 6, 6, 6, 5].
- After quality gate: 124 faces dropped (mostly
face_short < 90from group-photo tertiary subjects). - Outlier rejection: 0 dropped (clusters were tight).
- After
min_faces=6: 7 candidate clusters kept (sizes 6–28 unique source paths).
Phase 2 (cmd_export_swap with min_face_short=100,
outlier_threshold=0.45):
| name | input | outlier drop | exported PNGs |
|---|---|---|---|
| faceset_020 | 71 | 42 | 26 |
| faceset_021 | 36 | 21 | 10 |
| faceset_022 | 15 | 7 | 8 |
| faceset_023 | 19 | 14 | 4 |
| faceset_024 | 6 | 0 | 6 |
| faceset_025 | 10 | 4 | 6 |
| faceset_026 | — | — | 0 (skipped: empty after filter) |
faceset_026's 6 cluster faces all failed export-swap's tighter
min_face_short=100 gate (vs. cluster's 90); it is not emitted.
faceset_023 is small (4 PNGs) but useful as an averaged identity at
that size.
Top-level facesets_swap_ready/manifest.json now: 31 substantive
facesets (12 auto-cluster nl/lzbkp + 7 hand-sorted + 6 era splits + 6
osrc-discovered) + 68 thin_eras under _thin/.
5. Re-running and applying to other mixed buckets
- The cache holds osrc embeddings; to re-run with different parameters,
edit
cluster_osrc.py's config block and re-execute. Cluster discovery- export-swap is a few minutes total.
- For a different mixed-bucket source, copy
cluster_osrc.pytocluster_<name>.pyand changeOSRC_DIR,OUT_TMP,SYNTH_MANIFEST,START_NNN. The exclusion step compares against the current contents offacesets_swap_ready/faceset_NNN/so it picks up everything emitted by previous discovery / split / hand-sorted runs. - Lowering
MIN_FACESfrom 6 to 4 would have admitted ~3 additional marginal clusters at this corpus size; the trade-off is a noisier identity average for small-N facesets. extendshould be run beforecluster_osrc.pysoraw_full/andfacesets_full/stay in sync —cluster_osrc.pyitself only writes tofacesets_swap_ready/.