Files
face-sets/docs/analysis/osrc-identity-discovery.md
Peter 7ecbfae981 Add osrc identity-discovery pipeline + run analysis
work/cluster_osrc.py mirrors build_folders.py's shape (synthesize a
refine_manifest, hand off to cmd_export_swap, relocate, merge top-level
manifest) but discovers identities by clustering rather than asserting
them by folder. Drops faces already covered by existing identity
centroids, clusters the rest at 0.55, applies refine-equivalent gates
with min_faces=6, numbers new facesets past the existing maximum so
faceset_001..NNN are never disturbed.

The 2026-04-26 run on /mnt/x/src/osrc produced faceset_020..025 (sizes
4-26 exported PNGs); analysis writeup in docs/analysis/.

README also notes the refine-renumbers caveat in passing — extend +
orchestration script is the safe pattern; cmd_refine is for fresh
clusters only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:40:19 +02:00

5.2 KiB
Raw Permalink Blame History

Identity discovery in /mnt/x/src/osrc

Run date: 2026-04-26. Cache: work/cache/nl_full.npz (5260 face records). Driver script: work/cluster_osrc.py.

1. Source

/mnt/x/src/osrc/ is a flat mixed-identity bucket: 213 files in root + a psd/ subfolder with 41 PSD files + a single file in [Originaldateien]/. File extensions are 171 jpg + 1 jpeg + 41 psd. PSDs are not embedded (InsightFace's loader doesn't read PSD); the 41 PSDs were skipped, on the working assumption that the same identities are also present in the adjacent JPGs.

nl_full.npz already covered 160 of the 213 files (the remaining 53: 41 psd + 12 jpg). Of the 12 missing JPGs, 11 are byte-duplicates of 00843resc.jpg .. 00855resc.jpg (same file sizes, paired by sha256) — already aliased in the cache. Only 1 jpg (19554226_..._n.jpg) is genuinely uncovered.

The 160 covered files yielded 336 face records / 10 noface, with 64 single-face / 35 two-face / 19 three-face / 24 four-face / 8 with 58 faces. Quality is good: median face_short=116px, det_score=0.85, blur=244. Min face_short=40px will fail the 90px refine gate.

2. Coverage by existing identities

Computed cos-dist from each osrc face to the centroids of the canonical faceset_001..019 (built from each manifest's (source, bbox) keys). Median nearest-cos-dist was 0.875 — i.e. the bulk of osrc is not the existing 19 identities.

At cos-dist ≤ 0.45 (matching build_folders.py's OSRC_THRESHOLD):

existing identity osrc faces matched
faceset_002 7
faceset_008 4
faceset_015 3
faceset_019 4

These 18 osrc faces are routed to existing identities by build_folders.py and extend; they are excluded from the identity-discovery step.

3. Pipeline

work/cluster_osrc.py mirrors build_folders.py's structure (synthesize a refine manifest, hand off to cmd_export_swap, relocate, merge top-level manifest) but discovers identities by clustering rather than asserting them by folder.

  1. Filter cache to face records under /mnt/x/src/osrc (canonical or byte-aliased path).
  2. Drop the 18 already-covered faces (cos-dist ≤ 0.45 to any existing identity centroid).
  3. Cluster the remaining 318 faces among themselves at cos-dist 0.55 (matches the extend default for new-cluster formation).
  4. For each cluster, apply refine-equivalent per-face gates (face_short ≥ 90, blur ≥ 40, det_score ≥ 0.6); for clusters ≥ 4 faces apply outlier rejection at cluster-centroid cos-dist 0.55. Keep clusters whose surviving unique-path count is ≥ 6 (the operator- chosen MIN_FACES, lower than the canonical 15 because osrc is small per-identity).
  5. Number kept clusters faceset_020+ (past the existing facesets_swap_ready/ max of 019) ordered by size descending.
  6. Synthesize a refine manifest and call cmd_export_swap on it. Move the emitted dirs into facesets_swap_ready/, drop an osrc.txt provenance marker, and append the new entries to the top-level manifest.json (without disturbing existing facesets / thin_eras).

4. Result (2026-04-26)

Phase 1 (clustering, before export-swap):

  • 137 raw clusters at cos-dist 0.55; top sizes [37, 20, 12, 9, 7, 7, 6, 6, 6, 5].
  • After quality gate: 124 faces dropped (mostly face_short < 90 from group-photo tertiary subjects).
  • Outlier rejection: 0 dropped (clusters were tight).
  • After min_faces=6: 7 candidate clusters kept (sizes 628 unique source paths).

Phase 2 (cmd_export_swap with min_face_short=100, outlier_threshold=0.45):

name input outlier drop exported PNGs
faceset_020 71 42 26
faceset_021 36 21 10
faceset_022 15 7 8
faceset_023 19 14 4
faceset_024 6 0 6
faceset_025 10 4 6
faceset_026 0 (skipped: empty after filter)

faceset_026's 6 cluster faces all failed export-swap's tighter min_face_short=100 gate (vs. cluster's 90); it is not emitted. faceset_023 is small (4 PNGs) but useful as an averaged identity at that size.

Top-level facesets_swap_ready/manifest.json now: 31 substantive facesets (12 auto-cluster nl/lzbkp + 7 hand-sorted + 6 era splits + 6 osrc-discovered) + 68 thin_eras under _thin/.

5. Re-running and applying to other mixed buckets

  • The cache holds osrc embeddings; to re-run with different parameters, edit cluster_osrc.py's config block and re-execute. Cluster discovery
    • export-swap is a few minutes total.
  • For a different mixed-bucket source, copy cluster_osrc.py to cluster_<name>.py and change OSRC_DIR, OUT_TMP, SYNTH_MANIFEST, START_NNN. The exclusion step compares against the current contents of facesets_swap_ready/faceset_NNN/ so it picks up everything emitted by previous discovery / split / hand-sorted runs.
  • Lowering MIN_FACES from 6 to 4 would have admitted ~3 additional marginal clusters at this corpus size; the trade-off is a noisier identity average for small-N facesets.
  • extend should be run before cluster_osrc.py so raw_full/ and facesets_full/ stay in sync — cluster_osrc.py itself only writes to facesets_swap_ready/.