Files
face-sets/docs/analysis/osrc-identity-discovery.md
Peter 7ecbfae981 Add osrc identity-discovery pipeline + run analysis
work/cluster_osrc.py mirrors build_folders.py's shape (synthesize a
refine_manifest, hand off to cmd_export_swap, relocate, merge top-level
manifest) but discovers identities by clustering rather than asserting
them by folder. Drops faces already covered by existing identity
centroids, clusters the rest at 0.55, applies refine-equivalent gates
with min_faces=6, numbers new facesets past the existing maximum so
faceset_001..NNN are never disturbed.

The 2026-04-26 run on /mnt/x/src/osrc produced faceset_020..025 (sizes
4-26 exported PNGs); analysis writeup in docs/analysis/.

README also notes the refine-renumbers caveat in passing — extend +
orchestration script is the safe pattern; cmd_refine is for fresh
clusters only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:40:19 +02:00

120 lines
5.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Identity discovery in `/mnt/x/src/osrc`
_Run date: 2026-04-26. Cache: `work/cache/nl_full.npz` (5260 face records).
Driver script: `work/cluster_osrc.py`._
## 1. Source
`/mnt/x/src/osrc/` is a flat mixed-identity bucket: 213 files in root + a
`psd/` subfolder with 41 PSD files + a single file in `[Originaldateien]/`.
File extensions are 171 jpg + 1 jpeg + 41 psd. PSDs are not embedded
(InsightFace's loader doesn't read PSD); the 41 PSDs were skipped, on the
working assumption that the same identities are also present in the
adjacent JPGs.
`nl_full.npz` already covered 160 of the 213 files (the remaining 53: 41
psd + 12 jpg). Of the 12 missing JPGs, 11 are byte-duplicates of `00843resc.jpg`
.. `00855resc.jpg` (same file sizes, paired by sha256) — already aliased
in the cache. Only 1 jpg (`19554226_..._n.jpg`) is genuinely uncovered.
The 160 covered files yielded **336 face records / 10 noface**, with 64
single-face / 35 two-face / 19 three-face / 24 four-face / 8 with 58
faces. Quality is good: median `face_short=116px`, `det_score=0.85`,
`blur=244`. Min `face_short=40px` will fail the 90px refine gate.
## 2. Coverage by existing identities
Computed cos-dist from each osrc face to the centroids of the canonical
`faceset_001..019` (built from each manifest's `(source, bbox)` keys).
Median nearest-cos-dist was 0.875 — i.e. the bulk of osrc is **not** the
existing 19 identities.
At cos-dist ≤ 0.45 (matching `build_folders.py`'s `OSRC_THRESHOLD`):
| existing identity | osrc faces matched |
|------------------|------------------:|
| faceset_002 | 7 |
| faceset_008 | 4 |
| faceset_015 | 3 |
| faceset_019 | 4 |
These 18 osrc faces are routed to existing identities by
`build_folders.py` and `extend`; they are excluded from the
identity-discovery step.
## 3. Pipeline
`work/cluster_osrc.py` mirrors `build_folders.py`'s structure (synthesize
a refine manifest, hand off to `cmd_export_swap`, relocate, merge
top-level manifest) but discovers identities by clustering rather than
asserting them by folder.
1. Filter cache to face records under `/mnt/x/src/osrc` (canonical or
byte-aliased path).
2. Drop the 18 already-covered faces (cos-dist ≤ 0.45 to any existing
identity centroid).
3. Cluster the remaining 318 faces among themselves at cos-dist 0.55
(matches the `extend` default for new-cluster formation).
4. For each cluster, apply `refine`-equivalent per-face gates
(`face_short ≥ 90`, `blur ≥ 40`, `det_score ≥ 0.6`); for clusters ≥ 4
faces apply outlier rejection at cluster-centroid cos-dist 0.55. Keep
clusters whose surviving unique-path count is ≥ 6 (the operator-
chosen `MIN_FACES`, lower than the canonical 15 because osrc is small
per-identity).
5. Number kept clusters `faceset_020+` (past the existing
`facesets_swap_ready/` max of 019) ordered by size descending.
6. Synthesize a refine manifest and call `cmd_export_swap` on it. Move
the emitted dirs into `facesets_swap_ready/`, drop an `osrc.txt`
provenance marker, and append the new entries to the top-level
`manifest.json` (without disturbing existing `facesets` / `thin_eras`).
## 4. Result (2026-04-26)
Phase 1 (clustering, before export-swap):
- 137 raw clusters at cos-dist 0.55; top sizes [37, 20, 12, 9, 7, 7, 6, 6, 6, 5].
- After quality gate: 124 faces dropped (mostly `face_short < 90` from
group-photo tertiary subjects).
- Outlier rejection: 0 dropped (clusters were tight).
- After `min_faces=6`: **7 candidate clusters kept** (sizes 628 unique
source paths).
Phase 2 (`cmd_export_swap` with `min_face_short=100`,
`outlier_threshold=0.45`):
| name | input | outlier drop | exported PNGs |
|--------------|------:|-------------:|--------------:|
| faceset_020 | 71 | 42 | 26 |
| faceset_021 | 36 | 21 | 10 |
| faceset_022 | 15 | 7 | 8 |
| faceset_023 | 19 | 14 | 4 |
| faceset_024 | 6 | 0 | 6 |
| faceset_025 | 10 | 4 | 6 |
| faceset_026 | — | — | 0 (skipped: empty after filter) |
`faceset_026`'s 6 cluster faces all failed export-swap's tighter
`min_face_short=100` gate (vs. cluster's 90); it is not emitted.
`faceset_023` is small (4 PNGs) but useful as an averaged identity at
that size.
Top-level `facesets_swap_ready/manifest.json` now: **31 substantive
facesets** (12 auto-cluster nl/lzbkp + 7 hand-sorted + 6 era splits + 6
osrc-discovered) + **68 thin_eras** under `_thin/`.
## 5. Re-running and applying to other mixed buckets
- The cache holds osrc embeddings; to re-run with different parameters,
edit `cluster_osrc.py`'s config block and re-execute. Cluster discovery
+ export-swap is a few minutes total.
- For a different mixed-bucket source, copy `cluster_osrc.py` to
`cluster_<name>.py` and change `OSRC_DIR`, `OUT_TMP`, `SYNTH_MANIFEST`,
`START_NNN`. The exclusion step compares against the *current* contents
of `facesets_swap_ready/faceset_NNN/` so it picks up everything emitted
by previous discovery / split / hand-sorted runs.
- Lowering `MIN_FACES` from 6 to 4 would have admitted ~3 additional
marginal clusters at this corpus size; the trade-off is a noisier
identity average for small-N facesets.
- `extend` should be run before `cluster_osrc.py` so `raw_full/` and
`facesets_full/` stay in sync — `cluster_osrc.py` itself only writes
to `facesets_swap_ready/`.