work/cluster_osrc.py mirrors build_folders.py's shape (synthesize a refine_manifest, hand off to cmd_export_swap, relocate, merge top-level manifest) but discovers identities by clustering rather than asserting them by folder. Drops faces already covered by existing identity centroids, clusters the rest at 0.55, applies refine-equivalent gates with min_faces=6, numbers new facesets past the existing maximum so faceset_001..NNN are never disturbed. The 2026-04-26 run on /mnt/x/src/osrc produced faceset_020..025 (sizes 4-26 exported PNGs); analysis writeup in docs/analysis/. README also notes the refine-renumbers caveat in passing — extend + orchestration script is the safe pattern; cmd_refine is for fresh clusters only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
120 lines
5.2 KiB
Markdown
120 lines
5.2 KiB
Markdown
# Identity discovery in `/mnt/x/src/osrc`
|
||
|
||
_Run date: 2026-04-26. Cache: `work/cache/nl_full.npz` (5260 face records).
|
||
Driver script: `work/cluster_osrc.py`._
|
||
|
||
## 1. Source
|
||
|
||
`/mnt/x/src/osrc/` is a flat mixed-identity bucket: 213 files in root + a
|
||
`psd/` subfolder with 41 PSD files + a single file in `[Originaldateien]/`.
|
||
File extensions are 171 jpg + 1 jpeg + 41 psd. PSDs are not embedded
|
||
(InsightFace's loader doesn't read PSD); the 41 PSDs were skipped, on the
|
||
working assumption that the same identities are also present in the
|
||
adjacent JPGs.
|
||
|
||
`nl_full.npz` already covered 160 of the 213 files (the remaining 53: 41
|
||
psd + 12 jpg). Of the 12 missing JPGs, 11 are byte-duplicates of `00843resc.jpg`
|
||
.. `00855resc.jpg` (same file sizes, paired by sha256) — already aliased
|
||
in the cache. Only 1 jpg (`19554226_..._n.jpg`) is genuinely uncovered.
|
||
|
||
The 160 covered files yielded **336 face records / 10 noface**, with 64
|
||
single-face / 35 two-face / 19 three-face / 24 four-face / 8 with 5–8
|
||
faces. Quality is good: median `face_short=116px`, `det_score=0.85`,
|
||
`blur=244`. Min `face_short=40px` will fail the 90px refine gate.
|
||
|
||
## 2. Coverage by existing identities
|
||
|
||
Computed cos-dist from each osrc face to the centroids of the canonical
|
||
`faceset_001..019` (built from each manifest's `(source, bbox)` keys).
|
||
Median nearest-cos-dist was 0.875 — i.e. the bulk of osrc is **not** the
|
||
existing 19 identities.
|
||
|
||
At cos-dist ≤ 0.45 (matching `build_folders.py`'s `OSRC_THRESHOLD`):
|
||
|
||
| existing identity | osrc faces matched |
|
||
|------------------|------------------:|
|
||
| faceset_002 | 7 |
|
||
| faceset_008 | 4 |
|
||
| faceset_015 | 3 |
|
||
| faceset_019 | 4 |
|
||
|
||
These 18 osrc faces are routed to existing identities by
|
||
`build_folders.py` and `extend`; they are excluded from the
|
||
identity-discovery step.
|
||
|
||
## 3. Pipeline
|
||
|
||
`work/cluster_osrc.py` mirrors `build_folders.py`'s structure (synthesize
|
||
a refine manifest, hand off to `cmd_export_swap`, relocate, merge
|
||
top-level manifest) but discovers identities by clustering rather than
|
||
asserting them by folder.
|
||
|
||
1. Filter cache to face records under `/mnt/x/src/osrc` (canonical or
|
||
byte-aliased path).
|
||
2. Drop the 18 already-covered faces (cos-dist ≤ 0.45 to any existing
|
||
identity centroid).
|
||
3. Cluster the remaining 318 faces among themselves at cos-dist 0.55
|
||
(matches the `extend` default for new-cluster formation).
|
||
4. For each cluster, apply `refine`-equivalent per-face gates
|
||
(`face_short ≥ 90`, `blur ≥ 40`, `det_score ≥ 0.6`); for clusters ≥ 4
|
||
faces apply outlier rejection at cluster-centroid cos-dist 0.55. Keep
|
||
clusters whose surviving unique-path count is ≥ 6 (the operator-
|
||
chosen `MIN_FACES`, lower than the canonical 15 because osrc is small
|
||
per-identity).
|
||
5. Number kept clusters `faceset_020+` (past the existing
|
||
`facesets_swap_ready/` max of 019) ordered by size descending.
|
||
6. Synthesize a refine manifest and call `cmd_export_swap` on it. Move
|
||
the emitted dirs into `facesets_swap_ready/`, drop an `osrc.txt`
|
||
provenance marker, and append the new entries to the top-level
|
||
`manifest.json` (without disturbing existing `facesets` / `thin_eras`).
|
||
|
||
## 4. Result (2026-04-26)
|
||
|
||
Phase 1 (clustering, before export-swap):
|
||
|
||
- 137 raw clusters at cos-dist 0.55; top sizes [37, 20, 12, 9, 7, 7, 6, 6, 6, 5].
|
||
- After quality gate: 124 faces dropped (mostly `face_short < 90` from
|
||
group-photo tertiary subjects).
|
||
- Outlier rejection: 0 dropped (clusters were tight).
|
||
- After `min_faces=6`: **7 candidate clusters kept** (sizes 6–28 unique
|
||
source paths).
|
||
|
||
Phase 2 (`cmd_export_swap` with `min_face_short=100`,
|
||
`outlier_threshold=0.45`):
|
||
|
||
| name | input | outlier drop | exported PNGs |
|
||
|--------------|------:|-------------:|--------------:|
|
||
| faceset_020 | 71 | 42 | 26 |
|
||
| faceset_021 | 36 | 21 | 10 |
|
||
| faceset_022 | 15 | 7 | 8 |
|
||
| faceset_023 | 19 | 14 | 4 |
|
||
| faceset_024 | 6 | 0 | 6 |
|
||
| faceset_025 | 10 | 4 | 6 |
|
||
| faceset_026 | — | — | 0 (skipped: empty after filter) |
|
||
|
||
`faceset_026`'s 6 cluster faces all failed export-swap's tighter
|
||
`min_face_short=100` gate (vs. cluster's 90); it is not emitted.
|
||
`faceset_023` is small (4 PNGs) but useful as an averaged identity at
|
||
that size.
|
||
|
||
Top-level `facesets_swap_ready/manifest.json` now: **31 substantive
|
||
facesets** (12 auto-cluster nl/lzbkp + 7 hand-sorted + 6 era splits + 6
|
||
osrc-discovered) + **68 thin_eras** under `_thin/`.
|
||
|
||
## 5. Re-running and applying to other mixed buckets
|
||
|
||
- The cache holds osrc embeddings; to re-run with different parameters,
|
||
edit `cluster_osrc.py`'s config block and re-execute. Cluster discovery
|
||
+ export-swap is a few minutes total.
|
||
- For a different mixed-bucket source, copy `cluster_osrc.py` to
|
||
`cluster_<name>.py` and change `OSRC_DIR`, `OUT_TMP`, `SYNTH_MANIFEST`,
|
||
`START_NNN`. The exclusion step compares against the *current* contents
|
||
of `facesets_swap_ready/faceset_NNN/` so it picks up everything emitted
|
||
by previous discovery / split / hand-sorted runs.
|
||
- Lowering `MIN_FACES` from 6 to 4 would have admitted ~3 additional
|
||
marginal clusters at this corpus size; the trade-off is a noisier
|
||
identity average for small-N facesets.
|
||
- `extend` should be run before `cluster_osrc.py` so `raw_full/` and
|
||
`facesets_full/` stay in sync — `cluster_osrc.py` itself only writes
|
||
to `facesets_swap_ready/`.
|