Document hand-sorted-folder import + age-split workflow
- README: document work/build_folders.py (hand-sorted folder identities) and the new age-split workflow for splitting a long-running identity into era-specific facesets after clustering. - Force-track work/age_split_001.py and work/check_faceset001_age.py; these are the worked example + readiness probe for faceset_001 and the template for splitting any other identity by EXIF era. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
95
README.md
95
README.md
@@ -67,6 +67,92 @@ python sort_faces.py export-swap "$CACHE" \
|
||||
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
|
||||
```
|
||||
|
||||
### Importing hand-sorted folders as identities
|
||||
|
||||
When source folders are already hand-sorted by person (one folder per identity), the
|
||||
clustering path is the wrong tool — the identity is asserted, not inferred. The
|
||||
orchestration script `work/build_folders.py` covers this case:
|
||||
|
||||
- For each trusted folder, it filters cache records that fall under it, builds an
|
||||
identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so
|
||||
bystanders in group photos drop out, and writes a synthetic `refine_manifest.json`.
|
||||
- It then routes each face record from a *mixed* folder (e.g. `osrc/`) into every
|
||||
identity centroid within a tight cosine cutoff (default 0.45). A multi-identity
|
||||
photo lands in multiple facesets; `export-swap`'s per-bbox outlier filter ensures
|
||||
each faceset crops only its matching face.
|
||||
- Finally it invokes `cmd_export_swap` against the synthetic manifest, renames the
|
||||
emitted `.fsz` bundles after the source folder, drops a `<label>.txt` marker, and
|
||||
merges the new entries into the canonical `facesets_swap_ready/manifest.json`
|
||||
(existing facesets are left untouched).
|
||||
|
||||
```bash
|
||||
# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
|
||||
for d in k m mi mir s sab t osrc; do
|
||||
python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
|
||||
done
|
||||
|
||||
# Bring landmarks/pose + visual-dupe report in sync with the new records.
|
||||
python sort_faces.py enrich "$CACHE"
|
||||
python sort_faces.py dedup "$CACHE"
|
||||
|
||||
# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
|
||||
python work/build_folders.py
|
||||
```
|
||||
|
||||
The script's config block (`TRUSTED`, `START_NNN`, `OSRC_THRESHOLD`, `TOP_N`, etc.)
|
||||
is the only thing to edit when adding more hand-sorted folders later.
|
||||
|
||||
### Splitting an identity by era (age sub-clustering)
|
||||
|
||||
Long-running source corpora produce identities that span 10+ years. The 2009 face
|
||||
and the 2024 face of the same person sit in the same cluster (correctly — same
|
||||
identity), but a single averaged embedding pulled from that cluster blurs across
|
||||
ages. For face-swap output that should target a specific period, the identity
|
||||
needs to be split by era *after* the identity is established.
|
||||
|
||||
`work/age_split_001.py` is a worked example for `faceset_001` and a template for
|
||||
any other identity. The pipeline is:
|
||||
|
||||
- **Probe first** with `work/check_faceset001_age.py` — report intra-cluster
|
||||
pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and
|
||||
EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with
|
||||
distinct year ranges, the identity is age-sortable.
|
||||
- **Seed centroid** from the curated `facesets_swap_ready/faceset_001/`
|
||||
(manifest provides face keys → cache rows).
|
||||
- **Wide recovery** at cos-dist ≤ 0.55 against the seed under the original
|
||||
source roots, then quality-gate (`face_short`, `blur`, `det_score`) and one
|
||||
re-centroid + tighten pass at 0.50 to absorb new faces without drift.
|
||||
- **Sub-cluster** the survivors at cos-dist 0.35 (precomputed-distance
|
||||
agglomerative, average linkage).
|
||||
- **Anchor-based fragment assignment** (not transitive merge — that caused
|
||||
year-drift): sub-clusters with size ≥ 20 are *anchors*; smaller fragments
|
||||
attach to the single nearest anchor only if both the centroid distance ≤ 0.40
|
||||
AND the dominant EXIF year is within ±5 years. Fragments with no qualifying
|
||||
anchor remain standalone (and end up THIN-tagged downstream).
|
||||
- **EXIF year per source path** with on-disk caching at
|
||||
`work/cache/age_split_exif.json` — the Windows-mount EXIF read is the
|
||||
slowest step, so re-runs after a parameter tweak are nearly instant.
|
||||
- **Per-era export** mirrors `export-swap`: composite-quality rank, single-face
|
||||
square PNG crops, top-N + `_all` `.fsz` bundles, per-era `manifest.json`,
|
||||
human-readable `<era>.txt` marker. Eras with < 20 face records also drop a
|
||||
`THIN.txt` marker so they can be quarantined.
|
||||
- **Top-level manifest merge**: era buckets are appended to
|
||||
`facesets_swap_ready/manifest.json`. Operationally the THIN buckets should be
|
||||
moved into `_thin/` (and the manifest split into `facesets` + `thin_eras`),
|
||||
leaving only the substantive era buckets at the top level.
|
||||
|
||||
```bash
|
||||
# 1. Confirm the identity is age-sortable.
|
||||
python work/check_faceset001_age.py
|
||||
|
||||
# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
|
||||
python work/age_split_001.py
|
||||
```
|
||||
|
||||
For the `faceset_001` run on 5260-face `nl_full.npz`, this produced 6 substantive
|
||||
era buckets (2005–10, 2010–13, 2011, 2014–17, 2018–19, 2018–20; sizes 43–282)
|
||||
plus 68 thin/fragment buckets quarantined under `_thin/`.
|
||||
|
||||
## Key defaults
|
||||
|
||||
`refine`:
|
||||
@@ -111,9 +197,14 @@ Highly recommended at swap time: enable **Select post-processing = GFPGAN** with
|
||||
├─ docs/
|
||||
│ └─ analysis/
|
||||
│ └─ facesets-downstream-refinement-evaluation.md
|
||||
└─ work/ (gitignored)
|
||||
└─ work/ (gitignored except force-tracked .py)
|
||||
├─ build_folders.py (hand-sorted-folder orchestration)
|
||||
├─ check_faceset001_age.py (age-split readiness probe)
|
||||
├─ age_split_001.py (age-split orchestration; faceset_001)
|
||||
├─ synthetic_refine_manifest.json (last build_folders.py output)
|
||||
├─ cache/
|
||||
│ └─ nl_full.npz (canonical cache + duplicates.json)
|
||||
│ ├─ nl_full.npz (canonical cache + duplicates.json)
|
||||
│ └─ age_split_exif.json (path → EXIF-year cache)
|
||||
└─ logs/
|
||||
└─ *.log (every long step writes here)
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user