Rewrite pipeline: resumable embed, byte-dedup, extend, dedup report
- embed: sha256-based dedup at listing (embed each unique hash once, carry other paths as aliases via a top-level path_aliases dict); resumable from any existing cache; atomic incremental flush every 50 files; explicit skip-ext filtering; schema bumped with processed_paths + path_aliases. - extend: new subcommand that merges new embeddings into an existing raw + facesets output without renumbering. Nearest person-centroid match above threshold, unmatched faces re-clustered into new person_NNN / _singletons. Optional --refine-out also extends facesets by centroid + quality gate. - dedup: new subcommand producing byte-identical + visual near-duplicate groups as a JSON report. - cluster/refine: fan every placement across canonical + aliases so each on-disk location gets represented. - safe_dst_name now always flattens the absolute path so filenames stay stable across runs when src_root shifts (fixes duplicate-copy bug that surfaced during the lzbkp_red extend). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
18
README.md
18
README.md
@@ -4,27 +4,35 @@ Sort photos by similar face using InsightFace embeddings + agglomerative cluster
|
||||
|
||||
## Pipeline
|
||||
|
||||
`sort_faces.py` is a single-file CLI with three subcommands:
|
||||
`sort_faces.py` is a single-file CLI with four subcommands:
|
||||
|
||||
| step | what it does |
|
||||
|---------|------------------------------------------------------------------------------|
|
||||
| embed | Recursively scan a source tree, detect + embed every face, write `.npz` cache |
|
||||
| cluster | Raw agglomerative clustering of the cache into `person_NNN/` / `_singletons/` / `_noface/` |
|
||||
| refine | Initial cluster → centroid merge → quality gate → outlier rejection → size filter → `faceset_NNN/` |
|
||||
| dedup | Post-hoc near-duplicate report: byte-identical groups + visual near-dupes (same face + same size within a tight cosine threshold) |
|
||||
|
||||
`embed` is resumable and incremental: it loads any existing cache at the target path and only hashes/embeds files it hasn't processed before. A periodic flush (default every 50 new files) writes the cache atomically, so a mid-run crash loses at most a few dozen embeddings.
|
||||
|
||||
Byte-identical duplicates are detected via sha256 during the listing phase. The canonical file is embedded once; other paths with the same hash are carried as `aliases` on the cache's top-level `path_aliases` dict. Every alias is materialized by `cluster`/`refine`, so each on-disk location ends up represented in the output.
|
||||
|
||||
Cache and outputs are kept out of the repo via `.gitignore`; defaults live under `work/`.
|
||||
|
||||
## Typical run
|
||||
|
||||
```bash
|
||||
# 1. Embed (CPU; InsightFace buffalo_l). Caches faces + metadata.
|
||||
python sort_faces.py embed "/mnt/x/src/nl/Neuer Ordner (2)/New Folder" work/cache/nl_all.npz
|
||||
# 1. Embed (CPU; InsightFace buffalo_l). Caches faces + metadata. Resumable.
|
||||
python sort_faces.py embed /mnt/x/src/nl work/cache/nl_full.npz
|
||||
|
||||
# 2. Raw clusters (every multi-face cluster -> a person_NNN/ folder).
|
||||
python sort_faces.py cluster work/cache/nl_all.npz /mnt/e/temp_things/fcswp/nl_sorted/raw
|
||||
python sort_faces.py cluster work/cache/nl_full.npz /mnt/e/temp_things/fcswp/nl_sorted/raw_full
|
||||
|
||||
# 3. Refined facesets (filters for faceset-ready quality).
|
||||
python sort_faces.py refine work/cache/nl_all.npz /mnt/e/temp_things/fcswp/nl_sorted/facesets
|
||||
python sort_faces.py refine work/cache/nl_full.npz /mnt/e/temp_things/fcswp/nl_sorted/facesets_full
|
||||
|
||||
# 4. (Optional) report on byte-identical + visual near-duplicates.
|
||||
python sort_faces.py dedup work/cache/nl_full.npz
|
||||
```
|
||||
|
||||
## Refine defaults
|
||||
|
||||
Reference in New Issue
Block a user