Document hand-sorted-folder import + age-split workflow
- README: document work/build_folders.py (hand-sorted folder identities) and the new age-split workflow for splitting a long-running identity into era-specific facesets after clustering. - Force-track work/age_split_001.py and work/check_faceset001_age.py; these are the worked example + readiness probe for faceset_001 and the template for splitting any other identity by EXIF era. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
95
README.md
95
README.md
@@ -67,6 +67,92 @@ python sort_faces.py export-swap "$CACHE" \
|
||||
--raw-manifest "$OUT/raw_full/manifest.json" --candidates
|
||||
```
|
||||
|
||||
### Importing hand-sorted folders as identities
|
||||
|
||||
When source folders are already hand-sorted by person (one folder per identity), the
|
||||
clustering path is the wrong tool — the identity is asserted, not inferred. The
|
||||
orchestration script `work/build_folders.py` covers this case:
|
||||
|
||||
- For each trusted folder, it filters cache records that fall under it, builds an
|
||||
identity centroid via two-pass outlier rejection (cos-dist 0.55 → 0.45) so
|
||||
bystanders in group photos drop out, and writes a synthetic `refine_manifest.json`.
|
||||
- It then routes each face record from a *mixed* folder (e.g. `osrc/`) into every
|
||||
identity centroid within a tight cosine cutoff (default 0.45). A multi-identity
|
||||
photo lands in multiple facesets; `export-swap`'s per-bbox outlier filter ensures
|
||||
each faceset crops only its matching face.
|
||||
- Finally it invokes `cmd_export_swap` against the synthetic manifest, renames the
|
||||
emitted `.fsz` bundles after the source folder, drops a `<label>.txt` marker, and
|
||||
merges the new entries into the canonical `facesets_swap_ready/manifest.json`
|
||||
(existing facesets are left untouched).
|
||||
|
||||
```bash
|
||||
# Embed each hand-sorted folder + the mixed bucket; cache deduplicates by sha256.
|
||||
for d in k m mi mir s sab t osrc; do
|
||||
python sort_faces.py embed "/mnt/x/src/$d" "$CACHE"
|
||||
done
|
||||
|
||||
# Bring landmarks/pose + visual-dupe report in sync with the new records.
|
||||
python sort_faces.py enrich "$CACHE"
|
||||
python sort_faces.py dedup "$CACHE"
|
||||
|
||||
# Build per-folder identities + osrc routing -> facesets_swap_ready/faceset_NNN/.
|
||||
python work/build_folders.py
|
||||
```
|
||||
|
||||
The script's config block (`TRUSTED`, `START_NNN`, `OSRC_THRESHOLD`, `TOP_N`, etc.)
|
||||
is the only thing to edit when adding more hand-sorted folders later.
|
||||
|
||||
### Splitting an identity by era (age sub-clustering)
|
||||
|
||||
Long-running source corpora produce identities that span 10+ years. The 2009 face
|
||||
and the 2024 face of the same person sit in the same cluster (correctly — same
|
||||
identity), but a single averaged embedding pulled from that cluster blurs across
|
||||
ages. For face-swap output that should target a specific period, the identity
|
||||
needs to be split by era *after* the identity is established.
|
||||
|
||||
`work/age_split_001.py` is a worked example for `faceset_001` and a template for
|
||||
any other identity. The pipeline is:
|
||||
|
||||
- **Probe first** with `work/check_faceset001_age.py` — report intra-cluster
|
||||
pairwise cos-dist histogram, sub-cluster sizes at thresholds 0.30..0.50, and
|
||||
EXIF-year distribution per sub-cluster. If sub-clusters at 0.35 align with
|
||||
distinct year ranges, the identity is age-sortable.
|
||||
- **Seed centroid** from the curated `facesets_swap_ready/faceset_001/`
|
||||
(manifest provides face keys → cache rows).
|
||||
- **Wide recovery** at cos-dist ≤ 0.55 against the seed under the original
|
||||
source roots, then quality-gate (`face_short`, `blur`, `det_score`) and one
|
||||
re-centroid + tighten pass at 0.50 to absorb new faces without drift.
|
||||
- **Sub-cluster** the survivors at cos-dist 0.35 (precomputed-distance
|
||||
agglomerative, average linkage).
|
||||
- **Anchor-based fragment assignment** (not transitive merge — that caused
|
||||
year-drift): sub-clusters with size ≥ 20 are *anchors*; smaller fragments
|
||||
attach to the single nearest anchor only if both the centroid distance ≤ 0.40
|
||||
AND the dominant EXIF year is within ±5 years. Fragments with no qualifying
|
||||
anchor remain standalone (and end up THIN-tagged downstream).
|
||||
- **EXIF year per source path** with on-disk caching at
|
||||
`work/cache/age_split_exif.json` — the Windows-mount EXIF read is the
|
||||
slowest step, so re-runs after a parameter tweak are nearly instant.
|
||||
- **Per-era export** mirrors `export-swap`: composite-quality rank, single-face
|
||||
square PNG crops, top-N + `_all` `.fsz` bundles, per-era `manifest.json`,
|
||||
human-readable `<era>.txt` marker. Eras with < 20 face records also drop a
|
||||
`THIN.txt` marker so they can be quarantined.
|
||||
- **Top-level manifest merge**: era buckets are appended to
|
||||
`facesets_swap_ready/manifest.json`. Operationally the THIN buckets should be
|
||||
moved into `_thin/` (and the manifest split into `facesets` + `thin_eras`),
|
||||
leaving only the substantive era buckets at the top level.
|
||||
|
||||
```bash
|
||||
# 1. Confirm the identity is age-sortable.
|
||||
python work/check_faceset001_age.py
|
||||
|
||||
# 2. Split it. Re-runs are cheap thanks to the EXIF cache.
|
||||
python work/age_split_001.py
|
||||
```
|
||||
|
||||
For the `faceset_001` run on 5260-face `nl_full.npz`, this produced 6 substantive
|
||||
era buckets (2005–10, 2010–13, 2011, 2014–17, 2018–19, 2018–20; sizes 43–282)
|
||||
plus 68 thin/fragment buckets quarantined under `_thin/`.
|
||||
|
||||
## Key defaults
|
||||
|
||||
`refine`:
|
||||
@@ -111,9 +197,14 @@ Highly recommended at swap time: enable **Select post-processing = GFPGAN** with
|
||||
├─ docs/
|
||||
│ └─ analysis/
|
||||
│ └─ facesets-downstream-refinement-evaluation.md
|
||||
└─ work/ (gitignored)
|
||||
└─ work/ (gitignored except force-tracked .py)
|
||||
├─ build_folders.py (hand-sorted-folder orchestration)
|
||||
├─ check_faceset001_age.py (age-split readiness probe)
|
||||
├─ age_split_001.py (age-split orchestration; faceset_001)
|
||||
├─ synthetic_refine_manifest.json (last build_folders.py output)
|
||||
├─ cache/
|
||||
│ └─ nl_full.npz (canonical cache + duplicates.json)
|
||||
│ ├─ nl_full.npz (canonical cache + duplicates.json)
|
||||
│ └─ age_split_exif.json (path → EXIF-year cache)
|
||||
└─ logs/
|
||||
└─ *.log (every long step writes here)
|
||||
```
|
||||
|
||||
485
work/age_split_001.py
Normal file
485
work/age_split_001.py
Normal file
@@ -0,0 +1,485 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Age-split person_001 into era-specific facesets.
|
||||
|
||||
Workflow:
|
||||
1. Seed a clean person_001 centroid from the existing curated 707-face
|
||||
`facesets_swap_ready/faceset_001/`.
|
||||
2. Wide-recovery scan: pull every face record under /mnt/x/src/{nl, lzbkp_red}
|
||||
from `nl_full.npz` with cos-dist <= 0.55 from the seed centroid.
|
||||
3. Apply export-swap-style per-face quality gates.
|
||||
4. One re-centroid + 0.50 tighten pass to absorb the recovery without drift.
|
||||
5. Agglomerative sub-clustering at cos-dist 0.35.
|
||||
6. Post-merge sub-clusters whose centroids <0.30 AND whose dominant EXIF
|
||||
years are within 2 years.
|
||||
7. Read EXIF DateTimeOriginal for each face's source path; era label =
|
||||
(p10 year, p90 year) over dated faces.
|
||||
8. Undated faces are assigned to the nearest era by embedding distance.
|
||||
9. For each era: composite-quality rank, single-face PNG crops, .fsz bundles
|
||||
(top-N and _all if era > top_n). `<era>_<range>.txt` marker file. Eras
|
||||
with <20 face records get a `THIN.txt` marker.
|
||||
10. Append era entries into the canonical
|
||||
`facesets_swap_ready/manifest.json` next to the existing 19.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import shutil
|
||||
import sys
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
from PIL import Image, ExifTags, ImageOps
|
||||
|
||||
REPO = Path(__file__).resolve().parent.parent
|
||||
sys.path.insert(0, str(REPO))
|
||||
|
||||
from sort_faces import ( # noqa: E402
|
||||
QUALITY_WEIGHTS,
|
||||
_crop_face_square,
|
||||
_zip_png_list,
|
||||
compute_quality,
|
||||
load_cache,
|
||||
load_rgb_bgr,
|
||||
)
|
||||
|
||||
# ---- config -------------------------------------------------------------- #
|
||||
|
||||
CACHE = REPO / "work" / "cache" / "nl_full.npz"
|
||||
SWAP_READY = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready")
|
||||
FS001 = SWAP_READY / "faceset_001"
|
||||
|
||||
SCAN_ROOTS = [
|
||||
Path("/mnt/x/src/nl"),
|
||||
Path("/mnt/x/src/lzbkp_red"),
|
||||
]
|
||||
|
||||
# Recovery + identity refinement
|
||||
RECOVERY_THRESHOLD = 0.55 # initial centroid match
|
||||
TIGHTEN_THRESHOLD = 0.50 # post-recentroid drift trim
|
||||
# Quality gates (mirror export-swap defaults)
|
||||
MIN_FACE_SHORT = 100
|
||||
# Sub-cluster
|
||||
SUBCLUSTER_THRESHOLD = 0.35
|
||||
# Anchor-based fragment assignment (replaces transitive union-find merge):
|
||||
ANCHOR_MIN_SIZE = 20 # sub-cluster size to qualify as an era anchor
|
||||
FRAGMENT_CENTROID_MAX = 0.40 # small fragment may join an anchor only if cent_dist <=
|
||||
FRAGMENT_YEAR_MAX = 5 # AND |dom_year_anchor - dom_year_fragment| <=
|
||||
# Output
|
||||
TOP_N = 30
|
||||
PAD_RATIO = 0.5
|
||||
OUT_SIZE = 512
|
||||
THIN_THRESHOLD = 20
|
||||
|
||||
# EXIF cache (so re-runs skip the 30-min Windows-mount EXIF read)
|
||||
EXIF_CACHE = REPO / "work" / "cache" / "age_split_exif.json"
|
||||
|
||||
|
||||
# ---- helpers ------------------------------------------------------------- #
|
||||
|
||||
def _normalize(v: np.ndarray) -> np.ndarray:
|
||||
n = np.linalg.norm(v)
|
||||
return v / n if n > 0 else v
|
||||
|
||||
|
||||
def _under(roots: list[Path], p: str) -> bool:
|
||||
for r in roots:
|
||||
rs = str(r).rstrip("/") + "/"
|
||||
if p == str(r) or p.startswith(rs):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def _record_in_roots(rec: dict, roots: list[Path], path_aliases: dict) -> bool:
|
||||
if _under(roots, rec["path"]):
|
||||
return True
|
||||
for alias in path_aliases.get(rec["path"], []):
|
||||
if _under(roots, alias):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def exif_year(path: Path) -> int | None:
|
||||
try:
|
||||
with Image.open(path) as im:
|
||||
exif = im._getexif()
|
||||
if not exif:
|
||||
return None
|
||||
for tag_id, val in exif.items():
|
||||
tag = ExifTags.TAGS.get(tag_id, tag_id)
|
||||
if tag == "DateTimeOriginal" and isinstance(val, str) and len(val) >= 4:
|
||||
return int(val[:4])
|
||||
except Exception:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
def label_for_era(years: list[int]) -> str:
|
||||
"""Era label as a year-range string. Falls back to 'undated' if no years."""
|
||||
if not years:
|
||||
return "undated"
|
||||
ys = sorted(years)
|
||||
lo = ys[len(ys) // 10] if len(ys) >= 10 else ys[0]
|
||||
hi = ys[-(len(ys) // 10) - 1] if len(ys) >= 10 else ys[-1]
|
||||
if lo == hi:
|
||||
return str(lo)
|
||||
# Compact year range like 2011-13 if same century, else 2009-2024.
|
||||
if (lo // 100) == (hi // 100):
|
||||
return f"{lo}-{hi % 100:02d}"
|
||||
return f"{lo}-{hi}"
|
||||
|
||||
|
||||
# ---- phase 1 + 2: seed centroid + recovery scan ------------------------- #
|
||||
|
||||
def main() -> None:
|
||||
if not FS001.exists():
|
||||
raise SystemExit(f"missing seed faceset: {FS001}")
|
||||
|
||||
print("=== loading cache ===")
|
||||
emb, meta, _src, _proc, path_aliases = load_cache(CACHE)
|
||||
face_records = [m for m in meta if not m.get("noface")]
|
||||
if len(face_records) != len(emb):
|
||||
raise SystemExit(f"emb/meta mismatch: {len(face_records)} vs {len(emb)}")
|
||||
|
||||
bbox_idx = {(m["path"], tuple(m.get("bbox") or ())): i for i, m in enumerate(face_records)}
|
||||
|
||||
seed_manifest = json.loads((FS001 / "manifest.json").read_text())
|
||||
seed_face_keys = [(f["source"], tuple(f.get("bbox") or ())) for f in seed_manifest["faces"]]
|
||||
seed_indices = [bbox_idx[k] for k in seed_face_keys if k in bbox_idx]
|
||||
print(f"seed faces from faceset_001: {len(seed_indices)} (manifest had {len(seed_face_keys)})")
|
||||
|
||||
seed_centroid = _normalize(emb[seed_indices].mean(axis=0))
|
||||
|
||||
# Recovery: every face record under nl/ + lzbkp_red/ within RECOVERY_THRESHOLD.
|
||||
candidate_idxs = [
|
||||
i for i, rec in enumerate(face_records)
|
||||
if _record_in_roots(rec, SCAN_ROOTS, path_aliases)
|
||||
]
|
||||
print(f"\ncandidates under {[str(r) for r in SCAN_ROOTS]}: {len(candidate_idxs)}")
|
||||
|
||||
cand_emb = emb[candidate_idxs]
|
||||
cand_dists = 1.0 - cand_emb @ seed_centroid
|
||||
recovered_local = [k for k, d in enumerate(cand_dists) if d <= RECOVERY_THRESHOLD]
|
||||
recovered = [candidate_idxs[k] for k in recovered_local]
|
||||
print(f"recovered at cos-dist <= {RECOVERY_THRESHOLD}: {len(recovered)}")
|
||||
|
||||
# Quality gate.
|
||||
qualified = []
|
||||
drop_size = drop_blur = drop_det = 0
|
||||
for i in recovered:
|
||||
r = face_records[i]
|
||||
if r.get("face_short", 0) < MIN_FACE_SHORT:
|
||||
drop_size += 1
|
||||
continue
|
||||
if r.get("blur", 0.0) < 40.0:
|
||||
drop_blur += 1
|
||||
continue
|
||||
if r.get("det_score", 0.0) < 0.6:
|
||||
drop_det += 1
|
||||
continue
|
||||
qualified.append(i)
|
||||
print(f"after quality gate: {len(qualified)} (drop size={drop_size} blur={drop_blur} det={drop_det})")
|
||||
|
||||
# One tightening pass: re-centroid on qualified, drop anyone > TIGHTEN_THRESHOLD.
|
||||
qcent = _normalize(emb[qualified].mean(axis=0))
|
||||
qd = 1.0 - emb[qualified] @ qcent
|
||||
tight = [qualified[k] for k, d in enumerate(qd) if d <= TIGHTEN_THRESHOLD]
|
||||
print(f"after re-centroid tighten ({TIGHTEN_THRESHOLD}): {len(tight)}")
|
||||
|
||||
# ---- phase 5: sub-cluster -------------------------------------------- #
|
||||
print("\n=== sub-clustering ===")
|
||||
from sklearn.cluster import AgglomerativeClustering
|
||||
|
||||
E = emb[tight]
|
||||
sims = E @ E.T
|
||||
dists = 1.0 - sims
|
||||
# Floor numerical noise.
|
||||
np.fill_diagonal(dists, 0.0)
|
||||
dists = np.maximum(dists, 0.0)
|
||||
|
||||
ac = AgglomerativeClustering(
|
||||
n_clusters=None,
|
||||
metric="precomputed",
|
||||
linkage="average",
|
||||
distance_threshold=SUBCLUSTER_THRESHOLD,
|
||||
)
|
||||
labels = ac.fit_predict(dists)
|
||||
sub_sizes = Counter(labels)
|
||||
print(f"raw sub-clusters: {len(sub_sizes)} (sizes: top10={sorted(sub_sizes.values(), reverse=True)[:10]})")
|
||||
|
||||
# Per-cluster: indices, centroid, EXIF years.
|
||||
cluster_indices: dict[int, list[int]] = {}
|
||||
for k, lab in enumerate(labels):
|
||||
cluster_indices.setdefault(int(lab), []).append(tight[k])
|
||||
|
||||
cluster_centroids: dict[int, np.ndarray] = {}
|
||||
for lab, idxs in cluster_indices.items():
|
||||
cluster_centroids[lab] = _normalize(emb[idxs].mean(axis=0))
|
||||
|
||||
print("\n=== EXIF years (one read per source path; cached) ===")
|
||||
unique_paths = sorted({face_records[i]["path"] for i in tight})
|
||||
if EXIF_CACHE.exists():
|
||||
cached = json.loads(EXIF_CACHE.read_text())
|
||||
else:
|
||||
cached = {}
|
||||
path_year: dict[str, int | None] = {}
|
||||
new_reads = 0
|
||||
for p in unique_paths:
|
||||
if p in cached:
|
||||
path_year[p] = cached[p]
|
||||
else:
|
||||
y = exif_year(Path(p))
|
||||
path_year[p] = y
|
||||
cached[p] = y
|
||||
new_reads += 1
|
||||
EXIF_CACHE.parent.mkdir(parents=True, exist_ok=True)
|
||||
EXIF_CACHE.write_text(json.dumps(cached, indent=0))
|
||||
dated = sum(1 for v in path_year.values() if v is not None)
|
||||
print(f" EXIF cache: {len(cached)} entries, {new_reads} new reads, "
|
||||
f"{dated}/{len(unique_paths)} dated")
|
||||
|
||||
cluster_years: dict[int, list[int]] = {}
|
||||
cluster_dom_year: dict[int, int | None] = {}
|
||||
for lab, idxs in cluster_indices.items():
|
||||
ys = []
|
||||
for i in idxs:
|
||||
y = path_year.get(face_records[i]["path"])
|
||||
if y is not None:
|
||||
ys.append(y)
|
||||
cluster_years[lab] = ys
|
||||
cluster_dom_year[lab] = (Counter(ys).most_common(1)[0][0]) if ys else None
|
||||
|
||||
# ---- phase 6: anchor-based fragment assignment ----------------------- #
|
||||
# Each sub-cluster of size >= ANCHOR_MIN_SIZE is an "era anchor". Smaller
|
||||
# fragments are assigned to the single nearest anchor IFF (centroid distance
|
||||
# <= FRAGMENT_CENTROID_MAX AND |dom_year delta| <= FRAGMENT_YEAR_MAX).
|
||||
# Anchors do NOT merge with each other — that prevented transitive year drift
|
||||
# observed when union-find was used. Standalone fragments stay as their own
|
||||
# (likely THIN) eras.
|
||||
print("\n=== anchor-based assignment ===")
|
||||
anchors = [lab for lab, idxs in cluster_indices.items() if len(idxs) >= ANCHOR_MIN_SIZE]
|
||||
fragments = [lab for lab in cluster_indices if lab not in anchors]
|
||||
anchors.sort(key=lambda l: -len(cluster_indices[l]))
|
||||
print(f"anchors (size>={ANCHOR_MIN_SIZE}): {len(anchors)}; fragments: {len(fragments)}")
|
||||
for a in anchors:
|
||||
print(f" anchor sub {a}: size={len(cluster_indices[a])} dom_year={cluster_dom_year[a]}")
|
||||
|
||||
if anchors:
|
||||
a_cent = np.stack([cluster_centroids[a] for a in anchors])
|
||||
assignments: dict[int, int] = {a: a for a in anchors} # anchor -> self
|
||||
unassigned: list[int] = []
|
||||
for f in fragments:
|
||||
f_cent = cluster_centroids[f]
|
||||
f_year = cluster_dom_year[f]
|
||||
# cosine distances to each anchor
|
||||
cd = 1.0 - a_cent @ f_cent
|
||||
# year distance (inf if either dom-year unknown)
|
||||
yd = []
|
||||
for a in anchors:
|
||||
ay = cluster_dom_year[a]
|
||||
if f_year is None or ay is None:
|
||||
yd.append(float("inf"))
|
||||
else:
|
||||
yd.append(abs(f_year - ay))
|
||||
yd = np.array(yd)
|
||||
ok = (cd <= FRAGMENT_CENTROID_MAX) & (yd <= FRAGMENT_YEAR_MAX)
|
||||
if not ok.any():
|
||||
unassigned.append(f)
|
||||
continue
|
||||
# nearest qualifying anchor by centroid distance.
|
||||
cd_masked = np.where(ok, cd, np.inf)
|
||||
best = int(np.argmin(cd_masked))
|
||||
assignments[f] = anchors[best]
|
||||
print(f" assigned fragments: {sum(1 for k,v in assignments.items() if k!=v)}/{len(fragments)}; "
|
||||
f"unassigned (standalone): {len(unassigned)}")
|
||||
else:
|
||||
print(" no anchors; every sub-cluster stands alone")
|
||||
assignments = {lab: lab for lab in cluster_indices}
|
||||
unassigned = []
|
||||
|
||||
merged: dict[int, list[int]] = {}
|
||||
for lab, idxs in cluster_indices.items():
|
||||
root = assignments.get(lab, lab)
|
||||
merged.setdefault(root, []).extend(idxs)
|
||||
|
||||
merged_sizes = sorted(((r, len(v)) for r, v in merged.items()), key=lambda kv: -kv[1])
|
||||
print(f"era buckets: {len(merged)} (top10 sizes: {[s for _, s in merged_sizes[:10]]})")
|
||||
|
||||
# Recompute centroid + dom-year for merged eras.
|
||||
era_indices: dict[int, list[int]] = merged
|
||||
era_centroids: dict[int, np.ndarray] = {}
|
||||
era_year_label: dict[int, str] = {}
|
||||
era_years_full: dict[int, list[int]] = {}
|
||||
for root, idxs in era_indices.items():
|
||||
era_centroids[root] = _normalize(emb[idxs].mean(axis=0))
|
||||
ys = []
|
||||
for i in idxs:
|
||||
y = path_year.get(face_records[i]["path"])
|
||||
if y is not None:
|
||||
ys.append(y)
|
||||
era_years_full[root] = ys
|
||||
era_year_label[root] = label_for_era(ys)
|
||||
|
||||
# ---- phase 8: assign undated faces (no-EXIF) to nearest era ---------- #
|
||||
# NB: undated = path's EXIF was None. For era assignment we use embedding,
|
||||
# but the year *label* is unaffected because labels come from dated faces only.
|
||||
# Actually undated face is already in some sub-cluster; here we just note count.
|
||||
n_undated = sum(1 for i in tight if path_year.get(face_records[i]["path"]) is None)
|
||||
print(f"undated face records (no EXIF): {n_undated}/{len(tight)} (placed by embedding only)")
|
||||
|
||||
# ---- phase 9: per-era export ----------------------------------------- #
|
||||
import cv2
|
||||
|
||||
print("\n=== exporting era bundles ===")
|
||||
new_manifest_entries: list[dict] = []
|
||||
eras_sorted = sorted(era_indices.items(), key=lambda kv: -len(kv[1]))
|
||||
for root, idxs in eras_sorted:
|
||||
size = len(idxs)
|
||||
label = era_year_label[root]
|
||||
era_name = f"faceset_001_{label}"
|
||||
out_dir = SWAP_READY / era_name
|
||||
|
||||
# Disambiguate same-label collisions (e.g. two distinct embedding eras both 2019).
|
||||
collision = 2
|
||||
while out_dir.exists():
|
||||
era_name = f"faceset_001_{label}_v{collision}"
|
||||
out_dir = SWAP_READY / era_name
|
||||
collision += 1
|
||||
|
||||
faces_dir = out_dir / "faces"
|
||||
faces_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Composite quality + rank.
|
||||
ranked = []
|
||||
for ci in idxs:
|
||||
rec = face_records[ci]
|
||||
q = compute_quality(rec)
|
||||
ranked.append({"cache_idx": ci, "rec": rec, "quality": q})
|
||||
|
||||
# Dedup by source path within this era — keep highest-quality face per path.
|
||||
seen_path: dict[str, dict] = {}
|
||||
for r in ranked:
|
||||
p = r["rec"]["path"]
|
||||
prev = seen_path.get(p)
|
||||
if prev is None or r["quality"]["composite"] > prev["quality"]["composite"]:
|
||||
seen_path[p] = r
|
||||
unique = sorted(seen_path.values(), key=lambda r: -r["quality"]["composite"])
|
||||
|
||||
# Materialize crops.
|
||||
written: list[Path] = []
|
||||
face_entries: list[dict] = []
|
||||
for rank, r in enumerate(unique, start=1):
|
||||
rec = r["rec"]
|
||||
src = Path(rec["path"])
|
||||
if not src.exists():
|
||||
continue
|
||||
rgb, _ = load_rgb_bgr(src)
|
||||
if rgb is None:
|
||||
continue
|
||||
crop = _crop_face_square(rgb, rec["bbox"], PAD_RATIO, OUT_SIZE)
|
||||
png = faces_dir / f"{rank:04d}.png"
|
||||
cv2.imwrite(str(png), cv2.cvtColor(crop, cv2.COLOR_RGB2BGR))
|
||||
written.append(png)
|
||||
face_entries.append({
|
||||
"rank": rank,
|
||||
"png": f"faces/{rank:04d}.png",
|
||||
"source": rec["path"],
|
||||
"aliases": path_aliases.get(rec["path"], []),
|
||||
"bbox": rec["bbox"],
|
||||
"face_short": rec.get("face_short"),
|
||||
"det_score": rec.get("det_score"),
|
||||
"blur": rec.get("blur"),
|
||||
"pose": rec.get("pose"),
|
||||
"exif_year": path_year.get(rec["path"]),
|
||||
"quality": r["quality"],
|
||||
})
|
||||
|
||||
if not written:
|
||||
print(f"[{era_name}] empty after materialization; skipping")
|
||||
shutil.rmtree(out_dir)
|
||||
continue
|
||||
|
||||
# Bundle.
|
||||
top_n_eff = min(TOP_N, len(written))
|
||||
top_fsz = out_dir / f"{era_name}_top{top_n_eff}.fsz"
|
||||
_zip_png_list(written[:top_n_eff], top_fsz)
|
||||
all_fsz: Path | None = None
|
||||
if len(written) > top_n_eff:
|
||||
all_fsz = out_dir / f"{era_name}_all.fsz"
|
||||
_zip_png_list(written, all_fsz)
|
||||
|
||||
# Per-era manifest.
|
||||
ys = era_years_full[root]
|
||||
year_summary = {
|
||||
"label": label,
|
||||
"year_count": len(ys),
|
||||
"year_min": min(ys) if ys else None,
|
||||
"year_max": max(ys) if ys else None,
|
||||
"year_dist": dict(Counter(ys).most_common()),
|
||||
}
|
||||
is_thin = size < THIN_THRESHOLD
|
||||
manifest = {
|
||||
"name": era_name,
|
||||
"parent_identity": "faceset_001",
|
||||
"era": year_summary,
|
||||
"input_face_records": size,
|
||||
"exported": len(written),
|
||||
"top_n": top_n_eff,
|
||||
"fsz_top": top_fsz.name,
|
||||
"fsz_all": all_fsz.name if all_fsz else None,
|
||||
"thin": is_thin,
|
||||
"quality_weights": QUALITY_WEIGHTS,
|
||||
"params": {
|
||||
"recovery_threshold": RECOVERY_THRESHOLD,
|
||||
"tighten_threshold": TIGHTEN_THRESHOLD,
|
||||
"subcluster_threshold": SUBCLUSTER_THRESHOLD,
|
||||
"anchor_min_size": ANCHOR_MIN_SIZE,
|
||||
"fragment_centroid_max": FRAGMENT_CENTROID_MAX,
|
||||
"fragment_year_max": FRAGMENT_YEAR_MAX,
|
||||
"min_face_short": MIN_FACE_SHORT,
|
||||
},
|
||||
"faces": face_entries,
|
||||
}
|
||||
(out_dir / "manifest.json").write_text(json.dumps(manifest, indent=2))
|
||||
|
||||
# Per-era marker file (always: <label>.txt for human reference).
|
||||
(out_dir / f"{label}.txt").write_text(
|
||||
f"{era_name}\n\nEra: {label}\n"
|
||||
f"Year span: {year_summary['year_min']}..{year_summary['year_max']} "
|
||||
f"({year_summary['year_count']} dated of {size} faces)\n"
|
||||
f"Sub-cluster size: {size} face records, {len(unique)} unique source paths, "
|
||||
f"{len(written)} exported PNGs.\n"
|
||||
)
|
||||
if is_thin:
|
||||
(out_dir / "THIN.txt").write_text(
|
||||
f"This era has only {size} face records (<{THIN_THRESHOLD}). "
|
||||
f"Averaged embedding may be dominated by single-photo idiosyncrasies.\n"
|
||||
)
|
||||
|
||||
# Append to top-level manifest summary.
|
||||
new_manifest_entries.append({k: v for k, v in manifest.items() if k != "faces"})
|
||||
|
||||
thin_tag = " THIN" if is_thin else ""
|
||||
print(
|
||||
f"[{era_name}] size={size} unique_paths={len(unique)} exported={len(written)} "
|
||||
f"top{top_n_eff}{thin_tag}"
|
||||
)
|
||||
|
||||
# ---- merge into top-level manifest ----------------------------------- #
|
||||
top_path = SWAP_READY / "manifest.json"
|
||||
existing = json.loads(top_path.read_text()) if top_path.exists() else {"facesets": []}
|
||||
existing_names = {fs.get("name") for fs in existing.get("facesets", [])}
|
||||
appended = 0
|
||||
for entry in new_manifest_entries:
|
||||
if entry["name"] in existing_names:
|
||||
continue
|
||||
existing["facesets"].append(entry)
|
||||
appended += 1
|
||||
top_path.write_text(json.dumps(existing, indent=2))
|
||||
print(f"\nAppended {appended} era entries to {top_path}")
|
||||
print(f"Done. {len(new_manifest_entries)} era buckets emitted (faceset_001/ left untouched).")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
151
work/check_faceset001_age.py
Normal file
151
work/check_faceset001_age.py
Normal file
@@ -0,0 +1,151 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Probe faceset_001 for age-sortable sub-structure.
|
||||
|
||||
Three questions:
|
||||
1. How spread is the embedding cloud? (intra-cluster pairwise distance histogram)
|
||||
2. Does it split naturally into sub-clusters at a tight threshold?
|
||||
3. Do the sub-clusters correspond to distinct time periods (EXIF DateTimeOriginal)?
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import sys
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
from PIL import Image, ExifTags
|
||||
|
||||
REPO = Path(__file__).resolve().parent.parent
|
||||
sys.path.insert(0, str(REPO))
|
||||
from sort_faces import load_cache # noqa: E402
|
||||
|
||||
CACHE = REPO / "work" / "cache" / "nl_full.npz"
|
||||
FS001 = Path("/mnt/e/temp_things/fcswp/nl_sorted/facesets_swap_ready/faceset_001")
|
||||
|
||||
|
||||
def exif_year(path: Path) -> int | None:
|
||||
try:
|
||||
with Image.open(path) as im:
|
||||
exif = im._getexif()
|
||||
if not exif:
|
||||
return None
|
||||
for tag_id, val in exif.items():
|
||||
tag = ExifTags.TAGS.get(tag_id, tag_id)
|
||||
if tag == "DateTimeOriginal" and isinstance(val, str) and len(val) >= 4:
|
||||
return int(val[:4])
|
||||
except Exception:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
def main() -> None:
|
||||
manifest = json.loads((FS001 / "manifest.json").read_text())
|
||||
faces = manifest["faces"]
|
||||
paths = [Path(f["source"]) for f in faces]
|
||||
print(f"faceset_001 has {len(paths)} ranked faces in the swap-ready set")
|
||||
|
||||
# Pull embeddings for these face records by (path, bbox).
|
||||
emb, meta, _src, _proc, _aliases = load_cache(CACHE)
|
||||
face_records = [m for m in meta if not m.get("noface")]
|
||||
if len(face_records) != len(emb):
|
||||
raise SystemExit("emb/meta mismatch")
|
||||
bbox_key = {}
|
||||
for i, m in enumerate(face_records):
|
||||
bbox_key[(m["path"], tuple(m.get("bbox") or ()))] = i
|
||||
|
||||
selected = []
|
||||
missing = 0
|
||||
for f in faces:
|
||||
key = (f["source"], tuple(f.get("bbox") or ()))
|
||||
i = bbox_key.get(key)
|
||||
if i is None:
|
||||
missing += 1
|
||||
continue
|
||||
selected.append(i)
|
||||
print(f"matched {len(selected)} embeddings (missing {missing})")
|
||||
|
||||
E = emb[selected]
|
||||
# All embeddings are L2-normalized -> cosine dist = 1 - dot.
|
||||
sims = E @ E.T
|
||||
dists = 1.0 - sims
|
||||
iu = np.triu_indices_from(dists, k=1)
|
||||
pw = dists[iu]
|
||||
print("\n-- intra-cluster pairwise cosine distance --")
|
||||
print(f" n_pairs = {len(pw):,}")
|
||||
print(f" mean = {pw.mean():.3f}")
|
||||
print(f" median = {np.median(pw):.3f}")
|
||||
print(f" p10/p25/p75/p90 = {np.percentile(pw, [10,25,75,90])}")
|
||||
print(f" max = {pw.max():.3f}")
|
||||
|
||||
# Histogram bins around interesting thresholds.
|
||||
edges = [0.0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0, 1.4]
|
||||
hist, _ = np.histogram(pw, bins=edges)
|
||||
print("\n histogram (cos-dist bin -> pair count):")
|
||||
for lo, hi, c in zip(edges[:-1], edges[1:], hist):
|
||||
bar = "#" * int(60 * c / max(hist.max(), 1))
|
||||
print(f" [{lo:.1f},{hi:.1f}) {c:7d} {bar}")
|
||||
|
||||
# Sub-cluster at three thresholds via agglomerative on the distance matrix.
|
||||
from sklearn.cluster import AgglomerativeClustering
|
||||
print("\n-- sub-clustering --")
|
||||
for thr in (0.30, 0.35, 0.40, 0.45, 0.50):
|
||||
ac = AgglomerativeClustering(
|
||||
n_clusters=None,
|
||||
metric="precomputed",
|
||||
linkage="average",
|
||||
distance_threshold=thr,
|
||||
)
|
||||
labels = ac.fit_predict(dists)
|
||||
sizes = Counter(labels)
|
||||
n = len(sizes)
|
||||
big = sum(1 for s in sizes.values() if s >= 10)
|
||||
top5 = sorted(sizes.values(), reverse=True)[:5]
|
||||
print(f" threshold {thr:.2f}: {n} sub-clusters, {big} with >=10 images, top-5 sizes={top5}")
|
||||
|
||||
# Pick the threshold that gives 2-5 substantial sub-clusters.
|
||||
target_thr = 0.35
|
||||
ac = AgglomerativeClustering(
|
||||
n_clusters=None, metric="precomputed", linkage="average",
|
||||
distance_threshold=target_thr,
|
||||
)
|
||||
labels = ac.fit_predict(dists)
|
||||
sizes = Counter(labels)
|
||||
big_labels = [lab for lab, s in sizes.most_common() if s >= 20]
|
||||
print(f"\n-- EXIF year analysis at threshold {target_thr} (sub-clusters with >=20 images) --")
|
||||
print(f" {len(big_labels)} substantial sub-clusters")
|
||||
|
||||
# Build label -> list of source paths
|
||||
by_label: dict[int, list[Path]] = {}
|
||||
for ci, lab in zip(selected, labels):
|
||||
rec = face_records[ci]
|
||||
by_label.setdefault(int(lab), []).append(Path(rec["path"]))
|
||||
|
||||
for lab in big_labels[:6]:
|
||||
paths_in = by_label[lab]
|
||||
years = []
|
||||
for p in paths_in:
|
||||
y = exif_year(p)
|
||||
if y is not None:
|
||||
years.append(y)
|
||||
n_paths = len(paths_in)
|
||||
n_years = len(years)
|
||||
if years:
|
||||
ys = np.array(years)
|
||||
ymin, ymax = int(ys.min()), int(ys.max())
|
||||
ymed = int(np.median(ys))
|
||||
yhist = Counter(years)
|
||||
top_years = ", ".join(f"{y}:{c}" for y, c in sorted(yhist.most_common(5)))
|
||||
else:
|
||||
ymin = ymax = ymed = None
|
||||
top_years = ""
|
||||
print(
|
||||
f" cluster {lab}: {n_paths} faces, EXIF on {n_years}/{n_paths}, "
|
||||
f"year range {ymin}..{ymax} (median {ymed})"
|
||||
)
|
||||
print(f" top years: {top_years}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user