Files
face-sets/docs/analysis/facesets-downstream-refinement-evaluation.md
Peter d53ab9fbfc Add enrich + export-swap pipeline for downstream face-swap ready output
- enrich: re-detects each cached face with buffalo_l (detection +
  landmark_2d_106 + landmark_3d_68, recognition module skipped for speed)
  and persists landmarks + pose into the cache so per-face frontality and
  landmark-symmetry quality signals become available.
- compute_quality: composite score combining det_score, face short-edge,
  blur, frontality (from pose pitch/yaw), and 2D-landmark symmetry with
  tunable weights. Default weighting 0.30/0.20/0.20/0.15/0.15.
- export-swap: builds facesets_swap_ready/ from an existing refine
  manifest. Per identity: tighter outlier gate (default 0.45), visual-
  near-dupe collapse (keep best representative per group), multi-face-
  per-source-image collapse (keep best bbox), rank by composite score,
  single-face-per-PNG crops at 512x512 with 0.5 bbox padding, ready-to-
  drop .fsz bundles (top-N + full), per-faceset manifest.json, NAME.txt
  placeholder for the operator. The multi-face-per-PNG collapse is the
  critical fix: roop-unleashed's .fsz loader appends every detected face
  in each PNG to the FaceSet, so any multi-face crop would contaminate
  the averaged embedding.
- Optional --candidates rescues raw_full singletons: matches against the
  final per-faceset centroids and routes to _candidates/to_<faceset>/
  for manual review; orphaned singletons that still cluster among
  themselves land in _candidates/new_<NNN>/.
- docs/analysis/: evaluation document captures the evidence, downstream
  requirements (FaceSet averaging, inswapper_128), opportunity matrix
  (R1-R14), and the recommended target state this export implements.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 22:37:32 +02:00

26 KiB
Raw Blame History

Facesets → roop-unleashed: downstream refinement evaluation

Analysis date: 2026-04-23. Author: Peter (with Claude Code).

1. Scope

Objective. Evaluate how the existing face-clustering / person-sorted results in facesets_full/ can be refined so that the downstream project roop-unleashed produces the best practical face-swap results.

Repositories / folders inspected

  • /opt/face-sets/ — the upstream project (this repo); code, README.md, sort_faces.py, refine_manifest.json, duplicates.json.
  • /mnt/e/temp_things/fcswp/nl_sorted/facesets_full/ and .../raw_full/ — current output.
  • /opt/face-sets/work/cache/nl_full.npz — the underlying embedding cache used to produce the output.
  • /opt/roop-unleashed/ — the downstream consumer.
  • InsightFace 0.7.3 Face class (/home/peter/face_sort_env/lib/...insightface/app/common.py) to resolve an ambiguity about embedding averaging.

Agent usage. Subagents (Explore, Trend Researcher) were attempted but rejected by the operator. All investigation was done directly via Read, Grep, Bash, WebFetch, WebSearch. ~/.claude/agents/ was enumerated; no face-swap-specific agent exists.

Web research used. Targeted WebSearch + WebFetch against FaceSwapLab FAQ, FaceFusion docs, and the GitHub roop-unleashed discussion page for faceset creation. The original C0untFloyd/roop-unleashed GitHub repo has been disabled by GitHub Staff for ToS violation, so the code in /opt/roop-unleashed/ is the authoritative source for this analysis.

2. Evidence base

2.1 Files read in facesets / output

  • sort_faces.py (full) — current pipeline, esp. cmd_embed (embed + sha256 dedup + resume), cmd_cluster, cmd_refine (centroid-merge + quality gate + outlier rejection), cmd_extend (centroid-preserving merge), cmd_dedup (byte + visual).
  • refine_manifest.json at facesets_full/ — post-extend state; extended: true; 12 facesets, params {initial_threshold: 0.55, merge_threshold: 0.40, outlier_threshold: 0.55, min_faces: 15, min_short: 90, min_blur: 40.0, min_det_score: 0.6}.
  • nl_full.npz — 4756 face embeddings + 133 noface records across 2667 unique files; 113 byte-dupe alias paths; 103 byte-groups + 115 visual-dupe groups in nl_full.duplicates.json.

2.2 Files read in roop-unleashed

  • roop/FaceSet.py — the downstream identity container; AverageEmbeddings() at lines 1520.
  • roop/face_util.pyget_face_analyser() builds InsightFace buffalo_l (lines 3550); extract_face_images() at lines 72144 implements the .fsz unpack + detect path.
  • roop/processors/FaceSwapInsightFace.py — the actual inswapper swap; Run() at lines 4252 uses source_face.normed_embedding.
  • roop/core.py:178179 — identifies the swap model as inswapper_128.onnx (HuggingFace countfloyd/deepfake + Codeberg mirror).
  • roop/ProcessMgr.py:626634process_face confirms only face_datas[face_index].faces[0] is used per identity.
  • ui/tabs/facemgr_tab.py (full) — how .fsz is created by users (cv2.imwrite PNGs → zip).
  • ui/tabs/faceswap_tab.py:651710 — how .fsz / image source is loaded into INPUT_FACESETS; AverageEmbeddings() is called iff len(faces) > 1 at line 690.
  • Insightface common.py:Facenormed_embedding is a @property, so it does re-derive from self.embedding; averaging therefore does propagate to the swap (resolves an ambiguity).

2.3 External sources

3. Current upstream output assessment

3.1 Structure of facesets_full/

  • 12 faceset folders (faceset_001faceset_012) selected by the refine step (min_faces=15).
  • Each folder contains the full original images (jpg / jpeg / png) that contributed a face to that cluster, filename-flattened from the absolute path so each file is traceable to its on-disk source.
  • One refine_manifest.json at the root with per-faceset {face_count, image_count, alias_count, images[]}.
  • facesets_full/extended=true (merged after the lzbkp_red run via cmd_extend).

Counts (manifest):

faceset images face records aliases
faceset_001 771 1505 55
faceset_002 238 543 6
faceset_003 206 402 2
faceset_004 103 273 2
faceset_005 68 218 2
faceset_006 51 153 1
faceset_007 89 158 0
faceset_008 44 131 1
faceset_009 43 129 0
faceset_010 25 73 0
faceset_011 25 71 8
faceset_012 17 55 0

3.2 Observed strengths

  • Identity grouping is directionally correct. The top facesets are credibly large and coherent — the raw raw_full/person_001 is 2.3 GB; refine extracted a 557→771-image faceset on top of that, which is a significant and useful identity pool by any standard.
  • Quality gate is applied. min_short=90, min_blur=40, min_det_score=0.6 are enforced; low-resolution and out-of-focus faces are rejected.
  • Outlier rejection is applied. Faces with cosine distance > 0.55 from their cluster centroid are dropped (when cluster ≥ 4).
  • Aliasing preserves provenance. Every on-disk copy (byte-duplicates between iCloud / manual backups / etc.) is preserved in the folder, so the user can trace every file in a faceset back to its original location.
  • Quality metrics already captured per face. face_short, blur (Laplacian variance), det_score, bbox are persisted in the cache — available for any future ranking logic without re-embedding.

3.3 Observed weaknesses

Evidence is from direct computation on the cache (nl_full.npz) + the manifests.

W1. face_records / image_count ratio ~2:1 in top facesets.

  • faceset_001: 1505 faces / 771 images = 1.95 faces per image.
  • faceset_002: 543 / 238 = 2.28.
  • faceset_003: 402 / 206 = 1.95.
  • A healthy one-identity set should be ~1:1 (one face per image).
  • Interpretation: many of these are multi-face photos (group / family shots) where multiple people's faces were placed into the same cluster, or the same image had multiple faces all passing the centroid gate for the same identity. Either way, the current facesets are contaminated with faces of other people from the same photo. This is the single biggest downstream risk — see §4.

W2. Intra-faceset pairwise cosine distance is high.

  • Mean pairwise distance in faceset_001 = 0.835, p90 = 1.047, max = 1.242.
  • For reference: same-identity ArcFace cosine distance typically clusters in [0.2, 0.6]. Pairs > 1.0 (negative cosine similarity) cannot be the same person.
  • All 12 facesets have means in [0.82, 0.90] and p90 in [1.03, 1.07].
  • Interpretation: the clusters were built with linkage=average, threshold=0.55, which admits chain-effects — two points with direct distance > 1.0 can end up in the same cluster via intermediate points. Some of this spread is legitimate (the photo library spans 15+ years — same person at different ages and lighting), some is contamination from W1.

W3. Near-duplicates inflate the effective size.

  • nl_full.duplicates.json: 103 byte-identical groups (same file copied around) + 115 visual near-duplicate groups (cross-file cosine-distance ≤ 0.03 with matching bbox size — likely re-encodes / resizes).
  • faceset_001 alone carries 55 aliased paths.
  • Interpretation: multiple copies of the same photograph contribute the same embedding (or a near-identical one) to the cluster's average. This does not add identity information — at best neutral, at worst biases the average toward whatever pose/expression appears in the duplicate set.

W4. Blur / quality gate is lax.

  • Cache-wide blur (Laplacian variance) p10/p25/p50 = 19/32/60. Refine gate is 40, so roughly the bottom ~35% of faces drop on blur.
  • Per-faceset p10 blur is 3690 — many included faces are visibly soft. For downstream swap this is acceptable (identity embedding tolerates modest softness) but tightening would improve the average.

W5. No pose / frontality filtering.

  • Neither detect-time nor refine uses landmarks / yaw / pitch. A strong profile shot with clear det_score + size still passes. ArcFace embeddings degrade for |yaw| > ~45°. The current set has no way to prefer frontal faces.

W6. 583 singletons + 133 noface drop to floor.

  • _singletons/ in raw_full has 583 face-records (some of which are from legitimate subjects that just didn't cluster). _noface/ has 133 files (hash-deduped images where detection failed). Some of these could belong to existing facesets with a looser centroid-match threshold.

W7. Embedding averaging quirk is latent but OK.

  • Investigated because FaceSet.AverageEmbeddings() at FaceSet.py:15 overwrites self.faces[0]["embedding"] while the swapper reads source_face.normed_embedding. Confirmed via InsightFace source that normed_embedding is a @property that re-normalizes from embedding. So averaging does take effect in the swap. No action needed; noted to avoid a future misdiagnosis.

3.4 Observed risks for downstream use

  1. Multi-face photos in a single-identity folder (W1) → when zipped into .fsz and loaded, roop-unleashed will detect and add ALL faces in each PNG to the FaceSet (faceswap_tab.py:678687 loops every face returned by extract_face_images into the set). This is identity contamination by design of the loader. Highest-priority risk.
  2. High intra-faceset variance (W2, W5) → the averaged embedding becomes a diffuse "average face" rather than a crisp identity vector. Downstream swap will produce generic likenesses, with identity drift on hard frames.
  3. Near-dupes biasing the mean (W3) → identity average tilts toward over-represented poses (e.g., ten copies of one iPhone screenshot skew the mean).
  4. No per-face ranking — users have no signal on which images to include / exclude when hand-curating a subset, and no way to pick "best representative" images for thumbnails.

4. Downstream consumer requirements

4.1 What roop-unleashed expects

  • Input format: a .fsz file, which is a zip of .png files (one crop per reference face). Created by ui/tabs/facemgr_tab.py:on_update_clicked():
    filename = os.path.join(roop.globals.output_path, f"{index}.png")
    cv2.imwrite(filename, img)
    
    util.zip(imgnames, finalzip)   # imgnames → "faceset.fsz"
    
    Files inside are named 0.png, 1.png, … — only indices.
  • Load path (ui/tabs/faceswap_tab.py:672691): unzip, iterate *.png, run extract_face_images(filename, (False, 0)) (note: extra_padding default -1.0 → plain bbox crop, no resize-to-512 dance). For every detected face in each PNG, append the InsightFace Face object (with its 512-dim embedding) to face_set.faces. If the resulting set has more than one face, call face_set.AverageEmbeddings().
  • Use at swap time (ProcessMgr.py:626634 + processors/FaceSwapInsightFace.py:4252): only face_set.faces[0] is used; its normed_embedding is fed to inswapper_128.onnx. The other faces in the set only exist to contribute to the averaged embedding.
  • Swap backend: inswapper_128.onnx (see roop/core.py:178). Internal face working resolution is 128×128 per the InsightFace blog and FaceSwapLab FAQ; identity is carried entirely in the 512-dim embedding.

4.2 Practical requirements derived from the code

  1. One identity per .fsz. Anything else corrupts the averaged embedding.
  2. One face per PNG inside the .fsz. Any multi-face PNG → every face gets appended to the set, polluting the average. This is enforced only by the PNG's content, not by the loader.
  3. Faces must be detectable by InsightFace buffalo_l at det_size=(640,640) or (320,320). Extremely small or cut-off faces will fail detection and be silently skipped on load.
  4. Input resolution: there is no explicit requirement, but since inswapper works at 128×128 and InsightFace aligns on 5 landmarks, a face bbox with a short edge of at least ~100150 px gives a reliable embedding. Below ~60 px, embedding quality drops measurably (literature). Our min_short=90 gate is close to the lower end of useful.
  5. Frontality helps. ArcFace embeddings are trained with some pose augmentation, so near-frontal (|yaw| ≤ 30°) is ideal; beyond ~45° the embedding starts to drift. Roop applies no compensation for this.
  6. Expression / lighting diversity is desirable but not required. FaceSwapLab explicitly supports "face blending" and notes it "improves the face's representative accuracy" — so a diverse set of the same identity is better than 100 near-duplicate frames.
  7. No metadata is consumed. roop-unleashed ignores everything outside the PNG bytes — filename, EXIF, sidecar JSON are not read.

4.3 Constraints and uncertainties

  • The roop-unleashed GitHub is unreachable (disabled), so the closest thing to community guidance is the in-repo CLAUDE.md and the code itself. Treat this code as authoritative.
  • Assumption: the user will either provide the whole facesets_full/faceset_NNN/ folder to roop-unleashed's Face Management tab (which accepts image files + a folder button — faceswap_tab.py:644647), OR pre-build .fsz files. Both paths run through the same loader; the multi-face-per-PNG issue applies equally.

5. Refinement opportunity matrix

Each opportunity is scored qualitatively. "Automation feasibility" distinguishes fully automated (A), semi-automated with heuristics that need operator review (S), and manual-only (M). "Best place" is where implementation should live.

# Opportunity Problem addressed Evidence Expected downstream benefit Automation Risk / downside Best place Priority Confidence
R1 Pre-crop each faceset image to a single face (the identity's own face) before export W1 — multi-face photos pollute FaceSet on load refine_manifest face/image ratio ~2:1 in top clusters; roop loader adds every detected face in a PNG (faceswap_tab.py:678687) Large. Cleans the single biggest identity-averaging contaminant A (use the existing bbox per face record in the cache and cv2.crop with padding, save to a new facesets_swap_ready/ mirror) Must pick the correct face of multiple detected per image → use the bbox that the upstream cache already matched to this faceset facesets P0 High
R2 Split known multi-face photos so only the identity's own bbox is included, alternative to full image export Same as R1, more conservative Same as R1 Same as R1 A facesets P0 High
R3 Identity tightening — re-run refine with stricter outlier threshold (e.g. outlier_threshold=0.45) W2 — intra-cluster spread too wide, chain effects from average-linkage pairwise distance max > 1.2 in every faceset Sharpens averaged embedding; removes obviously-wrong faces A Some legitimate same-person faces (age / lighting extremes) may be dropped facesets P0 High
R4 Drop visual near-duplicates from the set (keep the highest-quality representative per dupe group) W3 — duplicate images bias the average duplicates.json has 115 visual groups (25 images each) across 4756 faces Removes silent bias toward over-represented frames; shrinks set size for faster load A Deciding which copy to keep is a tiny judgement call (pick highest det_score × face_short × blur) facesets P1 High
R5 Per-face composite quality score (weighted det_score · blur · face_short · frontality) and ranked export / top-N subset Need to give roop-unleashed a small, strong averaging pool rather than all 771 images Cache already has det_score, blur, face_short; frontality = landmark symmetry, computable from landmark_2d_106 which InsightFace already provides but we don't store Smaller .fsz files, better average embedding, faster UI A for the score; S for the top-N choice (operator picks N per identity) Frontality adds a small extra compute step; needs a re-pass over the cache or a re-embed storing landmarks facesets P1 Medium
R6 Produce .fsz directly (zip the cropped PNGs with integer filenames) as an export mode Saves the operator the manual zipping step; guarantees filename correctness facemgr_tab.py:242255 is the reference implementation; trivially reproducible Zero-friction import into roop-unleashed A facesets P1 High
R7 Pose / frontality filter at refine time using pose_2d_106 landmark symmetry or yaw estimation from face.pose (if available) W5 — strong profile faces weaken the average ArcFace literature; no measurement yet in our cache Tighter identity average, especially for smaller facesets where one profile shot can dominate A (compute from cached landmarks if we re-embed or store them; otherwise a one-off enrichment pass) Landmarks not currently persisted in the cache; requires a small re-embed or enrichment command facesets P2 Medium
R8 Singleton rescue pass — re-classify _singletons/ against final faceset centroids with a looser threshold + quality gate W6 — some singletons are legit faceset members 583 singletons with p50 face_short=149, p50 det_score=0.76 — many look usable Recovers lost identity examples; modest expansion of useful facesets A Some true singletons will be mis-assigned; threshold choice matters facesets P2 Medium
R9 Modern face-quality scorer (SDD-FIQA / CR-FIQA) to replace the det_score × blur heuristic More robust quality ranking than hand-rolled heuristics Literature; current heuristic is crude Marginal improvement over R5 for the same goal A but adds a new model dependency Model weights to download, more CPU cost at ranking time facesets P3 Medium
R10 Person-label sidecars (e.g. faceset_001/_label.txt with an operator-provided name) UX — the 12 facesets are anonymous; operator has to peek to find "mom" No evidence; improvement to workflow Operator-quality-of-life; no effect on swap quality M facesets P3 Low
R11 Feed multiple source images selection UI in roop-unleashed improvements (e.g. a "pick best 20 by quality" button on load) Better use of large .fsz files Not implemented downstream Improvement happens at consumption time A Requires roop-unleashed patch, which is a disabled upstream roop-unleashed P4 Low
R12 Face alignment / crop standardization (e.g. arcface-aligned 512×512 crops in the .fsz) Some marginal consistency gain on detection roop re-detects anyway on load (extract_face_images) so input alignment is discarded Very small — roop's loader re-detects and re-aligns regardless A Extra compute for no practical gain — (do not do) Not recommended High
R13 Increase resolution via upscaling of low-res crops Make small faces "bigger" Identity comes from the embedding, not the pixels None — upscaling with GAN does not add identity info; inswapper reads 128×128 anyway A Can introduce synthetic artifacts — (do not do) Not recommended High
R14 Destructive reorganization of facesets_full/ in place Simpler final layout Operator explicitly told us yesterday to preserve existing output Marginal tidiness M Loses the current "full cluster" reference view, which has diagnostic value — (do not do without explicit go-ahead) Not recommended by default High

Define a new output view, facesets_swap_ready/, produced by a new subcommand (e.g. sort_faces.py export-swap). Original facesets_full/ stays intact. Per faceset:

facesets_swap_ready/
  faceset_001/
    manifest.json           # provenance + per-image score + rank
    previews/               # 4-image contact sheet thumbnail
      top_20_grid.jpg
    faces/                  # cropped-to-single-face PNGs named "000.png", "001.png", ...
      000.png               # highest-ranked face, single face per PNG, 512x512 padded/aligned
      001.png
      ...
    faceset.fsz             # zip of faces/*.png — drop-in for roop-unleashed
  faceset_002/
    ...

Key properties:

  1. One face per PNG — each PNG is a crop of a single face (R1/R2), padded to a consistent 512×512 with the identity's bbox centred. Roop-unleashed's loader will re-detect exactly one face per file.
  2. Ranked by composite qualityfaces/000.png is the best representative; later indices are weaker. Operator can trivially truncate by dropping later files.
  3. Configurable top-N — default --top-n 30 per faceset with a --include-all flag for the current behaviour. 30 is conservative; FaceSwapLab's "face blending" tool (the most analogous public practitioner reference) shows that blending with diverse but consistent images materially helps; 2040 is a common practitioner range.
  4. Near-duplicates dropped (R4) — one representative per visual-dupe group.
  5. Tighter outlier gate (R3) — outlier_threshold reduced from 0.55 to ~0.45 for this export, keeping the refine defaults on facesets_full/.
  6. Ready-to-ship .fsz (R6) in each folder.
  7. manifest.json per faceset — cites every source path and score. Lets the operator see why a face was kept (or dropped if we add a _rejected/ sibling).

This lets the operator test swap quality end-to-end without any roop-unleashed modification, and preserves full fallback to the raw / full results if anything needs re-examination.

7.1 Quick wins (high value, low effort)

  1. R1 — single-face crop export as part of export-swap. Uses bbox already in the cache; zero new models. Delivers the biggest likely swap-quality improvement.
  2. R4 — drop visual near-duplicates inside the export. Uses duplicates.json already produced by cmd_dedup. Smaller sets, cleaner averages.
  3. R5 — composite quality score + rank + top-N. Uses existing fields (det_score, blur, face_short). Deliver .fsz + faces/ sorted by descending score.
  4. R6 — .fsz bundle emission by simply zipping faces/*.png with integer names. Trivial given (1)-(3).

These four together give a clean, drop-in-usable export in one session of work.

7.2 Medium-effort improvements

  1. R3 — re-run refine with stricter outlier_threshold (e.g. 0.45) for the export path; keep facesets_full/ at 0.55 for reference. Requires a re-cluster over existing embeddings — fast (seconds), no re-embed.
  2. R7 — pose/frontality filter using landmarks. Requires either (a) a re-embed pass that persists landmark_2d_106, or (b) an enrichment pass that re-loads each image and computes yaw without redoing the full embed. Modest CPU cost; meaningful for small facesets.
  3. R8 — singleton rescue against final centroids. Low code cost; likely yields a handful of additional good images per identity.

7.3 Items requiring operator decision

  • Target top-N per faceset for the export (proposal: 30, override per run). Affects the average-embedding quality trade-off vs. UI load time.
  • Whether to name facesets (R10) by operator — purely workflow.
  • Whether _singletons/ should be retired or promoted to "uncertain identity" export with a lower-confidence tag.
  • R11 — patching roop-unleashed itself. The upstream repo is disabled; touching it introduces fork-maintenance overhead for no proportional gain we can't already achieve upstream in facesets.
  • R12 / R13 — pre-aligning or up-scaling source crops. Roop re-detects/aligns on load and inswapper caps at 128×128 internally; effort is wasted.
  • R14 — destructive reorganization of facesets_full/. The operator already told us (yesterday) to preserve existing results; no new evidence supports re-opening that.

8. Open questions

  • OQ1. Is the operator willing to have the export step drop faces rather than just rank them? R5-top-N drops everything past rank N; if the operator prefers to keep the full set but marked, we should export ranked without truncation and let the user pick in the UI.
  • OQ2. How many .fsz files does the operator actually plan to use? If only 34 identities will be used in practice, R5 can stay conservative (N=50) without cost. If all 12 are routinely used, leaner is better (N=20).
  • OQ3. Should singletons (R8) be rescued into existing facesets or exported as their own "candidate_NNN/" bucket for manual triage? The safer default is a separate bucket; the operator may prefer direct merge.
  • OQ4. Is frontality-filtering (R7) worth a re-embed, or should we settle for a cheap "bbox aspect ratio" proxy? A proper yaw estimate needs landmarks; a crude proxy (bbox width/height ratio) is free but weaker.
  • OQ5. Is there appetite for adding a modern FIQA model (R9) as a drop-in dependency? It adds ~50 MB download and a small CPU cost per face; benefit over the current heuristic is real but modest.
  • OQ6. For the export, should the operator name (R10) be required before an .fsz is emitted (forces thought about which identity is which), or optional (pure convenience)?

End of evaluation. No code has been changed as part of this analysis.