Files

Peter d53ab9fbfc Add enrich + export-swap pipeline for downstream face-swap ready output

- enrich: re-detects each cached face with buffalo_l (detection +
  landmark_2d_106 + landmark_3d_68, recognition module skipped for speed)
  and persists landmarks + pose into the cache so per-face frontality and
  landmark-symmetry quality signals become available.
- compute_quality: composite score combining det_score, face short-edge,
  blur, frontality (from pose pitch/yaw), and 2D-landmark symmetry with
  tunable weights. Default weighting 0.30/0.20/0.20/0.15/0.15.
- export-swap: builds facesets_swap_ready/ from an existing refine
  manifest. Per identity: tighter outlier gate (default 0.45), visual-
  near-dupe collapse (keep best representative per group), multi-face-
  per-source-image collapse (keep best bbox), rank by composite score,
  single-face-per-PNG crops at 512x512 with 0.5 bbox padding, ready-to-
  drop .fsz bundles (top-N + full), per-faceset manifest.json, NAME.txt
  placeholder for the operator. The multi-face-per-PNG collapse is the
  critical fix: roop-unleashed's .fsz loader appends every detected face
  in each PNG to the FaceSet, so any multi-face crop would contaminate
  the averaged embedding.
- Optional --candidates rescues raw_full singletons: matches against the
  final per-faceset centroids and routes to _candidates/to_<faceset>/
  for manual review; orphaned singletons that still cluster among
  themselves land in _candidates/new_<NNN>/.
- docs/analysis/: evaluation document captures the evidence, downstream
  requirements (FaceSet averaging, inswapper_128), opportunity matrix
  (R1-R14), and the recommended target state this export implements.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-23 22:37:32 +02:00

26 KiB

Raw Blame History

Facesets → roop-unleashed: downstream refinement evaluation

Analysis date: 2026-04-23. Author: Peter (with Claude Code).

1. Scope

Objective. Evaluate how the existing face-clustering / person-sorted results in facesets_full/ can be refined so that the downstream project roop-unleashed produces the best practical face-swap results.

Repositories / folders inspected

/opt/face-sets/ — the upstream project (this repo); code, README.md, sort_faces.py, refine_manifest.json, duplicates.json.
/mnt/e/temp_things/fcswp/nl_sorted/facesets_full/ and .../raw_full/ — current output.
/opt/face-sets/work/cache/nl_full.npz — the underlying embedding cache used to produce the output.
/opt/roop-unleashed/ — the downstream consumer.
InsightFace 0.7.3 Face class (/home/peter/face_sort_env/lib/...insightface/app/common.py) to resolve an ambiguity about embedding averaging.

Agent usage. Subagents (Explore, Trend Researcher) were attempted but rejected by the operator. All investigation was done directly via Read, Grep, Bash, WebFetch, WebSearch. ~/.claude/agents/ was enumerated; no face-swap-specific agent exists.

Web research used. Targeted WebSearch + WebFetch against FaceSwapLab FAQ, FaceFusion docs, and the GitHub roop-unleashed discussion page for faceset creation. The original C0untFloyd/roop-unleashed GitHub repo has been disabled by GitHub Staff for ToS violation, so the code in /opt/roop-unleashed/ is the authoritative source for this analysis.

2. Evidence base

2.1 Files read in `facesets` / output

sort_faces.py (full) — current pipeline, esp. cmd_embed (embed + sha256 dedup + resume), cmd_cluster, cmd_refine (centroid-merge + quality gate + outlier rejection), cmd_extend (centroid-preserving merge), cmd_dedup (byte + visual).
refine_manifest.json at facesets_full/ — post-extend state; extended: true; 12 facesets, params {initial_threshold: 0.55, merge_threshold: 0.40, outlier_threshold: 0.55, min_faces: 15, min_short: 90, min_blur: 40.0, min_det_score: 0.6}.
nl_full.npz — 4756 face embeddings + 133 noface records across 2667 unique files; 113 byte-dupe alias paths; 103 byte-groups + 115 visual-dupe groups in nl_full.duplicates.json.

2.2 Files read in `roop-unleashed`

roop/FaceSet.py — the downstream identity container; AverageEmbeddings() at lines 15–20.
roop/face_util.py — get_face_analyser() builds InsightFace buffalo_l (lines 35–50); extract_face_images() at lines 72–144 implements the .fsz unpack + detect path.
roop/processors/FaceSwapInsightFace.py — the actual inswapper swap; Run() at lines 42–52 uses source_face.normed_embedding.
roop/core.py:178–179 — identifies the swap model as inswapper_128.onnx (HuggingFace countfloyd/deepfake + Codeberg mirror).
roop/ProcessMgr.py:626–634 — process_face confirms only face_datas[face_index].faces[0] is used per identity.
ui/tabs/facemgr_tab.py (full) — how .fsz is created by users (cv2.imwrite PNGs → zip).
ui/tabs/faceswap_tab.py:651–710 — how .fsz / image source is loaded into INPUT_FACESETS; AverageEmbeddings() is called iff len(faces) > 1 at line 690.
Insightface common.py:Face — normed_embedding is a @property, so it does re-derive from self.embedding; averaging therefore does propagate to the swap (resolves an ambiguity).

2.3 External sources

FaceSwapLab FAQ — practitioner-level guidance on multi-image reference and the checkpoint builder.
FaceFusion face-swapper docs — model list including inswapper_128_fp16, hyperswap_1a_256, etc.
InsightFace blog: evolution of face swapping — inswapper internal face resolution is 128×128 RGB regardless of input.
DeepWiki: inswapper_128 — confirms encoder-decoder structure, identity taken from embedding, target appearance preserved.
SDD-FIQA CVPR 2021 — unsupervised face quality metric; a modern alternative to det_score + blur.

3. Current upstream output assessment

3.1 Structure of `facesets_full/`

12 faceset folders (faceset_001 … faceset_012) selected by the refine step (min_faces=15).
Each folder contains the full original images (jpg / jpeg / png) that contributed a face to that cluster, filename-flattened from the absolute path so each file is traceable to its on-disk source.
One refine_manifest.json at the root with per-faceset {face_count, image_count, alias_count, images[]}.
facesets_full/extended=true (merged after the lzbkp_red run via cmd_extend).

Counts (manifest):

faceset	images	face records	aliases
faceset_001	771	1505	55
faceset_002	238	543	6
faceset_003	206	402	2
faceset_004	103	273	2
faceset_005	68	218	2
faceset_006	51	153	1
faceset_007	89	158	0
faceset_008	44	131	1
faceset_009	43	129	0
faceset_010	25	73	0
faceset_011	25	71	8
faceset_012	17	55	0

3.2 Observed strengths

Identity grouping is directionally correct. The top facesets are credibly large and coherent — the raw raw_full/person_001 is 2.3 GB; refine extracted a 557→771-image faceset on top of that, which is a significant and useful identity pool by any standard.
Quality gate is applied. min_short=90, min_blur=40, min_det_score=0.6 are enforced; low-resolution and out-of-focus faces are rejected.
Outlier rejection is applied. Faces with cosine distance > 0.55 from their cluster centroid are dropped (when cluster ≥ 4).
Aliasing preserves provenance. Every on-disk copy (byte-duplicates between iCloud / manual backups / etc.) is preserved in the folder, so the user can trace every file in a faceset back to its original location.
Quality metrics already captured per face. face_short, blur (Laplacian variance), det_score, bbox are persisted in the cache — available for any future ranking logic without re-embedding.

3.3 Observed weaknesses

Evidence is from direct computation on the cache (nl_full.npz) + the manifests.

W1. face_records / image_count ratio ~2:1 in top facesets.

faceset_001: 1505 faces / 771 images = 1.95 faces per image.
faceset_002: 543 / 238 = 2.28.
faceset_003: 402 / 206 = 1.95.
A healthy one-identity set should be ~1:1 (one face per image).
Interpretation: many of these are multi-face photos (group / family shots) where multiple people's faces were placed into the same cluster, or the same image had multiple faces all passing the centroid gate for the same identity. Either way, the current facesets are contaminated with faces of other people from the same photo. This is the single biggest downstream risk — see §4.

W2. Intra-faceset pairwise cosine distance is high.

Mean pairwise distance in faceset_001 = 0.835, p90 = 1.047, max = 1.242.
For reference: same-identity ArcFace cosine distance typically clusters in [0.2, 0.6]. Pairs > 1.0 (negative cosine similarity) cannot be the same person.
All 12 facesets have means in [0.82, 0.90] and p90 in [1.03, 1.07].
Interpretation: the clusters were built with linkage=average, threshold=0.55, which admits chain-effects — two points with direct distance > 1.0 can end up in the same cluster via intermediate points. Some of this spread is legitimate (the photo library spans 15+ years — same person at different ages and lighting), some is contamination from W1.

W3. Near-duplicates inflate the effective size.

nl_full.duplicates.json: 103 byte-identical groups (same file copied around) + 115 visual near-duplicate groups (cross-file cosine-distance ≤ 0.03 with matching bbox size — likely re-encodes / resizes).
faceset_001 alone carries 55 aliased paths.
Interpretation: multiple copies of the same photograph contribute the same embedding (or a near-identical one) to the cluster's average. This does not add identity information — at best neutral, at worst biases the average toward whatever pose/expression appears in the duplicate set.

W4. Blur / quality gate is lax.

Cache-wide blur (Laplacian variance) p10/p25/p50 = 19/32/60. Refine gate is 40, so roughly the bottom ~35% of faces drop on blur.
Per-faceset p10 blur is 36–90 — many included faces are visibly soft. For downstream swap this is acceptable (identity embedding tolerates modest softness) but tightening would improve the average.

W5. No pose / frontality filtering.

Neither detect-time nor refine uses landmarks / yaw / pitch. A strong profile shot with clear det_score + size still passes. ArcFace embeddings degrade for |yaw| > ~45°. The current set has no way to prefer frontal faces.

W6. 583 singletons + 133 noface drop to floor.

_singletons/ in raw_full has 583 face-records (some of which are from legitimate subjects that just didn't cluster). _noface/ has 133 files (hash-deduped images where detection failed). Some of these could belong to existing facesets with a looser centroid-match threshold.

W7. Embedding averaging quirk is latent but OK.

Investigated because FaceSet.AverageEmbeddings() at FaceSet.py:15 overwrites self.faces[0]["embedding"] while the swapper reads source_face.normed_embedding. Confirmed via InsightFace source that normed_embedding is a @property that re-normalizes from embedding. So averaging does take effect in the swap. No action needed; noted to avoid a future misdiagnosis.

3.4 Observed risks for downstream use

Multi-face photos in a single-identity folder (W1) → when zipped into .fsz and loaded, roop-unleashed will detect and add ALL faces in each PNG to the FaceSet (faceswap_tab.py:678–687 loops every face returned by extract_face_images into the set). This is identity contamination by design of the loader. Highest-priority risk.
High intra-faceset variance (W2, W5) → the averaged embedding becomes a diffuse "average face" rather than a crisp identity vector. Downstream swap will produce generic likenesses, with identity drift on hard frames.
Near-dupes biasing the mean (W3) → identity average tilts toward over-represented poses (e.g., ten copies of one iPhone screenshot skew the mean).
No per-face ranking — users have no signal on which images to include / exclude when hand-curating a subset, and no way to pick "best representative" images for thumbnails.

4. Downstream consumer requirements

4.1 What `roop-unleashed` expects

Input format: a .fsz file, which is a zip of .png files (one crop per reference face). Created by ui/tabs/facemgr_tab.py:on_update_clicked():
```
filename = os.path.join(roop.globals.output_path, f"{index}.png")
cv2.imwrite(filename, img)
…
util.zip(imgnames, finalzip)   # imgnames → "faceset.fsz"
```
Files inside are named 0.png, 1.png, … — only indices.
Load path (ui/tabs/faceswap_tab.py:672–691): unzip, iterate *.png, run extract_face_images(filename, (False, 0)) (note: extra_padding default -1.0 → plain bbox crop, no resize-to-512 dance). For every detected face in each PNG, append the InsightFace Face object (with its 512-dim embedding) to face_set.faces. If the resulting set has more than one face, call face_set.AverageEmbeddings().
Use at swap time (ProcessMgr.py:626–634 + processors/FaceSwapInsightFace.py:42–52): only face_set.faces[0] is used; its normed_embedding is fed to inswapper_128.onnx. The other faces in the set only exist to contribute to the averaged embedding.
Swap backend: inswapper_128.onnx (see roop/core.py:178). Internal face working resolution is 128×128 per the InsightFace blog and FaceSwapLab FAQ; identity is carried entirely in the 512-dim embedding.

4.2 Practical requirements derived from the code

One identity per .fsz. Anything else corrupts the averaged embedding.
One face per PNG inside the .fsz. Any multi-face PNG → every face gets appended to the set, polluting the average. This is enforced only by the PNG's content, not by the loader.
Faces must be detectable by InsightFace buffalo_l at det_size=(640,640) or (320,320). Extremely small or cut-off faces will fail detection and be silently skipped on load.
Input resolution: there is no explicit requirement, but since inswapper works at 128×128 and InsightFace aligns on 5 landmarks, a face bbox with a short edge of at least ~100–150 px gives a reliable embedding. Below ~60 px, embedding quality drops measurably (literature). Our min_short=90 gate is close to the lower end of useful.
Frontality helps. ArcFace embeddings are trained with some pose augmentation, so near-frontal (|yaw| ≤ 30°) is ideal; beyond ~45° the embedding starts to drift. Roop applies no compensation for this.
Expression / lighting diversity is desirable but not required. FaceSwapLab explicitly supports "face blending" and notes it "improves the face's representative accuracy" — so a diverse set of the same identity is better than 100 near-duplicate frames.
No metadata is consumed. roop-unleashed ignores everything outside the PNG bytes — filename, EXIF, sidecar JSON are not read.

4.3 Constraints and uncertainties

The roop-unleashed GitHub is unreachable (disabled), so the closest thing to community guidance is the in-repo CLAUDE.md and the code itself. Treat this code as authoritative.
Assumption: the user will either provide the whole facesets_full/faceset_NNN/ folder to roop-unleashed's Face Management tab (which accepts image files + a folder button — faceswap_tab.py:644–647), OR pre-build .fsz files. Both paths run through the same loader; the multi-face-per-PNG issue applies equally.

5. Refinement opportunity matrix

Each opportunity is scored qualitatively. "Automation feasibility" distinguishes fully automated (A), semi-automated with heuristics that need operator review (S), and manual-only (M). "Best place" is where implementation should live.

#	Opportunity	Problem addressed	Evidence	Expected downstream benefit	Automation	Risk / downside	Best place	Priority	Confidence
R1	Pre-crop each faceset image to a single face (the identity's own face) before export	W1 — multi-face photos pollute FaceSet on load	refine_manifest face/image ratio ~2:1 in top clusters; roop loader adds every detected face in a PNG (`faceswap_tab.py:678–687`)	Large. Cleans the single biggest identity-averaging contaminant	A (use the existing bbox per face record in the cache and cv2.crop with padding, save to a new `facesets_swap_ready/` mirror)	Must pick the correct face of multiple detected per image → use the bbox that the upstream cache already matched to this faceset	`facesets`	P0	High
R2	Split known multi-face photos so only the identity's own bbox is included, alternative to full image export	Same as R1, more conservative	Same as R1	Same as R1	A	—	`facesets`	P0	High
R3	Identity tightening — re-run refine with stricter outlier threshold (e.g. outlier_threshold=0.45)	W2 — intra-cluster spread too wide, chain effects from average-linkage	pairwise distance max > 1.2 in every faceset	Sharpens averaged embedding; removes obviously-wrong faces	A	Some legitimate same-person faces (age / lighting extremes) may be dropped	`facesets`	P0	High
R4	Drop visual near-duplicates from the set (keep the highest-quality representative per dupe group)	W3 — duplicate images bias the average	`duplicates.json` has 115 visual groups (2–5 images each) across 4756 faces	Removes silent bias toward over-represented frames; shrinks set size for faster load	A	Deciding which copy to keep is a tiny judgement call (pick highest det_score × face_short × blur)	`facesets`	P1	High
R5	Per-face composite quality score (weighted `det_score · blur · face_short · frontality`) and ranked export / top-N subset	Need to give roop-unleashed a small, strong averaging pool rather than all 771 images	Cache already has det_score, blur, face_short; frontality = landmark symmetry, computable from `landmark_2d_106` which InsightFace already provides but we don't store	Smaller `.fsz` files, better average embedding, faster UI	A for the score; S for the top-N choice (operator picks N per identity)	Frontality adds a small extra compute step; needs a re-pass over the cache or a re-embed storing landmarks	`facesets`	P1	Medium
R6	Produce `.fsz` directly (zip the cropped PNGs with integer filenames) as an export mode	Saves the operator the manual zipping step; guarantees filename correctness	`facemgr_tab.py:242–255` is the reference implementation; trivially reproducible	Zero-friction import into roop-unleashed	A	—	`facesets`	P1	High
R7	Pose / frontality filter at refine time using `pose_2d_106` landmark symmetry or yaw estimation from `face.pose` (if available)	W5 — strong profile faces weaken the average	ArcFace literature; no measurement yet in our cache	Tighter identity average, especially for smaller facesets where one profile shot can dominate	A (compute from cached landmarks if we re-embed or store them; otherwise a one-off enrichment pass)	Landmarks not currently persisted in the cache; requires a small re-embed or enrichment command	`facesets`	P2	Medium
R8	Singleton rescue pass — re-classify `_singletons/` against final faceset centroids with a looser threshold + quality gate	W6 — some singletons are legit faceset members	583 singletons with p50 face_short=149, p50 det_score=0.76 — many look usable	Recovers lost identity examples; modest expansion of useful facesets	A	Some true singletons will be mis-assigned; threshold choice matters	`facesets`	P2	Medium
R9	Modern face-quality scorer (SDD-FIQA / CR-FIQA) to replace the `det_score × blur` heuristic	More robust quality ranking than hand-rolled heuristics	Literature; current heuristic is crude	Marginal improvement over R5 for the same goal	A but adds a new model dependency	Model weights to download, more CPU cost at ranking time	`facesets`	P3	Medium
R10	Person-label sidecars (e.g. `faceset_001/_label.txt` with an operator-provided name)	UX — the 12 facesets are anonymous; operator has to peek to find "mom"	No evidence; improvement to workflow	Operator-quality-of-life; no effect on swap quality	M	—	`facesets`	P3	Low
R11	Feed multiple source images selection UI in roop-unleashed improvements (e.g. a "pick best 20 by quality" button on load)	Better use of large `.fsz` files	Not implemented downstream	Improvement happens at consumption time	A	Requires roop-unleashed patch, which is a disabled upstream	`roop-unleashed`	P4	Low
R12	Face alignment / crop standardization (e.g. arcface-aligned 512×512 crops in the `.fsz`)	Some marginal consistency gain on detection	roop re-detects anyway on load (`extract_face_images`) so input alignment is discarded	Very small — roop's loader re-detects and re-aligns regardless	A	Extra compute for no practical gain	— (do not do)	Not recommended	High
R13	Increase resolution via upscaling of low-res crops	Make small faces "bigger"	Identity comes from the embedding, not the pixels	None — upscaling with GAN does not add identity info; inswapper reads 128×128 anyway	A	Can introduce synthetic artifacts	— (do not do)	Not recommended	High
R14	Destructive reorganization of `facesets_full/` in place	Simpler final layout	Operator explicitly told us yesterday to preserve existing output	Marginal tidiness	M	Loses the current "full cluster" reference view, which has diagnostic value	— (do not do without explicit go-ahead)	Not recommended by default	High

6. Recommended target state

Define a new output view, facesets_swap_ready/, produced by a new subcommand (e.g. sort_faces.py export-swap). Original facesets_full/ stays intact. Per faceset:

facesets_swap_ready/
  faceset_001/
    manifest.json           # provenance + per-image score + rank
    previews/               # 4-image contact sheet thumbnail
      top_20_grid.jpg
    faces/                  # cropped-to-single-face PNGs named "000.png", "001.png", ...
      000.png               # highest-ranked face, single face per PNG, 512x512 padded/aligned
      001.png
      ...
    faceset.fsz             # zip of faces/*.png — drop-in for roop-unleashed
  faceset_002/
    ...

Key properties:

One face per PNG — each PNG is a crop of a single face (R1/R2), padded to a consistent 512×512 with the identity's bbox centred. Roop-unleashed's loader will re-detect exactly one face per file.
Ranked by composite quality — faces/000.png is the best representative; later indices are weaker. Operator can trivially truncate by dropping later files.
Configurable top-N — default --top-n 30 per faceset with a --include-all flag for the current behaviour. 30 is conservative; FaceSwapLab's "face blending" tool (the most analogous public practitioner reference) shows that blending with diverse but consistent images materially helps; 20–40 is a common practitioner range.
Near-duplicates dropped (R4) — one representative per visual-dupe group.
Tighter outlier gate (R3) — outlier_threshold reduced from 0.55 to ~0.45 for this export, keeping the refine defaults on facesets_full/.
Ready-to-ship .fsz (R6) in each folder.
manifest.json per faceset — cites every source path and score. Lets the operator see why a face was kept (or dropped if we add a _rejected/ sibling).

This lets the operator test swap quality end-to-end without any roop-unleashed modification, and preserves full fallback to the raw / full results if anything needs re-examination.

7. Recommended next steps

7.1 Quick wins (high value, low effort)

R1 — single-face crop export as part of export-swap. Uses bbox already in the cache; zero new models. Delivers the biggest likely swap-quality improvement.
R4 — drop visual near-duplicates inside the export. Uses duplicates.json already produced by cmd_dedup. Smaller sets, cleaner averages.
R5 — composite quality score + rank + top-N. Uses existing fields (det_score, blur, face_short). Deliver .fsz + faces/ sorted by descending score.
R6 — .fsz bundle emission by simply zipping faces/*.png with integer names. Trivial given (1)-(3).

These four together give a clean, drop-in-usable export in one session of work.

7.2 Medium-effort improvements

R3 — re-run refine with stricter outlier_threshold (e.g. 0.45) for the export path; keep facesets_full/ at 0.55 for reference. Requires a re-cluster over existing embeddings — fast (seconds), no re-embed.
R7 — pose/frontality filter using landmarks. Requires either (a) a re-embed pass that persists landmark_2d_106, or (b) an enrichment pass that re-loads each image and computes yaw without redoing the full embed. Modest CPU cost; meaningful for small facesets.
R8 — singleton rescue against final centroids. Low code cost; likely yields a handful of additional good images per identity.

7.3 Items requiring operator decision

Target top-N per faceset for the export (proposal: 30, override per run). Affects the average-embedding quality trade-off vs. UI load time.
Whether to name facesets (R10) by operator — purely workflow.
Whether _singletons/ should be retired or promoted to "uncertain identity" export with a lower-confidence tag.

7.4 Not recommended

R11 — patching roop-unleashed itself. The upstream repo is disabled; touching it introduces fork-maintenance overhead for no proportional gain we can't already achieve upstream in facesets.
R12 / R13 — pre-aligning or up-scaling source crops. Roop re-detects/aligns on load and inswapper caps at 128×128 internally; effort is wasted.
R14 — destructive reorganization of facesets_full/. The operator already told us (yesterday) to preserve existing results; no new evidence supports re-opening that.

8. Open questions

OQ1. Is the operator willing to have the export step drop faces rather than just rank them? R5-top-N drops everything past rank N; if the operator prefers to keep the full set but marked, we should export ranked without truncation and let the user pick in the UI.
OQ2. How many .fsz files does the operator actually plan to use? If only 3–4 identities will be used in practice, R5 can stay conservative (N=50) without cost. If all 12 are routinely used, leaner is better (N=20).
OQ3. Should singletons (R8) be rescued into existing facesets or exported as their own "candidate_NNN/" bucket for manual triage? The safer default is a separate bucket; the operator may prefer direct merge.
OQ4. Is frontality-filtering (R7) worth a re-embed, or should we settle for a cheap "bbox aspect ratio" proxy? A proper yaw estimate needs landmarks; a crude proxy (bbox width/height ratio) is free but weaker.
OQ5. Is there appetite for adding a modern FIQA model (R9) as a drop-in dependency? It adds ~50 MB download and a small CPU cost per face; benefit over the current heuristic is real but modest.
OQ6. For the export, should the operator name (R10) be required before an .fsz is emitted (forces thought about which identity is which), or optional (pure convenience)?

End of evaluation. No code has been changed as part of this analysis.

26 KiB Raw Blame History Unescape Escape