- enrich: re-detects each cached face with buffalo_l (detection + landmark_2d_106 + landmark_3d_68, recognition module skipped for speed) and persists landmarks + pose into the cache so per-face frontality and landmark-symmetry quality signals become available. - compute_quality: composite score combining det_score, face short-edge, blur, frontality (from pose pitch/yaw), and 2D-landmark symmetry with tunable weights. Default weighting 0.30/0.20/0.20/0.15/0.15. - export-swap: builds facesets_swap_ready/ from an existing refine manifest. Per identity: tighter outlier gate (default 0.45), visual- near-dupe collapse (keep best representative per group), multi-face- per-source-image collapse (keep best bbox), rank by composite score, single-face-per-PNG crops at 512x512 with 0.5 bbox padding, ready-to- drop .fsz bundles (top-N + full), per-faceset manifest.json, NAME.txt placeholder for the operator. The multi-face-per-PNG collapse is the critical fix: roop-unleashed's .fsz loader appends every detected face in each PNG to the FaceSet, so any multi-face crop would contaminate the averaged embedding. - Optional --candidates rescues raw_full singletons: matches against the final per-faceset centroids and routes to _candidates/to_<faceset>/ for manual review; orphaned singletons that still cluster among themselves land in _candidates/new_<NNN>/. - docs/analysis/: evaluation document captures the evidence, downstream requirements (FaceSet averaging, inswapper_128), opportunity matrix (R1-R14), and the recommended target state this export implements. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
26 KiB
Facesets → roop-unleashed: downstream refinement evaluation
Analysis date: 2026-04-23. Author: Peter (with Claude Code).
1. Scope
Objective. Evaluate how the existing face-clustering / person-sorted results in facesets_full/ can be refined so that the downstream project roop-unleashed produces the best practical face-swap results.
Repositories / folders inspected
/opt/face-sets/— the upstream project (this repo); code,README.md,sort_faces.py,refine_manifest.json,duplicates.json./mnt/e/temp_things/fcswp/nl_sorted/facesets_full/and.../raw_full/— current output./opt/face-sets/work/cache/nl_full.npz— the underlying embedding cache used to produce the output./opt/roop-unleashed/— the downstream consumer.- InsightFace 0.7.3 Face class (
/home/peter/face_sort_env/lib/...insightface/app/common.py) to resolve an ambiguity about embedding averaging.
Agent usage. Subagents (Explore, Trend Researcher) were attempted but rejected by the operator. All investigation was done directly via Read, Grep, Bash, WebFetch, WebSearch. ~/.claude/agents/ was enumerated; no face-swap-specific agent exists.
Web research used. Targeted WebSearch + WebFetch against FaceSwapLab FAQ, FaceFusion docs, and the GitHub roop-unleashed discussion page for faceset creation. The original C0untFloyd/roop-unleashed GitHub repo has been disabled by GitHub Staff for ToS violation, so the code in /opt/roop-unleashed/ is the authoritative source for this analysis.
2. Evidence base
2.1 Files read in facesets / output
sort_faces.py(full) — current pipeline, esp.cmd_embed(embed + sha256 dedup + resume),cmd_cluster,cmd_refine(centroid-merge + quality gate + outlier rejection),cmd_extend(centroid-preserving merge),cmd_dedup(byte + visual).refine_manifest.jsonatfacesets_full/— post-extend state;extended: true; 12 facesets, params{initial_threshold: 0.55, merge_threshold: 0.40, outlier_threshold: 0.55, min_faces: 15, min_short: 90, min_blur: 40.0, min_det_score: 0.6}.nl_full.npz— 4756 face embeddings + 133 noface records across 2667 unique files; 113 byte-dupe alias paths; 103 byte-groups + 115 visual-dupe groups innl_full.duplicates.json.
2.2 Files read in roop-unleashed
roop/FaceSet.py— the downstream identity container;AverageEmbeddings()at lines 15–20.roop/face_util.py—get_face_analyser()builds InsightFacebuffalo_l(lines 35–50);extract_face_images()at lines 72–144 implements the .fsz unpack + detect path.roop/processors/FaceSwapInsightFace.py— the actual inswapper swap;Run()at lines 42–52 usessource_face.normed_embedding.roop/core.py:178–179— identifies the swap model asinswapper_128.onnx(HuggingFacecountfloyd/deepfake+ Codeberg mirror).roop/ProcessMgr.py:626–634—process_faceconfirms onlyface_datas[face_index].faces[0]is used per identity.ui/tabs/facemgr_tab.py(full) — how .fsz is created by users (cv2.imwrite PNGs → zip).ui/tabs/faceswap_tab.py:651–710— how .fsz / image source is loaded intoINPUT_FACESETS;AverageEmbeddings()is called ifflen(faces) > 1at line 690.- Insightface
common.py:Face—normed_embeddingis a@property, so it does re-derive fromself.embedding; averaging therefore does propagate to the swap (resolves an ambiguity).
2.3 External sources
- FaceSwapLab FAQ — practitioner-level guidance on multi-image reference and the checkpoint builder.
- FaceFusion face-swapper docs — model list including
inswapper_128_fp16,hyperswap_1a_256, etc. - InsightFace blog: evolution of face swapping — inswapper internal face resolution is 128×128 RGB regardless of input.
- DeepWiki: inswapper_128 — confirms encoder-decoder structure, identity taken from embedding, target appearance preserved.
- SDD-FIQA CVPR 2021 — unsupervised face quality metric; a modern alternative to
det_score + blur.
3. Current upstream output assessment
3.1 Structure of facesets_full/
- 12 faceset folders (
faceset_001…faceset_012) selected by the refine step (min_faces=15). - Each folder contains the full original images (jpg / jpeg / png) that contributed a face to that cluster, filename-flattened from the absolute path so each file is traceable to its on-disk source.
- One
refine_manifest.jsonat the root with per-faceset{face_count, image_count, alias_count, images[]}. facesets_full/extended=true(merged after the lzbkp_red run viacmd_extend).
Counts (manifest):
| faceset | images | face records | aliases |
|---|---|---|---|
| faceset_001 | 771 | 1505 | 55 |
| faceset_002 | 238 | 543 | 6 |
| faceset_003 | 206 | 402 | 2 |
| faceset_004 | 103 | 273 | 2 |
| faceset_005 | 68 | 218 | 2 |
| faceset_006 | 51 | 153 | 1 |
| faceset_007 | 89 | 158 | 0 |
| faceset_008 | 44 | 131 | 1 |
| faceset_009 | 43 | 129 | 0 |
| faceset_010 | 25 | 73 | 0 |
| faceset_011 | 25 | 71 | 8 |
| faceset_012 | 17 | 55 | 0 |
3.2 Observed strengths
- Identity grouping is directionally correct. The top facesets are credibly large and coherent — the raw
raw_full/person_001is 2.3 GB; refine extracted a 557→771-image faceset on top of that, which is a significant and useful identity pool by any standard. - Quality gate is applied.
min_short=90,min_blur=40,min_det_score=0.6are enforced; low-resolution and out-of-focus faces are rejected. - Outlier rejection is applied. Faces with cosine distance > 0.55 from their cluster centroid are dropped (when cluster ≥ 4).
- Aliasing preserves provenance. Every on-disk copy (byte-duplicates between iCloud / manual backups / etc.) is preserved in the folder, so the user can trace every file in a faceset back to its original location.
- Quality metrics already captured per face.
face_short,blur(Laplacian variance),det_score,bboxare persisted in the cache — available for any future ranking logic without re-embedding.
3.3 Observed weaknesses
Evidence is from direct computation on the cache (nl_full.npz) + the manifests.
W1. face_records / image_count ratio ~2:1 in top facesets.
- faceset_001: 1505 faces / 771 images = 1.95 faces per image.
- faceset_002: 543 / 238 = 2.28.
- faceset_003: 402 / 206 = 1.95.
- A healthy one-identity set should be ~1:1 (one face per image).
- Interpretation: many of these are multi-face photos (group / family shots) where multiple people's faces were placed into the same cluster, or the same image had multiple faces all passing the centroid gate for the same identity. Either way, the current facesets are contaminated with faces of other people from the same photo. This is the single biggest downstream risk — see §4.
W2. Intra-faceset pairwise cosine distance is high.
- Mean pairwise distance in faceset_001 = 0.835, p90 = 1.047, max = 1.242.
- For reference: same-identity ArcFace cosine distance typically clusters in [0.2, 0.6]. Pairs > 1.0 (negative cosine similarity) cannot be the same person.
- All 12 facesets have means in [0.82, 0.90] and p90 in [1.03, 1.07].
- Interpretation: the clusters were built with
linkage=average, threshold=0.55, which admits chain-effects — two points with direct distance > 1.0 can end up in the same cluster via intermediate points. Some of this spread is legitimate (the photo library spans 15+ years — same person at different ages and lighting), some is contamination from W1.
W3. Near-duplicates inflate the effective size.
nl_full.duplicates.json: 103 byte-identical groups (same file copied around) + 115 visual near-duplicate groups (cross-file cosine-distance ≤ 0.03 with matching bbox size — likely re-encodes / resizes).- faceset_001 alone carries 55 aliased paths.
- Interpretation: multiple copies of the same photograph contribute the same embedding (or a near-identical one) to the cluster's average. This does not add identity information — at best neutral, at worst biases the average toward whatever pose/expression appears in the duplicate set.
W4. Blur / quality gate is lax.
- Cache-wide
blur(Laplacian variance) p10/p25/p50 = 19/32/60. Refine gate is 40, so roughly the bottom ~35% of faces drop on blur. - Per-faceset p10 blur is 36–90 — many included faces are visibly soft. For downstream swap this is acceptable (identity embedding tolerates modest softness) but tightening would improve the average.
W5. No pose / frontality filtering.
- Neither detect-time nor refine uses landmarks / yaw / pitch. A strong profile shot with clear det_score + size still passes. ArcFace embeddings degrade for |yaw| > ~45°. The current set has no way to prefer frontal faces.
W6. 583 singletons + 133 noface drop to floor.
_singletons/in raw_full has 583 face-records (some of which are from legitimate subjects that just didn't cluster)._noface/has 133 files (hash-deduped images where detection failed). Some of these could belong to existing facesets with a looser centroid-match threshold.
W7. Embedding averaging quirk is latent but OK.
- Investigated because
FaceSet.AverageEmbeddings()atFaceSet.py:15overwritesself.faces[0]["embedding"]while the swapper readssource_face.normed_embedding. Confirmed via InsightFace source thatnormed_embeddingis a@propertythat re-normalizes fromembedding. So averaging does take effect in the swap. No action needed; noted to avoid a future misdiagnosis.
3.4 Observed risks for downstream use
- Multi-face photos in a single-identity folder (W1) → when zipped into
.fszand loaded, roop-unleashed will detect and add ALL faces in each PNG to the FaceSet (faceswap_tab.py:678–687loops every face returned byextract_face_imagesinto the set). This is identity contamination by design of the loader. Highest-priority risk. - High intra-faceset variance (W2, W5) → the averaged embedding becomes a diffuse "average face" rather than a crisp identity vector. Downstream swap will produce generic likenesses, with identity drift on hard frames.
- Near-dupes biasing the mean (W3) → identity average tilts toward over-represented poses (e.g., ten copies of one iPhone screenshot skew the mean).
- No per-face ranking — users have no signal on which images to include / exclude when hand-curating a subset, and no way to pick "best representative" images for thumbnails.
4. Downstream consumer requirements
4.1 What roop-unleashed expects
- Input format: a
.fszfile, which is a zip of.pngfiles (one crop per reference face). Created byui/tabs/facemgr_tab.py:on_update_clicked():Files inside are namedfilename = os.path.join(roop.globals.output_path, f"{index}.png") cv2.imwrite(filename, img) … util.zip(imgnames, finalzip) # imgnames → "faceset.fsz"0.png,1.png, … — only indices. - Load path (
ui/tabs/faceswap_tab.py:672–691): unzip, iterate*.png, runextract_face_images(filename, (False, 0))(note:extra_paddingdefault-1.0→ plain bbox crop, no resize-to-512 dance). For every detected face in each PNG, append the InsightFaceFaceobject (with its 512-dim embedding) toface_set.faces. If the resulting set has more than one face, callface_set.AverageEmbeddings(). - Use at swap time (
ProcessMgr.py:626–634+processors/FaceSwapInsightFace.py:42–52): onlyface_set.faces[0]is used; itsnormed_embeddingis fed toinswapper_128.onnx. The other faces in the set only exist to contribute to the averaged embedding. - Swap backend:
inswapper_128.onnx(seeroop/core.py:178). Internal face working resolution is 128×128 per the InsightFace blog and FaceSwapLab FAQ; identity is carried entirely in the 512-dim embedding.
4.2 Practical requirements derived from the code
- One identity per
.fsz. Anything else corrupts the averaged embedding. - One face per PNG inside the
.fsz. Any multi-face PNG → every face gets appended to the set, polluting the average. This is enforced only by the PNG's content, not by the loader. - Faces must be detectable by InsightFace
buffalo_latdet_size=(640,640)or(320,320). Extremely small or cut-off faces will fail detection and be silently skipped on load. - Input resolution: there is no explicit requirement, but since inswapper works at 128×128 and InsightFace aligns on 5 landmarks, a face bbox with a short edge of at least ~100–150 px gives a reliable embedding. Below ~60 px, embedding quality drops measurably (literature). Our
min_short=90gate is close to the lower end of useful. - Frontality helps. ArcFace embeddings are trained with some pose augmentation, so near-frontal (|yaw| ≤ 30°) is ideal; beyond ~45° the embedding starts to drift. Roop applies no compensation for this.
- Expression / lighting diversity is desirable but not required. FaceSwapLab explicitly supports "face blending" and notes it "improves the face's representative accuracy" — so a diverse set of the same identity is better than 100 near-duplicate frames.
- No metadata is consumed. roop-unleashed ignores everything outside the PNG bytes — filename, EXIF, sidecar JSON are not read.
4.3 Constraints and uncertainties
- The
roop-unleashedGitHub is unreachable (disabled), so the closest thing to community guidance is the in-repoCLAUDE.mdand the code itself. Treat this code as authoritative. - Assumption: the user will either provide the whole
facesets_full/faceset_NNN/folder to roop-unleashed's Face Management tab (which accepts image files + a folder button —faceswap_tab.py:644–647), OR pre-build.fszfiles. Both paths run through the same loader; the multi-face-per-PNG issue applies equally.
5. Refinement opportunity matrix
Each opportunity is scored qualitatively. "Automation feasibility" distinguishes fully automated (A), semi-automated with heuristics that need operator review (S), and manual-only (M). "Best place" is where implementation should live.
| # | Opportunity | Problem addressed | Evidence | Expected downstream benefit | Automation | Risk / downside | Best place | Priority | Confidence |
|---|---|---|---|---|---|---|---|---|---|
| R1 | Pre-crop each faceset image to a single face (the identity's own face) before export | W1 — multi-face photos pollute FaceSet on load | refine_manifest face/image ratio ~2:1 in top clusters; roop loader adds every detected face in a PNG (faceswap_tab.py:678–687) |
Large. Cleans the single biggest identity-averaging contaminant | A (use the existing bbox per face record in the cache and cv2.crop with padding, save to a new facesets_swap_ready/ mirror) |
Must pick the correct face of multiple detected per image → use the bbox that the upstream cache already matched to this faceset | facesets |
P0 | High |
| R2 | Split known multi-face photos so only the identity's own bbox is included, alternative to full image export | Same as R1, more conservative | Same as R1 | Same as R1 | A | — | facesets |
P0 | High |
| R3 | Identity tightening — re-run refine with stricter outlier threshold (e.g. outlier_threshold=0.45) | W2 — intra-cluster spread too wide, chain effects from average-linkage | pairwise distance max > 1.2 in every faceset | Sharpens averaged embedding; removes obviously-wrong faces | A | Some legitimate same-person faces (age / lighting extremes) may be dropped | facesets |
P0 | High |
| R4 | Drop visual near-duplicates from the set (keep the highest-quality representative per dupe group) | W3 — duplicate images bias the average | duplicates.json has 115 visual groups (2–5 images each) across 4756 faces |
Removes silent bias toward over-represented frames; shrinks set size for faster load | A | Deciding which copy to keep is a tiny judgement call (pick highest det_score × face_short × blur) | facesets |
P1 | High |
| R5 | Per-face composite quality score (weighted det_score · blur · face_short · frontality) and ranked export / top-N subset |
Need to give roop-unleashed a small, strong averaging pool rather than all 771 images | Cache already has det_score, blur, face_short; frontality = landmark symmetry, computable from landmark_2d_106 which InsightFace already provides but we don't store |
Smaller .fsz files, better average embedding, faster UI |
A for the score; S for the top-N choice (operator picks N per identity) | Frontality adds a small extra compute step; needs a re-pass over the cache or a re-embed storing landmarks | facesets |
P1 | Medium |
| R6 | Produce .fsz directly (zip the cropped PNGs with integer filenames) as an export mode |
Saves the operator the manual zipping step; guarantees filename correctness | facemgr_tab.py:242–255 is the reference implementation; trivially reproducible |
Zero-friction import into roop-unleashed | A | — | facesets |
P1 | High |
| R7 | Pose / frontality filter at refine time using pose_2d_106 landmark symmetry or yaw estimation from face.pose (if available) |
W5 — strong profile faces weaken the average | ArcFace literature; no measurement yet in our cache | Tighter identity average, especially for smaller facesets where one profile shot can dominate | A (compute from cached landmarks if we re-embed or store them; otherwise a one-off enrichment pass) | Landmarks not currently persisted in the cache; requires a small re-embed or enrichment command | facesets |
P2 | Medium |
| R8 | Singleton rescue pass — re-classify _singletons/ against final faceset centroids with a looser threshold + quality gate |
W6 — some singletons are legit faceset members | 583 singletons with p50 face_short=149, p50 det_score=0.76 — many look usable | Recovers lost identity examples; modest expansion of useful facesets | A | Some true singletons will be mis-assigned; threshold choice matters | facesets |
P2 | Medium |
| R9 | Modern face-quality scorer (SDD-FIQA / CR-FIQA) to replace the det_score × blur heuristic |
More robust quality ranking than hand-rolled heuristics | Literature; current heuristic is crude | Marginal improvement over R5 for the same goal | A but adds a new model dependency | Model weights to download, more CPU cost at ranking time | facesets |
P3 | Medium |
| R10 | Person-label sidecars (e.g. faceset_001/_label.txt with an operator-provided name) |
UX — the 12 facesets are anonymous; operator has to peek to find "mom" | No evidence; improvement to workflow | Operator-quality-of-life; no effect on swap quality | M | — | facesets |
P3 | Low |
| R11 | Feed multiple source images selection UI in roop-unleashed improvements (e.g. a "pick best 20 by quality" button on load) | Better use of large .fsz files |
Not implemented downstream | Improvement happens at consumption time | A | Requires roop-unleashed patch, which is a disabled upstream | roop-unleashed |
P4 | Low |
| R12 | Face alignment / crop standardization (e.g. arcface-aligned 512×512 crops in the .fsz) |
Some marginal consistency gain on detection | roop re-detects anyway on load (extract_face_images) so input alignment is discarded |
Very small — roop's loader re-detects and re-aligns regardless | A | Extra compute for no practical gain | — (do not do) | Not recommended | High |
| R13 | Increase resolution via upscaling of low-res crops | Make small faces "bigger" | Identity comes from the embedding, not the pixels | None — upscaling with GAN does not add identity info; inswapper reads 128×128 anyway | A | Can introduce synthetic artifacts | — (do not do) | Not recommended | High |
| R14 | Destructive reorganization of facesets_full/ in place |
Simpler final layout | Operator explicitly told us yesterday to preserve existing output | Marginal tidiness | M | Loses the current "full cluster" reference view, which has diagnostic value | — (do not do without explicit go-ahead) | Not recommended by default | High |
6. Recommended target state
Define a new output view, facesets_swap_ready/, produced by a new subcommand (e.g. sort_faces.py export-swap). Original facesets_full/ stays intact. Per faceset:
facesets_swap_ready/
faceset_001/
manifest.json # provenance + per-image score + rank
previews/ # 4-image contact sheet thumbnail
top_20_grid.jpg
faces/ # cropped-to-single-face PNGs named "000.png", "001.png", ...
000.png # highest-ranked face, single face per PNG, 512x512 padded/aligned
001.png
...
faceset.fsz # zip of faces/*.png — drop-in for roop-unleashed
faceset_002/
...
Key properties:
- One face per PNG — each PNG is a crop of a single face (R1/R2), padded to a consistent 512×512 with the identity's bbox centred. Roop-unleashed's loader will re-detect exactly one face per file.
- Ranked by composite quality —
faces/000.pngis the best representative; later indices are weaker. Operator can trivially truncate by dropping later files. - Configurable top-N — default
--top-n 30per faceset with a--include-allflag for the current behaviour. 30 is conservative; FaceSwapLab's "face blending" tool (the most analogous public practitioner reference) shows that blending with diverse but consistent images materially helps; 20–40 is a common practitioner range. - Near-duplicates dropped (R4) — one representative per visual-dupe group.
- Tighter outlier gate (R3) — outlier_threshold reduced from 0.55 to ~0.45 for this export, keeping the refine defaults on
facesets_full/. - Ready-to-ship
.fsz(R6) in each folder. - manifest.json per faceset — cites every source path and score. Lets the operator see why a face was kept (or dropped if we add a
_rejected/sibling).
This lets the operator test swap quality end-to-end without any roop-unleashed modification, and preserves full fallback to the raw / full results if anything needs re-examination.
7. Recommended next steps
7.1 Quick wins (high value, low effort)
- R1 — single-face crop export as part of
export-swap. Uses bbox already in the cache; zero new models. Delivers the biggest likely swap-quality improvement. - R4 — drop visual near-duplicates inside the export. Uses
duplicates.jsonalready produced bycmd_dedup. Smaller sets, cleaner averages. - R5 — composite quality score + rank + top-N. Uses existing fields (
det_score,blur,face_short). Deliver.fsz+faces/sorted by descending score. - R6 —
.fszbundle emission by simply zippingfaces/*.pngwith integer names. Trivial given (1)-(3).
These four together give a clean, drop-in-usable export in one session of work.
7.2 Medium-effort improvements
- R3 — re-run refine with stricter
outlier_threshold(e.g. 0.45) for the export path; keepfacesets_full/at 0.55 for reference. Requires a re-cluster over existing embeddings — fast (seconds), no re-embed. - R7 — pose/frontality filter using landmarks. Requires either (a) a re-embed pass that persists
landmark_2d_106, or (b) an enrichment pass that re-loads each image and computes yaw without redoing the full embed. Modest CPU cost; meaningful for small facesets. - R8 — singleton rescue against final centroids. Low code cost; likely yields a handful of additional good images per identity.
7.3 Items requiring operator decision
- Target top-N per faceset for the export (proposal: 30, override per run). Affects the average-embedding quality trade-off vs. UI load time.
- Whether to name facesets (R10) by operator — purely workflow.
- Whether
_singletons/should be retired or promoted to "uncertain identity" export with a lower-confidence tag.
7.4 Not recommended
- R11 — patching
roop-unleasheditself. The upstream repo is disabled; touching it introduces fork-maintenance overhead for no proportional gain we can't already achieve upstream infacesets. - R12 / R13 — pre-aligning or up-scaling source crops. Roop re-detects/aligns on load and inswapper caps at 128×128 internally; effort is wasted.
- R14 — destructive reorganization of
facesets_full/. The operator already told us (yesterday) to preserve existing results; no new evidence supports re-opening that.
8. Open questions
- OQ1. Is the operator willing to have the export step drop faces rather than just rank them? R5-top-N drops everything past rank N; if the operator prefers to keep the full set but marked, we should export ranked without truncation and let the user pick in the UI.
- OQ2. How many
.fszfiles does the operator actually plan to use? If only 3–4 identities will be used in practice, R5 can stay conservative (N=50) without cost. If all 12 are routinely used, leaner is better (N=20). - OQ3. Should singletons (R8) be rescued into existing facesets or exported as their own "candidate_NNN/" bucket for manual triage? The safer default is a separate bucket; the operator may prefer direct merge.
- OQ4. Is frontality-filtering (R7) worth a re-embed, or should we settle for a cheap "bbox aspect ratio" proxy? A proper yaw estimate needs landmarks; a crude proxy (bbox width/height ratio) is free but weaker.
- OQ5. Is there appetite for adding a modern FIQA model (R9) as a drop-in dependency? It adds ~50 MB download and a small CPU cost per face; benefit over the current heuristic is real but modest.
- OQ6. For the export, should the operator name (R10) be required before an
.fszis emitted (forces thought about which identity is which), or optional (pure convenience)?
End of evaluation. No code has been changed as part of this analysis.