CRAFT: Continuous Reasoning and Agentic Feedback Tuning

Community Article Published February 5, 2026

Upvote

flymy-ai

flymy-ai

flymy-ai

flymy-ai

Technical Report

TL;DR

How CRAFT works
High-level overview

Detailed architecture

Models tested

Iteration budget

Metrics

Evaluation datasets

Results
Model mapping (M1–M5)

DSG-1K win rates

Parti-Prompt win rates

Qualitative examples
Long prompt example

Additional examples

Cost and latency

Tables (quality)
DSG-1K (VQA, DSG, Auto SxS)

Parti-Prompt (VQA, DSG, Auto SxS)

Comparison with prompt-optimization methods (Maestro)

Limitations

Technical Report

↗ View Demo ↗ Website ↗ View arXiv page

TL;DR

CRAFT adds thinking into text to image generation and image editing without retraining.

It decomposes prompts into explicit visual checks, verifies outputs with a VLM, and edits only the failing parts. This yields better compositional accuracy and text rendering with modest overhead.

How CRAFT works

CRAFT adds a reasoning loop on top of existing models:

Decompose the prompt into structured visual questions.
Generate an image.
Verify each constraint with a VLM (Yes/No).
Refine the prompt only where constraints fail.
Stop early once constraints are satisfied or using iterative image editing.

High-level overview

Detailed architecture

Models tested

We evaluate across multiple backbones and settings:

FLUX-Schnell
FLUX-Dev
Qwen-Image
Z-Image-Turbo
FLUX-2 Pro

We use ChatGPT for both the LLM and VLM components (gpt-5-nano-2025-08-07). Image editing uses Gemini 2.5 Flash and Qwen-Image Edit.

CRAFT is model-agnostic: the same pipeline works with any open-source or API-based T2I generator and any image-editing model. You can swap components to build a custom pipeline for your constraints and budget.

Iteration budget

In practice, we run 3 iterations for generation/edit, which is typically enough for strong results.
Editing is applied when constraints fail, using the same loop.

Metrics

VQA (DVQ): fraction of satisfied visual constraints.
DSG: compositional prompt–image consistency.
Auto SxS: automatic preference via side-by-side VLM judging.
Auto SxS Adv: automatic preference via side-by-side VLM judging with advantage.

Evaluation datasets

DSG-1K: 1000+ compositional prompts from DSG-1K dataset.
Parti-Prompt: 1000+ long-form prompts from Parti-Prompt dataset.

Results

Model mapping (M1–M5)

M1: FLUX-Schnell
M2: FLUX-Dev
M3: Qwen-Image
M4: Z-Image-Turbo
M5: FLUX-2 Pro

DSG-1K win rates

Parti-Prompt win rates

Qualitative examples

Long prompt example

Prompt (long_prompt1): Studio shot of modern cosmetic bottles on textured marble and velvet background, soft golden lighting, luxury minimalist style, 8K, hyperrealistic. Text elements: large header 'LUMINOUS SKIN' in a thin geometric sans-serif font (like Futura PT Light) in the top left corner, a central mathematical fraction 'V/C = infinity' in an elegant serif font (like Garamond) slightly off-center to the right, the caption 'where Value meets Clarity' in a thin italic serif font beneath the fraction, and a tagline 'The future of radiance is here.' in a light sans-serif font (like Montserrat Light) in the bottom right corner. Color palette: cream, charcoal, dark khaki, gold accents.

Additional examples

Prompt: A giant cobra snake made from salad.

Prompt: a toaster shaking hands with a microwave.

Prompt: a view of the Earth from the moon.

Prompt: A woman holding a racquet on top of a tennis court.

Cost and latency

1 iteration (generation or editing) takes ~30 seconds on average.
3 iterations are typically enough to reach strong quality.
Hunyuan Image 3.0 averages ~60 seconds per image and yields lower compositional quality in our evaluations.

Tables (quality)

DSG-1K (VQA, DSG, Auto SxS)

Backbones:

B1/C1 = FLUX-Schnell
B2/C2 = FLUX-Dev
B3/C3 = Qwen-Image
B4/C4 = Z-Image-Turbo
B5/C5 = FLUX-2 Pro

Backbone	VQA (Base)	VQA (CRAFT)	DSG (Base)	DSG (CRAFT)	Auto SxS (Base)	Auto SxS (CRAFT)	Auto SxS Adv (Base)	Auto SxS Adv (CRAFT)
FLUX-Schnell	0.78	0.86	0.786	0.857	0.21	0.744	0.26	0.74
FLUX-Dev	0.77	0.87	0.765	0.841	0.34	0.65	0.288	0.671
Qwen-Image	0.864	0.946	0.858	0.932	0.36	0.59	0.33	0.58
Z-Image-Turbo	0.875	0.915	0.864	0.900	0.27	0.67	0.328	0.63
FLUX-2 Pro	0.88	0.925	0.87	0.91	0.40	0.52	0.47	0.50

Parti-Prompt (VQA, DSG, Auto SxS)

Backbone	VQA (Base)	VQA (CRAFT)	DSG (Base)	DSG (CRAFT)	Auto SxS (Base)	Auto SxS (CRAFT)	Auto SxS Adv (Base)	Auto SxS Adv (CRAFT)
FLUX-Schnell	0.79	0.88	0.743	0.857	0.19	0.756	0.252	0.748
FLUX-Dev	0.822	0.911	0.815	0.895	0.382	0.618	0.309	0.628
Qwen-Image	0.89	0.94	0.896	0.939	0.445	0.555	0.41	0.53
Z-Image-Turbo	0.90	0.94	0.89	0.92	0.268	0.679	0.328	0.668
FLUX-2 Pro	0.90	0.92	0.87	0.90	0.42	0.47	0.47	0.52

Comparison with prompt-optimization methods (Maestro)

We report DSGScore only (no ranks). Our CRAFT results use a GPT-based VLM judge (gpt-5-nano-2025-08-07), while Maestro reports with Gemini 2.0 Flash.

Method	P2-hard DSGScore ↑	DSG-1K DSGScore ↑
Original	0.826	0.772
Rewrite	0.855	0.815
Promptist	0.873	0.849
LM-BBO	0.859	0.806
OPT2I	0.90	0.838
Maestro	0.92	0.882
CRAFT (Ours)	0.90	0.91

Limitations

Quality depends on the VLM judge; errors can propagate into refinements.
Very abstract prompts may not map cleanly to explicit constraints.
Iterative loops add time and API cost, though overhead is small relative to high-end models.

↗ View Demo ↗ Website ↗ View arXiv page

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Back to Articles

CRAFT: Continuous Reasoning and Agentic Feedback Tuning

Community Article Published February 5, 2026

flymy-ai

flymy-ai

flymy-ai

flymy-ai

Technical Report

TL;DR

How CRAFT works
High-level overview

Detailed architecture

Models tested

Iteration budget

Metrics

Evaluation datasets

Results
Model mapping (M1–M5)

DSG-1K win rates

Parti-Prompt win rates

Qualitative examples
Long prompt example

Additional examples

Cost and latency

Tables (quality)
DSG-1K (VQA, DSG, Auto SxS)

Parti-Prompt (VQA, DSG, Auto SxS)

Comparison with prompt-optimization methods (Maestro)

Limitations

Technical Report

↗ View Demo ↗ Website ↗ View arXiv page

TL;DR

CRAFT adds thinking into text to image generation and image editing without retraining.

It decomposes prompts into explicit visual checks, verifies outputs with a VLM, and edits only the failing parts. This yields better compositional accuracy and text rendering with modest overhead.

How CRAFT works

CRAFT adds a reasoning loop on top of existing models:

Decompose the prompt into structured visual questions.
Generate an image.
Verify each constraint with a VLM (Yes/No).
Refine the prompt only where constraints fail.
Stop early once constraints are satisfied or using iterative image editing.

High-level overview

Detailed architecture

Models tested

We evaluate across multiple backbones and settings:

FLUX-Schnell
FLUX-Dev
Qwen-Image
Z-Image-Turbo
FLUX-2 Pro

We use ChatGPT for both the LLM and VLM components (gpt-5-nano-2025-08-07). Image editing uses Gemini 2.5 Flash and Qwen-Image Edit.

Iteration budget

In practice, we run 3 iterations for generation/edit, which is typically enough for strong results.
Editing is applied when constraints fail, using the same loop.

Metrics

VQA (DVQ): fraction of satisfied visual constraints.
DSG: compositional prompt–image consistency.
Auto SxS: automatic preference via side-by-side VLM judging.
Auto SxS Adv: automatic preference via side-by-side VLM judging with advantage.

Evaluation datasets

DSG-1K: 1000+ compositional prompts from DSG-1K dataset.
Parti-Prompt: 1000+ long-form prompts from Parti-Prompt dataset.

Results

Model mapping (M1–M5)

M1: FLUX-Schnell
M2: FLUX-Dev
M3: Qwen-Image
M4: Z-Image-Turbo
M5: FLUX-2 Pro

DSG-1K win rates

Parti-Prompt win rates

Qualitative examples

Long prompt example

Additional examples

Prompt: A giant cobra snake made from salad.

Prompt: a toaster shaking hands with a microwave.

Prompt: a view of the Earth from the moon.

Prompt: A woman holding a racquet on top of a tennis court.

Cost and latency

1 iteration (generation or editing) takes ~30 seconds on average.
3 iterations are typically enough to reach strong quality.
Hunyuan Image 3.0 averages ~60 seconds per image and yields lower compositional quality in our evaluations.

Tables (quality)

DSG-1K (VQA, DSG, Auto SxS)

Backbones:

B1/C1 = FLUX-Schnell
B2/C2 = FLUX-Dev
B3/C3 = Qwen-Image
B4/C4 = Z-Image-Turbo
B5/C5 = FLUX-2 Pro

Backbone	VQA (Base)	VQA (CRAFT)	DSG (Base)	DSG (CRAFT)	Auto SxS (Base)	Auto SxS (CRAFT)	Auto SxS Adv (Base)	Auto SxS Adv (CRAFT)
FLUX-Schnell	0.78	0.86	0.786	0.857	0.21	0.744	0.26	0.74
FLUX-Dev	0.77	0.87	0.765	0.841	0.34	0.65	0.288	0.671
Qwen-Image	0.864	0.946	0.858	0.932	0.36	0.59	0.33	0.58
Z-Image-Turbo	0.875	0.915	0.864	0.900	0.27	0.67	0.328	0.63
FLUX-2 Pro	0.88	0.925	0.87	0.91	0.40	0.52	0.47	0.50

Parti-Prompt (VQA, DSG, Auto SxS)

Backbone	VQA (Base)	VQA (CRAFT)	DSG (Base)	DSG (CRAFT)	Auto SxS (Base)	Auto SxS (CRAFT)	Auto SxS Adv (Base)	Auto SxS Adv (CRAFT)
FLUX-Schnell	0.79	0.88	0.743	0.857	0.19	0.756	0.252	0.748
FLUX-Dev	0.822	0.911	0.815	0.895	0.382	0.618	0.309	0.628
Qwen-Image	0.89	0.94	0.896	0.939	0.445	0.555	0.41	0.53
Z-Image-Turbo	0.90	0.94	0.89	0.92	0.268	0.679	0.328	0.668
FLUX-2 Pro	0.90	0.92	0.87	0.90	0.42	0.47	0.47	0.52

Comparison with prompt-optimization methods (Maestro)

We report DSGScore only (no ranks). Our CRAFT results use a GPT-based VLM judge (gpt-5-nano-2025-08-07), while Maestro reports with Gemini 2.0 Flash.

Method	P2-hard DSGScore ↑	DSG-1K DSGScore ↑
Original	0.826	0.772
Rewrite	0.855	0.815
Promptist	0.873	0.849
LM-BBO	0.859	0.806
OPT2I	0.90	0.838
Maestro	0.92	0.882
CRAFT (Ours)	0.90	0.91

Limitations

Quality depends on the VLM judge; errors can propagate into refinements.
Very abstract prompts may not map cleanly to explicit constraints.
Iterative loops add time and API cost, though overhead is small relative to high-end models.

↗ View Demo ↗ Website ↗ View arXiv page

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote