CRAFT: Continuous Reasoning and Agentic Feedback Tuning
Technical Report
TL;DR
CRAFT adds thinking into text to image generation and image editing without retraining.
It decomposes prompts into explicit visual checks, verifies outputs with a VLM, and edits only the failing parts. This yields better compositional accuracy and text rendering with modest overhead.
How CRAFT works
CRAFT adds a reasoning loop on top of existing models:
- Decompose the prompt into structured visual questions.
- Generate an image.
- Verify each constraint with a VLM (Yes/No).
- Refine the prompt only where constraints fail.
- Stop early once constraints are satisfied or using iterative image editing.
High-level overview
Detailed architecture
Models tested
We evaluate across multiple backbones and settings:
- FLUX-Schnell
- FLUX-Dev
- Qwen-Image
- Z-Image-Turbo
- FLUX-2 Pro
We use ChatGPT for both the LLM and VLM components (gpt-5-nano-2025-08-07). Image editing uses Gemini 2.5 Flash and Qwen-Image Edit.
CRAFT is model-agnostic: the same pipeline works with any open-source or API-based T2I generator and any image-editing model. You can swap components to build a custom pipeline for your constraints and budget.
Iteration budget
- In practice, we run 3 iterations for generation/edit, which is typically enough for strong results.
- Editing is applied when constraints fail, using the same loop.
Metrics
- VQA (DVQ): fraction of satisfied visual constraints.
- DSG: compositional prompt–image consistency.
- Auto SxS: automatic preference via side-by-side VLM judging.
- Auto SxS Adv: automatic preference via side-by-side VLM judging with advantage.
Evaluation datasets
- DSG-1K: 1000+ compositional prompts from DSG-1K dataset.
- Parti-Prompt: 1000+ long-form prompts from Parti-Prompt dataset.
Results
Model mapping (M1–M5)
- M1: FLUX-Schnell
- M2: FLUX-Dev
- M3: Qwen-Image
- M4: Z-Image-Turbo
- M5: FLUX-2 Pro
DSG-1K win rates
Parti-Prompt win rates
Qualitative examples
Long prompt example
Prompt (long_prompt1): Studio shot of modern cosmetic bottles on textured marble and velvet background, soft golden lighting, luxury minimalist style, 8K, hyperrealistic. Text elements: large header 'LUMINOUS SKIN' in a thin geometric sans-serif font (like Futura PT Light) in the top left corner, a central mathematical fraction 'V/C = infinity' in an elegant serif font (like Garamond) slightly off-center to the right, the caption 'where Value meets Clarity' in a thin italic serif font beneath the fraction, and a tagline 'The future of radiance is here.' in a light sans-serif font (like Montserrat Light) in the bottom right corner. Color palette: cream, charcoal, dark khaki, gold accents.
Additional examples
Prompt: A giant cobra snake made from salad.

Prompt: a toaster shaking hands with a microwave.

Prompt: a view of the Earth from the moon.

Prompt: A woman holding a racquet on top of a tennis court.

Cost and latency
- 1 iteration (generation or editing) takes ~30 seconds on average.
- 3 iterations are typically enough to reach strong quality.
- Hunyuan Image 3.0 averages ~60 seconds per image and yields lower compositional quality in our evaluations.
Tables (quality)
DSG-1K (VQA, DSG, Auto SxS)
Backbones:
- B1/C1 = FLUX-Schnell
- B2/C2 = FLUX-Dev
- B3/C3 = Qwen-Image
- B4/C4 = Z-Image-Turbo
- B5/C5 = FLUX-2 Pro
| Backbone | VQA (Base) | VQA (CRAFT) | DSG (Base) | DSG (CRAFT) | Auto SxS (Base) | Auto SxS (CRAFT) | Auto SxS Adv (Base) | Auto SxS Adv (CRAFT) |
|---|---|---|---|---|---|---|---|---|
| FLUX-Schnell | 0.78 | 0.86 | 0.786 | 0.857 | 0.21 | 0.744 | 0.26 | 0.74 |
| FLUX-Dev | 0.77 | 0.87 | 0.765 | 0.841 | 0.34 | 0.65 | 0.288 | 0.671 |
| Qwen-Image | 0.864 | 0.946 | 0.858 | 0.932 | 0.36 | 0.59 | 0.33 | 0.58 |
| Z-Image-Turbo | 0.875 | 0.915 | 0.864 | 0.900 | 0.27 | 0.67 | 0.328 | 0.63 |
| FLUX-2 Pro | 0.88 | 0.925 | 0.87 | 0.91 | 0.40 | 0.52 | 0.47 | 0.50 |
Parti-Prompt (VQA, DSG, Auto SxS)
| Backbone | VQA (Base) | VQA (CRAFT) | DSG (Base) | DSG (CRAFT) | Auto SxS (Base) | Auto SxS (CRAFT) | Auto SxS Adv (Base) | Auto SxS Adv (CRAFT) |
|---|---|---|---|---|---|---|---|---|
| FLUX-Schnell | 0.79 | 0.88 | 0.743 | 0.857 | 0.19 | 0.756 | 0.252 | 0.748 |
| FLUX-Dev | 0.822 | 0.911 | 0.815 | 0.895 | 0.382 | 0.618 | 0.309 | 0.628 |
| Qwen-Image | 0.89 | 0.94 | 0.896 | 0.939 | 0.445 | 0.555 | 0.41 | 0.53 |
| Z-Image-Turbo | 0.90 | 0.94 | 0.89 | 0.92 | 0.268 | 0.679 | 0.328 | 0.668 |
| FLUX-2 Pro | 0.90 | 0.92 | 0.87 | 0.90 | 0.42 | 0.47 | 0.47 | 0.52 |
Comparison with prompt-optimization methods (Maestro)
We report DSGScore only (no ranks). Our CRAFT results use a GPT-based VLM judge (gpt-5-nano-2025-08-07), while Maestro reports with Gemini 2.0 Flash.
| Method | P2-hard DSGScore ↑ | DSG-1K DSGScore ↑ |
|---|---|---|
| Original | 0.826 | 0.772 |
| Rewrite | 0.855 | 0.815 |
| Promptist | 0.873 | 0.849 |
| LM-BBO | 0.859 | 0.806 |
| OPT2I | 0.90 | 0.838 |
| Maestro | 0.92 | 0.882 |
| CRAFT (Ours) | 0.90 | 0.91 |
Limitations
- Quality depends on the VLM judge; errors can propagate into refinements.
- Very abstract prompts may not map cleanly to explicit constraints.
- Iterative loops add time and API cost, though overhead is small relative to high-end models.





