CRAFT: Continuous Reasoning and Agentic Feedback Tuning

Community Article Published February 5, 2026

Technical Report

TL;DR

CRAFT adds thinking into text to image generation and image editing without retraining.

It decomposes prompts into explicit visual checks, verifies outputs with a VLM, and edits only the failing parts. This yields better compositional accuracy and text rendering with modest overhead.

hf_combined_thinking_pipeline_horizontal

How CRAFT works

CRAFT adds a reasoning loop on top of existing models:

  1. Decompose the prompt into structured visual questions.
  2. Generate an image.
  3. Verify each constraint with a VLM (Yes/No).
  4. Refine the prompt only where constraints fail.
  5. Stop early once constraints are satisfied or using iterative image editing.

High-level overview

fig_overview

Detailed architecture

Thinking mode

Models tested

We evaluate across multiple backbones and settings:

  • FLUX-Schnell
  • FLUX-Dev
  • Qwen-Image
  • Z-Image-Turbo
  • FLUX-2 Pro

We use ChatGPT for both the LLM and VLM components (gpt-5-nano-2025-08-07). Image editing uses Gemini 2.5 Flash and Qwen-Image Edit.

CRAFT is model-agnostic: the same pipeline works with any open-source or API-based T2I generator and any image-editing model. You can swap components to build a custom pipeline for your constraints and budget.

Iteration budget

  • In practice, we run 3 iterations for generation/edit, which is typically enough for strong results.
  • Editing is applied when constraints fail, using the same loop.

Metrics

  • VQA (DVQ): fraction of satisfied visual constraints.
  • DSG: compositional prompt–image consistency.
  • Auto SxS: automatic preference via side-by-side VLM judging.
  • Auto SxS Adv: automatic preference via side-by-side VLM judging with advantage.

Evaluation datasets

  • DSG-1K: 1000+ compositional prompts from DSG-1K dataset.
  • Parti-Prompt: 1000+ long-form prompts from Parti-Prompt dataset.

Results

Model mapping (M1–M5)

  • M1: FLUX-Schnell
  • M2: FLUX-Dev
  • M3: Qwen-Image
  • M4: Z-Image-Turbo
  • M5: FLUX-2 Pro

DSG-1K win rates

fig_dsg1k_wins

Parti-Prompt win rates

fig_parti_wins

Qualitative examples

Long prompt example

Prompt (long_prompt1): Studio shot of modern cosmetic bottles on textured marble and velvet background, soft golden lighting, luxury minimalist style, 8K, hyperrealistic. Text elements: large header 'LUMINOUS SKIN' in a thin geometric sans-serif font (like Futura PT Light) in the top left corner, a central mathematical fraction 'V/C = infinity' in an elegant serif font (like Garamond) slightly off-center to the right, the caption 'where Value meets Clarity' in a thin italic serif font beneath the fraction, and a tagline 'The future of radiance is here.' in a light sans-serif font (like Montserrat Light) in the bottom right corner. Color palette: cream, charcoal, dark khaki, gold accents.

fig_long_prompt

Additional examples

Prompt: A giant cobra snake made from salad. example_qwen_1

Prompt: a toaster shaking hands with a microwave. example_qwen_2

Prompt: a view of the Earth from the moon. example_qwen_3

Prompt: A woman holding a racquet on top of a tennis court. example_fluxdev_1

Cost and latency

  • 1 iteration (generation or editing) takes ~30 seconds on average.
  • 3 iterations are typically enough to reach strong quality.
  • Hunyuan Image 3.0 averages ~60 seconds per image and yields lower compositional quality in our evaluations.

Tables (quality)

DSG-1K (VQA, DSG, Auto SxS)

Backbones:

  • B1/C1 = FLUX-Schnell
  • B2/C2 = FLUX-Dev
  • B3/C3 = Qwen-Image
  • B4/C4 = Z-Image-Turbo
  • B5/C5 = FLUX-2 Pro
Backbone VQA (Base) VQA (CRAFT) DSG (Base) DSG (CRAFT) Auto SxS (Base) Auto SxS (CRAFT) Auto SxS Adv (Base) Auto SxS Adv (CRAFT)
FLUX-Schnell 0.78 0.86 0.786 0.857 0.21 0.744 0.26 0.74
FLUX-Dev 0.77 0.87 0.765 0.841 0.34 0.65 0.288 0.671
Qwen-Image 0.864 0.946 0.858 0.932 0.36 0.59 0.33 0.58
Z-Image-Turbo 0.875 0.915 0.864 0.900 0.27 0.67 0.328 0.63
FLUX-2 Pro 0.88 0.925 0.87 0.91 0.40 0.52 0.47 0.50

Parti-Prompt (VQA, DSG, Auto SxS)

Backbone VQA (Base) VQA (CRAFT) DSG (Base) DSG (CRAFT) Auto SxS (Base) Auto SxS (CRAFT) Auto SxS Adv (Base) Auto SxS Adv (CRAFT)
FLUX-Schnell 0.79 0.88 0.743 0.857 0.19 0.756 0.252 0.748
FLUX-Dev 0.822 0.911 0.815 0.895 0.382 0.618 0.309 0.628
Qwen-Image 0.89 0.94 0.896 0.939 0.445 0.555 0.41 0.53
Z-Image-Turbo 0.90 0.94 0.89 0.92 0.268 0.679 0.328 0.668
FLUX-2 Pro 0.90 0.92 0.87 0.90 0.42 0.47 0.47 0.52

Comparison with prompt-optimization methods (Maestro)

We report DSGScore only (no ranks). Our CRAFT results use a GPT-based VLM judge (gpt-5-nano-2025-08-07), while Maestro reports with Gemini 2.0 Flash.

Method P2-hard DSGScore ↑ DSG-1K DSGScore ↑
Original 0.826 0.772
Rewrite 0.855 0.815
Promptist 0.873 0.849
LM-BBO 0.859 0.806
OPT2I 0.90 0.838
Maestro 0.92 0.882
CRAFT (Ours) 0.90 0.91

Limitations

  • Quality depends on the VLM judge; errors can propagate into refinements.
  • Very abstract prompts may not map cleanly to explicit constraints.
  • Iterative loops add time and API cost, though overhead is small relative to high-end models.

Community

Sign up or log in to comment

CRAFT: Continuous Reasoning and Agentic Feedback Tuning

CRAFT: Continuous Reasoning and Agentic Feedback Tuning

Community Article Published February 5, 2026

Technical Report

TL;DR

CRAFT adds thinking into text to image generation and image editing without retraining.

It decomposes prompts into explicit visual checks, verifies outputs with a VLM, and edits only the failing parts. This yields better compositional accuracy and text rendering with modest overhead.

hf_combined_thinking_pipeline_horizontal

How CRAFT works

CRAFT adds a reasoning loop on top of existing models:

  1. Decompose the prompt into structured visual questions.
  2. Generate an image.
  3. Verify each constraint with a VLM (Yes/No).
  4. Refine the prompt only where constraints fail.
  5. Stop early once constraints are satisfied or using iterative image editing.

High-level overview

fig_overview

Detailed architecture

Thinking mode

Models tested

We evaluate across multiple backbones and settings:

  • FLUX-Schnell
  • FLUX-Dev
  • Qwen-Image
  • Z-Image-Turbo
  • FLUX-2 Pro

We use ChatGPT for both the LLM and VLM components (gpt-5-nano-2025-08-07). Image editing uses Gemini 2.5 Flash and Qwen-Image Edit.

CRAFT is model-agnostic: the same pipeline works with any open-source or API-based T2I generator and any image-editing model. You can swap components to build a custom pipeline for your constraints and budget.

Iteration budget

  • In practice, we run 3 iterations for generation/edit, which is typically enough for strong results.
  • Editing is applied when constraints fail, using the same loop.

Metrics

  • VQA (DVQ): fraction of satisfied visual constraints.
  • DSG: compositional prompt–image consistency.
  • Auto SxS: automatic preference via side-by-side VLM judging.
  • Auto SxS Adv: automatic preference via side-by-side VLM judging with advantage.

Evaluation datasets

  • DSG-1K: 1000+ compositional prompts from DSG-1K dataset.
  • Parti-Prompt: 1000+ long-form prompts from Parti-Prompt dataset.

Results

Model mapping (M1–M5)

  • M1: FLUX-Schnell
  • M2: FLUX-Dev
  • M3: Qwen-Image
  • M4: Z-Image-Turbo
  • M5: FLUX-2 Pro

DSG-1K win rates

fig_dsg1k_wins

Parti-Prompt win rates

fig_parti_wins

Qualitative examples

Long prompt example

Prompt (long_prompt1): Studio shot of modern cosmetic bottles on textured marble and velvet background, soft golden lighting, luxury minimalist style, 8K, hyperrealistic. Text elements: large header 'LUMINOUS SKIN' in a thin geometric sans-serif font (like Futura PT Light) in the top left corner, a central mathematical fraction 'V/C = infinity' in an elegant serif font (like Garamond) slightly off-center to the right, the caption 'where Value meets Clarity' in a thin italic serif font beneath the fraction, and a tagline 'The future of radiance is here.' in a light sans-serif font (like Montserrat Light) in the bottom right corner. Color palette: cream, charcoal, dark khaki, gold accents.

fig_long_prompt

Additional examples

Prompt: A giant cobra snake made from salad. example_qwen_1

Prompt: a toaster shaking hands with a microwave. example_qwen_2

Prompt: a view of the Earth from the moon. example_qwen_3

Prompt: A woman holding a racquet on top of a tennis court. example_fluxdev_1

Cost and latency

  • 1 iteration (generation or editing) takes ~30 seconds on average.
  • 3 iterations are typically enough to reach strong quality.
  • Hunyuan Image 3.0 averages ~60 seconds per image and yields lower compositional quality in our evaluations.

Tables (quality)

DSG-1K (VQA, DSG, Auto SxS)

Backbones:

  • B1/C1 = FLUX-Schnell
  • B2/C2 = FLUX-Dev
  • B3/C3 = Qwen-Image
  • B4/C4 = Z-Image-Turbo
  • B5/C5 = FLUX-2 Pro
Backbone VQA (Base) VQA (CRAFT) DSG (Base) DSG (CRAFT) Auto SxS (Base) Auto SxS (CRAFT) Auto SxS Adv (Base) Auto SxS Adv (CRAFT)
FLUX-Schnell 0.78 0.86 0.786 0.857 0.21 0.744 0.26 0.74
FLUX-Dev 0.77 0.87 0.765 0.841 0.34 0.65 0.288 0.671
Qwen-Image 0.864 0.946 0.858 0.932 0.36 0.59 0.33 0.58
Z-Image-Turbo 0.875 0.915 0.864 0.900 0.27 0.67 0.328 0.63
FLUX-2 Pro 0.88 0.925 0.87 0.91 0.40 0.52 0.47 0.50

Parti-Prompt (VQA, DSG, Auto SxS)

Backbone VQA (Base) VQA (CRAFT) DSG (Base) DSG (CRAFT) Auto SxS (Base) Auto SxS (CRAFT) Auto SxS Adv (Base) Auto SxS Adv (CRAFT)
FLUX-Schnell 0.79 0.88 0.743 0.857 0.19 0.756 0.252 0.748
FLUX-Dev 0.822 0.911 0.815 0.895 0.382 0.618 0.309 0.628
Qwen-Image 0.89 0.94 0.896 0.939 0.445 0.555 0.41 0.53
Z-Image-Turbo 0.90 0.94 0.89 0.92 0.268 0.679 0.328 0.668
FLUX-2 Pro 0.90 0.92 0.87 0.90 0.42 0.47 0.47 0.52

Comparison with prompt-optimization methods (Maestro)

We report DSGScore only (no ranks). Our CRAFT results use a GPT-based VLM judge (gpt-5-nano-2025-08-07), while Maestro reports with Gemini 2.0 Flash.

Method P2-hard DSGScore ↑ DSG-1K DSGScore ↑
Original 0.826 0.772
Rewrite 0.855 0.815
Promptist 0.873 0.849
LM-BBO 0.859 0.806
OPT2I 0.90 0.838
Maestro 0.92 0.882
CRAFT (Ours) 0.90 0.91

Limitations

  • Quality depends on the VLM judge; errors can propagate into refinements.
  • Very abstract prompts may not map cleanly to explicit constraints.
  • Iterative loops add time and API cost, though overhead is small relative to high-end models.

Community

Sign up or log in to comment