Clicky

logo CreativityBench

Evaluating Agent Creative Reasoning via
Affordance-Based Tool Repurposing

1University of Illinois Urbana-Champaign, 2Amazon, 3Columbia University
*Equal contribution

Overview


We introduce CreativityBench, a comprehensive benchmark for evaluating creative intelligence in large language models (LLMs) through the lens of affordance-based creative tool use. While recent progress in LLMs has demonstrated strong analytical reasoning and practical task execution, their ability to generate novel yet physically grounded solutions under constraints remains largely underexplored.

CreativityBench addresses this gap by focusing on a core aspect of human creativity: the ability to repurpose objects based on their underlying affordances, reasoning about how an object’s parts, attributes, and physical properties enable unconventional but plausible uses. Rather than relying on canonical tool usage, the benchmark evaluates whether models can identify non-obvious, constraint-satisfying solutions grounded in physical reality.

Figure 1. Creative Tool Use As Affordance-Based Tool Repurposing.
Figure 1. Creative Tool Use As Affordance-Based Tool Repurposing. Under constraints, a model solves a task by identifying and using the affordances of an alternative object.

CreativityBench is more than a benchmark, but it is a structured and scalable evaluation framework for creative reasoning, designed to uncover fundamental limitations in current models and guide future progress toward more flexible, adaptive intelligence.



Affordance Knowledge Base

Figure 2. Annotation Pipeline For Affordance Knowledge Base Construction.
Figure 2. Annotation Pipeline For Affordance Knowledge Base Construction.

We construct a large-scale, structured affordance knowledge base (KB) containing 4K entities and 150K+ affordance annotations, explicitly linking objects to their parts, attributes, and actionable uses. This KB enables grounded reasoning by anchoring model decisions in fine-grained physical properties, rather than surface-level semantic associations.



Physically Grounded Creative Tasks

Figure 2. CreativityBench Task Sampling Pipeline.
Figure 3. CreativityBench Task Sampling Pipeline.

Using our affordance KB, we generate 14K diverse tasks that require models to identify non-obvious yet physically plausible solutions under constraints. Each task is constructed to go beyond object-level plausibility, requiring models to reason about how affordances emerge from structure and how they can be recombined to achieve a goal.


Leaderboard

From the metric perspective, current models are better at choosing plausible tools at the entity level but struggle to produce fully correct, physically grounded tool use. Model reasoning is often plausible but fails to account for all conditions required for successful execution. From the model perspective, there is a clear split between creative affordance discovery and grounded reasoning quality. Improvements from model scaling alone quickly saturate, indicating that stronger performance requires advances in grounded affordance reasoning rather than size alone.




Takeaway


Attribution Analysis

In this section, we conduct an attribution-based error analysis to identify the primary failure modes in creative tool use. For each model, we randomly sample 10% of its failure cases, resulting in a representative set of failed episodes. We then use Gemini-3.1-Flash-Lite as a categorization model to analyze why each prediction is considered inferior to the gold solution. We identify four major categories of errors: physical invalidity, practical infeasibility, risk or constraint mismatch, and comparative inferiority. Each category corresponds to a distinct stage in the affordance-based reasoning process. This categorization enables us to disentangle failures due to incorrect physical reasoning, impractical procedures, constraint violations, and preference-based comparisons, providing a more fine-grained understanding of where current models break down in affordance-based creative reasoning.

Primary Failure Category. Frequency of Categories Appearing as Contributing Factors.
Figure 8. Attribution Analysis. Most failures come from weak physical grounding: models often over-imagine what objects can do or choose tools whose real properties do not support the intended use. Practicality and safety matter too, but far fewer errors are simply suboptimal alternatives rather than genuinely mismatched solutions.

BibTeX

        @article{qian2026creativitybench,
          title={CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing},
          author={Qian, Cheng and Ha, Hyeonjeong and Liu, Jiayu and Kim, Jeonghwan and Liu, Jiateng and Li, Bingxuan and Tiwari, Aditi and Dalal, Dwip and Wang, Zhenhailong and Chen, Xiusi and Namazifar, Mahdi and Li, Yunzhu and Ji, Heng},
          journal={arXiv preprint arXiv:2605.02910},
          year={2026}
        }