Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Heekyung Lee^1,2, Jiaxin Ge², Tsung-Han Wu²
Minwoo Kang², Trevor Darrell², David M. Chan²

¹POSTECH ²UC Berkeley
EMNLP 2025 (Main)

Abstract

Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.

🧩 Guess the Answer!

Try out our visual puzzles annotator to explore the dataset and test your own reasoning skills! (Click the button below) Submit the answer and move to the next question. You can download the json file of your answers.

Dataset & Evaluation Metrics

Our Dataset

We construct a human-crafted probe dataset of 432 rebus puzzles, challenging VLMs to reason effecitvely about visual-linguistic information. Each puzzle is annotated with its ground-truth solution and categorized into eleven cognitive skill types. This fine-grained labeling allows us to probe VLMs’ strengths and weaknesses across varied symbolic and visual reasoning tasks.

Evaluation Metrics

Naive Matching: Compares the predicted answer to the ground truth by exact string match after lowercasing and trimming. Supports multiple acceptable answers and is straightforward but rigid.
LLM-Based Semantic Evaluation: Uses GPT-4o to judge correctness by understanding the answer's meaning, allowing for minor spelling differences and paraphrases, overcoming naive matching limits.

Results: Analysis over Diverse Methods

Overall Results

Models show a wide range of performance

While closed source reasoning models such as o3, o4-mini, and gemini-2.5-pro perform relatively well, especially compared to human non-expert/non-native English speaking solvers, open-source reasoning models as non-reasoning models struggle to solve the tasks. Compared to expert solvers, however, there remains a significant model accuracy gap

Per-skill model performance

Skill specific results are given in the table above. (Information of the skills can be found in the paper.) In general, models are competent in solving SS/SPR and QMR, while they lack ability in AN/VMCR.

Gap between SS vs PW: This suggests that VLMs are better at learning and applying these direct and rule-based symbolic transformations.
Gap between SPR vs LWM: This gap implies a disconnection between understanding spacial configuration and symbolic logic and complex linguistics.
Text recognition: Results imply that models can read text even when the orientation is unusual, but they are not always effectively using cues from the font style or size to derive the solution.

In-Context Learning (ICL)

Does models truly understand the concept of a “rebus” puzzle from the prompt alone?

To reduce prompt-specific bias, we frame the task as in-context learning using a single example with an image, answer, and reasoning. The results show that performance remains largely unchanged, indicating limitations come from the VLM itself rather than the prompt. While stronger models see little benefit, smaller models like Qwen2.5-VL gain slightly from a worked reasoning example.

Skill Guided Prompting

Does models lack the capability to understand and apply which cognitive thinking skills are required for each puzzle?

Incorporating these skill-related prompts generally leads to minor improvements in model performance, suggesting the existence of an awareness vs. execution gap.

Caption Only Experiments

Vision? Or Language?

Caption-only experiments reveal that strong reasoning models perform notably worse without direct visual input, highlighting the importance of visual perception quality. We hypothesize this drop occurs because reasoning models lack iterative visual examination during decoding, suggesting future work should explore how such processes impact overall VLM performance.

Image Retrieval

How do visual contrastive models perform in retrieving the correct answers?

To probe the underlying visual encoders, we evaluate multiple contrastive VLMs for their ability to retrieve correct answers based solely on image-text similarity. MobileCLIP and SigLIP variants show competitive results despite differing architecture sizes and patch resolutions, suggesting that finer-grained tokenization and explicit visual objectives benefit symbolic puzzle matching.

Iterative Refinement

Can models refine the solution given the opportunity to try again?

Allowing models to iteratively refine answers gives significant potential for iterative refinement or self-correction. However, they still reach a performance ceiling after a number of attempts.

BibTeX

@inproceedings{lee2025puzzled,
              title        = {Puzzled by Puzzles: When Vision-Language Models Can’t Take a Hint},
              author       = {Heekyung Lee and Jiaxin Ge and Tsung-Han Wu and Minwoo Kang and Trevor Darrell and David M. Chan},
              year         = {2025},
              booktitle    = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)},
              url          = {https://arxiv.org/abs/2505.23759}
            }