Multimodal Reasoning in Math Geometry
Advanced research on multimodal AI systems for solving complex geometry problems by combining visual and textual reasoning capabilities.
Role
Skills & Tools
Work Responsibilities
- Conducted an extensive survey of multimodal math/geometry research and defined the problem of geometric reasoning in VLMs, highlighting common perception–reasoning failures.
- Co-developed a planner–VQA–reasoner pipeline that actively queries diagrams and separates perception from reasoning, enabling more accurate geometric fact extraction and logical consistency.
- Demonstrated that a GPT-4o + pipeline setup outperformed GPT-4V in description faithfulness (+16%) and reasoning coherence (+54%), showing the effectiveness of targeted querying compared to end-to-end VLMs.
- Performed systematic experiments and error analyses, diagnosing whether errors arose from misperception or faulty reasoning, and providing guidance for future model design.
- Set up dev workflow with Git/GitHub+ Vercel, leveraging a Vite + Node.js build environment for streamlined releases.
Abstract
We study whether vision–language models can truly reason about geometry rather than merely label diagrams. On a MathVerse subset of multiple-choice problems, eight baselines—ranging from text-only LLMs to state-of-the-art VLMs—still struggle with fine-grained spatial relations.
We introduce an iterative pipeline that alternates planning questions, VQA fact extraction, and belief-state reasoning. Without additional training, the full GPT-4o variant doubles baseline description quality and raises accuracy to 40%, while an open-weights Qwen version reaches 34%.
Results suggest that targeted visual querying, not larger language models alone, is the critical driver of geometric reasoning performance.
Introduction & Problem Definition
Multimodal technologies have demonstrated remarkable potential in tackling complex real-world tasks, leveraging their powerful capability to integrate visual and textual information. Our team investigated whether rapidly evolving vision-language models (VLMs) could yield breakthrough results when applied to geometric mathematical problems.
Unlike typical text-based reasoning problems, geometry requires precise perception and reasoning about detailed structural and spatial relationships within images—such as points, lines, surfaces, geometric shape combinations, segment lengths, angle measurements, parallel or perpendicular relationships, shapes' containment and partitioning, overall symmetry, and spatial layouts.
Our investigation revealed that even state-of-the-art models like GPT-4V and LLaVA-7B exhibit significant limitations in specific geometric image reasoning tasks, particularly in handling precise spatial details. Current datasets commonly provide geometric images, question texts, choices, and answers, but lack detailed, explicit, and structured geometric image descriptions.
Research Hypotheses
Caption Benefit Hypothesis
Providing detailed and explicit image descriptions can significantly enhance model accuracy on geometric problems compared to using only raw images or text.
Task-Focused Caption Hypothesis
Descriptions specifically tailored to focus on visual features directly related to geometric problem-solving will more effectively improve model reasoning capabilities compared to general image descriptions.
Proposed Model
Our proposed model extends the classic perception-reasoning split into an iterative three-component loop: Planner → VQA → Belief-state reasoning. The system can actively request missing geometric facts before committing to a final answer.
Pipeline Components
Planner LM
Generates follow-up queries based on the question and previous Q/A log
VQA Model
Answers queries about the diagram, extracting visual facts
Belief-state Builder
Maintains a structured transcript of all Q/A pairs for reasoning
Reasoner LM
Produces final answer with chain-of-thought reasoning
Model Configurations Tested
| Pipeline | Planner | VQA | Reasoner |
|---|---|---|---|
| P-1 | Mathstral-7B-v0.1 | InternLM-XC2-VL-7B | Mathstral-7B-v0.1 |
| P-2 | Qwen-Chat-7B | Qwen-VL-Chat | Qwen-Chat-7B |
| P-3 | GPT-3.5-Turbo | GPT-4o-VQA | GPT-3.5-Turbo |
| P-4 | GPT-4o (text) | GPT-4o (vision) | GPT-4o (text) |
Results
Key Findings
GPT-4o Pipeline Accuracy
Full GPT-4o pipeline achieved 40% accuracy with 4.00 CoT reasoning score and 2.86 description score
Open-Weights Performance
Qwen→Qwen pipeline reached 34% accuracy, strongest open-weights alternative
Description Quality Improvement
Iterative pipeline doubled baseline description quality scores
Baseline Comparison
Proposed Models
Analysis & Limitations
Key Insights
- •Targeted visual querying, not larger language models alone, is the critical driver of geometric reasoning performance
- •The iterative pipeline successfully reduces uncertainty by actively requesting missing information
- •Chain-of-thought coherence scores correlate strongly with final accuracy
Identified Limitations
Meaningful Queries, Poor VQA Answers
The planner generates relevant questions, but VQA modules often provide incorrect responses, especially for fine-grained spatial cues.
Hallucinated Geometric Relationships
VQA models frequently conflate inscribed with central angles, mistake perpendicular distances for radii, and misidentify baselines.
No Contradiction Resolution
The system lacks mechanisms to detect or resolve inconsistent facts, allowing contradictions to accumulate.
Dataset Gaps
Ground-truth annotations often omit implicit geometric constraints, limiting model training effectiveness.
Future Work
Memory-Aware Retrieval
Build a structured memory module to store Q-A pairs, enabling retrieval, updates, and contradiction resolution before reasoning.
Adaptive Query Stopping
Monitor VQA uncertainty to halt question generation when marginal information gain is low, reducing inference costs.
VQA Fine-tuning
Fine-tune VQA models on re-annotated datasets with explicit tags for angles, lengths, and spatial relationships.
Prompt Refinement Loop
Use LLMs to rewrite queries for clarity and geometric specificity before VQA processing.