2025

Multimodal Reasoning in Math Geometry

Advanced research on multimodal AI systems for solving complex geometry problems by combining visual and textual reasoning capabilities.

Role

Graduate ResearcherMLE Researcher

Skills & Tools

AI/MLMultimodal Machine LearningNLPVLMGeometry Reasoning

Work Responsibilities

Graduate ResearcherMLE Researcher
  • Conducted an extensive survey of multimodal math/geometry research and defined the problem of geometric reasoning in VLMs, highlighting common perception–reasoning failures.
  • Co-developed a planner–VQA–reasoner pipeline that actively queries diagrams and separates perception from reasoning, enabling more accurate geometric fact extraction and logical consistency.
  • Demonstrated that a GPT-4o + pipeline setup outperformed GPT-4V in description faithfulness (+16%) and reasoning coherence (+54%), showing the effectiveness of targeted querying compared to end-to-end VLMs.
  • Performed systematic experiments and error analyses, diagnosing whether errors arose from misperception or faulty reasoning, and providing guidance for future model design.
  • Set up dev workflow with Git/GitHub+ Vercel, leveraging a Vite + Node.js build environment for streamlined releases.

Abstract

We study whether vision–language models can truly reason about geometry rather than merely label diagrams. On a MathVerse subset of multiple-choice problems, eight baselines—ranging from text-only LLMs to state-of-the-art VLMs—still struggle with fine-grained spatial relations.

We introduce an iterative pipeline that alternates planning questions, VQA fact extraction, and belief-state reasoning. Without additional training, the full GPT-4o variant doubles baseline description quality and raises accuracy to 40%, while an open-weights Qwen version reaches 34%.

Results suggest that targeted visual querying, not larger language models alone, is the critical driver of geometric reasoning performance.

Introduction & Problem Definition

Multimodal technologies have demonstrated remarkable potential in tackling complex real-world tasks, leveraging their powerful capability to integrate visual and textual information. Our team investigated whether rapidly evolving vision-language models (VLMs) could yield breakthrough results when applied to geometric mathematical problems.

Unlike typical text-based reasoning problems, geometry requires precise perception and reasoning about detailed structural and spatial relationships within images—such as points, lines, surfaces, geometric shape combinations, segment lengths, angle measurements, parallel or perpendicular relationships, shapes' containment and partitioning, overall symmetry, and spatial layouts.

Our investigation revealed that even state-of-the-art models like GPT-4V and LLaVA-7B exhibit significant limitations in specific geometric image reasoning tasks, particularly in handling precise spatial details. Current datasets commonly provide geometric images, question texts, choices, and answers, but lack detailed, explicit, and structured geometric image descriptions.

Research Hypotheses

Caption Benefit Hypothesis

Providing detailed and explicit image descriptions can significantly enhance model accuracy on geometric problems compared to using only raw images or text.

Task-Focused Caption Hypothesis

Descriptions specifically tailored to focus on visual features directly related to geometric problem-solving will more effectively improve model reasoning capabilities compared to general image descriptions.

Proposed Model

Our proposed model extends the classic perception-reasoning split into an iterative three-component loop: Planner → VQA → Belief-state reasoning. The system can actively request missing geometric facts before committing to a final answer.

Pipeline Components

1

Planner LM

Generates follow-up queries based on the question and previous Q/A log

2

VQA Model

Answers queries about the diagram, extracting visual facts

3

Belief-state Builder

Maintains a structured transcript of all Q/A pairs for reasoning

4

Reasoner LM

Produces final answer with chain-of-thought reasoning

Model Configurations Tested

PipelinePlannerVQAReasoner
P-1Mathstral-7B-v0.1InternLM-XC2-VL-7BMathstral-7B-v0.1
P-2Qwen-Chat-7BQwen-VL-ChatQwen-Chat-7B
P-3GPT-3.5-TurboGPT-4o-VQAGPT-3.5-Turbo
P-4GPT-4o (text)GPT-4o (vision)GPT-4o (text)

Results

Key Findings

40%

GPT-4o Pipeline Accuracy

Full GPT-4o pipeline achieved 40% accuracy with 4.00 CoT reasoning score and 2.86 description score

34%

Open-Weights Performance

Qwen→Qwen pipeline reached 34% accuracy, strongest open-weights alternative

Description Quality Improvement

Iterative pipeline doubled baseline description quality scores

Baseline Comparison

GPT-4o (text-only)58%
Qwen2.5-VL52%
GPT-4V44%
GPT-3.5-turbo22%

Proposed Models

GPT-4o Pipeline40%
Qwen Pipeline34%
GPT-3.5 + GPT-4o VQA16%
Mathstral Pipeline10%

Analysis & Limitations

Key Insights

  • Targeted visual querying, not larger language models alone, is the critical driver of geometric reasoning performance
  • The iterative pipeline successfully reduces uncertainty by actively requesting missing information
  • Chain-of-thought coherence scores correlate strongly with final accuracy

Identified Limitations

Meaningful Queries, Poor VQA Answers

The planner generates relevant questions, but VQA modules often provide incorrect responses, especially for fine-grained spatial cues.

Hallucinated Geometric Relationships

VQA models frequently conflate inscribed with central angles, mistake perpendicular distances for radii, and misidentify baselines.

No Contradiction Resolution

The system lacks mechanisms to detect or resolve inconsistent facts, allowing contradictions to accumulate.

Dataset Gaps

Ground-truth annotations often omit implicit geometric constraints, limiting model training effectiveness.

Future Work

Memory-Aware Retrieval

Build a structured memory module to store Q-A pairs, enabling retrieval, updates, and contradiction resolution before reasoning.

Adaptive Query Stopping

Monitor VQA uncertainty to halt question generation when marginal information gain is low, reducing inference costs.

VQA Fine-tuning

Fine-tune VQA models on re-annotated datasets with explicit tags for angles, lengths, and spatial relationships.

Prompt Refinement Loop

Use LLMs to rewrite queries for clarity and geometric specificity before VQA processing.