TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics

1Beihang University, 2Beijing Academy of Artificial Intellegence, 3Peking University
Equal Contribution   † Project Leader   ✉ Equal Advising  
TIGeR teaser image 1 TIGeR teaser image 2 TIGeR teaser image 3

TIGeR equips vision-language models with tool-use capabilities to perform accurate geometric reasoning for robotics.

Abstract

Vision-Language Models (VLMs) have shown remarkable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative assessments and lack the computational precision required for real-world robotics. Current approaches fail to leverage metric information from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter-level accuracy essential for robotic manipulation.

We present TIGeR (Tool-Integrated Geometric Reasoning), a novel framework that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate computational code, and invoke specialized libraries for exact calculations.

To support this paradigm, we introduce TIGeR, a comprehensive tool-invocation–oriented dataset covering point transformations, pose estimation, trajectory generation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), TIGeR achieves state-of-the-art performance on geometric reasoning benchmarks while demonstrating centimeter-level precision in real-world robotic manipulation tasks.

🛠️ Tool Lib

Visual Perception Tools

  • Get Camera Intrinsics
  • Get Camera Extrinsics
  • Get Pixel Depth
  • Get Object Segmentation

Geometric Computation Tools

  • Transform 2D Bounding Box to 3D Bounding Box
  • Transform 3D Position to 2D Position
  • Generate and run code based on requirements

🗃️ Data Preparation: TIGeR-300K

TIGeR dataset visualization 1

Part I: Template-based QA Pairs

Starting from CA-1M, every 20th frame is cleaned and semantically relabeled with GroundingDINO, RAM and Florence-2, and high-IoU boxes are kept; camera intrinsics, extrinsics and depth maps are then combined with 3D scene graphs to instantiate modular templates that vary single/multi-view images, object properties, inter-object relations and output formats, producing 274K QA pairs complete with tool-invocation sequences and intermediate numerical computations.

TIGeR dataset visualization 2

Part II: LLM-rewritten QA Pairs

GPT-4o filters SSR-CoT for spatial reasoning questions and rewrites each chain-of-thought into a tool-integrated narrative with explicit placeholders; MoGe-2, GeoCalib, SAM2 and π3 are invoked to return metric depth, camera poses, segmentation masks and gravity vectors, whose values are inserted into the corresponding placeholders, producing 35K diverse, adaptive examples with flexible, open-ended tool-call sequences.

TIGeR dataset detailed visualization

Details of Data Generation

🤖 Learning to Invoke Tools: Two-Stage Training

TIGeR training pipeline

We build on GLM-4.1V-Thinking and introduce tool-integrated geometric reasoning for robotics via a two-stage training pipeline: 1) Supervised Fine-Tuning (SFT) and 2) Reinforcement Fine-Tuning (RFT). SFT imparts basic tool-use reasoning capabili- ties, while RFT refines them through reward signals focused on geometric computation accuracy and effective tool use.

📊 Evaluation on Spatial Understanding Benchmarks

TIGeR benchmark evaluation results

Performance comparison on spatial understanding benchmarks across different models. Since these benchmarks lack ground-truth geometric annotations (e.g., camera intrinsics, extrinsics, and depth), we leverage visual foundation models to extract such information and inject approximate geometric priors into Tool-Integrated Reasoning at inference.

🦾 Real-world Experiments

🔗 BibTeX

@misc{2510.07181,
  Author = {Yi Han and Cheng Chi and Enshen Zhou and Shanyu Rong and Jingkun An and Pengwei Wang and Zhongyuan Wang and Lu Sheng and Shanghang Zhang},
  Title = {TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics},
  Year = {2025},
  Eprint = {arXiv:2510.07181},
}