TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics

Abstract

Vision-Language Models (VLMs) have shown remarkable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative assessments and lack the computational precision required for real-world robotics. Current approaches fail to leverage metric information from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter-level accuracy essential for robotic manipulation.

We present TIGeR (Tool-Integrated Geometric Reasoning), a novel framework that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate computational code, and invoke specialized libraries for exact calculations.

To support this paradigm, we introduce TIGeR, a comprehensive tool-invocation–oriented dataset covering point transformations, pose estimation, trajectory generation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), TIGeR achieves state-of-the-art performance on geometric reasoning benchmarks while demonstrating centimeter-level precision in real-world robotic manipulation tasks.

🛠️ Tool Lib

Visual Perception Tools

Get Camera Intrinsics
Get Camera Extrinsics
Get Pixel Depth
Get Object Segmentation

Geometric Computation Tools

Transform 2D Bounding Box to 3D Bounding Box
Transform 3D Position to 2D Position
Generate and run code based on requirements

🗃️ Data Preparation: TIGeR-300K

Part I: Template-based QA Pairs

Starting from CA-1M, every 20th frame is cleaned and semantically relabeled with GroundingDINO, RAM and Florence-2, and high-IoU boxes are kept; camera intrinsics, extrinsics and depth maps are then combined with 3D scene graphs to instantiate modular templates that vary single/multi-view images, object properties, inter-object relations and output formats, producing 274K QA pairs complete with tool-invocation sequences and intermediate numerical computations.

Part II: LLM-rewritten QA Pairs

GPT-4o filters SSR-CoT for spatial reasoning questions and rewrites each chain-of-thought into a tool-integrated narrative with explicit placeholders; MoGe-2, GeoCalib, SAM2 and π3 are invoked to return metric depth, camera poses, segmentation masks and gravity vectors, whose values are inserted into the corresponding placeholders, producing 35K diverse, adaptive examples with flexible, open-ended tool-call sequences.

Details of Data Generation

🤖 Learning to Invoke Tools: Two-Stage Training

We build on GLM-4.1V-Thinking and introduce tool-integrated geometric reasoning for robotics via a two-stage training pipeline: 1) Supervised Fine-Tuning (SFT) and 2) Reinforcement Fine-Tuning (RFT). SFT imparts basic tool-use reasoning capabili- ties, while RFT refines them through reward signals focused on geometric computation accuracy and effective tool use.

📊 Evaluation on Spatial Understanding Benchmarks

Performance comparison on spatial understanding benchmarks across different models. Since these benchmarks lack ground-truth geometric annotations (e.g., camera intrinsics, extrinsics, and depth), we leverage visual foundation models to extract such information and inject approximate geometric priors into Tool-Integrated Reasoning at inference.

🦾 Real-world Experiments

🔗 BibTeX

@article{han2025tiger,
      title={TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics},
      author={Han, Yi and Chi, Cheng and Zhou, Enshen and Rong, Shanyu and An, Jingkun and Wang, Pengwei and Wang, Zhongyuan and Sheng, Lu and Zhang, Shanghang},
      journal={arXiv preprint arXiv:2510.07181},
      year={2025}
    }