Visual Perception Tools
- Get Camera Intrinsics
- Get Camera Extrinsics
- Get Pixel Depth
- Get Object Segmentation
Vision-Language Models (VLMs) have shown remarkable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative assessments and lack the computational precision required for real-world robotics. Current approaches fail to leverage metric information from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter-level accuracy essential for robotic manipulation.
We present TIGeR (Tool-Integrated Geometric Reasoning), a novel framework that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate computational code, and invoke specialized libraries for exact calculations.
To support this paradigm, we introduce TIGeR, a comprehensive tool-invocation–oriented dataset covering point transformations, pose estimation, trajectory generation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), TIGeR achieves state-of-the-art performance on geometric reasoning benchmarks while demonstrating centimeter-level precision in real-world robotic manipulation tasks.
Part I: Template-based QA Pairs
Starting from CA-1M, every 20th frame is cleaned and semantically relabeled with GroundingDINO, RAM and Florence-2, and high-IoU boxes are kept; camera intrinsics, extrinsics and depth maps are then combined with 3D scene graphs to instantiate modular templates that vary single/multi-view images, object properties, inter-object relations and output formats, producing 274K QA pairs complete with tool-invocation sequences and intermediate numerical computations.
Part II: LLM-rewritten QA Pairs
GPT-4o filters SSR-CoT for spatial reasoning questions and rewrites each chain-of-thought into a tool-integrated narrative with explicit placeholders; MoGe-2, GeoCalib, SAM2 and π3 are invoked to return metric depth, camera poses, segmentation masks and gravity vectors, whose values are inserted into the corresponding placeholders, producing 35K diverse, adaptive examples with flexible, open-ended tool-call sequences.
Details of Data Generation
We build on GLM-4.1V-Thinking and introduce tool-integrated geometric reasoning for robotics via a two-stage training pipeline: 1) Supervised Fine-Tuning (SFT) and 2) Reinforcement Fine-Tuning (RFT). SFT imparts basic tool-use reasoning capabili- ties, while RFT refines them through reward signals focused on geometric computation accuracy and effective tool use.
Performance comparison on spatial understanding benchmarks across different models. Since these benchmarks lack ground-truth geometric annotations (e.g., camera intrinsics, extrinsics, and depth), we leverage visual foundation models to extract such information and inject approximate geometric priors into Tool-Integrated Reasoning at inference.
@misc{2510.07181,
Author = {Yi Han and Cheng Chi and Enshen Zhou and Shanyu Rong and Jingkun An and Pengwei Wang and Zhongyuan Wang and Lu Sheng and Shanghang Zhang},
Title = {TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics},
Year = {2025},
Eprint = {arXiv:2510.07181},
}