ThermoAct: Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making

* Corresponding Author

Temperature-Aware Framework for Robotic Perception and Decision-Making

Abstract

In recent human-robot collaboration environments, there is a growing focus on integrating diverse sensor data beyond visual information to enable safer and more intelligent task execution. Although thermal data can be crucial for enhancing robot safety and operational efficiency, its integration has been relatively overlooked in prior research. This paper proposes a novel Vision-Language-Action (VLA) framework that incorporates thermal information for robot task execution. The proposed system leverages a Vision-Language Model (VLM) as a high-level planner to interpret complex natural language commands and decompose them into simpler sub-tasks. This approach facilitates efficient data collection and robust reasoning for complex operations. Unlike conventional methods that rely solely on visual data, our approach integrates thermal information, enabling the robot to perceive physical properties and proactively ensure environmental safety. Experimental results from real-world task scenarios validate the feasibility of our proposed framework, suggesting its potential to enhance task success rates and safety compared to existing vision-based systems.

Framework

The ThermoAct framework proposed in this study consists of a Vision-Language Model (VLM) that performs reasoning and planning based on user commands and environmental information, and a Vision-Language-Action (VLA) module that executes robot control commands based on this plan. The VLM takes visual inputs, including thermal data, and a natural language instruction to generate a low-level action plan tailored to the situation. Subsequently, the VLA module controls the robot in real-time based on the decomposed plan and the corresponding inputs.

Experiment

All tasks were designed to require the successful completion of not only temperature-aware sub-tasks but also everyday sub-tasks (e.g., clearing unused cables, placing an apple on a plate). The sub-tasks decomposed by the VLM Planner follow a standardized format, aligning with protocols used in recent VLM-based robotic planning.

Task 1 to Task 3 were designed to evaluate whether the robot could act more intelligently by utilizing thermal information in daily scenarios, such as handing over a cup of warm water or giving a cold can of soda.

Task 4 and Task 5 was designed to verify the utility of thermal information in safety-related situations, such as picking up an overheated battery, turning off a hot hair straightener, and organizing the nearby power strip.

Experimental Results and Task Visualization

ThermoAct: Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making

Temperature-Aware Framework for Robotic Perception and Decision-Making

Abstract

Framework

Experiment

Experiment in Real-World

Task 2: Give me a cold coke (w/ cold coke)

Task 2: Give me a cold coke (w/o cold coke)

Task 3: Select the appropriate cup (teabag)

Task 3: Select the appropriate cup (lemon)

Task 4: Pick up overheated battery

Task 4: Pick up overheated battery

Task 5: Organize space near power strip

Task 5: Organize space near power strip

Citation