Google DeepMind has introduced two new artificial intelligence (AI) models in its Gemini Robotics family, aimed at enhancing the capabilities of general-purpose robots. The models, named Gemini Robotics-ER 1.5 and Gemini Robotics 1.5, are designed to operate together to improve reasoning, vision, and action in real-world environments.
Two-Model System for Planning and Execution
According to a blog post from DeepMind, Gemini Robotics-ER 1.5 serves as the planner or orchestrator, while Gemini Robotics 1.5 is responsible for executing tasks based on natural language instructions. The two-model system is intended to address limitations seen in earlier AI models, where a single system both planned and performed actions, often leading to errors or delays in execution.
Gemini Robotics-ER 1.5: The Planner
The ER 1.5 model functions as a vision-language model (VLM) capable of advanced reasoning and tool integration. It can create multi-step plans for a given task and is reported to perform strongly on spatial understanding benchmarks. The model can also access external tools, such as Google Search, to gather information for decision-making in physical environments.
Gemini Robotics 1.5: Task Execution
Once a plan is formulated, Gemini Robotics 1.5, a vision-language-action (VLA) model, translates instructions and visual input into motor commands, enabling the robot to carry out the task. The model assesses the most efficient path to complete an action and executes it, while also offering explanations of its decision-making in natural language.
Handling Complex Multi-Step Tasks
The system is designed to allow robots to handle complex, multi-step commands in a seamless process. For example, a robot could sort items into compost, recycling, and trash bins after consulting local recycling guidelines online, analysing the objects, planning the sorting process, and then executing the actions.
DeepMind states that the AI models are adaptable to robots of various shapes and sizes due to their spatial awareness and flexible design. At present, the orchestrator model, Gemini Robotics-ER 1.5, is accessible to developers via the Gemini API in Google AI Studio, while the VLA model is limited to select partners.
This development marks a step in integrating generative AI into robotics, replacing traditional interfaces with natural language-driven control, while also attempting to separate planning from execution to reduce errors.