We’ve all seen movies with the stereotypical AI robot that looks and talks like a human but is out to get someone or cause damage. It’s not a new concept in the realm of science fiction. But the reality looks very different. Advanced robotics, like the ones shown in movies filmed decades ago, are still not possible — and there are many reasons for that. The field of robotics is one with many limitations.
While we do have pre-programmed robots that follow predefined paths, like those that assemble cars or vacuum our floors, we want robots that can help us with anything: general purpose robots. Hopefully, they will be kinder than their cinematic counterparts. There are some challenges with general purpose robots, and Google developed a transformer model to bridge those gaps.
Discover how AI has created a new ecosystem of partnerships with a fresh spirit of customer-centric cocreation and a renewed focus on reimagining what is possible. The Acceleration Economy AI Ecosystem Course is available on demand.
Challenges With General Purpose Robotics
One of the problems with creating these kinds of robots is that you have to explicitly train a computer vision system to recognize each kind of object and scenario. Then, you have to provide it with a precise list of instructions to execute when that object or situation is recognized. This is a time-consuming and unfeasible process, given the randomness of real life. There will always be unforeseen circumstances.
For example, if you’re building a robot that picks up trash, you don’t want it to pick up food items that aren’t trash. Yet it’s very difficult for a robot to distinguish a bag of chips that is full – which is not trash — from a bag that’s maybe half empty and needs to be thrown away. Discerning between those requires a degree of reasoning, a trait that humans have but robots don’t. Even if you account for that specific scenario and train the robot explicitly to deal with half-full bags of chips, it still might not recognize the difference between full and half-full.
The technical challenge here is that high-level reasoning models like ChatGPT, which will understand the difference between a bag of chips and a piece of trash which used to be a bag of chips, aren’t aligned with the low-level software that drives the physical actions of robots.
To bridge this gap, Google recently developed a novel kind of AI model that brings these functionalities together. Called RT-2, or the Robotics Transformer 2, this is a first-of-its-kind model trained on text and images from the web as well as actual robotics movement data. RT-2 can directly output robotic actions. Google’s innovation has implications for any field that relies on robotics, including healthcare, logistics, manufacturing, and more.
Google’s RT-2 Model
In short, Google built an AI model that translates high-level reasoning into low-level machine-executable instructions (move this joint 30 degrees, change the position of this from X to Y, and so on). They achieved this by pretraining RT-2 on web data that contains a vast amount of situations the robot might encounter, allowing the robot to pull knowledge from this pretraining when novel situations arise. This makes RT-2 extremely powerful in managing unforeseen situations, making robots that operate on this model much more useful as all-purpose machines.
RT-2 is built on previous work from Google and others. They called RT-2 the first-ever VLA model, or vision-language-action model, indicating that it can translate easily between visuals, language, and robotic action. These VLA models are built on top of VLM, an older class of so-called vision-language models that are general models trained on web data sets to translate between images and language. In essence, RT-2 combines VLM pre-training with robotic data that allows it to directly control a robot.
It’s also worth noting that RT-2 and VLAs are transformer models, a class of machine learning models that also include things like ChatGPT and most LLMs. Transformers are great at transferring “learned concepts” from their training data to unforeseen scenarios. This is why ChatGPT is still great at answering specific questions that it hasn’t seen before in its training: it’s able to generalize.
While RT-2 is still a research project and not a product, there’s a clear pathway for this technology to impact markets. It will reduce the failure rate and increase the flexibility of robots deployed in healthcare, manufacturing, or other commercial environments. It can eventually be used within autonomous robots like Spot by Boston Dynamics, which is being used in accident scenes or by the military, as well as drones which are used in various industries. It may play a role in autonomous vehicles. And, of course, it builds the foundation required to have robot companions that can help us with our various humanoid tasks.
I’m excited to see how innovation in AI and transformer models will continue to trickle down into other fields including robotics. RT-2 is no doubt a huge step closer to that sci-fi vision of robots that are actually useful day-to-day.