Google's PaLM-SayCan: A Revolutionary Step in Robotics and AI
Written on
Introduction to Google’s Innovative Robotics
Google is embarking on an exciting journey, blending artificial intelligence (AI) with robotics in unprecedented ways.
Historically, AI was seldom associated with humanoid robots, despite popular belief. However, in recent years, major tech players have heavily invested in AI-driven robots. Unlike simpler devices like Roomba, which serve specific functions, these companies are focusing on creating more humanoid machines.
Boston Dynamics, a pioneer in robotics, introduced an advanced version of Atlas in 2021, showcasing considerable motor skills, including the ability to perform somersaults. Agility Robotics, now under Amazon’s wing, has developed Digit, a versatile robot capable of performing warehouse tasks efficiently. Samsung's Handy aims to tackle household chores requiring intricate manual dexterity. Xiaomi is also venturing into this space with CyberOne, a conversational robot reminiscent of Tesla's Optimus, set to debut soon at Tesla AI Day.
The growing trend among leading tech firms to develop humanoid robots is noteworthy. There are compelling reasons to design robots with human-like features, considering that our world is tailored to our dimensions and movements. These initiatives reflect the industry’s ambition to create robots that can help eliminate hazardous, monotonous tasks or assist us in our daily lives.
However, this discussion is not solely about humanoid robots. It also addresses an innovative robotic approach that has not been widely adopted yet. I am referring to the fusion of advanced AI systems—especially language models—with full-bodied robots capable of navigating their surroundings. This connection between cognitive understanding and physical action represents a promising frontier.
The Challenge of Integrating AI and Robotics
Many AI companies shy away from robotics (for example, OpenAI disbanded its robotics division last year), while robotics firms often limit their robots to basic tasks or environments. One major reason for this discrepancy is Moravec's Paradox, which states that it is surprisingly difficult for robots to perform sensorimotor and perceptual tasks, such as picking up an object, while solving complex cognitive challenges, like playing strategic board games, is comparatively easier.
To humans, it seems intuitive that mastering calculus is more complex than catching a ball. However, calculus is a recent development in our evolutionary history, and we have not yet fully adapted to it. As Marvin Minsky, a foundational figure in AI, noted, "We tend to recognize simple processes that fail more than we notice complex processes that succeed seamlessly." In summary, creating robots that can adeptly move and interact with their environments remains a significant challenge, with limited advancements made over the past decades.
Google, however, is striving to break this cycle. Partnering with Everyday Robots, the tech giant has introduced PaLM-SayCan (PSC), a humanoid robot with capabilities that surpass those of its predecessors.
My interest in Google's approach stems from my belief that the integration of virtual AI systems with real-world robotics is the logical progression for both fields. While some researchers advocate for scaling AI to achieve human-level intelligence, I argue that grounding AI in real-world experiences is crucial for overcoming limitations and enhancing capabilities. This grounding is essential for reasoning and comprehension, which require tacit knowledge gained through interaction with the world.
(Note: For further exploration of this topic, I recommend my earlier post, "Artificial Intelligence and Robotics Will Inevitably Merge.")
PaLM-SayCan: A New Era of Robotics
Google's PSC exemplifies the company's recognition that merging AI with robotics is a vital path forward. Rather than abandoning pure AI, Google is revitalizing the potential of AI-robotics integration to develop more competent intelligent systems. This concept is akin to training multimodal models, which are increasingly regarded as the next logical step in deep learning. Just as AIs capable of "seeing" and "reading" outperform those limited to a single mode of information, robots that can act in addition to perceiving will perform better in our physical world.
Let’s delve into what makes Google’s PSC unique and how it adeptly combines the strengths of large language models with a physical robot's dexterity and action capabilities.
Understanding PaLM-SayCan’s Functionality
At its core, PSC integrates PaLM’s expertise in natural language processing (similar to GPT-3 and LaMDA) with the robot's capacity to interact with its surroundings. PaLM serves as the bridge between human commands and robotic execution.
In technical terms, PaLM empowers the robot to undertake intricate tasks. For instance, a simple request like "bring me a snack" encompasses numerous actions and requires interpretation, as the specifics of "which snack" are often left unstated.
PaLM enhances the robot's task execution by converting natural language commands into structured tasks and breaking them down into actionable steps. While robots like Atlas and Digit excel at straightforward tasks, they struggle with complex, multi-step commands without explicit programming. PSC, on the other hand, is designed for such challenges.
In turn, the robot supplies PaLM with situational context about its environment and capabilities, informing the language model of feasible actions based on real-world conditions.
PaLM defines what is useful, while the robot identifies what is possible. This dual input is key to Google's innovative approach, positioning the company at the forefront of this integration, despite PSC still being a research prototype compared to fully developed products like Atlas and Digit.
Exploring the Capabilities of PaLM-SayCan
To illustrate PSC’s functionality, consider an example from Google researchers’ experiments, as detailed in their paper "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances." A typical human request might be: "I just worked out; please bring me a snack and a drink to recover."
While this task is straightforward for a human, a conventional robot would struggle to interpret it. PSC utilizes PaLM's language capabilities to refine this request into a structured, high-level task that can be divided into manageable steps. For example, PaLM might deduce that it should "bring the person an apple and a water bottle."
PaLM's role is to mediate between the nuances of human language and the rigid commands that a robot can process. After defining a task, PaLM generates a sequence of steps for the robot to follow. However, being a virtual AI, it lacks direct world knowledge and may propose suboptimal solutions that do not account for the environment.
This is where the robot’s affordances come into play. Trained to understand its physical limitations, the robot collaborates with PaLM by prioritizing actions that are feasible within its current context. While PaLM may emphasize useful actions, the robot focuses on what is achievable, allowing PSC to determine the most effective course of action.
Returning to the snack example, after concluding that it should bring an apple and a water bottle, PaLM might suggest going to the store to buy an apple. However, the robot would dismiss this option if it cannot navigate the stairs. Conversely, the robot might suggest fetching an empty glass, which PaLM would reject as unhelpful since the person requested water, not just a glass. By balancing the useful and possible actions, PSC ultimately decides to retrieve the apple and water from the kitchen. This process iterates, guiding PSC closer to completing the task with each step.
In trials against two alternatives across 101 instruction tasks, researchers found that PSC, utilizing PaLM with affordance grounding, succeeded in choosing the correct sequence of actions 84% of the time and executing them correctly 74% of the time, halving the error rate compared to systems like FLAN and those without robotic grounding.
These results indicate a promising advancement in merging cutting-edge language AI with robotics, enhancing our ability to comprehend and navigate the world.
Limitations of PaLM-SayCan: The Challenge of Evolution
Despite its achievements, PSC faces significant challenges. While peripheral systems such as speech recognition and vision sensors are crucial for the robot's functionality, their effectiveness can vary due to external factors, like changes in lighting impacting object detection.
A notable issue is language models’ lack of true understanding. For example, while PaLM may accurately interpret a request for a snack, it operates on a superficial level without grasping the deeper meaning or context. It essentially functions as an advanced autocomplete tool, predicting the next word in a sequence without real comprehension of intentions or implications.
Moreover, if the robot missteps during task execution—say, it drops the apple—does PSC possess a mechanism to reassess and adjust its actions? The current answer is no. Experiments have been conducted in highly controlled lab settings, and should PSC operate in the real world, it would face a multitude of unpredictable variables, from moving objects to environmental irregularities.
Although PSC serves as a proof of concept, moving from a controlled environment to practical application involves complexities beyond mere quantitative adjustments.
In addition to these challenges, PaLM, like other language models, risks perpetuating biases learned during training. Recent studies suggest that biases can extend beyond language to influence robotic behavior, resulting in unintended, biased actions. This presents a complex problem since biases in action are often more subtle and difficult to identify.
Finally, Google emphasizes safety and responsible AI development. While PSC includes mechanisms to mitigate unsafe or biased actions, these issues are widespread, and there is no universal solution. Although PSC represents a pioneering effort in the realm of AI-powered robotics, it still grapples with these persistent challenges.
To further explore these advancements, watch the following videos:
The ROBOT that listens to you! PaLm-SayCan - YouTube.
Grounding language in robotic affordances - YouTube.
Stay connected with The Algorithmic Bridge, a newsletter dedicated to exploring the intersection of algorithms and everyday life. You can also support my work on Medium directly and enjoy unlimited access by becoming a member through my referral link here! :)