AI's Initial Steps Towards Embodied Intelligence
Written on
The Convergence of Two Realms
We are currently witnessing a remarkable advancement in robotics, driven by two significant developments: Figure AI's conversational robot and Google's versatile agent, SIMA. However, it’s important to clarify that these innovations do not equate to Artificial General Intelligence (AGI), despite some extravagant claims circulating. Additionally, I won't indulge in any sensationalized narratives about robots turning against humanity.
Nevertheless, the individual advancements of these technologies are quite compelling, and their potential interplay marks what I believe to be the initial steps towards AI embodied intelligence. Many of these observations have been shared in my weekly newsletter, TheTechOasis. If you're interested in staying informed about the dynamic landscape of AI and feeling motivated to engage with the future, consider subscribing.
Understanding Figure AI
So, what exactly is Figure AI? This robotics firm is on a mission to develop robots aimed at "eliminating the necessity for hazardous and undesirable jobs, enabling future generations to lead happier, more purposeful lives," according to their CEO.
While many companies boast similar ambitious goals, the backing of high-profile investors—including OpenAI, NVIDIA, Jeff Bezos, and Intel, who collectively invested $675 million at a valuation of $2.6 billion—signals that Figure AI is onto something significant.
The buzz surrounding this company can largely be attributed to a remarkable demonstration video showcasing a robot that engages with humans while executing tasks with impressive agility. At its core, Figure AI's robot utilizes GPT-4V, OpenAI's advanced Multimodal Large Language Model (MLLM).
In essence, this robot represents the first instance of an 'embodied ChatGPT,' signifying that LLMs can now perform actions in a physical context as well. A noteworthy moment in the demonstration occurs when a human requests an action from the robot, which explains the reasoning behind its previous task. This is not merely a simple query; it highlights the model's ability to manage multiple tasks simultaneously.
It appears that Figure AI has fine-tuned GPT-4 to generate both text and action outputs—the former being transformed into speech through a vocoder and the latter into actuator movements. Although the specifics of Figure AI's robot mechanisms remain largely unknown, we can draw parallels to DeepMind’s RT-2 model, which has demonstrated how to train LLMs to produce robot actions or text outputs.
Given that RT-2 was trained using models that do not approach the capabilities of GPT-4, one could argue that the model behind Figure AI's robots is the most sophisticated vision-language-action framework to date.
However, we should maintain a realistic perspective: the demonstration was likely highly controlled and rehearsed, suggesting that the robot is still in the early stages of developing general-purpose capabilities. Thus, to envision the next phase in the evolution of these robots, we can look to Google's latest release.
SIMA: The Emergence of a True Generalist Agent?
Google's Scalable Instructable Multiworld Agent (SIMA) project aspires to develop AI systems capable of understanding and executing any language instruction within a 3D environment. This initiative tackles a key challenge in general AI advancement by integrating language comprehension with perception and embodied actions across diverse virtual realms, such as research settings and commercial video games.
In simpler terms, SIMA interprets language commands and translates them into keyboard and mouse actions within 3D spaces. Importantly, these agents strive to perform tasks in a manner akin to human actions based solely on natural language instructions from users, like "collect resources and construct a house."
This means the model operates under the same constraints as humans, utilizing identical inputs when engaging in these environments—without access to underlying APIs or other advantages. Therefore, its ability to fulfill requests relies on accurately predicting the keyboard and mouse actions a human would take to complete those tasks.
SIMA comprises several components:
- Encoders: A text encoder converts language instructions into interpretable embeddings, along with an image encoder inspired by the recent SPARC development and a video encoder.
- Multi-modal Transformer + Transformer XL: These two transformer architectures manage cross-attention between modalities, with the latter keeping track of previous states to identify the current state.
- The policy: A classification head that determines the chosen action from a pool of 600 skills based on the policy.
Breaking Down the Processing of Information
As with most cutting-edge models today, the initial phase involves "encoding" the inputs. This means transforming the input data (text and video) into vector embeddings—a sequence of numbers—using their respective encoders.
Why is this transformation necessary? It allows each element to be represented in a dense vector that captures the semantic essence of the concept. Essentially, semantically similar concepts will produce similar vectors, placing them closer together in the vector space.
This transformation also converts the process of "understanding our world," which humans do unconsciously, into a mathematical exercise suitable for computers, as concepts are now represented numerically. The model then calculates the similarity between these vectors (the distance between them) to determine which concepts are related.
For instance, this enables the model to recognize that the terms 'dog' and 'cat' refer to similar concepts with shared attributes (mammals, domesticated, four legs, etc.).
It is crucial to grasp the significance of this 'similarity' principle, as it forms the foundation of everything that follows. If you understand this concept of representing something as "a sum of its similars," you've encountered the core of the attention mechanism that powers ChatGPT and other leading LLMs.
In multi-modal contexts like this, where vectors originate from different data types, the similarity principle is equally vital. For example, both the word 'dog' and an 'image of a dog' should have corresponding vectors to indicate they refer to the same concept.
Given that text and images differ structurally, distinct encoders are required. This poses a challenge, as separate encoders must be trained to ensure they yield similar embeddings for analogous concepts. To address this, SIMA employs a SPARC image encoder.
What is SPARC? It represents a recent breakthrough in image encoding, trained similarly to conventional encoders through contrastive learning, but excelling at capturing fine-grained details.
Contrastive learning is a prevalent training method for models handling text and images. It pushes similar concepts closer while distancing dissimilar ones. By analyzing millions of images with their textual descriptions, it learns to identify what an image depicts.
However, traditional image encoders often fail to capture intricate details. While they may identify the general content of an image, they overlook smaller elements that, while not essential for conveying the overarching semantics, hold significance in various contexts.
SPARC introduces an innovative twist: when given an image-text pair, it divides the image into patches and assigns corresponding tokens from the text description to each patch. For example, if a patch covers a dog's body, it will be linked to the word "dog." This is repeated for every patch, with weighted values assigned for those covering multiple aspects.
Once each patch is associated with specific concepts, the weights are aggregated and compared to the actual word using the similarity principle. This localized approach enables the model to determine both which concepts are represented in the image and where they are located.
This specificity is crucial for SIMA, as the 3D agent must recognize individual objects in its environment to interact with them effectively as part of the tasks requested.
The video encoder is included to provide temporal awareness, which text and image encoders lack. This allows the model to consider both the current and previous states of the environment when determining the next action.
For instance, while lighting a match may seem appropriate as a next step, it would be ill-advised if the last action was to cover the floor with gasoline.
Choosing the Optimal Policy
With the gathered information, SIMA utilizes a series of transformer models to process representations generated by the various encoders. Instead of predicting words like an LLM would, this model produces the policy dictating the keyboard and mouse actions the agent will execute.
One might wonder why a unique set of Transformers was chosen as the model's 'brain' instead of simply using Gemini, Google's LLM. Budget considerations likely played a role, as the researchers acknowledge in their technical report that the next logical progression for SIMA would be to integrate Gemini.
This is fascinating, as even without the most advanced 'brain' available, the results obtained are noteworthy.
A True Generalist
The ultimate aim was to create an agent proficient in any game it encountered, even those it had never played before. After training, the SIMA agent could perform up to 600 distinct basic tasks, categorized into various domains like navigation, animal interaction, and food collection.
You can see SIMA executing some of these tasks here. Notably, SIMA achieved impressive outcomes: despite being trained across multiple games simultaneously, it generally outperformed agents specialized in single games within those specific environments.
More remarkably, across various games, the agent displayed strong performance in zero-shot tasks (without prior examples) in most cases, often surpassing specialized agents.
This suggests that the model effectively transfers knowledge gained from one game to another. In simpler terms, it means that the skills learned in some games can be successfully applied to others, like mastering keyboard controls.
The quality of these skills is so high that the generalist agent often outperformed specialized ones, indicating that this generalist approach cultivates superior competencies applicable across diverse environments.
Overall, these impressive results gain significance when considered alongside the advancements from Figure AI.
A Promising Week for Robotics
The field of AI robotics is advancing rapidly. On one hand, Figure AI demonstrates progress in creating humanoid robots capable of performing a growing array of manual tasks. On the other hand, SIMA reveals that we are beginning to see the emergence of true generalist agents within 3D environments.
What becomes evident is the potential for collaboration between these developments. While we may not yet be ready to deploy these agents in real-world scenarios, the convergence of these two domains represents a natural progression; SIMA serves as a training ground, while Figure AI robots embody the capabilities of generalist agents.
With other companies also exploring the concept of embodied intelligence, it is clear that the race is on. Many industry leaders believe that the technologies we currently possess are sufficiently advanced to tackle the next major challenge: integrating AI into everyday life.
As a final note, if you found this article engaging, I share similar insights in a more accessible format on my LinkedIn. Feel free to connect with me on X as well. I look forward to engaging with you.