Building the general-purpose robotic brain

By Skild AI Team • 29 July, 2025

Modern AI is confined to the digital world. Current physical AI systems are tightly coupled with specific robot designs and narrow tasks. At Skild AI, we are building towards AGI for the real world, unconstrained by robot type or task – a single, omni-bodied brain. While we push ahead, we want to share our vision, technical achievements and breakthroughs with the world.

What are we building?

Omni-bodied Intelligence: towards any task, any robot, one brain.
One of the key defining features of Skild’s robotics foundation model is its ability to generalize across tasks and hardware. Rather than focusing on specific hardware or focusing on one form-factor, we instead train models that can work across different morphologies (including human data - since humans are also a form of robot!), vastly expanding the available training set. This allows us to build a truly omni-bodied foundation model, which can work on quadrupeds, humanoids, table-top, mobile manipulators and more. Working with diverse morphologies not only allows us to unlock more training data, but it also provides robustness to changes or failures in hardware.

Our foundation model, the Skild Brain, which works on various quadrupeds, humanoids, table-top arms, mobile manipulators and more, follows a hierarchical architecture: (1) a low-frequency high-level manipulation and navigation action policy which provides inputs to a (2) high-frequency low-level action policy. This low-level policy (2) translates high level commands into precise joint angles and motor torques that drive the body of the robot. Trained on diverse data, tasks and hardware, our omni-bodied brain is fine-tuned and distilled to meet deployment needs.

A true Robotics Foundation Model: more than just a VLM in disguise
One of the biggest challenges in building a robotics foundation model is the lack of any large-scale robotics data. And to make matters worse, collecting real-world data using hardware is slow and prohibitively expensive. Thus, a lot of researchers and competitors side-step the problem: start with an existing vision-and-language model (VLM) and sprinkle in less than 1% of real-world robot data to build a “robotics foundation model”. But is this a true robotics foundation model? Does it have information about actions? No. LLMs have a lot of semantic information. However, like a Potemkin village, they lack the true substance of grounded actionable information. And that is why most “robotics foundation models” showcase semantic generalization in pick-and-place style tasks but lack true physical common sense.

How does one obtain the scale of action data to build a true robotics foundation model? Over the past decade, our team has tackled this challenge head-on (see our teaser video for our team’s past research highlights). In their previous work, our team members have not only pioneered scalable real-world data collection strategies such as self-supervised robots and imitation learning [1, 2, 3, 4] but also tried to explore alternatives such as using internet videos [5, 6] and large-scale simulation [7, 8, 9, 10]. Over this past decade, one thing has become crystal clear: scale does not mean million or billion examples, achieving scale requires collecting trillions of examples - and there is no way just real world data can achieve this scale in near future¹. At Skild AI, we tackle this challenge through large-scale simulation and internet video data to pre-train our omni-bodied brain. We post-train this foundation model using targeted real-world data to deliver working solutions to our customers.

How does it look? An evolution
The Skild Brain’s development has been characterized by the emergence of new behaviors, as well as improvements in robustness and generalization, achieved by training on an ever-growing set of data. Before demonstrating its latest capabilities, we would like to showcase the evolution of our omni-bodied brain, and how it has iteratively improved in generalization, robustness, and adaptability over time.. Today we are releasing our results from 2024, showcasing some of our early scalability and robustness milestones.

Over the next month, we’ll dive into the capabilities previewed in the teaser, and show how robust our foundation model has gotten through continued training and algorithmic innovation, as well as all the emergent behaviors developed by the model.

Who built it?
A passionate group of scientists & engineers driven by the goal of omni-bodied intelligence. We have been researching AI and robotics for more than a decade. Our team includes pioneers of self-supervised learning, curiosity-driven exploration, end-to-end sim2real for visual locomotion, dexterous manipulation, learning from human videos, robot parkour, and many more. Many of these works have won awards at top-tier AI and Robotics conferences. Our team has also built production-ready systems at Anduril, Tesla, Nvidia, Meta, Kitty Hawk, Google, Everyday Robotics, and Amazon.

We are looking for exceptional and passionate people to join our team. Please apply via our careers page at https://www.skild.ai/career.

¹ a simple back of envelope math says that even if whole population of earth collects data it will take years to reach 100 trillion trajectories

References

[1] Pinto, Lerrel, and Abhinav Gupta. "Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours." 2016 IEEE international conference on robotics and automation (ICRA). IEEE, 2016.

[2] Gandhi, Dhiraj, Lerrel Pinto, and Abhinav Gupta. "Learning to fly by crashing." 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017.

[3] Pathak, Deepak, et al. "Zero-shot visual imitation." Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2018.

[4] Pathak, Deepak, et al. "Curiosity-driven exploration by self-supervised prediction." International conference on machine learning. PMLR, 2017.

[5] Bahl, Shikhar, Abhinav Gupta, and Deepak Pathak. "Human-to-Robot Imitation in the Wild." Robotics: Science and Systems (RSS), 2022.

[6] Bahl, Shikhar, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. "Affordances from Human Videos as a Versatile Representation for Robotics." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

[7] Agarwal, Ananye, et al. "Legged locomotion in challenging terrains using egocentric vision." Conference on robot learning. PMLR, 2022.

[8] Agarwal, Ananye, et al. "Dexterous functional grasping." Conference on robot learning. PMLR, 2023.

[9] Uppal, Shagun, et al. "Spin: Simultaneous perception interaction and navigation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.