By Skild AI Team • Jan 12, 2026

In the race to build truly general-purpose robotics, the industry has hit a familiar wall: the data bottleneck. While Large Language Models (LLMs) flourished by scraping the vast expanse of the internet, robotics has struggled to find its "Common Crawl." The prevailing strategy has been to scale teleoperation—human operators manually guiding robots to collect precise motor-torque sequences.
However, at Skild AI, we believe teleoperation alone cannot bridge the gap to foundation-model scale. It faces two fatal challenges:
Learning by Watching
If we look at biological intelligence, the solution is hiding in plain sight. Humans don't learn to make tea by being told the exact Newton-meters of force to apply to a kettle. We learn through observational learning. We possess a foundation of kinematics and dynamics that allows us to watch a visual demonstration, internalize the intent, and map those actions onto our own bodies. This data—human video—is already abundant. From first-person "egocentric" headcam footage to the millions of instructional videos on YouTube, the "internet-scale" dataset for robotics already exists. It just isn't "robot-native."
The Embodiment Gap
The reason the industry hasn't shifted entirely to video data is that it's technically daunting. Unlike text, video data for robotics suffers from:
The Skild AI Solution: Omni-bodied Learning
Bridging embodiment is a core capability of our model, enabling new skills to be learned directly from video demonstrations. We pretrain our model to perform new skills by watching videos alone, and only a small amount of robot data (<1 hr). We believe this fundamentally changes the robotics data bottleneck, making scalable task learning for foundation models feasible.