A new generation of vision-language-action models is letting humanoid robots learn tasks from demonstration in minutes — and the warehouses are already ordering by the thousand.
If you spent the last two years skeptical of humanoid robot demos, this is the year that skepticism becomes hard to defend. Vision-Language-Action (VLA) foundation models have made the leap from research curiosities to production tools, and Figure, 1X, Apptronik, and Unitree are all reporting backlogs measured in the tens of thousands.
The technical unlock is simple in retrospect. Rather than hand-engineering controllers for each task, the new generation of robots runs a single large transformer that takes camera feeds and a natural-language goal as input, and produces joint torques as output. Training is done on a mix of internet video, teleoperation data, and simulation rollouts — the same recipe that produced LLMs, applied to bodies.
What they can actually do today
Folding laundry, palletizing irregularly-shaped boxes, restocking shelves, and basic kitchen prep are now reliable enough for paying customers. Tasks requiring sub-millimeter precision (electronics assembly, surgical assist) remain out of reach. The sweet spot is exactly the kind of mid-skilled manual work that has been hardest to automate with traditional robotic arms.
The economics
A Figure 03 unit currently lists at $42,000 with a 5-year service contract, working out to roughly $4/hour amortized — well below US minimum wage and competitive with low-cost manufacturing labor. The robots are not yet capable enough to displace skilled trades, but for warehouse and fulfillment work, the calculus has tipped.
What to watch
The next inflection will be policy capture: a single VLA model that, given a few minutes of human demonstration, generalizes to a new task without further training. Several labs claim early results. If that capability lands cleanly, the bottleneck shifts from R&D to manufacturing — and the question becomes how fast the factories can scale.