Embodied AI 101

Contact-Grounded Policy: Dexterous Visuotactile Policy with Generative Contact Grounding

Shaoqing Tan — Mon, 06 Jul 2026 14:07:09 -0700

Learns dexterous manipulation policies that explicitly ground actions in generative contact predictions from visuotactile observations, improving robustness on contact-rich tasks.

Freeform Preference Learning (FPL) for Robotic Manipulation

Shaoqing Tan — Mon, 06 Jul 2026 05:09:07 -0700

Introduces multi-axis preference supervision to learn dense, language-conditioned rewards across speed/precision/subtask axes without segmentation; enables compositional generalization and better long-horizon credit assignment than single-reward baselines.

Orca: The World is in Your Mind

Shaoqing Tan — Sun, 05 Jul 2026 14:09:11 -0700

Proposes a general world foundation model leveraging Next-State-Prediction to jointly generate text, images, and embodied actions within a unified framework.

Qwen-RobotNav: A Scalable Unified Navigation Model for Agentic Robotics

Shaoqing Tan — Sun, 05 Jul 2026 14:08:08 -0700

A unified 2B–8B parameter model for robot navigation tasks (VLN, ObjectNav, tracking, autonomous driving) via a configurable observation protocol, with demonstrated zero-shot deployment on real quadruped robots using agentic planners.

Scaling Robot Skills from Cheap Human Videos

Shaoqing Tan — Wed, 01 Jul 2026 14:08:01 -0700

Replaces noisy 6-DoF hand poses with relative wrist translation as a shared action space between humans and bimanual robots, enabling scalable skill acquisition from inexpensive video data. This approach outperforms full-pose baselines for robot skill learning.

ABC: An Open Behavior Cloning Stack for Bimanual Manipulation

Shaoqing Tan — Wed, 01 Jul 2026 05:16:54 -0700

Large-scale open-source framework for real-world robotic manipulation using behavior cloning, including the ABC-130K dataset with 3,500 hours and 130K+ episodes across 195 tasks. Provides hardware setups, simulators, and training recipes for Diffusion Transformers and Vision-Language-Action models.

Qwen-RobotManip: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Shaoqing Tan — Wed, 01 Jul 2026 05:12:07 -0700

Presents alignment techniques to scale robotic manipulation foundation models, building on the Qwen model family for dexterous robot control.

ASPIRE: Automated Skill Discovery for Robotics

Shaoqing Tan — Tue, 30 Jun 2026 14:11:15 -0700

Introduces the first automated system that continuously discovers, evolves, and accumulates reusable sensorimotor skills via evolutionary search over control programs, enabling compounding multi-task, sim-to-real, and cross-embodiment transfer without retraining end-to-end policies.

SERF: 4D Latent Mapping for Long-Horizon Mobile Manipulation

Shaoqing Tan — Tue, 30 Jun 2026 05:48:09 -0700

Embeds both the robot and environment into a shared 4D latent space augmented with forward-kinematics robot points, enabling a vision-language-action model to handle dynamic scenes and long-horizon memory. Outperforms image-only VLA baselines on the BEHAVIOR-1K benchmark for mobile manipulation.

ViserDex: Visual Sim-to-Real for Robust Dexterous In-Hand Reorientation

Shaoqing Tan — Tue, 30 Jun 2026 05:13:30 -0700

A single-camera sim-to-real framework that uses physically consistent 3D Gaussian Splatting augmentations to achieve zero-shot transfer of dexterous in-hand reorientation policies to an Allegro hand. The approach trains entirely on consumer hardware while maintaining high fidelity to real-world dynamics.

DexSkin: A High-Coverage, Conformable "Electronic Skin" for Robot Fingers

Shaoqing Tan — Tue, 30 Jun 2026 03:12:49 -0700

Introduces a high-coverage, conformable robotic skin hardware system designed to improve data collection and policy learning for contact-rich, dexterous manipulation tasks. The system provides rich tactile sensing coverage to enable more capable robot manipulation policies.

EBench: A Diagnostic Benchmark for Generalist Manipulation Policies

Shaoqing Tan — Tue, 30 Jun 2026 03:11:09 -0700

A CAT-scan style diagnostic benchmark for robot foundation models that evaluates policies such as π0, π0.5, and Qwen-RobotManip beyond single success rates. The benchmark is designed to distinguish genuine generalization from overfitting to demonstrations in generalist manipulation policies.

VITRA: A Foundation for Dexterous VLA via Human Video Pretraining

Shaoqing Tan — Mon, 29 Jun 2026 14:13:41 -0700

A scalable VLA pretraining pipeline that converts unstructured egocentric human videos into robot training data, trains a dexterous hand VLA, and fine-tunes on robot data, achieving strong zero-shot generalization and real-robot dexterous manipulation.

DexWM: A Dexterous Manipulation World Model from Human Videos

Shaoqing Tan — Mon, 29 Jun 2026 14:08:49 -0700

A dexterous manipulation world model pretrained on 829 hours of EgoDex human data and DROID robot data using conditioned diffusion transformers, enabling open-loop rollouts and sim-to-real transfer with minimal robot fine-tuning.

PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning

Shaoqing Tan — Mon, 29 Jun 2026 05:14:05 -0700

Introduces PoLAR, a method that factorizes latent action representations into extent and mode components to improve robot policy learning efficiency and generalization.

Continual Robot Policy Learning via Variational Neural Dynamics

Shaoqing Tan — Mon, 29 Jun 2026 05:13:48 -0700

Proposes a variational neural dynamics framework for continual robot policy learning, enabling robots to acquire new skills without forgetting previously learned ones.

PhysisForcing: Physics-Reinforced World Models for Robotic Manipulation

Shaoqing Tan — Mon, 29 Jun 2026 03:24:42 -0700

Plug-and-play training framework that enforces physical plausibility in robotic video generation models, achieving SOTA on R-Bench, PAI-Bench, and EZS-Bench. Lifts WorldArena success rate from 16% to 24% with zero extra inference cost.

Translation as a Bridging Action

Shaoqing Tan — Mon, 29 Jun 2026 03:11:02 -0700

Replaces noisy 6DoF hand poses with relative wrist translation as a shared action space between cheap human videos and bimanual robots. Scales data-efficiently and outperforms full-pose baselines on manipulation tasks.

Play2Perfect: Dexterous Play Pretraining for Precise Assembly

Shaoqing Tan — Sun, 28 Jun 2026 14:13:36 -0700

Pre-trains a dexterous hand via unstructured 'play' interactions with objects, then fine-tunes for precise assembly tasks including 0.5 mm clearance insertions and furniture screwing, achieving 33x better sample efficiency than RL from scratch.

Dexora: Open-Source VLA for High-DoF Bimanual Dexterity

Shaoqing Tan — Sun, 28 Jun 2026 14:11:01 -0700

First open-source Vision-Language-Action (VLA) model for dual-arm, dual-hand 36-DoF dexterous manipulation, trained on 100K simulated and 10K real trajectories with strong cross-embodiment transfer capabilities.

WorldVLA: Towards Autoregressive Action World Model

Shaoqing Tan — Sun, 28 Jun 2026 05:09:08 -0700

Unifies VLA and world-modeling in a single autoregressive transformer that predicts both future images and actions. Outperforms separate VLA or world models on LIBERO simulation benchmarks.

HumDex: Humanoid Dexterous Manipulation Made Easy

Shaoqing Tan — Sun, 28 Jun 2026 05:08:49 -0700

HumDex targets humanoid dexterous manipulation, aiming to simplify the development of dexterous manipulation capabilities for humanoid robots.

ForceBand: Learning Forceful Manipulation with sEMG

Shaoqing Tan — Sat, 27 Jun 2026 14:11:53 -0700

Presents an open-source, low-cost sEMG wristband framework that extracts force signals from human muscle activity in videos, enabling zero-shot human-to-robot transfer of forceful manipulation policies across any robot, camera, or environment.

In-Context World Modeling for Robotic Control

Shaoqing Tan — Sat, 27 Jun 2026 14:08:16 -0700

Introduces ICWM, a method that learns world dynamics from just seconds of a robot's self-generated interaction data, enabling zero-shot adaptation to unseen cameras and new robot morphologies without any fine-tuning.

WOLF-VLA: Vision-Language-Action for Humanoid Walking

Shaoqing Tan — Sat, 27 Jun 2026 05:15:30 -0700

Introduces a framework integrating vision-language-action models for whole-body humanoid locomotion, addressing optimal control and learning for complex bipedal behaviors. Combines VLA learning with locomotion-specific control for humanoid robots.

Motion-Focused Latent Action for Cross-Embodiment VLA from Human Videos

Shaoqing Tan — Sat, 27 Jun 2026 05:14:57 -0700

Proposes a motion-focused latent action representation for cross-embodiment vision-language-action policies learned from human videos, accepted to IROS 2026.

ManiFlow: Manipulation via Rectified Flow

Shaoqing Tan — Sat, 27 Jun 2026 03:11:30 -0700

ManiFlow is a visuomotor imitation learning policy using consistency flow matching with a DiT-X architecture that generates high-quality actions in 1–2 steps. It works across single-arm, bimanual, and humanoid platforms using RGB or point cloud inputs.

RL-100: Toward Highly Reliable Real-World Robot Reinforcement Learning

Shaoqing Tan — Sat, 27 Jun 2026 03:10:48 -0700

RL-100 demonstrates highly reliable real-world RL manipulation achieving 900/900 success rates across 7 tasks with up to 250 consecutive trials without failure. It also shows strong robustness to disturbances and zero/few-shot adaptation capabilities.

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Shaoqing Tan — Fri, 26 Jun 2026 14:10:00 -0700

A video-diffusion world model trained on over 1 million manipulation episodes (3,000 hours) that includes an action model and neural simulator for closed-loop robotic manipulation control, with all code and models open-sourced.

Bi-HIL: Bilateral Control-Based Multimodal Hierarchical Imitation Learning for Long-Horizon Contact-Rich Manipulation

Shaoqing Tan — Fri, 26 Jun 2026 05:12:44 -0700

Proposes a hierarchical imitation learning framework using bilateral control, subtask-level progress tracking, and keyframe memory to handle long-horizon, contact-rich manipulation tasks.

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Shaoqing Tan — Fri, 26 Jun 2026 05:08:58 -0700

Uses reinforcement learning to improve process reasoning capabilities in robotic manipulation policies, shifting the model from passive observation to active critique.

ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning

Shaoqing Tan — Fri, 26 Jun 2026 03:12:15 -0700

ROVE leverages reinforcement learning to enable humanoid robots to benefit from human interventions during manipulation tasks.

ConstrainedMimic: Safe Humanoid Robot Motion Tracking

Shaoqing Tan — Fri, 26 Jun 2026 03:11:55 -0700

A control framework for safe humanoid robot motion tracking using RL policies with real-time constraint enforcement via kinematics, dynamics, and control barrier functions.

REAL: Robust Extreme Agility via Spatio-Temporal Policy Learning and Physics-Guided Filtering

Shaoqing Tan — Thu, 25 Jun 2026 14:13:34 -0700

Introduces spatio-temporal policy learning combined with physics-guided filtering to achieve robust and extremely agile robot control.

HiFlow: Tokenization-Free Scale-Wise Autoregressive Policy Learning via Flow Matching

Shaoqing Tan — Thu, 25 Jun 2026 14:08:48 -0700

Introduces a tokenization-free autoregressive policy learning framework using flow matching across scales for robotic control.

Reactive Diffusion Policy: Slow-Fast Visual-Tactile Learning for Contact-Rich Manipulation

Shaoqing Tan — Thu, 25 Jun 2026 05:15:32 -0700

Introduces a slow-fast imitation learning framework combining diffusion-based planning with reactive tactile/force feedback for contact-rich manipulation tasks. Also includes TactAR, an AR-based teleoperation system with tactile sensing.

SARM2 + SPIRAL: Multi-Task Reward Models and RL Refinement for Long-Horizon Dexterous Manipulation

Shaoqing Tan — Thu, 25 Jun 2026 05:12:28 -0700

Combines scalable autonomous reward modeling with RL-based refinement to improve vision-language-action policies on long-horizon dexterous manipulation tasks via autonomous rollouts. Demonstrates significant gains over imitation learning baselines.

Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm VLA Systems

Shaoqing Tan — Thu, 25 Jun 2026 03:34:15 -0700

Introduces coordination-aware structured action modeling for dual-arm robotic systems within a VLA framework. Addresses the unique challenges of bimanual manipulation through specialized action representations.

ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation

Shaoqing Tan — Thu, 25 Jun 2026 03:31:11 -0700

Proposes interleaved vision and language reasoning for robotic manipulation within a VLA framework. Aims to improve instruction following and task performance through integrated multimodal reasoning.

Playful Agentic Robot Learning

Shaoqing Tan — Thu, 25 Jun 2026 03:22:48 -0700

Self-directed play combined with Code-as-Policy for reusable skill acquisition and downstream manipulation tasks.

Learning Unified Force and Position Control for Legged Loco-Manipulation

Shaoqing Tan — Wed, 24 Jun 2026 14:14:01 -0700

A unified RL policy for quadrupeds and humanoids that jointly handles force and position control without force sensors, enabling compliant behaviors, force-aware imitation learning, and contact-rich tasks.

Robots that Collaborate: Sequential Asymmetric Imitation for Learning Coupled Robot Policies

Shaoqing Tan — Wed, 24 Jun 2026 14:11:02 -0700

Explores imitation learning approaches for multi-robot systems, focusing on policy coupling through sequential asymmetric imitation to enable collaborative robot behaviors.

AstraBrain-WBC 0.5: A Humanoid Robot Cerebellum Foundation Model

Shaoqing Tan — Wed, 24 Jun 2026 05:17:07 -0700

A humanoid robot 'cerebellum' foundation model trained on 20,000 hours of human motion data that demonstrates scaling laws for robot motion control and enables zero-shot execution of unseen motions on real humanoids.

SRL: Combining SLIP Model and Reinforcement Learning for Agile Robotic Jumping

Shaoqing Tan — Wed, 24 Jun 2026 05:14:53 -0700

Combines the Spring-Loaded Inverted Pendulum (SLIP) model with reinforcement learning to achieve agile jumping behaviors in robotic systems.

DataClaw0: Agentic Tailoring for Raw Multimodal Streams

Shaoqing Tan — Wed, 24 Jun 2026 03:20:11 -0700

A 9B model that filters noise from videos, GUI, and embodied data streams, reorganizing them into dense supervision via factual anchors and semantic synthesis; trained with SFT + GRPO across five domains with benchmarks.

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

Shaoqing Tan — Wed, 24 Jun 2026 03:16:42 -0700

Converts 6K+ hours of mixed human/robot egocentric video into robot pseudo-actions via camera-space alignment and reliability-aware loss, achieving 72.8% on RoboCasa and 91.1% on RoboTwin.

VERA: Video-to-Action World Model Policy

Shaoqing Tan — Wed, 24 Jun 2026 00:03:04 -0700

A 14B-parameter video world model that converts predicted visual futures into embodiment-agnostic actions via Jacobian inverse-dynamics, enabling zero-shot cross-robot transfer across a Panda arm and 16-DoF hand with open-sourced weights and training code.

GEN-1: Scaled Dexterous Manipulation Foundation Model

Shaoqing Tan — Wed, 24 Jun 2026 00:01:21 -0700

A dexterous manipulation foundation model trained on 500k hours of real-world bimanual data that handles deformable objects such as cardboard folding and screw packing, featuring online retry and adaptation capabilities.

Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation

Shaoqing Tan — Mon, 22 Jun 2026 05:16:39 -0700

Develops an SE(3)-equivariant flow-based visuomotor policy leveraging spherical harmonics for efficient and geometrically consistent robot manipulation.

Cortical Policy: A Dual-Stream View Transformer for Robotic Manipulation

Shaoqing Tan — Mon, 22 Jun 2026 05:10:43 -0700

Introduces a dual-stream transformer architecture inspired by cortical visual processing for learning robotic manipulation policies.

VisualClaw: A Self-Evolving Wearable Vision Agent

Shaoqing Tan — Mon, 22 Jun 2026 03:05:59 -0700

An edge-filtered video streaming agent that evolves skills from memory and runs on smart glasses, reducing API costs by 98%, accompanied by the VisualClawArena benchmark dataset.

Kairos: A Native World Model Stack for Physical AI

Shaoqing Tan — Sun, 21 Jun 2026 14:23:15 -0700

A 4B unified architecture for world understanding, generation, and action with hybrid linear attention enabling real-time edge inference across embodiments, outperforming 14B models on embodied benchmarks.

DragMesh-2: A Contact-Driven Framework for Dexterous Hand–Object Interaction

Shaoqing Tan — Sun, 21 Jun 2026 14:12:42 -0700

A framework that trains a 51-DoF dexterous hand to open drawers and doors using only physical contact without requiring tactile sensors.

Guava: A Universal Harness for Robot Manipulation

Shaoqing Tan — Sun, 21 Jun 2026 05:10:01 -0700

A 4B open-source VLA-style model trained on fewer than 2K simulation trajectories that matches closed frontier systems on real-world manipulation tasks with zero-shot generalization to novel objects and long-horizon behaviors, including failure-recovery demonstrations.

Geometric Action Model for Robot Policies

Shaoqing Tan — Sun, 21 Jun 2026 05:09:28 -0700

A new geometric action model for robot manipulation policies that focuses on structured action representations to improve policy learning and generalization.

ENPIRE: Physical AutoResearch with a Fleet of 8 Robots

Shaoqing Tan — Sun, 21 Jun 2026 03:14:25 -0700

ENPIRE demonstrates fully autonomous physical AutoResearch where Codex agents control a fleet of 8 robots overnight, self-improving through real hardware rollouts on tasks like zip-tie tying and GPU installation, while discovering physical scaling laws with built-in safety harnesses and frozen reward classifiers derived from demonstrations.

MolmoAct2: An Open Foundation Model for Real-World Robotics

Shaoqing Tan — Sun, 21 Jun 2026 00:43:08 -0700

An open VLA-style robotics foundation model featuring open weights, open dataset, open action tokenizer, and a depth-reasoning variant; designed to enable community experiments on real robots for manipulation and generalist policies.

Hy-Embodied-0.5-VLA: A Massive Bimanual Teleoperation Dataset for Vision-Language-Action

Shaoqing Tan — Mon, 15 Jun 2026 05:27:45 -0700

Released a massive bimanual robot manipulation dataset with 2,163 hours and 250K+ episodes across 70+ tasks, along with a compatible VLA model for multi-view egocentric teleop. The dataset and model are fully compatible with LeRobot v3.0.

Q-Guided Flow: Test-Time Gradient Guidance of Flow Policies

Shaoqing Tan — Sun, 14 Jun 2026 14:23:14 -0700

New framework for guided flow-matching policies that improves long-horizon robotic control and sample efficiency.

Flow Reversal Steering: Guiding Diffusion-Based Robot Policies with High-Level Reasoning

Shaoqing Tan — Sun, 14 Jun 2026 14:13:15 -0700

Introduces flow reversal steering to guide diffusion-based vision-language-action models with high-level VLM reasoning and enables RL directly in the diffusion noise space.

Test-Time Compute Scaling for Robot Policies (DIRECT)

Shaoqing Tan — Sun, 14 Jun 2026 05:26:08 -0700

Larger models + more thinking + more context improve performance on some prompts but not others; a learned router enables better performance/latency trade-offs.

LabVLA: Bringing Vision-Language-Action to the Chemistry Lab

Shaoqing Tan — Sun, 14 Jun 2026 05:16:11 -0700

RoboGenesis generates 10K+ lab scenes across 16 robot embodiments; LabVLA (Qwen3-VL + DiT flow-matching) achieves 71.1% success on LabUtopia and transfers to real Franka arms.

Humanoid-GPT: A Foundation Model for Zero-Shot Humanoid Control

Shaoqing Tan — Sat, 13 Jun 2026 14:17:57 -0700

GPT-style Transformer pretrained on 2 billion motion frames that achieves agile, generalist zero-shot control on a real Unitree G1 humanoid for tasks like soccer, dancing, and digging. Requires no fine-tuning or task-specific adaptation.

CHORUS: Decentralized Multi-Robot Collaboration with a Single Shared VLA Model

Shaoqing Tan — Sat, 13 Jun 2026 14:07:47 -0700

Finetunes a single Vision-Language-Action (VLA) foundation model so that any robot in a team can control any other. Outperforms both per-robot specialists and a monolithic centralized policy while scaling to large teams.

RISE: Self-Improving Robot Policy with Compositional World Model

Shaoqing Tan — Sat, 13 Jun 2026 05:18:47 -0700

Trains a compositional world model on real robot data to enable closed-loop policy improvement via future prediction and progress evaluation, bypassing both risky real-world RL and traditional sim-to-real gaps.

EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

Shaoqing Tan — Fri, 12 Jun 2026 14:34:55 -0700

Open-source 3B unified embodied foundation model trained on 1.5M interleaved vision-text-action samples for perception, planning, and acting.

Robix: A Unified Model for Robot Interaction, Reasoning and Planning

Shaoqing Tan — Fri, 12 Jun 2026 14:15:36 -0700

A single vision-language model that unifies reasoning, task planning, and human-robot interaction for complex instructions and long-horizon tasks.

Robotic World Model: Learning to Simulate for Robust Robot Control

Shaoqing Tan — Fri, 12 Jun 2026 05:07:03 -0700

Presents a neural network-based world model for model-based reinforcement learning in robotics, focusing on sim-to-real transfer for quadrupedal and humanoid robots. Enables robust policy optimization through learned environment simulation.

AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

Shaoqing Tan — Wed, 10 Jun 2026 14:21:45 -0700

Embodied egocentric simulation framework that controls first-person worlds with 3D human motion and customizes evolving scenes via pose-anchored views.

ArtiFixer: Few-Step Diffusion for 3D Scene Reconstruction

Shaoqing Tan — Wed, 10 Jun 2026 14:10:57 -0700

Few-step auto-regressive diffusion model that converts broken 3D reconstructions into fully realized scenes, outperforming prior methods by 1-3 dB PSNR.

Deployment-Time Memorization in Foundation-Model Agents

Shaoqing Tan — Wed, 10 Jun 2026 05:17:10 -0700

Examines memorization phenomena that occur during deployment of foundation model agents in practical applications.

Adversarial Machine Learning: Taxonomy, Threat Models, and Mitigation Strategies in Deep Neural Networks

Shaoqing Tan — Tue, 09 Jun 2026 14:08:24 -0700

Focuses on security aspects of deep neural networks, providing taxonomy and mitigation strategies for adversarial attacks.

SoCRATES: Evaluating LLM Mediators in Conflict Scenarios

Shaoqing Tan — Tue, 09 Jun 2026 05:38:49 -0700

First comprehensive framework for evaluating LLM mediators in real-time, emotional, socio-cognitive scenarios.

Unembedding Matrix as a Feature Lens: Unlocking Better Text Embeddings

Shaoqing Tan — Tue, 09 Jun 2026 05:32:33 -0700

Improves embedding quality without extra training by using the unembedding matrix as a feature lens for text embeddings.

LeanMarathon: Autonomous Formalization of Math Proofs on Erdős Problems

Shaoqing Tan — Mon, 08 Jun 2026 14:38:40 -0700

Presents an autonomous system for formalizing mathematical proofs, specifically targeting Erdős problems. Demonstrates automated proof formalization capabilities in the Lean theorem prover.

Deep Research Agents: Survey and Roadmap for Autonomous AI Research

Shaoqing Tan — Mon, 08 Jun 2026 14:26:32 -0700

Provides a comprehensive taxonomy, benchmarks, and future directions for autonomous AI research agents. Examines systematic approaches to developing agents capable of conducting independent research.

Cosmos 3: Omnimodal World Models for Physical AI

Shaoqing Tan — Sun, 07 Jun 2026 14:08:40 -0700

Omnimodal world models explicitly designed for Physical AI and robotics applications. Enables improved simulation and control for robotic systems through multimodal understanding.

Humanoid-GPT: GPT-Style Transformer for Zero-Shot Dynamic Humanoid Control

Shaoqing Tan — Sun, 07 Jun 2026 05:11:47 -0700

GPT-style Transformer trained on 2 billion motion frames enabling zero-shot dynamic humanoid control for tasks like soccer, dancing, and digging on real Unitree G1 robots without fine-tuning. Breaks the agility-generalization trade-off in humanoid robotics.

Bending Paper, Shaping Dexterity: The Robotic Origami Challenge

Shaoqing Tan — Fri, 05 Jun 2026 14:20:26 -0700

New IROS benchmark providing 500+ teleoperation episodes and physically accurate simulation assets for training policies that outperform human origami experts.

GraspGen-X: A Foundation Model for Zero-Shot 6-DoF Grasping

Shaoqing Tan — Fri, 05 Jun 2026 14:10:23 -0700

First foundation model for zero-shot grasping trained on billions of simulated grasps, enabling generalized manipulation without task-specific training.

When Does Deep RL Beat Calibrated Baselines?

Shaoqing Tan — Thu, 04 Jun 2026 05:24:54 -0700

Benchmark study examining when deep reinforcement learning outperforms calibrated baseline methods in adaptive resource control tasks.

Training Deep Networks as Random Effects: An Optimization–Inference Duality

Shaoqing Tan — Thu, 04 Jun 2026 05:10:17 -0700

Explores training dynamics of deep neural networks through a statistical lens, examining the duality between optimization and inference perspectives.

Generative Depth Supervision for Embodied Vision-Language Models

Shaoqing Tan — Tue, 02 Jun 2026 05:07:28 -0700

Vision-language model that adds generative depth prediction during pre-training for physical grounding; achieves SOTA on embodied benchiments and transfers directly to real-robot tasks.

PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation

Shaoqing Tan — Mon, 01 Jun 2026 05:12:12 -0700

Presents a 3D point-cloud-based world model trained on mixed real/sim data that enables zero-shot grasping and articulated object handling on real robots by explicitly modeling spatial structure.

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Shaoqing Tan — Sun, 31 May 2026 16:41:21 -0700

NVIDIA Research presents LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). Unlike prior VLMs that serialize bounding boxes into sequential coordinate tokens, PBD treats each box as an atomic unit and predicts all coordinates in a single forward pass. This preserves intra-box geometric coherence while achieving 2.5x faster decoding throughput. The model supports diverse localization tasks including document understanding, GUI grounding, dense object detection, and OCR localization. Built on Moon-ViT vision encoder and Qwen2.5 language decoder. Trained on LocateAnything-Data with 138M language queries and 785M bounding boxes. Achieves state-of-the-art on LVIS, M6Doc, and ScreenSpot-Pro benchmarks. Models and demo available on HuggingFace.

LT2: Linear-Time Looped Transformers

Shaoqing Tan — Sun, 31 May 2026 05:21:02 -0700

Replaces quadratic softmax attention in looped architectures with linear/sparse mechanisms for iterative memory refinement, achieving parity with standard looped transformers at much lower cost.

One Learning Rate Doesn't Fit All: Layerwise Spectral Scheduling for Transformers

Shaoqing Tan — Sun, 31 May 2026 05:07:55 -0700

Shows that modern transformers are highly heterogeneous across layers and proposes layerwise learning rates based on weight spectrum shape, yielding up to 1.5× training speedup on LLaMA/GPT-style models.

SimToolReal: Procedural Tool Generation and a Universal Objective for Zero-Shot Tool Manipulation

Shaoqing Tan — Sat, 30 May 2026 14:19:08 -0700

Trains generalist policies in simulation on procedurally generated tools to move objects, enabling real-world tool use across varied shapes/sizes.

Robometer and the Future of Robotic Reward Modeling

Shaoqing Tan — Sat, 30 May 2026 14:07:00 -0700

New framework for scalable robotic reward modeling using trajectory comparisons to train general-purpose reward models.

Qwen-VLA: A Generalist Vision–Language–Action Robot Model

Shaoqing Tan — Fri, 29 May 2026 14:15:49 -0700

A single generalist VLA built on Qwen3.5-4B + 1.15B DiT flow-matching action decoder that unifies manipulation, navigation, and trajectory prediction across 11 embodiments via text-described embodiment prompts. Trained in four stages and outperforms task-specific specialists on real ALOHA and sim benchmarks without per-task fine-tuning.

EXPO-FT: Sample-Efficient Reinforcement Learning Fine-Tuning for Vision-Language-Action Models

Shaoqing Tan — Fri, 29 May 2026 05:13:54 -0700

Extends the EXPO method with real-world RL post-training for VLAs using image observations, action chunking, DAgger, and on-the-fly Q-value maximization. Achieves 30/30 success on 8 challenging manipulation tasks with only ~19 min of RL data on average.

RoboMeter: Learning Dense Rewards from Successes and Failures

Shaoqing Tan — Thu, 28 May 2026 22:00:45 -0700

RoboMeter trains dense reward models from both successful and failed robot trajectories, solving a key gap in prior methods that only learn from expert demos.

MobileGym: A Controllable, Parallel Sandbox for Mobile GUI Agents

Shaoqing Tan — Wed, 27 May 2026 05:34:07 -0700

Browser-hosted mobile environment with JSON state, deterministic judges, and 256 parallel rollouts. Reports +40.7 real-device points after GRPO training on 416 tasks for GUI agent development.

ANY2ANY: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking

Shaoqing Tan — Wed, 27 May 2026 05:19:30 -0700

Introduces a method to transfer a Unitree G1 foundation policy (Gear-Sonic) to LimX Oli/Luna humanoids using only 1% of the original compute/data. Achieves fast convergence and strong tracking performance for humanoid whole-body control.

TriSplat: Feed-Forward 3D Reconstruction with Triangulated Meshes

Shaoqing Tan — Tue, 26 May 2026 14:31:44 -0700

Outputs physics-engine-compatible triangle meshes directly from sparse, unposed images without Gaussian splatting or post-processing.

MIKASA-Robo-VLA: A Memory-Intensive Benchmark for Vision-Language-Action Robotics

Shaoqing Tan — Tue, 26 May 2026 14:11:34 -0700

Releases a benchmark suite for systematically evaluating memory in Vision-Language-Action policies on tabletop manipulation tasks.

PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation

Shaoqing Tan — Mon, 25 May 2026 14:12:29 -0700

Introduces large-scale 3D world models pretrained on diverse real-world video to enable robust robotic manipulation policies that generalize beyond simulation.

Bimanual Pegboard Manipulation: A Benchmark for Vision-Language-Action Models

Shaoqing Tan — Sun, 24 May 2026 14:14:31 -0700

New LeRobot-based bimanual pegboard manipulation dataset with 52 episodes, 30k frames, 3 camera views, and 14-DOF arms for VLA evaluation. Provides standardized benchmark for vision-language-action model assessment.

FutureSim: Replaying Real-World Events to Evaluate AI Forecasting Agents

Shaoqing Tan — Sun, 24 May 2026 05:31:47 -0700

A benchmark designed to test AI models' capabilities in making accurate 3-month future predictions.

AgentFloor: A Benchmark for Long-Horizon Agent Planning

Shaoqing Tan — Sun, 24 May 2026 05:18:22 -0700

A 30-task benchmark for evaluating long-horizon planning capabilities across 16 different AI models.

AlexNet: The Deep Convolutional Network That Transformed Vision

Shaoqing Tan — Sat, 23 May 2026 14:21:12 -0700

AlexNet paper that sparked the modern deep learning revolution through convolutional neural networks.

A Few Useful Things to Know About Machine Learning

Shaoqing Tan — Sat, 23 May 2026 14:13:24 -0700

Practical insights into ML pitfalls and best practices for machine learning practitioners.

SimToolReal: A Universal Dexterous Tool-Use Policy

Shaoqing Tan — Sat, 23 May 2026 05:26:37 -0700

Introduces an object-centric sim-to-real policy that enables zero-shot dexterous tool use on physical robots without task-specific fine-tuning. Leverages simulation data for robust real-world transfer.

Mimic-Video: Learning Physics Priors from Web-Scale Video for Robot Dexterity

Shaoqing Tan — Sat, 23 May 2026 05:13:14 -0700

Pretrains robot policies on large-scale web video to acquire dynamics and physics understanding instead of static images or VLMs. Yields faster training, better generalization, and superior dexterous manipulation results in real-world tasks.

Deep Residual Learning for Image Recognition (ResNet)

Shaoqing Tan — Sat, 23 May 2026 02:03:20 -0700

Introduced residual connections (ResNet) enabling training of very deep networks, still widely used in modern architectures.

Attention Is All You Need – The Transformer Revolution

Shaoqing Tan — Sat, 23 May 2026 01:49:20 -0700

Introduced the Transformer architecture based purely on attention mechanisms, becoming the foundation of nearly all modern large language models.

NVIDIA Cosmos: World Foundation Models for Physical AI

Shaoqing Tan — Wed, 20 May 2026 05:19:16 -0700

World foundation models for video and physics prediction with SynthID watermarking for responsible AI practices. Developed in collaboration with Google DeepMind.

LATENT: Teaching a Humanoid to Play Tennis from Imperfect Data

Shaoqing Tan — Tue, 19 May 2026 14:11:09 -0700

Introduces a three-stage pipeline that extracts a latent action space from noisy, low-quality human motion capture, then trains a high-level RL policy in simulation to compose and execute dynamic whole-body tennis skills. Achieves volleys at human-level performance on a humanoid robot.

CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

Shaoqing Tan — Tue, 19 May 2026 05:26:22 -0700

Closed-loop framework coupling Vision-Language Models with Video Generation Models at step-level granularity. Mitigates long-horizon drift and mid-clip errors in goal-directed video reasoning for robotic planning.

World Action Models: The Next Frontier in Embodied AI

Shaoqing Tan — Tue, 19 May 2026 05:10:48 -0700

First systematic survey defining World Action Models (WAMs) as embodied foundation models that jointly predict future states and generate actions. Covers architectures, data ecosystems, and evaluation protocols.

Training a Whole-Body Control Foundation Model

Shaoqing Tan — Mon, 18 May 2026 14:26:59 -0700

Describes end-to-end learning of a foundation model for adaptive whole-body humanoid control via massive simulation variation. Combines proprioceptive perception and policy adaptation across embodiments.

DexJoCo: A Unified Benchmark for Task-Oriented Dexterous Manipulation

Shaoqing Tan — Mon, 18 May 2026 14:11:24 -0700

Releases an open-source MuJoCo-based benchmark with 11 dexterous tasks, low-cost teleoperation hardware, and 1.1K human demonstrations. Designed to evaluate and train modern VLA/robotic policies.

MMSkills: Building Multimodal Skill Libraries for Visual Agents

Shaoqing Tan — Mon, 18 May 2026 05:29:29 -0700

Skill library, demonstrations, and dataset for multi-modal robotic skill learning and manipulation tasks.

PhysBrain 1.0 VLA (TwinBrainVLA): Dual-Brain Vision-Language-Action with Physics-Grounded Learning

Shaoqing Tan — Mon, 18 May 2026 05:16:17 -0700

Introduces dual-brain fusion Vision-Language-Action model with LangForce physics-grounded training methodology.

MolmoAct2-LIBERO: An Open Vision-Language-Action Model for Robotics

Shaoqing Tan — Sun, 17 May 2026 14:24:27 -0700

Vision-Language-Action (VLA) model fine-tuned on the merged LIBERO robotics dataset (1,693 episodes, 273k+ frames) achieving 98.25% success rate on manipulation tasks. Released with both checkpoint and dataset for VLA finetuning.

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Diffusion Transformers

Shaoqing Tan — Sun, 17 May 2026 14:12:10 -0700

A 2.6B-parameter open-source world model that generates coherent 720p, minute-long videos with precise 6-DoF camera control on a single GPU using a Hybrid Linear Diffusion Transformer + Gated DeltaNet for long-context efficiency. Targets controllable physics simulation.

WildClawBench: A Real-World, Long-Horizon Benchmark for AI Agents

Shaoqing Tan — Sun, 17 May 2026 05:24:48 -0700

New benchmark and dataset for robotic manipulation in unconstrained 'wild' environments. Includes standardized containers, leaderboards, and evaluation protocols for cross-embodiment policies.

MCP-Cosmos: Bring Your Own World Model

Shaoqing Tan — Sun, 17 May 2026 05:15:11 -0700

Introduces a latent-space world model framework that lets agents simulate state transitions and iteratively refine plans before real-world execution. Evaluated on 20+ MCP-Bench tasks with measurable gains in tool-use success.

OpenAI o1: Teaching LLMs to Think Slow and Deep

Shaoqing Tan — Sat, 16 May 2026 18:41:07 -0700

Details OpenAI's reasoning-focused o1 model and its 'long thought' approach using test-time compute scaling. Explores how extended reasoning during inference can improve model performance on complex tasks.

The Llama 3 Herd of Models

Shaoqing Tan — Sat, 16 May 2026 18:32:22 -0700

Comprehensive technical report on the Llama 3 family, covering architecture, training at scale, multimodal extensions, and real-world impact. Details the development of Meta's flagship open-source language model series.

LATENT: Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data

Shaoqing Tan — Sat, 16 May 2026 18:03:21 -0700

Introduces a three-stage pipeline that extracts a latent action space from low-quality human tennis demonstrations, then trains a high-level policy in simulation via reinforcement learning. Enables dynamic whole-body humanoid tennis play with back-and-forth volleys at human level.

AnyFlow: Any-Step Video Diffusion for Predictive World Modeling

Shaoqing Tan — Thu, 14 May 2026 16:13:25 -0700

First any-step video diffusion framework using flow maps, allowing a single model to adapt to arbitrary inference budgets for scalable high-quality video generation relevant to predictive world modeling.

# Robotics: The Endgame

Shaoqing Tan — Thu, 14 May 2026 16:02:02 -0700

Technical roadmap mirroring LLM scaling: critiques VLAs, advocates video world models as second pretraining phase, introduces World Action Models (WAM), manipulation data flywheels, EgoScale with new Dexterity Scaling Law, and DreamDojo end-to-end neural physics engine for sim RL.

Claw-Eval: Toward Trustworthy and Transparent Evaluation of Autonomous Agents

Shaoqing Tan — Wed, 08 Apr 2026 07:19:18 -0700

Benchmark with 2,159 rubric items across 300 tasks using trajectory-aware grading and 3-trial Pass^3 scoring to mitigate luck. Evaluates agent reliability in real-world robotics settings.

LIBERO-Para: Paraphrase Robustness in Robotic Manipulation

Shaoqing Tan — Wed, 08 Apr 2026 07:18:01 -0700

Reveals paraphrase fragility in VLAs causing 22-52% success drops due to task misidentification. Introduces PRIDE metric weighting success by paraphrase difficulty on LIBERO benchmark manipulation tasks.

YOR: Your Own Mobile Manipulator for Generalizable Robotics

Shaoqing Tan — Tue, 07 Apr 2026 07:41:37 -0700

Low-cost mobile manipulator design and training strategies for broad generalization in real-world tasks.

EgoSim: Egocentric World Simulator for Embodied Interaction Generation

Shaoqing Tan — Tue, 07 Apr 2026 07:29:11 -0700

Closed-loop egocentric video simulator maintaining persistent 3D scene state for consistent interactions, enabling cross-embodiment transfer from human videos to robotic manipulation.

Accelerating Video World Models: From Generative Videos to Real-Time Simulators

Shaoqing Tan — Mon, 06 Apr 2026 22:17:58 -0700

Comprehensive survey taxonomizing efficient architectures/algorithms for video world models as simulators, targeting compute bottlenecks in embodied AI, autonomous driving, and games with techniques like short-window attention for real-time long-horizon prediction.

From Tokens to Thoughts: Continuous Latent Reasoning in Large Models and Robot Control

Shaoqing Tan — Mon, 06 Apr 2026 22:14:05 -0700

Curated collection of 100+ works surveying shift to continuous latent spaces in LLMs/VLMs/VLAs for improved reasoning over discrete tokens, with relevance to robotics action modeling.

CaP-X: Coding Agents for Physical eXecution

Shaoqing Tan — Mon, 06 Apr 2026 07:11:45 -0700

CaP-X is an open-source agentic robotics framework where LLMs/VLMs generate code to call perception and control APIs for execution across diverse simulated and real robots in CaP-Gym's 187 manipulation tasks. The framework includes CaP-Bench for evaluating frontier models and CaP-RL, which boosts a 7B model's success from 20% to 72% with minimal sim-to-real gap.

DoRA: Weight-Decomposed Low-Rank Adaptation

Shaoqing Tan — Sun, 05 Apr 2026 22:30:22 -0700

An upgrade over LoRA for parameter-efficient fine-tuning, enabling better performance in LLMs by decomposing weights into magnitude and direction components.

AI Model Collapse: What Happens When AI Trains on Its Own Outputs

Shaoqing Tan — Sun, 05 Apr 2026 22:15:47 -0700

Seminal work showing how training on AI-generated data leads to 'model collapse' in neural networks, with urgent implications for future scaling.

PhAIL: Benchmarking Vision-Language-Action Models on Real-World Bin-Picking

Shaoqing Tan — Sun, 05 Apr 2026 07:19:40 -0700

Real-world hardware evaluation of VLAs on blind bin-to-bin picking, achieving max 64 picks/hour across hundreds of runs, with full videos/data exposing gaps in production-scale robotic manipulation reliability.

Co-training Large Behavior Models: Data Modalities and Training Strategies for Robot Manipulation

Shaoqing Tan — Sat, 04 Apr 2026 22:42:13 -0700

Comprehensive evaluation of 89 policies showing optimal co-training practices mixing real robot data with sim/egocentric human videos to boost diversity and performance in large robotics foundation models.

HyDRA: Hybrid Memory for Dynamic Video World Models

Shaoqing Tan — Sat, 04 Apr 2026 22:31:30 -0700

Novel memory system preserving dynamic object identity and motion continuity across occlusions in video world models, addressing frozen/vanishing issues for improved predictive physics in embodied AI.

# WildWorld: Dynamic World Modeling with Actions and Explicit State

Shaoqing Tan — Sat, 04 Apr 2026 07:29:53 -0700

Massive dataset enabling dynamic world models with explicit states and actions, supporting predictive modeling for cross-embodiment robotic control.

Omni-WorldBench: Evaluating Interactive 4D World Models

Shaoqing Tan — Sat, 04 Apr 2026 07:18:25 -0700

New benchmark assessing world models on interaction tasks, pushing predictive physics and video modeling towards robotics applications with action-conditioned evaluation.

SIMART: From Static Meshes to Sim-Ready Articulated Models

Shaoqing Tan — Fri, 03 Apr 2026 22:37:59 -0700

Unified MLLM framework with Sparse 3D VQ-VAE (70% token reduction) for part-level mesh decomposition and kinematic chain prediction, enabling physics-based robotic simulation from monolithic assets.

EgoSim: An Egocentric World Simulator for Embodied Interaction

Shaoqing Tan — Fri, 03 Apr 2026 22:23:00 -0700

Closed-loop egocentric simulator persistently updating 3D scene state to generate spatially consistent interaction videos for continuous simulation, enabling cross-embodiment transfer from human videos to robotic manipulation tasks.

Digit's New Motor Cortex: Sim-to-Real RL for Whole-Body Control

Shaoqing Tan — Fri, 03 Apr 2026 07:13:57 -0700

AI-trained capabilities for new whole-body motions using mocap/teleop data and sim-to-real reinforcement learning, deployable overnight on hardware.

EgoNav: Diffusion-Based Humanoid Navigation from Human Egocentric Video

Shaoqing Tan — Thu, 02 Apr 2026 22:32:11 -0700

Diffusion-based humanoid navigation trained solely on 5 hours of human egocentric video data, enabling zero-shot deployment on Unitree G1 for complex behaviors like handling glass walls, crowds, and dynamic obstacles via 360° visual memory and hybrid trajectory sampling; upcoming release of dataset, models, and code.

CaP-X: A Code-as-Policy Framework for Robot Manipulation

Shaoqing Tan — Thu, 02 Apr 2026 22:19:27 -0700

Comprehensive open-source agentic robotics framework treating VLMs/LLMs as code-generating APIs for perception (SAM3, Molmo) and control (IK, grasping), with CaP-Gym benchmark of 187 diverse manipulation tasks (tabletop, bimanual, mobile; sim/real) and CaP-Bench evaluating 12 frontier models; demonstrates rapid RL gains (7B model from 20% to 72% success) with strong sim-to-real transfer.

Embodied Intelligence Breakthrough: Generalist AI’s GEN-1 Robots

Shaoqing Tan — Thu, 02 Apr 2026 12:58:30 -0700

We've created GEN-1, our latest milestone in scaling robot learning. We believe it to be the first general-purpose AI model that crosses a new performance threshold: mastery of simple physical tasks. It improves average success rates to 99% on tasks where previous models achieve 64%, completes tasks roughly 3x faster than state of the art, and requires only 1 hour of robot data for each of these results. GEN-1 unlocks commercial viability across a broad range of applications—and while it cannot solve all tasks today, it is a significant step towards our mission of creating generalist intelligence for the physical world.

CaP-X: LMs' First Physical Exam

Shaoqing Tan — Thu, 02 Apr 2026 12:43:57 -0700

A novel benchmark that evaluates language models on physical examination tasks, testing their ability to understand and perform clinical physical exam procedures in simulated environments. This work introduces a comprehensive evaluation framework for AI systems in medical/clinical settings.

AI Model Collapse: The Danger of Training on AI-Generated Data

Shaoqing Tan — Tue, 31 Mar 2026 07:36:21 -0700

Demonstrated that LLMs trained recursively on AI-generated data suffer model collapse, a degenerative process where they lose grasp of true data distributions. Sparked critical debates on data provenance and the importance of preserving human-generated training data.

High-Level Automated Reasoning with Qwen2.5-7B

Shaoqing Tan — Tue, 31 Mar 2026 07:35:15 -0700

Qwen2.5-7B achieved 79.6% on MATH benchmark, surpassing GPT-4o, by employing atomic reasoning actions combined with Monte Carlo Tree Search. Demonstrated that strategic reasoning architectures can enable smaller models to outperform much larger ones.

Co-Training Large Behavior Models: Multimodal Data for Robot Manipulation

Shaoqing Tan — Mon, 30 Mar 2026 22:19:22 -0700

Explores data modalities and co-training strategies to enhance large behavior models (foundation models) for improved performance in robot manipulation tasks, supporting end-to-end learning and cross-embodiment generalization.

HyDRA: Hybrid Memory for Dynamic Video World Models

Shaoqing Tan — Sun, 29 Mar 2026 22:20:50 -0700

Memory architecture preserving identity and motion continuity for out-of-view dynamic subjects, addressing frozen/vanishing issues in video world models.

DexWM: Leveraging Human Videos for Dexterous Robot World Models

Shaoqing Tan — Sun, 29 Mar 2026 22:18:43 -0700

Dataset of robot trajectories designed for training world models to learn dexterous hand-object interactions directly from human videos.

World Models in Robotics

Shaoqing Tan — Sun, 29 Mar 2026 07:14:29 -0700

Technical survey categorizing world models into action-conditioned, video-inverse dynamics, and joint world-action models (WAMs), discussing their generalization, video data leverage, and trends for closing the robotics data gap.

SIMART: Decomposing Monolithic Meshes into Sim-Ready Articulated Assets

Shaoqing Tan — Sat, 28 Mar 2026 07:21:21 -0700

Unified MLLM framework with Sparse 3D VQ-VAE that reduces tokens by 70% for efficient part-level decomposition and kinematic prediction in physics-based robotic simulations.

LeWorldModel: A Stable JEPA World Model from Pixels

Shaoqing Tan — Fri, 27 Mar 2026 22:16:24 -0700

Stable end-to-end JEPA world model trained directly from pixels using simple MSE prediction loss and SIGReg anti-collapse regularization, enabling efficient latent planning under 1 second on 15M params with emergent spatial structure outperforming prior methods.

World Models for Robots: The Next Big Leap?

Shaoqing Tan — Fri, 27 Mar 2026 07:39:49 -0700

Technical overview defining world models in robotics, their potential to solve diverse problems via video prediction, and key enablers like scale.

Harnessing Long-Running AI in Embodied Systems

Shaoqing Tan — Fri, 27 Mar 2026 00:11:43 -0700

As AI moves from quick Q&A to marathon tasks, designers grapple with continuity. This episode explores how Anthropics harness design principles translate to embodied AI - robots that need to maintain context across long-running missions.

HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

Shaoqing Tan — Wed, 25 Mar 2026 22:29:11 -0700

Whole-Body Mobile Manipulation Interface (HoMMI) that learns bimanual and whole-body manipulation, long-horizon navigation, and active perception directly from egocentric human demonstrations without teleoperation.

TurboQuant: Redefining AI Efficiency with Extreme Compression

Shaoqing Tan — Wed, 25 Mar 2026 17:52:48 -0700

This episode explores TurboQuant, a revolutionary set of quantization algorithms from Google Research that redefines AI efficiency through extreme compression.

We dive deep into how TurboQuant addresses one of AI's most pressing challenges: the memory bottleneck created by high-dimensional vectors in key-value caches. The research introduces theoretically grounded quantization methods that enable massive compression for large language models and vector search engines without sacrificing performance.

Key topics covered:

The theoretical foundations of TurboQuant's quantization algorithms
How extreme compression works for LLMs and vector search engines
Impact on high-dimensional vectors and key-value cache memory bottlenecks
Performance metrics and comparisons with existing methods
Practical implications for AI deployment and efficiency

Links:
Paper: https://arxiv.org/pdf/2504.19874
Blog: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/

DexWM: Learning Dexterous Object Manipulation from Human Videos

Shaoqing Tan — Wed, 25 Mar 2026 07:19:46 -0700

Dataset of robot trajectories designed for training world models that learn dexterous hand-object interactions from human videos, released on Hugging Face.

FlashAttention-3: Fast & Accurate Attention with Asynchrony & Low-Precision

Shaoqing Tan — Tue, 24 Mar 2026 22:54:40 -0700

Major efficiency leap for Transformer attention mechanisms, enabling faster training/inference on long sequences with low-precision compute.

When AI Trains on Its Own Output: The Model Collapse Problem

Shaoqing Tan — Tue, 24 Mar 2026 22:39:06 -0700

Warns of "model collapse" in LLMs trained on synthetic data from prior models, urging preservation of human-generated data. One of 2024's most influential papers.

MolmoBot: A Vision-Language Model for Zero-Shot Robot Manipulation

Shaoqing Tan — Tue, 24 Mar 2026 07:22:35 -0700

Vision-language model (VLM) for zero-shot robot manipulation, trained entirely in simulation without real-world data; achieves 79.2% success rate on real-world tabletop tasks, outperforming π₀.₅ baseline at 39.2%.

LeWorldModel: Stable End-to-End JEPA from Pixels

Shaoqing Tan — Tue, 24 Mar 2026 01:12:52 -0700

A stable end-to-end Joint Embedding Predictive Architecture (JEPA) trained directly from pixels that enables robust world modeling for embodied AI systems.

EgoVerse: An Egocentric Data Ecosystem for Scaling Robot Learning

Shaoqing Tan — Mon, 23 Mar 2026 22:18:26 -0700

Ecosystem with over 1300 hours of egocentric human video data spanning 240 scenes and 2000+ tasks, designed for scalable robot policy training via behavior cloning; includes cloud infrastructure, data viewer, and human-to-robot transfer algorithms to enable cross-embodiment learning without teleoperation.

HSImul3R: Physics-Driven Reconstruction of Human–Scene Interactions

Shaoqing Tan — Mon, 23 Mar 2026 22:15:11 -0700

Physics-in-the-loop bi-directional optimization pipeline reconstructing stable, simulation-ready 3D human-scene interactions from casual videos, deployable directly to humanoid robots for world modeling and manipulation.

MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation

Shaoqing Tan — Mon, 23 Mar 2026 11:48:08 -0700

Open-source suite of large-scale simulation environments and benchmarks designed for advancing end-to-end learning in robot navigation and manipulation across multiple embodiments.

DreamZero: World Action Models Are Zero-Shot Policies

Shaoqing Tan — Mon, 23 Mar 2026 11:36:32 -0700

Introduces World Action Models (WAMs), a family of 14B-parameter autoregressive diffusion models that jointly predict video and robotic actions to enable zero-shot generalization across manipulation tasks, outperforming fine-tuned Vision-Language-Action models on benchmarks like MolmoSpaces and RoboArena.

Kinema4D: A 4D Generative Simulator for Embodied AI

Shaoqing Tan — Sun, 22 Mar 2026 19:16:00 -0700

An action-conditioned 4D generative robotic simulator that disentangles precise kinematic control from environmental dynamics, facilitating physically-plausible simulations of complex robot-world interactions for training and world modeling.

VEGA-3D: Teaching multimodal LLMs spatial reasoning through video generation

Shaoqing Tan — Sun, 22 Mar 2026 19:02:22 -0700

A plug-and-play framework extracts implicit 3D priors from video diffusion models to enhance multimodal LLMs with spatial reasoning capabilities, enabling improved geometric scene understanding and embodied decision-making without explicit 3D supervision.