Skip to content

AIDASLab/Awesome-VLA-Data-Collection-Synthesis-Curation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 

Repository files navigation


Awesome VLA Data Collection, Synthesis, and Curation

A curated list of data-centric methods for Vision-Language-Action models and robot foundation models: collection, simulation, digital twins, cross-embodiment augmentation, neural trajectories, RL-generated rollouts, cleaning, preprocessing, annotation, and benchmarks.



📑 Full Table of Contents  (click to expand)

📚 Surveys & Resources

🗄️ Robot Data Substrates

🤖 Data Engines

🧹 Curation, Cleaning & Preprocessing

🧾 Meta



🧾 Artifact Legend

Tag Meaning
Data Dataset or generated trajectories are released.
Code Collection, conversion, generation, training, or evaluation code is released.
Method Pipeline is described, but artifacts may be private or unclear.
Gated Access or license restrictions apply.
Weights Model checkpoints are released.
Benchmark Evaluation tasks, metrics, or benchmark data are released.

🏷️ Embodiment Tags

single-arm · bimanual · mobile-manipulator · humanoid · dexterous-hand · cross-embodiment · robot-free · simulation · real-world · latent-action · world-model · RL-generated


📚 Surveys & Resources

📝 Survey & Perspective Papers

Paper Year Focus
Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines 2026.04 Data-centric VLA survey.
3D Gaussian Splatting in Robotics: A Survey 2024.10 Survey of 3DGS scene representations, rendering, and robotics applications.
Vision Language Action Models in Robotic Manipulation: A Systematic Review 2025.07 Systematic review of VLA models, datasets, and simulation platforms for manipulation.
Vision-Language-Action Models for Robotics: A Survey 2025 Broad VLA survey.
Survey of Vision-Language-Action Models for Embodied Manipulation 2025.08 VLA methods and embodied manipulation.
What Matters in Building Vision-Language-Action Models for Generalist Robots 2025 Empirical study on VLA data and design choices.
A Tutorial Note on Collecting Simulated Data for Vision-Language-Action Models 2025.08 Simulated VLA data collection.

🧭 Related Repos

Repository Focus
Open X-Embodiment Multi-embodiment robot data.
VLA Datasets, Benchmarks, and Data Engines Data-centric VLA datasets, benchmarks, and data-engine taxonomy.
UMI Robot Dataset Community UMI-style datasets.

🛠️ Tutorials & Tooling

Resource Focus
LeRobot Documentation Robot datasets, policies, and tooling.
OpenVLA OXE preprocessing and VLA fine-tuning.
VLA Foundry Unified LLM/VLM/VLA training and robot data preprocessing.


🗄️ Robot Data Substrates

🧺 Open Robot Data Corpora

Work Year Embodiment Artifact Notes
Open X-Embodiment / RT-X 2023.10 cross-embodiment Data Multi-robot dataset for cross-embodiment learning.
BridgeData V2 2023.08 single-arm Data Real-world WidowX manipulation data.
DROID 2024.03 single-arm Data Code In-the-wild robot manipulation demonstrations.
LeRobot 2026.02 cross-embodiment Data Code Open robot data and policy ecosystem.
RoboMIND 2025 cross-embodiment Data Real-world manipulation trajectories across robots and tasks.
RoboMIND 2.0 2025.12 bimanual, mobile-manipulator, cross-embodiment Data Benchmark 310K+ real dual-arm trajectories across six embodiments, plus tactile/mobile manipulation and simulated trajectories.
AgiBot World 2025 mobile, humanoid, bimanual Data Large-scale real robot corpus.
AgiBot World Colosseo 2025.03 bimanual, dexterous, cross-embodiment Data Code Weights Large-scale manipulation platform with 1M+ trajectories and GO-1 policy artifacts.
Galaxea Open-World Dataset 2025 mobile dual-arm Gated Data Open-world behavior data.
Action100M 2026.01 human / internet video Data O(100M) dense open-vocabulary action annotations from instructional videos for world-model and embodied pretraining.
OpenEAI Dataset 2026 cross-embodiment Data Unified VLA pretraining corpus aggregating Open X-Embodiment, UMI Community, and related embodied data sources.
RH20T 2023 single-arm Data Robot-human multimodal manipulation.
CLAMP 2025.05 tactile, robot-free Data Code In-the-wild multimodal haptic dataset collected with a low-cost open-source device.
Kaiwu 2025.03 human, robot, multimodal Data Synchronized multimodal manipulation data with pressure, audio, mocap, gaze, EMG, and fine-grained labels.
RoboSet / RoboHive 2023 arms, hands Data Code Real and simulated robot learning data.
RoboNet 2020 multi-robot Data Code Shared multi-robot experience dataset.

🦾 Humanoid & Dexterous Corpora

Work Year Embodiment Artifact Notes
Humanoid Everyday 2025 humanoid Data Everyday humanoid manipulation data.
Unitree UnifoLM-WBT Dataset 2025 humanoid Data Whole-body teleoperation data.
Fourier ActionNet 2026 humanoid, bimanual Data Dexterous bimanual humanoid dataset.
DexCap 2024.03 dexterous-hand Code Method Portable hand mocap and retargeting.
DexUMI 2025.05 dexterous-hand Data Method Robot-free dexterous collection.
XL-VLA 2026 dexterous-hand Data Code Cross-hand teleoperation data and latent action representation.
RealSource-World 2025 humanoid, arms Data Real robot data release.

🧪 Benchmarks & Evaluation Suites

Benchmark Year Setting Artifact Notes
LIBERO 2023 single-arm, simulation Data Code Benchmark Lifelong language-conditioned manipulation benchmark.
RLBench 2019 single-arm, simulation Data Code Benchmark Large-scale vision-guided manipulation suite built on CoppeliaSim/PyRep.
CALVIN 2021 single-arm, simulation Data Code Benchmark Long-horizon language-conditioned manipulation benchmark.
SimplerEnv 2024 single-arm, simulation Code Benchmark Sim-real evaluation framework for Google Robot and WidowX-style policies.
ManiSkill2 2023 arms, mobile, dexterous, simulation Data Code Benchmark Unified benchmark for generalizable manipulation skills.
RoboCasa 2024 household manipulation, simulation Data Code Benchmark Photorealistic household manipulation benchmark built on robosuite.
RoboTwin 2.0 2025 bimanual, simulation Data Code Benchmark Dual-arm benchmark and scalable data generator with domain randomization.
VLABench 2025 long-horizon manipulation, simulation Data Code Benchmark Language-conditioned manipulation benchmark for long-horizon reasoning.
Kinetix 2024 2D physics, RL Code Benchmark Open-ended physics-control benchmark; adjacent to robot policy evaluation.
RoboTwin 1.0 2024 bimanual, simulation Data Code Benchmark Early RoboTwin dual-arm benchmark with generative digital twins.
LIBERO-Plus 2026 single-arm, robustness Data Benchmark Robustness extension of LIBERO.
LIBERO-Pro 2025 single-arm, robustness Code Benchmark LIBERO extension for evaluation beyond memorization.
RoboArena 2025 real-world, distributed Benchmark Community-run real-world benchmark for generalist robot policies.
MIKASA-Robo 2025 tabletop manipulation, memory Code Benchmark Memory-intensive tabletop manipulation benchmark.
RoboCerebra 2025 long-horizon manipulation Data Benchmark Long-horizon benchmark for planning, reflection, and memory.
LIBERO-Mem 2026 single-arm, memory Benchmark Non-Markovian object-centric memory benchmark.
RoboChallenge 2025 real-world manipulation Benchmark Real-robot evaluation benchmark for embodied policies.
RoboMME 2026 memory-augmented manipulation Data Code Benchmark Benchmark for memory-augmented robotic manipulation.


🤖 Data Engines

🎮 Real-World Capture & Robot-Free Collection

Work Year Embodiment Artifact Core Mechanism
ALOHA 2023 bimanual Code Data Low-cost bimanual teleoperation.
Mobile ALOHA 2024.01 mobile bimanual Code Data Mobile bimanual teleoperation.
ALOHA 2 2024 bimanual Code Improved low-cost bimanual teleop hardware.
UMI 2024.02 robot-free, single-arm Code Data Handheld gripper for robot-free demonstrations.
UMI Data Community 2025 robot-free Data UMI-style dataset hub.
FastUMI 2024.09 robot-free Data Code Scalable UMI-style collection.
UMI-3D 2026.04 robot-free Method UMI with 3D spatial perception.
DexUMI 2025.05 dexterous-hand Data Method Human hand collection with robot-hand conversion.
DexCap 2024.03 dexterous-hand Code Method Hand mocap and retargeting.
DEX-Mouse 2026.04 dexterous-hand Method Low-cost force-feedback dexterous teleop.
SABER 2025 ego, hands, humanoid Data Method Human behavior capture to embodied action supervision.
Lucid-XR 2026.04 dexterous, robot-free, simulation Method XR-headset physics simulation, human-to-robot retargeting, and physics-guided video generation for synthetic manipulation data.
GELLO 2023.09 single-arm, bimanual Code Method Low-cost kinematically matched teleoperation devices for high-quality demonstrations.
RoboWheel 2025.12 cross-embodiment Method Human-object interaction reconstruction for robot learning.
MV-UMI 2026 cross-embodiment Method Multi-view UMI-style interface for cross-embodiment learning.
Touch in the Wild 2025.07 robot-free, tactile Method Portable visuo-tactile gripper for fine-grained in-the-wild demonstrations.

🏗️ Simulation-Based Demonstration Generation

Work Year Embodiment Artifact Core Mechanism
MimicGen 2023.10 single-arm Data Code Object-centric trajectory retargeting.
SkillMimicGen / SkillGen 2024.10 single-arm Method Skill segmentation and motion-planner stitching.
DexMimicGen 2024.10 bimanual, dexterous Data Code Bimanual dexterous demo generation.
SoftMimicGen 2026.03 deformable, bimanual, humanoid Method MimicGen-style automated data generation for rope, towel, tissue, stuffed-animal, surgical, and humanoid deformable manipulation tasks.
DiffGen 2024.05 simulation, manipulation Method Differentiable physics, differentiable rendering, and VLM objectives for automatic robot demonstration generation.
IntervenGen 2024.05 simulation, real-world Method Expands a small number of human corrective interventions into synthetic recovery demonstrations for robust imitation learning.
CyberDemo 2024.02 dexterous-hand, simulation Data Method Simulated human demonstrations with domain and task augmentation for real-world dexterous manipulation transfer.
RoboGen 2023.11 manipulation, locomotion Code Method Generative simulation for tasks and supervision.
GenSim2 2024.10 simulation, articulated objects Code Method Multimodal LLM task generation with planning and RL solvers that generate demonstrations for articulated manipulation.
Scaling Up and Distilling Down 2023.07 simulation, language-conditioned Code Data LLM-guided planners generate diverse language-labeled robot trajectories that are distilled into a multitask visuomotor policy.
RoboCasa 2024.06 household manipulation Data Code Generated household tasks and demos.
RoboCasa365 2026.03 household, mobile manipulation Data Method Large-scale household task suite.
RoboTwin 2025.04 bimanual Data Code Dual-arm digital-twin-style benchmark.
RoboTwin 2.0 2025.06 bimanual, cross-embodiment Data Code Scalable dual-arm generator.
GenSim 2023.10 simulation, manipulation Code Method LLM-generated simulation tasks, curricula, and expert demonstrations.
Video2Policy 2025.02 simulation, human / internet video Method Reconstructs manipulation tasks from internet RGB videos and trains RL policies with LLM-generated rewards.
Learning Interactive Real-World Simulators 2023.10 simulation, world-model Method Learns interactive simulators from real-world data for zero-shot policy transfer.
Scaling Robot Learning with Semantically Imagined Experience 2023.02 simulation, single-arm Method Semantically imagined experience for scaling robot learning with generated variations.
MolmoB0T / MolmoBot-Engine 2025 single-arm, mobile manipulator Data Code Procedural simulation engine.
AGT-World 2026.02 simulation, long-horizon Method Affordance-graph task worlds with self-evolving VLM/geometric feedback for task and policy generation.
Generative Simulation for pHRI 2026.04 physical HRI, simulation Method Text2sim2real pipeline for synthesizing assistive pHRI scenes, soft-body humans, and robot motion trajectories.
CP-Gen 2025.08 single-arm Method Constraint-preserving keypoint generation.
DynaMimicGen 2025.11 dynamic manipulation Method Data generation for dynamic tasks.
MoMaGen 2025.10 bimanual mobile Method Reachability and visibility constrained generation.
HumanoidGen 2025.07 humanoid, dexterous Data Code LLM/MCTS-generated humanoid dexterous demos.

🧱 3D Reconstruction & Digital-Twin Generation

Work Year Embodiment Artifact Core Mechanism
DemoGen 2025.02 arm, bimanual, dexterous Code Method 3D point-cloud scene and trajectory synthesis.
RoboSplat 2025.04 3DGS, manipulation Method Edits 3D Gaussians to generate one-shot visual demonstrations across object pose/type, camera, lighting, scene appearance, and embodiment changes.
RoboGSim 2024.11 3DGS, simulation Code Method Real2Sim2Real robotic simulator combining 3DGS reconstruction with physics-engine interaction for synthetic trajectories, scenes, objects, and viewpoints.
SplatSim 2024.09 3DGS, simulation Method Uses Gaussian splats as photoreal rendering primitives for zero-shot sim-to-real RGB manipulation policy training.
NeRF-Aug 2025.09 NeRF, manipulation Method Uses object-level neural radiance fields to synthesize photorealistic, 3D-consistent demonstrations for novel objects without collecting new human demos.
DISCOVERSE 2025.07 3DGS, simulation Code Method Open-source Real2Sim2Real simulator using 3DGS for high-fidelity rendering and MuJoCo for physics.
RoboSimGS 2025.10 3DGS, simulation Method Hybrid 3DGS-plus-mesh framework where 3DGS captures photoreal appearance and mesh primitives provide physically interactive manipulation assets.
A High-Fidelity Digital Twin for Robotic Manipulation Based on 3DGS 2026.01 3DGS, digital twin Method Converts sparse RGB 3DGS reconstructions into semantically labeled, collision-ready digital twins integrated with Unity, ROS2, and MoveIt.
EmbodiedGen / RoboSplatter 2025.06 3DGS, embodied simulation Code Method Generative 3D world engine with RoboSplatter for high-fidelity 3DGS rendering inside embodied simulation pipelines.
GaussGym 2025.10 3DGS, locomotion Code Data Real-to-sim framework that plugs 3DGS rendering into vectorized physics simulators for high-throughput pixel-based robot learning.
Real2Render2Real 2025.05 robot arms Method Real scan and human video to rendered robot demonstrations.
EMMA 2025.09 real-world, single-arm Method Generative visual transfer via multi-view consistent embodied manipulation video editing.
ERMV 2025.07 multi-view, VLA Method Edits 4D robotic multi-view image sequences with geometric, temporal, and robot-state consistency for VLA data augmentation.
AOMGen 2025.12 articulated objects, simulation Method Photoreal, physics-consistent demonstration generation for articulated objects from a single scan, demonstration, and digital assets.

🔁 Cross-Embodiment Augmentation & Retargeting

Work Year Embodiment Artifact Core Mechanism
MIRAGE 2024.02 cross-embodiment arms Code Method Robot masking and visual inpainting.
RoVi-Aug 2024.09 cross-embodiment arms Code Method Robot appearance and viewpoint augmentation.
RoboEngine 2025.03 cross-scene robot data Method Robot segmentation and background generation.
H2R 2025.05 human video to robot Method Human hand keypoints to rendered robot motions.
From Generated Human Videos to Physically Plausible Robot Trajectories 2025.12 humanoid Method Benchmark Converts generated human videos into physically plausible humanoid trajectories through reconstruction, retargeting, and robust policy execution.
X-Humanoid 2025.12 humanoid Data Method Robotizes human videos into humanoid videos at scale using paired synthetic data and video-to-video generation.
OXE-AugE 2025.12 cross-embodiment Data Method Robot augmentation over Open X-Embodiment.
CEI: Cross-Embodiment Interface 2026.01 arms, grippers, hands Method Observation and action synthesis across embodiments.
R2RGen 2025.10 real robot point clouds Method Real-to-real 3D pointcloud/action augmentation.
XL-VLA 2026 dexterous hands Code Data Cross-hand latent action representation.
Being-H0 2025.07 human hands, dexterous Method Human-video physical instruction tuning and motion tokenization.
Being-H0.5 2026.01 cross-embodiment Method Weights Unified action space for embodiment transfer.
OmniRetarget 2025.09 humanoid Method Whole-body humanoid motion retargeting.
SPIDER 2026 dexterous, humanoid Method Physics-based retargeting for hands and humanoids.
Perceptive Humanoid Parkour 2026.02 humanoid Method Retargets and composes atomic human parkour skills with motion matching, then distills them into a depth-based real-robot policy.

🧠 Neural Trajectory Synthesis & Labeling

Work Year Embodiment Artifact Core Mechanism
DreamGen 2025.05 arms, humanoid Method Video world model to action-labeled neural trajectories.
RLDX-1 2026.05 dexterous, humanoid Code Weights Method Video generation, motion filtering, and inverse dynamics labels.
RoboCurate 2026.02 humanoid, dexterous Method Action-verified neural trajectory filtering.
EVA 2026.03 bimanual, arms Method Inverse-dynamics reward for executable videos.
TC-IDM 2026.01 generated video, 3D motion Method Tool-centric inverse dynamics model that converts world-model video plans into 6-DoF end-effector motions and controls.
NovaFlow 2025.10 generated video, 3D flow Method Distills generated videos into 3D actionable object flow and realizes it with grasp proposals, trajectory optimization, and particle dynamics.
TraceGen / TraceForge 2025.11 cross-embodiment Method 3D trace-space world modeling and trace data pipeline.

🎯 RL & Expert-Policy Rollouts

Work Year Embodiment Artifact Core Mechanism
RLDG 2024.12 real robot, single-arm Code Method Trains task-specific RL specialists, rolls them out to generate high-quality data, and distills the data into generalist robot policies.
RDGen 2026.05 sim-real, single-arm Method Trains RL policies as structured trajectory generators, transfers them to real robots, and harvests successful rollouts as demonstrations for downstream VLA training.
PLD / Deployment-Aligned Data 2026.05 sim-real, VLA post-training Method Trains lightweight residual RL specialists, collects recovery trajectories from base-policy states, and distills them back into the VLA.
RL-Driven Data Generation for Dexterous Grasping 2025.04 dexterous-hand, simulation Method Uses a residual RL teacher skill to generate contact-rich grasping trajectories across object geometries for vision-action policy training.
Beyond Human Demonstrations 2025.09 simulation, arms Method Diffusion RL experts generate VLA training data.
Discover, Learn, and Reinforce 2025.11 simulation, arms Method Diverse RL rollouts for VLA pretraining.
OmniReset 2025 arms, dexterous Code Method Diverse resets, RL experts, and RGB distillation.


🧹 Curation, Cleaning & Preprocessing

🧼 Cleaning & Standardization

Work Year Artifact Pipeline
Robust Learning from Demonstrations with Mixed Qualities Using Leveraged Gaussian Processes 2019 autonomous driving, LfD Method
ABot-M0 / UniACT 2026.02 Method Cleans, standardizes, and balances heterogeneous public datasets into UniACT.
VLA Foundry 2026.04 Code Method WebDataset sharding, normalization, action chunking, SE(3) actions, and multi-dataset stats.
HoloBrain-0 / RoboOrchard 2026.02 Code Method Full-stack VLA infrastructure for data curation, training, and deployment.
Green-VLA 2026.02 Method Temporal alignment, quality filtering, and embodiment-aware action interfaces.
LingBot-VLA 2026.01 Code Weights Benchmark Efficient VLA post-training stack and GM-100 benchmark; raw pretraining data not confirmed open.
OpenVLA Data Tooling 2024 Code OXE mixture conversion and fine-tuning utilities.
LeRobot Dataset Format 2024 Code Standardized robot dataset format and tooling.
Hybrid-VLA Data Pipeline 2025 Code RLDS conversion, action tokenization, and multimodal collation.
Qwen-VLA 2026.05 Method Large-scale joint pretraining recipe across robot trajectories, egocentric demonstrations, synthetic simulation, VLN, trajectory supervision, and VLM data.

🏷️ Annotation & Relabeling

Work Year Artifact Pipeline
ShareRobot 2025 Data Code Task planning, affordance, trajectory, and QA annotations.
Robo2VLM 2025.05 Data Code VQA generation from pose, gripper, force, and trajectory signals.
CAST 2025.08 Method Counterfactual language/action relabeling that augments existing robot datasets with synthetic trajectory-instruction pairs.
Q-DIG 2026.03 Method Quality-diversity prompt generation for VLA red-teaming; generated adversarial instructions are used to augment fine-tuning data.
CLIP-RT 2024.11 Method Language-supervised data collection with stochastic trajectory augmentation and heuristic labeling for language-conditioned robot policies.
PixelVLA / Pixel-160K 2025.11 Method Data Automated pixel-level annotation from robot data.
RoboVQA 2023.11 Data Method Long-horizon robotics VQA and reasoning data.
RoboAfford++ 2025 Method Generative affordance and spatial reasoning annotations.
Being-H0 2025.07 Method Human video, mocap, and VR data curation for dexterous VLA pretraining.
RLDX-1 2026.05 Code Weights Method Inverse dynamics labels for synthetic rare manipulation scenarios.
RoboInter 2026 Method Intermediate representations and VQA/VLA-oriented robotic annotations.

🧩 Task Curation & Dataset Design

Work Year Artifact Pipeline
RoboGene 2026.02 Data Method Agentic generation of diverse, feasible manipulation tasks.
RoboGene Dataset 2026.02 Data Released RoboGene task-generation dataset.
LIBERO 2023 Data Code Procedural language-conditioned task suites.
RoboCasa 2024 Data Code Generated household tasks and environments.
RoboCasa365 2026.03 Data Method Large-scale household task and demonstration suite.
MolmoBot-Engine 2025 Data Code Procedural task and scene generation.
ABot-N0 Data Engine 2026.02 Method Expert trajectories and reasoning samples for embodied navigation.
RESample 2025.10 Method Offline-RL critic and exploratory rollout sampling generate OOD/recovery data to improve VLA robustness.


🤝 Contributing

PRs are welcome. You can also send papers, corrections, artifact updates, or taxonomy suggestions to jake630@snu.ac.kr

Releases

No releases published

Packages

 
 
 

Contributors