Karl Pertsch

I am a postdoc at UC Berkeley and Stanford University, where I work with Sergey Levine and Chelsea Finn on training robot foundation models. I'm also a member of the technical staff at Physical Intelligence.

I completed my PhD at the University of Southern California (USC), working with Joseph Lim. During my PhD, I was fortunate to intern at Meta AI and spend time as a student researcher at Google Brain with Karol Hausman. Before my PhD, I spent one year as a Fulbright Scholar at the University of Pennsylvania, working with Kostas Daniilidis.

Email / Twitter / Google Scholar / CV / LinkedIn

Research

I'm interested in machine learning, reinforcement learning and robotics. At the moment, I am working on training foundation models for robotics. Towards this goal, I focus on three key challenges: (1) building diverse robot datasets, (2) training large-scale robot policies on this data, and (3) developing approaches for scalably evaluating robot foundation models.

	FAST: Efficient Action Tokenization for Vision-Language-Action Models Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, Sergey Levine ArXiv, 2025 paper / website / code We release FAST, a new action tokenization method for vision-language-action models. FAST is a simple, efficient, and scalable method for tokenizing actions into a compact, discrete representation. With FAST, we can train VLAs 5x faster and build the first VLAs that work zero-shot in new environments.
	Re-Mix: Optimizing Data Mixtures for Large Scale Imitation Learning Joey Hejna, Chethan Bhateja, Yichen Jian, Karl Pertsch, Dorsa Sadigh Conference on Robot Learning (CoRL), 2024 paper / code We develop a scalable approach for optimizing data mixtures for large-scale robot imitation learning, using group distributionally robust optimization. Our approach generates dataset weights for the RT-X data mixture that outperform weights tuned by human experts.
	Robotic Control via Embodied Chain-of-Thought Reasoning Michal Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn Sergey Levine Conference on Robot Learning (CoRL), 2024 project page / paper / code / models We propose embodied chain-of-thought learning for vision-language-action models (VLAs). By training VLAs to "look and think" before acting, i.e. to predict intermediate "grounded reasoning steps" like subtasks, object bounding boxes, etc. we can enable substantially improved generalization. Our approach increases the performance of OpenVLA on challenging generalization evaluations by 30% without any additional robot data.
	OpenVLA: An Open-Source Vision-Language-Action Model Moo Jin Kim, Karl Pertsch*, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, Chelsea Finn Conference on Robot Learning (CoRL), 2024 project page / paper / code / models We introduce OpenVLA, a 7B-parameter open-source vision-language-action model (VLA), pretrained on 970k robot episodes from the Open X-Embodiment dataset. OpenVLA sets a new state of the art for generalist robot manipulation policies. It supports controlling multiple robots out of the box and can be quickly adapted to new robot setups via parameter-efficient fine-tuning. OpenVLA models, code, and training data are fully open-source.
	Evaluating Real-World Robot Manipulation Policies in Simulation Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, ..., Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, Ted Xiao Conference on Robot Learning (CoRL)*, 2024 project page / paper / code We introduce SIMPLER, a collection of simulated environments for manipulation policy evaluation on common real robot setups. We demonstrate strong correlation between policy performance in SIMPLER environments and in the real world through paired sim-and-real evaluations of open-source manipulation policies.
	DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset Alexander Khazatsky, Karl Pertsch*, Suraj Nair, ..., Thomas Kollar, Sergey Levine, Chelsea Finn Robotics: Science and Systems (RSS)*, 2024 project page / paper / dataset visualizer We introduce DROID, the most diverse robot manipulation dataset to date. It contains 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup.
	Octo: An Open-Source Generalist Robot Policy Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, ..., Dorsa Sadigh, Chelsea Finn, Sergey Levine Robotics: Science and Systems (RSS)*, 2024 project page / tech report / code We introduce Octo, an open-source generalist policy, trained on 800k robot trajectories. Octo is a large, transformer-based diffusion policy that supports flexible task specification, observation and action spaces. It can control a diverse range of robots out of the box and supports efficient finetuning to new robot configurations. We release pre-trained checkpoints and our full training + finetuning pipelines.
	Open X-Embodiment: Robotic Learning Datasets and RT-X Models Open X-Embodiment Collaboration (Project co-leads: Quan Vuong, Karl Pertsch) International Conference on Robotics and Automation (ICRA), 2023 (Best Conference Paper Award) project page / arXiv / dataset We introduce the Open X-Embodiment Dataset, the largest robot learning dataset to date with 1M+ real robot trajectories, spanning 22 robot embodiments. We train large, transformer-based policies on the dataset (RT-1-X, RT-2-X) and show that co-training with our diverse dataset substantially improves performance.
	Cross-Domain Transfer via Semantic Skill Imitation Karl Pertsch, Ruta Desai, Vikash Kumar, Franziska Meier, Joseph J. Lim, Dhruv Batra, Akshara Rai Conference on Robot Learning (CoRL), 2022 project page / arXiv / code We learn a semantic skill policy that enables cross-domain imitation: from robot to robot between different environments and even from human video to robot. We show that we can learn long-horizon robotic manipulation tasks in a simulated kitchen environment using only three minutes of human video, recorded in my kitchen with a GoPro strapped to my head.
	Assisted Teleoperation for Scalable Robot Data Collection Shivin Dass, Karl Pertsch**, Hejia Zhang, Youngwoon Lee, Joseph J. Lim, Stefanos Nikolaidis project page / arXiv / code We enable scalable robot data collection by assisting human teleoperators with a learned policy. Our approach estimates its uncertainty over future actions to determine when to request user input. In real world user studies we demonstrate that our system enables more efficient teleoperation with reduced mental load and up to four robots in parallel.
	Task-Induced Representation Learning Jun Yamada, Karl Pertsch, Anisha Gunjal, Joseph J. Lim International Conference on Learning Representations (ICLR), 2022 project page / arXiv / code We evaluate the effectiveness of representation learning approaches on visually complex environments with substantial distractors. We compare common unsupervised representation learning approaches to task-induced representations, that leverage task information from prior tasks to learn what parts of the scene are important to model and what parts can be ignored.
	Skill-based Meta-Reinforcement Learning Taewook Nam, Shao-Hua Sun, Karl Pertsch, Sung Ju Hwang, Joseph J. Lim International Conference on Learning Representations (ICLR), 2022 project page / arXiv / code We perform meta-RL on top of skills extracted from large task-agnostic offline datasets. By combining meta-training tasks with offline data we can meta-learn policies that can quickly learn new long-horizon, sparse reward tasks.
	Demonstration-Guided Reinforcement Learning with Learned Skills Karl Pertsch, Youngwoon Lee, Yue Wu, Joseph J. Lim Conference on Robot Learning (CoRL), 2021 project page / arXiv / code We follow long-horizon demonstrations by imitating the demonstrated skills instead of the primitive actions. By using skills learned from large, task-agnostic experience datasets for imitation, our approach SkiLD can seamlessly integrate task-agnostic data & demonstrations via a skill-based learning framework.
	Accelerating Reinforcement Learning with Learned Skill Priors Karl Pertsch, Youngwoon Lee, Joseph J. Lim Conference on Robot Learning (CoRL), 2020 (Plenary Talk, top 4%) Workshop on Robot Learning @ NeurIPS, 2020 (Best Paper Runner-up Award) Deep RL Workshop @ NeurIPS, 2020 (Oral) project page / arXiv / code We jointly learn an embedding space of skills and a prior over skills. This skill prior tells us when to use which skill and guides learning on new tasks for effective skill transfer from large offline datasets.
	Motion Planner Augmented Reinforcement Learning for Robot Manipulation in Obstructed Environments Jun Yamada, Youngwoon Lee, Gautam Salhorta, Karl Pertsch, Max Pflueger, Gaurav S.Sukhatme, Joseph J. Lim, Peter Englert Conference on Robot Learning (CoRL), 2020 project page / arXiv / code Our approach augments model-free RL agents with motion planning capabilities, enabling them to solve long-horizon manipulation tasks in cluttered environments.
	Long-Horizon Visual Planning with Goal-Conditioned Hierarchical Predictors Karl Pertsch, Oleh Rybkin, Frederik Ebert, Chelsea Finn, Dinesh Jayaraman, Sergey Levine Conference on Neural Information Processing Systems (NeurIPS), 2020 project page / arXiv / video / code We propose a hierarchical prediction model that predicts sequences by recursive infilling. We use this model to devise a hierarchical planning approach that allows to scale visual MPC to long-horizon tasks with hundreds of time steps.
	Keyframing the Future: Keyframe Discovery for Visual Prediction and Planning Karl Pertsch, Oleh Rybkin, Jingyun Yang, Shenghao Zhou, Kosta Derpanis, Joseph Lim, Kostas Daniilidis, Andrew Jaegle Conference on Learning for Dynamics and Control, 2020 project page / arXiv / video / poster We propose a keyframe-based video prediction model that can unsupervisedly discover the moments of interesting change, the keyframes, in the data. We show that using the predicted keyframes as subgoals for planning improves performance on a simulated pushing task. Hover over image (or tap the screen) to see the video.
	Learning what you can do before doing anything Oleh Rybkin, Karl Pertsch*, Kosta Derpanis, Kostas Daniilidis, Andrew Jaegle International Conference on Learning Representations (ICLR), 2019 project page / arXiv / poster We learn an agent's action space from pure visual observations along with a predictive model. It can then be used to perform model predictive control, requiring orders of magnitude fewer action annotated videos. Hover over image (or tap the screen) to see the video.*
	iPose: Instance-Aware 6D Pose Estimation of Partly Occluded Objects Omid Hosseini Jafari, Siva Karthik Mustikovela, Karl Pertsch, Eric Brachmann, Carsten Rother Asian Conference on Computer Vision (ACCV), 2018 Combining a CNN-based regression of dense on-object surface labeling with RANSAC-based pose fitting for accurate 6DoF pose estimation of texture-less objects under heavy occlusion.

I borrowed this website layout from here!