| 
          
            | 
                Karl Pertsch
               I am a postdoc at UC Berkeley and Stanford University, where I work with Sergey Levine and Chelsea Finn on training robot foundation models.
              I'm also a member of the technical staff at Physical Intelligence. I completed my PhD at the University of Southern California (USC), working with Joseph Lim. During my PhD, I was fortunate to intern at Meta AI and spend time as a student researcher at Google Brain with Karol Hausman. Before my PhD, I spent one year as a Fulbright Scholar at the University of Pennsylvania, working with Kostas Daniilidis.
               
                Email  / 
                 Twitter   / 
                Google Scholar  / 
                CV  / 
                 LinkedIn 
               |   |  
          
            | Research 
                I'm interested in machine learning, reinforcement learning and robotics. At the moment, I am working on training foundation models for robotics. 
                Towards this goal, I focus on three key challenges: 
                (1) building diverse robot datasets, 
                (2) training large-scale robot policies on this data, 
                and (3) developing approaches for scalably evaluating robot foundation models.
               |  
        
          |  | 
           
            
            FAST: Efficient Action Tokenization for Vision-Language-Action Models
            
            Karl Pertsch, 
            Kyle Stachowicz, 
            Brian Ichter, 
            Danny Driess, 
            Suraj Nair, 
            Quan Vuong, 
            Oier Mees, 
            Chelsea Finn,
            Sergey Levine
 ArXiv, 2025
 paper  / 
            website  / 
            code
 
 We release FAST, a new action tokenization method for vision-language-action models. FAST is a simple, efficient, and scalable method for tokenizing actions into a compact, discrete representation.
            With FAST, we can train VLAs 5x faster and build the first VLAs that work zero-shot in new environments.  
           |  
          |  | 
           
            
            Re-Mix: Optimizing Data Mixtures for Large Scale Imitation Learning
            
            Joey Hejna, 
            Chethan Bhateja, 
            Yichen Jian, 
            Karl Pertsch, 
            Dorsa Sadigh
 Conference on Robot Learning (CoRL), 2024
 paper  / 
            code
 
 We develop a scalable approach for optimizing data mixtures for large-scale robot imitation learning, using group distributionally robust optimization. Our approach generates dataset weights for the RT-X data mixture that 
            outperform weights tuned by human experts.  
           |  
          |  | 
           
            
            Robotic Control via Embodied Chain-of-Thought Reasoning
            
            Michal Zawalski*, 
            William Chen*, 
            Karl Pertsch, 
            Oier Mees,
            Chelsea Finn
 Sergey Levine
 Conference on Robot Learning (CoRL), 2024
 project page / paper  / 
            code / models
 
 We propose embodied chain-of-thought learning for vision-language-action models (VLAs). By training VLAs to "look and think" before acting, i.e. to predict intermediate "grounded reasoning steps" like subtasks, object bounding boxes, etc. we can enable substantially improved generalization.
            Our approach increases the performance of OpenVLA on challenging generalization evaluations by 30% without any additional robot data.   
           |  
          |  | 
           
            
            OpenVLA: An Open-Source Vision-Language-Action Model
            
            Moo Jin Kim*, Karl Pertsch*, Siddharth Karamcheti*,
            Ted Xiao,
            Ashwin Balakrishna,
            Suraj Nair,
            Rafael Rafailov,
            Ethan Foster,
            Grace Lam,
            Pannag Sanketi,
            Quan Vuong,
            Thomas Kollar,
            Benjamin Burchfiel,
            Russ Tedrake,
            Dorsa Sadigh,
            Sergey Levine,
            Percy Liang,
            Chelsea Finn
 Conference on Robot Learning (CoRL), 2024
 project page / paper  / 
            code / models
 
 We introduce OpenVLA, a 7B-parameter open-source vision-language-action model (VLA), pretrained on 970k robot episodes from the Open X-Embodiment dataset. OpenVLA sets a new state of the art for generalist robot manipulation policies. It supports controlling multiple robots out of the box and can be quickly adapted to new robot setups via parameter-efficient fine-tuning. OpenVLA models, code, and training data are fully open-source. 
           |  
          |  | 
           
            
            Evaluating Real-World Robot Manipulation Policies in Simulation
            
            Xuanlin Li*, Kyle Hsu*, Jiayuan Gu*, Karl Pertsch, Oier Mees, ..., Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, Ted Xiao
 Conference on Robot Learning (CoRL), 2024
 project page / paper  / 
            code
 
  We introduce SIMPLER, a collection of simulated environments for manipulation policy evaluation on common real robot setups. We demonstrate strong correlation between policy performance in SIMPLER environments and in the real world through paired sim-and-real evaluations of open-source manipulation policies. 
           |  
          |  | 
           
            
            DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
            
            Alexander Khazatsky*, Karl Pertsch*, Suraj Nair, ..., Thomas Kollar, Sergey Levine, Chelsea Finn
 Robotics: Science and Systems (RSS), 2024
 project page / paper  / 
            dataset visualizer
 
  We introduce DROID, the most diverse robot manipulation dataset to date. It contains 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup. 
           |  
          |  | 
           
            
            Octo: An Open-Source Generalist Robot Policy
            
            Dibya Ghosh*, Homer Walke*, Karl Pertsch*, Kevin Black*, Oier Mees*, ..., Dorsa Sadigh, Chelsea Finn, Sergey Levine
 Robotics: Science and Systems (RSS), 2024
 project page / tech report  / 
            code
 
  We introduce Octo, an open-source generalist policy, trained on 800k robot trajectories. Octo is a large, transformer-based diffusion policy that supports flexible task specification, observation and action spaces. It can control a diverse range of robots out of the box and supports efficient finetuning to new robot configurations. We release pre-trained checkpoints and our full training + finetuning pipelines. 
           |  
          |  | 
           
            
            Open X-Embodiment: Robotic Learning Datasets and RT-X Models
            
            Open X-Embodiment Collaboration
 (Project co-leads: Quan Vuong, Karl Pertsch)
 International Conference on Robotics and Automation (ICRA), 2023 (Best Conference Paper Award)
 project page / arXiv  / 
            dataset
 
  We introduce the Open X-Embodiment Dataset, the largest robot learning dataset to date with 1M+ real robot trajectories, spanning 22 robot embodiments. We train large, transformer-based policies on the dataset (RT-1-X, RT-2-X) and show that co-training with our diverse dataset substantially improves performance. 
           |  
          |  | 
           
            
            Cross-Domain Transfer via Semantic Skill Imitation
            
            Karl Pertsch, Ruta Desai, Vikash Kumar, Franziska Meier, Joseph J. Lim, Dhruv Batra, Akshara Rai
 Conference on Robot Learning (CoRL), 2022
 project page / arXiv  / 
            code
 
  We learn a semantic skill policy that enables cross-domain imitation: from robot to robot between different environments and even from human video to robot. We show that we can learn long-horizon robotic manipulation tasks in a simulated kitchen environment using only three minutes of human video, recorded in my kitchen with a GoPro strapped to my head. 
           |  
          |  | 
           
            
            Assisted Teleoperation for Scalable Robot Data Collection
            
            Shivin Dass*, Karl Pertsch*, Hejia Zhang, Youngwoon Lee, Joseph J. Lim, Stefanos Nikolaidis
 project page / arXiv  / 
            code
 
  We enable scalable robot data collection by assisting human teleoperators with a learned policy. Our approach estimates its uncertainty over future actions to determine when to request user input. In real world user studies we demonstrate that our system enables more efficient teleoperation with reduced mental load and up to four robots in parallel. 
           |  
          |  | 
           
            
            Task-Induced Representation Learning
            
            Jun Yamada, Karl Pertsch, Anisha Gunjal, Joseph J. Lim
 International Conference on Learning Representations (ICLR), 2022
 project page / arXiv  / 
            code
 
  We evaluate the effectiveness of representation learning approaches on visually complex environments with substantial distractors. We compare common unsupervised representation learning approaches to task-induced representations, that leverage task information from prior tasks to learn what parts of the scene are important to model and what parts can be ignored. 
           |  
          |  | 
           
            
            Skill-based Meta-Reinforcement Learning
            
            Taewook Nam, Shao-Hua Sun, Karl Pertsch, Sung Ju Hwang, Joseph J. Lim
 International Conference on Learning Representations (ICLR), 2022
 project page / arXiv  / 
            code
 
  We perform meta-RL on top of skills extracted from large task-agnostic offline datasets. By combining meta-training tasks with offline data we can meta-learn policies that can quickly learn new long-horizon, sparse reward tasks. 
           |  
          |  | 
           
            
            Demonstration-Guided Reinforcement Learning with Learned Skills
            
            Karl Pertsch, Youngwoon Lee, Yue Wu, Joseph J. Lim
 Conference on Robot Learning (CoRL), 2021
 project page / arXiv  / 
            code
 
  We follow long-horizon demonstrations by imitating the demonstrated skills instead of the primitive actions. By using skills learned from large, task-agnostic experience datasets for imitation, our approach SkiLD can seamlessly integrate task-agnostic data & demonstrations via a skill-based learning framework. 
           |  
          |  | 
           
            
            Accelerating Reinforcement Learning with Learned Skill Priors
            
            Karl Pertsch, Youngwoon Lee, Joseph J. Lim
 Conference on Robot Learning (CoRL), 2020 (Plenary Talk, top 4%)
 Workshop on Robot Learning @ NeurIPS, 2020 (Best Paper Runner-up Award)
 Deep RL Workshop @ NeurIPS, 2020 (Oral)
 project page / arXiv  / 
            code
 
  We jointly learn an embedding space of skills and a prior over skills. This skill prior tells us when to use which skill and guides learning on new tasks for effective skill transfer from large offline datasets. 
           |  
          |  | 
           
            
            Motion Planner Augmented Reinforcement Learning for Robot Manipulation in Obstructed Environments
            
            Jun Yamada*, Youngwoon Lee*, Gautam Salhorta, Karl Pertsch, Max Pflueger, Gaurav S.Sukhatme, Joseph J. Lim, Peter Englert
 Conference on Robot Learning (CoRL), 2020
 project page / arXiv  / 
            code
 
  Our approach augments model-free RL agents with motion planning capabilities, enabling them to solve long-horizon manipulation tasks in cluttered environments. 
           |  
          |  | 
           
            
            Long-Horizon Visual Planning with Goal-Conditioned Hierarchical Predictors
            
            Karl Pertsch*, Oleh Rybkin*, Frederik Ebert, Chelsea Finn, Dinesh Jayaraman, Sergey Levine
 Conference on Neural Information Processing Systems (NeurIPS), 2020
 project page / arXiv  /
            video / code
 
  We propose a hierarchical prediction model that predicts sequences by recursive infilling. We use this model to devise a hierarchical planning approach that allows to scale visual MPC to long-horizon tasks with hundreds of time steps. 
           |  
          |  | 
           
            
            Keyframing the Future: Keyframe Discovery for Visual Prediction and Planning
            
            Karl Pertsch*, Oleh Rybkin*, Jingyun Yang, Shenghao Zhou, Kosta Derpanis, Joseph Lim, Kostas Daniilidis, Andrew Jaegle
 Conference on Learning for Dynamics and Control, 2020
 project page / arXiv  /
            video / poster
 
  We propose a keyframe-based video prediction model that can unsupervisedly discover the moments of interesting change, the keyframes, in the data. We show that using the predicted keyframes as subgoals for planning improves performance on a simulated pushing task. 
           Hover over image (or tap the screen) to see the video. |  
	        |  | 
	         
	          
	          Learning what you can do before doing anything
	          
	          Oleh Rybkin*, Karl Pertsch*,  Kosta Derpanis, Kostas Daniilidis, Andrew Jaegle
 International Conference on Learning Representations (ICLR), 2019
 project page / arXiv  /
	          poster
 
  We learn an agent's action space from pure visual observations along with a predictive model. It can then be used to perform model predictive control, requiring orders of magnitude fewer action annotated videos. 
	         Hover over image (or tap the screen) to see the video. |  
            |   | 
                
                  iPose: Instance-Aware 6D Pose Estimation of Partly Occluded Objects
                
                Omid Hosseini Jafari*, Siva Karthik Mustikovela*, Karl Pertsch,  Eric Brachmann, Carsten Rother
 Asian Conference on Computer Vision (ACCV), 2018
 
 Combining a CNN-based regression of dense on-object surface labeling with RANSAC-based pose fitting for accurate 6DoF pose estimation of texture-less objects under heavy occlusion. |  
	      
	        | 
 
	          
	          I borrowed this website layout from here!
		    
	         |  |