Skip to content

Latest commit

 

History

History
244 lines (182 loc) · 13.3 KB

File metadata and controls

244 lines (182 loc) · 13.3 KB

OpenVLA: An Open-Source Vision-Language-Action Model

arXiv HF Models PyTorch Python License

Getting Started | Pretrained VLAs | Installation | Fine-Tuning OpenVLA via LoRA | Fully Fine-Tuning OpenVLA | Training VLAs from Scratch | Evaluating OpenVLA | Project Website


Getting Started

To get started with loading and running OpenVLA models for inference, we provide a lightweight interface that leverages HuggingFace transformers AutoClasses, with minimal dependencies.

For example, to load openvla-7b for zero-shot instruction following in the BridgeData V2 environments with a WidowX robot:

# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
# > pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

import torch

# Load Processor & VLA
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b", 
    attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True, 
    trust_remote_code=True
).to("cuda:0")

# Grab image input & format prompt
image: Image.Image = get_from_camera(...)
prompt = "In: What action should the robot take to {<INSTRUCTION>}?\nOut:"

# Predict Action (7-DoF; un-normalize for BridgeData V2)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)

# Execute...
robot.act(action, ...)

We also provide an example script for fine-tuning OpenVLA models for new tasks and embodiments; this script supports different fine-tuning modes -- including (quantized) low-rank adaptation (LoRA) supported by HuggingFace's PEFT library.

For deployment, we provide a lightweight script for serving OpenVLA models over a REST API, providing an easy way to integrate OpenVLA models into existing robot control stacks, removing any requirement for powerful on-device compute.


Installation

Note: These installation instructions are for full-scale pretraining (and distributed fine-tuning); if looking to just run inference with OpenVLA models (or perform lightweight fine-tuning), see instructions above!

This repository was built using Python 3.10, but should be backwards compatible with any Python >= 3.8. We require PyTorch 2.2.* -- installation instructions can be found here. The latest version of this repository was developed and thoroughly tested with:

  • PyTorch 2.2.0, torchvision 0.17.0, transformers 4.40.1, tokenizers 0.19.1, timm 0.9.10, and flash-attn 2.5.5

[5/21/24] Note: Following reported regressions and breaking changes in later versions of transformers, timm, and tokenizers we explicitly pin the above versions of the dependencies. We are working on implementing thorough tests, and plan on relaxing these constraints as soon as we can.

Use the setup commands below to get started:

# Create and activate conda environment
conda create -n openvla python=3.10 -y
conda activate openvla

# Install PyTorch. Below is a sample command to do this, but you should check the following link
# to find installation instructions that are specific to your compute platform:
# https://pytorch.org/get-started/locally/
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia -y  # UPDATE ME!

# Install the openvla in third_party
cd ./third_party/openvla
pip install -e .

# Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
#   =>> If you run into difficulty, try `pip cache remove flash_attn` first
pip install packaging ninja
ninja --version; echo $?  # Verify Ninja --> should return exit code "0"
pip install "flash-attn==2.5.5" --no-build-isolation

If you run into any problems during the installation process, please file a GitHub Issue.

Note: See vla-scripts/ for full training and verification scripts for OpenVLA models. Note that scripts/ is mostly a holdover from the original (base) prismatic-vlms repository, with support for training and evaluating visually-conditioned language models; while you can use this repo to train VLMs AND VLAs, note that trying to generate language (via scripts/generate.py) with existing OpenVLA models will not work (as we only train current OpenVLA models to generate actions, and actions alone).

LIBERO Simulation Benchmark Evaluations

LIBERO Setup

Clone and install the LIBERO repo:

git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
pip install -e .

Additionally, install other required packages:

cd openvla
pip install -r experiments/robot/libero/libero_requirements.txt

To download the modified versions of the LIBERO datasets that we used in our fine-tuning experiments, run the command below. This will download the LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-10 datasets in RLDS data format (~10 GB total). You can use these to fine-tune OpenVLA or train other methods. This step is optional since we provide pretrained OpenVLA checkpoints below. (Also, you can find the script we used to generate the modified datasets in raw HDF5 format here and the code we used to convert these datasets to the RLDS format here.)

git clone git@hf.co:datasets/openvla/modified_libero_rlds

Launching LIBERO Evaluations

We fine-tuned OpenVLA via LoRA (r=32) on four LIBERO task suites independently: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-10 (also called LIBERO-Long). The four checkpoints are available on Hugging Face:

To start evaluation with one of these checkpoints, run one of the commands below. Each will automatically download the appropriate checkpoint listed above.

# Launch LIBERO-Spatial evals
python experiments/robot/libero/run_libero_eval.py \
  --model_family openvla \
  --pretrained_checkpoint openvla/openvla-7b-finetuned-libero-spatial \
  --task_suite_name libero_spatial \
  --center_crop True

# Launch LIBERO-Object evals
python experiments/robot/libero/run_libero_eval.py \
  --model_family openvla \
  --pretrained_checkpoint openvla/openvla-7b-finetuned-libero-object \
  --task_suite_name libero_object \
  --center_crop True

# Launch LIBERO-Goal evals
python experiments/robot/libero/run_libero_eval.py \
  --model_family openvla \
  --pretrained_checkpoint openvla/openvla-7b-finetuned-libero-goal \
  --task_suite_name libero_goal \
  --center_crop True

# Launch LIBERO-10 (LIBERO-Long) evals
python experiments/robot/libero/run_libero_eval.py \
  --model_family openvla \
  --pretrained_checkpoint openvla/openvla-7b-finetuned-libero-10 \
  --task_suite_name libero_10 \
  --center_crop True

Notes:

  • The evaluation script will run 500 trials by default (10 tasks x 50 episodes each). You can modify the number of trials per task by setting --num_trials_per_task. You can also change the random seed via --seed.
  • NOTE: Setting --center_crop True is important because we fine-tuned OpenVLA with random crop augmentations (we took a random crop with 90% area in every training sample, so at test time we simply take the center 90% crop).
  • The evaluation script logs results locally. You can also log results in Weights & Biases by setting --use_wandb True and specifying --wandb_project <PROJECT> and --wandb_entity <ENTITY>.
  • The results reported in our paper were obtained using Python 3.10.13, PyTorch 2.2.0, transformers 4.40.1, and flash-attn 2.5.5 on an NVIDIA A100 GPU, averaged over three random seeds. Please stick to these package versions. Note that results may vary slightly if you use a different GPU for evaluation due to GPU nondeterminism in large models (though we have tested that results were consistent across different machines with A100 GPUs).

Please file a GitHub Issue if you run into any problems.


Repository Structure

High-level overview of repository/project file-tree:

  • prismatic - Package source; provides core utilities for model loading, training, data preprocessing, etc.
  • vla-scripts/ - Core scripts for training, fine-tuning, and deploying VLAs.
  • experiments/ - Code for evaluating OpenVLA policies in robot environments.
  • LICENSE - All code is made available under the MIT License; happy hacking!
  • Makefile - Top-level Makefile (by default, supports linting - checking & auto-fix); extend as needed.
  • pyproject.toml - Full project configuration details (including dependencies), as well as tool configurations.
  • README.md - You are here!

VLA Performance Troubleshooting

In this section we cover best practices for debugging poor VLA performance after fine-tuning on your target domain robot dataset.

Note: OpenVLA typically requires fine-tuning on a small demonstration dataset (~100 demos) from your target domain robot. Out-of-the-box, it only works well on domains from the training dataset.

Sanity checks:

  • replay the actions from a demonstration from your fine-tuning dataset and make sure that the robot can execute the task successfully (this ensures that your data collection pipeline is correct)
  • once you fine-tuned a model, load the model in your inference pipeline (as if you would run it to control the robot), but feed images from the fine-tuning dataset into the model (pretending they come from the robot) and verify that you can reproduce the token accuracies / L1 errors from training (this ensures that your inference pipeline is correct)

Best practices for fine-tuning data collection: If your setup passed the above two sanity checks, the issue may not be in model training, but in the data you fine-tuned the model with. Some best practices for data collection:

  • Collect at a control frequency around 5-10Hz. OpenVLA is not trained with action chunking, empirically the model struggles with high-frequency data. If your robot setup uses a high-frequency controller (eg 50 Hz), consider downsampling your actions to 5Hz. Verify first that your robot can still solve the task when using 5Hz actions (ie repeat sanity check (1) above with 5Hz actions)
  • Avoid pauses / small actions during data collection. Because OpenVLA is trained without action chunking, the model can be sensitive to idle actions in the fine-tuning data. If your data contains steps in which the robot barely moves, the model may "get stuck" in these steps at inference time. Try to collect fine-tuning demonstrations with continuous, slow movement.
  • Ensure sufficient data coverage. If you plan to test the model with some variation, e.g. different initial positions of objects, make sure that your fine-tuning data contains sufficient diversity of such conditions as well, e.g. shows demonstrations with diverse initial conditions.
  • Use consistent task strategies during data collection. This is not a hard constraint, but may make your life easier. Try to demonstrate tasks in consistent ways, e.g. approach objects from the same side, perform sub-steps in the same order even if they could be performed in arbitrary sequences. Being consistent gives you a less multi-modal fine-tuning dataset, which makes the modeling problem easier.

Citation

If you find our code or models useful in your work, please cite our paper:

@article{kim24openvla,
    title={OpenVLA: An Open-Source Vision-Language-Action Model},
    author={{Moo Jin} Kim and Karl Pertsch and Siddharth Karamcheti and Ted Xiao and Ashwin Balakrishna and Suraj Nair and Rafael Rafailov and Ethan Foster and Grace Lam and Pannag Sanketi and Quan Vuong and Thomas Kollar and Benjamin Burchfiel and Russ Tedrake and Dorsa Sadigh and Sergey Levine and Percy Liang and Chelsea Finn},
    journal = {arXiv preprint arXiv:2406.09246},
    year={2024}
}