Skip to content

EvolvingLMMs-Lab/LLaVA-OneVision-2

Repository files navigation

LLaVA-OneVision-2

Fully Open Framework for Democratized Multimodal Training

👉 WeChat Group 👈

🤗 2 Models · Datasets · Technical Report · HomePage · Codec Playground · Discord

🤗 1.5 Models · Datasets · Technical Report · NeMo · Discord


NEWS

Contents

Introduction

LLaVA-OneVision-2 is the next-generation release of the LLaVA-OneVision family — a fully open 8B multimodal model that unifies image, long-form video, and spatial understanding under a single architecture, with the entire pipeline (data, encoders, training, checkpoints, logs) released end-to-end.

🎬 Codec-Aligned Vision Encoders

Beyond uniform patchification. OneVision-Encoder and OneVision-Encoder-Lang are HEVC-style vision transformers that add a codec-stream input mode alongside image and uniform-frame video — selecting only motion- and residual-rich patches and sampling dense frames sparsely instead of sparse frames densely. The result is dramatically longer temporal coverage under the same token budget, where prior ViT backbones simply run out of context.

🧊 One Model, Every Modality

Most open multimodal models still live in a 2D, single-image world. LLaVA-OneVision-2-8B-Instruct breaks out of it — one model, native resolution, no task-specific adapters, no hidden tricks.

  • Long video — multi-frame reasoning with efficient codec-aligned inference
  • 3D-aware spatial reasoning — depth, layout, object relations
  • Documents, OCR, charts — structured visual inputs at native resolution

New open-source SOTA across a broad suite of multimodal benchmarks.

🚀 Fully Open, Reproducible from Day One

Four datasets ship with the LLaVA-OneVision family — two new for 2, two carried forward from 1.5:

  • LLaVA-OneVision-2-VideoCaption — extremely dense video captions
  • LLaVA-OneVision-2-Spatial — 3D-aware spatial reasoning
  • LLaVA-OneVision-1.5-Mid-Training-85M — 85M concept-balanced mid-training corpus
  • LLaVA-OneVision-1.5-Instruct — full instruction-tuning mixture

And unlike most "open" releases, everything ships alongside them: encoder weights, training code, configs, and full training logs. Reproducible end to end.

Evaluation Results

LLaVA-OneVision-2 Benchmark Comparison

Codec-aligned sampling compared with uniform frame sampling across video benchmarks

Method

Codec-Style Patch Selection

Codec-Style Patch Selection: same 54-token budget, 3× more temporal range than uniform sampling

Standard video pipelines uniformly sample a handful of frames and process every patch — most of it static background. We borrow from HEVC: keep I-frames dense, keep only motion- and residual-rich patches from P-frames. Same 54-token budget, 18 frames instead of 6 — 3× the temporal range, no extra LLM context, no input-type adapters.

One Encoder, Every Modality

Multi-modal vision input: image, uniform frames, or codec-aligned tokens all feed the same OneVision-Encoder with shared (t, h, w) positions

Most multimodal stacks ship a different tokenizer per input type — one path for images, another for video, a third for multi-image. We don't. Image, uniform frames, and codec-aligned tokens all flow into the same OneVision-Encoder under a shared (t, h, w) position scheme. No task-specific tokenizers, no per-modality routing.

Four-Stage Training Curriculum

We train LLaVA-OneVision-2 in four compact stages:

  1. Bootstrap video ability from LLaVA-OneVision-1.5 with short 30s video captions.
  2. Instruction tune with large multimodal instruction data and 30–180s video captions.
  3. Extend to long videos with 10–15 min captions and public video instruction data.
  4. Refine codec, spatial, and tracking skills with denser long-video sampling, point tracking, and 4M spatial samples.

The curriculum mixes LLaVA-OneVision-1.5 data, FineVision, and new in-house video caption/spatial datasets; we do not synthesize any video instruction data.

Models

Model HF Link Training Log
LLaVA-OneVision-2-8B-Instruct HF / 8B-Instruct Coming soon
LLaVA-OneVision-2-4B-Instruct Coming soon Coming soon
LLaVA-OneVision-1.5-4B-Instruct HF / 4B-Instruct TensorBoard
LLaVA-OneVision-1.5-8B-Instruct HF / 8B-Instruct TensorBoard
OneVision-Encoder HF / OneVision-Encoder Coming soon
OneVision-Encoder-Lang HF / OneVision-Encoder-Lang Coming soon

Datasets

Description Link Status
LLaVA-OneVision-2-VideoCaption HF / VideoCaption Available
LLaVA-OneVision-2-Spatial HF / Spatial Available
LLaVA-OneVision-1.5-Mid-Training-85M HF / Mid-Training 85M Available
LLaVA-OneVision-1.5-Instruct HF / Instruct-Data Available

LLaVA-OneVision Data Distribution

Quick Start (4B, single node)

🚨 Strongly recommended: don't read the rest of this section. Just hand the prompt below to any code agent (Claude Code, opencode, Cursor, Codex, etc.) from a working directory on your 8×GPU node, and it will reproduce the full SFT run end-to-end — clone, download the 340 GB bundle, load the docker image, and launch detached training — without you reading a single step.

Follow these instructions to start training: https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-2#quick-start-4b-single-node

Prefer to do it by hand? Keep reading.

Audience: code agent or human reproducing LLaVA-OneVision-2-4B SFT on a single 8×GPU node. Target script: examples/llava_onevision2/quick_start_4b/quick_start.sh Bundle: lmms-lab-encoder/ov2_quickstart on Hugging Face — contains the prebuilt docker image, the Megatron tp1pp1 checkpoint, the HF tokenizer/config, and the packed Energon dataset (~340 GB total, mostly the packed dataset).

🤖 Agent fast path — one block, idempotent (click to expand)

For code agents (Claude Code / opencode / etc.): copy-paste this entire block from any working directory and run it. It self-bootstraps — clones the repo if missing, downloads the 340 GB bundle if missing, loads the docker image if missing, then launches training. Every step is guarded by an existence check, so re-runs are safe. Human readers should follow the numbered sections (§0–§5) below instead.

# Run from any working directory. Requires: 8 GPUs, Docker + NVIDIA Container Toolkit,
# git, huggingface-cli (only used if the bundle is missing).
set -euo pipefail

REPO_URL="https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-2.git"
REPO_DIR="LLaVA-OneVision-2"
BUNDLE_DIR="./ov2_quickstart"
IMAGE="llava_megatron:26.05"
CONTAINER_NAME="ov2_quickstart_4b"

# 0) Repo — clone if the target script isn't reachable from $(pwd).
#    If you're already inside the repo, this is a no-op and we stay put.
if [ -f "./examples/llava_onevision2/quick_start_4b/quick_start.sh" ]; then
    echo "[skip] already inside repo root at $(pwd)"
elif [ -f "./${REPO_DIR}/examples/llava_onevision2/quick_start_4b/quick_start.sh" ]; then
    echo "[skip] repo already cloned at ./${REPO_DIR}"
    cd "${REPO_DIR}"
else
    git clone --depth 1 "${REPO_URL}" "${REPO_DIR}"
    cd "${REPO_DIR}"
fi
OUTPUT_DIR="$(pwd)/output/quick_start_4b"

# 1) Bundle (~340 GB) — skip if the three required subdirs already exist.
if [ -d "${BUNDLE_DIR}/packed_mixed_sft_cap_v30s/node_a/webdataset" ] \
    && [ -d "${BUNDLE_DIR}/ov_encoder_p14m22_qwen3_mcore_tp1pp1/release/mp_rank_00" ] \
    && [ -d "${BUNDLE_DIR}/ov_encoder_p14m22_qwen3_hf" ]; then
    echo "[skip] bundle already present at ${BUNDLE_DIR}"
else
    huggingface-cli download --repo-type dataset --resume-download \
        --local-dir "${BUNDLE_DIR}" \
        lmms-lab-encoder/ov2_quickstart
fi

# 2) Docker image — skip if already loaded.
if [ -n "$(docker images -q "${IMAGE}" 2>/dev/null)" ]; then
    echo "[skip] docker image ${IMAGE} already loaded"
elif [ -f "${BUNDLE_DIR}/llava_megatron.26.05.tar" ]; then
    docker load -i "${BUNDLE_DIR}/llava_megatron.26.05.tar"
else
    echo "ERROR: ${IMAGE} not loaded and tarball missing at ${BUNDLE_DIR}/llava_megatron.26.05.tar" >&2
    exit 1
fi

# 3) Clean up any prior container with the same name, then launch detached.
docker rm -f "${CONTAINER_NAME}" 2>/dev/null || true
mkdir -p "${OUTPUT_DIR}"
docker run -d \
    --gpus all \
    --ipc host --net host --privileged --cap-add IPC_LOCK \
    --ulimit memlock=-1 --ulimit stack=67108864 \
    -v "$(pwd)":/workspace/LLaVA-OneVision-2 \
    -e OUTPUT_DIR=/workspace/LLaVA-OneVision-2/output/quick_start_4b \
    -w /workspace/LLaVA-OneVision-2 \
    --name "${CONTAINER_NAME}" \
    "${IMAGE}" \
    bash -lc "bash examples/llava_onevision2/quick_start_4b/quick_start.sh"

echo "Training launched. Follow with: docker logs -f ${CONTAINER_NAME}"

After this block returns, training is running detached. Tail progress with docker logs -f ov2_quickstart_4b. Checkpoints + tensorboard land under ./output/quick_start_4b/quick_start/.

The numbered sections below (§0 Prerequisites through §5 Run the quickstart training) are the human walkthrough — same steps, broken out with explanations and alternatives (Option B paths, interactive launch, hyperparameter overrides).

0. Prerequisites

8 × A800 (or equivalent), Docker with NVIDIA Container Toolkit.

1. Layout

Repo is mounted at /workspace/LLaVA-OneVision-2 inside the container. The HF bundle goes into <REPO_ROOT>/ov2_quickstart/. Launch all commands from the repo root — the training script uses relative paths.

Full directory tree
<REPO_ROOT>/
├── examples/llava_onevision2/quick_start_4b/quick_start.sh   # entry point
├── run_docker_local.sh                                       # docker wrapper
└── ov2_quickstart/                                           # ← from HF
    ├── llava_megatron.26.05.tar                              # 24 GB prebuilt image
    ├── ov_encoder_p14m22_qwen3_hf/                           # HF tokenizer + config (--hf-tokenizer-path)
    ├── ov_encoder_p14m22_qwen3_mcore_tp1pp1/                 # Megatron mcore checkpoint (--load)
    │   ├── latest_checkpointed_iteration.txt
    │   └── release/mp_rank_00/
    └── packed_mixed_sft_cap_v30s/                            # Energon packed dataset (~308 GB)
        ├── dataset.yaml                                      # references node_{a..d}/webdataset
        └── node_{a,b,c,d}/webdataset/

2. Get the dataset + checkpoint

The dataset (packed_mixed_sft_cap_v30s/, ~308 GB) must come from the HF bundle — it's pre-packed Energon shards specific to this recipe. The checkpoint has two paths: download the ready-to-train Megatron tp1pp1 (recommended) or build it yourself from the standalone ViT + LLM.

Option A — Download the full bundle (recommended)

cd <REPO_ROOT>
huggingface-cli download --repo-type dataset --resume-download \
    --local-dir ./ov2_quickstart \
    lmms-lab-encoder/ov2_quickstart

~340 GB. Resumable.

Option B — Merge the checkpoint yourself

Use when you want to swap the ViT or LLM, or reproduce the encoder pipeline from scratch. Still requires Option A for the dataset.

You'll need the standalone components from HF:

Inside the container (set up in step 3), run HF merge → Megatron conversion in two stages:

# Stage 1: merge ViT + LLM + processor → unified HF checkpoint
PYTHONPATH=transformers_impl:. python -m merge_ov2 merge \
    --variant dense \
    --vit lmms-lab-encoder/onevision-encoder-large-lang-tf57 \
    --llm Qwen/Qwen3-4B-Instruct-2507 \
    --processor lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct \
    --out ./ov2_quickstart/ov_encoder_p14m22_qwen3_hf \
    --target-dtype bf16 \
    --vit-validator-strategy layerwise

# Stage 2: HF → Megatron-Core (TP=1, PP=1)
bash examples/llava_onevision2/convert/convert_4b_p14m2_hf_to_mcore.sh \
    ./ov2_quickstart/ov_encoder_p14m22_qwen3_hf \
    ./ov2_quickstart/ov_encoder_p14m22_qwen3_mcore_tp1pp1 \
    1 1

After Stage 2 succeeds, ./ov2_quickstart/ov_encoder_p14m22_qwen3_mcore_tp1pp1/ will match the layout that Option A would have downloaded (a release/mp_rank_00/ directory plus latest_checkpointed_iteration.txt).

Full merge reference: .opencode/skills/merge-ov2/SKILL.md (variant matrix, validators, common failure modes). For other TP/PP layouts (e.g. PP=4 with the ViT on stage 0), pass 2 4 0,12,12,12 to the conversion script.

You still need to download the dataset from Option A:

huggingface-cli download --repo-type dataset --resume-download \
    --local-dir ./ov2_quickstart --local-dir-use-symlinks False \
    --include 'packed_mixed_sft_cap_v30s/*' \
    lmms-lab-encoder/ov2_quickstart

3. Get the docker image

Option A — Load the prebuilt image (recommended)

docker load -i ov2_quickstart/llava_megatron.26.05.tar
docker images | grep llava_megatron
# llava_megatron   26.05   <id>   ...   ~30GB

~1 minute. Skips the long base-image pull + dependency install.

Option B — Build from source

Use when the prebuilt tar is unavailable, or you've modified Dockerfile / requirements.txt.
cd <REPO_ROOT>
docker build -t llava_megatron:26.05 .

~30 min on a warm pip cache. Base image is nvcr.io/nvidia/pytorch:25.04-py3; Python deps come from requirements.txt.

4. Launch the container

Use the repo's wrapper. It mounts the host repo at /workspace/LLaVA-OneVision-2, opens all GPUs, sets NCCL env vars for IB, and drops you into a bash shell.

cd <REPO_ROOT>
bash run_docker_local.sh

You should now be inside the container at:

root@<host>:/workspace/LLaVA-OneVision-2#
Agent-mode: detached launch (no interactive shell, runs training directly)

When you don't want an interactive shell — e.g. a code agent kicking off training and walking away — run the container detached and pass the training script as the entrypoint. Minimal, portable form (no site-specific NCCL tuning, no extra bind-mounts):

cd <REPO_ROOT>
mkdir -p ./output/quick_start_4b

docker run -d \
    --gpus all \
    --ipc host --net host --privileged --cap-add IPC_LOCK \
    --ulimit memlock=-1 --ulimit stack=67108864 \
    -v "$(pwd)":/workspace/LLaVA-OneVision-2 \
    -e OUTPUT_DIR=/workspace/LLaVA-OneVision-2/output/quick_start_4b \
    -w /workspace/LLaVA-OneVision-2 \
    --name ov2_quickstart_4b \
    llava_megatron:26.05 \
    bash -lc "bash examples/llava_onevision2/quick_start_4b/quick_start.sh"

# Follow progress:
docker logs -f ov2_quickstart_4b

Notes:

  • --rm is intentionally omitted so the container survives a crash for postmortem (docker logs ov2_quickstart_4b).
  • The bundle lives under <REPO_ROOT>/ov2_quickstart/, which is already inside the mounted repo — no extra -v needed.
  • If your cluster needs IB / NCCL tuning, append -e NCCL_*=... flags; the defaults in run_docker_local.sh are site-specific and not required here.

5. Run the quickstart training

Inside the container, from /workspace/LLaVA-OneVision-2:

# Optional: pick an output dir on a disk with ≥ 100 GB free for checkpoints + tensorboard.
export OUTPUT_DIR=/workspace/LLaVA-OneVision-2/output/quick_start_4b

bash examples/llava_onevision2/quick_start_4b/quick_start.sh

That's it. Defaults: 8 GPUs, TP=1, PP=1, SEQ_LEN=10192, MBS=1, GBS=16, 1 epoch over 219,907 packed bins. Logs + checkpoints land under ${OUTPUT_DIR}/quick_start/.

What the script does, hyperparameter overrides, and env knobs

The script will:

  1. Set the two mandatory packing gates (OFFLINE_PACKING_BMR=1, OFFLINE_PACKED_DATA=1) — see .opencode/skills/offline-packing-env-vars/SKILL.md for why both are required.
  2. Compute NSTEP = ceil(219907 × EPOCHS / GBS) from the verified bin count of the four shards (54480 + 54854 + 54785 + 54788 = 219907).
  3. torchrun --nproc_per_node=8 --nnodes=1 against aiak_training_llm/train.py with the llava-onevision2-4b-p14m2 model.
  4. Stream stdout/stderr to ${OUTPUT_DIR}/quick_start/run_<timestamp>_tp1_pp1_seqlen10192_mbs1_gbs16_<NSTEP>steps.log (and to your terminal via tee).
  5. Save checkpoints to ${OUTPUT_DIR}/quick_start/ every 2000 iters; tensorboard events to ${OUTPUT_DIR}/quick_start/tensorboard/.

Positional args:

Arg Position Default Override example
TP $1 1 bash quick_start.sh 2
PP $2 1 bash quick_start.sh 1 2
SEQ_LEN $3 10192 bash quick_start.sh 1 1 8192
MBS $4 1 must stay 1 (packing gate requires it)
GBS $5 16 bash quick_start.sh 1 1 10192 1 32
EPOCHS $6 1 bash quick_start.sh 1 1 10192 1 16 2

Env knobs:

Paths (point at the bundle):

  • DATA_PATH./ov2_quickstart/packed_mixed_sft_cap_v30s/dataset.yaml (Energon Metadataset)
  • TOKENIZER_PATH./ov2_quickstart/ov_encoder_p14m22_qwen3_hf (HF tokenizer + config)
  • CHECKPOINT_PATH./ov2_quickstart/ov_encoder_p14m22_qwen3_mcore_tp1pp1 (Megatron tp1pp1 start)
  • OUTPUT_DIR./output/quick_start_4b (checkpoints + tensorboard + dataloader state)

Distributed (single-node defaults are fine):

  • GPUS_PER_NODE=8 — must equal visible GPUs
  • MASTER_ADDR=127.0.0.1, MASTER_PORT=26000

Optional logging:

  • WANDB_API_KEY — if set, also logs to W&B (WANDB_PROJECT / WANDB_NAME honored)

Contributors

Thanks so much to all of our amazing contributors!

Contributors ranked by commit count

Citation

If you find LLaVA-OneVision-2 useful in your research, please consider to cite the following related papers:

@inproceedings{LLaVA-OneVision-2,
  title={LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence},
  author={An, Xiang and Xie, Yin and Tang, Feilong and Yan, Yunyao and Tan, Huajie and Zhu, Didi and Chen, Changrui and Zhao, Xiuwei and Qin, Bin and Yang, Kaicheng and Shen, Yifei and Zhang, Yuanhan and Zhang, Kaichen and Zhang, Wenkang and Cheng, Zheng and Zhang, Nansen and Wu, Chunsheng and Ge, Chunjiang and Ran, Zimin and Song, Dehua and Li, Chunyuan and Feng, Shikun and Hu, Ming and Chen, Zhangquan and Niu, Junbo and Li, Bo and Feng, Ziyong and Liu, Ziwei and Ge, Zongyuan and Deng, Jiankang},
  booktitle={arXiv},
  year={2026}
}

@inproceedings{LLaVA-OneVision-1.5,
  title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
  author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
  booktitle={arXiv},
  year={2025}
 }

@article{tang2026onevisionencoder,
  title={OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence},
  author={Tang, Feilong and An, Xiang and Yan, Yunyao and Xie, Yin and Qin, Bin and Yang, Kaicheng and Shen, Yifei and Zhang, Yuanhan and Li, Chunyuan and Feng, Shikun and Chen, Changrui and Tan, Huajie and Hu, Ming and Zhang, Manyuan and Li, Bo and Feng, Ziyong and Liu, Ziwei and Ge, Zongyuan and Deng, Jiankang},
  journal={arXiv preprint arXiv:2602.08683},
  year={2026}
}

@article{lillava,
  title={LLaVA-OneVision: Easy Visual Task Transfer},
  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
  journal={Transactions on Machine Learning Research},
  year={2024}
}

Acknowledgement

We extend our sincere gratitude to the LoongForge team from Baidu AI Cloud for providing the exceptional AIAK-based training framework. The outstanding capabilities of AIAK-Training-LLM and AIAK-Megatron have significantly accelerated our training process with remarkable efficiency. We are especially grateful for their citation of LLaVA-OneVision-2 and deeply appreciate their recognition and support.

We acknowledge the support of Synvo AI for contributing to the partial data annotation in this work, and also thank the maintainers and contributors of the following open-source projects, whose work greatly inspired and supported our research:

  • LLaVA: Large Language-and-Vision Assistant — LLaVA
  • LLaVA-NeXT: Next-generation multi-modal assistant — LLaVA-NeXT
  • lmms-eval: A standardized evaluation framework for Large Multimodal Models — lmms-eval
  • Megatron-LM: Efficient, scalable training for large language models — Megatron-LM
  • Qwen2.5-VL: Strong vision-language foundation model — Qwen2.5-VL
  • InternVL: Open-source large-scale vision-language foundation model — InternVL
  • Qwen3: Next-generation Qwen LLM — Qwen
  • MetaCLIP: Scalable contrastive pretraining — MetaCLIP
  • FineVision: Open Data Is All You Need — FineVision