Skip to content

Feature Request: Add Video Modality Support (Qwen2.5-VL) via llama-mtmd-cli #17660

@deepshnv

Description

@deepshnv

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Add native video modality support to llama.cpp's multimodal pipeline (mtmd). Currently, llama.cpp supports vision primarily for static images (e.g., LLaVA, Qwen2-VL) via CLIP-based projectors. However, it lacks a built-in mechanism to process video inputs, a capability that is becoming standard in state-of-the-art Vision Language Models.

The recently released Qwen2.5-VL model is explicitly architected for video understanding. As noted in its technical report, it utilizes dynamic-resolution processing and absolute time encoding to handle videos of varying durations—from seconds to hours—enabling precise event localization. In Python ecosystems (HuggingFace Transformers, vLLM etc.), this is handled by accepting a video path and sampling parameters, then injecting specific tokens (e.g., <|vision_start|><|video_pad|><|vision_end|>) wrapping the temporal sequence.
Furthermore, this architecture serves as the foundation for Nvidia's Cosmos Reason1, a 7B-parameter model finetuned for physical AI, robotics, and chain-of-thought reasoning.

This feature request aims to replicate this video modality pipeline in llama.cpp, allowing users to pass video files directly to the inference engine, enabling tasks like video captioning and temporal reasoning on edge devices.

Motivation

With the release of high-performance open-weights models like Qwen2.5-VL, video understanding is moving from specialized proprietary APIs to local inference. Currently, Python-based frameworks (like vLLM and Transformers) support these video modalities out of the box.

Adding this support to llama.cpp is crucial for:

  1. Feature Parity: Bridging the gap between C++ inference and Python research codebases (vLLM/Transformers).
  2. Edge Application: Enabling local, privacy-preserving video analysis on consumer hardware (e.g., CPU, Apple Silicon, consumer GPUs) without heavy Python dependencies.
  3. Specialized AI Support: Explicitly supporting Nvidia Cosmos Reason1, enabling developers to leverage its "Physical World" reasoning capabilities locally. This allows for embodied agent applications where the model reasons about physics and time from video input without relying on cloud APIs.

Possible Implementation

Add a working implementation locally that aligns with the Qwen2.5-VL technical report and mirrors the logic found in vLLM's implementation.

The implementation integrates into tools/mtmd and clip.cpp with the following key components:

1. Frame Extraction (External FFMPEG)

As per the Coding Guidelines, we can avoid adding large third-party dependencies to the codebase by linking against libavcodec/libavformat directly. Instead the implementation should rely on the user having the ffmpeg CLI tool installed and available in their system PATH.

  • Logic: The system checks for ffmpeg, creates a temporary directory, extracts frames at a specified FPS (defaulting to 1.0 or user-defined) via a command-line call, and loads the resulting frames as a batch.
  • Benefit: Keeps the build system simple and dependency-free while leveraging the industry standard for media processing.

2. Visual Patch Extraction & Temporal Merging

The implementation handles the Qwen2.5-VL specific "super-frame" logic:

  • Temporal Patching: It merges consecutive frames (Temporal Patch Size = 2) into a single "super-frame" with 6 channels (2 frames × 3 RGB channels).
  • Conv3D Decomposition: In clip.cpp, the Conv3D operation (2×14×14) is mathematically decomposed into two separate Conv2D operations on the temporal slices, which are then summed. This avoids implementing a full 3D convolution operator (which is currently not support in the ggml backend) while maintaining mathematical equivalence.

3. 3D M-RoPE & Absolute Time Encoding

Qwen2.5-VL uses a 3D Rotary Positional Embedding to encode time alongside spatial dimensions:

  • Structure: [text_pos, temporal_idx, height_pos, width_pos]
  • Temporal Index: Calculated as round(frame_idx * seconds_per_grid * tokens_per_second).
  • Fallback: The system detects if the input is a video frame or static image and gracefully degrades to standard 2D M-RoPE for regular images.

4. Nvidia Cosmos Reason1 Specifics

To support Cosmos Reason1, I have added specific handling in arg.cpp and mtmd-cli.cpp:

  • Custom Chat Template: Added a cosmos template that handles the model's specific Chain-of-Thought requirements. It formats prompts with specific system instructions (<|im_start|>system...) and supports a "reasoning" mode that parses <think> blocks.
  • Marker Mapping: The implementation correctly maps the distinct vision markers used by the Cosmos finetunes:
    • Video: <|vision_start|><|video_pad|><|vision_end|>
    • Image: <|vision_start|><|image_pad|><|vision_end|>

5. Pipeline Integration

  • Vision Transformer: In clip.cpp, implement new methods i.e. build_qwen25_vl() and clip_image_batch_encode_video() to handle the "super-frame" batches, applying RMSNorm and Window Attention where appropriate.
  • Vision-Language Merger: The spatial merging (2×2 pooling) and projection adapter layers are applied to the output tokens before concatenation with the text prompt.
  • CLI Arguments: New arguments added to common for ease of use: --video <path>, --video-fps <N>, -vmaxf <N> (max frames), --ts-per-grid (seconds per temporal grid for 3D M-RoPE), --chat-template cosmos etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions