Feature Request: Add Video Modality Support (Qwen2.5-VL) via llama-mtmd-cli

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.



### Feature Description

Add native video modality support to `llama.cpp`'s multimodal pipeline (`mtmd`). Currently, `llama.cpp` supports vision primarily for static images (e.g., LLaVA, Qwen2-VL) via CLIP-based projectors. However, it lacks a built-in mechanism to process video inputs, a capability that is becoming standard in state-of-the-art Vision Language Models.

The recently released **Qwen2.5-VL** model is explicitly architected for video understanding. As noted in its [technical report](https://arxiv.org/pdf/2502.13923), it utilizes dynamic-resolution processing and absolute time encoding to handle videos of varying durations—from seconds to hours—enabling precise event localization. In Python ecosystems (HuggingFace Transformers, vLLM etc.), this is handled by accepting a video path and sampling parameters, then injecting specific tokens (e.g., `<|vision_start|><|video_pad|><|vision_end|>`) wrapping the temporal sequence.
Furthermore, this architecture serves as the foundation for **Nvidia's Cosmos Reason1**, a 7B-parameter model finetuned for physical AI, robotics, and chain-of-thought reasoning.

This feature request aims to replicate this video modality pipeline in `llama.cpp`, allowing users to pass video files directly to the inference engine, enabling tasks like video captioning and temporal reasoning on edge devices.




### Motivation

With the release of high-performance open-weights models like Qwen2.5-VL, video understanding is moving from specialized proprietary APIs to local inference. Currently, Python-based frameworks (like vLLM and Transformers) support these video modalities out of the box.

Adding this support to `llama.cpp` is crucial for:
1.  **Feature Parity:** Bridging the gap between C++ inference and Python research codebases (vLLM/Transformers).
2.  **Edge Application:** Enabling local, privacy-preserving video analysis on consumer hardware (e.g., CPU, Apple Silicon, consumer GPUs) without heavy Python dependencies.
3.  **Specialized AI Support:** Explicitly supporting **Nvidia Cosmos Reason1**, enabling developers to leverage its "Physical World" reasoning capabilities locally. This allows for embodied agent applications where the model reasons about physics and time from video input without relying on cloud APIs.




### Possible Implementation

Add a working implementation locally that aligns with the [Qwen2.5-VL technical report](https://arxiv.org/pdf/2502.13923) and mirrors the logic found in [vLLM's implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/qwen2_5_vl.py).

The implementation integrates into `tools/mtmd` and `clip.cpp` with the following key components:

#### 1. Frame Extraction (External FFMPEG)
As per the [Coding Guidelines](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#coding-guidelines), we can avoid adding large third-party dependencies to the codebase by linking against `libavcodec`/`libavformat` directly. Instead the implementation should rely on the user having the `ffmpeg` CLI tool installed and available in their system PATH.
*   **Logic:** The system checks for `ffmpeg`, creates a temporary directory, extracts frames at a specified FPS (defaulting to 1.0 or user-defined) via a command-line call, and loads the resulting frames as a batch.
*   **Benefit:** Keeps the build system simple and dependency-free while leveraging the industry standard for media processing.

#### 2. Visual Patch Extraction & Temporal Merging
The implementation handles the Qwen2.5-VL specific "super-frame" logic:
*   **Temporal Patching:** It merges consecutive frames (Temporal Patch Size = 2) into a single "super-frame" with 6 channels (2 frames × 3 RGB channels).
*   **Conv3D Decomposition:** In `clip.cpp`, the Conv3D operation (2×14×14) is mathematically decomposed into two separate Conv2D operations on the temporal slices, which are then summed. This avoids implementing a full 3D convolution operator (which is currently not support in the ggml backend) while maintaining mathematical equivalence.

#### 3. 3D M-RoPE & Absolute Time Encoding
Qwen2.5-VL uses a 3D Rotary Positional Embedding to encode time alongside spatial dimensions:
*   **Structure:** `[text_pos, temporal_idx, height_pos, width_pos]`
*   **Temporal Index:** Calculated as `round(frame_idx * seconds_per_grid * tokens_per_second)`.
*   **Fallback:** The system detects if the input is a video frame or static image and gracefully degrades to standard 2D M-RoPE for regular images.

#### 4. Nvidia [Cosmos Reason1](https://github.com/nvidia-cosmos/cosmos-reason1) Specifics
To support Cosmos Reason1, I have added specific handling in `arg.cpp` and `mtmd-cli.cpp`:
*   **Custom Chat Template:** Added a `cosmos` template that handles the model's specific Chain-of-Thought requirements. It formats prompts with specific system instructions (`<|im_start|>system...`) and supports a "reasoning" mode that parses `<think>` blocks.
*   **Marker Mapping:** The implementation correctly maps the distinct vision markers used by the Cosmos finetunes:
    *   Video: `<|vision_start|><|video_pad|><|vision_end|>`
    *   Image: `<|vision_start|><|image_pad|><|vision_end|>`

#### 5. Pipeline Integration
*   **Vision Transformer:** In clip.cpp, implement new methods i.e. build_qwen25_vl() and clip_image_batch_encode_video() to handle the "super-frame" batches, applying RMSNorm and Window Attention where appropriate.
*   **Vision-Language Merger:** The spatial merging (2×2 pooling) and projection adapter layers are applied to the output tokens before concatenation with the text prompt.
*   **CLI Arguments:** New arguments added to `common` for ease of use: `--video <path>`, `--video-fps <N>`, `-vmaxf <N>` (max frames), `--ts-per-grid` (seconds per temporal grid for 3D M-RoPE), `--chat-template cosmos` etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Add Video Modality Support (Qwen2.5-VL) via llama-mtmd-cli #17660

Prerequisites

Feature Description

Motivation

Possible Implementation

1. Frame Extraction (External FFMPEG)

2. Visual Patch Extraction & Temporal Merging

3. 3D M-RoPE & Absolute Time Encoding

4. Nvidia Cosmos Reason1 Specifics

5. Pipeline Integration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Add Video Modality Support (Qwen2.5-VL) via llama-mtmd-cli #17660

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

1. Frame Extraction (External FFMPEG)

2. Visual Patch Extraction & Temporal Merging

3. 3D M-RoPE & Absolute Time Encoding

4. Nvidia Cosmos Reason1 Specifics

5. Pipeline Integration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions