Skip to content

Conversation

@sfallah
Copy link
Contributor

@sfallah sfallah commented Nov 20, 2025

Feature Request: #16676

Make sure to read the contributing guidelines before submitting a PR

GGUF Models

sabafallah/DeepSeek-OCR-GGUF

deepseek-ocr-f32.gguf

mmproj-deepseek-ocr-f32.gguf

Running the Model

Build llama.cpp (Mac)

cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --config Release

Running llama-mtmd-cli

build/bin/llama-mtmd-cli \
-m gguf_models/deepseek-ai/deepseek-ocr-f32.gguf \
--mmproj gguf_models/deepseek-ai/mmproj-deepseek-ocr-f32.gguf \
--image tmp/mtmd_test_data/Deepseek-OCR-2510.18234v1_page1.png \
-p "<|grounding|>Convert the document to markdown." \
--chat-template deepseek \

@github-actions github-actions bot added model Model specific examples python python script changes labels Nov 20, 2025
@sfallah sfallah marked this pull request as draft November 20, 2025 09:12
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 2, 2025
# Conflicts:
#	convert_hf_to_gguf.py
#	tools/mtmd/clip.h
#	tools/mtmd/mtmd.cpp
common/arg.cpp Outdated
Comment on lines 1835 to 1837
"- auto (default): automatically select resolution\n"
"- tiny, small, base, large: native resolution\n"
"- gundam, gundam-master: dynamic resolution",
Copy link
Collaborator

@ngxson ngxson Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO these modes can look quite confusing for end-users.

I already seen your logic where you calculate the area to automatically determine the best resolution, it looks good enough.

So, I think we can better remove the argument and make everything automatic.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I'll remove it later.

res_imgs->grid_y = 1;
}
else {
GGML_ABORT("DeepSeek-OCR: Gundam/Gundam-Master haven't been tested yet.\n");
Copy link

@bluebread bluebread Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson I've encountered an issue with batching images. In order to handle images much larger than 1280x1280, DeepSeek-OCR crops them into 640x640 (gundam) or 1024x1024 (gundam master) subimages as local views. However, the current framework doesn't support batching multiple images. Technically, it shouldn't be too difficult to add batch support, but I'm concerned about introducing new bugs and affecting other models. Do you have any suggestions?

Copy link
Collaborator

@ngxson ngxson Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC it should be the same logic as llava uhd or minicpm-v where image is cropped into smaller sub-images

batching is not yet supported, but does all sub-images need to be on the same batch?

otherwise, what we can do is to extend the clip_image_f32 to include a notion of "nz":

struct clip_image_f32 {
    int nx;
    int ny;
    int nz; // can be > 1 for deepseek ocr

    std::vector<float> buf;
};

And memory layout corresponding to what you need on cgraph (to avoid another ggml_permute for example)

@bluebread
Copy link

@sfallah Could you please mark this PR as ready for review and update the llama-mtmd-cli command for testing DeepSeek-OCR (because I removed --dsocr-mode argument)? Also, I have run the CI locally, and it failed on "27 - test-thread-safety". I guess this failure should be unrelated to the changes made in this PR. Here is the log: ci.txt

@sfallah sfallah marked this pull request as ready for review December 3, 2025 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants