-
Notifications
You must be signed in to change notification settings - Fork 13.9k
mtmd: Add DeepSeekOCR Support #17400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
init commit
mtmd: fix vision model processing
testing Vision model loading
mtmd: DeepseekOCR Implement DeepSeek3B-MoE-A570M (LM component)
…ut in deepseek2 model
debug: correct token order
Add native resolution support
- changes are concerning PR #4
mtmd: quick fix token order
# Conflicts: # convert_hf_to_gguf.py # src/llama-model.cpp # src/models/deepseek2.cpp
…rol & all native resolution modes work
First DeepSeek-OCR working implementation
# Conflicts: # convert_hf_to_gguf.py # tools/mtmd/clip.h # tools/mtmd/mtmd.cpp
common/arg.cpp
Outdated
| "- auto (default): automatically select resolution\n" | ||
| "- tiny, small, base, large: native resolution\n" | ||
| "- gundam, gundam-master: dynamic resolution", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO these modes can look quite confusing for end-users.
I already seen your logic where you calculate the area to automatically determine the best resolution, it looks good enough.
So, I think we can better remove the argument and make everything automatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I'll remove it later.
tools/mtmd/clip.cpp
Outdated
| res_imgs->grid_y = 1; | ||
| } | ||
| else { | ||
| GGML_ABORT("DeepSeek-OCR: Gundam/Gundam-Master haven't been tested yet.\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngxson I've encountered an issue with batching images. In order to handle images much larger than 1280x1280, DeepSeek-OCR crops them into 640x640 (gundam) or 1024x1024 (gundam master) subimages as local views. However, the current framework doesn't support batching multiple images. Technically, it shouldn't be too difficult to add batch support, but I'm concerned about introducing new bugs and affecting other models. Do you have any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC it should be the same logic as llava uhd or minicpm-v where image is cropped into smaller sub-images
batching is not yet supported, but does all sub-images need to be on the same batch?
otherwise, what we can do is to extend the clip_image_f32 to include a notion of "nz":
struct clip_image_f32 {
int nx;
int ny;
int nz; // can be > 1 for deepseek ocr
std::vector<float> buf;
};
And memory layout corresponding to what you need on cgraph (to avoid another ggml_permute for example)
|
@sfallah Could you please mark this PR as ready for review and update the llama-mtmd-cli command for testing DeepSeek-OCR (because I removed --dsocr-mode argument)? Also, I have run the CI locally, and it failed on "27 - test-thread-safety". I guess this failure should be unrelated to the changes made in this PR. Here is the log: ci.txt |
Feature Request: #16676
Make sure to read the contributing guidelines before submitting a PR
GGUF Models
sabafallah/DeepSeek-OCR-GGUF
deepseek-ocr-f32.gguf
mmproj-deepseek-ocr-f32.gguf
Running the Model
Build llama.cpp (Mac)
Running llama-mtmd-cli