Skip to content
Open
Show file tree
Hide file tree
Changes from 55 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
43a130b
mtmd: llama.cpp DeepSeekOCR support
sfallah Nov 14, 2025
b6b9f02
loading sam tensors
sfallah Nov 14, 2025
85c7cda
mtmd: fix vision model processing
bluebread Nov 15, 2025
578c8d7
Merge pull request #1 from bluebread/sf/deepseek-ocr
sfallah Nov 15, 2025
2aab52e
deepseek-ocr clip-vit model impl
sfallah Nov 15, 2025
eab28ed
mtmd: add DeepSeek-OCR LM support with standard attention
bluebread Nov 15, 2025
7630587
mtmd: successfully runs DeepSeek-OCR LM in llama-cli
bluebread Nov 16, 2025
2de3436
mtmd: Fix RoPE type for DeepSeek-OCR LM.
bluebread Nov 17, 2025
e8b2610
Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…
bluebread Nov 17, 2025
97e0907
loading LM
sfallah Nov 17, 2025
13dc6fb
Merge branch 'sf/deepseek-ocr' into sf/deepseek-ocr
sfallah Nov 17, 2025
b32bb5e
Merge pull request #2 from bluebread/sf/deepseek-ocr
sfallah Nov 17, 2025
790bbb9
sam warmup working
sfallah Nov 17, 2025
cec9a5c
sam erroneous return corrected
sfallah Nov 17, 2025
8b3d319
clip-vit: corrected cls_embd concat
sfallah Nov 17, 2025
1e08157
clip-vit: model convert qkv_proj split
sfallah Nov 17, 2025
331cea8
corrected combining of image encoders' results
sfallah Nov 18, 2025
6c0715b
fix: update callback for ffn_moe_weighted and add callback for attn_o…
bluebread Nov 18, 2025
a65ddf5
Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…
bluebread Nov 18, 2025
63a042f
concat image_newline and image_seperator tokens
sfallah Nov 18, 2025
89afda8
visual_model warmup (technically) works
sfallah Nov 18, 2025
88032f4
window partitioning using standard ggml ops
sfallah Nov 20, 2025
1268dc3
Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…
bluebread Nov 20, 2025
68b206b
sam implementation without using CPU only ops
sfallah Nov 21, 2025
8bce66d
clip: fixed warnings
bluebread Nov 21, 2025
5e6cf3c
Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…
bluebread Nov 21, 2025
7e9fbec
mtmd: fix get_rel_pos
bluebread Nov 21, 2025
0f5587d
Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…
bluebread Nov 21, 2025
7b8d735
mtmd: fixed the wrong scaler for get_rel_pos
bluebread Nov 21, 2025
86f111f
image encoding technically works but the output can't be checked sing…
sfallah Nov 21, 2025
effe669
mtmd: minor changed
bluebread Nov 22, 2025
f8f66a1
Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…
bluebread Nov 22, 2025
3fcfc3a
Merge pull request #3 from bluebread/sf/deepseek-ocr
sfallah Nov 22, 2025
ee8a148
mtmd: add native resolution support
bluebread Nov 22, 2025
4cfa15f
- image encoding debugged
sfallah Nov 22, 2025
3f71188
mtmd: correct token order
bluebread Nov 23, 2025
a594990
Merge pull request #5 from bluebread/dsocr-debug
sfallah Nov 23, 2025
6dfda99
Merge branch 'sf/deepseek-ocr' into sf/deepseek-ocr
sfallah Nov 23, 2025
7941f5d
Merge pull request #4 from bluebread/sf/deepseek-ocr
sfallah Nov 23, 2025
206f8ab
- dynamic resizing
sfallah Nov 23, 2025
40e7e6e
mtmd: quick fix token order
bluebread Nov 24, 2025
81533e4
mtmd: fix danling pointer
bluebread Nov 24, 2025
8810940
Merge pull request #6 from bluebread/sf/deepseek-ocr
sfallah Nov 24, 2025
a488b49
mtmd: SAM numerically works
bluebread Nov 29, 2025
ccb2f23
mtmd: debug CLIP-L (vit_pre_ln)
bluebread Nov 29, 2025
841a4a8
mtmd: debug CLIP-L & first working DeepSeek-OCR model
bluebread Nov 29, 2025
ed3b7f1
Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr
sfallah Nov 30, 2025
5543094
Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…
bluebread Nov 30, 2025
c5f4c64
mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution cont…
bluebread Nov 30, 2025
95239f9
mtmd: simplify SAM patch embedding
bluebread Dec 1, 2025
6b0e7cd
Merge pull request #7 from bluebread/sf/deepseek-ocr
sfallah Dec 2, 2025
6634166
Merge branch 'master' into sf/deepseek-ocr
sfallah Dec 2, 2025
c914e05
mtmd: adapt Pillow image resizing function
bluebread Dec 3, 2025
e20857b
mtmd: simplify DeepSeek-OCR dynamic resolution preprocessing
bluebread Dec 3, 2025
43dfc0c
Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…
bluebread Dec 3, 2025
b696c54
mtmd: remove --dsocr-mode argument
bluebread Dec 3, 2025
b26b507
mtmd: refactor code & remove unused helper functions
bluebread Dec 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions common/arg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1829,6 +1829,21 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.image_max_tokens = value;
}
).set_examples(mmproj_examples).set_env("LLAMA_ARG_IMAGE_MAX_TOKENS"));
add_opt(common_arg(
{"--dsocr-mode"}, "MODE",
"DeepSeek-OCR resolution mode, one of:\n"
"- auto (default): automatically select resolution\n"
"- tiny, small, base, large: native resolution\n"
"- gundam, gundam-master: dynamic resolution",
Copy link
Collaborator

@ngxson ngxson Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO these modes can look quite confusing for end-users.

I already seen your logic where you calculate the area to automatically determine the best resolution, it looks good enough.

So, I think we can better remove the argument and make everything automatic.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I'll remove it later.

[](common_params & params, const std::string & value) {
if (value == "auto" || value == "tiny" || value == "small" || value == "base" ||
value == "large" || value == "gundam" || value == "gundam-master") {
params.dsocr_mode = value;
} else {
throw std::invalid_argument("invalid value");
}
}
).set_examples(mmproj_examples).set_env("LLAMA_ARG_DSOCR_MODE"));
if (llama_supports_rpc()) {
add_opt(common_arg(
{"--rpc"}, "SERVERS",
Expand Down
1 change: 1 addition & 0 deletions common/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -432,6 +432,7 @@ struct common_params {
std::vector<std::string> image; // path to image file(s)
int image_min_tokens = -1;
int image_max_tokens = -1;
std::string dsocr_mode = "auto"; // DeepSeek-OCR resolution mode: auto, tiny, small, base, large, gundam, gundam-master

// finetune
struct lr_opt lr;
Expand Down
147 changes: 136 additions & 11 deletions convert_hf_to_gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -697,6 +697,9 @@ def load_hparams(dir_model: Path, is_mistral_format: bool):
if "thinker_config" in config:
# rename for Qwen2.5-Omni
config["text_config"] = config["thinker_config"]["text_config"]
if "language_config" in config:
# rename for DeepSeekOCR
config["text_config"] = config["language_config"]
return config

@classmethod
Expand Down Expand Up @@ -1531,7 +1534,7 @@ class MmprojModel(ModelBase):
preprocessor_config: dict[str, Any]
global_config: dict[str, Any]

n_block_keys = ["n_layers", "num_hidden_layers", "n_layer", "num_layers", "depth"]
n_block_keys = ["n_layers", "num_hidden_layers", "n_layer", "num_layers", "depth", "layers"]

has_vision_encoder: bool = True # by default
has_audio_encoder: bool = False
Expand Down Expand Up @@ -1577,6 +1580,14 @@ def __init__(self, *args, **kwargs):
# TODO @ngxson : this is a hack to support both vision and audio encoders
have_multiple_encoders = self.has_audio_encoder and self.has_vision_encoder
self.block_count = 128 if have_multiple_encoders else self.find_hparam(self.n_block_keys, True)
# FIXME: DeepseekOCRVisionModel specific hack
if self.block_count is None:
if isinstance(self, DeepseekOCRVisionModel):
clip_block_count = self.hparams['layers']
if clip_block_count is not None:
self.block_count = clip_block_count
if self.block_count is None:
raise KeyError(f"could not find block count using any of: {self.n_block_keys}")
self.tensor_map = gguf.get_tensor_name_map(gguf.MODEL_ARCH.MMPROJ, self.block_count)

# load preprocessor config
Expand Down Expand Up @@ -5990,6 +6001,99 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter

return [] # skip other tensors

@ModelBase.register("DeepseekOCRForCausalLM")
class DeepseekOCRVisionModel(MmprojModel):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)

proc_fname = self.dir_model / "processor_config.json"

if proc_fname.is_file():
with open(proc_fname, "r") as f:
self.preprocessor_config = json.load(f)


def set_gguf_parameters(self):
super().set_gguf_parameters()
hparams = self.hparams
self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.DEEPSEEKOCR)
# default values below are taken from HF tranformers code
self.gguf_writer.add_vision_attention_layernorm_eps(hparams.get("layer_norm_eps", 1e-6))
self.gguf_writer.add_vision_use_gelu(True)
# calculate proj_scale_factor (used by tinygemma3 test model)
image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
n_per_side = int(image_seq_length ** 0.5)
image_size = self.hparams["image_size"]
patch_size = self.hparams["patch_size"]
proj_scale_factor = (image_size // patch_size) // n_per_side
if proj_scale_factor > 0 and proj_scale_factor != 4:
# we only need to write this if it's not the default value
# in this case, we are converting a test model
self.gguf_writer.add_vision_projector_scale_factor(proj_scale_factor)

# SAM configuration
sam_hparams = hparams['sam']
self.gguf_writer.add_vision_sam_layers_count(sam_hparams['layers'])
self.gguf_writer.add_vision_sam_embedding_length(sam_hparams['width'])

def get_vision_config(self) -> dict[str, Any]:
vision_config: dict[str, Any] | None = self.global_config.get("vision_config")

if not vision_config:
raise ValueError("DeepseekOCR model requires 'vision_config' in the model configuration, but it was not found")

vision_config['sam'] = vision_config['width']['sam_vit_b']
vision_config.update(vision_config['width']['clip-l-14-224'])
vision_config['hidden_size'] = vision_config['width']
vision_config['num_heads'] = vision_config['heads']
vision_config['intermediate_size'] = vision_config['heads'] * 4

return vision_config


def tensor_force_quant(self, name, new_name, bid, n_dims):
# TODO: increase numercial stability. maybe delete later.
return gguf.GGMLQuantizationType.F32
# related to https://github.com/ggml-org/llama.cpp/issues/13025
# if "input_projection" in name:
# return gguf.GGMLQuantizationType.F16
# if ".embeddings." in name:
# return gguf.GGMLQuantizationType.F32
# return super().tensor_force_quant(name, new_name, bid, n_dims)

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
# Only process vision-related tensors, skip language model tensors
# Vision components: sam_model, vision_model, projector, image_newline, view_seperator
# Language model components to skip: lm_head, embed_tokens, layers, norm
if name.startswith(("lm_head.", "model.embed_tokens.", "model.layers.", "model.norm.")):
return []

if ".attn.rel_pos_h" in name or ".attn.rel_pos_w" in name:
return [(self.map_tensor_name(name, try_suffixes=("",)), data_torch)]

if name.startswith("model.vision_model.transformer.layers."):
# process visual tensors
# split QKV tensors if needed
if ".qkv_proj." in name:
if data_torch.ndim == 2: # weight
c3, _ = data_torch.shape
else: # bias
c3 = data_torch.shape[0]
assert c3 % 3 == 0
c = c3 // 3
wq = data_torch[:c]
wk = data_torch[c: c * 2]
wv = data_torch[c * 2:]
return [
(self.map_tensor_name(name.replace("qkv", "q")), wq),
(self.map_tensor_name(name.replace("qkv", "k")), wk),
(self.map_tensor_name(name.replace("qkv", "v")), wv),
]
else:
return [(self.map_tensor_name(name), data_torch)]

return [(self.map_tensor_name(name), data_torch)]


@ModelBase.register("Gemma3nForConditionalGeneration")
class Gemma3NModel(Gemma3Model):
Expand Down Expand Up @@ -7159,6 +7263,7 @@ def prepare_tensors(self):
@ModelBase.register(
"DeepseekV2ForCausalLM",
"DeepseekV3ForCausalLM",
"DeepseekOCRForCausalLM",
"KimiVLForConditionalGeneration",
)
class DeepseekV2Model(TextModel):
Expand Down Expand Up @@ -7219,31 +7324,49 @@ def set_vocab(self):
raise NotImplementedError(f"Deepseek pre-tokenizer {tokpre!r} is not supported yet!")

def set_gguf_parameters(self):
is_ocr = (self.hparams["num_hidden_layers"] == 12)

# note: deepseek2 using MLA converts into MQA (ie: GQA with 1 group)
self.hparams["num_key_value_heads"] = 1
if is_ocr:
self.hparams['rope_theta'] = self.hparams.get('rope_theta', 10000.0)
self.hparams['rms_norm_eps'] = self.hparams.get('rms_norm_eps', 1e-6)
else:
# note: deepseek2 using MLA converts into MQA (ie: GQA with 1 group)
self.hparams["num_key_value_heads"] = 1

super().set_gguf_parameters()
hparams = self.hparams
kv_lora_rank = hparams["q_lora_rank"] if hparams["q_lora_rank"] is not None else 512
routed_scaling_factor = hparams.get("routed_scaling_factor", 1.0)
norm_topk_prob = hparams.get("norm_topk_prob", False)
scoring_func = hparams.get("scoring_func", "softmax")

self.gguf_writer.add_leading_dense_block_count(hparams["first_k_dense_replace"])
self.gguf_writer.add_vocab_size(hparams["vocab_size"])
if "q_lora_rank" in hparams and hparams["q_lora_rank"] is not None:
self.gguf_writer.add_q_lora_rank(hparams["q_lora_rank"])
self.gguf_writer.add_kv_lora_rank(hparams["kv_lora_rank"])
if "kv_lora_rank" in hparams and hparams["kv_lora_rank"] is not None:
self.gguf_writer.add_kv_lora_rank(kv_lora_rank)

# note: deepseek2 using MLA converts into MQA with larger heads, then decompresses to MHA
self.gguf_writer.add_key_length(hparams["kv_lora_rank"] + hparams["qk_rope_head_dim"])
self.gguf_writer.add_value_length(hparams["kv_lora_rank"])
self.gguf_writer.add_key_length_mla(hparams["qk_nope_head_dim"] + hparams["qk_rope_head_dim"])
self.gguf_writer.add_value_length_mla(hparams["v_head_dim"])
if not is_ocr:
self.gguf_writer.add_key_length(kv_lora_rank + hparams["qk_rope_head_dim"])
self.gguf_writer.add_value_length(kv_lora_rank)
self.gguf_writer.add_key_length_mla(hparams["qk_nope_head_dim"] + hparams["qk_rope_head_dim"])
self.gguf_writer.add_value_length_mla(hparams["v_head_dim"])
self.gguf_writer.add_rope_dimension_count(hparams["qk_rope_head_dim"])

self.gguf_writer.add_expert_feed_forward_length(hparams["moe_intermediate_size"])
self.gguf_writer.add_expert_count(hparams["n_routed_experts"])
self.gguf_writer.add_expert_shared_count(hparams["n_shared_experts"])
self.gguf_writer.add_expert_weights_scale(hparams["routed_scaling_factor"])
self.gguf_writer.add_expert_weights_norm(hparams["norm_topk_prob"])
self.gguf_writer.add_expert_weights_scale(routed_scaling_factor)
self.gguf_writer.add_expert_weights_norm(norm_topk_prob)

if scoring_func == "sigmoid":
self.gguf_writer.add_expert_gating_func(gguf.ExpertGatingFuncType.SIGMOID)
elif scoring_func == "softmax":
self.gguf_writer.add_expert_gating_func(gguf.ExpertGatingFuncType.SOFTMAX)
else:
raise ValueError(f"Unsupported scoring_func value: {scoring_func}")
self.gguf_writer.add_rope_dimension_count(hparams["qk_rope_head_dim"])

rope_scaling = self.hparams.get("rope_scaling") or {}
Expand All @@ -7252,12 +7375,14 @@ def set_gguf_parameters(self):
self.gguf_writer.add_rope_scaling_factor(rope_scaling["factor"])
self.gguf_writer.add_rope_scaling_orig_ctx_len(rope_scaling["original_max_position_embeddings"])
self.gguf_writer.add_rope_scaling_yarn_log_mul(0.1 * rope_scaling["mscale_all_dim"])
self.gguf_writer.add_layer_norm_rms_eps(self.hparams.get("rms_norm_eps", 1e-6))

_experts: list[dict[str, Tensor]] | None = None

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
# skip vision tensors and remove "language_model." for Kimi-VL
if "vision_tower" in name or "multi_modal_projector" in name:
if "vision_" in name or "multi_modal_projector" in name \
or "image_newline" in name or "model.projector" in name or "sam_model" in name or "view_seperator" in name:
return []

if name.startswith("language_model."):
Expand Down
18 changes: 9 additions & 9 deletions examples/eval-callback/eval-callback.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -74,19 +74,19 @@ static void ggml_print_tensor(uint8_t * data, ggml_type type, const int64_t * ne
}
}
for (int64_t i3 = 0; i3 < ne[3]; i3++) {
LOG(" [\n");
LOG(" [\n");
for (int64_t i2 = 0; i2 < ne[2]; i2++) {
if (i2 == n && ne[2] > 2*n) {
LOG(" ..., \n");
LOG(" ..., \n");
i2 = ne[2] - n;
}
LOG(" [\n");
LOG(" [\n");
for (int64_t i1 = 0; i1 < ne[1]; i1++) {
if (i1 == n && ne[1] > 2*n) {
LOG(" ..., \n");
LOG(" ..., \n");
i1 = ne[1] - n;
}
LOG(" [");
LOG(" [");
for (int64_t i0 = 0; i0 < ne[0]; i0++) {
if (i0 == n && ne[0] > 2*n) {
LOG("..., ");
Expand All @@ -98,10 +98,10 @@ static void ggml_print_tensor(uint8_t * data, ggml_type type, const int64_t * ne
}
LOG("],\n");
}
LOG(" ],\n");
LOG(" ],\n");
}
LOG(" ]\n");
LOG(" sum = %f\n", sum);
LOG(" ]\n");
LOG(" sum = %f\n", sum);
}

// TODO: make this abort configurable/optional?
Expand Down Expand Up @@ -136,7 +136,7 @@ static bool ggml_debug(struct ggml_tensor * t, bool ask, void * user_data) {
snprintf(src1_str, sizeof(src1_str), "%s{%s}", src1->name, ggml_ne_string(src1).c_str());
}

LOG("%s: %24s = (%s) %10s(%s{%s}, %s}) = {%s}\n", __func__,
LOG("%s: %16s = (%s) %10s(%s{%s}, %s}) = {%s}\n", __func__,
t->name, ggml_type_name(t->type), ggml_op_desc(t),
src0->name, ggml_ne_string(src0).c_str(),
src1 ? src1_str : "",
Expand Down
2 changes: 2 additions & 0 deletions ggml/src/ggml-cuda/upscale.cu
Original file line number Diff line number Diff line change
Expand Up @@ -289,5 +289,7 @@ void ggml_cuda_op_upscale(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
upscale_f32_bicubic_cuda(src0_d, dst_d, src0->nb[0], src0->nb[1], src0->nb[2], src0->nb[3],
src0->ne[0], src0->ne[1], dst->ne[0], dst->ne[1], dst->ne[2], dst->ne[3],
sf0, sf1, sf2, sf3, pixel_offset, stream);
} else {
GGML_ABORT("fatal error");
}
}
1 change: 1 addition & 0 deletions ggml/src/ggml.c
Original file line number Diff line number Diff line change
Expand Up @@ -5206,6 +5206,7 @@ struct ggml_tensor * ggml_flash_attn_ext(
GGML_ASSERT(q->ne[3] == v->ne[3]);

if (mask) {
GGML_ASSERT(mask->type == GGML_TYPE_F16);
GGML_ASSERT(ggml_is_contiguous(mask));
GGML_ASSERT(mask->ne[1] >= GGML_PAD(q->ne[1], GGML_KQ_MASK_PAD) &&
"the Flash-Attention kernel requires the mask to be padded to GGML_KQ_MASK_PAD and at least n_queries big");
Expand Down
Loading