Skip to content

Conversation

@stevescot
Copy link

Fix DeepSeek V3.2 Inference on Apple Silicon (Metal)

Description:
This PR addresses multiple runtime crashes and assertions encountered when running the experimental DeepSeek V3.2 model on Apple Silicon (M3 Ultra).

Changes:

llama-model-loader.cpp: Fixed a crash during tensor loading where tensor->src[0] was being accessed before initialization. Rewrote the loop to correctly handle tensor naming and allocation.
llama-sparse-mla-fwd.cpp: Disabled LLAMA_SPARSE_MLA_FUSED_DECODE by default. The fused decode kernel is not currently implemented for Metal, causing assertions.
llama-sparse-topk.cpp: Added a fallback for SPARSE_TOPK_RADIX.
Issue: The Metal backend does not implement the SPARSE_TOPK_RADIX operator, causing the graph to fall back to CPU. However, the CPU backend explicitly asserts false for this operator, leading to a crash.
Fix: Wrapped the call in an #ifdef APPLE block to force the use of the CPU-based sparse_attn_topk::topk_radix_indices function instead of the graph operator, bypassing the missing backend support while preserving CUDA behavior for other platforms.
Result:
Model loads and runs inference successfully on M3 Ultra with ~12 t/s prompt processing and ~8 t/s generation.
…-K, disable fused decode, and fix model loader

Make sure to read the contributing guidelines before submitting a PR

…-K, disable fused decode, and fix model loader
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant