Releases: modelscope/dash-infer
Releases · modelscope/dash-infer
v2.2.5
What's Changed
- Support Single Node P-D Disaggregate function for CUDA.
- Support Qwen3 Model, Currently only for Dense, MoE Models support is WIP.
- Support EP in MOE OP, currently only support BF16 and FP16.
- CPU Support is not work in this release, Use a v2.1.x version for CPU support.
In Detail
- readme: add citation, update subproject description by @leefige in #70
- fix cugraph dnn moe kernel bug by @laiwenzh in #71
- support MOE EP by @yjc9696 in #73
- support disaggregated prefilling by @laiwenzh in #78
- prompt相关的问题修复 by @lddfym in #77
- Build: only build flash attention kernel once. by @kzjeef in #80
- support qwenV3【Dense】 by @yjc9696 in #81
- model: fix cpu compiler error by @kzjeef in #82
- Fix cpu compile by @kzjeef in #83
- Create build-check.yml by @kzjeef in #84
- ci: only trigger cuda release for current version. by @kzjeef in #85
New Contributors
Full Changelog: v2.1.0...v3.0.0-rc1
Existing Issue
- CPU Support is not work in this release, Use a v2.1.x version for CPU support.
v2.1.0
What's Changed
- [JSON mode]: FormatEnforcer use cudaMallocHost for scores buffer by @WangNorthSea in #56
- [A16W8 & A8W8]: further optimization for Ampere A16W8 fused gemm kernel 2. fix lora doc by @wyajieha in #58
- [Multimodal]: Support LLM quantization with GPTQ and AXWY by @x574chen in #60
- [PKG]: Reduce package size by only compiling flash-attn src with hdim128 by @laiwenzh in #62
- [MOE]: add high performance moe kernel; fix a16w8 compile bug for sm<80 by @laiwenzh in #67
New Contributors
Full Changelog: v2.0.0...v2.1.0
v2.0.0
What's Changed
- engine: stop and release model when engine release, and remove deprecated lock
- sampling: generate_op heavily modified, remove dependency on global tensors
- prefix cache: some bug fix, impove evict performance
- json mode: update lmfe-cpp patch, add process_logits, sampling with top_k top_p
- span-attention: move span_attn decoderReshape to init
- lora: add docs, fix typo
- ubuntu: add ubuntu dockerfile, fix install dir err
- bugifx: fix multi-batch rep_penlty bug
Full Changelog: v1.3.0...v2.0.0
v2.0.0-rc3
some bugfix - uuid crash issue - update lora implement - set page size by param - delete deprecated files
v2.0.0-rc2
release script: reduce python wheel size (#46)
v1.3.0
Highlight
- Support Baichuan-7B and Baichuan2-7B & 13B by @WangNorthSea in #38
Full Changelog: v1.2.1...v1.3.0
v1.2.1
v1.2.0
expand context length to 32K & support flash attention on intel-avx512 platform
- remove currently unsupported cache mode
- examples: update qwen prompt template, add print func to examples
- support glm-4-9b-chat by
- change to size_t to avoid overflow when seq is long
- update README since we support 32k context length
- Add flash attention on intel-avx512 platform