File tree Expand file tree Collapse file tree 1 file changed +14
-1
lines changed
Expand file tree Collapse file tree 1 file changed +14
-1
lines changed Original file line number Diff line number Diff line change 1010config .arch_compat_overrides ()
1111config .no_graphs = True
1212model = ExLlamaV2 (config )
13- model .load_tp (progress = True )
13+
14+ # Load the model in tensor-parallel mode. With no gpu_split specified, the model will attempt to split across
15+ # all visible devices according to the currently available VRAM on each. expect_cache_tokens is necessary for
16+ # balancing the split, in case the GPUs are of uneven sizes, or if the number of GPUs doesn't divide the number
17+ # of KV heads in the model
18+ #
19+ # The cache type for a TP model is always ExLlamaV2Cache_TP and should be allocated after the model. To use a
20+ # quantized cache, add a `base = ExLlamaV2Cache_Q6` etc. argument to the cache constructor. It's advisable
21+ # to also add `expect_cache_base = ExLlamaV2Cache_Q6` to load_tp() as well so the size can be correctly
22+ # accounted for when splitting the model.
23+
24+ model .load_tp (progress = True , expect_cache_tokens = 16384 )
1425cache = ExLlamaV2Cache_TP (model , max_seq_len = 16384 )
1526
27+ # After loading the model, all other functions should work the same
28+
1629print ("Loading tokenizer..." )
1730tokenizer = ExLlamaV2Tokenizer (config )
1831
You can’t perform that action at this time.
0 commit comments