Skip to content

convert-hf-to-ggml.py CUDA out of memory #35

@Alex20129

Description

@Alex20129

i've tried to convert model from HF to GGML format:

python3 convert-hf-to-ggml.py ../starcoderbase_int8

and got an error:

Loading model:  ../starcoderbase_int8
Loading checkpoint shards:   ...
Traceback (most recent call last):
  File "/home/alex/starcoder/starcoder.cpp/convert-hf-to-ggml.py", line 58, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16 if use_f16 else torch.float32, low_cpu_mem_usage=True, trust_remote_code=True, offload_state_dict=True)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2901, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3258, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 725, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(model, param_name, param_device, value=param, fp16_statistics=fp16_statistics)
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/bitsandbytes.py", line 109, in set_module_quantized_tensor_to_device
    new_value = value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 576.00 MiB (GPU 0; 10.90 GiB total capacity; 9.21 GiB already allocated; 568.69 MiB free; 9.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

And next I've tried to force it to run on the CPU:

export CUDA_VISIBLE_DEVICES=""
python3 convert-hf-to-ggml.py ../starcoderbase_int8

Then, I got this:

Loading model:  ../starcoderbase_int8
Traceback (most recent call last):
  File "/home/alex/starcoder/starcoder.cpp/convert-hf-to-ggml.py", line 58, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16 if use_f16 else torch.float32, low_cpu_mem_usage=True, trust_remote_code=True, offload_state_dict=True)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2370, in from_pretrained
    raise RuntimeError("No GPU found. A GPU is needed for quantization.")
RuntimeError: No GPU found. A GPU is needed for quantization.

For me, the main reason to go with GGML implementation is that I can't fit the model in my GPU. I thought I could perform both the conversion and inference using only the CPU and system RAM. Am I doing something specific wrong or I got it wrong in general?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions