Skip to content

Revamp int4 MPS support in torchao #3446

@jerryzh168

Description

@jerryzh168

Request from @SunMarc:

Huggingface transformer would like to get better support for MacOS users through torchao, we have some prototype/experimental API and kernels but there are a few issues we need to resolve to make it useful for huggingface transformer users:

* installation -> requires users to checkout to v0.13 and build the lib for macos
* api -> don't work out of the box with our integration but we can write our own implementation using directly UIntxWeightOnlyQuantizedLinear is needed. This will enable us to quantize on the fly.
* no bias needed. Right now, this implementation only works with models that don't have bias. It would be nice to remove this restriction to enable more models (e.g qwen) .

For installation:

USE_CPP=1 BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 TORCHAO_BUILD_MPS_OPS=1 pip install -e . --no-build-isolation --no-cache-dir

gives:

  File "/Users/marcsun/Desktop/HF/transformers/src/transformers/cli/serve.py", line 1839, in _load_model_and_data_processor
    model = quantizer.quantize(model)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/marcsun/Desktop/HF/ao/torchao/experimental/quant_api.py", line 183, in quantize
    _replace_linear_with_quantized_linear_mps(
  File "/Users/marcsun/Desktop/HF/ao/torchao/experimental/quant_api.py", line 125, in _replace_linear_with_quantized_linear_mps
    _replace_linear_with_quantized_linear_mps(child, kwargs)
  File "/Users/marcsun/Desktop/HF/ao/torchao/experimental/quant_api.py", line 125, in _replace_linear_with_quantized_linear_mps
    _replace_linear_with_quantized_linear_mps(child, kwargs)
  File "/Users/marcsun/Desktop/HF/ao/torchao/experimental/quant_api.py", line 125, in _replace_linear_with_quantized_linear_mps
    _replace_linear_with_quantized_linear_mps(child, kwargs)
  [Previous line repeated 1 more time]
  File "/Users/marcsun/Desktop/HF/ao/torchao/experimental/quant_api.py", line 129, in _replace_linear_with_quantized_linear_mps
    pack_weight_op=getattr(torch.ops.torchao, f"_pack_weight_{nbit}bit"),
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/marcsun/miniconda3/envs/test/lib/python3.11/site-packages/torch/_ops.py", line 1365, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'torchao' object has no attribute '_pack_weight_6bit'

Target to be fixed in 0.16 (~end Jan 2026)

Metadata

Metadata

Labels

0.16integrationIssues related to integrations with other libraries, like huggingface, vllm, sglang, gemlite etc.multibackendquantize_quantize_ APItriaged

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions