You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1.**Hardware-Friendly**: Uses power-of-2 scaling factors for efficient hardware implementation
130
+
2.**Fine-Grained Quantization**: Per-block scaling (block size = 32) provides better accuracy than per-tensor or per-channel methods
131
+
3.**Zero-Point Free**: No zero-point overhead, simplifying computation
132
+
4.**Memory Efficient**: Significantly reduces model size while maintaining competitive accuracy
133
+
5.**Energy Efficient**: Lower energy consumption for multiply-accumulate operations compared to traditional data types
134
+
135
+
## Mix Precision (MXFP4 + MXFP8)
136
+
137
+
To achieve optimal compression ratios with acceptable accuracy, we integrate AutoRound automatic mix-precision algorithm. The mix-precision approach combines MXFP4 and MXFP8 formats to quantize different layers of the model based on their sensitivity to quantization.
138
+
139
+
### Benefits of Mix Precision
140
+
141
+
-**Better Accuracy-Compression Trade-off**: Sensitive layers use MXFP8 (higher precision) while less sensitive layers use MXFP4 (higher compression), optimizing the overall model performance.
142
+
-**Flexible Configuration**: Users can customize the precision assignment strategy based on their specific accuracy and compression requirements.
143
+
-**Automatic Layer Selection**: The AutoRound algorithm automatically identifies which layers should use which precision level, reducing manual tuning effort.
144
+
145
+
### Target Bits Configuration
146
+
147
+
To achieve optimal compression ratios in mixed-precision quantization, we provide the `target_bits` parameter for automated precision configuration.
148
+
149
+
-**Single target bit**: If you pass a single float number, it will automatically generate an optimal quantization recipe to achieve that target average bit-width.
150
+
-**Multiple target bits**: If you pass multiple float numbers, it will generate multiple recipes for different target bit-widths, allowing you to compare trade-offs between model size and accuracy.
151
+
152
+
**Note**: For MX data type, `target_bits` ranges from 4.25 to 8.25 due to scale bits overhead.
153
+
154
+
### Usage Example
155
+
156
+
#### AutoTune with Multiple Target Bits
157
+
158
+
For automatically finding the best configuration across multiple target bits:
159
+
160
+
```python
161
+
from neural_compressor.torch.quantization import AutoRoundConfig, autotune, TuningConfig
162
+
from transformers import AutoModelForCausalLM, AutoTokenizer
-**Multimodal Models**: [Llama-4-Scout-17B-16E-Instruct with MXFP4](/examples/pytorch/multimodal-modeling/quantization/auto_round/llama4)
220
+
-**Language Models**: [Llama3 series with MXFP4/MXFP8 and Mix Precision](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3)
221
+
- Llama 3.1 8B: MXFP8, MXFP4, and Mix Precision (target_bits=7.8)
222
+
- Llama 3.3 70B: MXFP8, MXFP4, and Mix Precision (target_bits=5.8)
223
+
224
+
## Best Practices and Tips
225
+
226
+
### Choosing the Right Data Type
227
+
228
+
| Data Type | Compression | Accuracy | Use Case | Export Format |
|**MXFP8**| Moderate (8-bit) | High | Production models where accuracy is critical |`auto_round`|
231
+
|**MXFP4**| High (4-bit) | Moderate | Aggressive compression with acceptable accuracy loss |`auto_round`|
232
+
|**MXFP4+MXFP8 Mix**| Configurable (4.25-8.25 bits) | High | Best balance between compression and accuracy |`auto_round`|
233
+
234
+
235
+
### Common Issues and Solutions
236
+
237
+
**Issue**: Out of Memory (OOM) during quantization
238
+
-**Solution**: Use `low_gpu_mem_usage=True`, enable `enable_torch_compile`, reduce `nsamples`, or use smaller `seqlen`
239
+
240
+
**Issue**: Accuracy drop is too large
241
+
-**Solution**: Increase `iters`, use more `nsamples`, or try mixed precision with higher `target_bits`
242
+
243
+
**Issue**: Quantization is too slow
244
+
-**Solution**: Reduce `iters` or set to 0 for RTN, decrease `nsamples`, enable `enable_torch_compile`
245
+
246
+
**Issue**: Model loading fails after quantization
247
+
-**Solution**: Refer to [auto_round/llama3/inference](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/README.md#inference)
0 commit comments