LLM Technical Note: Picking the Right Quantization Method for Local Inference
Last updated: Nov 7, 2023
Thanks to innovative quantization strategies and the commendable work by fellas like TheBloke, running powerful models locally on devices such as my M1 Max is now possible. However, when first going to a huggingface model page it can be overwhelming and confusing knowing which variation to choose: “Q4_K_M”, “Q5_K”, “Q8_O”, etc. What are all these things? This guide aims to quickly clear it up.
GGML / GGUF vs. GPTQ
The first decision one has to make when choosing a quantized model is the model type. In general:
- GGUF (new version of GGML) models are optimized for CPU, and are also the go-to alternative for Mac users.
- GPTQ models are optimized for GPU, ideal when you can fit the model into memory.
Both type of models can be run using llama.cpp or GUIs such as LM Studio.
Quantization Methods at a Glance
Model variations are usually named according to the following convention:
<model_name>_Q<quantization_bits>_<variant>
At a high level:
Variant | Description | Key Differences |
---|---|---|
_0 | Legacy quant method with uniform precision. | Uniform precision across all tensors. |
_1 | Larger legacy quant method with uniform precision. | Uniform precision but with a larger model size. |
_K | Designation for the k-quants. | Improved size/quality tradeoff compared to legacy methods. |
_K_L | Large size k-quant. | Uses distinct precision types for specific tensors for optimal tradeoff. |
_K_M | Medium size k-quant. | Uses higher precision (Q6_K) for half of the attention.wv and feed_forward.w2 tensors, and default precision (Q4_K) for the rest. |
_K_S | Small size k-quant. | Uses default precision (Q4_K) for all tensors. |
Here is a summary of how these models compare in practice, according to my own limited testing and what I have read on the web:
Quant Method | Bits | Approx Relative Size | Quality Impact on 13B Model |
---|---|---|---|
Q2_K | 2 | ~39% | Significant - not recommended |
Q3_K_S | 3 | ~40% | Very high loss |
Q3_K_M | 3 | ~45% | Very high loss |
Q3_K_L | 3 | ~49% | Substantial loss |
Q4_0 | 4 | ~53% | Very high loss (legacy) |
Q4_K_S | 4 | ~53% | Greater loss |
Q4_K_M | 4 | ~56% | Balanced - recommended |
Q5_0 | 5 | ~64% | Balanced (legacy) |
Q5_K_S | 5 | ~64% | Low loss - recommended |
Q5_K_M | 5 | ~66% | Very low loss - recommended |
Q6_K | 6 | ~76% | Extremely low loss |
Q8_0 | 8 | ~99% | Extremely low loss - not recommended |
Final Recommendations
Model Size | Best for Quality | Best Trade-off | Avoid |
---|---|---|---|
7B, 13B | Q5_K_S, Q5_K_M | Q4_K_M | Q2_K, Q4_0 |