LLM Technical Note: Picking the Right Quantization Method for Local Inference

Publish date: Sep 10, 2023
Last updated: Nov 7, 2023

Tags:

Thanks to innovative quantization strategies and the commendable work by fellas like TheBloke, running powerful models locally on devices such as my M1 Max is now possible. However, when first going to a huggingface model page it can be overwhelming and confusing knowing which variation to choose: “Q4_K_M”, “Q5_K”, “Q8_O”, etc. What are all these things? This guide aims to quickly clear it up.

GGML / GGUF vs. GPTQ

The first decision one has to make when choosing a quantized model is the model type. In general:

GGUF (new version of GGML) models are optimized for CPU, and are also the go-to alternative for Mac users.
GPTQ models are optimized for GPU, ideal when you can fit the model into memory.

Both type of models can be run using llama.cpp or GUIs such as LM Studio.

Quantization Methods at a Glance

Model variations are usually named according to the following convention:

<model_name>_Q<quantization_bits>_<variant>

At a high level:

Variant	Description	Key Differences
_0	Legacy quant method with uniform precision.	Uniform precision across all tensors.
_1	Larger legacy quant method with uniform precision.	Uniform precision but with a larger model size.
_K	Designation for the k-quants.	Improved size/quality tradeoff compared to legacy methods.
_K_L	Large size k-quant.	Uses distinct precision types for specific tensors for optimal tradeoff.
_K_M	Medium size k-quant.	Uses higher precision (Q6_K) for half of the `attention.wv` and `feed_forward.w2` tensors, and default precision (Q4_K) for the rest.
_K_S	Small size k-quant.	Uses default precision (Q4_K) for all tensors.

Here is a summary of how these models compare in practice, according to my own limited testing and what I have read on the web:

Quant Method	Bits	Approx Relative Size	Quality Impact on 13B Model
Q2_K	2	~39%	Significant - not recommended
Q3_K_S	3	~40%	Very high loss
Q3_K_M	3	~45%	Very high loss
Q3_K_L	3	~49%	Substantial loss
Q4_0	4	~53%	Very high loss (legacy)
Q4_K_S	4	~53%	Greater loss
Q4_K_M	4	~56%	Balanced - recommended
Q5_0	5	~64%	Balanced (legacy)
Q5_K_S	5	~64%	Low loss - recommended
Q5_K_M	5	~66%	Very low loss - recommended
Q6_K	6	~76%	Extremely low loss
Q8_0	8	~99%	Extremely low loss - not recommended

Final Recommendations

Model Size	Best for Quality	Best Trade-off	Avoid
7B, 13B	Q5_K_S, Q5_K_M	Q4_K_M	Q2_K, Q4_0