Performance benchmark of Mistral AI using llama.cpp

The llama.cpp library comes with a benchmarking tool.

I tested both the MacBook Pro M1 with 16 GB of unified memory and the Tesla V100S from OVHCloud (t2-le-45).

The post will be updated as more tests are done.

Procedure to run inference benchmark with llama.cpp

This guide covers only MacOS and Linux.

Clone and compile llama.cpp

shell

git clone https://github.com/ggerganov/llama.cpp.git llama-cpp-for-benchmark
cd llama-cpp-for-benchmark

If you have a Nvidia GPU and CUDA is already installed:

shell

# I'm assuming you already have CUDA properly installed and working
make LLAMA_CUBLAS=1

If you have an Apple Silicon Machine:

shell

make

If you have an OpenCL compatible GPU and running Linux:

shell

# install CLBlast - see https://github.com/ggerganov/llama.cpp#installing-clblast
sudo apt-get install libclblast-dev libclblast1
make LLAMA_CLBLAST=1

If you don't have any usable GPU (older models without support):

shell

make

Download the model

If you have less than 8GB RAM, running the benchmark is not recommended
If you have between 8 GB and 16 GB RAM, download the Q4_K_M quantized version
If you have at least 16 GB RAM, download the Q8_0 quantized version
If you have at least 24 GB RAM, you can download the official release

Store the downloaded version in ./llama-cpp-for-benchmark/models/mistral/

Using the weights of the original release (.pth fp16 => GGUF)

Follow the steps hereunder only if you want to test the official release (fp16)

shell

# Create a virtual env to store required dependencies
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Download and convert the official release from HF to GGUC
pip install huggingface_hub
huggingface-cli login
mkdir ./models/mistral/
huggingface-cli download --local-dir ./models/mistral/ mistralai/Mistral-7B-v0.1

# Convert the pytorch model to GGUF format
./convert.py ./models/mistral/

Run a test to see if the inference is actually working

shell

./main -t 4 -ngl 33  -m ./models/mistral/<name-of-the-model.gguf> -c 1024 --color -n 128 -p "Hello"

Run benchmarks

shell

# Quick single-thread CPU benchmark
./llama-bench -m ./models/mistral/<name-of-the-model.gguf> -t 1 -ngl 0 --output csv > bench-results.csv

# Quick GPU benchmark
./llama-bench -m ./models/mistral/<name-of-the-model.gguf> -t 1 -ngl 33 --output csv > bench-results.csv

# 5-10 min benchmark
./llama-bench -m ./models/mistral/<name-of-the-model.gguf> -t 1,2,4 -ngl 0,10,20,33 --output csv > bench-results.csv

# Complete benchmark
./llama-bench -m ./models/mistral/<name-of-the-model.gguf> -t 1,2,4,8 -ngl 0,10,20,33 --output csv > bench-results.csv

Notes:

ngl is set to 33 to offload every layer to the GPU
if you don't have a usable GPU, set -ngl 0

Apple Silicon M1

Highlights

The first key result you should consider from this benchmark is the use of GPU offloading yields a very interesting tradeoff:

Setting thread count to 1
Setting gpu layers to 30

yields

model	params	backend	ngl	threads	test	t/s
Mistral Q4_K	7.24 B	Metal	30	1	pp 128	112.31 ± 0.10
Mistral Q4_K	7.24 B	Metal	30	1	tg 32	12.62 ± 0.02

So you'll get most of the performance from a M1 macbook by offloading everything to the integrated GPU leaving room for any other compute workload.

Note that this result is due to the nature of the workload involved when running inference of LLM: It is all about memory bandwidth and less about raw compute performance.

This configuration is interesting because:

You'll draw less power (compared to running only on the CPU)
You'll keep room for CPU-hungry tasks.

4-bits quantized model (Q4_K_M)

Build ref: db3abcc (1354)

Prompt processing

model	params	backend	ngl	threads	test	t/s
Mistral Q4_K	7.24 B	Metal	30	4	pp 128	112.31 ± 0.10
Mistral Q4_K	7.24 B	Metal	10	4	pp 128	112.23 ± 0.09
Mistral Q4_K	7.24 B	Metal	20	2	pp 128	112.23 ± 0.05
Mistral Q4_K	7.24 B	Metal	30	2	pp 128	112.21 ± 0.05
Mistral Q4_K	7.24 B	Metal	20	4	pp 128	112.19 ± 0.04
Mistral Q4_K	7.24 B	Metal	10	2	pp 128	112.15 ± 0.42
Mistral Q4_K	7.24 B	Metal	30	1	pp 128	112.11 ± 0.06
Mistral Q4_K	7.24 B	Metal	20	1	pp 128	111.97 ± 0.15
Mistral Q4_K	7.24 B	Metal	10	1	pp 128	111.82 ± 0.40
Mistral Q4_K	7.24 B	Metal	30	1	pp 1024	98.77 ± 0.72
Mistral Q4_K	7.24 B	Metal	30	4	pp 1024	98.58 ± 0.23
Mistral Q4_K	7.24 B	Metal	30	2	pp 1024	98.37 ± 0.49
Mistral Q4_K	7.24 B	Metal	10	2	pp 1024	99.28 ± 0.56
Mistral Q4_K	7.24 B	Metal	10	4	pp 1024	98.23 ± 0.99
Mistral Q4_K	7.24 B	Metal	20	2	pp 1024	98.12 ± 0.08
Mistral Q4_K	7.24 B	Metal	20	1	pp 1024	98.10 ± 0.53
Mistral Q4_K	7.24 B	Metal	20	4	pp 1024	98.03 ± 0.36
Mistral Q4_K	7.24 B	Metal	10	1	pp 1024	97.80 ± 0.53
Mistral Q4_K	7.24 B	Metal	0	1	pp 1024	29.94 ± 0.15
Mistral Q4_K	7.24 B	Metal	0	2	pp 1024	26.98 ± 0.31
Mistral Q4_K	7.24 B	Metal	0	4	pp 1024	25.70 ± 0.23
Mistral Q4_K	7.24 B	Metal	0	1	pp 128	17.69 ± 0.23
Mistral Q4_K	7.24 B	Metal	0	2	pp 128	14.37 ± 0.45
Mistral Q4_K	7.24 B	Metal	0	4	pp 128	12.77 ± 0.73

Text generation

model	params	backend	ngl	threads	test	t/s
Mistral Q4_K	7.24 B	Metal	30	4	tg 32	12.91 ± 0.02
Mistral Q4_K	7.24 B	Metal	10	4	tg 32	12.76 ± 0.07
Mistral Q4_K	7.24 B	Metal	20	4	tg 32	12.87 ± 0.01
Mistral Q4_K	7.24 B	Metal	20	2	tg 32	12.83 ± 0.03
Mistral Q4_K	7.24 B	Metal	30	2	tg 32	12.79 ± 0.01
Mistral Q4_K	7.24 B	Metal	20	1	tg 32	12.67 ± 0.01
Mistral Q4_K	7.24 B	Metal	30	1	tg 32	12.62 ± 0.02
Mistral Q4_K	7.24 B	Metal	10	1	tg 32	12.46 ± 0.12
Mistral Q4_K	7.24 B	Metal	10	2	tg 32	12.37 ± 0.15
Mistral Q4_K	7.24 B	Metal	0	4	tg 32	11.51 ± 0.20
Mistral Q4_K	7.24 B	Metal	0	2	tg 32	8.66 ± 0.12
Mistral Q4_K	7.24 B	Metal	0	1	tg 32	4.71 ± 0.16

Apple Silicon M1 Max

8-bits quantized model (Q8_0)

Build ref: f5f9121 (1359)

Prompt processing

Text generation

4-bits quantized model (Q4_K_M)

Build ref: f5f9121 (1359)

Prompt processing

Text generation

Performance benchmark of Mistral AI using llama.cpp ​

Procedure to run inference benchmark with llama.cpp ​

Clone and compile llama.cpp ​

Download the model ​

Using the weights of the original release (.pth fp16 => GGUF) ​

Run a test to see if the inference is actually working ​

Run benchmarks ​

Apple Silicon M1 ​

Highlights ​

4-bits quantized model (Q4_K_M) ​

Apple Silicon M1 Max ​

8-bits quantized model (Q8_0) ​

4-bits quantized model (Q4_K_M) ​

Read more... ​

Looking to suggest modifications or share your results ? ​

Performance benchmark of Mistral AI using llama.cpp

Procedure to run inference benchmark with llama.cpp

Clone and compile llama.cpp

Download the model

Using the weights of the original release (.pth fp16 => GGUF)

Run a test to see if the inference is actually working

Run benchmarks

Apple Silicon M1

Highlights

4-bits quantized model (Q4_K_M)

Apple Silicon M1 Max

8-bits quantized model (Q8_0)

4-bits quantized model (Q4_K_M)

Read more...

Looking to suggest modifications or share your results ?