Performance benchmark of Mistral AI using llama.cpp
The llama.cpp library comes with a benchmarking tool.
I tested both the MacBook Pro M1 with 16 GB of unified memory and the Tesla V100S from OVHCloud (t2-le-45).
The post will be updated as more tests are done.
Procedure to run inference benchmark with llama.cpp
This guide covers only MacOS and Linux.
Clone and compile llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git llama-cpp-for-benchmark
cd llama-cpp-for-benchmark
If you have a Nvidia GPU and CUDA is already installed:
# I'm assuming you already have CUDA properly installed and working
make LLAMA_CUBLAS=1
If you have an Apple Silicon Machine:
make
If you have an OpenCL compatible GPU and running Linux:
# install CLBlast - see https://github.com/ggerganov/llama.cpp#installing-clblast
sudo apt-get install libclblast-dev libclblast1
make LLAMA_CLBLAST=1
If you don't have any usable GPU (older models without support):
make
Download the model
- If you have less than 8GB RAM, running the benchmark is not recommended
- If you have between 8 GB and 16 GB RAM, download the Q4_K_M quantized version
- If you have at least 16 GB RAM, download the Q8_0 quantized version
- If you have at least 24 GB RAM, you can download the official release
Store the downloaded version in ./llama-cpp-for-benchmark/models/mistral/
Using the weights of the original release (.pth fp16 => GGUF)
Follow the steps hereunder only if you want to test the official release (fp16)
# Create a virtual env to store required dependencies
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Download and convert the official release from HF to GGUC
pip install huggingface_hub
huggingface-cli login
mkdir ./models/mistral/
huggingface-cli download --local-dir ./models/mistral/ mistralai/Mistral-7B-v0.1
# Convert the pytorch model to GGUF format
./convert.py ./models/mistral/
Run a test to see if the inference is actually working
./main -t 4 -ngl 33 -m ./models/mistral/<name-of-the-model.gguf> -c 1024 --color -n 128 -p "Hello"
Run benchmarks
# Quick single-thread CPU benchmark
./llama-bench -m ./models/mistral/<name-of-the-model.gguf> -t 1 -ngl 0 --output csv > bench-results.csv
# Quick GPU benchmark
./llama-bench -m ./models/mistral/<name-of-the-model.gguf> -t 1 -ngl 33 --output csv > bench-results.csv
# 5-10 min benchmark
./llama-bench -m ./models/mistral/<name-of-the-model.gguf> -t 1,2,4 -ngl 0,10,20,33 --output csv > bench-results.csv
# Complete benchmark
./llama-bench -m ./models/mistral/<name-of-the-model.gguf> -t 1,2,4,8 -ngl 0,10,20,33 --output csv > bench-results.csv
Notes:
- ngl is set to 33 to offload every layer to the GPU
- if you don't have a usable GPU, set
-ngl 0
Apple Silicon M1
Highlights
The first key result you should consider from this benchmark is the use of GPU offloading yields a very interesting tradeoff:
- Setting thread count to 1
- Setting gpu layers to 30
yields
model | params | backend | ngl | threads | test | t/s |
---|---|---|---|---|---|---|
Mistral Q4_K | 7.24 B | Metal | 30 | 1 | pp 128 | 112.31 ± 0.10 |
Mistral Q4_K | 7.24 B | Metal | 30 | 1 | tg 32 | 12.62 ± 0.02 |
So you'll get most of the performance from a M1 macbook by offloading everything to the integrated GPU leaving room for any other compute workload.
Note that this result is due to the nature of the workload involved when running inference of LLM: It is all about memory bandwidth and less about raw compute performance.
This configuration is interesting because:
- You'll draw less power (compared to running only on the CPU)
- You'll keep room for CPU-hungry tasks.
4-bits quantized model (Q4_K_M)
Build ref: db3abcc (1354)
Prompt processing
model | params | backend | ngl | threads | test | t/s |
---|---|---|---|---|---|---|
Mistral Q4_K | 7.24 B | Metal | 30 | 4 | pp 128 | 112.31 ± 0.10 |
Mistral Q4_K | 7.24 B | Metal | 10 | 4 | pp 128 | 112.23 ± 0.09 |
Mistral Q4_K | 7.24 B | Metal | 20 | 2 | pp 128 | 112.23 ± 0.05 |
Mistral Q4_K | 7.24 B | Metal | 30 | 2 | pp 128 | 112.21 ± 0.05 |
Mistral Q4_K | 7.24 B | Metal | 20 | 4 | pp 128 | 112.19 ± 0.04 |
Mistral Q4_K | 7.24 B | Metal | 10 | 2 | pp 128 | 112.15 ± 0.42 |
Mistral Q4_K | 7.24 B | Metal | 30 | 1 | pp 128 | 112.11 ± 0.06 |
Mistral Q4_K | 7.24 B | Metal | 20 | 1 | pp 128 | 111.97 ± 0.15 |
Mistral Q4_K | 7.24 B | Metal | 10 | 1 | pp 128 | 111.82 ± 0.40 |
Mistral Q4_K | 7.24 B | Metal | 30 | 1 | pp 1024 | 98.77 ± 0.72 |
Mistral Q4_K | 7.24 B | Metal | 30 | 4 | pp 1024 | 98.58 ± 0.23 |
Mistral Q4_K | 7.24 B | Metal | 30 | 2 | pp 1024 | 98.37 ± 0.49 |
Mistral Q4_K | 7.24 B | Metal | 10 | 2 | pp 1024 | 99.28 ± 0.56 |
Mistral Q4_K | 7.24 B | Metal | 10 | 4 | pp 1024 | 98.23 ± 0.99 |
Mistral Q4_K | 7.24 B | Metal | 20 | 2 | pp 1024 | 98.12 ± 0.08 |
Mistral Q4_K | 7.24 B | Metal | 20 | 1 | pp 1024 | 98.10 ± 0.53 |
Mistral Q4_K | 7.24 B | Metal | 20 | 4 | pp 1024 | 98.03 ± 0.36 |
Mistral Q4_K | 7.24 B | Metal | 10 | 1 | pp 1024 | 97.80 ± 0.53 |
Mistral Q4_K | 7.24 B | Metal | 0 | 1 | pp 1024 | 29.94 ± 0.15 |
Mistral Q4_K | 7.24 B | Metal | 0 | 2 | pp 1024 | 26.98 ± 0.31 |
Mistral Q4_K | 7.24 B | Metal | 0 | 4 | pp 1024 | 25.70 ± 0.23 |
Mistral Q4_K | 7.24 B | Metal | 0 | 1 | pp 128 | 17.69 ± 0.23 |
Mistral Q4_K | 7.24 B | Metal | 0 | 2 | pp 128 | 14.37 ± 0.45 |
Mistral Q4_K | 7.24 B | Metal | 0 | 4 | pp 128 | 12.77 ± 0.73 |
Text generation
model | params | backend | ngl | threads | test | t/s |
---|---|---|---|---|---|---|
Mistral Q4_K | 7.24 B | Metal | 30 | 4 | tg 32 | 12.91 ± 0.02 |
Mistral Q4_K | 7.24 B | Metal | 10 | 4 | tg 32 | 12.76 ± 0.07 |
Mistral Q4_K | 7.24 B | Metal | 20 | 4 | tg 32 | 12.87 ± 0.01 |
Mistral Q4_K | 7.24 B | Metal | 20 | 2 | tg 32 | 12.83 ± 0.03 |
Mistral Q4_K | 7.24 B | Metal | 30 | 2 | tg 32 | 12.79 ± 0.01 |
Mistral Q4_K | 7.24 B | Metal | 20 | 1 | tg 32 | 12.67 ± 0.01 |
Mistral Q4_K | 7.24 B | Metal | 30 | 1 | tg 32 | 12.62 ± 0.02 |
Mistral Q4_K | 7.24 B | Metal | 10 | 1 | tg 32 | 12.46 ± 0.12 |
Mistral Q4_K | 7.24 B | Metal | 10 | 2 | tg 32 | 12.37 ± 0.15 |
Mistral Q4_K | 7.24 B | Metal | 0 | 4 | tg 32 | 11.51 ± 0.20 |
Mistral Q4_K | 7.24 B | Metal | 0 | 2 | tg 32 | 8.66 ± 0.12 |
Mistral Q4_K | 7.24 B | Metal | 0 | 1 | tg 32 | 4.71 ± 0.16 |
Apple Silicon M1 Max
8-bits quantized model (Q8_0)
Build ref: f5f9121 (1359)
Prompt processing
Text generation
4-bits quantized model (Q4_K_M)
Build ref: f5f9121 (1359)
Prompt processing
Text generation
Read more...
Subscribe to get notified when new results are published.
Looking to suggest modifications or share your results ?
You can reach me on Twitter / X