Skip to content

Performance benchmark of Mistral AI using llama.cpp

The llama.cpp library comes with a benchmarking tool.

I tested both the MacBook Pro M1 with 16 GB of unified memory and the Tesla V100S from OVHCloud (t2-le-45).

The post will be updated as more tests are done.

Procedure to run inference benchmark with llama.cpp

This guide covers only MacOS and Linux.

Clone and compile llama.cpp

shell
git clone https://github.com/ggerganov/llama.cpp.git llama-cpp-for-benchmark
cd llama-cpp-for-benchmark

If you have a Nvidia GPU and CUDA is already installed:

shell
# I'm assuming you already have CUDA properly installed and working
make LLAMA_CUBLAS=1

If you have an Apple Silicon Machine:

shell
make

If you have an OpenCL compatible GPU and running Linux:

shell
# install CLBlast - see https://github.com/ggerganov/llama.cpp#installing-clblast
sudo apt-get install libclblast-dev libclblast1
make LLAMA_CLBLAST=1

If you don't have any usable GPU (older models without support):

shell
make

Download the model

Store the downloaded version in ./llama-cpp-for-benchmark/models/mistral/

Using the weights of the original release (.pth fp16 => GGUF)

Follow the steps hereunder only if you want to test the official release (fp16)

shell
# Create a virtual env to store required dependencies
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Download and convert the official release from HF to GGUC
pip install huggingface_hub
huggingface-cli login
mkdir ./models/mistral/
huggingface-cli download --local-dir ./models/mistral/ mistralai/Mistral-7B-v0.1

# Convert the pytorch model to GGUF format
./convert.py ./models/mistral/

Run a test to see if the inference is actually working

shell
./main -t 4 -ngl 33  -m ./models/mistral/<name-of-the-model.gguf> -c 1024 --color -n 128 -p "Hello"

Run benchmarks

shell
# Quick single-thread CPU benchmark
./llama-bench -m ./models/mistral/<name-of-the-model.gguf> -t 1 -ngl 0 --output csv > bench-results.csv

# Quick GPU benchmark
./llama-bench -m ./models/mistral/<name-of-the-model.gguf> -t 1 -ngl 33 --output csv > bench-results.csv

# 5-10 min benchmark
./llama-bench -m ./models/mistral/<name-of-the-model.gguf> -t 1,2,4 -ngl 0,10,20,33 --output csv > bench-results.csv

# Complete benchmark
./llama-bench -m ./models/mistral/<name-of-the-model.gguf> -t 1,2,4,8 -ngl 0,10,20,33 --output csv > bench-results.csv

Notes:

  • ngl is set to 33 to offload every layer to the GPU
  • if you don't have a usable GPU, set -ngl 0

Apple Silicon M1

Highlights

The first key result you should consider from this benchmark is the use of GPU offloading yields a very interesting tradeoff:

  • Setting thread count to 1
  • Setting gpu layers to 30

yields

modelparamsbackendnglthreadstestt/s
Mistral Q4_K7.24 BMetal301pp 128112.31 ± 0.10
Mistral Q4_K7.24 BMetal301tg 3212.62 ± 0.02

So you'll get most of the performance from a M1 macbook by offloading everything to the integrated GPU leaving room for any other compute workload.

Note that this result is due to the nature of the workload involved when running inference of LLM: It is all about memory bandwidth and less about raw compute performance.

This configuration is interesting because:

  • You'll draw less power (compared to running only on the CPU)
  • You'll keep room for CPU-hungry tasks.

4-bits quantized model (Q4_K_M)

Build ref: db3abcc (1354)

Prompt processing

modelparamsbackendnglthreadstestt/s
Mistral Q4_K7.24 BMetal304pp 128112.31 ± 0.10
Mistral Q4_K7.24 BMetal104pp 128112.23 ± 0.09
Mistral Q4_K7.24 BMetal202pp 128112.23 ± 0.05
Mistral Q4_K7.24 BMetal302pp 128112.21 ± 0.05
Mistral Q4_K7.24 BMetal204pp 128112.19 ± 0.04
Mistral Q4_K7.24 BMetal102pp 128112.15 ± 0.42
Mistral Q4_K7.24 BMetal301pp 128112.11 ± 0.06
Mistral Q4_K7.24 BMetal201pp 128111.97 ± 0.15
Mistral Q4_K7.24 BMetal101pp 128111.82 ± 0.40
Mistral Q4_K7.24 BMetal301pp 102498.77 ± 0.72
Mistral Q4_K7.24 BMetal304pp 102498.58 ± 0.23
Mistral Q4_K7.24 BMetal302pp 102498.37 ± 0.49
Mistral Q4_K7.24 BMetal102pp 102499.28 ± 0.56
Mistral Q4_K7.24 BMetal104pp 102498.23 ± 0.99
Mistral Q4_K7.24 BMetal202pp 102498.12 ± 0.08
Mistral Q4_K7.24 BMetal201pp 102498.10 ± 0.53
Mistral Q4_K7.24 BMetal204pp 102498.03 ± 0.36
Mistral Q4_K7.24 BMetal101pp 102497.80 ± 0.53
Mistral Q4_K7.24 BMetal01pp 102429.94 ± 0.15
Mistral Q4_K7.24 BMetal02pp 102426.98 ± 0.31
Mistral Q4_K7.24 BMetal04pp 102425.70 ± 0.23
Mistral Q4_K7.24 BMetal01pp 12817.69 ± 0.23
Mistral Q4_K7.24 BMetal02pp 12814.37 ± 0.45
Mistral Q4_K7.24 BMetal04pp 12812.77 ± 0.73

Text generation

modelparamsbackendnglthreadstestt/s
Mistral Q4_K7.24 BMetal304tg 3212.91 ± 0.02
Mistral Q4_K7.24 BMetal104tg 3212.76 ± 0.07
Mistral Q4_K7.24 BMetal204tg 3212.87 ± 0.01
Mistral Q4_K7.24 BMetal202tg 3212.83 ± 0.03
Mistral Q4_K7.24 BMetal302tg 3212.79 ± 0.01
Mistral Q4_K7.24 BMetal201tg 3212.67 ± 0.01
Mistral Q4_K7.24 BMetal301tg 3212.62 ± 0.02
Mistral Q4_K7.24 BMetal101tg 3212.46 ± 0.12
Mistral Q4_K7.24 BMetal102tg 3212.37 ± 0.15
Mistral Q4_K7.24 BMetal04tg 3211.51 ± 0.20
Mistral Q4_K7.24 BMetal02tg 328.66 ± 0.12
Mistral Q4_K7.24 BMetal01tg 324.71 ± 0.16

Apple Silicon M1 Max

8-bits quantized model (Q8_0)

Build ref: f5f9121 (1359)

Prompt processing

Text generation

4-bits quantized model (Q4_K_M)

Build ref: f5f9121 (1359)

Prompt processing

Text generation

Read more...

Subscribe to get notified when new results are published.

Looking to suggest modifications or share your results ?

You can reach me on Twitter / X

Advanced Stack