Running inference using Mistral 7B AI first released model with llama.cpp
Mistral AI released their first large language model, Mistral 7B v0.1, by sharing a magnet link on Twitter the 27th of September 2023.
You can download the model weights using a bittorrent client like Transmission or directly get them from Hugging Face.
- Recommended for consumer hardware: Quantized versions from The Bloke
- If you have a dedicated GPU: Official Mistral AI repository
- Magnet link (useful to reduce bandwidth on the Official Mistral AI repository) :
magnet:?xt=urn:btih:208b101a0f51514ecf285885a8b0f6fb1a1e4d7d&dn=mistral-7B-v0.1&tr=udp%3A%2F%http://2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=https%3A%2F%http://2Ftracker1.520.jp%3A443%2Fannounce
Our goals: Run Mistral AI model inference on a laptop
In this article, we focus on running the Mistral AI model on consumer hardware (I'll use a MacBook Pro M1 / 16 GB). You'll need at least 8 GB of RAM - below that your computer will swap and may wear prematurely your SSD.
For cloud deployments like vLLM, you can find the reference documentation here: https://docs.mistral.ai
Quick note on LLaMA.cpp
The llama.cpp library is used to run inference on LLaMA models in plain C/C++ without dependencies.
The main benefit from this library is the ability to run inference on consumer hardware while being quite fast (the code is well optimized for Apple Silicon machines).
It has support for various backends such as CUDA, Metal and OpenCL.
The code is available on Github.
Converting Mistral AI consolidated.00.pth or pytorch_model-00001-of-00002.bin files to GGUF
This section covers the quantization process. This is not needed when you download the weights from The Bloke repository.
When you download the weights from the torrent link of the official repository, you'll end up with .pth
or .bin
files. These are the weights intended to be used with the pytorch library. We will convert them to the GGUF format needed by the llama.cpp library.
➜ mistral-7B-v0.1 ls -lh
total 64604824
-rw-r--r--@ 1 pas staff 10K 27 sep 09:27 RELEASE
-rw-r--r-- 1 pas staff 13G 27 sep 09:35 consolidated.00.pth
-rw-r--r-- 1 pas staff 202B 27 sep 09:27 params.json
-rw-r--r-- 1 pas staff 482K 27 sep 09:27 tokenizer.model
To perform the conversion and then the quantization, we'll follow the procedure from llama.cpp which I reproduced here with Mistral AI model :
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
Then install dependencies for Python scripts to convert pytorch to gguf:
# install Python env + dependencies
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -r requirements.txt
# Convert to GGUF FP16 format
python3 convert.py /path/to/mistral-7B-v0.1/
# Then we quantize the model weights
# quantize the model to 4-bits (using q4_0 method)
# This step is useful on low-memory machines, it degrades the accuracy of the weights so
# everything can fit into RAM
./quantize /path/to/mistral-7B-v0.1/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0
Inference with llama.cpp and the quantized Mistral AI model
You can use the main
executable to run inference.
./main -m /path/to/mistral-7B-v0.1/ggml-model-q4_0.gguf --color \
-c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -i --n-predict -2 \
-p "This is what Elon Musk told Mark Zuckerberg:"
This is what Elon Musk told Mark Zuckerberg: “I think the way forward
is to be much more open about what you’re doing, and then let people
evaluate that.”
Elon Musk has a message for Mark Zuckerberg: be more open.
Musk made his comments during an interview with CBS News host
John Dickerson on Sunday night’s episode of “60 Minutes” in which
the Tesla CEO discussed how Facebook should handle its role as social
media companies influence politics worldwide.
Musk has been critical of Zuckerberg’s handling of Facebook’s fake news
problem and suggested that the company’s leadership be more transparent
with its users about how it handles content on the site.
“I think the way forward is to be much more open about what you’re
doing, and then let people evaluate that,” Musk said. “You know,
there’s not really any harm in being more open, but I do think
there needs to...
The llama.cpp library ships with a web server and a ton of features, take a look at the README and the examples
folder in the github repo
Building LLM application with Mistral AI, llama-cpp-python and grammar constraints
You can use several libraries on top of llama.cpp to build your applications. As I use Python, you can use the python bindings library llama-cpp-python.
I wrote a high level library on top of llama-cpp-python to perform tasks such as parsing, summarization, argumentation analysis, etc. It can also be used with OpenAI API with the same function signatures.
I'll show an example on how to use PyLLMCore to interact with the model locally and produce structured output. To understand how it is done, you can see this introduction.
The PyLLMCore library leverages the built-in grammar capabilities of llama.cpp and implements a mechanism to convert data classes to a JSON schema.
We can invoke the following:
from dataclasses import dataclass
from llm_core.parsers import LLaMACPPParser
@dataclass
class Book:
title: str
summary: str
author: str
published_year: int
text = """Foundation is a science fiction novel by American writer
Isaac Asimov. It is the first published in his Foundation Trilogy (later
expanded into the Foundation series). Foundation is a cycle of five
interrelated short stories, first published as a single book by Gnome Press
in 1951. Collectively they tell the early story of the Foundation,
an institute founded by psychohistorian Hari Seldon to preserve the best
of galactic civilization after the collapse of the Galactic Empire.
"""
# Be sure to have mistral-7b-instruct-v0.1.Q4_K_M.gguf in ~/.cache/py-llm-core/models/
# You can download from https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
model = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
with LLaMACPPParser(Book, model=model) as parser:
book = parser.parse(text)
print(book)
Book(
title='Foundation',
summary="""Foundation is a science fiction novel by American writer
Isaac Asimov. It is the first published in his Foundation Trilogy
(later expanded into the Foundation series). Foundation is a
cycle of five interrelated short stories, first published as a
single book by Gnome Press in 1951. Collectively they tell the
early story of the Foundation, an institute founded by
psychohistorian Hari Seldon to preserve the best of galactic
civilization after the collapse of the Galactic Empire.""",
author='Isaac Asimov',
published_year=1951
)
Here is another example from the PyLLMCore documentation:
from typing import List
from dataclasses import dataclass
# LLaMACPPAssistant is needed to instanciate Mistral Instruct
from llm_core.assistants import LLaMACPPAssistant
# Make sure that ~/.cache/py-llm-core/models contains the following file
# You can download from https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
model = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
@dataclass
class RecipeStep:
step_title: str
step_instructions: str
@dataclass
class Recipe:
system_prompt = "You are a world-class chef"
prompt = "Write a detailed step-by-step recipe to make {dish}"
title: str
steps: List[RecipeStep]
ingredients: List[str]
class Chef:
def generate_recipe(self, dish):
with LLaMACPPAssistant(Recipe, model=model) as assistant:
recipe = assistant.process(dish=dish)
return recipe
chef = Chef()
recipe = chef.generate_recipe("Boeuf bourguignon")
print(recipe)
Inference with the python library llama-cpp-python
Under the hood, PyLLMCore uses llama-cpp-python, here is an example on how to generate completions:
from llama_cpp import Llama
llm = Llama(
'mistral-7b-instruct-v0.1.Q4_K_M.gguf',
verbose=False,
n_ctx=4000,
n_threads=4
)
llm.create_completion('Write a step-by-step recipe to make a pizza')
{
"id": "cmpl-a06474ea-d8b0-4ce4-a6e2-5c8b07e04d55",
"object": "text_completion",
"created": 1696770275,
"model": "/Users/pas/.cache/py-llm-core/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
"choices": [
{
"text": " crust:\n\nIngredients:\n- 3 cups all-purpose flour\n- 1 package active dry yeast\n- 2 tablespoons olive oil\n- 2 teaspoons salt\n- 1 cup warm water\n\nInstructions:\n\n1. Start by combining the dry ingredients in a large mixing bowl. Add in the flour, yeast and salt. Mix everything together until it is well combined.\n\n2. Next, add in the olive oil and warm water to the mixture. Use a wooden spoon to stir everything together until it forms a soft dough. \n",
"index": 0,
"logprobs": None,
"finish_reason": "length",
}
],
"usage": {
"prompt_tokens": 13,
"completion_tokens": 128,
"total_tokens": 141,
},
}
How fast is Mistral AI Instruct model ?
I ran some basic benchmarks and with the help of other contributors and I found the model to be quite fast:
Machine | Tokens/sec | Cost / h | Cost / 1k tok | Comments |
---|---|---|---|---|
OVHCloud T2-45 | 93.22 | 1.650 € | 0.005 € | Tesla V100 - Q4_K_M |
OVHCloud T2-45 | 70.5 | 1.650 € | 0.007 € | Tesla V100 - Q8_0 |
Apple M1 Max | 49.1 | --- € | --- € | Q4_K_M - GPU |
RTX 3070 Ti | 40.4 | --- € | --- € | Q4_K_M - GPU |
Apple M1 Max | 32.5 | --- € | --- € | Q8_0 - GPU |
MacBook Pro M1 | 13.3 | 0.110 € | 0.002 € | Q4_K_M - GPU |
MacBook Pro M1 | 12.38 | 0.110 € | 0.002 € | Q4_K_M CPU-only |
OVHCloud C2-60 | 10.34 | 0.749 € | 0.020 € | Q4_K_M CPU - 16 c. |
OVHCloud D2-8 | 3.54 | 0.036 € | 0.003 € | Q4_K_M CPU - 4 c. |
OVHCloud C2-7 | 2.85 | 0.098 € | 0.010 € | Q4_K_M CPU - 2 c. |
Intel NUC N3050 | 0.15 | 0.040 € | 0.074 € | Q4_K_M OpenCL |
Next steps
You might be interested in reading Mistral AI Instruct v0.1 model evaluation in real world use cases.
Read more...
Subscribe to get notified when a new article is published.