Running inference using Mistral 7B AI first released model with llama.cpp

Mistral AI released their first large language model, Mistral 7B v0.1, by sharing a magnet link on Twitter the 27th of September 2023.

You can download the model weights using a bittorrent client like Transmission or directly get them from Hugging Face.

Recommended for consumer hardware: Quantized versions from The Bloke
If you have a dedicated GPU: Official Mistral AI repository
Magnet link (useful to reduce bandwidth on the Official Mistral AI repository) :

magnet:?xt=urn:btih:208b101a0f51514ecf285885a8b0f6fb1a1e4d7d&dn=mistral-7B-v0.1&tr=udp%3A%2F%http://2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=https%3A%2F%http://2Ftracker1.520.jp%3A443%2Fannounce

Our goals: Run Mistral AI model inference on a laptop

In this article, we focus on running the Mistral AI model on consumer hardware (I'll use a MacBook Pro M1 / 16 GB). You'll need at least 8 GB of RAM - below that your computer will swap and may wear prematurely your SSD.

For cloud deployments like vLLM, you can find the reference documentation here: https://docs.mistral.ai

Quick note on LLaMA.cpp

The llama.cpp library is used to run inference on LLaMA models in plain C/C++ without dependencies.

The main benefit from this library is the ability to run inference on consumer hardware while being quite fast (the code is well optimized for Apple Silicon machines).

It has support for various backends such as CUDA, Metal and OpenCL.

The code is available on Github.

Converting Mistral AI consolidated.00.pth or pytorch_model-00001-of-00002.bin files to GGUF

This section covers the quantization process. This is not needed when you download the weights from The Bloke repository.

When you download the weights from the torrent link of the official repository, you'll end up with .pth or .bin files. These are the weights intended to be used with the pytorch library. We will convert them to the GGUF format needed by the llama.cpp library.

shell

➜  mistral-7B-v0.1 ls -lh

total 64604824
-rw-r--r--@ 1 pas  staff    10K 27 sep 09:27 RELEASE
-rw-r--r--  1 pas  staff    13G 27 sep 09:35 consolidated.00.pth
-rw-r--r--  1 pas  staff   202B 27 sep 09:27 params.json
-rw-r--r--  1 pas  staff   482K 27 sep 09:27 tokenizer.model

To perform the conversion and then the quantization, we'll follow the procedure from llama.cpp which I reproduced here with Mistral AI model :

shell

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

Then install dependencies for Python scripts to convert pytorch to gguf:

shell

# install Python env + dependencies
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -r requirements.txt

# Convert to GGUF FP16 format
python3 convert.py /path/to/mistral-7B-v0.1/

# Then we quantize the model weights
# quantize the model to 4-bits (using q4_0 method)
# This step is useful on low-memory machines, it degrades the accuracy of the weights so
# everything can fit into RAM
./quantize /path/to/mistral-7B-v0.1/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0

Inference with llama.cpp and the quantized Mistral AI model

You can use the main executable to run inference.

shell

./main -m /path/to/mistral-7B-v0.1/ggml-model-q4_0.gguf --color \
-c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -i --n-predict -2 \
-p "This is what Elon Musk told Mark Zuckerberg:"

txt

This is what Elon Musk told Mark Zuckerberg: “I think the way forward 
is to be much more open about what you’re doing, and then let people
evaluate that.”

Elon Musk has a message for Mark Zuckerberg: be more open.

Musk made his comments during an interview with CBS News host
John Dickerson on Sunday night’s episode of “60 Minutes” in which
the Tesla CEO discussed how Facebook should handle its role as social
media companies influence politics worldwide.

Musk has been critical of Zuckerberg’s handling of Facebook’s fake news
problem and suggested that the company’s leadership be more transparent
with its users about how it handles content on the site.

“I think the way forward is to be much more open about what you’re
doing, and then let people evaluate that,” Musk said. “You know,
there’s not really any harm in being more open, but I do think
there needs to...

The llama.cpp library ships with a web server and a ton of features, take a look at the README and the examples folder in the github repo

Building LLM application with Mistral AI, llama-cpp-python and grammar constraints

You can use several libraries on top of llama.cpp to build your applications. As I use Python, you can use the python bindings library llama-cpp-python.

I wrote a high level library on top of llama-cpp-python to perform tasks such as parsing, summarization, argumentation analysis, etc. It can also be used with OpenAI API with the same function signatures.

I'll show an example on how to use PyLLMCore to interact with the model locally and produce structured output. To understand how it is done, you can see this introduction.

The PyLLMCore library leverages the built-in grammar capabilities of llama.cpp and implements a mechanism to convert data classes to a JSON schema.

We can invoke the following:

python

from dataclasses import dataclass
from llm_core.parsers import LLaMACPPParser

@dataclass
class Book:
    title: str
    summary: str
    author: str
    published_year: int

text = """Foundation is a science fiction novel by American writer
Isaac Asimov. It is the first published in his Foundation Trilogy (later
expanded into the Foundation series). Foundation is a cycle of five
interrelated short stories, first published as a single book by Gnome Press
in 1951. Collectively they tell the early story of the Foundation,
an institute founded by psychohistorian Hari Seldon to preserve the best
of galactic civilization after the collapse of the Galactic Empire.
"""


# Be sure to have mistral-7b-instruct-v0.1.Q4_K_M.gguf in ~/.cache/py-llm-core/models/
# You can download from https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf

model = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
with LLaMACPPParser(Book, model=model) as parser:
    book = parser.parse(text)
    print(book)

python

Book(
    title='Foundation',
    summary="""Foundation is a science fiction novel by American writer
        Isaac Asimov. It is the first published in his Foundation Trilogy
        (later expanded into the Foundation series). Foundation is a
        cycle of five interrelated short stories, first published as a
        single book by Gnome Press in 1951. Collectively they tell the
        early story of the Foundation, an institute founded by 
        psychohistorian Hari Seldon to preserve the best of galactic
        civilization after the collapse of the Galactic Empire.""",
    author='Isaac Asimov',
    published_year=1951
)

Here is another example from the PyLLMCore documentation:

python

from typing import List
from dataclasses import dataclass

# LLaMACPPAssistant is needed to instanciate Mistral Instruct
from llm_core.assistants import LLaMACPPAssistant

# Make sure that ~/.cache/py-llm-core/models contains the following file
# You can download from https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
model = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"


@dataclass
class RecipeStep:
    step_title: str
    step_instructions: str


@dataclass
class Recipe:
    system_prompt = "You are a world-class chef"
    prompt = "Write a detailed step-by-step recipe to make {dish}"

    title: str
    steps: List[RecipeStep]
    ingredients: List[str]


class Chef:
    def generate_recipe(self, dish):
        with LLaMACPPAssistant(Recipe, model=model) as assistant:
            recipe = assistant.process(dish=dish)
            return recipe

chef = Chef()
recipe = chef.generate_recipe("Boeuf bourguignon")
print(recipe)

Inference with the python library llama-cpp-python

Under the hood, PyLLMCore uses llama-cpp-python, here is an example on how to generate completions:

python

from llama_cpp import Llama

llm = Llama(
    'mistral-7b-instruct-v0.1.Q4_K_M.gguf',
    verbose=False,
    n_ctx=4000,
    n_threads=4
)
llm.create_completion('Write a step-by-step recipe to make a pizza')

python

{
    "id": "cmpl-a06474ea-d8b0-4ce4-a6e2-5c8b07e04d55",
    "object": "text_completion",
    "created": 1696770275,
    "model": "/Users/pas/.cache/py-llm-core/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    "choices": [
        {
            "text": " crust:\n\nIngredients:\n- 3 cups all-purpose flour\n- 1 package active dry yeast\n- 2 tablespoons olive oil\n- 2 teaspoons salt\n- 1 cup warm water\n\nInstructions:\n\n1. Start by combining the dry ingredients in a large mixing bowl. Add in the flour, yeast and salt. Mix everything together until it is well combined.\n\n2. Next, add in the olive oil and warm water to the mixture. Use a wooden spoon to stir everything together until it forms a soft dough. \n",
            "index": 0,
            "logprobs": None,
            "finish_reason": "length",
        }
    ],
    "usage": {
        "prompt_tokens": 13,
        "completion_tokens": 128,
        "total_tokens": 141,
    },
}

How fast is Mistral AI Instruct model ?

I ran some basic benchmarks and with the help of other contributors and I found the model to be quite fast:

Machine	Tokens/sec	Cost / h	Cost / 1k tok	Comments
OVHCloud T2-45	93.22	1.650 €	0.005 €	Tesla V100 - Q4_K_M
OVHCloud T2-45	70.5	1.650 €	0.007 €	Tesla V100 - Q8_0
Apple M1 Max	49.1	--- €	--- €	Q4_K_M - GPU
RTX 3070 Ti	40.4	--- €	--- €	Q4_K_M - GPU
Apple M1 Max	32.5	--- €	--- €	Q8_0 - GPU
MacBook Pro M1	13.3	0.110 €	0.002 €	Q4_K_M - GPU
MacBook Pro M1	12.38	0.110 €	0.002 €	Q4_K_M CPU-only
OVHCloud C2-60	10.34	0.749 €	0.020 €	Q4_K_M CPU - 16 c.
OVHCloud D2-8	3.54	0.036 €	0.003 €	Q4_K_M CPU - 4 c.
OVHCloud C2-7	2.85	0.098 €	0.010 €	Q4_K_M CPU - 2 c.
Intel NUC N3050	0.15	0.040 €	0.074 €	Q4_K_M OpenCL

Next steps

You might be interested in reading Mistral AI Instruct v0.1 model evaluation in real world use cases.

Running inference using Mistral 7B AI first released model with llama.cpp ​

Our goals: Run Mistral AI model inference on a laptop ​

Quick note on LLaMA.cpp ​

Converting Mistral AI consolidated.00.pth or pytorch_model-00001-of-00002.bin files to GGUF ​

Inference with llama.cpp and the quantized Mistral AI model ​

Building LLM application with Mistral AI, llama-cpp-python and grammar constraints ​

Inference with the python library llama-cpp-python ​

How fast is Mistral AI Instruct model ? ​

Next steps ​

Read more... ​

Running inference using Mistral 7B AI first released model with llama.cpp

Our goals: Run Mistral AI model inference on a laptop

Quick note on LLaMA.cpp

Converting Mistral AI consolidated.00.pth or pytorch_model-00001-of-00002.bin files to GGUF

Inference with llama.cpp and the quantized Mistral AI model

Building LLM application with Mistral AI, llama-cpp-python and grammar constraints

Inference with the python library llama-cpp-python

How fast is Mistral AI Instruct model ?

Next steps

Read more...