Skip to content

How to use Mistral AI Instruct model to generate json output aka JSON Mode

Mistral AI released a new Language Model and it has great capabilities. However, as with LLaMA models, it does not come with the Functions features from OpenAI.

In this article, we'll see how to leverage llama.cpp library to get Mistral-7B-Instruct-v0.1 model to parse and extract information in a structured way.

Leveraging LLaMA.cpp for inference on GPU-less hardware

As presented in a previous article, the llama.cpp library is used to run inference when a GPU is not available but takes advantage of other optimizations.

The code is available on Github.

For the next parts of the tutorial, I suggest you to download pre-processed models (GGUF format) here.

Use JSON Mode with Mistral AI model

To produce a structured content, the best way is to first create a JSON Schema (as with OpenAI Functions) and then perform inference with respect to the schema.

Here's a sample code to generate the schema from a dataclass using PyLLMCore:

>>> from llm_core.schema import to_json_schema
>>> from dataclasses import dataclass

>>> @dataclass
... class Book:
...     title: str
...     summary: str
...     author: str
...     published_year: int
>>> schema = to_json_schema(Book)
>>> schema
  'type': 'object',
  'properties': {
    'title': {'type': 'string'}},
    'summary': {'type': 'string'},
    'author': {'type': 'string'},
    'published_year': {'type': 'integer'},
  'required': ['title', 'summary', 'author', 'published_year'],

Now that we have the schema, we'll convert that into grammar rules.

Generate grammar rules from a JSON Schema specification

The team behind llama.cpp did a great job and provided a python function to generate a Bakus-Naur Form grammar (with some adaptations).

The code is available here

I won't reproduce the code here, instead integrate directly the code in your project. Then you can use the following wrapper (on top of SchemaConverter):

def generate_grammar(schema):
    converter = SchemaConverter({})
    converter.visit(schema, '')
    return converter.format_grammar()

Let's try calling generate_grammar(schema):

space ::= " "?
string ::=  "\"" (
        [^"\\] |
        "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
      )* "\"" space
integer ::= ("-"? ([0-9] | [1-9] [0-9]*)) space
root ::= "{" space "\"author\"" space ":" space string "," space "\"published_year\"" space ":" space integer "," space "\"summary\"" space ":" space string "," space "\"title\"" space ":" space string "}" space

OK, I don't know BNF (neither GNBF) but you get the idea. By the way, if you feed this into GPT-3.5 and ask for a valid example it will comply gracefully.

Use grammar rules from JSON Schema to produce structured output

The llama.cpp library also comes with a web server you can use locally:

./server -t 4 -ngl 16 -m mistral-instruct-7B-v0.1-q4_0.gguf -c 4096

Some explanations:

  • -t 4 means I'll use the 4 performance cores of Apple Silicon M1
  • -ngl 16 means I'll offload 16 layers onto the integrated GPU of the M1
  • mistral-instruct-7B-v0.1-q4_0.gguf is a quantized version of the instruct model. This version only requires 4 GB of RAM
  • -c 4096 means I'll use a 4096 context window (the sliding window attention is yet to be implemented in llama.cpp)

The server listens on port 8080.

After a lot of testing and prompting I figured out that you can parse and perform simple analysis with no prompt whatsoever: Using a very descriptive schema just works !

A simple (maybe naive) parser can be implemented as described hereunder:

def parse(text, schema):
    url = "http://localhost:8080/completion"
    headers = {"Content-Type": "application/json"}
    grammar = generate_grammar(schema)

    prompt = f"""<s>[INST]

    data = {
        "prompt": prompt,
        "n_predict": 512,
        "temperature": 0.1,
        "grammar": grammar,
    response =, headers=headers, json=data)
    return json.loads(response.json()["content"])

We'll test with the reference example of the PyLLMCore library:


text = """Foundation is a science fiction novel by American writer
Isaac Asimov. It is the first published in his Foundation Trilogy (later
expanded into the Foundation series). Foundation is a cycle of five
interrelated short stories, first published as a single book by Gnome Press
in 1951. Collectively they tell the early story of the Foundation,
an institute founded by psychohistorian Hari Seldon to preserve the best
of galactic civilization after the collapse of the Galactic Empire.

class Book:
    title: str
    summary: str
    author: str
    published_year: int

    def schema(cls):
        return to_json_schema(cls)

data = parse(text, Book.schema())
    "title": "Foundation (novel)",
    "author": "Isaac Asimov",
    "summary": "Foundation is a science fiction novel by Isaac Asimov that
      tells the early story of the Foundation, an institute founded by
      psychohistorian Hari Seldon to preserve the best of galactic 
      civilization after the collapse of the Galactic Empire. The novel
      is a cycle of five interrelated short stories that collectively tell 
      the story of the Foundation and its efforts to maintain order
      in the galaxy.",
    "published_year": 1951,

A more complex topic, out of the current knowledge

We'll try with something harder and not in the common knowledge:

I'll feed Mistral Instruct with the abstract of the paper on the Chain of Density Prompting:

from typing import List

class Publication:
    title: str
    main_topic: str
    summary_in_50_words: str
    categories: List[str]

    def schema(cls):
        return to_json_schema(cls)

text = """

Selecting the “right” amount of information to include in a summary is
a difficult task. A good summary should be detailed and entity-centric
without being overly dense and hard to follow.

To better understand this tradeoff, we solicit increasingly dense GPT-4
summaries with what we refer to as a “Chain of Density” (CoD) prompt.

Specifically, GPT-4 generates an initial entity- sparse summary before
iteratively incorporating missing salient entities without increasing the
length. Summaries generated by CoD are more abstractive, exhibit more
fusion, and have less of a lead bias than GPT-4 summaries generated by
a vanilla prompt.

We conduct a human preference study on 100 CNN DailyMail articles and
find that that humans prefer GPT-4 summaries that are more dense than
those generated by a vanilla prompt and almost as dense as human
written summaries.

Qualitative analysis supports the notion that there exists a tradeoff between
informativeness and readability. 500 annotated CoD summaries, as well as
an extra 5,000 unannotated summaries, are freely available on HuggingFace.

data = parse(text, Publication.schema())


  • I did not provide any prompt
  • I did not provide a title
  • I did not provide instructions to summarize or to categorize

Yet, here the results:

    "title": "Tradeoff between information richness and readability
      in summaries using GPT-4",
    "categories": [
      "Artificial Intelligence",
      "Natural Language Processing"
    "main_topic": "Summarization",
    "summary_in_50_words": "This study investigates the tradeoff
      between information richness and readability in summaries.
      GPT-4 generates increasingly dense summaries using a Chain
      of Density (CoD) prompt, which results in more abstractive,
      fused, and less biased summaries than vanilla prompts.
      Human preference studies show that people prefer more dense
      summaries than vanilla prompts but not as dense as
      human-written summaries.",

I find that remarkable as this is only a 7B model running on my laptop and consuming just about 4 GB of RAM.

Next steps

I am astounded by Mistral Instruct model performances. I am grateful to all the llama.cpp contributors for all the work they did so we can run these models on consumer hardware. I'll encourage you to take a look and share your experiments.

You might be interested in reading Mistral AI Instruct v0.1 model evaluation in real world use cases.


Subscribe to get notified when a new article is published.

Advanced Stack