Skip to content

How to generate a synthetic dataset using a large language model

In order build and fine tune small language models like text classifiers or fill-mask models, the synthetic dataset generation is a method yielding great results due to the latest releases of high quality LLM (Mistral AI, LLama-2).

We'll explore this method using the Mistral AI Instruct model as it is quite capable (and fast) for this kind of use cases.

What is synthetic dataset generation ?

Synthetic dataset generation is the process of creating artificial data that mimics the characteristics of real-world data. Synthetic data can be generated to reflect a variety of scenarios that may not be easily available in real-world data.

Use case: Map user actions to the CRUD model and entities

In this article, we want to achieve a mapping between a user query and the underlying operation to perform in a CRUD (Create Read Update Delete) database application.

We could use directly a large language model to perform this task, as shown hereunder when using PyLLMCore and Mistral AI Instruct:

python
In [2]: ask('Cancel all my meetings for the week')
Out[2]: UserQuery(operation=<CRUDOperation.DELETE: 4>, target=<TargetItem.MEETING: 4>)

In [3]: ask('What is the agenda ?')
Out[3]: UserQuery(operation=<CRUDOperation.READ: 2>, target=<TargetItem.MEETING: 4>)

In [4]: ask('Schedule meeting for next monday')
Out[4]: UserQuery(operation=<CRUDOperation.CREATE: 1>, target=<TargetItem.MEETING: 4>)

In [5]: ask('When is my next meeting ?')
Out[5]: UserQuery(operation=<CRUDOperation.READ: 2>, target=<TargetItem.MEETING: 4>)

But the main issue is the computing power required to run this mapping. The use of a smaller model suited to perform text classification would be much faster and cheaper.

For example, fine tuning the distilbert-base-uncased model (only 67M parameters vs 7B for Mistral AI) would yield significant performance and costs improvements.

But to do so, we need to generate user queries and label them with the intended operation.

Let's do that.

Generate synthetic user queries in a specific domain

In this example, we will narrow our focus on a fictional application to manage one's calendar.

We'll use PyLLMCore to generate our synthetic dataset.

The code snippets are truncated for brevity. For a complete working code, refer to the documentation of PyLLMCore - Synthetic dataset generation.

We start by generating user queries for our domain:

python
@dataclass
class UserQueryGenerator:
    system_prompt = "You are a helpful assistant."
    prompt = """
    # Goals

    We are developing a new business calendar software that is able
    to understand plain english.
    
    # Examples
    
    Cancel all my meetings of the week
    What is my next meeting ?
    What is on the agenda for the meeting at 1 pm ?
    {queries}
    
    # Todo

    Write {queries_count} new examples of what a user could have asked.
    
    """
    user_queries: List[str]

    @classmethod
    def generate(cls, queries_count=10, existing_queries=()):
        with LLaMACPPAssistant(cls, model="mistral") as assistant:
            existing_queries_str = '\n'.join(existing_queries)
            batch = assistant.process(
                queries_count=queries_count,
                queries=existing_queries_str
            )
            return batch.user_queries
txt
What are my meetings for tomorrow?

Can you remind me to call John at 3 pm today?

Show me the schedule for the next week.

What is the time of my meeting with Sarah?

Add a new meeting with Michael at 2 pm on Friday.

Remove all meetings from my calendar.

What is the agenda for the meeting at 10 am?

Can you reschedule my meeting with David to 4 pm?

Show me the schedule for today and tomorrow.

Now that we have some user queries, we can label them.

Automatic data labelling using a llm

To label the queries, we will map each user query to 2 Enum classes:

python
class Item(Enum):
    CALENDAR = 1
    EVENT = 2
    TASK = 3
    REMINDER = 4
    INVITEE = 5

class CRUDOperation(Enum):
    CREATE = 1
    READ = 2
    UPDATE = 3
    DELETE = 4

We write the following (expensive and slow) classifier:

python
@dataclass
class UserQueryClassification:
    system_prompt = "You are a helpful assistant."
    prompt = """
    Analyze the user's query and convert his intent to:
    - an operation (among CRUD)
    - a target item

    Query: {prompt}
    """
    operation: CRUDOperation
    item: Item

    @classmethod
    def ask(cls, prompt):
        with LLaMACPPAssistant(cls, model="mistral") as assistant:
            user_query = assistant.process(prompt=prompt)
            return user_query

When we apply this classifier on our generated queries, we obtain:

What are my meetings for tomorrow?
operation:READ
item:CALENDAR

Can you remind me to call John at 3 pm today?
operation:CREATE
item:TASK

Show me the schedule for the next week.
operation:READ
item:CALENDAR

What is the time of my meeting with Sarah?
operation:READ
item:TASK

Add a new meeting with Michael at 2 pm on Friday.
operation:CREATE
item:CALENDAR

Remove all meetings from my calendar.
operation:DELETE
item:CALENDAR

What is the agenda for the meeting at 10 am?
operation:READ
item:TASK

Can you reschedule my meeting with David to 4 pm?
operation:UPDATE
item:CALENDAR

Show me the schedule for today and tomorrow.
operation:READ
item:TASK

We can see from the results that the labelling can yield unwanted results.

The following example is not satisfactory:

txt
Can you remind me to call John at 3 pm today?
operation:CREATE
item:TASK

There is in fact ambiguity in the prompting: What are we looking to achieve here ?

Is it Create a task to call John at 3 pm or Create a reminder to call John at 3 pm ?

To produce the best results, you'll have 2 choices:

  • authorize multiple labels
  • generate labels that are exclusive (orthogonal)

It depends of the downstream application and what you need to do. The main benefits of the synthetic dataset generation is that you can leverage language models to both produce and label data with rich classifications.

As the overall costs are now a fraction of what it used to be, you can afford to produce a hyper-specialized dataset.

Next steps

There are 2 steps we'll cover in upcoming articles:

  • Synthetic dataset quality verification
  • Fine tuning distilbert-base-uncased

Stay tuned and register to be notified:

Advanced Stack