Skip to content
On this page

Build expert AI models through distillation and parameters pruning

The general availability of high-quality models enables the creation of smaller Expert models.

The distillation technique is not new (G. Hinton et al. 2015) and has lead to the popular DistilBERT series of models.

Basically, this is a knowledge transfer between models with one goal in mind: reduce the costs (parameters count) of inference. The performances are somehow similar (less than 5% decrease).

One recent game-changer is the general availability of much larger and capable models. These larger models can act as teacher by:

  • generating high quality synthetic datasets (or even simulated environments)
  • transferring their parameters (when weights are released)

More recently, M. Xia et al. 2023 released a 1.3B and 2.7B parameters LLM (Sheared LLaMA) using structured pruning and continued pre-training.

Smaller AI models are the key to maintain privacy

Using a smaller AI model can be beneficial for several reasons:

  1. Reduced computational resources: Smaller models require fewer computational resources, making them more efficient for deployment on devices with limited processing power or in resource-constrained environments.

  2. Faster inference: Smaller models typically have lower inference times, allowing for faster predictions and real-time applications.

  3. Reduced energy consumption: Smaller models require less energy to perform inference, making them more energy-efficient.

  4. Improved privacy: Deploying smaller models locally is mandatory to maintain privacy.

Use cases for expert models: Inference performances

From my point of view, the most important use case for an expert model is performance. For example, the latency experienced when processing text with a modern LLM is in seconds and requires expensive GPUs.

When the inference should run under 50 ms without requiring complex hardware, you don't have a choice.

As a rule of thumb, we can estimate the inference time by looking at the parameters count:

Below 100 M parameters let you achieve < 10 ms on modern hardware and < 20 ms on older or embedded CPU. But you would sill need between 100 MB to 200 MB of RAM (int8 / fp16).

How to create and train an expert model ?

I recently started to look into creating a practical example in the article series From synthetic dataset to an expert model.

I am exploring the steps required to map natural language user queries into CRUD (Create Read Update Delete) operations (i.e. How can we translate Cancel my meetings for the week into a API call).

Using LLM to make API calls was a starting point when I figured the latency and the costs won't be acceptable.

Get notified

Subscribe to the newsletter to get notified when a new article is published.

Last updated:

Advanced Stack