Skip to content

Use NuExtract to parse unstructured text locally in less than 5 min

Setting Up Your Environment

To get started with parsing and extracting structured data from text using the NuExtract model, you first need to set up your environment. This involves installing the necessary libraries and ensuring you have the required dependencies.

Step 1: Install PyLLMCore

PyLLMCore is an open source Python library (MIT) that interfaces with various Large Language Models (LLMs) and provides a simple API for parsing and extracting data. To install PyLLMCore, run the following command in your terminal:

shell
# You need Python 3.10
python3 -m venv venv
source venv/bin/activate

# Important: Some versions need to be pinned
pip3 install py-llm-core==2.8.15
pip3 install mistralai==0.4.2

Step 2: Download the NuExtract Model

NuExtract is a specialized LLM designed for structured extraction tasks. You can download the NuExtract model from Hugging Face. For this example, we'll use the NuExtract-tiny model. Run the following commands to download and store the model:

shell
mkdir -p ~/.cache/py-llm-core/models
cd ~/.cache/py-llm-core/models
wget -O nuextract-tiny https://huggingface.co/advanced-stack/NuExtract-tiny-GGUF/resolve/main/nuextract-tiny-f16.gguf?download=true

Step 3: Import Required Libraries

With the model downloaded, you can now import the necessary libraries in your Python script:

python
from dataclasses import dataclass, field
from llm_core.parsers import NuExtractParser

Parsing and Extracting Structured Data

To parse and extract structured data from text using the NuExtract model, you need to define a data class that represents the structure of the data you want to extract. PyLLMCore will use this data class to generate a JSON schema for the extraction process.

Step 4: Define the Data Class

For this example, let's create a Product data class with fields for the product name, price, and description:

python
from dataclasses import dataclass

@dataclass
class Product:
    name: str = ""
    price: str = ""
    description: str = ""

Step 5: Load and Parse the Text

Next, you need to load the text you want to parse and extract data from. For this example, we'll use a sample product description:

python
text = """
Introducing the new SuperWidget! This innovative gadget is priced at .99 and offers a range of features to make your life easier. With its sleek design and user-friendly interface, the SuperWidget is perfect for tech enthusiasts and casual users alike.
"""

Step 6: Parse the Text with NuExtract

Finally, you can use the NuExtractParser class to parse the text and extract the structured data:

python
from llm_core.parsers import NuExtractParser

with NuExtractParser(Product) as parser:
    product = parser.parse(text)
    print(product)

The output will be an instance of the Product data class with the extracted data:

python
Product(
    name='SuperWidget',
    price='.99',
    description='This innovative gadget offers a range of features to make your life easier. With its sleek design and user-friendly interface, the SuperWidget is perfect for tech enthusiasts and casual users alike.'
)

Parsing Multiple Products

To demonstrate the power of NuExtract, let's parse a text containing multiple products. We'll extend our Product data class to include a list of products and parse a more complex text.

Step 7: Define the Extended Data Class

We'll create a Catalog data class that contains a list of Product instances:

python
@dataclass
class Catalog:
    products: list[Product] = field(default_factory=lambda: [Product()])

Step 8: Load and Parse the Extended Text

Now, let's load a text containing multiple product descriptions:

python
text = """
Introducing the new SuperWidget! This innovative gadget is priced at .99 and offers a range of features to make your life easier. With its sleek design and user-friendly interface, the SuperWidget is perfect for tech enthusiasts and casual users alike.

Check out the MegaGizmo! At just 2.59, this device packs a punch with its powerful performance and compact size. Ideal for on-the-go professionals and students, the MegaGizmo is a must-have.

Don't miss the UltraTool! Priced at 3.49, this versatile tool is perfect for DIY enthusiasts and professionals. With its robust build and multiple functionalities, the UltraTool is your go-to solution for any task.
"""

Step 9: Parse the Extended Text with NuExtract

Finally, use the NuExtractParser class to parse the text and extract the structured data:

python
with NuExtractParser(Catalog) as parser:
    catalog = parser.parse(text)
    for product in catalog.products:
        print(product)

The output will be a list of Product instances with the extracted data:

python
Product(
    name='SuperWidget',
    price='.99',
    description='This innovative gadget offers a range of features to make your life easier. With its sleek design and user-friendly interface, the SuperWidget is perfect for tech enthusiasts and casual users alike.'
)
Product(
    name='MegaGizmo',
    price='2.59',
    description='This device packs a punch with its powerful performance and compact size. Ideal for on-the-go professionals and students, the MegaGizmo is a must-have.'
)
Product(
    name='UltraTool',
    price='3.49',
    description='This versatile tool is perfect for DIY enthusiasts and professionals. With its robust build and multiple functionalities, the UltraTool is your go-to solution for any task.'
)

Conclusion

With the NuExtract model, you can easily parse and extract structured data from unstructured text locally as the model is very small. Special thanks to NuMind for releasing the NuExtract model series, making structured extraction more accessible and efficient.

Subscribe to get weekly tips on integrating AI tech in your projects.

Advanced Stack