Use NuExtract to parse unstructured text locally in less than 5 min
Setting Up Your Environment
To get started with parsing and extracting structured data from text using the NuExtract model, you first need to set up your environment. This involves installing the necessary libraries and ensuring you have the required dependencies.
Step 1: Install PyLLMCore
PyLLMCore is an open source Python library (MIT) that interfaces with various Large Language Models (LLMs) and provides a simple API for parsing and extracting data. To install PyLLMCore, run the following command in your terminal:
# You need Python 3.10
python3 -m venv venv
source venv/bin/activate
# Important: Some versions need to be pinned
pip3 install py-llm-core==2.8.15
pip3 install mistralai==0.4.2
Step 2: Download the NuExtract Model
NuExtract is a specialized LLM designed for structured extraction tasks. You can download the NuExtract model from Hugging Face. For this example, we'll use the NuExtract-tiny model. Run the following commands to download and store the model:
mkdir -p ~/.cache/py-llm-core/models
cd ~/.cache/py-llm-core/models
wget -O nuextract-tiny https://huggingface.co/advanced-stack/NuExtract-tiny-GGUF/resolve/main/nuextract-tiny-f16.gguf?download=true
Step 3: Import Required Libraries
With the model downloaded, you can now import the necessary libraries in your Python script:
from dataclasses import dataclass, field
from llm_core.parsers import NuExtractParser
Parsing and Extracting Structured Data
To parse and extract structured data from text using the NuExtract model, you need to define a data class that represents the structure of the data you want to extract. PyLLMCore will use this data class to generate a JSON schema for the extraction process.
Step 4: Define the Data Class
For this example, let's create a Product
data class with fields for the product name, price, and description:
from dataclasses import dataclass
@dataclass
class Product:
name: str = ""
price: str = ""
description: str = ""
Step 5: Load and Parse the Text
Next, you need to load the text you want to parse and extract data from. For this example, we'll use a sample product description:
text = """
Introducing the new SuperWidget! This innovative gadget is priced at .99 and offers a range of features to make your life easier. With its sleek design and user-friendly interface, the SuperWidget is perfect for tech enthusiasts and casual users alike.
"""
Step 6: Parse the Text with NuExtract
Finally, you can use the NuExtractParser
class to parse the text and extract the structured data:
from llm_core.parsers import NuExtractParser
with NuExtractParser(Product) as parser:
product = parser.parse(text)
print(product)
The output will be an instance of the Product
data class with the extracted data:
Product(
name='SuperWidget',
price='.99',
description='This innovative gadget offers a range of features to make your life easier. With its sleek design and user-friendly interface, the SuperWidget is perfect for tech enthusiasts and casual users alike.'
)
Parsing Multiple Products
To demonstrate the power of NuExtract, let's parse a text containing multiple products. We'll extend our Product
data class to include a list of products and parse a more complex text.
Step 7: Define the Extended Data Class
We'll create a Catalog
data class that contains a list of Product
instances:
@dataclass
class Catalog:
products: list[Product] = field(default_factory=lambda: [Product()])
Step 8: Load and Parse the Extended Text
Now, let's load a text containing multiple product descriptions:
text = """
Introducing the new SuperWidget! This innovative gadget is priced at .99 and offers a range of features to make your life easier. With its sleek design and user-friendly interface, the SuperWidget is perfect for tech enthusiasts and casual users alike.
Check out the MegaGizmo! At just 2.59, this device packs a punch with its powerful performance and compact size. Ideal for on-the-go professionals and students, the MegaGizmo is a must-have.
Don't miss the UltraTool! Priced at 3.49, this versatile tool is perfect for DIY enthusiasts and professionals. With its robust build and multiple functionalities, the UltraTool is your go-to solution for any task.
"""
Step 9: Parse the Extended Text with NuExtract
Finally, use the NuExtractParser
class to parse the text and extract the structured data:
with NuExtractParser(Catalog) as parser:
catalog = parser.parse(text)
for product in catalog.products:
print(product)
The output will be a list of Product
instances with the extracted data:
Product(
name='SuperWidget',
price='.99',
description='This innovative gadget offers a range of features to make your life easier. With its sleek design and user-friendly interface, the SuperWidget is perfect for tech enthusiasts and casual users alike.'
)
Product(
name='MegaGizmo',
price='2.59',
description='This device packs a punch with its powerful performance and compact size. Ideal for on-the-go professionals and students, the MegaGizmo is a must-have.'
)
Product(
name='UltraTool',
price='3.49',
description='This versatile tool is perfect for DIY enthusiasts and professionals. With its robust build and multiple functionalities, the UltraTool is your go-to solution for any task.'
)
Conclusion
With the NuExtract model, you can easily parse and extract structured data from unstructured text locally as the model is very small. Special thanks to NuMind for releasing the NuExtract model series, making structured extraction more accessible and efficient.
Subscribe to get weekly tips on integrating AI tech in your projects.