Skip to content

How to parse unstructured text using PyLLMCore ?

Official repository

Overview

We'll discover how to use PyLLMCore, a Python library that interfaces with Large Language Models API. The goal will be to parse unstructured text from B2B websites and extract valuable information like the company's name, activity domain, market, etc.

Step 1: Install PyLLMCore

First, we need to install the PyLLMCore library. You can do this by running the following command in your terminal:

shell
pip install py-llm-core

Next, you need to add your OpenAI API key to the environment.

shell
export OPENAI_API_KEY=sk-<replace with your actual api key>

Step 2: Import Required Libraries

Now that we have installed PyLLMCore, we need to import the required libraries. In your Python script, add the following lines:

python
import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass
from llm_core.parsers import OpenAIParser
from typing import List

Step 3: Define the Data Class

Next, we need to define a data class that will hold the parsed data.

PyLLMCore will internally convert the dataclass into a JSON Schema to use the Function feature from OpenAI models.

For this tutorial, we will create a Company data class with fields for the company name, activity domain, market, headline, sub-headline, and call-to-actions.

Note that there is no prompt here

python
@dataclass
class Company:
    name: str
    activity_domain: str
    market: str
    headline: str
    call_to_actions: List[str]

Step 4: Load the Website HTML

We will be parsing the website https://vercel.com. To load the website HTML, we will use the requests library:

python
response = requests.get('https://vercel.com')
response.raise_for_status()
html = response.text

Step 5: Extract Text with BeautifulSoup

With the website HTML loaded, we can now extract the text. We will use BeautifulSoup to parse the HTML and extract the text:

python
# Create a BeautifulSoup object and specify the parser
soup = BeautifulSoup(html, 'html.parser')

# Use the get_text() method to extract the text
text = soup.get_text()

Step 6: Parse the Text with PyLLMCore

Finally, we can parse the extracted text with PyLLMCore.

We will use the OpenAIParser class and the gpt-3.5-turbo-16k model to parse the extracted text. The gpt-3.5-turbo-16k model is chosen here for its window size of 16 000 tokens.

python
import codecs
import llm_core.tokenizers

# Check that the content we are about to process fits the window size
assert len(codecs.encode(text, 'gpt-3.5-turbo-16k')) < 16_000

# Create an instance of OpenAIParser with the Company data class and the gpt-3.5-turbo-16k model
with OpenAIParser(Company, model='gpt-3.5-turbo-16k') as parser:
    # Use the parse() method to parse the text
    company = parser.parse(text)
    # Print the parsed data
    print(company)

The previous code prints:

python
Company(
    name='Vercel',
    activity_domain='Frontend Development',
    market='Cloud Services',
    headline='Develop. Preview. Ship.',
    call_to_actions=[
        'Start Deploying',
        'Get a Demo',
        'Join us for the Live Keynote'
    ]
)

As we have a dataclass we can easily convert the instance into a dict :

python
from dataclasses import asdict

print(asdict(company))

{
    'name': 'Vercel',
    'activity_domain': 'Frontend Development',
    'market': 'Cloud Services',
    'headline': 'Develop. Preview. Ship.',
    'call_to_actions': [
        'Start Deploying',
        'Get a Demo',
        'Join us for the Live Keynote'
    ]
}

And that's it!

In a next article, we will explain how to perform tasks with actual prompts.

Subscribe to get notified when a new tutorial is available.

Advanced Stack