How to parse unstructured text using PyLLMCore ?

Official repository

PyLLMCore

Overview

We'll discover how to use PyLLMCore, a Python library that interfaces with Large Language Models API. The goal will be to parse unstructured text from B2B websites and extract valuable information like the company's name, activity domain, market, etc.

Step 1: Install PyLLMCore

First, we need to install the PyLLMCore library. You can do this by running the following command in your terminal:

shell

pip install py-llm-core

Next, you need to add your OpenAI API key to the environment.

shell

export OPENAI_API_KEY=sk-<replace with your actual api key>

Step 2: Import Required Libraries

Now that we have installed PyLLMCore, we need to import the required libraries. In your Python script, add the following lines:

python

import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass
from llm_core.parsers import OpenAIParser
from typing import List

Step 3: Define the Data Class

Next, we need to define a data class that will hold the parsed data.

PyLLMCore will internally convert the dataclass into a JSON Schema to use the Function feature from OpenAI models.

For this tutorial, we will create a Company data class with fields for the company name, activity domain, market, headline, sub-headline, and call-to-actions.

Note that there is no prompt here

python

@dataclass
class Company:
    name: str
    activity_domain: str
    market: str
    headline: str
    call_to_actions: List[str]

Step 4: Load the Website HTML

We will be parsing the website https://vercel.com. To load the website HTML, we will use the requests library:

python

response = requests.get('https://vercel.com')
response.raise_for_status()
html = response.text

Step 5: Extract Text with BeautifulSoup

With the website HTML loaded, we can now extract the text. We will use BeautifulSoup to parse the HTML and extract the text:

python

# Create a BeautifulSoup object and specify the parser
soup = BeautifulSoup(html, 'html.parser')

# Use the get_text() method to extract the text
text = soup.get_text()

Step 6: Parse the Text with PyLLMCore

Finally, we can parse the extracted text with PyLLMCore.

We will use the OpenAIParser class and the gpt-3.5-turbo-16k model to parse the extracted text. The gpt-3.5-turbo-16k model is chosen here for its window size of 16 000 tokens.

python

import codecs
import llm_core.tokenizers

# Check that the content we are about to process fits the window size
assert len(codecs.encode(text, 'gpt-3.5-turbo-16k')) < 16_000

# Create an instance of OpenAIParser with the Company data class and the gpt-3.5-turbo-16k model
with OpenAIParser(Company, model='gpt-3.5-turbo-16k') as parser:
    # Use the parse() method to parse the text
    company = parser.parse(text)
    # Print the parsed data
    print(company)

The previous code prints:

python

Company(
    name='Vercel',
    activity_domain='Frontend Development',
    market='Cloud Services',
    headline='Develop. Preview. Ship.',
    call_to_actions=[
        'Start Deploying',
        'Get a Demo',
        'Join us for the Live Keynote'
    ]
)

As we have a dataclass we can easily convert the instance into a dict :

python

from dataclasses import asdict

print(asdict(company))

{
    'name': 'Vercel',
    'activity_domain': 'Frontend Development',
    'market': 'Cloud Services',
    'headline': 'Develop. Preview. Ship.',
    'call_to_actions': [
        'Start Deploying',
        'Get a Demo',
        'Join us for the Live Keynote'
    ]
}

And that's it!

In a next article, we will explain how to perform tasks with actual prompts.

Subscribe to get notified when a new tutorial is available.

How to parse unstructured text using PyLLMCore ? ​

Official repository ​

Overview ​

Step 1: Install PyLLMCore ​

Step 2: Import Required Libraries ​

Step 3: Define the Data Class ​

Step 4: Load the Website HTML ​

Step 5: Extract Text with BeautifulSoup ​

Step 6: Parse the Text with PyLLMCore ​

How to parse unstructured text using PyLLMCore ?

Official repository

Overview

Step 1: Install PyLLMCore

Step 2: Import Required Libraries

Step 3: Define the Data Class

Step 4: Load the Website HTML

Step 5: Extract Text with BeautifulSoup

Step 6: Parse the Text with PyLLMCore