How to parse unstructured text using PyLLMCore ?
Official repository
Overview
We'll discover how to use PyLLMCore, a Python library that interfaces with Large Language Models API. The goal will be to parse unstructured text from B2B websites and extract valuable information like the company's name, activity domain, market, etc.
Step 1: Install PyLLMCore
First, we need to install the PyLLMCore library. You can do this by running the following command in your terminal:
pip install py-llm-core
Next, you need to add your OpenAI API key to the environment.
export OPENAI_API_KEY=sk-<replace with your actual api key>
Step 2: Import Required Libraries
Now that we have installed PyLLMCore, we need to import the required libraries. In your Python script, add the following lines:
import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass
from llm_core.parsers import OpenAIParser
from typing import List
Step 3: Define the Data Class
Next, we need to define a data class that will hold the parsed data.
PyLLMCore will internally convert the dataclass into a JSON Schema to use the Function
feature from OpenAI models.
For this tutorial, we will create a Company
data class with fields for the company name, activity domain, market, headline, sub-headline, and call-to-actions.
Note that there is no prompt here
@dataclass
class Company:
name: str
activity_domain: str
market: str
headline: str
call_to_actions: List[str]
Step 4: Load the Website HTML
We will be parsing the website https://vercel.com. To load the website HTML, we will use the requests
library:
response = requests.get('https://vercel.com')
response.raise_for_status()
html = response.text
Step 5: Extract Text with BeautifulSoup
With the website HTML loaded, we can now extract the text. We will use BeautifulSoup to parse the HTML and extract the text:
# Create a BeautifulSoup object and specify the parser
soup = BeautifulSoup(html, 'html.parser')
# Use the get_text() method to extract the text
text = soup.get_text()
Step 6: Parse the Text with PyLLMCore
Finally, we can parse the extracted text with PyLLMCore.
We will use the OpenAIParser
class and the gpt-3.5-turbo-16k
model to parse the extracted text. The gpt-3.5-turbo-16k
model is chosen here for its window size of 16 000 tokens.
import codecs
import llm_core.tokenizers
# Check that the content we are about to process fits the window size
assert len(codecs.encode(text, 'gpt-3.5-turbo-16k')) < 16_000
# Create an instance of OpenAIParser with the Company data class and the gpt-3.5-turbo-16k model
with OpenAIParser(Company, model='gpt-3.5-turbo-16k') as parser:
# Use the parse() method to parse the text
company = parser.parse(text)
# Print the parsed data
print(company)
The previous code prints:
Company(
name='Vercel',
activity_domain='Frontend Development',
market='Cloud Services',
headline='Develop. Preview. Ship.',
call_to_actions=[
'Start Deploying',
'Get a Demo',
'Join us for the Live Keynote'
]
)
As we have a dataclass we can easily convert the instance into a dict :
from dataclasses import asdict
print(asdict(company))
{
'name': 'Vercel',
'activity_domain': 'Frontend Development',
'market': 'Cloud Services',
'headline': 'Develop. Preview. Ship.',
'call_to_actions': [
'Start Deploying',
'Get a Demo',
'Join us for the Live Keynote'
]
}
And that's it!
In a next article, we will explain how to perform tasks with actual prompts.
Subscribe to get notified when a new tutorial is available.