Skip to content
On this page

How to summarize a text using the Chain of Density and GPT-4 ?

This article explains how to summarize a text step-by-step using the chain of density prompt using Python and GPT-4.

What is chain of density prompting ?

The Chain of Density (CoD) prompting is used to generate summaries that start off entity-sparse and become increasingly entity-dense. Summaries are generated by iteratively incorporating missing salient entities from the source text without increasing the length.

The CoD prompt enables the generation of summaries with varying levels of information density.

CoD summaries exhibit more fusion and less lead bias compared to GPT-4 summaries with a vanilla prompt. The research showed that human preferences favor summaries that are more dense. The CoD summaries are also more abstract.

Implement Chain of Density with minimum dependencies

In order to implement this technique in Python, we aim to produce a simple way with minimum dependencies.

For this tutorial, we'll apply this technique over a fairly complex topic : A Path Towards Autonomous Machine Intelligence, a position paper from LeCun (2022).

We chose this paper for the following reasons :

  • The paper topic is fairly complex
  • The paper is filled with rich information
  • The length (60 pages / 38 k tokens) makes it hard to process in a single shot

Prerequisites

We'll use the PyLLMCore library as a thin wrapper on top of OpenAI GPT models.

The example from the documentation of PyLLMCore will serve as a starting point.

Create a new virtual environment and install the library (you'll need an OpenAI API key).

shell
pip install py-llm-core
pip install PyPDF2
export OPENAI_API_KEY=sk-<replace with your actual api key>

Overview

We'll cover the following steps:

  1. Convert the PDF into a text string
  2. Create chunks that GPT-4 can digest
  3. Perform the CoD

Converting the PDF file into a text

python
import PyPDF2

# Open the PDF file
with open('path_to_your_pdf.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfReader(pdf_file)

    # Extract the text from the PDF
    pages = []
    for page in pdf_reader.pages:
        pages.append(page.extract_text())

    text = ''.join(pages)

Cleanup unicode characters from the PDF

python
import unicodedata


def cleanup_unicode(text):
    corrected_chars = []
    for char in text:
        corrected_char = unicodedata.normalize("NFKC", char)
        corrected_chars.append(corrected_char)
    return "".join(corrected_chars)


text = cleanup_unicode(text)

Split the text to fit the window size of GPT-4

We'll use the GPT-4 model with a window size of 8 000 tokens - that means that we'll summarize approximately the first 10 pages. The whole document is 38 533 tokens long.

python
from llm_core.splitters import TokenSplitter


# We'll take the first 6 000 tokens
splitter = TokenSplitter(chunk_size=6_000, chunk_overlap=0)

first_10_pages = next(splitter.chunkify(text))

print(first_10_pages)

Implement Chain of Density prompting with GPT-4

With the help of OpenAI Functions features and the PyLLMCore library, we create data classes to provide a receptacle for our data.

Data classes are created with annotations to help GPT models infer the data type.

python
from typing import List
from dataclasses import dataclass
from llm_core.assistants import OpenAIAssistant


@dataclass
class DenseSummary:
    denser_summary: str
    missing_entities: List[str]


@dataclass
class DenserSummaryCollection:
  system_prompt = """
  You are an expert in writing rich and dense summaries in broad domains.
  """

  prompt = """
  Article:
  
  {article}

  ----

  You will generate increasingly concise, entity-dense summaries of the above
  Article.

  Repeat the following 2 steps 5 times.

  - Step 1: Identify 1-3 informative Entities from the Article
  which are missing from the previously generated summary and are the most
  relevant.

  - Step 2: Write a new, denser summary of identical length which covers
  every entity and detail from the previous summary plus the missing entities

  A Missing Entity is:

  - Relevant: to the main story
  - Specific: descriptive yet concise (5 words or fewer)
  - Novel: not in the previous summary
  - Faithful: present in the Article
  - Anywhere: located anywhere in the Article

  Guidelines:
  - The first summary should be long (4-5 sentences, approx. 80 words) yet
  highly non-specific, containing little information beyond the entities
  marked as missing.

  - Use overly verbose language and fillers (e.g. "this article discusses") to
  reach approx. 80 words.

  - Make every word count: re-write the previous summary to improve flow and
  make space for additional entities.

  - Make space with fusion, compression, and removal of uninformative phrases
  like "the article discusses"

  - The summaries should become highly dense and concise yet self-contained,
  e.g., easily understood without the Article.

  - Missing entities can appear anywhere in the new summary.

  - Never drop entities from the previous summary. If space cannot be made,
  add fewer new entities.

  > Remember to use the exact same number of words for each summary.
  Answer in JSON.

  > The JSON in `summaries_per_step` should be a list (length 5) of
  dictionaries whose keys are "missing_entities" and "denser_summary".

  """

  summaries: List[DenseSummary]


  @classmethod
  def summarize(cls, article):
      with OpenAIAssistant(cls, model='gpt-4') as assistant:
          return assistant.process(article=article)

This code defines a Python dataclass DenseSummary and DenserSummaryCollection to structure the data for a task that involves generating increasingly concise, entity-dense summaries of a given article.

Here's a breakdown of the code:

  • DenseSummary: This dataclass is used to store a denser summary and the missing entities from the summary
  • DenserSummaryCollection: This dataclass is used to store a collection of DenseSummary objects for each iteration.
  • summarize: This method uses the OpenAIAssistant initialized with the class itself to process the article and generate the summary. It returns a DenserSummaryCollection populated instance.

Now, we can print the results for each iteration ; note that iterations are in-context, these are not separate requests.

python
>>> summary_collection = DenserSummaryCollection.summarize(first_10_pages)
>>> print(len(summary_collection.summaries))
5
>>> print(summary_collection.summaries[0].missing_entities)
['Yann LeCun', 'autonomous intelligent agents', 'self-supervised learning']

>>> print(summary_collection.summaries[0].denser_summary)
This article discusses a position paper by Yann LeCun, which proposes an
architecture for autonomous intelligent agents.

The paper combines concepts such as configurable predictive world models, 
behavior driven through intrinsic motivation, and hierarchical joint embedding
architectures trained with self-supervised learning.

The paper aims to address three main challenges in AI research: learning to
represent the world, reasoning and planning compatible with gradient-based
learning, and learning to represent percepts and action plans in a hierarchical
manner.


>>> print(summary_collection.summaries[1].missing_entities)
['JEPA', 'Hierarchical JEPA', 'non-generative architectures']

>>> print(summary_collection.summaries[1].denser_summary)
Yann LeCun's position paper proposes an architecture for autonomous
intelligent agents, combining concepts like configurable predictive world
models and behavior driven by intrinsic motivation.

It addresses three AI research challenges: learning world representation,
reasoning and planning compatible with gradient-based learning, and
hierarchical representation of percepts and action plans.

The paper also introduces JEPA and Hierarchical JEPA,
non-generative architectures for predictive world models.

Further notes and performance of CoD with GPT-3.5 vs GPT-4

Performing several tests with the models from OpenAI gpt-3.5-turbo, gpt-3.5-turbo-16k and gpt-4 lead interesting results:

The GPT-3.5 models provide good results in the first summary but quickly lose track of entities. Furthermore, they clearly lack the ability to identify what's missing. Doing more tests, I realized that GPT-3.5 models are unable to produce intersections between sets.

I wouldn't recommend using GPT-3.5 models.

On the contrary, the GPT-4 model performs very well with this chain of density prompting technique.

Conclusion and next steps

The GPT-4 Summarization with Chain of Density Prompting paper is quite recent, however working out a simple implementation only took few lines of code.

You might be interested in reading Mistral AI Instruct v0.1 model evaluation in real world use cases.

Read more...

Subscribe to get notified when a new article is published.

Advanced Stack