How to reduce hallucinations using Chain Of Verification in Large Language Models

What is Chain Of Verification (CoVe) ?

Chain of verification is a new technique described in the paper Chain-of-Verification Reduces Hallucination in Large Language Models from Shehzaad Dhuliawala et al. (2023). It combines prompting and consistency checks made by the LLM.

This study looks at how to stop large language models from creating believable but incorrect information, a problem known as hallucination. The researchers created a method called Chain-of-Verification (CoVe) where the model first makes a draft response, then creates questions to check the facts in its draft, answers these questions without bias, and finally produces a final, verified response. The study found that CoVe reduced hallucinations in various tasks.

However, there is a key assumption from the study team I wanted to mention:

"The language model, when suitably prompted, can both generate and execute a plan of how to verify itself in order to check its own work, and finally incorporate this analysis into an improved response."

We'll see how to implement this technique using PyLLMCore.

EDIT(2023-10-02): Updated guide

When no content is available to provide an answer, asking the llm is very likely to produce hallucinations. See here

Overview of the chain process

We'll cover the following steps:

Generate a baseline: This will be the first draft
Plan verifications: Write questions related to the baseline
Fact-checking: Question answering with original content
Final response: Write a verified response (a self-verified response, we won't use any external tools or internet access)

Use case: Focus on cases where hallucinations are frequent

To illustrate the technique, we will focus on cases where hallucinations are the most frequent:

Lack of context: when the model is given incomplete or ambiguous information, it may fill in the gaps with incorrect or unrelated information
Complex queries: when asked to generate complex or highly specific responses, the model may generate information that seems plausible but is not accurate
Long conversations: The longer the conversation or text generation, the higher the chance of the model generating incorrect or unrelated information

We will focus on a task known to be difficult for the GPT-3.5 model: Answering question when information is missing (see Summarizing a text using the Chain of Density Prompting).

We'll use fresh content that is not in the training set: the following article from Willy C. Shih and Ali Shakouri: How Smaller Manufacturers Can Upgrade Their Tech (not reproduced here for obvious copyright reasons)

Let's get insights from Willy C. Shih and Ali Shakouri.

Prerequisites

Create a new virtual environment and install the library (An OpenAI API key is required).

shell

pip install py-llm-core
export OPENAI_API_KEY=sk-<replace with your actual api key>

Implementing the baseline generation

python

from typing import List
from dataclasses import dataclass
from llm_core.assistants import OpenAIAssistant


@dataclass
class ContentAnalysisBaseline:
    """
    This will be the structure to generate and store the baseline answers
    """
    system_prompt = """You are a knowledgeable and helpful assistant"""

    prompt = """
    Content: {content}
    ----
    Using only the previous Content, answer the following:
    {question}
    """

    answer: str

    @classmethod
    def ask(cls, question, content, model="gpt-3.5-turbo"):
        with OpenAIAssistant(cls, model) as assistant:
            return assistant.process(question=question, content=content)

First we check the content length:

python

>>> from llm_core.splitters import TokenSplitter
>>> splitter = TokenSplitter()
>>> splitter.compute_token_count(content)
2088

Good, the gpt-3.5-turbo has a context window of 4 000 tokens so we can proceed.

python

question = "How automation can be implemented for small manufacturers ?"
baseline = ContentAnalysisBaseline.ask(question, content)
print(baseline.answer)

Here's the output:

txt

Automation can be implemented for small manufacturers by starting
with small projects and learning from them. SMEs can seek help from
organizations such as manufacturing extension partnerships (MEPs)
to get guidance and support.

They can also consider implementing simple automation tools like
collaborative robots (cobots) that can work alongside human workers.

By automating manual and repetitive tasks, SMEs can improve product
quality, reduce variability, and free up workers to focus on more
productive and desirable work.

It's important for SMEs to stay up to date on advances in production
tools and methods and leverage assistance programs from universities
and outside resources to learn and adopt new technologies.

Implementing verifications by generating questions

python

@dataclass
class BaselineVerification:
    """
    This will be the structure to generate and store the verifications
    """
    system_prompt = """You are a knowledgeable and helpful assistant"""

    prompt = """
    Content: {content}
    --
    Only using the previous content, provide a set of insightful questions
    """

    questions: List[str]

    @classmethod
    def control(cls, content, model="gpt-3.5-turbo"):
        with OpenAIAssistant(cls, model) as assistant:
            return assistant.process(content=content)

Make the LLM ask verification questions:

python

verification = BaselineVerification.control(baseline.answer)
print('\n\n'.join(verification.questions))

Here's the list of questions generated:

txt

- What are some examples of small automation projects
  that SMEs can start with?

- How can SMEs seek help from manufacturing extension
  partnerships (MEPs) ?

- What are collaborative robots (cobots) and how can
  they be used in small manufacturing?

- What are the benefits of automating manual and repetitive
  tasks for SMEs?

- How can SMEs stay up to date on advances in production
  tools and methods?

- What assistance programs are available for SMEs to learn
  and adopt new technologies?

Implementing the Factor + Revise approach

In this step, we ask the LLM to answer to a single question using the original content.

python

@dataclass
class SelfVerification:
    prompt = """
    Content: {content}
    --
    Using the Content, provide an answer to the following:
    {question}
    """

    answer: str

    @classmethod
    def ask(cls, question, content, model="gpt-3.5-turbo"):
        with OpenAIAssistant(cls, model) as assistant:
            return assistant.process(question=question, content=content)

Make the LLM ask verification questions:

python

template = """
Q: {}
A: {}
"""
# This is a dictionary containing the question/answer as a key/value
factored_questions = {}

for question in verification.questions:
    factored_questions[question] = SelfVerification.ask(question, content).answer

factored_qa = '\n'.join(
    (template.format(k, v) for k, v in factored_questions.items())
)

We can inspect the questions / answers produced (truncated):

txt

> Q: What are some examples of small automation projects
     that SMEs can start with?

> A: Some examples of small automation projects that SMEs
     can start with include implementing collaborative robots
     (cobots) for tasks like polishing or assembly, installing
     power-consumption sensors to monitor usage patterns, and
     using sound sensors and machine learning to detect quality
     issues during operations.

> Q: How can manufacturing extension partnerships (MEPs) help
     SMEs in implementing automation?

> A: Manufacturing extension partnerships (MEPs) can help SMEs
     in implementing automation by providing assistance, guidance,
     and resources.

     They can help SMEs identify automation opportunities,
     assess the feasibility and benefits of automation, and develop
     implementation plans.

     MEPs can also provide training and education on automation
     technologies and help SMEs access funding or grants for
     automation projects.

     Additionally, MEPs can connect SMEs with experts, consultants,
     and technology providers who can support them in implementing
     automation solutions.

     Overall, MEPs play a crucial role in helping SMEs navigate
     the complexities of automation and ensure successful implementation.

continued...

Now, we proceed to revise the baseline

python

@dataclass
class RevisedBaseline:
    system_prompt = """You are a knowledgeable and helpful assistant"""
    prompt = """
    First source: {content}
    ----
    Second source: {factored_qa}
    ----
    Based on facts that are consistent between First source and Second source,
    provide an answer the following question :
    {question}
    """

    answer: str

    @classmethod
    def ask(cls, question, content, factored_qa, model="gpt-3.5-turbo"):
        with OpenAIAssistant(cls, model) as assistant:
            return assistant.process(question=question, content=content, factored_qa=factored_qa)

We test with the original question:

python


question = "How automation can be implemented for small manufacturers ?"
answer = RevisedBaseline.ask(question, content, factored_qa).answer
print(answer)

Here's the result:

txt

Automation can be implemented for small manufacturers through
various methods.

They can start with small automation projects such as implementing
collaborative robots (cobots) for tasks like polishing or assembly,
installing power-consumption sensors to monitor usage patterns, and
using sound sensors and machine learning to detect quality issues
during operations.

Manufacturing extension partnerships (MEPs) can provide assistance,
guidance, and resources to help SMEs identify automation opportunities,
assess feasibility, develop implementation plans, and access funding
or grants.

Staying up to date on advances in production tools and methods,
leveraging assistance programs from universities and outside resources,
and building peer relationships can also support the implementation of
automation for small manufacturers.

When comparing side-by-side the baseline answer and the revised one, we can see a quality improvement as more facts are factored in the revised answer.

Wrap-up

To test the complete chain, here the wrapper code:

python

class COVQuestionAnswering:
    @classmethod
    def ask(cls, question, content, model="gpt-3.5-turbo"):
        baseline = ContentAnalysisBaseline.ask(question, content)
        verification = BaselineVerification.control(baseline.answer)
        factored_questions = {}
        for question in verification.questions:
            factored_questions[question] = SelfVerification.ask(question, content).answer
        factored_qa = '\n'.join((template.format(k, v) for k, v in factored_questions.items()))
        answer = RevisedBaseline.ask(question, content, factored_qa).answer
        return answer

answer = COVQuestionAnswering.ask(
    'What are the risks for SME regarding their tech ?',
    content
)
print(answer)

txt

The potential disadvantages for SMEs that have not kept up with changing
technologies include inefficiency and lower productivity, limited
competitiveness, higher costs, difficulty in meeting customer demands,
and limited growth opportunities.

Without adopting new technologies, SMEs may struggle to compete,
experience higher costs, and be unable to meet evolving customer expectations.

They may also miss out on growth opportunities and be less efficient
compared to competitors who have embraced technological advancements.

Conclusion and next steps

I am amazed by the quality of outputs from gpt-3.5-turbo using this chain when performing question on information-rich and dense content. I had almost abandoned gpt-3.5 for the lack of reasoning capabilities.

A topic I didn't cover here (maybe for another post) is the cost analysis of this approach vs using more capable models (or even less capable).

How to reduce hallucinations using Chain Of Verification in Large Language Models ​

What is Chain Of Verification (CoVe) ? ​

EDIT(2023-10-02): Updated guide ​

Overview of the chain process ​

Use case: Focus on cases where hallucinations are frequent ​

Prerequisites ​

Implementing the baseline generation ​

Implementing verifications by generating questions ​

Implementing the Factor + Revise approach ​

Wrap-up ​

Conclusion and next steps ​

How to reduce hallucinations using Chain Of Verification in Large Language Models

What is Chain Of Verification (CoVe) ?

EDIT(2023-10-02): Updated guide

Overview of the chain process

Use case: Focus on cases where hallucinations are frequent

Prerequisites

Implementing the baseline generation

Implementing verifications by generating questions

Implementing the Factor + Revise approach

Wrap-up

Conclusion and next steps