How to reduce hallucinations using Chain Of Verification in Large Language Models
What is Chain Of Verification (CoVe) ?
Chain of verification is a new technique described in the paper Chain-of-Verification Reduces Hallucination in Large Language Models from Shehzaad Dhuliawala et al. (2023). It combines prompting and consistency checks made by the LLM.
This study looks at how to stop large language models from creating believable but incorrect information, a problem known as hallucination. The researchers created a method called Chain-of-Verification (CoVe) where the model first makes a draft response, then creates questions to check the facts in its draft, answers these questions without bias, and finally produces a final, verified response. The study found that CoVe reduced hallucinations in various tasks.
However, there is a key assumption from the study team I wanted to mention:
"The language model, when suitably prompted, can both generate and execute a plan of how to verify itself in order to check its own work, and finally incorporate this analysis into an improved response."
We'll see how to implement this technique using PyLLMCore.
EDIT(2023-10-02): Updated guide
When no content is available to provide an answer, asking the llm is very likely to produce hallucinations. See here
Overview of the chain process
We'll cover the following steps:
- Generate a baseline: This will be the first draft
- Plan verifications: Write questions related to the baseline
- Fact-checking: Question answering with original content
- Final response: Write a verified response (a self-verified response, we won't use any external tools or internet access)
Use case: Focus on cases where hallucinations are frequent
To illustrate the technique, we will focus on cases where hallucinations are the most frequent:
- Lack of context: when the model is given incomplete or ambiguous information, it may fill in the gaps with incorrect or unrelated information
- Complex queries: when asked to generate complex or highly specific responses, the model may generate information that seems plausible but is not accurate
- Long conversations: The longer the conversation or text generation, the higher the chance of the model generating incorrect or unrelated information
We will focus on a task known to be difficult for the GPT-3.5 model: Answering question when information is missing (see Summarizing a text using the Chain of Density Prompting).
We'll use fresh content that is not in the training set: the following article from Willy C. Shih and Ali Shakouri: How Smaller Manufacturers Can Upgrade Their Tech (not reproduced here for obvious copyright reasons)
Let's get insights from Willy C. Shih and Ali Shakouri.
Prerequisites
Create a new virtual environment and install the library (An OpenAI API key is required).
pip install py-llm-core
export OPENAI_API_KEY=sk-<replace with your actual api key>
Implementing the baseline generation
from typing import List
from dataclasses import dataclass
from llm_core.assistants import OpenAIAssistant
@dataclass
class ContentAnalysisBaseline:
"""
This will be the structure to generate and store the baseline answers
"""
system_prompt = """You are a knowledgeable and helpful assistant"""
prompt = """
Content: {content}
----
Using only the previous Content, answer the following:
{question}
"""
answer: str
@classmethod
def ask(cls, question, content, model="gpt-3.5-turbo"):
with OpenAIAssistant(cls, model) as assistant:
return assistant.process(question=question, content=content)
First we check the content length:
>>> from llm_core.splitters import TokenSplitter
>>> splitter = TokenSplitter()
>>> splitter.compute_token_count(content)
2088
Good, the gpt-3.5-turbo has a context window of 4 000 tokens so we can proceed.
question = "How automation can be implemented for small manufacturers ?"
baseline = ContentAnalysisBaseline.ask(question, content)
print(baseline.answer)
Here's the output:
Automation can be implemented for small manufacturers by starting
with small projects and learning from them. SMEs can seek help from
organizations such as manufacturing extension partnerships (MEPs)
to get guidance and support.
They can also consider implementing simple automation tools like
collaborative robots (cobots) that can work alongside human workers.
By automating manual and repetitive tasks, SMEs can improve product
quality, reduce variability, and free up workers to focus on more
productive and desirable work.
It's important for SMEs to stay up to date on advances in production
tools and methods and leverage assistance programs from universities
and outside resources to learn and adopt new technologies.
Implementing verifications by generating questions
@dataclass
class BaselineVerification:
"""
This will be the structure to generate and store the verifications
"""
system_prompt = """You are a knowledgeable and helpful assistant"""
prompt = """
Content: {content}
--
Only using the previous content, provide a set of insightful questions
"""
questions: List[str]
@classmethod
def control(cls, content, model="gpt-3.5-turbo"):
with OpenAIAssistant(cls, model) as assistant:
return assistant.process(content=content)
Make the LLM ask verification questions:
verification = BaselineVerification.control(baseline.answer)
print('\n\n'.join(verification.questions))
Here's the list of questions generated:
- What are some examples of small automation projects
that SMEs can start with?
- How can SMEs seek help from manufacturing extension
partnerships (MEPs) ?
- What are collaborative robots (cobots) and how can
they be used in small manufacturing?
- What are the benefits of automating manual and repetitive
tasks for SMEs?
- How can SMEs stay up to date on advances in production
tools and methods?
- What assistance programs are available for SMEs to learn
and adopt new technologies?
Implementing the Factor + Revise approach
In this step, we ask the LLM to answer to a single question using the original content.
@dataclass
class SelfVerification:
prompt = """
Content: {content}
--
Using the Content, provide an answer to the following:
{question}
"""
answer: str
@classmethod
def ask(cls, question, content, model="gpt-3.5-turbo"):
with OpenAIAssistant(cls, model) as assistant:
return assistant.process(question=question, content=content)
Make the LLM ask verification questions:
template = """
Q: {}
A: {}
"""
# This is a dictionary containing the question/answer as a key/value
factored_questions = {}
for question in verification.questions:
factored_questions[question] = SelfVerification.ask(question, content).answer
factored_qa = '\n'.join(
(template.format(k, v) for k, v in factored_questions.items())
)
We can inspect the questions / answers produced (truncated):
> Q: What are some examples of small automation projects
that SMEs can start with?
> A: Some examples of small automation projects that SMEs
can start with include implementing collaborative robots
(cobots) for tasks like polishing or assembly, installing
power-consumption sensors to monitor usage patterns, and
using sound sensors and machine learning to detect quality
issues during operations.
> Q: How can manufacturing extension partnerships (MEPs) help
SMEs in implementing automation?
> A: Manufacturing extension partnerships (MEPs) can help SMEs
in implementing automation by providing assistance, guidance,
and resources.
They can help SMEs identify automation opportunities,
assess the feasibility and benefits of automation, and develop
implementation plans.
MEPs can also provide training and education on automation
technologies and help SMEs access funding or grants for
automation projects.
Additionally, MEPs can connect SMEs with experts, consultants,
and technology providers who can support them in implementing
automation solutions.
Overall, MEPs play a crucial role in helping SMEs navigate
the complexities of automation and ensure successful implementation.
continued...
Now, we proceed to revise the baseline
@dataclass
class RevisedBaseline:
system_prompt = """You are a knowledgeable and helpful assistant"""
prompt = """
First source: {content}
----
Second source: {factored_qa}
----
Based on facts that are consistent between First source and Second source,
provide an answer the following question :
{question}
"""
answer: str
@classmethod
def ask(cls, question, content, factored_qa, model="gpt-3.5-turbo"):
with OpenAIAssistant(cls, model) as assistant:
return assistant.process(question=question, content=content, factored_qa=factored_qa)
We test with the original question:
question = "How automation can be implemented for small manufacturers ?"
answer = RevisedBaseline.ask(question, content, factored_qa).answer
print(answer)
Here's the result:
Automation can be implemented for small manufacturers through
various methods.
They can start with small automation projects such as implementing
collaborative robots (cobots) for tasks like polishing or assembly,
installing power-consumption sensors to monitor usage patterns, and
using sound sensors and machine learning to detect quality issues
during operations.
Manufacturing extension partnerships (MEPs) can provide assistance,
guidance, and resources to help SMEs identify automation opportunities,
assess feasibility, develop implementation plans, and access funding
or grants.
Staying up to date on advances in production tools and methods,
leveraging assistance programs from universities and outside resources,
and building peer relationships can also support the implementation of
automation for small manufacturers.
When comparing side-by-side the baseline answer and the revised one, we can see a quality improvement as more facts are factored in the revised answer.
Wrap-up
To test the complete chain, here the wrapper code:
class COVQuestionAnswering:
@classmethod
def ask(cls, question, content, model="gpt-3.5-turbo"):
baseline = ContentAnalysisBaseline.ask(question, content)
verification = BaselineVerification.control(baseline.answer)
factored_questions = {}
for question in verification.questions:
factored_questions[question] = SelfVerification.ask(question, content).answer
factored_qa = '\n'.join((template.format(k, v) for k, v in factored_questions.items()))
answer = RevisedBaseline.ask(question, content, factored_qa).answer
return answer
answer = COVQuestionAnswering.ask(
'What are the risks for SME regarding their tech ?',
content
)
print(answer)
The potential disadvantages for SMEs that have not kept up with changing
technologies include inefficiency and lower productivity, limited
competitiveness, higher costs, difficulty in meeting customer demands,
and limited growth opportunities.
Without adopting new technologies, SMEs may struggle to compete,
experience higher costs, and be unable to meet evolving customer expectations.
They may also miss out on growth opportunities and be less efficient
compared to competitors who have embraced technological advancements.
Conclusion and next steps
I am amazed by the quality of outputs from gpt-3.5-turbo using this chain when performing question on information-rich and dense content. I had almost abandoned gpt-3.5 for the lack of reasoning capabilities.
A topic I didn't cover here (maybe for another post) is the cost analysis of this approach vs using more capable models (or even less capable).