Factuality in LLMs: Key Metrics, Challenges & Improvement Strategies

Anjali Chaudhary

Sep 17, 2024•8 min read

LLM training and enhancement

As large language models (LLMs) become integral to business workflows across industries, ensuring their factual accuracy is crucial. LLMs often generate content for applications in fields like healthcare, finance, and law, where misinformation can lead to serious consequences. However, achieving reliable factual outputs from LLMs is challenging due to the need for vast training datasets, the occurrence of hallucinations, and the difficulty in verifying the accuracy of their responses in real-time.

What is factuality in LLMs?

Factuality in LLMs refers to their ability to generate content that aligns with accurate, verifiable information based on trustworthy sources such as encyclopedias, textbooks, or reputable knowledge databases.

Why LLM factuality matters

Factuality is essential for maintaining the integrity of LLMs across various fields, from general knowledge to domain-specific applications like healthcare or law. For example, a physician relying on an LLM for medical advice risks making decisions that could endanger a patient's health if the model generates false information. Similarly, businesses could make costly strategic mistakes if they base decisions on inaccurate insights produced by an LLM.

Factual errors can result in not only operational risks but also legal and reputational damage. For example, an Australian mayor considered taking legal action after ChatGPT falsely accused him of bribery. This incident highlights how misinformation generated by LLMs can lead to defamation and damage a person’s reputation.

As LLMs are further integrated into critical systems like autonomous vehicles, ensuring factual accuracy becomes critical, as even a single error could lead to disastrous outcomes, highlighting the need for accurate AI-generated content.

Factuality vs. hallucinations

Factual errors involve incorrect or misleading real-world data, whereas hallucinations involve fabricated content that is not grounded in any factual basis. Hallucinations often occur when the model tries to fill in gaps or when it encounters topics outside its domain of knowledge. For example, if an LLM is asked about the biography of a historical figure like Albert Einstein:

Factuality example: An LLM might state that Einstein was awarded the Nobel Prize in Physics in 1932 (it was actually 1921), a factual error due to incorrect data.
Hallucination example: In another scenario, the LLM might claim that Einstein was also a talented painter and sculptor, a completely fabricated statement with no real basis.

While both issues can undermine trust in LLM-generated content, they are distinct challenges and addressing them requires different approaches.

LLM factuality evaluation metrics

Evaluating the factual accuracy of LLMs requires a set of tailored metrics that help identify factual errors, measure the reliability of outputs, and guide improvements to enhance accuracy. Below are some commonly used LLM factuality evaluation metrics:

Exact Match (EM): It measures how often an LLM-generated response perfectly matches a reference answer. This metric is particularly useful in tasks like question-answering and machine translation, where accuracy is key. While EM ensures that responses are factually correct, it can be too rigid in scenarios where close approximations or paraphrased answers are acceptable.
Perplexity: It quantifies how well a model predicts a sample, typically a piece of text. A lower perplexity score indicates better performance in predicting the next word in a sequence. While useful for quantitative assessment, it doesn't capture qualitative aspects such as coherence or relevance and is often paired with other metrics for a more comprehensive evaluation.
Human evaluation: Human reviewers can judge the nuance, context, and real-world relevance of the model’s output, identifying errors that automated metrics might overlook. Human assessments are often used alongside automated metrics for a comprehensive evaluation of LLM factuality.
TruthfulQA: It is designed to evaluate how well an LLM avoids generating misleading or incorrect answers to general knowledge questions. It focuses on identifying common misconceptions and tests whether the model's responses align with verifiable facts. This benchmark is particularly useful in open-ended tasks where factual consistency is crucial.
FactScore: It assesses the factual precision of LLM outputs by breaking down content into atomic facts and checking their correctness, allowing for fine-grained analysis. FactScore is commonly used to assess long-form text, such as summaries or biographies, where individual factual details matter.
Precision, Recall, and F1 Score:
a. Precision: It evaluates the proportion of correct facts out of the total facts generated by the model. Higher precision means fewer irrelevant or false facts.

b. Recall: It measures the proportion of relevant facts captured by the model out of the total possible correct facts. A high recall score indicates that the model covers the necessary information.

c. F1 Score: It balances precision and recall, providing a harmonic mean. It is particularly valuable in situations where both false positives (incorrect facts) and false negatives (missed facts) are equally important.

Causes of factual errors in LLMs

Factual errors in LLMs arise from various underlying causes related to their training, architecture, and operational environment. Below are some common sources of factual inaccuracies:

Inaccurate or outdated training data: LLMs are trained on vast datasets scraped from the web, which may include inaccurate, outdated, or incomplete information. When these sources are used during LLM training, they can generate content based on misinformation or outdated facts.
Moreover, even when the dataset contains accurate information, the model’s ability to retain and prioritize this knowledge is limited, which can lead to the model referencing outdated data instead of more current facts.
Ambiguity and lack of specificity: In many cases, factual errors occur when the model is prompted with ambiguous or poorly phrased queries. LLMs may interpret such prompts in multiple ways, leading to inaccurate or incomplete responses.
Limitations in retrieval and knowledge retention: LLMs are not inherently connected to real-time knowledge sources unless augmented with retrieval mechanisms like Retrieval-Augmented Generation (RAG). As a result, models rely only on their pre-trained knowledge and provide inaccurate information on topics that require up-to-date or specific data.
Overgeneralization: When the model encounters unfamiliar concepts, it might generate responses based on the closest related patterns, even if those patterns don't accurately represent the specific facts needed. This overgeneralization can result in factually incorrect statements, especially when dealing with niche or domain-specific information.
Errors in knowledge integration: LLMs integrate knowledge from various sources during training. When these sources offer conflicting information, the model may struggle to reconcile differences, leading to errors, especially when generating complex or multi-layered facts.

Strategies to improve factuality in LLMs

Some common techniques to improve the factual accuracy of LLMs include:

Pre-training data improvements: The quality of the pre-training data directly impacts factual accuracy. Given the massive scale of datasets, manual filtration and curation is impractical, so automated filtering methods can be used to prioritize reliable sources.
Models like RETRO (Retrieval-Enhanced Transformer) improve accuracy by retrieving relevant information from a vast database of billions of tokens during training. RETRO has shown better factual accuracy and fewer hallucinations compared to models like GPT. However, its performance can be affected if the retrieval database contains inaccurate, biased, or outdated information.
Supervised fine-tuning: By fine-tuning LLMs on curated, high-quality datasets that come from verified sources like encyclopedias, scientific literature, and peer-reviewed journals, the models can prioritize accurate and verifiable information. Incorporating fact-checking data into the training phase helps teach the model to generate trustworthy responses.
Human-in-the-loop systems: Integrating human evaluators into the LLM workflow provides a strong layer of factual accuracy, especially for critical applications. Human reviewers can flag inaccuracies, identify patterns of hallucinations, and feed that feedback back into the model for further fine-tuning.
Prompt engineering: By optimizing the structure of prompts, users can guide the model to focus on generating reliable information. Techniques such as adding context, providing explicit instructions to verify facts, or asking the model to “think step-by-step” (as in chain-of-thought reasoning) can significantly reduce the likelihood of factual inaccuracies.

Recent studies in LLM factuality

Researchers are developing new methods to improve the accuracy and trustworthiness of LLM outputs, particularly in real-world applications where factual consistency is critical. Below are some of the key recent developments in this space.

FACTSCORE

FACTSCORE is an advanced evaluation framework that assesses the factual precision of long-form text generated by LLMs. It focuses on breaking down generated content into “atomic facts”, evaluating each piece of information against a reliable knowledge source like Wikipedia.

Unlike traditional methods that rely on binary judgments, FACTSCORE allows for more nuanced, fine-grained analysis by evaluating the percentage of atomic facts that are accurate. It was used to evaluate models like GPT-4, ChatGPT, and public models such as Vicuna and Alpaca.

Despite GPT-4 and ChatGPT achieving higher factual precision than public models, they still produced factual inaccuracies.

LLM Factuality Diagrams- FACTSCORE

Original image source

SelfCheckGPT

SelfCheckGPT detects hallucinations in responses generated by LLMs, particularly in a zero-resource setting. Unlike traditional methods that rely on external databases or internal probability distributions, SelfCheckGPT uses a sampling-based approach to measure the consistency of multiple generated responses.

The core idea is that if an LLM "knows" a fact, its responses will be similar across samples, whereas hallucinated information will vary between outputs.

LLM Factuality Diagrams SelfCheckGPT

Original image source

OpenFactCheck

OpenFactCheck is an advanced framework for assessing LLMs’ factual accuracy in a unified, customizable manner. It addresses two major challenges in LLM factuality evaluation: assessing free-form responses and the inconsistency in evaluation benchmarks across various studies. The key components of this framework include:

CUSTCHECKER allows users to customize automatic fact-checking systems to verify the accuracy of both human-written and LLM-generated content, with modules tailored to specific domain requirements.
LLMEVAL applies seven different benchmarks to assess LLM factuality and generates detailed reports that highlight weaknesses and suggest improvement strategies.
CHECKEREVAL ranks fact-checking systems based on accuracy, speed, and cost, helping to develop more efficient fact-checkers.

LLM Factuality Diagrams OpenFactCheck

Original image source

Conclusion

Factuality is critical for building reliable, trustworthy LLMs, particularly in high-stakes industries like healthcare, law, and finance. As LLMs continue to advance, maintaining factual accuracy becomes increasingly important to prevent misinformation and ensure safe usage in real-world applications. By using robust evaluation metrics, refining training methods, and incorporating strategies like retrieval-augmented generation, LLMs can be made more factually accurate and reliable.

At Turing, we offer advanced LLM factuality services, including fact verification, bias and misinformation detection, and source credibility assessment, ensuring your model consistently delivers truthful and credible information. Leading LLM companies and research organizations, including OpenAI, Google, Meta, and Anthropic have trusted us to accelerate their AGI training and deployment.

Want to accelerate your business with AI?

Talk to one of our solutions architects and get a complimentary GenAI advisory session.

Author
Anjali Chaudhary

Anjali is an engineer-turned-writer, editor, and team lead with extensive experience in writing blogs, guest posts, website content, social media content, and more.

Factuality in LLMs: Key Metrics, Challenges & Improvement Strategies

What is factuality in LLMs?

Why LLM factuality matters

Factuality vs. hallucinations

LLM factuality evaluation metrics

Causes of factual errors in LLMs

Strategies to improve factuality in LLMs

Recent studies in LLM factuality

FACTSCORE

SelfCheckGPT

OpenFactCheck

Conclusion

Want to accelerate your business with AI?

Share this post

Share