Corrective Retrieval Augmented Generation — Why RAGs are not enough?!!

Arion Das
Generative AI
Published in
10 min readMar 3, 2024

--

CRAG

ISSUES WITH RAG

Although RAGs are an essential complement to LLMs, they rely heavily on the relevance of the retrieved documents. This raises concerns over the model’s behavior when the retrieval is wrong.

Since static & limited corpora can only return below-optimal documents, CRAG utilizes large-scale web searches to augment the retrieval results.

LLMs manifest hallucinations due to their struggle with factual errors and their inability to accurately provide responses from the parametric knowledge they encapsulate. The heavy reliance of generation on the retrieved knowledge raises significant concerns about the model’s behavior and performance in scenarios where retrieval may fail or return inaccurate results.

POOR RETRIEVAL

The figure shows how low-quality retrievers are prone to introducing a substantial amount of irrelevant information, shadowing the models from acquiring accurate knowledge and potentially misleading them, resulting in issues such as hallucinations.

Current methods mostly treat complete documents as reference knowledge both during retrieval and utilization. But a considerable portion of the text within these retrieved documents is often non-essential for generation, which should not have been equally involved in RAG.

RELATED WORK

HALLUCINATIONS OF LLMs

Although LLMs have shown impressive abilities to understand instructions and generate fluent language texts, one of their most severe issues is hallucinations.

Large-scale unregulated training data collection, a low proportion of high-quality sampling data, an imperfection of data allocation in the input space, and many other realistic factors could impact the LLMs and cause problems.

RETRIEVAL-AUGMENTED GENERATION

RAG is a useful method to address the issues above. It enhances the generated response with the retrieved documents. It usually provides an extra knowledge source from a specific corpus, like Wikipedia or relevant documents, which greatly improves the performance of language models in knowledge-intensive tasks. The relevant knowledge is then used to generate a response for the input query.

Despite this, the method has overlooked a crucial part when the retrieval goes wrong! If retrieved documents are irrelevant, the retrieval system can even cause the factual errors that LMs make.

ADVANCED RAG

Many advanced approaches have been developed to overcome the shortcomings of the standard RAG framework.

  1. Self-RAG: It is proposed to selectively retrieve knowledge and introduce a critic model to decide whether to retrieve it. It is done because sometimes retrieval is unnecessary.
  2. SAIL: A method that inserts retrieved documents before instructions and tunes LMs on instruction-tuning data. Perplexity-AI works in a similar way.
  3. Toolformer: It is pre-trained to call APIs such as Wikipedia.
  4. Ret-ChatGPT and Ret-LlaMA-chat: These methods use the same augmentation technique as RAG but train with private data.

Even after so much work in the field of RAGs, there is still a huge scope for improvement. This paper makes an attempt to explore and design corrective strategies for RAG to improve its robustness of generation.

TASK FORMULATION

INPUT

  1. Large amount of knowledge documents: C = {d1, d2, d3,…, dN}
  2. Input Query: X

RETRIEVER

The entire framework is divided into a retriever, R, and a generator, G. The retriever retrieves top k documents: D = {dr1, dr2,…, drk}, which are relevant to the input X from the corpus, C. Based on the input X and the retrieved results D, the generator G is responsible for generating the output Y.

We can formulate the framework as:

P(Y|X) = P(D|X) P(Y,D|X)

Any unsuccessful retrieval can result in an unsatisfactory response, regardless of the impressive abilities of the generator. This is exactly the focus of this paper: to improve the robustness of generation.

CRAG

CRAG

To address the above issues, this paper studies scenarios where the retriever returns inaccurate results.

A method named Corrective Retrieval Augmented Generation, or CRAG, is proposed to self-correct the responses when required.

It is a lightweight retrieval evaluator designed to assess the overall quality of retrieved documents for a query. It contributes to information generation by reviewing and evaluating the relevance and reliability of the retrieved documents.

The relevance score is quantified to a total of three confidence degrees and then triggers the corresponding actions: {Correct, Incorrect, Ambiguous}

Correct

The retrieved documents will be refined into more precise knowledge strips. This refinement operation involves knowledge decomposition, filtering, and recomposition.

Incorrect

The retrieved documents will be discarded. Instead, web searches will be regarded as complementary resources for corrections.

Ambiguous

When it cannot confidently make a correct or incorrect judgment, a soft and balanced action called ‘Ambiguous’ (which combines both of them) is triggered. After optimizing the retrieval results, an arbitrary generative model can be adopted.

RETRIEVAL EVALUATOR

Checking whether retrieved documents are accurate or not before using them is essential, as irrelevant or misleading messages can have a significant impact. The accuracy of the retrieval evaluator plays a crucial part in shaping the overall performance of the system.

T5 large is adopted for initializing the retrieval evaluator. Existing datasets can be used to fine-tune the evaluator.

For every question, 10 documents are retrieved. The question is concatenated with each single document as the input, and the evaluator predicts the relevance score for each question-document pair individually. Based on these calculated relevance scores, a final judgment is made as to whether the retrieval is correct or is not associated with the action trigger. CRAG demonstrates the advantages of being lightweight when compared to the self-RAG model, which fine-tuned the LLaMA-2 (7B).

ACTION TRIGGER

Based on the aforementioned confidence score for each retrieved document, three types of actions are designed and triggered accordingly:

Correct

The retrieved documents will be refined into more precise knowledge strips. This refinement operation involves knowledge decomposition, filtering, and recomposition.

Incorrect

The retrieved documents will be discarded. Instead, web searches will be regarded as complementary resources for corrections.

Ambiguous

When it cannot confidently make a correct or incorrect judgment, a soft and balanced action called ‘Ambiguous’ (which combines both of them) is triggered. After optimizing the retrieval results, an arbitrary generative model can be adopted.

KNOWLEDGE REFINEMENT

A “decompose-then-recompose” knowledge refinement method is used to extract the most critical knowledge strips from it. Each retrieved document is segmented into fine-grained knowledge strips. The relevance score is calculated for each strip to filter out the relevant ones, which will be recomposed via concatenation.

WEB SEARCH

This is the part that is really unique about this paper. When the retrieved results are all assumed to be irrelevant, using external knowledge is extremely important, as the static corpora will be providing either suboptimal or incorrect results.

The inputs are rewritten into queries composed of keywords by ChatGPT to mimic the daily usage of search engines. A public and accessible commercial web search API is adopted to generate a series of URL links for every query.

The URL links to navigate web pages, transcribe their content, and employ the same knowledge refinement method to derive the relevant external knowledge from the Web.

Algorithm 1: CRAG Inference

Requirements: E (Retrieval Evaluator), W (Query Rewriter), G (Generator)
Input : x (Input question), D = {d1, d2, ..., dk} (Retrieved documents)
Output : y (Generated response)


1 scorei = E evaluates the relevance of each pair (x, di), di ∈ D

2 Confidence = Calculate and give a final judgment based on {score1, score2, ...scorek}
// Confidence has 3 optional values: [CORRECT], [INCORRECT] or [AMBIGUOUS]

3 if Confidence == [CORRECT] then

4 Internal_Knowledge = Knowledge_Refine(x, D)

5 k = Internal_Knowledge

6 else if Confidence == [INCORRECT] then

7 External_Knowledge = Web_Search(W Rewrites x for searching)

8 k = External_Knowledge

9 else if Confidence == [AMBIGUOUS] then

10 Internal_Knowledge = Knowledge_Refine(x, D)

11 External_Knowledge = Web_Search(W Rewrites x for searching)

12 k = Internal_Knowledge + External_Knowledge

13 end

14 G predicts y given x and k

This algorithm goes through the entire workflow in a really simple way. Let us walk through the algorithm step by step for a clearer understanding:

Requirements

  1. Retrieval Evaluator (E): It evaluated the content retrieved from the knowledge base.
  2. Query Rewriter (W): It rewrites the user’s query for an optimized web search
  3. Generator (G)

Input

  1. Input Question (x)
  2. Retrieved Documents (D): {d1, d2, d3, …, dk}

Output

  1. Generated Response (y)

Steps

  1. E evaluates the relevance of each pair (x, di), where di belongs to D
  2. A relevance score is evaluated, which will be used to choose one of the three action triggers: ‘correct’, 'incorrect', or ‘ambiguous’.

3–5. If the ‘correct’ action is triggered, knowledge from the documents/corpora is used to generate the response.

6–8. If the ‘incorrect’ action is triggered, an optimized web search is performed to bring in external knowledge for providing a response.

9–12. When confidence is ‘ambiguous’, a mixture of internal & external knowledge is used to provide a response.

EXPERIMENTS

TASKS, DATASETS & METRICS

The datasets used for evaluating CRAG were:

  1. PopQA: short-form generation
  2. Biography: long-form generation
  3. PubHealth: true or false question
  4. Arc-Challenge: multiple choice question

Accuracy was adopted as the evaluation metric for PopQA, PubHealth, and Arc-Challenge. FactScore was adopted as the evaluation metric for Biography. (Refer to the paper for more details.)

BASELINES

Some public LLMs—LLaMA2–7B, 13B, instruction-tuned models, Alpaca-7B, 13B, and CoVE65B—have been used. Propriety LLMs such as LLaMA2-chat13B and ChatGPT were also included.

Standard Rag

The standard RAG is evaluated where a language model (LM) generates output, given the query prepended with the top retrieved documents using the same retriever as in our system. Several public instruction-tuned LLMs, including LLaMA2- 7B, 13B, Alpaca-7B, 13B, and LLaMA2- 7B, were instruction-tuned in Self-RAG.

Advanced Rag

  1. Self-RAG: It is proposed to selectively retrieve knowledge and introduce a critic model to decide whether to retrieve it. It is done because sometimes retrieval is unnecessary.
  2. SAIL: A method that inserts retrieved documents before instructions and tunes LMs on instruction-tuning data. Perplexity-AI works in a similar way.
  3. Toolformer: It is pre-trained to call APIs such as Wikipedia.
  4. Ret-ChatGPT and Ret-LlaMA-chat: These methods use the same augmentation technique as RAG but train with private data.

Results

RESULTS

Evaluation of the retrieval evaluator and ChatGPT for the retrieval results on the PopQA dataset:

For more details on the results, refer to the paper.

ABLATION STUDY

Impact of each triggered action

Evaluations on the PopQA dataset were conducted to demonstrate the performance change in terms of accuracy.

When the action Correct or Incorrect was removed, it was merged with Ambiguous so that the proportion that originally triggered Correct or Incorrect would trigger Ambiguous.

On the other hand, when the action Ambiguous was removed, there was only one threshold against which all input queries clearly triggered Correct or Incorrect.

It can be seen that there was a performance drop no matter which action was removed, clearly showing that each action contributed to improving the robustness of generation.

Impact of each knowledge utilization operation

Evaluations on the PopQA dataset in terms of accuracy were conducted by individually removing the knowledge utilization operations of document refinement, search query rewriting, and external knowledge selection.

Removing document refinement denotes that the original retrieved documents were directly fed to the following generator.

Removing search query rewriting denotes that questions were not rewritten into queries consisting of keywords during knowledge searching.

Removing knowledge selection denotes that all searched content on web pages was regarded as external knowledge without selection.

Removing either of the operations shows a degraded performance of the final system, revealing that each knowledge utilization operation contributed to improving the utilization of knowledge.

Accuracy of the Retrieval Evaluator

RETRIEVAL EVALUATOR EVALUATION ON PopQA

The quality of the retrieval evaluator significantly determined the performance of the entire system. The accuracy of the retrieval evaluator and LLM, ChatGPT, was assessed on the PopQA dataset.

Prompts of ChatGPT, ChatGPT-CoT, and ChatGPT-few-shot have been used in the experiments. The lightweight T5-based retrieval evaluator significantly outperformed ChatGPT in all settings. For more details on the results, refer to the Appendix part of the paper.

LIMITATIONS & CONCLUSION

This paper studies the situations when RAG-based approaches go wrong, bringing inaccurate and misleading knowledge to generative LMs. CRAG proposes a lightweight retrieval evaluator to estimate and trigger three knowledge retrieval actions. Through web search and optimized knowledge utilization, CRAG improves the ability of automatic self-correction and efficient utilization of retrieved documents.

Although the primary focus is improving the RAG framework from a corrective perspective, detecting and correcting the wrong knowledge more accurately and effectively still requires further study. The potential bias introduced by web searches is also a concern.

The quality of internet sources can vary significantly, and incorporating such data without enough consideration may introduce noise or misleading information to the generated outputs. A more stable and reliable method of retrieval augmentation is definitely viable and in the making!!

Here is an implementation of a CRAG application using LangChain, Ollama, & Mistral: https://www.youtube.com/watch?v=E2shqsYwxck&t=205s

Let’s connect and build a project together: 🐈‍⬛

The Rotation of the Earth really makes my day

This story is published on Generative AI. Connect with us on LinkedIn and follow Zeniteq to stay in the loop with the latest AI stories. Let’s shape the future of AI together!

--

--

Data Science | Linux | Generative AI | Research | Optimization Algorithms | Competitive Programming | Large Language Models | Web Applications | Blogging |