Retrieval-Augmented Generation (RAG): Recent Research and Challenges

In today’s AI-driven world, Retrieval-Augmented Generation (RAG) is becoming an increasingly significant approach that combines the capabilities of information retrieval with the generative abilities of large language models (LLMs). This overcomes a number of limitations faced by traditional LLMs and provides more accurate and fact-based answers.

What is RAG?

RAG is not a single technology, but an entire umbrella of different components, designs, and domain-specific adaptations. A typical RAG system includes:

A data ingestion component: where data is processed, embedded, and stored as context documents in a vector database
A retrieval component: where context documents are retrieved and ranked for relevance to the query
Query component: where the prompt with the query is combined with the search results and sent to the LLM

Methodology for evaluating RAG systems

Recent studies, including work by Simon and colleagues at the University of Leipzig, propose a methodological framework for evaluating RAG systems. Their approach includes:

Independent variables: RAG architecture (embedding models, search methods, filters) and context sources
Dependent variables: RAG performance metrics
Evaluation: comparison with a baseline model (usually vanilla LLM) and use of benchmarks
Improvements: error analysis and systematic improvements

One of the most important aspects of developing a successful Retrieval-Augmented Generation (RAG) system is to evaluate it using objective metrics. Without proper performance measurement, it is impossible to determine if the system is actually working as intended and where improvements are needed.

Four key metrics for evaluating RAG systems

There are four key metrics that should be used to evaluate your RAG pipelines comprehensively:

Information Accuracy (Faithfulness)

The information accuracy metric measures how closely the generated response matches the context provided. Essentially, this metric indicates whether the model is making up facts not in the extracted documents.

It is scored on a scale of 0 to 1
Higher values indicate better factual accuracy
A low score may signal a problem of “hallucinating” the model

Answer Relevance (Answer Relevance)

This metric assesses how directly and adequately the generated answer answers the user’s original question. Even if the answer is factually accurate, it may be useless if it does not answer the exact question that was asked.

Uses the average cosine similarity between the original question and the questions generated from the answer
Higher scores indicate better relevance
Helps identify instances where the system deviates from the topic of the question

3- Context Precision

This metric evaluates how effectively the information extraction component ranks relevant documents higher than irrelevant ones. High Context Precision means that the system retrieves exactly the documents that are needed for the answer.

The range of values is from 0 to 1
Higher values indicate better accuracy
Critical for the efficient use of the limited context window of the model

Context Recall completeness (Context Recall)

The completeness metric measures how well the extracted context matches the reference response. That is, whether the retrieved context contains all the necessary information to produce a complete and correct answer.

It is scored from 0 to 1
Higher values indicate a better match to the reference truth
Helps identify instances where the system is missing critical information

Automate assessment with the Ragas library

The Ragas library greatly simplifies the process of evaluating RAG systems against the above metrics. With it, developers can create a dataset containing questions, answers, and relevant contexts, and then evaluate them on multiple metrics simultaneously.

This evaluation approach not only identifies current RAG system deficiencies, but also tracks progress as improvements are made to various components of the pipelines, from vector search to prompts for the generative model.

Regular evaluation on these metrics should be an integral part of the development cycle of any RAG system, especially when deployed in mission-critical domains where accuracy and reliability of information are paramount.

Example code:

from ragas import SingleTurnSample
from ragas.metrics import AspectCritic

test_data = {
    "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
    "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
}
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
metric = AspectCritic(name="summary_accuracy",llm=evaluator_llm, definition="Verify if the summary is accurate.")
await metric.single_turn_ascore(SingleTurnSample(**test_data))

Repository

Key RAG Challenges

Modern RAG systems face several key challenges:

Search relevance: Ensuring that the retrieved information is truly relevant to the query
Information Integration: Effectively combining retrieved information from different sources
Source quality: Identifying and utilizing reliable data sources
Effective Prompting: Developing optimal prompts for the LLM using the extracted context

Recent research

Contextual extraction: a revolutionary improvement in RAG

AI models often need access to specific knowledge to work effectively with users. For example, support chatbots need to know about a company’s products, and legal assistants need to know about a set of case law. The traditional approach to augmenting model knowledge is Retrieval-Augmented Generation (RAG), which extracts relevant information from a knowledge base and adds it to the user’s query.

However, traditional RAG has a significant drawback: context is often lost during information processing, leading to failures in retrieving relevant data.

Anthropic has presented a method that significantly improves the extraction phase in RAG systems. The method is called Contextual Retrieval and utilizes two sub-techniques:

Contextual Embeddings (Contextual Embeddings)
Contextual BM25.

These techniques help preserve context when processing document fragments, which is critical for accurate information extraction.

How does contextual extraction work?

In traditional RAG, documents are broken into small fragments, which are then converted into vector embeddings or indexed for keyword search. The problem arises when the individual fragments don’t contain enough context.

For example, if you have a financial report with the phrase, “The company’s revenue grew 3% from the previous quarter,” without additional context, it’s unclear which company is being talked about and for what period.

Contextual extraction solves this problem by adding specific explanatory context to each snippet before creating embeddings or BM25-indexes:

Original fragment = "The company's revenue increased by 3% compared to the previous quarter."

Contextualized fragment = "This fragment is from an SEC report on ACME company's performance for Q2 2023; revenue for the previous quarter was $314 million. The company's revenue increased by 3% compared to the previous quarter.

Experiments have shown that:

Contextual embeddings reduced the frequency of failed retrievals of top-20 fragments by 35% (from 5.7% to 3.7%)
The combination of contextual embeddings and contextual BM25 reduced the failure rate by 49% (from 5.7% to 2.9%)
Adding a step of re-ranking the results reduced the failure rate by 67% (from 5.7% to 1.9%)

Anthropic uses Claude to automate the process of adding context. A special prompt instructs the model to provide a short, fragment-specific context explaining its place in the overall document. The result is a short contextual text (50-100 tokens) that is added to the snippet before embeddings and BM25 indexes are created.

Anthropic recommends:

For small knowledge bases (less than 200,000 tokens, approximately 500 pages) – just include the entire knowledge base in the prompt. With the recently released prompt caching for Claude, this approach is much faster and more cost-effective.
For larger knowledge bases – use contextual extraction, possibly combined with re-ranking for best results.

LongRAG

One interesting recent research is LongRAG, a RAG paradigm for dealing with long contexts in question answering. LongRAG addresses two major problems of traditional RAG systems: global structure breakdown when partitioned into chunks and the difficulty of identifying actual details among large amounts of noise.LongRAG includes four key components: a hybrid retriever, an LLM-enhanced information extractor, a CoT-guided filter, and an LLM-enhanced generator. The system works from two perspectives: extracting global information to understand structure and context, and extracting factual details for accurate answers. Experiments on several datasets with multi-way questions showed that LongRAG significantly outperforms both long context models (by 6.94%), advanced RAG systems (by 6.16%), and conventional Vanilla RAG (by 17.25%).

Source

ChunkRAG

Conventional RAG systems often extract redundant or weakly relevant information, resulting in factual inaccuracies and hallucinations in the generated answers. Existing document retrieval approaches lack sufficient granularity to effectively filter irrelevant content.

Researchers from Algoverse AI Research presented ChunkRAG, a new methodology that improves the information selection process through semantic segmentation and evaluation of text fragments. Instead of working with whole documents, the system divides them into semantically coherent fragments of variable length and evaluates the relevance of each fragment concerning a query.

ChunkRAG works in two main steps:

Semantic segmentation: the document is partitioned into sentences, which are grouped into semantically coherent blocks based on the cosine similarity of their vector representations. If the similarity between consecutive sentences falls below a threshold value (θ = 0.8), a new chunk is created.
Hybrid search and advanced filtering: The system uses a combination of BM25 and LLM-based search, followed by a multi-level process to evaluate the relevance of each fragment. Fragments exceeding a certain similarity threshold (> 0.9) are removed as redundant.

In experiments on PopQA, PubHealth, and Biography datasets, ChunkRAG performed better than state-of-the-art RAG methods. The system achieved 64.9% accuracy on PopQA, 77.3% on PubHealth, and a FactScore of 86.4% on Biography.

The improvements are particularly notable in tasks requiring high factual accuracy, where even small irrelevant segments can cause the model to hallucinate.

Despite the impressive results, ChunkRAG has some limitations, including high computational cost due to multi-level LLM estimation and slower processing time. Future research can focus on optimizing performance and investigating domain-specific thresholds for different types of tasks.

HtmlRAG

Traditionally, RAG systems use plain text as the format for extracted knowledge. However, researchers from China have proposed an innovative approach: using HTML instead of plain text.

Why HTML is better than plain text for RAG systems:

Preserves structural information of documents (headers, tables)
Preserves semantic elements (code, a, and other tags)
Most LLMs already can understand HTML markup

Using HTML in RAGs causes some problems:

Excessive size. HTML documents from web pages contain more than 80K tokens on average, and 90% of those tokens are CSS styles, JavaScript, and other elements not related to the main content.
Noise in the data. Additional HTML elements can create noise that reduces the quality of LLM generation.

To address these problems, the authors of the study proposed the HtmlRAG method, which includes:

HTML Cleaning, a module for cleaning HTML from semantically irrelevant information
Two-stage block-based HTML pruning:
- Trimming based on text embeddings
- Generative pruning based on smaller blocks

Experimental results on six datasets for question-response systems confirm the superiority of HTML over plain text in RAG systems.

FastRAG: an approach for dealing with semi-structured data

Researchers at NIST presented a model called FastRAG, specifically designed to deal with semi-structured data such as network logs and configurations.

Key benefits of FastRAG:

Processing efficiency: FastRAG processes the raw data without having to send all the pieces through the LLM, which significantly reduces processing time and cost.
Use of schemas and scripts: The system applies schemas and scripts to structure the data, making the information retrieval process more efficient.
Integration of text and knowledge graph search: FastRAG integrates text search with knowledge graph queries, which improves the accuracy of information retrieval with context.
Impressive results: Compared to existing methods such as GraphRAG, FastRAG shows improvements of up to 90% in time and up to 85% in processing cost.

Astute RAG

Retrieval-Augmented Generation (RAG) has become a standard approach for solving problems requiring extensive knowledge with large language models (LLMs). However, a recent study, “Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models,” revealed a serious problem: imperfect retrieval can significantly reduce the performance of RAG systems. Researchers from Google Cloud AI Research and the University of Southern California found that approximately 70% of retrieved passages in real-world contexts do not contain direct answers to questions, resulting in poor performance of RAG- augmented LLMs.

The central problem appeared to be a conflict between internal model knowledge and external information. The analysis showed that in about 19.2% of the cases there is a knowledge conflict when either the answer with or without RAG is correct. In this case, the internal knowledge of the model turns out to be correct in 47.4% of the conflict cases, while the external knowledge turns out to be correct in 52.6% of the cases.

To overcome these problems, the authors proposed a new approach, Astute RAG, which:

Adaptively extracts important information from the LLM’s internal knowledge
Iteratively combines internal and external knowledge in a source-aware manner
Forms a final answer based on the reliability of the information

Experiments with Gemini and Claude models demonstrated the superiority of Astute RAG over other RAG methods designed to improve robustness to retrieval errors. Notably, Astute RAG is the only approach that achieves or outperforms the performance of LLM without RAG even in worst-case scenarios.

This research is particularly important for practical applications of RAG systems where search quality can vary. The ability to effectively combine internal model knowledge with external information allows for more robust and resilient AI systems to solve complex problems that require extensive and relevant knowledge.

Source

Rankify

Recently, researchers from the University of Innsbruck introduced a new tool called Rankify, a comprehensive Python framework that combines search, reranking and generation processes using extracted data.

Key features of Rankify:

Modular architecture: Rankify offers a single interface for different RAG components, making it easy to experiment with different methods.
Support for a variety of search methods: The tool integrates both sparse (BM25) and dense neural search methods (DPR, ANCE, BGE, Contriever, ColBERT).
Advanced re-ranking capabilities: Rankify supports 24 re-ranking models with 41 sub-methods, including stream, pairwise, and list ranking.
Pre-built datasets: The framework provides 40 pre-processed datasets for testing and comparing different RAG approaches.
Evaluation Tools: Built-in metrics for evaluating search quality, reranking, and response generation.

Available rerankers:

# BM25 retrieval on Wikipedia
bm25_retriever_wiki = Retriever(method="bm25", n_docs=5, index_type="wiki")

# BM25 retrieval on MS MARCO
bm25_retriever_msmacro = Retriever(method="bm25", n_docs=5, index_type="msmarco")


# DPR (multi-encoder) retrieval on Wikipedia
dpr_retriever_wiki = Retriever(method="dpr", model="dpr-multi", n_docs=5, index_type="wiki")

# DPR (multi-encoder) retrieval on MS MARCO
dpr_retriever_msmacro = Retriever(method="dpr", model="dpr-multi", n_docs=5, index_type="msmarco")

# DPR (single-encoder) retrieval on Wikipedia
dpr_retriever_wiki = Retriever(method="dpr", model="dpr-single", n_docs=5, index_type="wiki")

# DPR (single-encoder) retrieval on MS MARCO
dpr_retriever_msmacro = Retriever(method="dpr", model="dpr-single", n_docs=5, index_type="msmarco")

# ANCE retrieval on Wikipedia
ance_retriever_wiki = Retriever(method="ance", model="ance-multi", n_docs=5, index_type="wiki")

# ANCE retrieval on MS MARCO
ance_retriever_msmacro = Retriever(method="ance", model="ance-multi", n_docs=5, index_type="msmarco")


# Contriever retrieval on Wikipedia
contriever_retriever_wiki = Retriever(method="contriever", model="facebook/contriever-msmarco", n_docs=5, index_type="wiki")

# Contriever retrieval on MS MARCO
contriever_retriever_msmacro = Retriever(method="contriever", model="facebook/contriever-msmarco", n_docs=5, index_type="msmarco")


# ColBERT retrieval on Wikipedia
colbert_retriever_wiki = Retriever(method="colbert", model="colbert-ir/colbertv2.0", n_docs=5, index_type="wiki")

# ColBERT retrieval on MS MARCO
colbert_retriever_msmacro = Retriever(method="colbert", model="colbert-ir/colbertv2.0", n_docs=5, index_type="msmarco")


# BGE retrieval on Wikipedia
bge_retriever_wiki = Retriever(method="bge", model="BAAI/bge-large-en-v1.5", n_docs=5, index_type="wiki")

# BGE retrieval on MS MARCO
bge_retriever_msmacro = Retriever(method="bge", model="BAAI/bge-large-en-v1.5", n_docs=5, index_type="msmarco")


# Hyde retrieval on Wikipedia
hyde_retriever_wiki = Retriever(method="hyde" , n_docs=5, index_type="wiki", api_key=OPENAI_API_KEY )

# Hyde retrieval on MS MARCO
hyde_retriever_msmacro = Retriever(method="hyde", n_docs=5, index_type="msmarco", api_key=OPENAI_API_KEY)

Researchers have extensively tested Rankify on a variety of datasets. Particularly interesting are the results comparing different search models and their impact on response generation accuracy:

For the Natural Questions (NQ) dataset, DPR combined with LLaMA V3 8B performed best, achieving an accuracy of 28.08%
For TriviaQA, BM25 with the Gemma 2 9B model was the most efficient (57.55%)
On the WebQuestions set, DPR with LLaMA V3 8B was again in the lead (19.83%)

This demonstrates that the choice of the optimal search and generation strategy depends on the specific task and dataset.

Repository

From Native RAG to Agent-based RAG

In today’s world of artificial intelligence, Retrieval-Augmented Generation (RAG) technology has become one of the key approaches to information processing. However, classical RAG is being replaced by a new paradigm, Agentic RAG, which represents a significant step forward in intelligent document processing.

Native RAG, which is widely used today, is a sequential process involving information extraction, information ranking, synthesis and response generation. This approach combines the extraction and generation techniques to provide accurate and contextually relevant answers.

Agent-based RAG takes this concept to a whole new level. It is an advanced approach based on using autonomous agents to process and analyze multiple documents in a coordinated manner. It allows comparing different documents, generating specialized summaries and comparing different summaries of information.

The Agent RAG architecture is based on two main types of components:

Document Agents: Each document is assigned a dedicated agent capable of answering questions and producing summaries within its document. These agents act as experts on specific content.
Meta-Agent: A top-level agent that coordinates the work of all document agents, orchestrating their interactions and integrating their outputs to form a cohesive and comprehensive response.

Agent-based RAG has several significant advantages over traditional approaches:

Autonomy: Agents act independently to extract, process, and generate information, which greatly increases the scalability of the system.
Adaptability: The system is able to adjust strategies based on new data and changing contexts, making it more flexible in real-world usage scenarios.
Proactivity: Agents can anticipate needs and take proactive actions to achieve goals, which improves the overall system performance.

Agent-based RAG is particularly useful in scenarios that require careful and nuanced information processing and decision making. These may include:

Analyzing large volumes of scientific publications
Legal research and precedent analysis
Financial analysis and forecasting
Medical diagnostics based on heterogeneous data sources

Conclusions

RAG represents an important approach in modern AI technologies, combining information retrieval with the generative power of large language models. Here are the main conclusions from the paper:

RAG is not a stand-alone technology but a comprehensive architecture with three key components: data ingestion, information retrieval, and a query component to deal with LLMs.
Evaluating RAG systems requires an integrated approach using four key metrics: information accuracy (Faithfulness), response relevance, context accuracy, and context completeness.
Recent research on RAGs includes innovative approaches:
- Contextual Extraction (from Anthropic)
- LongRAG for dealing with long contexts
- ChunkRAG to improve the relevance of extracted fragments
- HtmlRAG for preserving structural information
- FastRAG for efficient handling of semi-structured data
- Astute RAG to address the problem of knowledge conflicts
Agent-based RAG represents the next evolutionary step in the development of this technology, utilizing autonomous agents to process information from multiple documents in a coordinated manner.
The main challenges of RAG systems include search relevance, integration of information from different sources, source quality, and efficient prompting.

These improvements enable RAGs to provide more accurate, fact-based answers, overcoming the limitations of traditional language models.

Share it