This page is currently being updated. For the latest information, please visit the individual member pages to view their newest projects.
Retrieval-Augmented Generation (RAG) has emerged as a powerful technique for building sophisticated question-answering systems that leverage vast, domain-specific knowledge bases. However, developing and deploying these systems in a real-world enterprise context presents significant challenges. The performance of a RAG pipeline is highly sensitive to a wide array of components and parameters—including the choice of the base Large Language Model (LLM), the retrieval algorithm (e.g., dense, sparse, or hybrid), and data chunking strategies.
Optimizing these components is complex and resource-intensive. Furthermore, evaluating a RAG system's factual accuracy and reliability is a major bottleneck. Traditional evaluation methods often rely on small, manually curated test sets or require extensive "human-in-the-loop" validation, both of which are slow, expensive, and difficult to scale. This creates a critical gap: organizations need a robust, automated, and objective way to measure and improve the performance of their RAG applications before and during deployment.
This project will apply a cutting-edge methodology for the automated evaluation and optimization of RAG systems [https://arxiv.org/abs/2405.13622]. Instead of relying on manual evaluation, the system automatically generates a comprehensive, task-specific "exam" directly from an organization's own knowledge corpus.
This synthetic, multiple-choice exam will serve as a standardized benchmark to rigorously and objectively score the performance of any RAG pipeline. By treating the RAG system as an "exam-taker," its ability to retrieve and reason over the source information can be effectively quantified.
This project will deliver a scalable and cost-effective framework for continuously evaluating and improving enterprise RAG systems. The primary outcome will be an optimized, high-performing RAG application for our chosen use case, backed by quantitative data. By automating the tedious evaluation process, this project will empower our teams to make rapid, data-driven decisions, significantly accelerating development cycles and increasing the reliability and factual accuracy of our AI-powered information systems.