This project is a framework to run experiments related to Chunking and Retrieval of a Retrieval
Augmented Generation (RAG) system. The framework supports Chunking Methods from LlamaIndex
and LangChain, Embedding models (single-vector) and Rerankers from Huggingface Massive Text Embedding Benchmark (MTEB)
leaderboard, and multi-vector Embedding Model ColBERT through
RAGatouille. The retrieved documents are evaluated with metrics of
MRR, NDCG@k, Recall@k, Precision@k, MAP@k and Hit-Rate provided by Torchmetrics.
The combination of the Chunking Methods, Embedding Models and Rerankers are evaluated on Information Retrieval (IR) datasets, specifically,
Question-Answering (QA) datasets. These datasets contain <question, context, answer>
triples. Therefore, they are suitable for the evaluation of
end to end RAG systems. For the evaluation of the Retrieval component, the <question, context>
pairs are sufficient.
The following datasets are supported currently:
These datasets are also available through Huggingface Question Answering Datasets. The framework is easily extendable with additional QA datasets by following the output conventions of the preprocessors.
For document chunking, the following methods are supported:
- SentenceSplitter
- SemanticSplitterNodeParser
- TokenTextSplitter
- SentenceWindowNodeParser
- LangchainNodeParser
The LangchainNodeParser supports RecursiveCharacterTextSplitter from LangChain. Examples can be found here.
The experiments are configuration driven with Hydra. The main settings in the configuration file are the following:
config.yaml
Experiments can be run based on either the "config.yaml" or the "pipeline.yaml" file. If the pipeline is empty or
everything is commented out, the main config will be used. There are two types of experiments, Huggingface Embedding
Model or ColBERT based experiments. The mlflow.experiment_name
section defines which experiment type will be run.
Its values can be Retrieval
or RetrievalColBERT
that define the underlying ExperimentRunner
classes.
defaults:
- _self_
- pipeline: pipeline
mlflow:
experiment_name: Retrieval
run_name: BAAI/bge-small-en-v1.5
description: test run
tags:
experiment: Retrieval
dataset: QUAC
model: BAAI/bge-small-en-v1.5
Paths to the datasets, saved question embeddings, MLflow experiments and the ColBERT index. Currently, only the question embeddings are saved, because these are independent of the chunking method and its parameters. The context embeddings could be saved and re-used as well by creating directories from hashing the name of the chunking method and its parameters. ColBERT default saving path is overwritten by the path below, but it is not used for re-loading, the index is overwritten by new ColBERT experiments.
datasets: ./datasets/
embeddings: ./artifacts/embeddings/
experiments: ./artifacts/experiments/
colbert_index: ./artifacts/colbert_index
There are 4 main sections related to Huggingface Embedding Models based retrieval and ranking.
The chunker section defines the chunking method to be used and its parameters. It is important to note that everything below the chunker until the model section should be commented out if the pipeline mode is used to run the experiments, for proper configuration file saving to MLflow.
The model section defines the Huggingface Embedding Model and its parameters. The normalize parameter should be set to true.
Reranking after Retrieval is turned on with the reranking
flag. If it is turned on, the reranker
section defines the model to be used
and its parameters. It is important to note that on Windows OS, using multiple workers for data loading resulted in bugs and wrong outputs.
The evaluation section defines the top k-s for which the Retrieval metrics will be computed.
retrieval:
# comment everything below the chunker until the model if the pipeline is used
chunker:
name: SentenceSplitter
params:
chunk_size: 128
chunk_overlap: 16
model:
model_name: BAAI/bge-small-en-v1.5
embed_batch_size: 32
normalize: true
reranking: false
reranker:
model_name: cross-encoder/ms-marco-TinyBERT-L-2-v2
predict:
batch_size: 32
show_progress_bar: true
#num_workers: 6 DON'T use it, it is buggy, same scores are returned!
evaluation:
top_ks:
- 1
- 3
- 5
- 7
- 10
For ColBERT based Retrieval the following section defines the supported parameters. By default,
RAGatouille and ColBERT supports LlamaIndex SentenceSplitter
chunking method. The max_document_length
parameter defines the maximum chunk size. The question_batch_size
parameter defines the batch size to be applied for Retrieval evaluation.
colbert_retrieval:
model_name: colbert-ir/colbertv2.0
max_document_length: 128
question_batch_size: 5000
evaluation:
top_ks:
- 1
- 3
- 5
- 7
- 10
Installation is complicated due to RAGatouille, ColBERT and Faiss-GPU (the latter is a must for optimized runtime).
The following steps were tested on Ubuntu 22.04.2 LTS. To install the proper Nvidia driver and CUDA version the easiest way, follow the steps listed in this blog. For convenience, the steps are listed and extended below.
- sudo ubuntu-drivers devices
- sudo apt install nvidia-driver-470
- sudo reboot now
- nvidia-smi
- sudo apt install gcc
- gcc -v
- Download and install CUDA toolkit 11.8 from runfile (local) with these options: --toolkit --silent --override
- sudo reboot
- nano ~/.bashrc
- export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
- export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
- ~/.bashrc
- nvcc -V
conda create -n chunking-research python=3.10.13
conda activate chunking-research
pip install -r requirements.txt
Unfortunately, RAGatouille's LlamaIndex dependency is inconsistent, so LlamaIndex should be removed and re-installed.
pip uninstall llama-index
pip install llama-index --upgrade --no-cache-dir --force-reinstall
Faiss-CPU should be also removed and Faiss-GPU to be installed with conda, as described here.
pip uninstall faiss-cpu
conda install -c conda-forge faiss-gpu
mlflow server --backend-store-uri artifacts/experiments/
python experiments.py
or without saving the results to MLflow
python retrieval.py