chunking-research

This project is a framework to run experiments related to Chunking and Retrieval of a Retrieval Augmented Generation (RAG) system. The framework supports Chunking Methods from LlamaIndex and LangChain, Embedding models (single-vector) and Rerankers from Huggingface Massive Text Embedding Benchmark (MTEB) leaderboard, and multi-vector Embedding Model ColBERT through RAGatouille. The retrieved documents are evaluated with metrics of MRR, NDCG@k, Recall@k, Precision@k, MAP@k and Hit-Rate provided by Torchmetrics. The combination of the Chunking Methods, Embedding Models and Rerankers are evaluated on Information Retrieval (IR) datasets, specifically, Question-Answering (QA) datasets. These datasets contain <question, context, answer> triples. Therefore, they are suitable for the evaluation of end to end RAG systems. For the evaluation of the Retrieval component, the <question, context> pairs are sufficient.

The following datasets are supported currently:

SQUAD 1.1
QuAC
HotpotQA

These datasets are also available through Huggingface Question Answering Datasets. The framework is easily extendable with additional QA datasets by following the output conventions of the preprocessors.

For document chunking, the following methods are supported:

SentenceSplitter
SemanticSplitterNodeParser
TokenTextSplitter
SentenceWindowNodeParser
LangchainNodeParser

The LangchainNodeParser supports RecursiveCharacterTextSplitter from LangChain. Examples can be found here.

Configuration

The experiments are configuration driven with Hydra. The main settings in the configuration file are the following:

config.yaml

Experiments can be run based on either the "config.yaml" or the "pipeline.yaml" file. If the pipeline is empty or everything is commented out, the main config will be used. There are two types of experiments, Huggingface Embedding Model or ColBERT based experiments. The mlflow.experiment_name section defines which experiment type will be run. Its values can be Retrieval or RetrievalColBERT that define the underlying ExperimentRunner classes.

defaults:
  - _self_
  - pipeline: pipeline

mlflow:
  experiment_name: Retrieval
  run_name: BAAI/bge-small-en-v1.5
  description: test run
  tags:
    experiment: Retrieval
    dataset: QUAC
    model: BAAI/bge-small-en-v1.5

Paths to the datasets, saved question embeddings, MLflow experiments and the ColBERT index. Currently, only the question embeddings are saved, because these are independent of the chunking method and its parameters. The context embeddings could be saved and re-used as well by creating directories from hashing the name of the chunking method and its parameters. ColBERT default saving path is overwritten by the path below, but it is not used for re-loading, the index is overwritten by new ColBERT experiments.

datasets: ./datasets/
embeddings: ./artifacts/embeddings/
experiments: ./artifacts/experiments/
colbert_index: ./artifacts/colbert_index

There are 4 main sections related to Huggingface Embedding Models based retrieval and ranking.

The chunker section defines the chunking method to be used and its parameters. It is important to note that everything below the chunker until the model section should be commented out if the pipeline mode is used to run the experiments, for proper configuration file saving to MLflow.

The model section defines the Huggingface Embedding Model and its parameters. The normalize parameter should be set to true.

Reranking after Retrieval is turned on with the reranking flag. If it is turned on, the reranker section defines the model to be used and its parameters. It is important to note that on Windows OS, using multiple workers for data loading resulted in bugs and wrong outputs.

The evaluation section defines the top k-s for which the Retrieval metrics will be computed.

retrieval:
  # comment everything below the chunker until the model if the pipeline is used
  chunker:
    name: SentenceSplitter
    params:
      chunk_size: 128
      chunk_overlap: 16
      
  model:
   model_name: BAAI/bge-small-en-v1.5
   embed_batch_size: 32
   normalize: true
  
  reranking: false
  reranker:
    model_name: cross-encoder/ms-marco-TinyBERT-L-2-v2
    predict:
      batch_size: 32
      show_progress_bar: true
      #num_workers: 6 DON'T use it, it is buggy, same scores are returned!
      
  evaluation:
    top_ks:
      - 1
      - 3
      - 5
      - 7
      - 10

For ColBERT based Retrieval the following section defines the supported parameters. By default, RAGatouille and ColBERT supports LlamaIndex SentenceSplitter chunking method. The max_document_length parameter defines the maximum chunk size. The question_batch_size parameter defines the batch size to be applied for Retrieval evaluation.

colbert_retrieval:
  model_name: colbert-ir/colbertv2.0
  max_document_length: 128
  question_batch_size: 5000
  evaluation:
    top_ks:
      - 1
      - 3
      - 5
      - 7
      - 10

Installation

Installation is complicated due to RAGatouille, ColBERT and Faiss-GPU (the latter is a must for optimized runtime).

The following steps were tested on Ubuntu 22.04.2 LTS. To install the proper Nvidia driver and CUDA version the easiest way, follow the steps listed in this blog. For convenience, the steps are listed and extended below.

sudo ubuntu-drivers devices
sudo apt install nvidia-driver-470
sudo reboot now
nvidia-smi
sudo apt install gcc
gcc -v
Download and install CUDA toolkit 11.8 from runfile (local) with these options: --toolkit --silent --override
sudo reboot
nano ~/.bashrc
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
~/.bashrc
nvcc -V

conda create -n chunking-research python=3.10.13

conda activate chunking-research

pip install -r requirements.txt

Unfortunately, RAGatouille's LlamaIndex dependency is inconsistent, so LlamaIndex should be removed and re-installed.

pip uninstall llama-index

pip install llama-index --upgrade --no-cache-dir --force-reinstall

Faiss-CPU should be also removed and Faiss-GPU to be installed with conda, as described here.

pip uninstall faiss-cpu

conda install -c conda-forge faiss-gpu

Start mlflow server

mlflow server --backend-store-uri artifacts/experiments/

Run the experiments based on the config

python experiments.py

or without saving the results to MLflow

python retrieval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

chunking-research

Configuration

Installation

Start mlflow server

Run the experiments based on the config

Files

README.md

Latest commit

History

README.md

File metadata and controls

chunking-research

Configuration

Installation

Start mlflow server

Run the experiments based on the config