Skip to content

Finetune your own Code LLM to create your very own Coding Assistant!

Notifications You must be signed in to change notification settings

jamesnatulan/finetune-code-llm

Repository files navigation

Code-LLM

The main purpose of this project is to be able to finetune your own Code Large Language Model to act as your own coding assistant, much like what GitHub Copilot is. The main feature of this project is that you can do it on a single consumer grade GPU. This is made possible by using Unsloth, which is an LLM finetuning library that focuses on efficiency without degradation in accuracy. For context, I implemented and used this project on my personal machine which has an RTX 3080Ti with 12GB of VRAM. But you can still make it work with a GPU that has at least 4GB of VRAM, provided that you only use a model that has a smaller parameter size, or by lowering the batch size.

Demo

I've created a TabAutocomplete model by finetuning Qwen2.5-Coder:1.5B model on the hf-stack-v1 dataset, specifically, the Transformers section of the dataset. I've then hooked up the model to my VSCode using Continue.dev extension. Shown below are snippets of the difference between using this finetuned model, and GitHub Copilot.

Finetuned Model GitHub Copilot
automodel automodel
lora_config lora_config

The first pair of image shows only a subtle difference, as both can be a valid code segment. But on the second pair, you can clearly see that the Copilot sample is hallucinating a lot of its suggestions. The finetuned suggested one is a lot more concise, and most importantly, correct.

Training your own

To try training your own model, follow the setup below:

Setup

Python dependencies

This repository uses UV as its python dependency management tool. Install UV by:

curl -LsSf https://astral.sh/uv/install.sh | sh

Initialize virtual env and activate

uv venv
source .venv/bin/activate

Install dependencies with:

uv sync

This project also requires flash-attn and unsloth installed. For installing flash-attn, run:

uv pip install flash-attn --no-build-isolation

Make sure you have ran uv sync first as this requires torch to be installed prior to this.

Then, to install unsloth, run the command below to get the optimal install command to use:

wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -

After running this command, you will get something like:

pip install --upgrade pip && pip install "unsloth[cu124-ampere-torch250] @ git+https://github.com/unslothai/unsloth.git"

You just need to add uv for the pip commands so you run them through uv, which is this project's dependency management tool. The command above will then look something like:

uv pip install --upgrade pip && uv pip install "unsloth[cu124-ampere-torch250] @ git+https://github.com/unslothai/unsloth.git"

Install llama.cpp

llama.cp is a submodule to this repository. If cloned with --recursive flag, you should have llama.cpp directory alreaddy. If not, run git submodule update --init.

Go to the llama.cpp directory and run the following depending on your platform: For CUDA enabled devices:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU. To disable the Metal build at compile time use the -DGGML_METAL=OFF cmake option.

For unsloth to find the llama.cpp binaries, specially llama-quantize, move or copy over the binary for llama-quantize to llama.cpp directory root.

Inside the llama.cpp directory, run:

cp build/bin/llama-quantize .

Install ollama

To install ollama, run:

curl -fsSL https://ollama.com/install.sh | sh

Training

To run the training script, run:

uv run train.py

The parameters for the training run can be modified in the settings file. For more information on the parameters, checkout the file itself. It contains useful comments for what each parameter is.

After training, the trained adapter is saved, as well as the merged base and adapter model in gguf format. You can then use this model in ollama for inference. To import the trained model to ollama, run:

ollama create -f <path/to/trained/Modelfile> <your choice of name>

For example, if I do a training run, and have it saved under runs/run4. I will run:

ollama create -f runs/run4/final/Modelfile sample_model

Inference

For inference, you can try it out straight using ollama with:

ollama run <your model name>

In the example before, that would be:

ollama run sample_model

Then from here, you can interact with your model directly using ollama's interface or via OpenAI API. For more information, visit ollama docs.

To hook your trained model to VSCode, use the Continue.dev extension. Below is a sample configuration for the tabAutocomplete you can use. Make sure to change the values to match your run.

"tabAutocompleteModel": {
    "title": "Tab Autocomplete Model",
    "provider": "ollama",
    "model": "hf-stack-v1:latest"
},
"tabAutocompleteOptions": {
    "debounceDelay": 500,
    "maxPromptTokens": 2000,
    "disableInFiles": ["*.md"]
}

Citations

This project is mostly inspired and based of off this guide from huggingface. I've also followed this Unsloth guide for using unsloth and ollama.