A visual speech recognition (VSR) tool that reads your lips in real-time and types whatever you silently mouth. Runs fully locally.
Relies on a model trained on the Lip Reading Sentences 3 dataset as part of the Auto-AVSR project.
Watch a demo of Chaplin here.
- Clone the repository, and
cd
into it:git clone https://github.com/amanvirparhar/chaplin cd chaplin
- Download the required model components: LRS3_V_WER19.1 and lm_en_subword.
- Unzip both folders, and place them in their respective directories:
chaplin/ ├── benchmarks/ ├── LRS3/ ├── language_models/ ├── lm_en_subword/ ├── models/ ├── LRS3_V_WER19.1/ ├── ...
- Install and run
ollama
, and pull thellama3.2
model. - Install
uv
.
- Run the following command:
sudo uv run --with-requirements requirements.txt --python 3.12 main.py config_filename=./configs/LRS3_V_WER19.1.ini detector=mediapipe
- Once the camera feed is displayed, you can start "recording" by pressing the
option
key (Mac) or thealt
key (Windows/Linux), and start mouthing words. - To stop recording, press the
option
key (Mac) or thealt
key (Windows/Linux) again. You should see some text being typed out wherever your cursor is. - To exit gracefully, focus on the window displaying the camera feed and press
q
.