Managing invoices is a critical yet often cumbersome task for businesses of all sizes. The sheer volume of data, coupled with the need for accuracy and efficiency, can make invoice processing a significant challenge. This code repo provides a solution using Streamlit application and Bedrock Anthropic models to streamline and automate the process.
This project demonstrates how to process PDF invoices stored in an Amazon S3 bucket using Amazon Bedrock. Amazon Bedrock is a fully managed service for building generative AI applications that gives access to range of LLM's. In this project, we will extract the invoice data and summarize the invoice and finally store in a JSON file. Alternatively, you can store this JSON and key value in your operational databases as required.
This application uses Amazon Bedrock Knowledge Base - Chat with document feature with Anthrophic Claude Sonnet LLM to extract information from pdf invoices and provides a streamlit app which displays the invoices and extracted information side-by-side for easier review.
-
Python 3.7 or later on your local machine
-
AWS CLI installed and configured with appropriate credentials
- Set the region to where you would like to run this invoice processor by following the Set up AWS Credentials and Region for Development documentation.
Note: The region must have Bedrock and Anthropic Claude 3 Sonnet model available. You can check it here.
-
Access to foundation model Anthropic Claude 3 Sonnet on Amazon Bedrock in the region chosen
-
Invoices that you want to process
-
Clone Github repository
git clone https://github.com/aws-samples/genai-invoice-processor.git
-
Navigate to the project directory:
cd </path/to/your/folder>/genai-invoice-processor
-
Upgrade pip
python3 -m pip install -–upgrade pip
-
(Optionally) create a virtual environment to isolate dependencies:
python3 -m venv venv
Activate the virtual environment:
Mac/Linux:
source venv/bin/activate
Windows:
venv/Scripts/activate
-
Install the necessary Python packages:
pip install -r requirements.txt
-
Update the
region
in the config.yaml file to the same region set for your AWS CLI where Bedrock and Anthropic Claude 3 Sonnet model is available.
-
Create Bucket -
aws s3 mb s3://<your-bucket-name> --region <your-region>
- Replace your-bucket-name with the desired name of your S3 bucket.
- Replace your-region with the AWS region set for your AWS CLI and in config.yaml, such as us-east-1.
-
Using the below AWS cli command, copy your invoices from your local computer to the S3 bucket created in the step above. If you would like to create a folder within the S3 bucket and upload your invoices there, then follow the second command below.
aws s3 cp </path/to/your/local/folder/with/invoices> s3://<your-bucket-name>/ --recursive
aws s3 cp </path/to/your/local/folder/with/invoices> s3://<your-bucket-name>/<folder>/ --recursive
-
Validate the Upload
aws s3 ls s3://<your-bucket-name>/
This project uses a config.yaml
file for configuration. Before running the application, ensure you've reviewed and updated this file as needed:
- The file contains settings for AWS region and the Bedrock model ID.
- The default model is set to Calude 3 Sonnet, you can find the model IDs on https://docs.aws.amazon.com/bedrock/latest/userguide/model-ids.html
- It also specifies the output file path and local download folder for invoices.
To process invoices stored in an S3 bucket, run the following command:
In this step we will process the invoices in S3 bucket and store the model output in the processes_invoice_output.json file. We are performing below 3 steps while processing the invoice:
- Extracting data from each invoice in key value format.
- Extracting only key infomation from the invoice required by our stakeholders.
- And finally summarize the invoice.
You can check the prompts used in the invoices_processor.py file. And you can use different LLM's for all of these 3 steps.
python invoices_processor.py --bucket_name='<your-bucket-name>' --prefix='<your-folder>'
Note: The --prefix
argument is optional. If omitted, the script will process all PDFs in the bucket.
Examples:
python invoices_processor.py --bucket_name='gen_ai_demo_bucket'
python invoices_processor.py --bucket_name='gen_ai_demo_bucket' --prefix='invoice'
After successful completion of the job, you should see a invoices folder in your local file system with all the invoices. You will also see a processed_invoice_output.json file with all the metadata extracted by Amazon Bedrock Knowledge Base using Claude Sonnet Model.
To review the processed invoice data, you can run the Streamlit app with the following command:
streamlit run review-invoice-data.py
or
python -m streamlit run review-invoice-data.py
The Streamlit app will open in your default web browser, allowing you to view and interact with the processed invoice data.
-
invoices_processor.py: The main script for processing invoices stored in an S3 bucket.
-
review-invoice-data.py: The Streamlit app for reviewing the processed invoice data.
-
requirements.txt: List of required Python packages.
-
README.md: This file, containing project documentation.
-
config.yaml: This contains the configuration for the AWS region, bedrock model and local folder/file structure to be used by both scripts
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.