Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates the readme to show how to pip install and run a transform. #928

Open
wants to merge 28 commits into
base: dev
Choose a base branch
from
Open
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
021c8f5
fix lib doc .py links and update resize readme
daw3rd Aug 13, 2024
666e558
Merge branch 'dev' into Readme-Changes
daw3rd Sep 18, 2024
591f3d8
reorder some instructions in RELEASE.md
daw3rd Sep 18, 2024
0b4c712
Merge branch 'dev' into Readme-Changes
daw3rd Sep 23, 2024
31e8354
updated doc on exception processing by the runtime
daw3rd Sep 23, 2024
ebbc0a1
updated release notes and release process doc
daw3rd Sep 25, 2024
96122eb
Merge branch 'dev' into Readme-Changes
daw3rd Sep 25, 2024
9671d6f
Merge branch 'dev' into Readme-Changes
daw3rd Sep 25, 2024
698edbe
cleanups in the release documentation
daw3rd Sep 26, 2024
81f0b35
cleanups in the release documentation
daw3rd Sep 26, 2024
148fde8
Merge branch 'dev' into Readme-Changes
daw3rd Oct 1, 2024
61dc844
remove duplicated table of transforms
daw3rd Oct 1, 2024
88fc03a
center columns in module table of readme
daw3rd Oct 1, 2024
474ab8d
Merge branch 'dev' into Readme-Changes
daw3rd Dec 12, 2024
60171f8
readme changes for simplified start example
daw3rd Jan 6, 2025
355ab20
notebook readme
daw3rd Jan 6, 2025
b9cd435
Merge branch 'dev' into readme-david
daw3rd Jan 6, 2025
289fbba
add pip install/python to show running transform from cli in top leve…
daw3rd Jan 8, 2025
60a15f3
add terminology to readme and tune cli python run
daw3rd Jan 8, 2025
ce2ab62
use wget to get data and reorder Getting Started sections
daw3rd Jan 9, 2025
ac355d9
improved wget urls
daw3rd Jan 9, 2025
4c63be0
simplify first/readme notebook and setup
daw3rd Jan 9, 2025
4a99d86
fix google colab link for new notebook - temporarily for testing
daw3rd Jan 9, 2025
6588e9f
restore collab notebook link to be to dev branch
daw3rd Jan 9, 2025
d2a1279
change readme to only install pdf2parquet to workaround fasttext inst…
daw3rd Jan 9, 2025
bc3d99f
Merge branch 'dev' into readme-david
daw3rd Jan 24, 2025
ce00d2c
Update/merge CLI transform invocation in the README
daw3rd Jan 24, 2025
2676cf0
delete unneeded notebook and readme
daw3rd Jan 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,32 @@ pip install 'data-prep-toolkit-transforms[pdf2parquet]'
```
For additional guidance on creating the virtual environment for installing the data prep kit, click [here](doc/quick-start/quick-start.md#conda).

### Run a transform at the command line
Here we run the `pdf2parquet` transform on its input data to
import pdf content into rows of a parquet file.
First, we load some data for the transform to run on.
```shell
wget -P input https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/pdf2parquet/test-data/input/archive1.zip
wget -P input https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/pdf2parquet/test-data/input/redp5110-ch1.pdf
```
```shell
% ls input
archive1.zip redp5110-ch1.pdf
```

Next we run `pdf2parquet` on the data in the `input` folder.
```shell
python -m dpk_pdf2parquet.transform_python \
--data_local_config "{ 'input_folder': 'input', 'output_folder': 'output'}" \
--data_files_to_use "['.pdf', '.zip']"
```
Parquet files are generated in the designated `output` folder:
```shell
% ls output
archive1.parquet metadata.json redp5110-ch1.parquet
```
All transforms are runnable from the command line in the manner above.

### Run your first data prep pipeline

Now that you have run a single transform, the next step is to explore how to put these transforms
Expand All @@ -81,6 +107,7 @@ a RAG application.
This [notebook](examples/notebooks/fine%20tuning/code/sample-notebook.ipynb) gives an example of
how to build an end to end data prep pipeline for fine tuning for code LLMs.
You can also explore how to build a RAG pipeline [here](examples/notebooks/rag).
Pipelines can also be defined using multiple transform invocations from the command line.

### Current list of transforms
The matrix below shows the the combination of modules and supported runtimes. All the modules can be accessed [here](transforms) and can be combined to form data processing pipelines, as shown in the [examples](examples/notebooks) folder.
Expand Down