[SYSTEMDS-3548] Optimize IO path Python interface for SystemDS #2189

Nakroma · 2025-01-24T16:43:53Z

Student project SYSTEMDS-3548 and follow-up to #2154

Contributions/discussion:

I did some follow-up to both suggestions from @Baunsgaard in the first PR and did testing with both chunking into smaller parts and fusing operations into fewer java calls. I was unable to get any real improvements that were replicable over the larger datasets, altough I don't have a ton of experience with Py4J, so this might still have some potential. I added some of the adjacent code for it though (fusing convert, setting only chunks of a FrameBlock etc.), so at least some of the work I did there contributes to the project.
As it turns out, anything involving the java gateway is super costly, so for example even simply doing a if var == jvm.gate.sds.ValueType.String comparison has a big overhead. I was able to optimize another constant time by reducing stuff like that to a minimum, see the first graph below.
For cases where cols > rows the current column-wise processing is very slow, so I added row-wise processing for that case to speed it up (see second graph below, tested on 1k rows x 10k cols). Note here that it currently only does that for edge cases where all columns have the same data type. This is because when testing, serializing over a row with different columns was very costly. I wasn't able to spend a lot of time on this, as the deadline is approaching, so I think there is a lot of potential here to find an efficient way to serialize to be able to also use it for mixed columns. I'd also expect in the most optimal case for the time to be the same as as the rows > cols case, so I think there is probably also more optimization potential in the current row-wise processing.
Small note that I switched how I compared times, before I was averaging runs but now I take the min as suggested by the timeit docs, so times might be slightly different from the first PR.
Fixed a regression from my first PR where exceptions in the threaded function calls wouldn't propagate properly.
Fixed a small bug in the perftests to be able to read multi-file data (since that's what datagen generates for larger datasets)

This commit optimizes how the pandas_to_frame_block function accesses Java types. It also fixes a small regression, where exceptions from the parallelization threads weren't propagating exceptions properly.

IO datagen splits large datasets into multiple files (for example 100k_1k). This commit makes load_pandas.py and load_numpy.py able to read those.

This commit adds basic row-wise processing in the case of cols > rows. It also adds some other small, unused utility methods.

Baunsgaard

LGTM, thanks for the edits,

The only elements missing are documentation on the methods defined. If you could add these, then i will merge it!

src/main/java/org/apache/sysds/runtime/frame/data/FrameBlock.java

This commit adjusts the Py4J converter tests to reflect the code changes in the main class.

codecov · 2025-01-28T19:09:56Z

Codecov Report

Attention: Patch coverage is 84.31373% with 8 lines in your changes missing coverage. Please review.

Project coverage is 72.27%. Comparing base (f8522a7) to head (cac793e).
Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
...rg/apache/sysds/runtime/frame/data/FrameBlock.java	63.63%	5 Missing and 3 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2189      +/-   ##
============================================
+ Coverage     71.88%   72.27%   +0.38%     
- Complexity    44701    44981     +280     
============================================
  Files          1449     1452       +3     
  Lines        169182   169330     +148     
  Branches      32980    33032      +52     
============================================
+ Hits         121617   122383     +766     
+ Misses        38237    37637     -600     
+ Partials       9328     9310      -18

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Baunsgaard · 2025-01-28T22:50:53Z

Github actions are bringing up a valid point, testing is missing.

We need to add a few test files (with tests).

in Python

Add files to src/main/python/tests/iotests/???.py for end-to-end tests.

in Java, Please extend:

src/test/java/org/apache/sysds/test/component/frame/array/Py4jConverterUtilsTest.java
And add a src/test/java/org/apache/sysds/test/component/matrix/???.java if you find it appropriate.

This commit adds missing tests for added code in SYSTEMDS-3548. This includes the FrameBlock and Py4jConverterUtils functions, as well as python pandas to systemds io e2e tests.

Nakroma · 2025-01-29T16:36:11Z

Added those. I've added the FrameBlock tests to /frame/FrameGetSetTest.java and the python tests to the already existing pandas setup since that made the most sense to me, let me know if thats okey if not I can split that up into their own files.

Nakroma added 4 commits January 19, 2025 15:36

[SYSTEMDS-3548] Optimize python dataframe transfer column processing

7298c84

This commit optimizes how the pandas_to_frame_block function accesses Java types. It also fixes a small regression, where exceptions from the parallelization threads weren't propagating exceptions properly.

[SYSTEMDS-3548] Fix perftests not working with large, split-up datasets

0d90e4f

IO datagen splits large datasets into multiple files (for example 100k_1k). This commit makes load_pandas.py and load_numpy.py able to read those.

[SYSTEMDS-3548] Add pandas to FrameBlock row-wise parallel processing

3eee164

This commit adds basic row-wise processing in the case of cols > rows. It also adds some other small, unused utility methods.

Merge branch 'apache:main' into main

013c248

Nakroma marked this pull request as ready for review January 24, 2025 16:46

Baunsgaard reviewed Jan 24, 2025

View reviewed changes

src/main/java/org/apache/sysds/runtime/frame/data/FrameBlock.java Outdated Show resolved Hide resolved

Nakroma added 4 commits January 27, 2025 16:49

[SYSTEMDS-3548] Black format converters.py

f708016

[SYSTEMDS-3548] Add javadocs

b95bc9f

[SYSTEMDS-3548] Adjust Py4jConverterUtilsTest

e5f7fd0

This commit adjusts the Py4J converter tests to reflect the code changes in the main class.

Merge remote-tracking branch 'origin/main'

9481847

[SYSTEMDS-3548] Extend test cases

669cd14

This commit adds missing tests for added code in SYSTEMDS-3548. This includes the FrameBlock and Py4jConverterUtils functions, as well as python pandas to systemds io e2e tests.

Nakroma added 2 commits January 29, 2025 17:38

[SYSTEMDS-3548] Fix pandas io test (rows have to be >4)

cac793e

[SYSTEMDS-3548] Black format test_io_pandas_systemds.py

8dfd4f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMDS-3548] Optimize IO path Python interface for SystemDS #2189

[SYSTEMDS-3548] Optimize IO path Python interface for SystemDS #2189

Nakroma commented Jan 24, 2025 •

edited

Loading

Baunsgaard left a comment

codecov bot commented Jan 28, 2025 •

edited

Loading

Baunsgaard commented Jan 28, 2025

Nakroma commented Jan 29, 2025

[SYSTEMDS-3548] Optimize IO path Python interface for SystemDS #2189

Are you sure you want to change the base?

[SYSTEMDS-3548] Optimize IO path Python interface for SystemDS #2189

Conversation

Nakroma commented Jan 24, 2025 • edited Loading

Baunsgaard left a comment

Choose a reason for hiding this comment

codecov bot commented Jan 28, 2025 • edited Loading

Codecov Report

Baunsgaard commented Jan 28, 2025

Nakroma commented Jan 29, 2025

Nakroma commented Jan 24, 2025 •

edited

Loading

codecov bot commented Jan 28, 2025 •

edited

Loading