Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-3548] Optimize IO path Python interface for SystemDS #2189

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

Nakroma
Copy link
Contributor

@Nakroma Nakroma commented Jan 24, 2025

Student project SYSTEMDS-3548 and follow-up to #2154

Contributions/discussion:

  • I did some follow-up to both suggestions from @Baunsgaard in the first PR and did testing with both chunking into smaller parts and fusing operations into fewer java calls. I was unable to get any real improvements that were replicable over the larger datasets, altough I don't have a ton of experience with Py4J, so this might still have some potential. I added some of the adjacent code for it though (fusing convert, setting only chunks of a FrameBlock etc.), so at least some of the work I did there contributes to the project.
  • As it turns out, anything involving the java gateway is super costly, so for example even simply doing a if var == jvm.gate.sds.ValueType.String comparison has a big overhead. I was able to optimize another constant time by reducing stuff like that to a minimum, see the first graph below.
  • For cases where cols > rows the current column-wise processing is very slow, so I added row-wise processing for that case to speed it up (see second graph below, tested on 1k rows x 10k cols). Note here that it currently only does that for edge cases where all columns have the same data type. This is because when testing, serializing over a row with different columns was very costly. I wasn't able to spend a lot of time on this, as the deadline is approaching, so I think there is a lot of potential here to find an efficient way to serialize to be able to also use it for mixed columns. I'd also expect in the most optimal case for the time to be the same as as the rows > cols case, so I think there is probably also more optimization potential in the current row-wise processing.
  • Small note that I switched how I compared times, before I was averaging runs but now I take the min as suggested by the timeit docs, so times might be slightly different from the first PR.
  • Fixed a regression from my first PR where exceptions in the threaded function calls wouldn't propagate properly.
  • Fixed a small bug in the perftests to be able to read multi-file data (since that's what datagen generates for larger datasets)

load_pandas_10k_1k_dense
load_pandas_1k_10k_dense

This commit optimizes how the pandas_to_frame_block function accesses Java types. It also fixes a small regression, where exceptions from the parallelization threads weren't propagating exceptions properly.
IO datagen splits large datasets into multiple files (for example 100k_1k). This commit makes load_pandas.py and load_numpy.py able to read those.
This commit adds basic row-wise processing in the case of cols > rows. It also adds some other small, unused utility methods.
@Nakroma Nakroma marked this pull request as ready for review January 24, 2025 16:46
Copy link
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the edits,

The only elements missing are documentation on the methods defined. If you could add these, then i will merge it!

Copy link

codecov bot commented Jan 28, 2025

Codecov Report

Attention: Patch coverage is 84.31373% with 8 lines in your changes missing coverage. Please review.

Project coverage is 72.27%. Comparing base (f8522a7) to head (cac793e).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
...rg/apache/sysds/runtime/frame/data/FrameBlock.java 63.63% 5 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2189      +/-   ##
============================================
+ Coverage     71.88%   72.27%   +0.38%     
- Complexity    44701    44981     +280     
============================================
  Files          1449     1452       +3     
  Lines        169182   169330     +148     
  Branches      32980    33032      +52     
============================================
+ Hits         121617   122383     +766     
+ Misses        38237    37637     -600     
+ Partials       9328     9310      -18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Baunsgaard
Copy link
Contributor

Github actions are bringing up a valid point, testing is missing.

We need to add a few test files (with tests).

  1. in Python
  • Add files to src/main/python/tests/iotests/???.py for end-to-end tests.
  1. in Java, Please extend:
  • src/test/java/org/apache/sysds/test/component/frame/array/Py4jConverterUtilsTest.java
  • And add a src/test/java/org/apache/sysds/test/component/matrix/???.java if you find it appropriate.

This commit adds missing tests for added code in SYSTEMDS-3548. This includes the FrameBlock and Py4jConverterUtils functions, as well as python pandas to systemds io e2e tests.
@Nakroma
Copy link
Contributor Author

Nakroma commented Jan 29, 2025

Added those. I've added the FrameBlock tests to /frame/FrameGetSetTest.java and the python tests to the already existing pandas setup since that made the most sense to me, let me know if thats okey if not I can split that up into their own files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants