-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYSTEMDS-3548] Optimize IO path Python interface for SystemDS #2189
base: main
Are you sure you want to change the base?
Conversation
This commit optimizes how the pandas_to_frame_block function accesses Java types. It also fixes a small regression, where exceptions from the parallelization threads weren't propagating exceptions properly.
IO datagen splits large datasets into multiple files (for example 100k_1k). This commit makes load_pandas.py and load_numpy.py able to read those.
This commit adds basic row-wise processing in the case of cols > rows. It also adds some other small, unused utility methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the edits,
The only elements missing are documentation on the methods defined. If you could add these, then i will merge it!
src/main/java/org/apache/sysds/runtime/frame/data/FrameBlock.java
Outdated
Show resolved
Hide resolved
This commit adjusts the Py4J converter tests to reflect the code changes in the main class.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2189 +/- ##
============================================
+ Coverage 71.88% 72.27% +0.38%
- Complexity 44701 44981 +280
============================================
Files 1449 1452 +3
Lines 169182 169330 +148
Branches 32980 33032 +52
============================================
+ Hits 121617 122383 +766
+ Misses 38237 37637 -600
+ Partials 9328 9310 -18 ☔ View full report in Codecov by Sentry. |
Github actions are bringing up a valid point, testing is missing. We need to add a few test files (with tests).
|
This commit adds missing tests for added code in SYSTEMDS-3548. This includes the FrameBlock and Py4jConverterUtils functions, as well as python pandas to systemds io e2e tests.
Added those. I've added the FrameBlock tests to |
Student project
SYSTEMDS-3548
and follow-up to #2154Contributions/discussion:
if var == jvm.gate.sds.ValueType.String
comparison has a big overhead. I was able to optimize another constant time by reducing stuff like that to a minimum, see the first graph below.cols > rows
the current column-wise processing is very slow, so I added row-wise processing for that case to speed it up (see second graph below, tested on 1k rows x 10k cols). Note here that it currently only does that for edge cases where all columns have the same data type. This is because when testing, serializing over a row with different columns was very costly. I wasn't able to spend a lot of time on this, as the deadline is approaching, so I think there is a lot of potential here to find an efficient way to serialize to be able to also use it for mixed columns. I'd also expect in the most optimal case for the time to be the same as as therows > cols
case, so I think there is probably also more optimization potential in the current row-wise processing.