Speed up parallel from_group_dataframe #2645

tRosenflanz · 2025-01-19T05:47:59Z

Is your feature request related to a current problem? Please describe.
Parallel timeseries.from_group_dataframe is currently passing sub_df around per group which can be slow when there are lot of groups to process due to the parallelization overheads.

Describe proposed solution
Instead of processing each group individually, split the initial dataframe into n_jobs chunks and process each of those chunks sequentially (i.e with n_jobs=1). This way each worker get many groups at once and each of them can process a large number of groups.

Describe potential alternatives
A mix of the two approaches can work as well

Additional context
Stub of logic to compare the results:

baseline call:

ts.TimeSeries.from_group_dataframe(
                data_df,
                group_cols=grouper,
                value_cols=val,
                time_col="date",
                n_jobs=-1,
            )

a (potentially) significantly faster implementation to compare

def process_group(data_df):
    return ts.TimeSeries.from_group_dataframe(
        data_df,
        group_cols=grouper,
        value_cols=val,
        time_col="date",
        n_jobs=1,
    )

n_chonks = cpu_count()
sub_df = df[grouper].drop_duplicates()
#make a list of dataframes that correspond to each group indexes
list_df = np.array_split(sub_df, n_chonks)
jobs = []
for chunk in list_df:
    #create a sub chunk of the original dataframe using the group index
    chunk_df = df.merge(chunk)
    job = delayed(process_group)(chunk_df) 
    jobs.append(job)
retLst = Parallel(n_jobs=-1)(jobs)
covariates[key] = sum(retLst, start=[])

On my dataset the latter code is about 4x faster for a dataframe with 30k groups.

I am not certain this is worth putting into the library but thought it might be worth looking into

The text was updated successfully, but these errors were encountered:

tRosenflanz added the triage Issue waiting for triaging label Jan 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up parallel from_group_dataframe #2645

Speed up parallel from_group_dataframe #2645

tRosenflanz commented Jan 19, 2025 •

edited

Loading

Speed up parallel from_group_dataframe #2645

Speed up parallel from_group_dataframe #2645

Comments

tRosenflanz commented Jan 19, 2025 • edited Loading

tRosenflanz commented Jan 19, 2025 •

edited

Loading