You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a current problem? Please describe.
Parallel timeseries.from_group_dataframe is currently passing sub_df around per group which can be slow when there are lot of groups to process due to the parallelization overheads.
Describe proposed solution
Instead of processing each group individually, split the initial dataframe into n_jobs chunks and process each of those chunks sequentially (i.e with n_jobs=1). This way each worker get many groups at once and each of them can process a large number of groups.
Describe potential alternatives
A mix of the two approaches can work as well
Additional context
Stub of logic to compare the results:
a (potentially) significantly faster implementation to compare
def process_group(data_df):
return ts.TimeSeries.from_group_dataframe(
data_df,
group_cols=grouper,
value_cols=val,
time_col="date",
n_jobs=1,
)
n_chonks = cpu_count()
sub_df = df[grouper].drop_duplicates()
#make a list of dataframes that correspond to each group indexes
list_df = np.array_split(sub_df, n_chonks)
jobs = []
for chunk in list_df:
#create a sub chunk of the original dataframe using the group index
chunk_df = df.merge(chunk)
job = delayed(process_group)(chunk_df)
jobs.append(job)
retLst = Parallel(n_jobs=-1)(jobs)
covariates[key] = sum(retLst, start=[])
On my dataset the latter code is about 4x faster for a dataframe with 30k groups.
I am not certain this is worth putting into the library but thought it might be worth looking into
The text was updated successfully, but these errors were encountered:
Is your feature request related to a current problem? Please describe.
Parallel timeseries.from_group_dataframe is currently passing sub_df around per group which can be slow when there are lot of groups to process due to the parallelization overheads.
Describe proposed solution
Instead of processing each group individually, split the initial dataframe into n_jobs chunks and process each of those chunks sequentially (i.e with n_jobs=1). This way each worker get many groups at once and each of them can process a large number of groups.
Describe potential alternatives
A mix of the two approaches can work as well
Additional context
Stub of logic to compare the results:
baseline call:
a (potentially) significantly faster implementation to compare
On my dataset the latter code is about 4x faster for a dataframe with 30k groups.
I am not certain this is worth putting into the library but thought it might be worth looking into
The text was updated successfully, but these errors were encountered: