[SPARK-50881] Use cached schema instead of deep copying in dataframe.py #49734

garlandz-db · 2025-01-30T13:56:27Z

What changes were proposed in this pull request?

schema property returns a deepcopy everytime to ensure completeness. However this creates a performance degradation for internal use in dataframe.py. we make the following changes:

columns returns a copy of the array of names. This is the same as classic
all uses of schema in dataframe.py now calls the cached schema, avoiding a deepcopy

Why are the changes needed?

this does not scale well when these methods are called thousands of times like columns method in pivot

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing tests

benchmarking does show improvement in performance approximately 1/3 times faster.

import cProfile, pstats
import copy
cProfile.run("""
x = pd.DataFrame(zip(np.random.rand(1000000), np.random.randint(1, 3000, 10000000), list(range(1000)) * 100000), columns=['x', 'y', 'z'])
df = spark.createDataFrame(x)
schema = df.schema
for i in range(1_000_000):
  [name for name in schema.names]
""")
p = pstats.Stats("profile_results")
p.sort_stats("cumtime").print_stats(.1)

         17000003 function calls in 8.886 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.931    0.931    8.886    8.886 <string>:1(<module>)
  1000000    0.391    0.000    0.391    0.000 <string>:3(<listcomp>)
  1000000    0.933    0.000    5.516    0.000 DatasetInfo.py:22(gather_imported_dataframes)
  1000000    0.948    0.000    6.669    0.000 DatasetInfo.py:75(_maybe_handle_dataframe_assignment)
  1000000    0.895    0.000    7.564    0.000 DatasetInfo.py:90(__setitem__)
  3000000    2.853    0.000    4.583    0.000 utils.py:54(retrieve_imported_type)
        1    0.000    0.000    8.886    8.886 {built-in method builtins.exec}
  3000000    0.667    0.000    0.667    0.000 {built-in method builtins.getattr}
  1000000    0.204    0.000    0.204    0.000 {built-in method builtins.isinstance}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  3000000    0.473    0.000    0.473    0.000 {method 'get' of 'dict' objects}
  3000000    0.590    0.000    0.590    0.000 {method 'rsplit' of 'str' objects}


Thu Jan 16 20:13:47 2025    profile_results

         3 function calls in 0.000 seconds

vs

         55000003 function calls (50000003 primitive calls) in 23.181 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.987    0.987   23.181   23.181 <string>:1(<module>)
  1000000    1.060    0.000    5.750    0.000 DatasetInfo.py:22(gather_imported_dataframes)
  1000000    0.956    0.000    6.907    0.000 DatasetInfo.py:75(_maybe_handle_dataframe_assignment)
  1000000    0.930    0.000    7.837    0.000 DatasetInfo.py:90(__setitem__)
6000000/1000000    7.420    0.000   14.357    0.000 copy.py:128(deepcopy)
  5000000    0.494    0.000    0.494    0.000 copy.py:182(_deepcopy_atomic)
  1000000    2.734    0.000   11.015    0.000 copy.py:201(_deepcopy_list)
  1000000    0.951    0.000    1.160    0.000 copy.py:243(_keep_alive)
  3000000    2.946    0.000    4.690    0.000 utils.py:54(retrieve_imported_type)
        1    0.000    0.000   23.181   23.181 {built-in method builtins.exec}
  3000000    0.686    0.000    0.686    0.000 {built-in method builtins.getattr}
  9000000    0.976    0.000    0.976    0.000 {built-in method builtins.id}
  1000000    0.201    0.000    0.201    0.000 {built-in method builtins.isinstance}
  5000000    0.560    0.000    0.560    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 15000000    1.673    0.000    1.673    0.000 {method 'get' of 'dict' objects}
  3000000    0.607    0.000    0.607    0.000 {method 'rsplit' of 'str' objects}


Thu Jan 16 20:13:47 2025    profile_results

         3 function calls in 0.000 seconds

Was this patch authored or co-authored using generative AI tooling?

Use cached schema instead of deep copying to retrieve names

878f29c

github-actions bot added SQL PYTHON CONNECT labels Jan 30, 2025

garlandz-db changed the title ~~[SPARK-50881] Use cached schema instead of deep copying to retrieve names of columns~~ [SPARK-50881] Use cached schema instead of deep copying in dataframe.py Jan 31, 2025

garlandz-db closed this Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50881] Use cached schema instead of deep copying in dataframe.py #49734

[SPARK-50881] Use cached schema instead of deep copying in dataframe.py #49734

garlandz-db commented Jan 30, 2025 •

edited

Loading

[SPARK-50881] Use cached schema instead of deep copying in dataframe.py #49734

[SPARK-50881] Use cached schema instead of deep copying in dataframe.py #49734

Conversation

garlandz-db commented Jan 30, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

garlandz-db commented Jan 30, 2025 •

edited

Loading