Fix zero percentage replacement in get_binned_data Function and update parameter name #1278

boemer00 · 2024-09-02T20:40:14Z

This pull request introduces changes that address the issue: The fixed value for feel_zeroes in get_binned_data may lead to deviation in some case. #334

Dynamic Fill Value Calculation: The fill value used to replace zero percentages is now calculated dynamically based on the actual data, rather than using a fixed value.
Ensuring Correct Fill Value: The fill value is guaranteed to be smaller than the minimum non-zero percentage in both the reference and current datasets. This adjustment ensures that the data distribution remains accurate.
Maintaining Data Distribution: These changes help maintain the correct distribution of data, which is crucial for accurate statistical tests, including Kullback-Leibler divergence drift score calculations.
Parameter Name Update: The parameter name has been updated from "feel_zeroes" to "fill_zeroes" for clarity and consistency.

Hope it helps
boemer00

boemer00 · 2024-09-05T15:39:23Z

Enhance `get_binned_data` Function with Flexible Fill Methods and Dynamic Scaling

This pull request introduces enhancements to my previous PR regarding the get_binned_data function. These changes address issues with handling very small percentages in data drift calculations while providing some flexibility and maintaining its original purpose.

Changes Made:

New Parameters:
- fill_method (str, default='auto'): Allows users to select the method for calculating the fill value for zero percentages.
- dynamic_scale (bool, default=False): Enables dynamic scaling of the fill value based on the data distribution.
Fill Methods: Three fill methods are now supported:
- auto: Calculates a fill value that is the minimum of 1/10 and 1/2 of the smallest non-zero percentage.
- min: Uses the minimum non-zero percentage as the fill value.
- mean: Uses the average of the minimum non-zero percentages from reference and current data.

Dynamic Scaling: When enabled, applies an additional scaling factor to the fill value:

scale_factor = min(min_non_zero_ref, min_non_zero_cur) / max(min_non_zero_ref, min_non_zero_cur)

This is beneficial in cases where there is a significant difference between the minimum non-zero percentages in the reference and current data.

Normalisation: After applying the fill value, the percentages are normalised to ensure they always sum to 1:

reference_percents = reference_percents / np.sum(reference_percents)
current_percents = current_percents / np.sum(current_percents)

Error Handling: Added a ValueError for invalid fill_method values to prevent unexpected behaviour.

Lastly, I am happy to write a test for the new parameters but would appreciate direction on where this should be implemented.

Thanks

boemer00 added 2 commits September 2, 2024 21:24

fixed min non-zero issue v1

63de8d5

name correction feel -> fill

a9bac3f

boemer00 mentioned this pull request Sep 3, 2024

The fixed value for feel_zeroes in get_binned_data may lead to deviation in some case. #334

Open

updated function with new parameters

bd0b12c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix zero percentage replacement in get_binned_data Function and update parameter name #1278

Fix zero percentage replacement in get_binned_data Function and update parameter name #1278

boemer00 commented Sep 2, 2024

boemer00 commented Sep 5, 2024

Fix zero percentage replacement in get_binned_data Function and update parameter name #1278

Are you sure you want to change the base?

Fix zero percentage replacement in get_binned_data Function and update parameter name #1278

Conversation

boemer00 commented Sep 2, 2024

boemer00 commented Sep 5, 2024

Enhance get_binned_data Function with Flexible Fill Methods and Dynamic Scaling

Changes Made:

Enhance `get_binned_data` Function with Flexible Fill Methods and Dynamic Scaling