Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix zero percentage replacement in get_binned_data Function and update parameter name #1278

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

boemer00
Copy link

@boemer00 boemer00 commented Sep 2, 2024

This pull request introduces changes that address the issue: The fixed value for feel_zeroes in get_binned_data may lead to deviation in some case. #334

  • Dynamic Fill Value Calculation: The fill value used to replace zero percentages is now calculated dynamically based on the actual data, rather than using a fixed value.

  • Ensuring Correct Fill Value: The fill value is guaranteed to be smaller than the minimum non-zero percentage in both the reference and current datasets. This adjustment ensures that the data distribution remains accurate.

  • Maintaining Data Distribution: These changes help maintain the correct distribution of data, which is crucial for accurate statistical tests, including Kullback-Leibler divergence drift score calculations.

  • Parameter Name Update: The parameter name has been updated from "feel_zeroes" to "fill_zeroes" for clarity and consistency.

Hope it helps
boemer00

@boemer00
Copy link
Author

boemer00 commented Sep 5, 2024

Enhance get_binned_data Function with Flexible Fill Methods and Dynamic Scaling

This pull request introduces enhancements to my previous PR regarding the get_binned_data function. These changes address issues with handling very small percentages in data drift calculations while providing some flexibility and maintaining its original purpose.

Changes Made:

  1. New Parameters:

    • fill_method (str, default='auto'): Allows users to select the method for calculating the fill value for zero percentages.
    • dynamic_scale (bool, default=False): Enables dynamic scaling of the fill value based on the data distribution.
  2. Fill Methods: Three fill methods are now supported:

    • auto: Calculates a fill value that is the minimum of 1/10 and 1/2 of the smallest non-zero percentage.
    • min: Uses the minimum non-zero percentage as the fill value.
    • mean: Uses the average of the minimum non-zero percentages from reference and current data.
  3. Dynamic Scaling: When enabled, applies an additional scaling factor to the fill value:

    scale_factor = min(min_non_zero_ref, min_non_zero_cur) / max(min_non_zero_ref, min_non_zero_cur)

This is beneficial in cases where there is a significant difference between the minimum non-zero percentages in the reference and current data.

  1. Normalisation: After applying the fill value, the percentages are normalised to ensure they always sum to 1:

reference_percents = reference_percents / np.sum(reference_percents)
current_percents = current_percents / np.sum(current_percents)

  1. Error Handling: Added a ValueError for invalid fill_method values to prevent unexpected behaviour.

Lastly, I am happy to write a test for the new parameters but would appreciate direction on where this should be implemented.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant