Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build_best versus build_merged what is the difference if no testset is given. #39

Open
JanoschMenke opened this issue Jan 11, 2025 · 1 comment

Comments

@JanoschMenke
Copy link

I just wanted to ask what the difference between build_best versus build_merged is, when there is no testset specified such as here

config = OptimizationConfig(
    data=Dataset(
        input_column="canonical",
        response_column="molwt",
        training_dataset_file="tests/data/DRD2/subset-50/train.csv",
        split_strategy= Random(),
    ),
)

Based on the results I get I assume that build_merged is trained on the complete training_dataset supplied, while build best is trained only on a subset, I assume generated by the split_strategy but when I am using a 10-fold criss validation which of the 10 fold is it split by?

# Build (re-Train) and save the best model.
build_best(buildconfig, "target/best.pkl")

# Build (Train) and save the model on the merged train+test data.
build_merged(buildconfig, "target/merged.pkl")
@lewismervin1
Copy link
Collaborator

lewismervin1 commented Feb 7, 2025

Hi @JanoschMenke, First, I want to clarify what the split_strategy belonging to the Dataset class and the cv_split_strategy supplied by the Settings class to the OptimizationConfig do and how they behave as two different split implementations because that might be leading to a misunderstanding, given your example.

1.) The cv_split_strategy is simply the strategy that is used by the hyper-parameter cross validation to obtain the performance of different parameter trials (for your train set), which defaults to Random (i.e. the same split as in your example, though you are supplying it to split_strategy not cv_split_strategy).

2.) The split_strategy which is Random() in your example will split your train set into one train and one (outer) test set which is not used in the optimisation to avoid leakage. Any performance will be reported on the data reserved by this "Random" test set.

More specifically, if you perform build_best on your above example, you will actually split your data into train & test sets. You will train/build a model with the parameters from the highest performing hyper-parameter trial for your train data (and not from your random held out (test) split). Your model is finally evaluated on the test data.

In comparison, if you were to perform build_merged on your config, you will obtain a model trained with both the Random test/train split re-combined (with the caveat that no performance evaluation is possible).

You may (optionally) provide a test_dataset_file to the Dataset class. This will be an additional holdout test dataset to evaluate an optimized model outside of the Optuna Hyperparameter search, similar to point 2.) above. It can be used and split_strategy is set to NoSplitting. It can be left empty (default behaviour) when a splitting strategy is set to split the data. It is even possible to supply both split_strategy and test_dataset_file (which is actually what you may have been doing) - in this cae the two test sets from your file and from your split_strategy are combined, and performance is evaluated (and weighted equally) on both test sets in a single metric calculation (i.e. they are not calculated and aggregated together).

If you provide neither split_strategy nor test_dataset_file, then we observe no difference between build_best and build_merged. This would recreate the scenario I think you are referring to in your post, i.e. when there is no test set nor splitting specified (and hence no testing all all).

I hope this clarifies? Please do let me know if not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants