build_best versus build_merged what is the difference if no testset is given. #39

JanoschMenke · 2025-01-11T15:01:29Z

I just wanted to ask what the difference between build_best versus build_merged is, when there is no testset specified such as here

config = OptimizationConfig(
    data=Dataset(
        input_column="canonical",
        response_column="molwt",
        training_dataset_file="tests/data/DRD2/subset-50/train.csv",
        split_strategy= Random(),
    ),
)

Based on the results I get I assume that build_merged is trained on the complete training_dataset supplied, while build best is trained only on a subset, I assume generated by the split_strategy but when I am using a 10-fold criss validation which of the 10 fold is it split by?

# Build (re-Train) and save the best model.
build_best(buildconfig, "target/best.pkl")

# Build (Train) and save the model on the merged train+test data.
build_merged(buildconfig, "target/merged.pkl")

The text was updated successfully, but these errors were encountered:

lewismervin1 · 2025-02-07T16:50:01Z

Hi @JanoschMenke, First, I want to clarify what the split_strategy belonging to the Dataset class and the cv_split_strategy supplied by the Settings class to the OptimizationConfig do and how they behave as two different split implementations because that might be leading to a misunderstanding, given your example.

1.) The cv_split_strategy is simply the strategy that is used by the hyper-parameter cross validation to obtain the performance of different parameter trials (for your train set), which defaults to Random (i.e. the same split as in your example, though you are supplying it to split_strategy not cv_split_strategy).

2.) The split_strategy which is Random() in your example will split your train set into one train and one (outer) test set which is not used in the optimisation to avoid leakage. Any performance will be reported on the data reserved by this "Random" test set.

More specifically, if you perform build_best on your above example, you will actually split your data into train & test sets. You will train/build a model with the parameters from the highest performing hyper-parameter trial for your train data (and not from your random held out (test) split). Your model is finally evaluated on the test data.

In comparison, if you were to perform build_merged on your config, you will obtain a model trained with both the Random test/train split re-combined (with the caveat that no performance evaluation is possible).

You may (optionally) provide a test_dataset_file to the Dataset class. This will be an additional holdout test dataset to evaluate an optimized model outside of the Optuna Hyperparameter search, similar to point 2.) above. It can be used and split_strategy is set to NoSplitting. It can be left empty (default behaviour) when a splitting strategy is set to split the data. It is even possible to supply both split_strategy and test_dataset_file (which is actually what you may have been doing) - in this cae the two test sets from your file and from your split_strategy are combined, and performance is evaluated (and weighted equally) on both test sets in a single metric calculation (i.e. they are not calculated and aggregated together).

If you provide neither split_strategy nor test_dataset_file, then we observe no difference between build_best and build_merged. This would recreate the scenario I think you are referring to in your post, i.e. when there is no test set nor splitting specified (and hence no testing all all).

I hope this clarifies? Please do let me know if not.

lewismervin1 mentioned this issue Feb 7, 2025

Train/test scores from QSARtuna model #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build_best versus build_merged what is the difference if no testset is given. #39

build_best versus build_merged what is the difference if no testset is given. #39

JanoschMenke commented Jan 11, 2025

lewismervin1 commented Feb 7, 2025 •

edited

Loading

build_best versus build_merged what is the difference if no testset is given. #39

build_best versus build_merged what is the difference if no testset is given. #39

Comments

JanoschMenke commented Jan 11, 2025

lewismervin1 commented Feb 7, 2025 • edited Loading

lewismervin1 commented Feb 7, 2025 •

edited

Loading