Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor adjustments to usage of alarm(2) timer in tests #5292

Merged
merged 1 commit into from
Feb 5, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion test/h5test.h
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,7 @@ H5TEST_DLLVAR MPI_Info h5_io_info_g; /* MPI INFO object for IO */

/* Macros for the different TestExpress levels for expediting tests */
#define H5_TEST_EXPRESS_EXHAUSTIVE 0 /** Exhaustive run; tests should take as long as necessary */
#define H5_TEST_EXPRESS_FULL 1 /** Full run; tests should take no more than 30 minutes */
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a documentation cleanup. The intention for a "full" run may have been 30 minutes at some point, but the current timeout value for both ctest and for the alarm timer is 20 minutes by default, so a "full" run will always be kneecapped at 20 minutes anyway, unless the values are overridden.

#define H5_TEST_EXPRESS_FULL 1 /** Full run; tests should take no more than 20 minutes */
#define H5_TEST_EXPRESS_QUICK 2 /** Quick run; tests should take no more than 10 minutes */
#define H5_TEST_EXPRESS_SMOKE_TEST 3 /** Smoke test; tests should take no more than 1 minute */

Expand Down
62 changes: 36 additions & 26 deletions test/testframe.c
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,12 @@ TestInit(const char *ProgName, void (*TestPrivateUsage)(FILE *stream),
/* Initialize value for TestExpress functionality */
h5_get_testexpress();

/* Enable alarm timer for test program once TestExpress setting
* has been determined
*/
if (TestAlarmOn() < 0)
MESSAGE(5, ("Couldn't enable test alarm timer\n"));

/* Record the program name and private routines if provided. */
TestProgName = ProgName;
if (NULL != TestPrivateUsage)
Expand Down Expand Up @@ -461,10 +467,6 @@ PerformTests(void)
MESSAGE(2, ("Testing -- %s (%s) \n", TestArray[Loop].Description, TestArray[Loop].Name));
MESSAGE(5, ("===============================================\n"));

if (TestAlarmOn() < 0)
MESSAGE(5, ("Couldn't enable test alarm timer for test -- %s (%s) \n",
Copy link
Collaborator Author

@jhendersonHDF jhendersonHDF Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alarm timer set for tests has always (?) been per-subtest rather than per-test program, presumably to catch a particular hanging sub-test while allowing the rest of the tests to run. While this may be helpful for Autotools, where there's no real built-in mechanism for restricting the runtime of tests in general, it effectively goes against CMake's built-in timeout value in ctest. If the alarm timer is set to its default 20 minutes, the test program is going to be killed by ctest before a hanging test even hits this timer's timeout value in the first place. Adding to that, the HPC systems that get tested on often have a ~30 minute limit on jobs in the job queue where testing is usually run, meaning that an Autotools test run would get killed shortly after a single hanging test anyway.

This changes the logic so that the alarm timer is per-test program and is turned on almost immediately on test program start (in TestInit()) so that there's at least a reasonable ability of reacting to testing timeouts in the future. This won't allow for continuing to run tests beyond one that hangs, but we really need functionality beyond what alarm(2) offers to do that nicely in CMake, unless we simply disable testing timeouts there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the alarm timer is set to its default 20 minutes, the test program is going to be killed by ctest before a hanging test even hits this timer's timeout value in the first place.

Looking at the cmake configuration, it seems like the cmake timeouts are derived from the alarm timer value. Is this in reference to CMake starting a count to the same timeout value slightly earlier than the alarm (in the current per-test implementation) and always pre-empting it?

This won't allow for continuing to run tests beyond one that hangs, but we really need functionality beyond what alarm(2) offers to do that nicely in CMake, unless we simply disable testing timeouts there.

The ctest timeout documentation says that moving on to the next test after a timeout is already how it behaves. Are you talking about problems with continuing beyond hangs in entire test programs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the cmake configuration, it seems like the cmake timeouts are derived from the alarm timer value.

There may be some historical precedent for this, but the ctest timer and alarm timer are separate from each other. It's generally likely that, since the ctest timer will be active before the program runs, the ctest timer will expire before the alarm timer since they currently have the same timeout value in seconds.

Are you talking about problems with continuing beyond hangs in entire test programs?

Yes this is referring to continuing to run other subtests in a test program.

TestArray[Loop].Description, TestArray[Loop].Name));

if (TestArray[Loop].TestSetupFunc)
TestArray[Loop].TestSetupFunc(TestArray[Loop].TestParameters);

Expand All @@ -473,8 +475,6 @@ PerformTests(void)
if (TestArray[Loop].TestCleanupFunc)
TestArray[Loop].TestCleanupFunc(TestArray[Loop].TestParameters);

TestAlarmOff();

TestArray[Loop].TestNumErrors = TestNumErrs_g - old_num_errs;

MESSAGE(5, ("===============================================\n"));
Expand Down Expand Up @@ -573,6 +573,8 @@ TestShutdown(void)

free(TestArray);

TestAlarmOff();

return SUCCEED;
}

Expand Down Expand Up @@ -829,30 +831,38 @@ SetTestMaxNumThreads(int max_num_threads)
herr_t
TestAlarmOn(void)
{
/* A TestExpress setting of H5_TEST_EXPRESS_EXHAUSTIVE should allow
* tests to run for as long as necessary, so avoid enabling an
* alarm-style timer here that would, by default, kill the test.
*/
if (GetTestExpress() == H5_TEST_EXPRESS_EXHAUSTIVE)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generally handy to be able to perform manual runs of long-running tests like multi-threaded ones, while still allowing them to be restricted by the alarm(2) timer for CI. This just avoids setting any timer at all if TestExpress is set to 0. While this allows a test to run forever if it hangs, there's no particular way of predicting how long of a timer to set to catch hangs in these cases.

return SUCCEED;
#ifdef H5_HAVE_ALARM
char *env_val = getenv("HDF5_ALARM_SECONDS"); /* Alarm environment */
unsigned long alarm_sec = H5_ALARM_SEC; /* Number of seconds before alarm goes off */
else {
char *env_val = getenv("HDF5_ALARM_SECONDS"); /* Alarm environment */
unsigned long alarm_sec = H5_ALARM_SEC; /* Number of seconds before alarm goes off */

/* Get the alarm value from the environment variable, if set */
if (env_val != NULL) {
errno = 0;
alarm_sec = strtoul(env_val, NULL, 10);
if (errno != 0) {
if (TestFrameworkProcessID_g == 0)
fprintf(stderr, "%s: error while parsing value (%s) specified for alarm timeout\n", __func__,
env_val);
return FAIL;
}
else if (alarm_sec > (unsigned long)UINT_MAX) {
if (TestFrameworkProcessID_g == 0)
fprintf(stderr, "%s: value (%lu) specified for alarm timeout too large\n", __func__,
alarm_sec);
return FAIL;
/* Get the alarm value from the environment variable, if set */
if (env_val != NULL) {
errno = 0;
alarm_sec = strtoul(env_val, NULL, 10);
if (errno != 0) {
if (TestFrameworkProcessID_g == 0)
fprintf(stderr, "%s: error while parsing value (%s) specified for alarm timeout\n",
__func__, env_val);
return FAIL;
}
else if (alarm_sec > (unsigned long)UINT_MAX) {
if (TestFrameworkProcessID_g == 0)
fprintf(stderr, "%s: value (%lu) specified for alarm timeout too large\n", __func__,
alarm_sec);
return FAIL;
}
}
}

/* Set the number of seconds before alarm goes off */
alarm((unsigned)alarm_sec);
/* Set the number of seconds before alarm goes off */
alarm((unsigned)alarm_sec);
}
#endif

return SUCCEED;
Expand Down
Loading