action #30183
closed[sle][functional][qa_automation]test incompletes in execute_test_run because of sum of timeouts > MAX_JOB_TIME
100%
Description
Observation¶
openQA test in scenario sle-15-Installer-DVD-s390x-fs_stress@s390x-kvm-sle12 fails in
execute_test_run
with incomplete for unknown reason. The big problem is though that the test API commands are called with a timeout being equal to MAX_JOB_TIME (fallback to 7200) which does not make sense as this will ensure the job to incomplete in case of a timeout as MAX_JOB_TIME will be hit before the internal timeout is hit.
Suggestion¶
Please make sure sum of internal timeouts do not exceed MAX_JOB_TIME. Never assign the value of MAX_JOB_TIME to a test API timeout.
Further details¶
This is considered "urgent" as incomplete jobs yield a big effort for test review.
Always latest result in this scenario: latest
Updated by yosun about 7 years ago
- Status changed from New to Feedback
The fail reason is all tests runs in s390x-kvm-sle12 will fail. As same as https://openqa.suse.de/tests/1377895#settings
As this test also run in zkvm in s390, I think I can remove the test run in s390x-kvm-sle12. From the name "-sle12" maybe it's not suitable to sle15 tests?
For the MAX_JOB_TIME issue, if some of our test need more than 2 hours, which parameters to use? I remember we use MAX_JOB_TIME to set more than 2 hours was OK.
Updated by mgriessmeier about 7 years ago
- Status changed from Feedback to In Progress
yosun wrote:
The fail reason is all tests runs in s390x-kvm-sle12 will fail. As same as https://openqa.suse.de/tests/1377895#settings
this one is failing for a completely different reason
As this test also run in zkvm in s390, I think I can remove the test run in s390x-kvm-sle12. From the name "-sle12" maybe it's not suitable to sle15 tests?
the "-sle12" is referring to the OS on the hypervisor, of course it can run sle15 vms, that's what we want to test here
For the MAX_JOB_TIME issue, if some of our test need more than 2 hours, which parameters to use? I remember we use MAX_JOB_TIME to set more than 2 hours was OK.
but apparently this was not set in openQA
Updated by okurz about 7 years ago
Just setting MAX_JOB_TIME to a different time will not prevent the problem. The internal timeout is set to MAX_JOB_TIME. Plus the timeout from other testapi calls that always exceeds MAX_JOB_TIME. To give you a simple example
sleep 1;
assert_screen 'foo', get_var('MAX_JOB_TIME');
If the needle 'foo' can not be found the job will be incomplete, not failed because assert_screen would wait one second longer than MAX_JOB_TIME -> do not ever use MAX_JOB_TIME as the value for any internal timeout
Updated by yosun almost 7 years ago
- Status changed from In Progress to Feedback
Checked with the latest build: all stress tests passed, and test time as follow
s390x all finish around 1hour
https://openqa.suse.de/tests/1385355
https://openqa.suse.de/tests/1385304
https://openqa.suse.de/tests/1385362
https://openqa.suse.de/tests/1385305
https://openqa.suse.de/tests/1385356
https://openqa.suse.de/tests/1385310
I personally think this ticket is not match with the main cause of the fail https://openqa.suse.de/tests/1377890/modules/execute_test_run/steps/1
I have modify all stress test's MAX_JOB_TIME to 7200.
From https://openqa.suse.de/tests/1377890/file/video.ogv , we can see this is a real bug when run third sub-testcase system hang there, which caused not upload the log tarbal. But it didn't reproduced in following builds. I suggest to close this ticket, because all stress tests can finish in 2 hours without bugs.
Updated by yosun almost 7 years ago
I got what you mean in #3
We can call the real test timeout MAX_TEST_TIME to replace MAX_JOB_TIME in our test. But it will add one more parameter in each testcase, and it's really hard to say exactly how many seconds is "MAX_JOB_TIME - MAX_TEST_TIME"...The most time caused by real test, for tests need to set the MAX_TEST_TIME always run hours(default timeout is perfect for other tests), so for the "prepare stage" cause "MAX_JOB_TIME-MAX_TEST_TIME" can be ignored :)
I suggest to keep MAX_JOB_TIME in use for easy configuration and understanding.
Updated by okurz almost 7 years ago
I suggest to make it simple and deduct 20 minutes in the test code and bump MAX_JOB_TIME to 9600. So:
- MAX_JOB_TIME=9600 in test suites
- change test code to set
my $timeout = get_var('MAX_JOB_TIME', 9600) - 1200;
Updated by yosun almost 7 years ago
- Status changed from Feedback to In Progress
- % Done changed from 0 to 90
Updated by yosun almost 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 90 to 100
Runs ok. Marked as resolved.