Project

General

Profile

Actions

action #30183

closed

[sle][functional][qa_automation]test incompletes in execute_test_run because of sum of timeouts > MAX_JOB_TIME

Added by okurz over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Bugs in existing tests
Start date:
2018-01-11
Due date:
2018-01-30
% Done:

100%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-Installer-DVD-s390x-fs_stress@s390x-kvm-sle12 fails in
execute_test_run
with incomplete for unknown reason. The big problem is though that the test API commands are called with a timeout being equal to MAX_JOB_TIME (fallback to 7200) which does not make sense as this will ensure the job to incomplete in case of a timeout as MAX_JOB_TIME will be hit before the internal timeout is hit.

Suggestion

Please make sure sum of internal timeouts do not exceed MAX_JOB_TIME. Never assign the value of MAX_JOB_TIME to a test API timeout.

Further details

This is considered "urgent" as incomplete jobs yield a big effort for test review.

Always latest result in this scenario: latest

Actions #1

Updated by yosun over 6 years ago

  • Status changed from New to Feedback

The fail reason is all tests runs in s390x-kvm-sle12 will fail. As same as https://openqa.suse.de/tests/1377895#settings
As this test also run in zkvm in s390, I think I can remove the test run in s390x-kvm-sle12. From the name "-sle12" maybe it's not suitable to sle15 tests?

For the MAX_JOB_TIME issue, if some of our test need more than 2 hours, which parameters to use? I remember we use MAX_JOB_TIME to set more than 2 hours was OK.

Actions #2

Updated by mgriessmeier over 6 years ago

  • Status changed from Feedback to In Progress

yosun wrote:

The fail reason is all tests runs in s390x-kvm-sle12 will fail. As same as https://openqa.suse.de/tests/1377895#settings

this one is failing for a completely different reason

As this test also run in zkvm in s390, I think I can remove the test run in s390x-kvm-sle12. From the name "-sle12" maybe it's not suitable to sle15 tests?

the "-sle12" is referring to the OS on the hypervisor, of course it can run sle15 vms, that's what we want to test here

For the MAX_JOB_TIME issue, if some of our test need more than 2 hours, which parameters to use? I remember we use MAX_JOB_TIME to set more than 2 hours was OK.

but apparently this was not set in openQA

Actions #3

Updated by okurz over 6 years ago

Just setting MAX_JOB_TIME to a different time will not prevent the problem. The internal timeout is set to MAX_JOB_TIME. Plus the timeout from other testapi calls that always exceeds MAX_JOB_TIME. To give you a simple example

sleep 1;
assert_screen 'foo', get_var('MAX_JOB_TIME');

If the needle 'foo' can not be found the job will be incomplete, not failed because assert_screen would wait one second longer than MAX_JOB_TIME -> do not ever use MAX_JOB_TIME as the value for any internal timeout

Actions #4

Updated by okurz about 6 years ago

  • Due date set to 2018-01-30
Actions #5

Updated by yosun about 6 years ago

  • Status changed from In Progress to Feedback

Checked with the latest build: all stress tests passed, and test time as follow
s390x all finish around 1hour
https://openqa.suse.de/tests/1385355
https://openqa.suse.de/tests/1385304
https://openqa.suse.de/tests/1385362
https://openqa.suse.de/tests/1385305
https://openqa.suse.de/tests/1385356
https://openqa.suse.de/tests/1385310

I personally think this ticket is not match with the main cause of the fail https://openqa.suse.de/tests/1377890/modules/execute_test_run/steps/1
I have modify all stress test's MAX_JOB_TIME to 7200.
From https://openqa.suse.de/tests/1377890/file/video.ogv , we can see this is a real bug when run third sub-testcase system hang there, which caused not upload the log tarbal. But it didn't reproduced in following builds. I suggest to close this ticket, because all stress tests can finish in 2 hours without bugs.

Actions #6

Updated by yosun about 6 years ago

I got what you mean in #3
We can call the real test timeout MAX_TEST_TIME to replace MAX_JOB_TIME in our test. But it will add one more parameter in each testcase, and it's really hard to say exactly how many seconds is "MAX_JOB_TIME - MAX_TEST_TIME"...The most time caused by real test, for tests need to set the MAX_TEST_TIME always run hours(default timeout is perfect for other tests), so for the "prepare stage" cause "MAX_JOB_TIME-MAX_TEST_TIME" can be ignored :)
I suggest to keep MAX_JOB_TIME in use for easy configuration and understanding.

Actions #7

Updated by okurz about 6 years ago

I suggest to make it simple and deduct 20 minutes in the test code and bump MAX_JOB_TIME to 9600. So:

  • MAX_JOB_TIME=9600 in test suites
  • change test code to set my $timeout = get_var('MAX_JOB_TIME', 9600) - 1200;
Actions #8

Updated by okurz about 6 years ago

  • Target version set to Milestone 14
Actions #9

Updated by yosun about 6 years ago

  • Status changed from Feedback to In Progress
  • % Done changed from 0 to 90
Actions #10

Updated by yosun about 6 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

Runs ok. Marked as resolved.

Actions

Also available in: Atom PDF