action #30183: [sle][functional][qa_automation]test incompletes in execute_test_run because of sum of timeouts > MAX_JOB_TIME - openQA Tests - openSUSE Project Management Tool

Custom queries

All open Feature tests
openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QAM
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
SLE15 Migration Open Tickets
SLE15 SP1 Migration Open Tickets
SLE15SP3 Migration open ticket
SLE15SP3 Security open ticket
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #30183

closed

[sle][functional][qa_automation]test incompletes in execute_test_run because of sum of timeouts > MAX_JOB_TIME

Added by okurz over 6 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

yosun

Category:

Bugs in existing tests

Target version:

openQA Project - Milestone 14

Start date:

2018-01-11

Due date:

2018-01-30

% Done:

100%

Estimated time:

Difficulty:

Description

Observation¶

openQA test in scenario sle-15-Installer-DVD-s390x-fs_stress@s390x-kvm-sle12 fails in
execute_test_run
with incomplete for unknown reason. The big problem is though that the test API commands are called with a timeout being equal to MAX_JOB_TIME (fallback to 7200) which does not make sense as this will ensure the job to incomplete in case of a timeout as MAX_JOB_TIME will be hit before the internal timeout is hit.

Suggestion¶

Please make sure sum of internal timeouts do not exceed MAX_JOB_TIME. Never assign the value of MAX_JOB_TIME to a test API timeout.

Further details¶

This is considered "urgent" as incomplete jobs yield a big effort for test review.

Always latest result in this scenario: latest

History
Notes
Property changes

Actions

Copy link

Updated by yosun over 6 years ago

Status changed from New to Feedback

The fail reason is all tests runs in s390x-kvm-sle12 will fail. As same as https://openqa.suse.de/tests/1377895#settings
As this test also run in zkvm in s390, I think I can remove the test run in s390x-kvm-sle12. From the name "-sle12" maybe it's not suitable to sle15 tests?

For the MAX_JOB_TIME issue, if some of our test need more than 2 hours, which parameters to use? I remember we use MAX_JOB_TIME to set more than 2 hours was OK.

Actions

Copy link

Updated by mgriessmeier over 6 years ago

Status changed from Feedback to In Progress

yosun wrote:

The fail reason is all tests runs in s390x-kvm-sle12 will fail. As same as https://openqa.suse.de/tests/1377895#settings

this one is failing for a completely different reason

As this test also run in zkvm in s390, I think I can remove the test run in s390x-kvm-sle12. From the name "-sle12" maybe it's not suitable to sle15 tests?

the "-sle12" is referring to the OS on the hypervisor, of course it can run sle15 vms, that's what we want to test here

For the MAX_JOB_TIME issue, if some of our test need more than 2 hours, which parameters to use? I remember we use MAX_JOB_TIME to set more than 2 hours was OK.

but apparently this was not set in openQA

Actions

Copy link

Updated by okurz over 6 years ago

Just setting MAX_JOB_TIME to a different time will not prevent the problem. The internal timeout is set to MAX_JOB_TIME. Plus the timeout from other testapi calls that always exceeds MAX_JOB_TIME. To give you a simple example

sleep 1;
assert_screen 'foo', get_var('MAX_JOB_TIME');

If the needle 'foo' can not be found the job will be incomplete, not failed because assert_screen would wait one second longer than MAX_JOB_TIME -> do not ever use MAX_JOB_TIME as the value for any internal timeout

Actions

Copy link

Updated by okurz over 6 years ago

Due date set to 2018-01-30

Actions

Copy link

Updated by yosun over 6 years ago

Status changed from In Progress to Feedback

Checked with the latest build: all stress tests passed, and test time as follow
s390x all finish around 1hour
https://openqa.suse.de/tests/1385355
https://openqa.suse.de/tests/1385304
https://openqa.suse.de/tests/1385362
https://openqa.suse.de/tests/1385305
https://openqa.suse.de/tests/1385356
https://openqa.suse.de/tests/1385310

I personally think this ticket is not match with the main cause of the fail https://openqa.suse.de/tests/1377890/modules/execute_test_run/steps/1
I have modify all stress test's MAX_JOB_TIME to 7200.
From https://openqa.suse.de/tests/1377890/file/video.ogv , we can see this is a real bug when run third sub-testcase system hang there, which caused not upload the log tarbal. But it didn't reproduced in following builds. I suggest to close this ticket, because all stress tests can finish in 2 hours without bugs.

Actions

Copy link

Updated by yosun over 6 years ago

I got what you mean in #3
We can call the real test timeout MAX_TEST_TIME to replace MAX_JOB_TIME in our test. But it will add one more parameter in each testcase, and it's really hard to say exactly how many seconds is "MAX_JOB_TIME - MAX_TEST_TIME"...The most time caused by real test, for tests need to set the MAX_TEST_TIME always run hours(default timeout is perfect for other tests), so for the "prepare stage" cause "MAX_JOB_TIME-MAX_TEST_TIME" can be ignored :)
I suggest to keep MAX_JOB_TIME in use for easy configuration and understanding.

Actions

Copy link