action #125033
closed[security][maint][12sp2][12sp3][12sp4] test fails in aa_autodep
100%
Description
Observation¶
openQA test in scenario sle-12-SP3-Server-DVD-Updates-x86_64-mau-apparmor@64bit fails in
aa_autodep
Test suite description¶
Testsuite maintained at https://gitlab.suse.de/qe-security/osd-sle15-security.
Reproducible¶
Fails since (at least) Build 20230223-1
Expected result¶
Last good: 20230222-1 (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by emiler about 1 year ago
This is weird, but I've re-run this in my local instance and it passed...
http://emiler-openqa.qe.suse.de/tests/95#
The error seems like it's just a timeout, so an infrastructure problem?
Updated by emiler about 1 year ago
- Status changed from New to In Progress
- Assignee set to emiler
Updated by pstivanin about 1 year ago
happened also on 12sp2: https://openqa.suse.de/tests/10603080
Updated by pstivanin about 1 year ago
new failure on 12sp3: https://openqa.suse.de/tests/10608748
Updated by emiler about 1 year ago
- % Done changed from 0 to 80
Should be fixed by this PR: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/16525
Updated by emiler about 1 year ago
I am also experimenting with setting TIMEOUT_SCALE
instead.
Updated by emiler about 1 year ago
- % Done changed from 80 to 100
I've spoken with Josef Pupava and he says that TIMEOUT_SCALE should never be used in production, but only while debugging (when you don't want to deal with several timeouts by hand). It is also ignored by some timeouts entirely, no all cases honour this variable. So the original PR is a much better solution, in his own words.
PR merged, waiting for a successful run before closing.
Updated by dzedro about 1 year ago
Yep, I'm not fan of use of TIMEOUT_SCALE as the solution.
Exceptions are slower architecture like aarch64, where it's faster and convenient.
But timeouts are, will be always, causing failures because since beginning the timeouts were pretty strict.
Sometimes things are slower or get slower because some functionality got extended, worker has load peek, infra has hickup etc.
Timeout is different in e.g. assert_script_run
and assert_screen
or checks_screen
.
IMO in assert_script_run
should be as high as possible because it's always better to get error message from cmd than timeout. Timeout is some king of safe net if something abnormal would happen to avoid stuck assert_script_run
like infinite loop.
Needle
timeouts should be as low as possible, but not too low. With assert_screen
if needles does not match in time or is not present test will fail. With check_screen
will wait whole timeout and return match but not fail. There is different usage for both.
TIMEOUT_SCALE will just multiply "all" this timeouts with different behavior. 🤷
Updated by emiler about 1 year ago
- Status changed from In Progress to Resolved
https://openqa.suse.de/tests/10611337
Re-run on 12-SP3 passed this time. Closing.
Updated by pstivanin about 1 year ago
- Status changed from Resolved to Feedback
still failing: https://openqa.suse.de/tests/10614289
Updated by emiler about 1 year ago
https://openqa.suse.de/tests/10615012
Ok, weird. A re-run of the same test passed again, so the timeout is perhaps still not enough? I don't want to believe that this will hang for over 5 minutes.
Updated by pstivanin about 1 year ago
I think it'd be better to set the RETRY value to 3 in this case (via test suite json).
Updated by emiler about 1 year ago
That could work.
Related PR: https://gitlab.suse.de/qe-security/osd-sle15-security/-/merge_requests/57
Test run passed: http://emiler-openqa.qe.suse.de/tests/120
Updated by emiler about 1 year ago
- Status changed from Feedback to Resolved
Tests passed several times now:
Closing again.
Updated by mgrifalconi about 1 year ago
Hello, test are still failing: https://openqa.suse.de/tests/10689040#next_previous
Updated by tjyrinki_suse about 1 year ago
- Status changed from Resolved to Workable
Reopening, it has failed 7 out of 10 times recently: https://openqa.suse.de/tests/10690470#next_previous , which is a bit high / cumbersome.
Updated by emiler about 1 year ago
We can either increase the timeout, set a timeout multipler in the testsuite itself (i'd rather not), or, as Paolo suggested:
maybe we can try with more resources? qemucpu=host, qemcpus=2, qemuram=2048 ?
I'll take a look at this on Thursday.
Updated by emiler about 1 year ago
- Subject changed from [security][maint][12sp3][12sp4] test fails in aa_autodep to [security][maint][12sp2][12sp3][12sp4] test fails in aa_autodep
Updated by emiler about 1 year ago
New PR adding more resources to the test run: https://gitlab.suse.de/qe-security/osd-sle15-security/-/merge_requests/73
I'll wait for new schedules, to see if it fails again, before closing.
Updated by emiler about 1 year ago
All three versions passed today:
- https://openqa.suse.de/tests/10710759
- https://openqa.suse.de/tests/10710811
- https://openqa.suse.de/tests/10710892
Though I am still going to wait until Monday to double-check.
Updated by emiler about 1 year ago
- Status changed from In Progress to Resolved
Updated by emiler 10 months ago
- Related to action #131462: [security][12-sp2] test fails in aa_autodep added