Project

General

Profile

action #125033

[security][maint][12sp2][12sp3][12sp4] test fails in aa_autodep

Added by pstivanin 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Bugs in existing tests
Target version:
-
Start date:
2023-02-24
Due date:
% Done:

100%

Estimated time:
Difficulty:
Tags:

Description

Observation

openQA test in scenario sle-12-SP3-Server-DVD-Updates-x86_64-mau-apparmor@64bit fails in
aa_autodep

Test suite description

Testsuite maintained at https://gitlab.suse.de/qe-security/osd-sle15-security.

Reproducible

Fails since (at least) Build 20230223-1

Expected result

Last good: 20230222-1 (or more recent)

Further details

Always latest result in this scenario: latest

History

#1 Updated by emiler 3 months ago

This is weird, but I've re-run this in my local instance and it passed...
http://emiler-openqa.qe.suse.de/tests/95#
The error seems like it's just a timeout, so an infrastructure problem?

#2 Updated by emiler 3 months ago

  • Status changed from New to In Progress
  • Assignee set to emiler

#3 Updated by pstivanin 3 months ago

happened also on 12sp2: https://openqa.suse.de/tests/10603080

#4 Updated by pstivanin 3 months ago

#5 Updated by emiler 3 months ago

  • % Done changed from 0 to 80

#6 Updated by emiler 3 months ago

I am also experimenting with setting TIMEOUT_SCALE instead.

#7 Updated by emiler 3 months ago

  • % Done changed from 80 to 100

I've spoken with Josef Pupava and he says that TIMEOUT_SCALE should never be used in production, but only while debugging (when you don't want to deal with several timeouts by hand). It is also ignored by some timeouts entirely, no all cases honour this variable. So the original PR is a much better solution, in his own words.
PR merged, waiting for a successful run before closing.

#8 Updated by dzedro 3 months ago

Yep, I'm not fan of use of TIMEOUT_SCALE as the solution.
Exceptions are slower architecture like aarch64, where it's faster and convenient.
But timeouts are, will be always, causing failures because since beginning the timeouts were pretty strict.
Sometimes things are slower or get slower because some functionality got extended, worker has load peek, infra has hickup etc.
Timeout is different in e.g. assert_script_run and assert_screen or checks_screen.
IMO in assert_script_run should be as high as possible because it's always better to get error message from cmd than timeout. Timeout is some king of safe net if something abnormal would happen to avoid stuck assert_script_run like infinite loop.
Needle timeouts should be as low as possible, but not too low. With assert_screen if needles does not match in time or is not present test will fail. With check_screen will wait whole timeout and return match but not fail. There is different usage for both.
TIMEOUT_SCALE will just multiply "all" this timeouts with different behavior. 🤷

#9 Updated by emiler 3 months ago

  • Status changed from In Progress to Resolved

https://openqa.suse.de/tests/10611337
Re-run on 12-SP3 passed this time. Closing.

#10 Updated by pstivanin 3 months ago

  • Status changed from Resolved to Feedback

#11 Updated by emiler 3 months ago

https://openqa.suse.de/tests/10615012
Ok, weird. A re-run of the same test passed again, so the timeout is perhaps still not enough? I don't want to believe that this will hang for over 5 minutes.

#12 Updated by pstivanin 3 months ago

I think it'd be better to set the RETRY value to 3 in this case (via test suite json).

#14 Updated by emiler 3 months ago

  • Status changed from Feedback to Resolved

#15 Updated by mgrifalconi 3 months ago

#16 Updated by tjyrinki_suse 3 months ago

  • Status changed from Resolved to Workable

Reopening, it has failed 7 out of 10 times recently: https://openqa.suse.de/tests/10690470#next_previous , which is a bit high / cumbersome.

#17 Updated by emiler 3 months ago

  • Status changed from Workable to In Progress

#18 Updated by emiler 3 months ago

We can either increase the timeout, set a timeout multipler in the testsuite itself (i'd rather not), or, as Paolo suggested:

maybe we can try with more resources? qemucpu=host, qemcpus=2, qemuram=2048 ?

I'll take a look at this on Thursday.

#19 Updated by emiler 3 months ago

  • Subject changed from [security][maint][12sp3][12sp4] test fails in aa_autodep to [security][maint][12sp2][12sp3][12sp4] test fails in aa_autodep

#20 Updated by emiler 3 months ago

New PR adding more resources to the test run: https://gitlab.suse.de/qe-security/osd-sle15-security/-/merge_requests/73
I'll wait for new schedules, to see if it fails again, before closing.

#21 Updated by emiler 3 months ago

All three versions passed today:

Though I am still going to wait until Monday to double-check.

Also available in: Atom PDF