action #98541
closed[qe-core][kernel] Steps in case of s390 failures
0%
Description
There are from time to time failures on s390 workers, which needs actions to fix. It is usually handled on eng-testing channel, sometimes it is fixed quickly, but sometimes it takes more time (waiting for return from vacation of more experienced person). Sometimes it is not clear who is responsible for s390 workers health in openQA pool.
What should be best process and who is primary responsible for s390 workers?
Updated by MDoucha over 3 years ago
The most common s390x worker error is failure to execute define_and_start()
in bootloader_zkvm
. But this failure has multiple different causes:
- Memory allocation issue: https://openqa.suse.de/tests/6044126#step/bootloader_zkvm/28
macvtap
address collision: https://openqa.suse.de/tests/6926261#step/bootloader_zkvm/28- Netlink connection error: https://openqa.suse.de/tests/7085075#step/bootloader_zkvm/28
Some happen randomly due to worker overload, others are the result of manual misconfiguration and persist on one or more worker slots until manually fixed.
Updated by szarate over 3 years ago
- Related to action #97532: [qe-core][sporadic] s390x jobs are failing to boot auto_review:"error: Cannot set interface flags on 'macvtap.*': Address already in use":retry added
Updated by szarate over 3 years ago
Hi Petr, In any case if you're struggling to figure out the root cause of those problems, you can ping me directly, or mention the issue in the qe-core/eng-testing channels, but as I mentioned during the call.
I suspect that the memory one (if it happens again lmk) could be related to too many jobs running on the same machine.
Updated by okurz over 3 years ago
- Project changed from 175 to openQA Tests (public)
- Subject changed from Steps in case of s390 failures to [qe-core][kernel] Steps in case of s390 failures
- Category set to Bugs in existing tests
discussed in weekly QE sync 2021-09-15. @szarate already linked the important related ticket #97532 . The above mentioned test modules mention mgriessmeier as maintainer hence I added him as watcher to the ticket. He might be able to help. If not then I see the responsibility on the QE Core team about these s390x particularities. In case of issues which look not specific to the test code of os-autoinst-distri-opensuse then tools team is responsible. All tools team members are expected to be responsive in chat (https://progress.opensuse.org/projects/qa/wiki#Common-tasks-for-team-members) , e.g. #eng-testing of the internal chat, so questions can be raised there. With this I think we can move the ticket out of "qam-qasle-collaboration" into the "openQA Tests" project with according keywords
Updated by tjyrinki_suse almost 3 years ago
- Related to action #105049: [qe-core] System cannot boot after installation in s390x in multiple test suites added
Updated by slo-gin over 2 years ago
This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.
Updated by okurz about 2 years ago
- Tags changed from s390, openQA, infrastructure to s390, openQA, infra