[qe-core][kernel] Steps in case of s390 failures
There are from time to time failures on s390 workers, which needs actions to fix. It is usually handled on eng-testing channel, sometimes it is fixed quickly, but sometimes it takes more time (waiting for return from vacation of more experienced person). Sometimes it is not clear who is responsible for s390 workers health in openQA pool.
What should be best process and who is primary responsible for s390 workers?
#1 Updated by MDoucha about 1 month ago
The most common s390x worker error is failure to execute
bootloader_zkvm. But this failure has multiple different causes:
- Memory allocation issue: https://openqa.suse.de/tests/6044126#step/bootloader_zkvm/28
macvtapaddress collision: https://openqa.suse.de/tests/6926261#step/bootloader_zkvm/28
- Netlink connection error: https://openqa.suse.de/tests/7085075#step/bootloader_zkvm/28
Some happen randomly due to worker overload, others are the result of manual misconfiguration and persist on one or more worker slots until manually fixed.
#4 Updated by szarate about 1 month ago
Hi Petr, In any case if you're struggling to figure out the root cause of those problems, you can ping me directly, or mention the issue in the qe-core/eng-testing channels, but as I mentioned during the call.
I suspect that the memory one (if it happens again lmk) could be related to too many jobs running on the same machine.
#5 Updated by okurz about 1 month ago
- Project changed from qam-qasle-collaboration to openQA Tests
- Subject changed from Steps in case of s390 failures to [qe-core][kernel] Steps in case of s390 failures
- Category set to Bugs in existing tests
discussed in weekly QE sync 2021-09-15. szarate already linked the important related ticket #97532 . The above mentioned test modules mention mgriessmeier as maintainer hence I added him as watcher to the ticket. He might be able to help. If not then I see the responsibility on the QE Core team about these s390x particularities. In case of issues which look not specific to the test code of os-autoinst-distri-opensuse then tools team is responsible. All tools team members are expected to be responsive in chat (https://progress.opensuse.org/projects/qa/wiki#Common-tasks-for-team-members) , e.g. #eng-testing of the internal chat, so questions can be raised there. With this I think we can move the ticket out of "qam-qasle-collaboration" into the "openQA Tests" project with according keywords