action #90974
closedcoordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues
coordination #62420: [epic] Distinguish all types of incompletes
Make it obvious if qemu gets terminated unexpectedly due to out-of-memory
0%
Description
Motivation¶
qemu can need a lot of memory and is influenced by how openQA users configure the test jobs. This can lead to "out-of-memory" conditions and we should feedback this situation to the test reviewers. #90161 is a recent example where jobs failed on malbec.arch due to OOM but the feedback was suboptimal as the corresponding openQA test is https://openqa.suse.de/tests/5674784 which was incomplete with reason "Reason: backend died: QEMU exited unexpectedly, see log for details" and auto-review labeled with #71188 but not specifically pointing to an OOM condition
Acceptance criteria¶
- AC1: if qemu dies due to being killed due to OOM this should be obvious from the incomplete reason
Suggestions¶
- So far what okurz could find out the best way to detect OOM is to check the system logs, e.g. with
dmesg | grep 'Out of memory: Killed process'
which would also reveal the PID of the killed process. Then one could check that PID against the PID of the qemu process that the qemu backend monitors and feed that information back as incomplete reason. - Ensure that these conditions are not linked anymore to #71188
- Crosscheck what other reasons could explain #71188 or close that as well if it's very likely only OOM that would explain such problems
further references:
- https://stackoverflow.com/questions/6132333/how-to-detect-out-of-memory-segfaults
- https://unix.stackexchange.com/questions/128642/debug-out-of-memory-with-var-log-messages
- It would also be possible to change if we want to completely disable or allow memory overcommit, see https://www.eurovps.com/faq/how-to-troubleshoot-high-memory-usage-in-linux/