action #90974
closedcoordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues
coordination #62420: [epic] Distinguish all types of incompletes
Make it obvious if qemu gets terminated unexpectedly due to out-of-memory
0%
Description
Motivation¶
qemu can need a lot of memory and is influenced by how openQA users configure the test jobs. This can lead to "out-of-memory" conditions and we should feedback this situation to the test reviewers. #90161 is a recent example where jobs failed on malbec.arch due to OOM but the feedback was suboptimal as the corresponding openQA test is https://openqa.suse.de/tests/5674784 which was incomplete with reason "Reason: backend died: QEMU exited unexpectedly, see log for details" and auto-review labeled with #71188 but not specifically pointing to an OOM condition
Acceptance criteria¶
- AC1: if qemu dies due to being killed due to OOM this should be obvious from the incomplete reason
Suggestions¶
- So far what okurz could find out the best way to detect OOM is to check the system logs, e.g. with
dmesg | grep 'Out of memory: Killed process'
which would also reveal the PID of the killed process. Then one could check that PID against the PID of the qemu process that the qemu backend monitors and feed that information back as incomplete reason. - Ensure that these conditions are not linked anymore to #71188
- Crosscheck what other reasons could explain #71188 or close that as well if it's very likely only OOM that would explain such problems
further references:
- https://stackoverflow.com/questions/6132333/how-to-detect-out-of-memory-segfaults
- https://unix.stackexchange.com/questions/128642/debug-out-of-memory-with-var-log-messages
- It would also be possible to change if we want to completely disable or allow memory overcommit, see https://www.eurovps.com/faq/how-to-troubleshoot-high-memory-usage-in-linux/
Updated by okurz over 3 years ago
- Copied from action #90161: [Alerting] malbec: Memory usage alert triggered briefly and turned OK within the next minute added
Updated by mkittler over 3 years ago
A complete Out of memory: Killed process
log message from production looks like this:
Mar 16 08:30:13 malbec kernel: Out of memory: Killed process 49233 (qemu-system-ppc) total-vm:39447680kB, anon-rss:33653376kB, file-rss:2176kB, shmem-r
So parsing the PID should be easy. (This message is actually from the journal and not from dmesg. So when using dmesg the formatting might differ.) The matching of the PID is of course not that easy because the worker only knows the PID of isotovideo
and the process group ID but not the PIDs of the sub processes (which would be most relevant here). As the list of involved PIDs is only required when everything has terminated I suppose the worker could record all PIDs when receiving SIGCHLD
. That's maybe implemented best within Mojo::RWP.
I'd also like to note that this approach for determining an OOM error/crash would only help if the kernel's OOM killer actually killed the process (and not a user-space OOM killer). This is usually only happening when it is already too late. At least in my experience a system usually does not operate efficiently anymore at this point. Additionally, an application might also just exit with an error message (or just crash) if the OS declines giving it more memory. In case of a crash it might be hard to determine whether it was because of a failed memory allocation. When disabling memory over-commitment (one of the suggestions in the ticket) than this case would of course become more common and grepping for messages from the kernel OOM killer wouldn't be very helpful. So the different suggestions won't play well together.
Alternatively to disabling over-commitment, it would be quite simple to restrict the memory usage of a service via CGroups by extending the systemd service file:
[Service]
MemoryMax=1.5G
There are actually many more settings: https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#MemoryMin=bytes,%20MemoryLow=bytes
But I suppose this has the same catch that checking for kernel OOM killer messages breaks.
It would also be possible to utilize the CGroup created via Mojo::RWP for this. Not sure whether the CGroup creation of Mojo::RWP still works but this way the worker could monitor the memory usage of the isotovideo process tree itself, kill processes as needed and of course take record of it.
Otherwise we could also install a generic user-space OOM killer like earlyoom or systemd-oomd and check its logs explicitly.
Updated by okurz over 3 years ago
Seems like you are looking for a generic "OOM problem handler". I suggested "qemu backend" because I consider this a rather specific problem. So the qemu backend spawns the qemu process that is considered the culprit here. So don't look for "isotovideo should monitor grand-children".
Updated by mkittler over 3 years ago
Ok, so we can also just consider qemu. Unfortunately I don't know it well enough to be able to say it won't ever fork; so I don't know whether we should ignore grand children.
Updated by Xiaojing_liu over 3 years ago
- Status changed from Workable to In Progress
Updated by openqa_review over 3 years ago
- Due date set to 2021-05-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 3 years ago
- Status changed from In Progress to Feedback
Updated by livdywan over 3 years ago
See also https://github.com/os-autoinst/openQA/pull/3903
Seems to be pending non-leaky access to dmesg via apparmor profile, being discussed on GotHub
Updated by livdywan over 3 years ago
The PR was merged. Verification is being discussed on GitHub. Not sure if this will need more time or might be resolved now
Updated by Xiaojing_liu over 3 years ago
- Status changed from Feedback to Resolved
PRs have been merged. I would like the set this as resolve
. we can reopen it if we find any issue.
Updated by okurz over 3 years ago
- Status changed from Resolved to Feedback
can you please provide a verification from production to ensure that the real reason is forwarded and also to be sure that apparmor is configured accordingly? Maybe scheduling a test job with unreasonably high QEMURAM selection does the job?
Updated by okurz over 3 years ago
- Status changed from Feedback to Workable
- Assignee deleted (
Xiaojing_liu) - Priority changed from Normal to High
Triggered a test job
openqa-clone-job --within-instance https://openqa.opensuse.org/tests/1770110 _GROUP=0 WORKER_CLASS=imagetester QEMURAM=$((128*1024)) TEST=openqa_okurz_poo90974_detect_oom BUILD=poo90947
Created job #1771649: openqa-Tumbleweed-dev-x86_64-Build:TW.8296-openqa_install+publish@64bit-2G -> https://openqa.opensuse.org/t1771649
which incompleted with Can't exec "/usr/lib/os-autoinst/check_qemu_oom": Permission denied at /usr/lib/os-autoinst/OpenQA/Qemu/Proc.pm line 436.
Updated by livdywan over 3 years ago
- Due date deleted (
2021-05-28)
Seems like the apparmor changes aren't complete or not correct afterall. /usr/lib/os-autoinst/check_qemu_oom
was added in the recently merged fix.
Updated by Xiaojing_liu over 3 years ago
This is my fault. Seems my tests missed somethings that I still didn't know. Sorry for that.
Updated by Xiaojing_liu over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to Xiaojing_liu
Updated by Xiaojing_liu over 3 years ago
- Assignee deleted (
Xiaojing_liu)
This is my fault. Seems my tests missed somethings that I still didn't know. Sorry for that.
Xiaojing_liu wrote:
pr was merged
Did the test on o3, the permission denied
didn't happen. Example: https://openqa.opensuse.org/tests/1776523
[0m[37m[2021-06-10T05:53:09.923 UTC] [debug] Waiting for 0 attempts
[0m[37m[2021-06-10T05:53:10.055 UTC] [debug] Waiting for 1 attempts
[0m[33m[2021-06-10T05:53:10.055 UTC] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
QEMU terminated before QMP connection could be established at /usr/lib/os-autoinst/OpenQA/Qemu/Proc.pm line 453.
[0m[33m[2021-06-10T05:53:10.056 UTC] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[0m[37m[2021-06-10T05:53:10.056 UTC] [debug] Passing remaining frames to the video encoder
[0m[image2pipe @ 0x5649c401d9c0] Could not find codec parameters for stream 0 (Video: ppm, none): unspecified size
Consider increasing the value for the 'analyzeduration' and 'probesize' options
Input #0, image2pipe, from 'pipe:':
Duration: N/A, bitrate: N/A
Stream #0:0: Video: ppm, none, 24 tbr, 24 tbn, 24 tbc
Output #0, webm, to '/var/lib/openqa/pool/2/video.webm':
Output file #0 does not contain any stream
[37m[2021-06-10T05:53:10.060 UTC] [debug] Waiting for video encoder to finalize the video
[0m[37m[2021-06-10T05:53:10.061 UTC] [debug] The built-in video encoder (pid 5530) terminated
[0m[37m[2021-06-10T05:53:10.061 UTC] [debug] The external video encoder (pid 5529) terminated
[0m[37m[2021-06-10T05:53:10.063 UTC] [debug] QEMU: QEMU emulator version 4.2.1 (openSUSE Leap 15.2)
[0m[37m[2021-06-10T05:53:10.063 UTC] [debug] QEMU: Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers
[0m[37m[2021-06-10T05:53:10.063 UTC] [debug] QEMU: qemu-system-x86_64: cannot set up guest memory 'pc.ram': Cannot allocate memory
[0m[37m[2021-06-10T05:53:10.064 UTC] [debug] sending magic and exit
[0m[37m[2021-06-10T05:53:10.065 UTC] [debug] received magic close
[0m[37m[2021-06-10T05:53:10.072 UTC] [debug] backend process exited: 0
[0mfailed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 126.
Updated by okurz over 3 years ago
- Status changed from Feedback to Resolved
With ssh osd "sudo -u geekotest psql --command=\"select id,t_finished from jobs where reason ~ 'backend died: QEMU exited unexpectedly' order by t_finished;\" openqa"
I have found 87 jobs but none with the above condition. Assuming fixed anyway.
Updated by Xiaojing_liu about 3 years ago
- Estimated time changed from 30.00 h to 40.00 h