qemu can need a lot of memory and is influenced by how openQA users configure the test jobs. This can lead to "out-of-memory" conditions and we should feedback this situation to the test reviewers. #90161 is a recent example where jobs failed on malbec.arch due to OOM but the feedback was suboptimal as the corresponding openQA test is https://openqa.suse.de/tests/5674784 which was incomplete with reason "Reason: backend died: QEMU exited unexpectedly, see log for details" and auto-review labeled with #71188 but not specifically pointing to an OOM condition
- AC1: if qemu dies due to being killed due to OOM this should be obvious from the incomplete reason
- So far what okurz could find out the best way to detect OOM is to check the system logs, e.g. with
dmesg | grep 'Out of memory: Killed process'which would also reveal the PID of the killed process. Then one could check that PID against the PID of the qemu process that the qemu backend monitors and feed that information back as incomplete reason.
- Ensure that these conditions are not linked anymore to #71188
- Crosscheck what other reasons could explain #71188 or close that as well if it's very likely only OOM that would explain such problems
- It would also be possible to change if we want to completely disable or allow memory overcommit, see https://www.eurovps.com/faq/how-to-troubleshoot-high-memory-usage-in-linux/
Out of memory: Killed process log message from production looks like this:
Mar 16 08:30:13 malbec kernel: Out of memory: Killed process 49233 (qemu-system-ppc) total-vm:39447680kB, anon-rss:33653376kB, file-rss:2176kB, shmem-r
So parsing the PID should be easy. (This message is actually from the journal and not from dmesg. So when using dmesg the formatting might differ.) The matching of the PID is of course not that easy because the worker only knows the PID of
isotovideo and the process group ID but not the PIDs of the sub processes (which would be most relevant here). As the list of involved PIDs is only required when everything has terminated I suppose the worker could record all PIDs when receiving
SIGCHLD. That's maybe implemented best within Mojo::RWP.
I'd also like to note that this approach for determining an OOM error/crash would only help if the kernel's OOM killer actually killed the process (and not a user-space OOM killer). This is usually only happening when it is already too late. At least in my experience a system usually does not operate efficiently anymore at this point. Additionally, an application might also just exit with an error message (or just crash) if the OS declines giving it more memory. In case of a crash it might be hard to determine whether it was because of a failed memory allocation. When disabling memory over-commitment (one of the suggestions in the ticket) than this case would of course become more common and grepping for messages from the kernel OOM killer wouldn't be very helpful. So the different suggestions won't play well together.
Alternatively to disabling over-commitment, it would be quite simple to restrict the memory usage of a service via CGroups by extending the systemd service file:
There are actually many more settings: https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#MemoryMin=bytes,%20MemoryLow=bytes
But I suppose this has the same catch that checking for kernel OOM killer messages breaks.
It would also be possible to utilize the CGroup created via Mojo::RWP for this. Not sure whether the CGroup creation of Mojo::RWP still works but this way the worker could monitor the memory usage of the isotovideo process tree itself, kill processes as needed and of course take record of it.
Otherwise we could also install a generic user-space OOM killer like earlyoom or systemd-oomd and check its logs explicitly.
Seems like you are looking for a generic "OOM problem handler". I suggested "qemu backend" because I consider this a rather specific problem. So the qemu backend spawns the qemu process that is considered the culprit here. So don't look for "isotovideo should monitor grand-children".
#8 Updated by okurz about 1 month ago
- Status changed from In Progress to Feedback
- Status changed from Resolved to Feedback
can you please provide a verification from production to ensure that the real reason is forwarded and also to be sure that apparmor is configured accordingly? Maybe scheduling a test job with unreasonably high QEMURAM selection does the job?
- Status changed from Feedback to Workable
- Assignee deleted (
- Priority changed from Normal to High
Triggered a test job
openqa-clone-job --within-instance https://openqa.opensuse.org/tests/1770110 _GROUP=0 WORKER_CLASS=imagetester QEMURAM=$((128*1024)) TEST=openqa_okurz_poo90974_detect_oom BUILD=poo90947
Created job #1771649: openqa-Tumbleweed-dev-x86_64-Build:TW.8296-openqa_install+publish@64bit-2G -> https://openqa.opensuse.org/t1771649
which incompleted with
Can't exec "/usr/lib/os-autoinst/check_qemu_oom": Permission denied at /usr/lib/os-autoinst/OpenQA/Qemu/Proc.pm line 436.
#18 Updated by Xiaojing_liu 6 days ago
- Assignee deleted (
This is my fault. Seems my tests missed somethings that I still didn't know. Sorry for that.
pr was merged
Did the test on o3, the
permission denied didn't happen. Example: https://openqa.opensuse.org/tests/1776523
[0m[37m[2021-06-10T05:53:09.923 UTC] [debug] Waiting for 0 attempts [0m[37m[2021-06-10T05:53:10.055 UTC] [debug] Waiting for 1 attempts [0m[33m[2021-06-10T05:53:10.055 UTC] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines: QEMU terminated before QMP connection could be established at /usr/lib/os-autoinst/OpenQA/Qemu/Proc.pm line 453. [0m[33m[2021-06-10T05:53:10.056 UTC] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json [0m[37m[2021-06-10T05:53:10.056 UTC] [debug] Passing remaining frames to the video encoder [0m[image2pipe @ 0x5649c401d9c0] Could not find codec parameters for stream 0 (Video: ppm, none): unspecified size Consider increasing the value for the 'analyzeduration' and 'probesize' options Input #0, image2pipe, from 'pipe:': Duration: N/A, bitrate: N/A Stream #0:0: Video: ppm, none, 24 tbr, 24 tbn, 24 tbc Output #0, webm, to '/var/lib/openqa/pool/2/video.webm': Output file #0 does not contain any stream [37m[2021-06-10T05:53:10.060 UTC] [debug] Waiting for video encoder to finalize the video [0m[37m[2021-06-10T05:53:10.061 UTC] [debug] The built-in video encoder (pid 5530) terminated [0m[37m[2021-06-10T05:53:10.061 UTC] [debug] The external video encoder (pid 5529) terminated [0m[37m[2021-06-10T05:53:10.063 UTC] [debug] QEMU: QEMU emulator version 4.2.1 (openSUSE Leap 15.2) [0m[37m[2021-06-10T05:53:10.063 UTC] [debug] QEMU: Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers [0m[37m[2021-06-10T05:53:10.063 UTC] [debug] QEMU: qemu-system-x86_64: cannot set up guest memory 'pc.ram': Cannot allocate memory [0m[37m[2021-06-10T05:53:10.064 UTC] [debug] sending magic and exit [0m[37m[2021-06-10T05:53:10.065 UTC] [debug] received magic close [0m[37m[2021-06-10T05:53:10.072 UTC] [debug] backend process exited: 0 [0mfailed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 126.