action #81382
open[y][qe-yast][qe-core] OOM detection flawed
0%
Description
Observation¶
openQA test in scenario opensuse-Tumbleweed-JeOS-for-kvm-and-xen-x86_64-jeos-extra@64bit_virtio-2G fails in
pcre
The JeOS seems just to be the one exposing the issue, but I doubt it is limited to it.
So far, I figured out this sequence of events:
- clamav module installs, runs, tests clamd. In this module already, clamd seems to run OOM, but it is not detected; test continues
- evolution_prep creates a snapshot/anchor
- journalctl test module vacuums and rotates the log (i.e. OOM messages are no longer in the current journal)
- firewalld test module fails, no OOM reported, as the journal was rotated; lastgood loaded
- tests continue, until ralis - which is another (known) module failure. As we loaded a lastgood state from before journal rotation, we have the OOM marker again in the journal, and it is reported.
- From here on, all subsequent modules fail on OOM marker (It is not clear why the OOM checker even runs on a successful test though - after the rails test, all subsequent tests fail)
Actual issues:
- The issue of OOM should have been detected in the clamav test already; according the journal, OOM was there before the eicar test.
- The OOM of the clamav test should not bleed into the other tests and disrupt them.
- There is special code to add a swap file for clamav in case of JeOS - that seems not to work to the extend expected
Test suite description¶
Same as jeos, plus some more tests.
Reproducible¶
Fails since (at least) Build 20190311
Expected result¶
Last good: (unknown) (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by okurz over 3 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: jeos-extra@64bit_virtio-2G
https://openqa.opensuse.org/tests/1583587
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released"
- The label in the openQA scenario is removed
Updated by okurz about 3 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: jeos-extra@64bit_virtio-2G
https://openqa.opensuse.org/tests/1602022
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released"
- The label in the openQA scenario is removed
Updated by mloviska about 3 years ago
- Priority changed from Normal to Urgent
@jlausuch, should we take care of it?
Updated by szarate about 3 years ago
- Subject changed from OOM detection flawed to [qe-core] OOM detection flawed
Updated by tjyrinki_suse almost 3 years ago
- Status changed from New to Resolved
The all subsequent tests failing has been fixed since February.
Updated by favogt over 2 years ago
- Status changed from Resolved to Workable
The all subsequent tests failing has been fixed since February.
It's not been fixed - it just worked because clamav was lucky enough not to run OOM.
Meanwhile the OOM is back, and all modules fail.
Updated by favogt over 2 years ago
First I tried to debug the issue locally, by cloning the job using a custom SCHEDULE
, to avoid waiting for most of the extra tests. That didn't work, because the array of "serial failures" os-autoinst checks for is filled by main.pm
and is test specific, but the way a custom SCHEDULE
works means that for those tests it's not set. So the local test runs all worked.
After chasing that red herring, this is what I think happened in the failing test runs:
For some reason (it doesn't happen in local runs and when trying to reproduce it manually), the Out of memory
message is only written to the kmsg buffer and not to the consoles (no kernel message in serial0.txt
, only as output of dmesg
). So openQA doesn't notice it and the test run continues.
The rails module fails naturally, but the post_fail_hook
looks for the OOM message and prints it to the serial console. This is however not skipped until the next module finishes, at which point openQA notices the OOM message and mistakenly attributes it to the current test module! This causes this module to fail and again post_fail_hook
is executed and the OOM message is again written to the serial console and the cycle repeats. This is fixed with https://github.com/os-autoinst/os-autoinst/pull/1842.
Unless the OOM during clamav is reliably caught before the clamav module finishes, it'll unfortunately be a false positive and openQA will misattribute later failures to OOM.
So I'd leave this ticket open for investigating and hopefully fixing the root cause, which is the missing OOM message from the kernel on the serial console.
Updated by tjyrinki_suse over 2 years ago
- Subject changed from [qe-core] OOM detection flawed to [kernel] OOM detection flawed
- Priority changed from Urgent to High
Thanks Fabian for trying to fix this. It sounds like this could be more for QE Kernel regarding how to get the OOM messages to the serial console. Lowering priority as we have survived so far and we have certain alerts if eg Urgent tickets aren't handled within a certain timeframe.
Updated by favogt over 2 years ago
tjyrinki_suse wrote:
Thanks Fabian for trying to fix this. It sounds like this could be more for QE Kernel regarding how to get the OOM messages to the serial console. Lowering priority as we have survived so far and we have certain alerts if eg Urgent tickets aren't handled within a certain timeframe.
The clamav OOM disappeared randomly again recently, so "High" should be fine, at least until the issue reappears.
https://github.com/os-autoinst/os-autoinst/pull/1842 hasn't been merged yet.
Updated by mloviska over 2 years ago
Another OOM detected in rails Out of memory: Killed process 28048 (clamd) total-vm:1328812kB, anon-rss:800056kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:2448kB oom_score_adj:0
Updated by okurz over 2 years ago
This ticket was set to "High" priority but was not updated within the SLO period for "High" tickets (30 days) as described on https://progress.opensuse.org/projects/openqatests/wiki/Wiki#SLOs-service-level-objectives. Please consider picking up this ticket within the next 30 days or just set the ticket to the next lower priority of "Normal" (SLO: updated within 365 days).
Updated by pcervinka about 2 years ago
- Project changed from openQA Tests to 178
- Subject changed from [kernel] OOM detection flawed to OOM detection flawed
- Category deleted (
Bugs in existing tests) - Priority changed from High to Normal
- Target version set to 643
Updated by pcervinka about 2 years ago
- Project changed from 178 to openQA Project
- Target version deleted (
643)
Checked with kernel qe guys @rpalethorpe and @MDoucha and it is recommended to add ignore_loglevel
to kernel boot params as we do it for LTP. Option should be added by team responsible for particular test flow. Probably for JeoS only.
Updated by okurz about 2 years ago
- Project changed from openQA Project to openQA Tests
- Subject changed from OOM detection flawed to [y][qe-yast][qe-core] OOM detection flawed
- Category set to Bugs in existing tests
still within "openQA Tests" as this is about os-autoinst-distri-opensuse so either "QE-YaST" or "QE-Core"
Updated by slo-gin about 1 year ago
This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.
Updated by slo-gin about 1 month ago
This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.