Project

General

Profile

Actions

action #81382

open

[y][qe-yast][qe-core] OOM detection flawed

Added by dimstar over 3 years ago. Updated about 1 month ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Category:
Bugs in existing tests
Target version:
-
Start date:
2020-12-28
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario opensuse-Tumbleweed-JeOS-for-kvm-and-xen-x86_64-jeos-extra@64bit_virtio-2G fails in
pcre

The JeOS seems just to be the one exposing the issue, but I doubt it is limited to it.

So far, I figured out this sequence of events:

  • clamav module installs, runs, tests clamd. In this module already, clamd seems to run OOM, but it is not detected; test continues
  • evolution_prep creates a snapshot/anchor
  • journalctl test module vacuums and rotates the log (i.e. OOM messages are no longer in the current journal)
  • firewalld test module fails, no OOM reported, as the journal was rotated; lastgood loaded
  • tests continue, until ralis - which is another (known) module failure. As we loaded a lastgood state from before journal rotation, we have the OOM marker again in the journal, and it is reported.
  • From here on, all subsequent modules fail on OOM marker (It is not clear why the OOM checker even runs on a successful test though - after the rails test, all subsequent tests fail)

Actual issues:

  • The issue of OOM should have been detected in the clamav test already; according the journal, OOM was there before the eicar test.
  • The OOM of the clamav test should not bleed into the other tests and disrupt them.
  • There is special code to add a swap file for clamav in case of JeOS - that seems not to work to the extend expected

Test suite description

Same as jeos, plus some more tests.

Reproducible

Fails since (at least) Build 20190311

Expected result

Last good: (unknown) (or more recent)

Further details

Always latest result in this scenario: latest

Actions #1

Updated by okurz over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-extra@64bit_virtio-2G
https://openqa.opensuse.org/tests/1583587

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #2

Updated by okurz about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-extra@64bit_virtio-2G
https://openqa.opensuse.org/tests/1602022

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #3

Updated by mloviska about 3 years ago

  • Priority changed from Normal to Urgent

@jlausuch, should we take care of it?

Actions #4

Updated by szarate about 3 years ago

  • Subject changed from OOM detection flawed to [qe-core] OOM detection flawed
Actions #5

Updated by tjyrinki_suse almost 3 years ago

  • Status changed from New to Resolved

The all subsequent tests failing has been fixed since February.

Actions #6

Updated by favogt over 2 years ago

  • Status changed from Resolved to Workable

The all subsequent tests failing has been fixed since February.

It's not been fixed - it just worked because clamav was lucky enough not to run OOM.
Meanwhile the OOM is back, and all modules fail.

Actions #7

Updated by favogt over 2 years ago

First I tried to debug the issue locally, by cloning the job using a custom SCHEDULE, to avoid waiting for most of the extra tests. That didn't work, because the array of "serial failures" os-autoinst checks for is filled by main.pm and is test specific, but the way a custom SCHEDULE works means that for those tests it's not set. So the local test runs all worked.

After chasing that red herring, this is what I think happened in the failing test runs:
For some reason (it doesn't happen in local runs and when trying to reproduce it manually), the Out of memory message is only written to the kmsg buffer and not to the consoles (no kernel message in serial0.txt, only as output of dmesg). So openQA doesn't notice it and the test run continues.

The rails module fails naturally, but the post_fail_hook looks for the OOM message and prints it to the serial console. This is however not skipped until the next module finishes, at which point openQA notices the OOM message and mistakenly attributes it to the current test module! This causes this module to fail and again post_fail_hook is executed and the OOM message is again written to the serial console and the cycle repeats. This is fixed with https://github.com/os-autoinst/os-autoinst/pull/1842.

Unless the OOM during clamav is reliably caught before the clamav module finishes, it'll unfortunately be a false positive and openQA will misattribute later failures to OOM.
So I'd leave this ticket open for investigating and hopefully fixing the root cause, which is the missing OOM message from the kernel on the serial console.

Actions #8

Updated by tjyrinki_suse over 2 years ago

  • Subject changed from [qe-core] OOM detection flawed to [kernel] OOM detection flawed
  • Priority changed from Urgent to High

Thanks Fabian for trying to fix this. It sounds like this could be more for QE Kernel regarding how to get the OOM messages to the serial console. Lowering priority as we have survived so far and we have certain alerts if eg Urgent tickets aren't handled within a certain timeframe.

Actions #9

Updated by favogt over 2 years ago

tjyrinki_suse wrote:

Thanks Fabian for trying to fix this. It sounds like this could be more for QE Kernel regarding how to get the OOM messages to the serial console. Lowering priority as we have survived so far and we have certain alerts if eg Urgent tickets aren't handled within a certain timeframe.

The clamav OOM disappeared randomly again recently, so "High" should be fine, at least until the issue reappears.

https://github.com/os-autoinst/os-autoinst/pull/1842 hasn't been merged yet.

Actions #10

Updated by mloviska over 2 years ago

Another OOM detected in rails Out of memory: Killed process 28048 (clamd) total-vm:1328812kB, anon-rss:800056kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:2448kB oom_score_adj:0

Actions #11

Updated by okurz over 2 years ago

This ticket was set to "High" priority but was not updated within the SLO period for "High" tickets (30 days) as described on https://progress.opensuse.org/projects/openqatests/wiki/Wiki#SLOs-service-level-objectives. Please consider picking up this ticket within the next 30 days or just set the ticket to the next lower priority of "Normal" (SLO: updated within 365 days).

Actions #12

Updated by pcervinka about 2 years ago

  • Project changed from openQA Tests to 178
  • Subject changed from [kernel] OOM detection flawed to OOM detection flawed
  • Category deleted (Bugs in existing tests)
  • Priority changed from High to Normal
  • Target version set to 643
Actions #13

Updated by pcervinka about 2 years ago

  • Project changed from 178 to openQA Project
  • Target version deleted (643)

Checked with kernel qe guys @rpalethorpe and @MDoucha and it is recommended to add ignore_loglevel to kernel boot params as we do it for LTP. Option should be added by team responsible for particular test flow. Probably for JeoS only.

Actions #14

Updated by okurz about 2 years ago

  • Project changed from openQA Project to openQA Tests
  • Subject changed from OOM detection flawed to [y][qe-yast][qe-core] OOM detection flawed
  • Category set to Bugs in existing tests

still within "openQA Tests" as this is about os-autoinst-distri-opensuse so either "QE-YaST" or "QE-Core"

Actions #15

Updated by slo-gin about 1 year ago

This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.

Actions #16

Updated by slo-gin about 1 month ago

This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.

Actions

Also available in: Atom PDF