action #81382: [y][qe-yast][qe-core] OOM detection flawed - openQA Tests - openSUSE Project Management Tool

Actions

Copy link

action #81382

open

[y][qe-yast][qe-core] OOM detection flawed

Added by dimstar over 3 years ago. Updated about 1 month ago.

Status:

Workable

Priority:

Normal

Assignee:

Category:

Bugs in existing tests

Target version:

Start date:

2020-12-28

Due date:

% Done:

Estimated time:

Difficulty:

Description

Observation¶

openQA test in scenario opensuse-Tumbleweed-JeOS-for-kvm-and-xen-x86_64-jeos-extra@64bit_virtio-2G fails in
pcre

The JeOS seems just to be the one exposing the issue, but I doubt it is limited to it.

So far, I figured out this sequence of events:

clamav module installs, runs, tests clamd. In this module already, clamd seems to run OOM, but it is not detected; test continues
evolution_prep creates a snapshot/anchor
journalctl test module vacuums and rotates the log (i.e. OOM messages are no longer in the current journal)
firewalld test module fails, no OOM reported, as the journal was rotated; lastgood loaded
tests continue, until ralis - which is another (known) module failure. As we loaded a lastgood state from before journal rotation, we have the OOM marker again in the journal, and it is reported.
From here on, all subsequent modules fail on OOM marker (It is not clear why the OOM checker even runs on a successful test though - after the rails test, all subsequent tests fail)

Actual issues:

The issue of OOM should have been detected in the clamav test already; according the journal, OOM was there before the eicar test.
The OOM of the clamav test should not bleed into the other tests and disrupt them.
There is special code to add a swap file for clamav in case of JeOS - that seems not to work to the extend expected

Test suite description¶

Same as jeos, plus some more tests.

Reproducible¶

Fails since (at least) Build 20190311

Expected result¶

Last good: (unknown) (or more recent)

Further details¶

Always latest result in this scenario: latest

Actions

Copy link

Updated by okurz over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-extra@64bit_virtio-2G
https://openqa.opensuse.org/tests/1583587

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released"
The label in the openQA scenario is removed

Actions

Copy link

Updated by okurz about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-extra@64bit_virtio-2G
https://openqa.opensuse.org/tests/1602022

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released"
The label in the openQA scenario is removed

Actions

Copy link

Updated by mloviska about 3 years ago

Priority changed from Normal to Urgent

@jlausuch, should we take care of it?

Actions

Copy link

Updated by szarate about 3 years ago

Subject changed from OOM detection flawed to [qe-core] OOM detection flawed

Actions

Copy link

Updated by tjyrinki_suse almost 3 years ago

Status changed from New to Resolved

The all subsequent tests failing has been fixed since February.

Actions

Copy link

Updated by favogt over 2 years ago

Status changed from Resolved to Workable

The all subsequent tests failing has been fixed since February.

It's not been fixed - it just worked because clamav was lucky enough not to run OOM.
Meanwhile the OOM is back, and all modules fail.

Actions

Copy link

Updated by favogt over 2 years ago

First I tried to debug the issue locally, by cloning the job using a custom SCHEDULE, to avoid waiting for most of the extra tests. That didn't work, because the array of "serial failures" os-autoinst checks for is filled by main.pm and is test specific, but the way a custom SCHEDULE works means that for those tests it's not set. So the local test runs all worked.

After chasing that red herring, this is what I think happened in the failing test runs:
For some reason (it doesn't happen in local runs and when trying to reproduce it manually), the Out of memory message is only written to the kmsg buffer and not to the consoles (no kernel message in serial0.txt, only as output of dmesg). So openQA doesn't notice it and the test run continues.

The rails module fails naturally, but the post_fail_hook looks for the OOM message and prints it to the serial console. This is however not skipped until the next module finishes, at which point openQA notices the OOM message and mistakenly attributes it to the current test module! This causes this module to fail and again post_fail_hook is executed and the OOM message is again written to the serial console and the cycle repeats. This is fixed with https://github.com/os-autoinst/os-autoinst/pull/1842.

Unless the OOM during clamav is reliably caught before the clamav module finishes, it'll unfortunately be a false positive and openQA will misattribute later failures to OOM.
So I'd leave this ticket open for investigating and hopefully fixing the root cause, which is the missing OOM message from the kernel on the serial console.

Actions

Copy link

Updated by tjyrinki_suse over 2 years ago

Subject changed from [qe-core] OOM detection flawed to [kernel] OOM detection flawed
Priority changed from Urgent to High

Thanks Fabian for trying to fix this. It sounds like this could be more for QE Kernel regarding how to get the OOM messages to the serial console. Lowering priority as we have survived so far and we have certain alerts if eg Urgent tickets aren't handled within a certain timeframe.

Actions

Copy link

Updated by favogt over 2 years ago

tjyrinki_suse wrote:

Thanks Fabian for trying to fix this. It sounds like this could be more for QE Kernel regarding how to get the OOM messages to the serial console. Lowering priority as we have survived so far and we have certain alerts if eg Urgent tickets aren't handled within a certain timeframe.

The clamav OOM disappeared randomly again recently, so "High" should be fine, at least until the issue reappears.

https://github.com/os-autoinst/os-autoinst/pull/1842 hasn't been merged yet.

Actions

Copy link

#10

Updated by mloviska over 2 years ago

Another OOM detected in rails Out of memory: Killed process 28048 (clamd) total-vm:1328812kB, anon-rss:800056kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:2448kB oom_score_adj:0

Actions

Copy link

#11

Updated by okurz over 2 years ago

This ticket was set to "High" priority but was not updated within the SLO period for "High" tickets (30 days) as described on https://progress.opensuse.org/projects/openqatests/wiki/Wiki#SLOs-service-level-objectives. Please consider picking up this ticket within the next 30 days or just set the ticket to the next lower priority of "Normal" (SLO: updated within 365 days).

Actions

Copy link

#12

Updated by pcervinka about 2 years ago

Project changed from openQA Tests to 178
Subject changed from [kernel] OOM detection flawed to OOM detection flawed
Category deleted (~~Bugs in existing tests~~)
Priority changed from High to Normal
Target version set to 643

Actions

Copy link

#13

Updated by pcervinka about 2 years ago

Project changed from 178 to openQA Project
Target version deleted (~~643~~)

Checked with kernel qe guys @rpalethorpe and @MDoucha and it is recommended to add ignore_loglevel to kernel boot params as we do it for LTP. Option should be added by team responsible for particular test flow. Probably for JeoS only.

Actions

Copy link

#14

Updated by okurz about 2 years ago

Project changed from openQA Project to openQA Tests
Subject changed from OOM detection flawed to [y][qe-yast][qe-core] OOM detection flawed
Category set to Bugs in existing tests

still within "openQA Tests" as this is about os-autoinst-distri-opensuse so either "QE-YaST" or "QE-Core"

Actions

Copy link

#15

Updated by slo-gin about 1 year ago

This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.

Actions

Copy link

#16

Updated by slo-gin about 1 month ago

This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project » openQA Tests

Tags

Custom queries

action #81382

[y][qe-yast][qe-core] OOM detection flawed

Observation¶

Test suite description¶

Reproducible¶

Expected result¶

Further details¶

Updated by okurz over 3 years ago

Updated by okurz about 3 years ago

Updated by mloviska about 3 years ago

Updated by szarate about 3 years ago

Updated by tjyrinki_suse almost 3 years ago

Updated by favogt over 2 years ago

Updated by favogt over 2 years ago

Updated by tjyrinki_suse over 2 years ago

Updated by favogt over 2 years ago

Updated by mloviska over 2 years ago

Updated by okurz over 2 years ago

Updated by pcervinka about 2 years ago

Updated by pcervinka about 2 years ago

Updated by okurz about 2 years ago

Updated by slo-gin about 1 year ago

Updated by slo-gin about 1 month ago