Project

General

Profile

Actions

action #33376

closed

[sle][functional][ppc64le][easy][u] test fails in kdump_and_crash - kdumptool gets killed by OOM

Added by nicksinger about 6 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
SUSE QA - Milestone 22
Start date:
2018-03-16
Due date:
% Done:

0%

Estimated time:
Difficulty:
easy

Description

Observation

openQA test in scenario sle-15-Installer-DVD-ppc64le-toolchain_zypper@ppc64le fails in
kdump_and_crash because the OOM-manager kills the kdumptool while it is running.

Reproducible

Fails from time to time since (at least) Build 489.2

Suggestions for fix

  • Find out what is a reasonable RAM size for the tools in question (kdumptool, yast2 kdump, …) and increase the assigned RAM for that job (on ppc64le)

Further details

I decided against a product-bug here since it makes sense that a tool like kdump fills up the memory pretty quick (AFAIK it loads a second kernel and keeps stacktraces if the first one crashes). From the previous results we can see that this is more of a sporadic issue so IMHO a small increase of RAM is a reasonable product change and we need to adapt our environment to that.


Files

kdump.png (63.6 KB) kdump.png sysrq-trigger zluo, 2019-01-10 14:43

Related issues 2 (0 open2 closed)

Related to openQA Tests - action #33199: [sle][functional][s390x][zkvm][u][hard] test fails in kdump_and_crash - system does not shutdown or reboot? what is happening? better output needed?Resolvedmgriessmeier2018-03-132018-04-10

Actions
Copied to openQA Tests - action #47960: [sle][functional][u] kdump_and_crash - detect error detection and only apply workaround thenResolvedzluo2018-03-16

Actions
Actions #1

Updated by okurz about 6 years ago

The machine is already configured for 4GB of RAM.

I think there are already product bugs: http://fastzilla.suse.de/?q=kdump+OOM+openQA returns some. I thought there would be more specific ones. Maybe http://fastzilla.suse.de/?q=kdump+openQA is a better search.

bsc#1039527 "[Build 0387] kdump endup with "Out of memory" error" looks like the best candidate. At least related looks https://bugzilla.suse.com/show_bug.cgi?id=1075937 . https://bugzilla.suse.com/show_bug.cgi?id=998544 is an older one about general instability of the test/component. But now I found https://bugzilla.suse.com/show_bug.cgi?id=1075945 which is describing what looks like the same problem. The bug is marked as RESOLVED FIXED so not verified. Maybe the submission does not fix it or never made it to SLE15.

https://bugzilla.suse.com/show_bug.cgi?id=1070397 is also about OOM in kdump. https://bugzilla.suse.com/show_bug.cgi?id=986196 is also about memory and kdump, getting old but still open.

Btw, I am not agreeing with the suggestion in general. It might be necessary here but we should be careful to not hide product regressions.

Actions #2

Updated by JERiveraMoya about 6 years ago

  • Status changed from Workable to In Progress
  • Assignee set to JERiveraMoya

Minimum requirement of RAM for SLE12 is 1024 MB. According with Petr Tesarik, he mentioned in one the bug that the requirement increased 3x, so it will be 3072MB + according SUSE docs for sle12 "the actual memory requirement in production depends on the system's workload" which in this case installing development tool seems more than reasonable that +1024MB could not be enough sometimes and it is the cause of being sporadic because every build there are different versions of the development tools comparing between last builds. I am going to try to confirm with Petr this issue and specially why is this 3x or if it is documented.

Actions #3

Updated by JERiveraMoya about 6 years ago

  • Status changed from In Progress to Feedback
Actions #4

Updated by JERiveraMoya about 6 years ago

According with Peter, he has been able to save dumps on VMs with only 2G of RAM, approx. 130M reserved for kdump. 3x refers to refers to user-space runtime requirements, not to the total RAM requirements. I crosschecked other scenarios, and even with 4 modules installed like in this case, is not necessary to have more than 2GB or 4GB. Increase the memory here seems to be to hide some error here for ppc.

Actions #5

Updated by okurz about 6 years ago

Very good evaluation. Thank you.

Actions #6

Updated by okurz about 6 years ago

It could be that there actually are too many unexpected applications running in the background in the test, depending on what the previous test modules did within that openQA jobs that failed. Our post_fail_hook in many cases include a process table being collected. https://openqa.suse.de/tests/1549130#step/kdump_and_crash/43 shows another problem on top which is that the post_fail_hook does not execute successfully because the log console tty, tty5, is not found. If you have another example of a job failing maybe you can gather the data from there.

This ticket should be related to other kdump tests and mloviska is also working on one. Of course it might be different issues but why not work together on one first and then pick up the next one?

Actions #7

Updated by JERiveraMoya about 6 years ago

We are working together in both issues. We think that the post_fail_hook does not work due to the moment when it happens, during the creation of the dump, so makes sense not to be able have logs as in this moment we are creating the dump which is critical for the system. No more jobs with info was found, but what is clear is that even when is checking the status of the service, the memory issue can be read there, before the dump. Added a couple of checking of the memory with this pull request:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4709 so we can know where could be a potential leak.

Actions #8

Updated by mgriessmeier about 6 years ago

  • Due date changed from 2018-03-27 to 2018-04-10

Pr is waiting for merge

Actions #9

Updated by riafarov about 6 years ago

  • Related to action #33199: [sle][functional][s390x][zkvm][u][hard] test fails in kdump_and_crash - system does not shutdown or reboot? what is happening? better output needed? added
Actions #10

Updated by JERiveraMoya about 6 years ago

  • Status changed from Feedback to In Progress

Checked the worker that was failing and is not always the same.
Verification runs in OSD in the Development Group:
https://openqa.suse.de/tests/1574534
https://openqa.suse.de/tests/1574540
https://openqa.suse.de/tests/1574541 (only this one failing)
https://openqa.suse.de/tests/1574542
https://openqa.suse.de/tests/1574543
Memory used is in interval 225-233MB in tests previous to kdump, so it looks ok, but after running 5 jobs for gathering statistics I was able to reproduce it once.
Next, I need to narrow the issue withing test kdump_and_crash.
In the meantime collaborating with @zluo for having a ppc remote worker to use for local verification.

Actions #11

Updated by JERiveraMoya about 6 years ago

In fact I've just realized that available memory is displayed in the test in YaST: https://openqa.suse.de/tests/1574541#step/kdump_and_crash/24 but that is the system memory, my best guess is that the memory we need to increment is the one for kdump on previous screen as a workaround, but only for ppc and probably open a bug.

Actions #12

Updated by okurz about 6 years ago

  • Subject changed from [sle][functional][ppc64le][easy][fast][u] test fails in kdump_and_crash - kdumptool gets killed by OOM to [sle][functional][ppc64le][easy][u] test fails in kdump_and_crash - kdumptool gets killed by OOM

now planned for S14

Actions #13

Updated by cwh about 6 years ago

  • Difficulty set to easy
Actions #14

Updated by JERiveraMoya about 6 years ago

[WIP] PR: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4758
256MB didn't work, trying with 512MB.

Actions #15

Updated by JERiveraMoya about 6 years ago

It didn't work and I can still reproduce the memory issue. I found in https://www.suse.com/support/kb/doc/?id=3374462 that "For the PPC64 architecture: crashkernel=128M@32M" I could try to add to boot options.

Actions #16

Updated by JERiveraMoya about 6 years ago

Boot params are volatile after reboot so it didn't help.
After trying several approachs, starting with a fresh testing by reproducing it with the latest build. Latest shows that the dumps are created, the memory issue is present when checking the status of the service, but the tool that is failing is the one for reading the dump (not the one to created): crash utility. Trying to reproduce it with my remote worker, but it is failing in the creation of the dump. Possible explanation could be based on free -m output that the machine does not have enough physical RAM (is almost fully consuming its 5GB), could it be the same cause on the osd worker?

Actions #17

Updated by JERiveraMoya about 6 years ago

  • Status changed from In Progress to Feedback

Set to feedback as the remote worker does not have enough memory and that could be related with the problem with the worker in prod infrastructure.

Actions #18

Updated by riafarov about 6 years ago

Not sure if it's in feedback, or how we should proceed here. Let's discuss.

Actions #19

Updated by mgriessmeier about 6 years ago

  • Due date changed from 2018-04-10 to 2018-04-24
Actions #20

Updated by JERiveraMoya about 6 years ago

  • Status changed from Feedback to Blocked

I cannot get the logs from our shared worker ps64vt1069 as it seems to be affected by #30595.
The host is running out of memory, it creates coredumps everytime stopping to test any further.
We cannot get logs from job in osd as is sporadic and the hook does not work when the dump is done, so I guess we are block here by #30595.

Actions #21

Updated by JERiveraMoya about 6 years ago

  • Status changed from Blocked to Feedback

I bypassed the problem with the dumps produced in the host, by using less memory QEMURAM=3072 (our shared worker only have 5G and we cannot be running a job like this using 4G as it is now) and the correct number of CPUs atm due to bug QEMUCPUS=1.
Finally I ended with the problem that the kdump is not able to be performed: http://dhcp254.suse.cz/tests/1117#step/kdump_and_crash/53 but It looks specific from this worker so it doesn't help to investigate the sporadic issue in osd.
Running statistics in osd changing these two parameters: https://openqa.suse.de/tests/overview?distri=sle&version=15&build=581.1%40tests_RAM_jeriveramoya&groupid=96 in order to provide more info to the bug that I just filed. Also running sames statistics with current parameters to compare both: https://openqa.suse.de/tests/overview?distri=sle&version=15&build=581.1%40tests_RAM_4GB_jeriveramoya&groupid=96

Actions #22

Updated by JERiveraMoya about 6 years ago

  • Status changed from Feedback to Blocked
Actions #23

Updated by mgriessmeier about 6 years ago

  • Due date changed from 2018-04-24 to 2018-05-08
  • Target version changed from Milestone 15 to Milestone 16

blocked by https://bugzilla.suse.com/show_bug.cgi?id=1090659
Try to find a workaround for it by e.g. changing test variables

Actions #25

Updated by JERiveraMoya almost 6 years ago

  • Status changed from Blocked to Feedback
Actions #26

Updated by mgriessmeier almost 6 years ago

  • Due date changed from 2018-05-08 to 2018-05-22
Actions #27

Updated by JERiveraMoya almost 6 years ago

PRs to fix typo 1 and type 2.

Actions #28

Updated by JERiveraMoya almost 6 years ago

  • Status changed from Feedback to In Progress
Actions #30

Updated by JERiveraMoya almost 6 years ago

Same error is displayed in https://openqa.suse.de/tests/1684402#step/kdump_and_crash/56 from the previous round of statistics and now finally is provided to the developer.

Actions #31

Updated by JERiveraMoya almost 6 years ago

Provided more stats to the bug, it seems that there are two issues (1) crash utility failing (1-2 out of 5 runs) because kernel file is corrupted and (2) OOM much easier to reproduce (5-4 out of 5 runs).

Actions #32

Updated by JERiveraMoya almost 6 years ago

  • Status changed from In Progress to Blocked

Failure is still sporadic. Hard to say how to continue here to create the soft-failure, two last failed jobs in OSD show the two types or problems, the one related with a corrupted dump and the other one that is not even able to trigger it.

Actions #33

Updated by mgriessmeier almost 6 years ago

  • Due date changed from 2018-05-22 to 2018-06-05
Actions #34

Updated by mgriessmeier almost 6 years ago

  • Due date changed from 2018-06-05 to 2018-06-19
  • Status changed from Blocked to Workable
  • Target version changed from Milestone 16 to Milestone 17

not sure if this is still blocked, setting to workable, moving and needs to be revisited for next planning
@Joaquin: Can you clarify the state here?

Actions #35

Updated by JERiveraMoya almost 6 years ago

  • Assignee deleted (JERiveraMoya)

Most likely cause is an OOM condition in the kdump kernel for both failures seen and I don't find something that could be useful as soft-failure. Perhaps someone else wants to give a try here?

Actions #36

Updated by mgriessmeier almost 6 years ago

  • Due date deleted (2018-06-19)
Actions #37

Updated by okurz almost 6 years ago

  • Target version changed from Milestone 17 to Milestone 21+
Actions #38

Updated by okurz almost 6 years ago

  • Target version changed from Milestone 21+ to Milestone 21+
Actions #39

Updated by zluo over 5 years ago

  • Status changed from Workable to In Progress
  • Assignee set to zluo

take over and checking...

Actions #40

Updated by zluo over 5 years ago

http://e13.suse.de/tests/10952#step/kdump_and_crash/33 shows problem with string "Enterprise" which is broken... quite strange

Actions #41

Updated by zluo over 5 years ago

http://openqa.suse.de/tests/2295805/file/serial0.txt shows:

sysrq: SysRq : Trigger a crash
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048

Later I see some call trace. Need to understand these relationship.

Actions #42

Updated by zluo over 5 years ago

http://e13.suse.de/tests/10979#step/kdump_and_crash/29 shows problem with fluting output for core dump on console.

Actions #43

Updated by szarate over 5 years ago

The last one is about yast coredumping... which is a different issue... in previous screens nokogiri failed to install... just not sure if its blocked

Actions #44

Updated by zluo over 5 years ago

Actions #45

Updated by zluo over 5 years ago

  • Status changed from In Progress to Blocked

the recent test runs on osd shows exactly the issue reported in: https://bugzilla.suse.com/show_bug.cgi?id=1112406

Actions #46

Updated by zluo over 5 years ago

one issue more on my remote worker:

ps64vt1069:~ #
Message from syslogd@ps64vt1069 at Dec 3 09:12:16 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#6 stuck for 65s! [systemd-coredum:6597]

Message from syslogd@ps64vt1069 at Dec 3 09:13:40 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#4 stuck for 42s! [swapper/4:0]

this is a known issue on sles 12 sp3.

Actions #48

Updated by okurz over 5 years ago

  • Status changed from Blocked to Workable
  • Target version changed from Milestone 21+ to Milestone 22

bug still open. But #44576#note-12 sounds related, maybe we can workaround the product issues by the right crash dump kernel settings on the command line and set a record_soft_failure with the issue detected

Actions #49

Updated by zluo over 5 years ago

  • Status changed from Workable to In Progress

check this again

Actions #50

Updated by zluo over 5 years ago

atm the failed because of send_key issue or yast2_console is not finished: http://f40.suse.de/tests/112#step/kdump_and_crash/27

This happens now on osd. We need to fix this problem at first.

sub activate_kdump {
# activate kdump
type_string "echo \"remove potential harmful nokogiri package boo#1047449\"\n";
zypper_call('rm -y ruby2.1-rubygem-nokogiri', exitcode => [0, 104]);
script_run 'yast2 kdump', 0;
my @tags = qw(yast2-kdump-disabled yast2-kdump-enabled yast2-kdump-restart-info yast2-missing_package yast2_console-finished);
do {
assert_screen \@tags, 300;
# enable kdump if it is not already
wait_screen_change { send_key 'alt-u' } if match_has_tag('yast2-kdump-disabled');
wait_screen_change { send_key 'alt-o' } if match_has_tag('yast2-kdump-enabled');
wait_screen_change { send_key 'alt-o' } if match_has_tag('yast2-kdump-restart-info');
wait_screen_change { send_key 'alt-i' } if match_has_tag('yast2-missing_package');
} until (match_has_tag('yast2_console-finished'));
Actions #51

Updated by zluo over 5 years ago

needle issue fixed but test got incomplete, no screenshots available: http://f40.suse.de/tests/123

increase QEMURAM to 6GB and try...

Actions #52

Updated by okurz over 5 years ago

I don't think you can solve "[2019-01-09T07:34:09.420 EST] [debug] QEMU: qemu-system-ppc64: cannot set up guest memory 'ppc_spapr.ram': Cannot allocate memory" by increasing the memory on the VM side. I guess the hypervisor can not provide the memory. Btw, please see #33376#note-48 as it could be that we need to adjust the kernel command line parameters for the right memory that is available to the crash handling kernel to prevent an OOM within the kdump process.

Actions #53

Updated by zluo over 5 years ago

@okurz will talk to kgw for this

fix at first needle match issue, needle PR:
https://gitlab.suse.de/openqa/os-autoinst-needles-sles/merge_requests/1041

Actions #54

Updated by zluo over 5 years ago

http://f40.suse.de/tests/145

increased kdump memory to 320 MB, but it got stuck later at:
echo c > /proc/sysrq-trigger

--
[2019-01-10T07:43:23.962 EST] [debug] /var/lib/openqa/cache/f40.suse.de/tests/sle/tests/console/kdump_and_crash.pm:44 called opensusebasetest::wait_boot
[2019-01-10T07:43:23.962 EST] [debug] <<< testapi::check_screen(mustmatch=[
'grub2',
'bootloader',
'inst-bootmenu'
], timeout=100)
[2019-01-10T07:43:24.666 EST] [debug] load of /var/lib/openqa/cache/f40.suse.de/tests/sle/products/sle/needles/inst-bootmenu-20160309.png took 0.13 seconds
[2019-01-10T07:43:24.683 EST] [debug] WARNING: check_asserted_screen took 0.71 seconds for 30 candidate needles - make your needles more specific
[2019-01-10T07:43:24.683 EST] [debug] no match: 99.9s, best candidate: bootloader-ofw-grub2-leanos-bsc1055166-20170823 (0.00)
[2019-01-10T07:43:24.971 EST] [debug] no change: 98.9s
[2019-01-10T07:43:25.971 EST] [debug] no change: 97.9s
[2019-01-10T07:43:26.972 EST] [debug] no change: 96.9s
[2019-01-10T07:43:29.942 EST] [debug] WARNING: check_asserted_screen took 1.97 seconds for 30 candidate needles - make your needles more specific
[2019-01-10T07:43:30.557 EST] [debug] no match: 95.9s, best candidate: bootloader-ofw-grub2-leanos-bsc1055166-20170823 (0.00)
[2019-01-10T07:43:30.558 EST] [debug] considering VNC stalled, no update for 6.60 seconds

--

the problem is also that there is no logs available because of (qemu crash?)...

Actions #55

Updated by zluo over 5 years ago

increased to 640 MB
but http://f40.suse.de/tests/148#step/kdump_and_crash/38 shows known issue with BUG: soft lockup

Actions #56

Updated by zluo over 5 years ago

Actions #57

Updated by zluo over 5 years ago

atm I face the problem with qemu crash. And it is not possible atm to test on osd with new ASSET with my code changes:

https://openqa.suse.de/tests/2367754#step/kdump_and_crash/48

Actions #58

Updated by zluo over 5 years ago

changes kdump memory itself works well on my pkvm remote worker.
but sysrq-trigger crashes every time qemu instance.
We cannot test directly on osd via new ASSERT-URL because of shared worker
configuration on osd.

So we need to merge this change now and see if this works or not.

PR:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6531

Actions #59

Updated by zluo over 5 years ago

  • Status changed from In Progress to Feedback

PR got updated now, set it as feedback

Actions #60

Updated by zluo about 5 years ago

  • Status changed from Feedback to In Progress

checking related bugs and if found then use record_soft_failure for change of kdump memorg size.

Actions #61

Updated by zluo about 5 years ago

found bsc#957053 and bsc#1120566 which are related to this kdump issue. This happened also on aarch64 as well, so add this change for aarch64 too.
update PR now.

Actions #62

Updated by zluo about 5 years ago

need to check https://openqa.suse.de/tests/2439423 and see whether PR works or not.

Actions #63

Updated by zluo about 5 years ago

we experience at moment issue with SCC ...

https://openqa.suse.de/tests/2439423#step/install/8

Actions #65

Updated by zluo about 5 years ago

https://openqa.suse.de/tests/2441710 shows expected test results and PR works.

Actions #66

Updated by zluo about 5 years ago

  • Status changed from In Progress to Resolved

set as resolved for now.

Actions #67

Updated by zluo about 5 years ago

  • Copied to action #47960: [sle][functional][u] kdump_and_crash - detect error detection and only apply workaround then added
Actions

Also available in: Atom PDF