Project

General

Profile

Actions

action #50345

closed

[functional][u] Usage of kernel parameter kernel.softlockup_panic

Added by SLindoMansilla almost 5 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Enhancement to existing tests
Target version:
SUSE QA - Milestone 30
Start date:
2019-04-12
Due date:
2019-05-21
% Done:

0%

Estimated time:
42.00 h
Difficulty:
easy

Description

Motivation

We get recently a lot of failures with error message "workqueue lockup" (see bsc#1126782). This is a symptom and we need to find the different causes for it.

There is a suggestion to enable kernel panic debugging: https://bugzilla.suse.com/show_bug.cgi?id=1126782#c12

We need to determine if we can use this parameter in general without affecting the tests and how to properly use it. Or if we need to manually reproduce the issue with this kernel parameter.


Related issues 2 (0 open2 closed)

Related to openQA Tests - action #47849: [functional][u] test if system_workarounds is still required as bsc#1105302 is fixedResolvedmgriessmeier2019-02-13

Actions
Blocks openQA Tests - action #66607: [functional][u] Execute "SysRq t" when workqueue lockup is detected and publish kernel logsResolveddheidler

Actions
Actions #1

Updated by SLindoMansilla almost 5 years ago

To be discussed in the next grooming meeting.

Actions #3

Updated by SLindoMansilla almost 5 years ago

  • Status changed from New to Workable
  • Assignee set to szarate
Actions #4

Updated by szarate almost 5 years ago

  • Status changed from Workable to In Progress

Trying out here: https://openqa.suse.de/tests/overview?build=rogue_workqueue_bsc1126782_investigation&distri=sle&version=15-SP1. Doesn't seem like it's going to be a problem for the tests, however It's better to use kernel.softlockup_panic=1 after GMC is out on sle15, Next step will be enabling it for SLE12 SP5, first via EXTRABOOTPARAMS and then via test code once we're sure enough that we want this.

Actions #5

Updated by szarate almost 5 years ago

Doesn't look like it's gonna hurt - so far -, will do more tests tomorrow and see

Actions #6

Updated by szarate almost 5 years ago

  • Due date set to 2019-05-21
  • Status changed from In Progress to Feedback
  • Priority changed from High to Normal
  • Target version set to Milestone 25

For the time being, I've created a new test suite: create_hdd_minimal_base+sdk+softlockup for SLE12-SP5 for the time being. It's in the development job group so matter to wait for the next build, and see how it goes. I did trigger few runs and none of them seemed to fail with this.

I'll look a bit later next week to see how to manually trigger softlockups and see what's the behaviour...

Plan for this ticket is to simply once the HDD's are created, schedule some 100 runs of modules that triggered the problem in the past, see if it can be reproduced and then move on.

But for now, look like we can get the parameter in shape to put in the test code.

Setting due date so that I get a reminder to look at this next week.

Actions #7

Updated by SLindoMansilla almost 5 years ago

waiting for next build

Actions #8

Updated by szarate almost 5 years ago

So according to: create_hdd_minimal_base+sdk+softlockup in the last run: https://openqa.suse.de/tests/overview?distri=sle&version=12-SP5&build=0170&groupid=132 Enabling the parameter does not hurt.

Actions #9

Updated by szarate almost 5 years ago

  • Status changed from Feedback to In Progress

Next up trigger a softlockup and see what happens :)

Potential problem with this approach is: https://www.suse.com/support/kb/doc/?id=7023049

Actions #10

Updated by szarate almost 5 years ago

So it definitely does produce a panic, which is what we want, however... the job gets stuck (possibly detecting the serial error, would suffice to get the machine to shutdown) My main concern is that if we follow this path... there might be a lot of tests that would fail due to how often these softlockup could end up happening...

I used https://github.com/foursixnine/softlockup_test this kernel module


susetest login: [ 1095.966023] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [softlockup_thre:15996]
[ 1095.966836] Kernel panic - not syncing: softlockup: hung tasks
[ 1095.967377] CPU: 0 PID: 15996 Comm: softlockup_thre Tainted: G           OEL    5.1.4-1-default #1 openSUSE Tumbleweed (unreleased)
[ 1095.968465] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014
[ 1095.969471] Call Trace:
[ 1095.969703]  <IRQ>
[ 1095.969896]  dump_stack+0x85/0xc0
[ 1095.970207]  panic+0xf6/0x2a1
[ 1095.970485]  ? ret_from_fork+0x11/0x50
[ 1095.970834]  ? ret_from_fork+0x45/0x50
[ 1095.971184]  watchdog_timer_fn.cold.6+0x16/0x1e
[ 1095.971603]  ? softlockup_fn+0x40/0x40
[ 1095.971993]  __hrtimer_run_queues+0xf0/0x260
[ 1095.972401]  hrtimer_interrupt+0x100/0x220
[ 1095.972784]  smp_apic_timer_interrupt+0x6a/0x140
[ 1095.973211]  apic_timer_interrupt+0xf/0x20
[ 1095.973590]  </IRQ>
[ 1095.973791] RIP: 0010:console_unlock+0x41f/0x550
[ 1095.974218] Code: fc ff ff c7 05 0a d0 8b 01 00 00 00 00 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f e9 8b f7 ff ff e8 26 35 09 00 4c 89 ff 57 9d <0f> 1f 44 00 00 e9 25 ff ff ff 65 8b 05 60 b8 f1 54 89 c0 48 0f a3
[ 1095.975939] RSP: 0018:ffffab1e0037be28 EFLAGS: 00000247 ORIG_RAX: ffffffffffffff13
[ 1095.976632] RAX: 0000000000000000 RBX: ffffffffac35de60 RCX: ffffffffac262c48
[ 1095.977286] RDX: 0000000000000001 RSI: 0000000000000086 RDI: 0000000000000247
[ 1095.977939] RBP: 0000000004d171eb R08: 0000000004d171eb R09: 0000000000000001
[ 1095.978593] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000004d171eb
[ 1095.979248] R13: 0000000004d171eb R14: 0000000000000007 R15: 0000000000000247
[ 1095.979916]  vprintk_emit+0x1ab/0x250
[ 1095.980261]  ? 0xffffffffc0a3b000
[ 1095.980571]  printk+0x48/0x4a
[ 1095.980851]  task+0x62/0x6b [softlockup_test]
[ 1095.981257]  kthread+0x116/0x130
[ 1095.981559]  ? kthread_bind+0x30/0x30
[ 1095.981903]  ret_from_fork+0x3a/0x50
[ 1095.982382] Kernel Offset: 0x2a000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 1095.983389] Rebooting in 90 seconds..
Actions #11

Updated by szarate almost 5 years ago

  • Status changed from In Progress to Feedback
Actions #12

Updated by szarate almost 5 years ago

  • Status changed from Feedback to In Progress

After talking with Takashi for a brief moment, he suggested to enable the parameter along with kdump, so I guess that the workflow would be something like:

enable kdump as early as possible in the test if it's available (Could start with extratests), and enable the kernel parameter... I'm guessing taht I'd have to look for a certain timeout and on top of that figure out how to get more information as requested on: https://bugzilla.suse.com/show_bug.cgi?id=1130701

Actions #13

Updated by mgriessmeier over 4 years ago

  • Target version changed from Milestone 25 to Milestone 26
Actions #14

Updated by szarate over 4 years ago

  • Target version changed from Milestone 26 to Milestone 25

So as agreed in an offline meeting, I'm creating a new test module to enable boot parameters via sysctl, and on the other hand crating a ticket so that boot_to_desktop is able to enable EXTRABOOTPARAMS if they are set, which it currently doesn't.

Actions #15

Updated by szarate over 4 years ago

  • Status changed from In Progress to Workable
Actions #16

Updated by mgriessmeier over 4 years ago

  • Target version changed from Milestone 25 to Milestone 27
Actions #17

Updated by SLindoMansilla over 4 years ago

  • Blocks action #47849: [functional][u] test if system_workarounds is still required as bsc#1105302 is fixed added
Actions #18

Updated by SLindoMansilla over 4 years ago

  • Blocks deleted (action #47849: [functional][u] test if system_workarounds is still required as bsc#1105302 is fixed)
Actions #19

Updated by SLindoMansilla over 4 years ago

  • Related to action #47849: [functional][u] test if system_workarounds is still required as bsc#1105302 is fixed added
Actions #20

Updated by szarate over 4 years ago

I'm unasigning from this for now, however I'd appreciate if anybody wants to pick this up...

Ideally, for at least one scenario, https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/7771 the early_test_setup is enabled right after the system has booted, on all of our architectures, duplicating at least one of the extra_tests scenarios (the big ones) and let it run for two or three builds, and see if nothing breaks too bad. Or Simply merge my pr, and see what breaks :D

Ideally that module could be extended further ahead in the future.

Actions #21

Updated by szarate over 4 years ago

  • Assignee deleted (szarate)
Actions #22

Updated by SLindoMansilla over 4 years ago

  • Estimated time set to 42.00 h
Actions #23

Updated by mgriessmeier over 4 years ago

  • Target version changed from Milestone 27 to Milestone 28
Actions #24

Updated by mgriessmeier about 4 years ago

  • Target version changed from Milestone 28 to Milestone 30

needs to be discussed offline

Actions #25

Updated by SLindoMansilla almost 4 years ago

  • Priority changed from Normal to High
Actions #26

Updated by SLindoMansilla almost 4 years ago

  • Blocks action #66607: [functional][u] Execute "SysRq t" when workqueue lockup is detected and publish kernel logs added
Actions #27

Updated by szarate almost 4 years ago

  • Category changed from Spike/Research to Enhancement to existing tests
Actions #28

Updated by szarate almost 4 years ago

szarate wrote:

I'm unasigning from this for now, however I'd appreciate if anybody wants to pick this up...

Ideally, for at least one scenario, https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/7771 the early_test_setup is enabled right after the system has booted, on all of our architectures, duplicating at least one of the extra_tests scenarios (the big ones) and let it run for two or three builds, and see if nothing breaks too bad. Or Simply merge my pr, and see what breaks :D

Ideally that module could be extended further ahead in the future.

This can be easily implemented by adding a new branch at bootloader_setup::specific_bootmenu_params

    # Kernel softlockup panic should be enabled, unless explicitly disabled
    # See bsc#1126782
    if (!get_var("SOFTLOCKUP_PANIC_DISABLED", 0)) {
        push @params, "kernel.softlockup_panic=1";
    }

In order to verify that it properly works:

A create_hdd job should be triggered, and a corresponding extra_test job should follow

Actions #29

Updated by szarate almost 4 years ago

  • Difficulty set to easy
Actions #30

Updated by szarate almost 4 years ago

  • Status changed from Workable to In Progress

Well I'm picking this one, gonna see if I can piggyback triggering tests from GH automatically

Actions #31

Updated by zluo almost 4 years ago

  • Status changed from In Progress to Workable

unassigned but in progress. change it now to workalbe

Actions #32

Updated by szarate almost 4 years ago

  • Status changed from Workable to In Progress
  • Assignee set to szarate

Better?

Actions #33

Updated by szarate over 3 years ago

  • Status changed from In Progress to Resolved

PR finally merged since few days ago, no failing jobs so far

Actions

Also available in: Atom PDF