Project

General

Profile

Actions

action #39497

closed

[sle][functional][u] send magic-sysrq-w to find out what is blocking the system

Added by mloviska over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
SUSE QA - Milestone 19
Start date:
2018-08-09
Due date:
2018-10-09
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-12-SP4-Server-DVD-ppc64le-cryptlvm_minimal_x@ppc64le fails in
yast2_lan

Reproducible

Fails since (at least) Build 0328 (current job)

Expected result

Last good: 0327 (or more recent)

Acceptance criteria

  • AC1: We can see if any task is blocking the system even though we can not login into the system in post_fail_hook

Suggestions

  • See what we did already in https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4843/files#diff-141f4b5a48eaecb0c631a0de23e41a51R1135 to collect the "blocked tasks" from the system when we can not have a logged in console (anymore)
  • Send the magic sysrq sequence to the system if the post_fail_hooks fails to login like reported above – that probably means that we need to call select_console('log-console', await_console => 0) in the post_fail_hook and check manually if we reach the expected login or logged in prompt or if we are stuck and need to send magic-sysrq-w. Or we might need a "post_fail_hook for the post_fail_hook"
  • Make sure the output of magic sysrq is available in text form, not just in screenshot so that everybody can read it and we can also forward the text to external references, e.g. bug reports

Further details

Always latest result in this scenario: latest


Related issues 2 (0 open2 closed)

Related to QA - coordination #32734: [functional][epic][u][new test] OOM handlingRejected2018-03-03

Actions
Has duplicate openQA Tests - action #39779: [sle][functional][y] test fails in yast2_lan - timeout and yast2 lan diedRejectedokurz2018-08-15

Actions
Actions #1

Updated by okurz over 5 years ago

Actions #2

Updated by okurz over 5 years ago

  • Subject changed from [sle][functional][y] test fails in yast2_lan - clear screen took longer than expected to [sle][functional][y] test fails in yast2_lan - clear screen took longer than expected, post_fail_hook fails to login -> send magic-sysrq-w to find out what is blocking the system
  • Description updated (diff)
  • Due date set to 2018-08-28
  • Status changed from New to Workable
  • Target version set to Milestone 18

Also here an interesting symptom is that the post_fail_hook failed to log in on https://openqa.suse.de/tests/1908646#step/yast2_lan/23 so there must be something running in the background that hampers performance. https://openqa.suse.de/tests/1908646/file/serial0.txt unfortunately does not show anything and without the post_fail_hook we do not know. I suspect the btrfs maintenance jobs in the background even though https://openqa.suse.de/tests/1908646#step/force_scheduled_tasks/12 looks like they have finished. https://openqa.suse.de/tests/1908646/file/textinfo-info.txt also does not list them.

We should send magic-sysrq-w to the system to see what's going on when we can not login.

Actions #3

Updated by okurz over 5 years ago

  • Description updated (diff)
Actions #4

Updated by mloviska over 5 years ago

  • Status changed from Workable to In Progress
  • Assignee set to mloviska
Actions #5

Updated by okurz over 5 years ago

I crosschecked current behaviour of our systems and my suspicions were confirmed: https://bugzilla.opensuse.org/show_bug.cgi?id=1104792 magic-sysrq output does not seem to show up on the current console tty

Could be that we can still see the output on the serial port and can use "wait_serial".

Actions #6

Updated by mloviska over 5 years ago

  • Status changed from In Progress to Workable
  • Assignee deleted (mloviska)
Actions #7

Updated by JERiveraMoya over 5 years ago

Last failed job provides more logs: https://openqa.suse.de/tests/1948704#downloads

Actions #8

Updated by JERiveraMoya over 5 years ago

  • Has duplicate action #39779: [sle][functional][y] test fails in yast2_lan - timeout and yast2 lan died added
Actions #9

Updated by JERiveraMoya over 5 years ago

  • Status changed from Workable to Blocked

I found in the logs:

2018-08-15 05:35:12 <3> susetest(15112) [agent-modules] ModulesConf.cc(getTimeStamp):282 Failed to stat /etc/modprobe.d/50-yast.conf: No such file or directory
2018-08-15 05:35:12 <1> susetest(15112) [agent-modules] ModulesConf.cc(writeFile):560 Modules not modified, not writing
2018-08-15 05:35:12 <3> susetest(15112) [agent-modules] ModulesConf.cc(getTimeStamp):282 Failed to stat /etc/modprobe.d/50-yast.conf: No such file or directory
2018-08-15 05:35:12 <3> susetest(15112) [agent-modules] ModulesConf.cc(~ModulesConf):103 Can't write configuration file in destructor.
2018-08-15 05:35:12 <1> susetest(15112) [Y2Ruby] binary/YRuby.cc(~YRuby):117 Shutting down ruby interpreter.

For that reason I open a new new bug

Actions #10

Updated by JERiveraMoya over 5 years ago

  • Assignee set to JERiveraMoya
Actions #11

Updated by JERiveraMoya over 5 years ago

  • Status changed from Blocked to In Progress

Trying to reproduce it with our ppc shared worker.

Actions #12

Updated by JERiveraMoya over 5 years ago

  • Status changed from In Progress to Blocked

I got stuck always after this step with the shared worker using MAKETESTSNAPSHOTS=1. Also the error is sporadic so locally I cannot reproduce it and provide more info to the bug about this sporadic issue.

Actions #13

Updated by riafarov over 5 years ago

  • Due date changed from 2018-08-28 to 2018-09-11
  • Target version changed from Milestone 18 to Milestone 19

Let's discuss what we can do here.

Actions #14

Updated by JERiveraMoya over 5 years ago

  • Status changed from Blocked to Resolved

It is not happening anymore after checking recent jobs and also after generating some stats (job running 10 times in OSD). Reported to bug and closing this story.

Actions #15

Updated by okurz over 5 years ago

  • Due date changed from 2018-09-11 to 2018-09-25
  • Status changed from Resolved to In Progress

Hi, I did not see any changes that lead to the resolution of the ticket. Did one of you create a pull request to introduce the "magic-sysrq-w"? As this is the only ticket that recommends to do so we should not close the ticket until this is either put in place or referenced in another ticket.

Actions #16

Updated by JERiveraMoya over 5 years ago

  • Status changed from In Progress to Workable
  • Assignee deleted (JERiveraMoya)

I understand now better your initial intention (reading the parent epic for [u] team), that is to introduce that feature anyway because the problem is not longer reproducible.

Actions #17

Updated by JERiveraMoya over 5 years ago

  • Subject changed from [sle][functional][y] test fails in yast2_lan - clear screen took longer than expected, post_fail_hook fails to login -> send magic-sysrq-w to find out what is blocking the system to [sle][functional][u] send magic-sysrq-w to find out what is blocking the system
Actions #18

Updated by okurz over 5 years ago

yes, thanks for updating the subject line, makes sense

Actions #19

Updated by dheidler over 5 years ago

  • Assignee set to dheidler
Actions #20

Updated by dheidler over 5 years ago

  • Assignee deleted (dheidler)

From man 5 proc:

/proc/sys/kernel/sysrq
    This file controls the functions allowed to be invoked by the SysRq key.  By default, the  file  con-
    tains  1  meaning  that  every possible SysRq request is allowed (in older kernel versions, SysRq was
    disabled by default, and you were required to specifically enable it at run-time, but this is not the
    case any more).  Possible values in this file are:

          0    Disable sysrq completely

          1    Enable all functions of sysrq

          > 1  Bit mask of allowed sysrq functions, as follows:
                 2  Enable control of console logging level
                 4  Enable control of keyboard (SAK, unraw)
                 8  Enable debugging dumps of processes etc.
                16  Enable sync command
                32  Enable remount read-only
                64  Enable signaling of processes (term, kill, oom-kill)
               128  Allow reboot/poweroff
               256  Allow nicing of all real-time tasks

.

cat /proc/sys/kernel/sysrq
184

.

python3
Python 3.6.5 (default, Mar 31 2018, 19:45:04) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> bin(184)
'0b10111000'

So by default we have set:

  • Enable debugging dumps of processes etc.
  • Enable sync command
  • Enable remount read-only
  • Allow reboot/poweroff
Actions #21

Updated by dheidler over 5 years ago

  • Assignee set to dheidler
Actions #22

Updated by dheidler over 5 years ago

send_key 'alt-sysrq-w';

generates the following output on serial tty, so sending the key combination via VNC to QEMU seems to work fine:

[  229.970197] sysrq: SysRq : Show Blocked State
[  229.970667]   task                        PC stack   pid father
Actions #23

Updated by dheidler over 5 years ago

  • Status changed from Workable to In Progress
Actions #25

Updated by okurz over 5 years ago

PR is merged and I have seen it in action, at least that the request is called. Have not found an incident with real "blocked tasks"

Actions #26

Updated by dheidler over 5 years ago

  • Status changed from In Progress to Resolved

@okurz: do you have a link for me?

Actions #27

Updated by okurz over 5 years ago

Well, basically every failed job should have the title line showing up, right? E.g. https://openqa.suse.de/tests/2068578#step/gnome_control_center/11

Actions #28

Updated by okurz over 5 years ago

  • Status changed from Resolved to In Progress

ok, not done here. The "send_key" call fails on the virtio terminal: https://openqa.opensuse.org/tests/757816 see details in logs:

[2018-09-19T12:26:51.0494 CEST] [debug] <<< testapi::send_key(key='alt-sysrq-w', do_wait=0)
DIE Virtio terminal does not support send_key. Use type_string (possibly with an
ANSI/XTERM escape sequence), or switch to a console which sends key presses, not
terminal codes.
Actions #30

Updated by nicksinger over 5 years ago

Also gets send in bootloader_hyperv: https://openqa.suse.de/tests/2084550#step/bootloader_hyperv/42
Not sure if this is covered by your last PR mentioned in #c29

Actions #31

Updated by SLindoMansilla over 5 years ago

  • Due date changed from 2018-09-25 to 2018-10-09
  • Status changed from In Progress to Feedback

Moving to sprint 27. Waiting for feedback on a PR.

Actions #32

Updated by dheidler over 5 years ago

  • Status changed from Feedback to Resolved

PR got merged.

Actions

Also available in: Atom PDF