action #39497
closed[sle][functional][u] send magic-sysrq-w to find out what is blocking the system
0%
Description
Observation¶
openQA test in scenario sle-12-SP4-Server-DVD-ppc64le-cryptlvm_minimal_x@ppc64le fails in
yast2_lan
Reproducible¶
Fails since (at least) Build 0328 (current job)
Expected result¶
Last good: 0327 (or more recent)
Acceptance criteria¶
- AC1: We can see if any task is blocking the system even though we can not login into the system in post_fail_hook
Suggestions¶
- See what we did already in https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4843/files#diff-141f4b5a48eaecb0c631a0de23e41a51R1135 to collect the "blocked tasks" from the system when we can not have a logged in console (anymore)
- Send the magic sysrq sequence to the system if the post_fail_hooks fails to login like reported above – that probably means that we need to call
select_console('log-console', await_console => 0)
in the post_fail_hook and check manually if we reach the expected login or logged in prompt or if we are stuck and need to send magic-sysrq-w. Or we might need a "post_fail_hook for the post_fail_hook" - Make sure the output of magic sysrq is available in text form, not just in screenshot so that everybody can read it and we can also forward the text to external references, e.g. bug reports
Further details¶
Always latest result in this scenario: latest
Updated by okurz over 5 years ago
- Related to coordination #32734: [functional][epic][u][new test] OOM handling added
Updated by okurz over 5 years ago
- Subject changed from [sle][functional][y] test fails in yast2_lan - clear screen took longer than expected to [sle][functional][y] test fails in yast2_lan - clear screen took longer than expected, post_fail_hook fails to login -> send magic-sysrq-w to find out what is blocking the system
- Description updated (diff)
- Due date set to 2018-08-28
- Status changed from New to Workable
- Target version set to Milestone 18
Also here an interesting symptom is that the post_fail_hook failed to log in on https://openqa.suse.de/tests/1908646#step/yast2_lan/23 so there must be something running in the background that hampers performance. https://openqa.suse.de/tests/1908646/file/serial0.txt unfortunately does not show anything and without the post_fail_hook we do not know. I suspect the btrfs maintenance jobs in the background even though https://openqa.suse.de/tests/1908646#step/force_scheduled_tasks/12 looks like they have finished. https://openqa.suse.de/tests/1908646/file/textinfo-info.txt also does not list them.
We should send magic-sysrq-w to the system to see what's going on when we can not login.
Updated by mloviska over 5 years ago
- Status changed from Workable to In Progress
- Assignee set to mloviska
Updated by okurz over 5 years ago
I crosschecked current behaviour of our systems and my suspicions were confirmed: https://bugzilla.opensuse.org/show_bug.cgi?id=1104792 magic-sysrq output does not seem to show up on the current console tty
Could be that we can still see the output on the serial port and can use "wait_serial".
Updated by mloviska over 5 years ago
- Status changed from In Progress to Workable
- Assignee deleted (
mloviska)
Updated by JERiveraMoya over 5 years ago
Last failed job provides more logs: https://openqa.suse.de/tests/1948704#downloads
Updated by JERiveraMoya over 5 years ago
- Has duplicate action #39779: [sle][functional][y] test fails in yast2_lan - timeout and yast2 lan died added
Updated by JERiveraMoya over 5 years ago
- Status changed from Workable to Blocked
I found in the logs:
2018-08-15 05:35:12 <3> susetest(15112) [agent-modules] ModulesConf.cc(getTimeStamp):282 Failed to stat /etc/modprobe.d/50-yast.conf: No such file or directory
2018-08-15 05:35:12 <1> susetest(15112) [agent-modules] ModulesConf.cc(writeFile):560 Modules not modified, not writing
2018-08-15 05:35:12 <3> susetest(15112) [agent-modules] ModulesConf.cc(getTimeStamp):282 Failed to stat /etc/modprobe.d/50-yast.conf: No such file or directory
2018-08-15 05:35:12 <3> susetest(15112) [agent-modules] ModulesConf.cc(~ModulesConf):103 Can't write configuration file in destructor.
2018-08-15 05:35:12 <1> susetest(15112) [Y2Ruby] binary/YRuby.cc(~YRuby):117 Shutting down ruby interpreter.
For that reason I open a new new bug
Updated by JERiveraMoya over 5 years ago
- Status changed from Blocked to In Progress
Trying to reproduce it with our ppc shared worker.
Updated by JERiveraMoya over 5 years ago
- Status changed from In Progress to Blocked
I got stuck always after this step with the shared worker using MAKETESTSNAPSHOTS=1. Also the error is sporadic so locally I cannot reproduce it and provide more info to the bug about this sporadic issue.
Updated by riafarov over 5 years ago
- Due date changed from 2018-08-28 to 2018-09-11
- Target version changed from Milestone 18 to Milestone 19
Let's discuss what we can do here.
Updated by JERiveraMoya over 5 years ago
- Status changed from Blocked to Resolved
It is not happening anymore after checking recent jobs and also after generating some stats (job running 10 times in OSD). Reported to bug and closing this story.
Updated by okurz over 5 years ago
- Due date changed from 2018-09-11 to 2018-09-25
- Status changed from Resolved to In Progress
Hi, I did not see any changes that lead to the resolution of the ticket. Did one of you create a pull request to introduce the "magic-sysrq-w"? As this is the only ticket that recommends to do so we should not close the ticket until this is either put in place or referenced in another ticket.
Updated by JERiveraMoya over 5 years ago
- Status changed from In Progress to Workable
- Assignee deleted (
JERiveraMoya)
I understand now better your initial intention (reading the parent epic for [u] team), that is to introduce that feature anyway because the problem is not longer reproducible.
Updated by JERiveraMoya over 5 years ago
- Subject changed from [sle][functional][y] test fails in yast2_lan - clear screen took longer than expected, post_fail_hook fails to login -> send magic-sysrq-w to find out what is blocking the system to [sle][functional][u] send magic-sysrq-w to find out what is blocking the system
Updated by okurz over 5 years ago
yes, thanks for updating the subject line, makes sense
Updated by dheidler over 5 years ago
- Assignee deleted (
dheidler)
From man 5 proc
:
/proc/sys/kernel/sysrq
This file controls the functions allowed to be invoked by the SysRq key. By default, the file con-
tains 1 meaning that every possible SysRq request is allowed (in older kernel versions, SysRq was
disabled by default, and you were required to specifically enable it at run-time, but this is not the
case any more). Possible values in this file are:
0 Disable sysrq completely
1 Enable all functions of sysrq
> 1 Bit mask of allowed sysrq functions, as follows:
2 Enable control of console logging level
4 Enable control of keyboard (SAK, unraw)
8 Enable debugging dumps of processes etc.
16 Enable sync command
32 Enable remount read-only
64 Enable signaling of processes (term, kill, oom-kill)
128 Allow reboot/poweroff
256 Allow nicing of all real-time tasks
.
cat /proc/sys/kernel/sysrq
184
.
python3
Python 3.6.5 (default, Mar 31 2018, 19:45:04) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> bin(184)
'0b10111000'
So by default we have set:
- Enable debugging dumps of processes etc.
- Enable sync command
- Enable remount read-only
- Allow reboot/poweroff
Updated by dheidler over 5 years ago
send_key 'alt-sysrq-w';
generates the following output on serial tty, so sending the key combination via VNC to QEMU seems to work fine:
[ 229.970197] sysrq: SysRq : Show Blocked State
[ 229.970667] task PC stack pid father
Updated by dheidler over 5 years ago
- Status changed from Workable to In Progress
Updated by dheidler over 5 years ago
Updated by okurz over 5 years ago
PR is merged and I have seen it in action, at least that the request is called. Have not found an incident with real "blocked tasks"
Updated by dheidler over 5 years ago
- Status changed from In Progress to Resolved
@okurz: do you have a link for me?
Updated by okurz over 5 years ago
Well, basically every failed job should have the title line showing up, right? E.g. https://openqa.suse.de/tests/2068578#step/gnome_control_center/11
Updated by okurz over 5 years ago
- Status changed from Resolved to In Progress
ok, not done here. The "send_key" call fails on the virtio terminal: https://openqa.opensuse.org/tests/757816 see details in logs:
[2018-09-19T12:26:51.0494 CEST] [debug] <<< testapi::send_key(key='alt-sysrq-w', do_wait=0)
DIE Virtio terminal does not support send_key. Use type_string (possibly with an
ANSI/XTERM escape sequence), or switch to a console which sends key presses, not
terminal codes.
Updated by dheidler over 5 years ago
Updated by nicksinger over 5 years ago
Also gets send in bootloader_hyperv
: https://openqa.suse.de/tests/2084550#step/bootloader_hyperv/42
Not sure if this is covered by your last PR mentioned in #c29
Updated by SLindoMansilla over 5 years ago
- Due date changed from 2018-09-25 to 2018-10-09
- Status changed from In Progress to Feedback
Moving to sprint 27. Waiting for feedback on a PR.