Project

General

Profile

Actions

action #26886

closed

[tools][s390x-kvm] investigate and improve 'assert_shutdown' function in testapi

Added by mgriessmeier over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
-
Start date:
2017-10-19
Due date:
% Done:

0%

Estimated time:

Description

[Observation]

In testapi.pm, 1481 sub assert_shutdown behaves weird on s390x-kvm and can probably also cause race-conditions on other architectures
On s390x-kvm we see following behaviour:

  • After typing 'poweroff', we execute query_isotovideo('backend_is_shutdown') once, which then establishes a ssh connection to our hypervisor (s390p8) and run some virsh command to find out the state of the vm
  • This is done once during the shutdown and returns 0 which is fine because the machine is not shut down yet, therefore the while-loop
  • however, after the system has "reached target shutdown", it is stuck and there is no additional iteration in this loop done (it should actually check once every second), it times out after two hours

s390x-kvm: https://openqa.suse.de/tests/1224632

Weird thing is, that it works on zkvm, which is basically identically to s390x-kvm...

zkvm: https://openqa.suse.de/tests/1224633

After some debugging together with riafarov, we decided to do a simple workaround in the tests and leave this backend improvement to the tools team
If you have any further questions on this, let us know


Related issues 1 (0 open1 closed)

Related to openQA Tests - action #25638: [sles][functional][s390x] test fails in shutdown: VNC stall detected, needs to be investigatedResolvedmgriessmeier2017-09-282017-10-25

Actions
Actions #1

Updated by mgriessmeier over 6 years ago

  • Blocks action #25638: [sles][functional][s390x] test fails in shutdown: VNC stall detected, needs to be investigated added
Actions #2

Updated by okurz over 6 years ago

  • Blocks deleted (action #25638: [sles][functional][s390x] test fails in shutdown: VNC stall detected, needs to be investigated)
Actions #3

Updated by okurz over 6 years ago

  • Related to action #25638: [sles][functional][s390x] test fails in shutdown: VNC stall detected, needs to be investigated added
Actions #4

Updated by okurz over 6 years ago

  • Project changed from 46 to openQA Project
  • Category set to 132

Still valid but #25638 is not blocked anymore as we implemented a workaround there by just not calling assert_shutdown on s390x-kvm

Actions #5

Updated by okurz over 6 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz

So we actually saw the problem again on z/VM as well, e.g. see https://openqa.suse.de/tests/1236007#live even though we skip over the call of assert_shutdown which apparently was not the problem. While this job was stuck but not yet incompleted with the help of coolo I could debug. I logged into openqaworker2, saw the worker instance 4 running and the according processes:

_openqa+  5641  0.1  0.0 267912 191416 ?       Ss   Oct19   9:15 /usr/bin/perl /usr/share/openqa/script/worker --instance 4
_openqa+  5195  0.2  0.0 237244 150428 ?       S    14:39   0:14  \_ perl /usr/bin/isotovideo -d
_openqa+  5197  0.2  0.0 124404 54948 ?        S    14:39   0:11      \_ /usr/bin/isotovideo: comman
_openqa+  5198  0.1  0.0 267928 108228 ?       S    14:39   0:09      \_ /usr/bin/isotovideo: autote
_openqa+  5200 97.3  0.0 937516 185852 ?       Rl   14:39  86:26      \_ /usr/bin/isotovideo: backen
_openqa+  5201  0.9  0.0 161216 66288 ?        SN   14:39   0:50          \_ /usr/lib/os-autoinst/videoencoder /var/lib/openqa/pool/4/video.ogv
_openqa+  5202  0.0  0.0 375304 36932 ?        Sl   14:39   0:00          \_ Xvnc -depth 16 -inetd -SecurityTypes None -ac :43801
_openqa+  5222  0.0  0.0  58964 14452 ?        S    14:39   0:01          \_ /usr/bin/x3270 -display :43801 -script -charset us -xrm x3270.visualBell:true -xr
_openqa+  5513  0.0  0.0 376460 38052 ?        Sl   14:40   0:02          \_ Xvnc -depth 16 -inetd -SecurityTypes None -ac :33075
_openqa+  6996  0.0  0.0 376424 39192 ?        Sl   14:50   0:01          \_ Xvnc -depth 16 -inetd -SecurityTypes None -ac :51313

so the backend is still there, the X server is running - as well as icewm, not shown here - but no xterm-console processes anymore. Attaching with strace to isotovideo: autote[st] and isotovideo -d revealed that one process has dispatched a command to isotovideo, isotovideo is polling the event loop but nothing suspicious. What can be seen in the process table is that isotovideo: backen is hogging the CPU but not 100% -> busy-loop but not blocked by kernel commands. strace -p 5200 revealed:

[pid  5200] recvfrom(13, "", 16384, MSG_NOSIGNAL, NULL, NULL) = 0
[pid  5200] select(16, [6 13], [6 13], NULL, {0, 206501}) = 3 (in [6 13], out [13], left {0, 206500})
[pid  5200] recvfrom(13, "", 16384, MSG_NOSIGNAL, NULL, NULL) = 0
[pid  5200] select(16, [6 13], [6 13], NULL, {0, 206436}) = 3 (in [6 13], out [13], left {0, 206435})
[pid  5200] recvfrom(13, "", 16384, MSG_NOSIGNAL, NULL, NULL) = 0
[pid  5200] select(16, [6 13], [6 13], NULL, {0, 206372}) = 3 (in [6 13], out [13], left {0, 206371})
[pid  5200] recvfrom(13, "", 16384, MSG_NOSIGNAL, NULL, NULL) = 0
…

which coolo recognized as one of the ssh sockets still active but the remote end died -> missing command to kill_ssh in the test when the machine is already shut down.

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3802

Actions #6

Updated by okurz over 6 years ago

  • Status changed from In Progress to Resolved

PR merged, verified in production: https://openqa.suse.de/tests/1236004

Actions

Also available in: Atom PDF