action #26886
closed[tools][s390x-kvm] investigate and improve 'assert_shutdown' function in testapi
0%
Description
[Observation]
In testapi.pm, 1481 sub assert_shutdown
behaves weird on s390x-kvm and can probably also cause race-conditions on other architectures
On s390x-kvm we see following behaviour:
- After typing 'poweroff', we execute
query_isotovideo('backend_is_shutdown')
once, which then establishes a ssh connection to our hypervisor (s390p8) and run some virsh command to find out the state of the vm - This is done once during the shutdown and returns 0 which is fine because the machine is not shut down yet, therefore the while-loop
- however, after the system has "reached target shutdown", it is stuck and there is no additional iteration in this loop done (it should actually check once every second), it times out after two hours
s390x-kvm: https://openqa.suse.de/tests/1224632
Weird thing is, that it works on zkvm, which is basically identically to s390x-kvm...
zkvm: https://openqa.suse.de/tests/1224633
After some debugging together with riafarov, we decided to do a simple workaround in the tests and leave this backend improvement to the tools team
If you have any further questions on this, let us know
Updated by mgriessmeier over 6 years ago
- Blocks action #25638: [sles][functional][s390x] test fails in shutdown: VNC stall detected, needs to be investigated added
Updated by okurz over 6 years ago
- Blocks deleted (action #25638: [sles][functional][s390x] test fails in shutdown: VNC stall detected, needs to be investigated)
Updated by okurz over 6 years ago
- Related to action #25638: [sles][functional][s390x] test fails in shutdown: VNC stall detected, needs to be investigated added
Updated by okurz over 6 years ago
- Project changed from 46 to openQA Project
- Category set to 132
Still valid but #25638 is not blocked anymore as we implemented a workaround there by just not calling assert_shutdown
on s390x-kvm
Updated by okurz over 6 years ago
- Status changed from New to In Progress
- Assignee set to okurz
So we actually saw the problem again on z/VM as well, e.g. see https://openqa.suse.de/tests/1236007#live even though we skip over the call of assert_shutdown
which apparently was not the problem. While this job was stuck but not yet incompleted with the help of coolo I could debug. I logged into openqaworker2, saw the worker instance 4 running and the according processes:
_openqa+ 5641 0.1 0.0 267912 191416 ? Ss Oct19 9:15 /usr/bin/perl /usr/share/openqa/script/worker --instance 4
_openqa+ 5195 0.2 0.0 237244 150428 ? S 14:39 0:14 \_ perl /usr/bin/isotovideo -d
_openqa+ 5197 0.2 0.0 124404 54948 ? S 14:39 0:11 \_ /usr/bin/isotovideo: comman
_openqa+ 5198 0.1 0.0 267928 108228 ? S 14:39 0:09 \_ /usr/bin/isotovideo: autote
_openqa+ 5200 97.3 0.0 937516 185852 ? Rl 14:39 86:26 \_ /usr/bin/isotovideo: backen
_openqa+ 5201 0.9 0.0 161216 66288 ? SN 14:39 0:50 \_ /usr/lib/os-autoinst/videoencoder /var/lib/openqa/pool/4/video.ogv
_openqa+ 5202 0.0 0.0 375304 36932 ? Sl 14:39 0:00 \_ Xvnc -depth 16 -inetd -SecurityTypes None -ac :43801
_openqa+ 5222 0.0 0.0 58964 14452 ? S 14:39 0:01 \_ /usr/bin/x3270 -display :43801 -script -charset us -xrm x3270.visualBell:true -xr
_openqa+ 5513 0.0 0.0 376460 38052 ? Sl 14:40 0:02 \_ Xvnc -depth 16 -inetd -SecurityTypes None -ac :33075
_openqa+ 6996 0.0 0.0 376424 39192 ? Sl 14:50 0:01 \_ Xvnc -depth 16 -inetd -SecurityTypes None -ac :51313
so the backend is still there, the X server is running - as well as icewm, not shown here - but no xterm-console processes anymore. Attaching with strace to isotovideo: autote[st]
and isotovideo -d
revealed that one process has dispatched a command to isotovideo, isotovideo is polling the event loop but nothing suspicious. What can be seen in the process table is that isotovideo: backen
is hogging the CPU but not 100% -> busy-loop but not blocked by kernel commands. strace -p 5200
revealed:
[pid 5200] recvfrom(13, "", 16384, MSG_NOSIGNAL, NULL, NULL) = 0
[pid 5200] select(16, [6 13], [6 13], NULL, {0, 206501}) = 3 (in [6 13], out [13], left {0, 206500})
[pid 5200] recvfrom(13, "", 16384, MSG_NOSIGNAL, NULL, NULL) = 0
[pid 5200] select(16, [6 13], [6 13], NULL, {0, 206436}) = 3 (in [6 13], out [13], left {0, 206435})
[pid 5200] recvfrom(13, "", 16384, MSG_NOSIGNAL, NULL, NULL) = 0
[pid 5200] select(16, [6 13], [6 13], NULL, {0, 206372}) = 3 (in [6 13], out [13], left {0, 206371})
[pid 5200] recvfrom(13, "", 16384, MSG_NOSIGNAL, NULL, NULL) = 0
…
which coolo recognized as one of the ssh sockets still active but the remote end died -> missing command to kill_ssh
in the test when the machine is already shut down.
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3802
Updated by okurz over 6 years ago
- Status changed from In Progress to Resolved
PR merged, verified in production: https://openqa.suse.de/tests/1236004