So we actually saw the problem again on z/VM as well, e.g. see https://openqa.suse.de/tests/1236007#live even though we skip over the call of assert_shutdown
which apparently was not the problem. While this job was stuck but not yet incompleted with the help of coolo I could debug. I logged into openqaworker2, saw the worker instance 4 running and the according processes:
_openqa+ 5641 0.1 0.0 267912 191416 ? Ss Oct19 9:15 /usr/bin/perl /usr/share/openqa/script/worker --instance 4
_openqa+ 5195 0.2 0.0 237244 150428 ? S 14:39 0:14 \_ perl /usr/bin/isotovideo -d
_openqa+ 5197 0.2 0.0 124404 54948 ? S 14:39 0:11 \_ /usr/bin/isotovideo: comman
_openqa+ 5198 0.1 0.0 267928 108228 ? S 14:39 0:09 \_ /usr/bin/isotovideo: autote
_openqa+ 5200 97.3 0.0 937516 185852 ? Rl 14:39 86:26 \_ /usr/bin/isotovideo: backen
_openqa+ 5201 0.9 0.0 161216 66288 ? SN 14:39 0:50 \_ /usr/lib/os-autoinst/videoencoder /var/lib/openqa/pool/4/video.ogv
_openqa+ 5202 0.0 0.0 375304 36932 ? Sl 14:39 0:00 \_ Xvnc -depth 16 -inetd -SecurityTypes None -ac :43801
_openqa+ 5222 0.0 0.0 58964 14452 ? S 14:39 0:01 \_ /usr/bin/x3270 -display :43801 -script -charset us -xrm x3270.visualBell:true -xr
_openqa+ 5513 0.0 0.0 376460 38052 ? Sl 14:40 0:02 \_ Xvnc -depth 16 -inetd -SecurityTypes None -ac :33075
_openqa+ 6996 0.0 0.0 376424 39192 ? Sl 14:50 0:01 \_ Xvnc -depth 16 -inetd -SecurityTypes None -ac :51313
so the backend is still there, the X server is running - as well as icewm, not shown here - but no xterm-console processes anymore. Attaching with strace to isotovideo: autote[st]
and isotovideo -d
revealed that one process has dispatched a command to isotovideo, isotovideo is polling the event loop but nothing suspicious. What can be seen in the process table is that isotovideo: backen
is hogging the CPU but not 100% -> busy-loop but not blocked by kernel commands. strace -p 5200
revealed:
[pid 5200] recvfrom(13, "", 16384, MSG_NOSIGNAL, NULL, NULL) = 0
[pid 5200] select(16, [6 13], [6 13], NULL, {0, 206501}) = 3 (in [6 13], out [13], left {0, 206500})
[pid 5200] recvfrom(13, "", 16384, MSG_NOSIGNAL, NULL, NULL) = 0
[pid 5200] select(16, [6 13], [6 13], NULL, {0, 206436}) = 3 (in [6 13], out [13], left {0, 206435})
[pid 5200] recvfrom(13, "", 16384, MSG_NOSIGNAL, NULL, NULL) = 0
[pid 5200] select(16, [6 13], [6 13], NULL, {0, 206372}) = 3 (in [6 13], out [13], left {0, 206371})
[pid 5200] recvfrom(13, "", 16384, MSG_NOSIGNAL, NULL, NULL) = 0
…
which coolo recognized as one of the ssh sockets still active but the remote end died -> missing command to kill_ssh
in the test when the machine is already shut down.
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3802