action #71644
closedjob stuck in "uploading", can not be cancelled, after what should have incompleted early
Description
Observation¶
Anna Minou @punkioudi asked about https://openqa.suse.de/tests/4716001 which at this time showed a retrigger button but no cancel button. I checked the job and found in
https://openqa.suse.de/tests/4716001#live
ERROR: no tests loaded at /usr/lib/os-autoinst/autotest.pm line 340.
[2020-09-22T12:18:41.304 CEST] [debug] stopping autotest process 1782
still at 12:45 CEST so that should have stopped much sooner already.
On the according worker machine grenache-1.qa I see
okurz@grenache-1:~> sudo systemctl status openqa-worker@15
● openqa-worker@15.service - openQA Worker #15
Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2020-09-16 09:04:47 CEST; 6 days ago
Process: 595861 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/15 (code=exited, status=0/SUCCESS)
Main PID: 595902 (worker)
Tasks: 1
CGroup: /openqa.slice/openqa-worker.slice/openqa-worker@15.service
└─595902 /usr/bin/perl /usr/share/openqa/script/worker --instance 15
Sep 22 12:18:42 grenache-1 worker[595902]: [info] [pid:595902] Command server likely not reachable at all
Sep 22 12:18:43 grenache-1 worker[595902]: [info] [pid:595902] Isotovideo exit status: 1
Sep 22 12:18:43 grenache-1 worker[595902]: [info] [pid:595902] +++ worker notes +++
Sep 22 12:18:43 grenache-1 worker[595902]: [info] [pid:595902] End time: 2020-09-22 10:18:43
Sep 22 12:18:43 grenache-1 worker[595902]: [info] [pid:595902] Result: no tests scheduled
Sep 22 12:18:43 grenache-1 worker[595902]: [info] [pid:1819] Uploading vars.json
Sep 22 12:18:43 grenache-1 worker[595902]: [info] [pid:1819] Uploading autoinst-log.txt
Sep 22 12:18:43 grenache-1 worker[595902]: [info] [pid:1819] Uploading worker-log.txt
Sep 22 12:18:43 grenache-1 worker[595902]: [info] [pid:1819] Uploading serial0.txt
Sep 22 12:18:43 grenache-1 worker[595902]: [info] [pid:1819] Uploading video_time.vtt
okurz@grenache-1:~> sudo strace -f -y -p 595902
strace: Process 595902 attached
restart_syscall(<... resuming interrupted poll ...>
) = 0
poll([{fd=4<socket:[68562927]>, events=POLLIN|POLLPRI}, {fd=5<socket:[92162047]>, events=POLLIN|POLLPRI|POLLOUT}], 2, 28753) = 1 ([{fd=5, revents=POLLOUT}])
write(5<socket:[92162047]>, "\201\376\25\352\0\216AP{\254\"%r\374$>t\32165b\373(\17h\3412$\"\264c?"..., 5618) = 5618
poll([{fd=4<socket:[68562927]>, events=POLLIN|POLLPRI}, {fd=5<socket:[92162047]>, events=POLLIN|POLLPRI}], 2, 28752) = 1 ([{fd=5, revents=POLLIN}])
read(5<socket:[92162047]>, "\201\37{\"population\":40,\"type\":\"info\""..., 131072) = 33
poll([{fd=4<socket:[68562927]>, events=POLLIN|POLLPRI}, {fd=5<socket:[92162047]>, events=POLLIN|POLLPRI}], 2, 28737) = 0 (Timeout)
poll([{fd=4<socket:[68562927]>, events=POLLIN|POLLPRI|POLLOUT}, {fd=5<socket:[92162047]>, events=POLLIN|POLLPRI}], 2, 46053) = 1 ([{fd=4, revents=POLLOUT}])
write(4<socket:[68562927]>, "\201\376\25\352\0\"\350\230{\0\213\355rP\215\366t}\237\375bW\201\307hM\233\354\"\30\312\367"..., 5618) = 5618
poll([{fd=4<socket:[68562927]>, events=POLLIN|POLLPRI}, {fd=5<socket:[92162047]>, events=POLLIN|POLLPRI}], 2, 46052) = 1 ([{fd=4, revents=POLLIN}])
read(4<socket:[68562927]>, "\201 {\"population\":341,\"type\":\"info"..., 131072) = 34
poll([{fd=4<socket:[68562927]>, events=POLLIN|POLLPRI}, {fd=5<socket:[92162047]>, events=POLLIN|POLLPRI}], 2, 46009) = 0 (Timeout)
poll([{fd=4<socket:[68562927]>, events=POLLIN|POLLPRI}, {fd=5<socket:[92162047]>, events=POLLIN|POLLPRI|POLLOUT}], 2, 53904) = 1 ([{fd=5, revents=POLLOUT}])
write(5<socket:[92162047]>, "\201\376\25\352\0>\17U{\34l rLj;tax0bKf\nhQ|!\"\4-:"..., 5618) = 5618
poll([{fd=4<socket:[68562927]>, events=POLLIN|POLLPRI}, {fd=5<socket:[92162047]>, events=POLLIN|POLLPRI}], 2, 53904) = 1 ([{fd=5, revents=POLLIN}])
read(5<socket:[92162047]>, "\201\37{\"population\":40,\"type\":\"info\""..., 131072) = 33
poll([{fd=4<socket:[68562927]>, events=POLLIN|POLLPRI}, {fd=5<socket:[92162047]>, events=POLLIN|POLLPRI}], 2, 53889) = 0 (Timeout)
poll([{fd=4<socket:[68562927]>, events=POLLIN|POLLPRI|POLLOUT}, {fd=5<socket:[92162047]>, events=POLLIN|POLLPRI}], 2, 20873) = 1 ([{fd=4, revents=POLLOUT}])
write(4<socket:[68562927]>, "\201\376\25\352\0j\272({H\331]r\30\337Ft5\315Mb\37\323wh\5\311\\\"P\230G"..., 5618) = 5618
poll([{fd=4<socket:[68562927]>, events=POLLIN|POLLPRI}, {fd=5<socket:[92162047]>, events=POLLIN|POLLPRI}], 2, 20872) = 1 ([{fd=4, revents=POLLIN}])
read(4<socket:[68562927]>, "\201 {\"population\":341,\"type\":\"info"..., 131072) = 34
poll([{fd=4<socket:[68562927]>, events=POLLIN|POLLPRI}, {fd=5<socket:[92162047]>, events=POLLIN|POLLPRI}], 2, 20845) = 0 (Timeout)
poll([{fd=4<socket:[68562927]>, events=POLLIN|POLLPRI}, {fd=5<socket:[92162047]>, events=POLLIN|POLLPRI|POLLOUT}], 2, 79108) = 1 ([{fd=5, revents=POLLOUT}])
write(5<socket:[92162047]>, "\201\376\25\352\0[\177\10{y\34}r)\32ft\4\10mb.\26Wh4\f|\"a]g"..., 5618) = 5618
poll([{fd=4<socket:[68562927]>, events=POLLIN|POLLPRI}, {fd=5<socket:[92162047]>, events=POLLIN|POLLPRI}], 2, 74830) = 1 ([{fd=5, revents=POLLIN}])
read(5<socket:[92162047]>, "\201\37{\"population\":40,\"type\":\"info\""..., 131072) = 33
poll([{fd=4<socket:[68562927]>, events=POLLIN|POLLPRI}, {fd=5<socket:[92162047]>, events=POLLIN|POLLPRI}], 2, 74815
Interestingly restarting the systemd service for the worker caused the job to go non-existant.
Problem¶
As we do not have any more information than the above and not being aware of any previous instance of the problem I immediately close the ticket as "Resolved"
Workaround¶
Restart the according openQA worker service
Updated by mkittler about 4 years ago
… which at this time showed a retrigger button but no cancel button
PR https://github.com/os-autoinst/openQA/pull/3409 should fix that by showing the cancel button for all jobs in non-final states.