action #112874
opencoordination #128366: [epic] further improvement after we did ensure all our database tables accomodate enough data
coordination #112961: [epic] Followup to "openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend"
Jobs stuck in assigned, worker reports to be "currently stopping" for > 21h
0%
Description
Observation¶
https://openqa.suse.de/tests currently shows many "assigned" jobs with date "Started" as "not yet". For example grenache-1:37 has not completed any job for 3 days. It says to be working on https://openqa.suse.de/tests/8998400 which is "assigned" created "about 21 hours ago" at 2022-06-21 14:30:54Z but apparently not progressing since then. I checked the journal of said worker on grenache-1.qa and found:
● openqa-worker-auto-restart@37.service - openQA Worker #37
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
└─30-openqa-max-inactive-caching-downloads.conf
Active: active (running) since Mon 2022-06-20 07:20:21 CEST; 2 days ago
Main PID: 194990 (worker)
Tasks: 1
CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@37.service
└─194990 /usr/bin/perl /usr/share/openqa/script/worker --instance 37
Jun 22 13:03:39 grenache-1 worker[194990]: [info] [pid:194990] Registered and connected via websockets with openQA host baremetal-support.qa>
Jun 22 13:04:39 grenache-1 worker[194990]: [warn] [pid:194990] Websocket connection to http://baremetal-support.qa.suse.de/api/v1/ws/32 fini>
Jun 22 13:04:49 grenache-1 worker[194990]: [debug] [pid:194990] Refusing to grab job from openqa.suse.de: currently stopping
The message "currently stopping" comes from https://github.com/os-autoinst/openQA/blob/681de1b665b56ce154b1232c6e934edbbad6eb66/lib/OpenQA/Worker/CommandHandler.pm#L152 and should say that the worker is stopping and hence the job would soon be rejected or something. But this does not seem to happen here.
As mitigation I have forcefully restarted grenache-1:37 (and others) so https://openqa.suse.de/tests/8998400 was going back to "scheduled" and also the worker picked up a new job, in this case https://openqa.suse.de/tests/8998387
Workaround¶
Forcefully restart the affected workers