Project

General

Profile

Actions

action #112874

open

coordination #128366: [epic] further improvement after we did ensure all our database tables accomodate enough data

coordination #112961: [epic] Followup to "openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend"

Jobs stuck in assigned, worker reports to be "currently stopping" for > 21h

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2022-06-22
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://openqa.suse.de/tests currently shows many "assigned" jobs with date "Started" as "not yet". For example grenache-1:37 has not completed any job for 3 days. It says to be working on https://openqa.suse.de/tests/8998400 which is "assigned" created "about 21 hours ago" at 2022-06-21 14:30:54Z but apparently not progressing since then. I checked the journal of said worker on grenache-1.qa and found:

● openqa-worker-auto-restart@37.service - openQA Worker #37
     Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
             └─30-openqa-max-inactive-caching-downloads.conf
     Active: active (running) since Mon 2022-06-20 07:20:21 CEST; 2 days ago
   Main PID: 194990 (worker)
      Tasks: 1
     CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@37.service
             └─194990 /usr/bin/perl /usr/share/openqa/script/worker --instance 37

Jun 22 13:03:39 grenache-1 worker[194990]: [info] [pid:194990] Registered and connected via websockets with openQA host baremetal-support.qa>
Jun 22 13:04:39 grenache-1 worker[194990]: [warn] [pid:194990] Websocket connection to http://baremetal-support.qa.suse.de/api/v1/ws/32 fini>
Jun 22 13:04:49 grenache-1 worker[194990]: [debug] [pid:194990] Refusing to grab job from openqa.suse.de: currently stopping

The message "currently stopping" comes from https://github.com/os-autoinst/openQA/blob/681de1b665b56ce154b1232c6e934edbbad6eb66/lib/OpenQA/Worker/CommandHandler.pm#L152 and should say that the worker is stopping and hence the job would soon be rejected or something. But this does not seem to happen here.

As mitigation I have forcefully restarted grenache-1:37 (and others) so https://openqa.suse.de/tests/8998400 was going back to "scheduled" and also the worker picked up a new job, in this case https://openqa.suse.de/tests/8998387

Workaround

Forcefully restart the affected workers

Actions #1

Updated by okurz over 2 years ago

  • Target version changed from Ready to future
Actions #3

Updated by okurz over 2 years ago

  • Parent task changed from #112718 to #112961
Actions

Also available in: Atom PDF