Project

General

Profile

Actions

action #93138

open

openqa-reload-worker-auto-restart and openqa-worker-auto-restart fail if numofworkers is reduced

Added by nicksinger over 3 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
2021-05-26
Due date:
% Done:

0%

Estimated time:

Description

With https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/320/diffs#0a0ec388f05cea3ca9a77dd911a33555ec192597_38_38 @mgriessmeier reduced the numofworkers for openqaworker2. Soon after we received an alert that several services on this machine fail. After closer investigation I see the following services failing:

● openqa-reload-worker-auto-restart@36.service         loaded failed failed    Restarts openqa-worker-auto-restart@36.service as soon as possible without interrupting jobs
● openqa-reload-worker-auto-restart@37.service         loaded failed failed    Restarts openqa-worker-auto-restart@37.service as soon as possible without interrupting jobs
● openqa-reload-worker-auto-restart@38.service         loaded failed failed    Restarts openqa-worker-auto-restart@38.service as soon as possible without interrupting jobs
● openqa-worker-auto-restart@37.service                loaded failed failed    openQA Worker #37

(Note: I don't understand why there is no openqa-worker-auto-restart@36.service nor no openqa-worker-auto-restart@38.service - is this a setup fail?)

These seem to be the related services which numofworkers got reduced by.
Example output for one of these services:

openqaworker2:~ # systemctl status openqa-reload-worker-auto-restart@36.service
● openqa-reload-worker-auto-restart@36.service - Restarts openqa-worker-auto-restart@36.service as soon as possible without interrupting jobs
   Loaded: loaded (/usr/lib/systemd/system/openqa-reload-worker-auto-restart@.service; static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2021-05-26 13:44:13 CEST; 49min ago
  Process: 21839 ExecStart=/usr/bin/systemctl reload openqa-worker-auto-restart@36.service (code=exited, status=1/FAILURE)
 Main PID: 21839 (code=exited, status=1/FAILURE)

May 26 13:44:12 openqaworker2 systemd[1]: Starting Restarts openqa-worker-auto-restart@36.service as soon as possible without interrupting jobs...
May 26 13:44:13 openqaworker2 systemctl[21839]: Job for openqa-worker-auto-restart@36.service canceled.
May 26 13:44:13 openqaworker2 systemd[1]: openqa-reload-worker-auto-restart@36.service: Main process exited, code=exited, status=1/FAILURE
May 26 13:44:13 openqaworker2 systemd[1]: Failed to start Restarts openqa-worker-auto-restart@36.service as soon as possible without interrupting jobs.
May 26 13:44:13 openqaworker2 systemd[1]: openqa-reload-worker-auto-restart@36.service: Unit entered failed state.
May 26 13:44:13 openqaworker2 systemd[1]: openqa-reload-worker-auto-restart@36.service: Failed with result 'exit-code'.

and

openqaworker2:~ # systemctl status openqa-worker-auto-restart@37.service
● openqa-worker-auto-restart@37.service - openQA Worker #37
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
           └─20-nvme-autoformat.conf
   Active: failed (Result: timeout) since Wed 2021-05-26 13:45:43 CEST; 48min ago
  Process: 18003 ExecStart=/usr/share/openqa/script/worker --instance 37 (code=killed, signal=KILL)
 Main PID: 18003 (code=killed, signal=KILL)

May 26 13:45:43 openqaworker2 worker[18003]: [debug] [pid:22105] Optimizing /var/lib/openqa/pool/37/testresults/ImageMagick-116.png
May 26 13:45:43 openqaworker2 systemd[1]: openqa-worker-auto-restart@37.service: State 'stop-sigterm' timed out. Killing.
May 26 13:45:43 openqaworker2 systemd[1]: openqa-worker-auto-restart@37.service: Killing process 18003 (worker) with signal SIGKILL.
May 26 13:45:43 openqaworker2 systemd[1]: openqa-worker-auto-restart@37.service: Killing process 22105 (worker) with signal SIGKILL.
May 26 13:45:43 openqaworker2 systemd[1]: openqa-worker-auto-restart@37.service: Killing process 23223 (optipng) with signal SIGKILL.
May 26 13:45:43 openqaworker2 systemd[1]: openqa-worker-auto-restart@37.service: Main process exited, code=killed, status=9/KILL
May 26 13:45:43 openqaworker2 systemd[1]: openqa-worker-auto-restart@37.service: Killing process 22105 (worker) with signal SIGKILL.
May 26 13:45:43 openqaworker2 systemd[1]: Stopped openQA Worker #37.
May 26 13:45:43 openqaworker2 systemd[1]: openqa-worker-auto-restart@37.service: Unit entered failed state.
May 26 13:45:43 openqaworker2 systemd[1]: openqa-worker-auto-restart@37.service: Failed with result 'timeout'.

These services should handle a reduce of numofworkers more gracefully.


Related issues 1 (1 open0 closed)

Related to openQA Project (public) - action #62441: openqa-worker systemd service can timeout when stoppingNew2020-01-21

Actions
Actions

Also available in: Atom PDF