action #96683: Reducing the number of worker slots leads to failing systemd units - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #96683

open

Reducing the number of worker slots leads to failing systemd units

Added by mkittler almost 4 years ago. Updated almost 4 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

QA (public) - future

Start date:

2021-08-09

Due date:

% Done:

Estimated time:

Description

Reducing the number of worker slots leaves failed systemd units, e.g. the change https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/339/diffs?commit_id=9e6db9d883e73a699e2a4f195c73e1c346fdbb42 left openqa-reload-worker-auto-restart@20.service and all other disabled openqa-reload-worker-auto-restart@….service units on the affected hosts failed:

martchus@openqaworker6:~> sudo systemctl status openqa-reload-worker-auto-restart@20.service
● openqa-reload-worker-auto-restart@20.service - Restarts openqa-worker-auto-restart@20.service as soon as possible without interrupting jobs
   Loaded: loaded (/usr/lib/systemd/system/openqa-reload-worker-auto-restart@.service; static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2021-08-09 13:02:20 CEST; 3h 30min ago
  Process: 31682 ExecStart=/usr/bin/systemctl reload openqa-worker-auto-restart@20.service (code=exited, status=1/FAILURE)
 Main PID: 31682 (code=exited, status=1/FAILURE)

Aug 09 13:02:19 openqaworker6 systemd[1]: Starting Restarts openqa-worker-auto-restart@20.service as soon as possible without interrupting jobs...
Aug 09 13:02:20 openqaworker6 systemctl[31682]: Job for openqa-worker-auto-restart@20.service canceled.
Aug 09 13:02:20 openqaworker6 systemd[1]: openqa-reload-worker-auto-restart@20.service: Main process exited, code=exited, status=1/FAILURE
Aug 09 13:02:20 openqaworker6 systemd[1]: Failed to start Restarts openqa-worker-auto-restart@20.service as soon as possible without interrupting jobs.
Aug 09 13:02:20 openqaworker6 systemd[1]: openqa-reload-worker-auto-restart@20.service: Unit entered failed state.
Aug 09 13:02:20 openqaworker6 systemd[1]: openqa-reload-worker-auto-restart@20.service: Failed with result 'exit-code'.

This service is only supposed to reload the actual worker service to apply configuration changes. This fails because the actual worker service is inactive (which is of course wanted):

martchus@openqaworker6:~> sudo systemctl status openqa-worker-auto-restart@20.service
● openqa-worker-auto-restart@20.service - openQA Worker #20
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
           └─20-nvme-autoformat.conf
   Active: inactive (dead)

Aug 05 08:11:11 openqaworker6 worker[7202]:  - pool directory:        /var/lib/openqa/pool/20
Aug 05 08:11:11 openqaworker6 worker[7202]: [info] [pid:7202] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de
Aug 05 08:11:11 openqaworker6 worker[7202]: [info] [pid:7202] Project dir for host openqa.suse.de is /var/lib/openqa/share
Aug 05 08:11:11 openqaworker6 worker[7202]: [info] [pid:7202] Registering with openQA openqa.suse.de
Aug 05 08:11:12 openqaworker6 worker[7202]: [info] [pid:7202] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1194
Aug 05 08:11:12 openqaworker6 worker[7202]: [info] [pid:7202] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1194
Aug 09 13:02:20 openqaworker6 worker[7202]: [info] [pid:7202] Received signal TERM
Aug 09 13:02:20 openqaworker6 worker[7202]: [debug] [pid:7202] Informing openqa.suse.de that we are going offline
Aug 09 13:02:20 openqaworker6 systemd[1]: Stopping openQA Worker #20...
Aug 09 13:02:20 openqaworker6 systemd[1]: Stopped openQA Worker #20.

I assume the problem is that openqa-reload-worker-auto-restart@20.path needed to be stopped as well which it isn't:

martchus@openqaworker6:~> sudo systemctl status openqa-reload-worker-auto-restart@20.path
● openqa-reload-worker-auto-restart@20.path
   Loaded: loaded (/usr/lib/systemd/system/openqa-reload-worker-auto-restart@.path; static; vendor preset: disabled)
   Active: active (waiting) since Wed 2021-08-04 15:06:03 CEST; 5 days ago

Aug 04 15:06:03 openqaworker6 systemd[1]: Started openqa-reload-worker-auto-restart@20.path.

Actions

Copy link

Updated by mkittler almost 4 years ago

Subject changed from [alert] Reducing the number of worker slots leads to failing systemd units to Reducing the number of worker slots leads to failing systemd units
Status changed from In Progress to New
Assignee deleted (~~mkittler~~)

I've just merged the revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/341 and reset the failed units. So the current alert is dealt with but it would of course happen again any time we reduce the number of worker slots. I'll leave this ticket for improving our salt states in that regard.

Of course it is not that important as one can simply reset the failed units (which is a one-time action after reducing the number of slots).

Actions

Copy link