action #96683
openReducing the number of worker slots leads to failing systemd units
0%
Description
Reducing the number of worker slots leaves failed systemd units, e.g. the change https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/339/diffs?commit_id=9e6db9d883e73a699e2a4f195c73e1c346fdbb42 left openqa-reload-worker-auto-restart@20.service
and all other disabled openqa-reload-worker-auto-restart@….service
units on the affected hosts failed:
martchus@openqaworker6:~> sudo systemctl status openqa-reload-worker-auto-restart@20.service
● openqa-reload-worker-auto-restart@20.service - Restarts openqa-worker-auto-restart@20.service as soon as possible without interrupting jobs
Loaded: loaded (/usr/lib/systemd/system/openqa-reload-worker-auto-restart@.service; static; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2021-08-09 13:02:20 CEST; 3h 30min ago
Process: 31682 ExecStart=/usr/bin/systemctl reload openqa-worker-auto-restart@20.service (code=exited, status=1/FAILURE)
Main PID: 31682 (code=exited, status=1/FAILURE)
Aug 09 13:02:19 openqaworker6 systemd[1]: Starting Restarts openqa-worker-auto-restart@20.service as soon as possible without interrupting jobs...
Aug 09 13:02:20 openqaworker6 systemctl[31682]: Job for openqa-worker-auto-restart@20.service canceled.
Aug 09 13:02:20 openqaworker6 systemd[1]: openqa-reload-worker-auto-restart@20.service: Main process exited, code=exited, status=1/FAILURE
Aug 09 13:02:20 openqaworker6 systemd[1]: Failed to start Restarts openqa-worker-auto-restart@20.service as soon as possible without interrupting jobs.
Aug 09 13:02:20 openqaworker6 systemd[1]: openqa-reload-worker-auto-restart@20.service: Unit entered failed state.
Aug 09 13:02:20 openqaworker6 systemd[1]: openqa-reload-worker-auto-restart@20.service: Failed with result 'exit-code'.
This service is only supposed to reload the actual worker service to apply configuration changes. This fails because the actual worker service is inactive (which is of course wanted):
martchus@openqaworker6:~> sudo systemctl status openqa-worker-auto-restart@20.service
● openqa-worker-auto-restart@20.service - openQA Worker #20
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
└─20-nvme-autoformat.conf
Active: inactive (dead)
Aug 05 08:11:11 openqaworker6 worker[7202]: - pool directory: /var/lib/openqa/pool/20
Aug 05 08:11:11 openqaworker6 worker[7202]: [info] [pid:7202] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de
Aug 05 08:11:11 openqaworker6 worker[7202]: [info] [pid:7202] Project dir for host openqa.suse.de is /var/lib/openqa/share
Aug 05 08:11:11 openqaworker6 worker[7202]: [info] [pid:7202] Registering with openQA openqa.suse.de
Aug 05 08:11:12 openqaworker6 worker[7202]: [info] [pid:7202] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1194
Aug 05 08:11:12 openqaworker6 worker[7202]: [info] [pid:7202] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1194
Aug 09 13:02:20 openqaworker6 worker[7202]: [info] [pid:7202] Received signal TERM
Aug 09 13:02:20 openqaworker6 worker[7202]: [debug] [pid:7202] Informing openqa.suse.de that we are going offline
Aug 09 13:02:20 openqaworker6 systemd[1]: Stopping openQA Worker #20...
Aug 09 13:02:20 openqaworker6 systemd[1]: Stopped openQA Worker #20.
I assume the problem is that openqa-reload-worker-auto-restart@20.path
needed to be stopped as well which it isn't:
martchus@openqaworker6:~> sudo systemctl status openqa-reload-worker-auto-restart@20.path
● openqa-reload-worker-auto-restart@20.path
Loaded: loaded (/usr/lib/systemd/system/openqa-reload-worker-auto-restart@.path; static; vendor preset: disabled)
Active: active (waiting) since Wed 2021-08-04 15:06:03 CEST; 5 days ago
Aug 04 15:06:03 openqaworker6 systemd[1]: Started openqa-reload-worker-auto-restart@20.path.
Updated by mkittler over 3 years ago
- Subject changed from [alert] Reducing the number of worker slots leads to failing systemd units to Reducing the number of worker slots leads to failing systemd units
- Status changed from In Progress to New
- Assignee deleted (
mkittler)
I've just merged the revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/341 and reset the failed units. So the current alert is dealt with but it would of course happen again any time we reduce the number of worker slots. I'll leave this ticket for improving our salt states in that regard.
Of course it is not that important as one can simply reset the failed units (which is a one-time action after reducing the number of slots).