Actions
action #96683
openReducing the number of worker slots leads to failing systemd units
Start date:
2021-08-09
Due date:
% Done:
0%
Estimated time:
Description
Reducing the number of worker slots leaves failed systemd units, e.g. the change https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/339/diffs?commit_id=9e6db9d883e73a699e2a4f195c73e1c346fdbb42 left openqa-reload-worker-auto-restart@20.service
and all other disabled openqa-reload-worker-auto-restart@….service
units on the affected hosts failed:
martchus@openqaworker6:~> sudo systemctl status openqa-reload-worker-auto-restart@20.service
● openqa-reload-worker-auto-restart@20.service - Restarts openqa-worker-auto-restart@20.service as soon as possible without interrupting jobs
Loaded: loaded (/usr/lib/systemd/system/openqa-reload-worker-auto-restart@.service; static; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2021-08-09 13:02:20 CEST; 3h 30min ago
Process: 31682 ExecStart=/usr/bin/systemctl reload openqa-worker-auto-restart@20.service (code=exited, status=1/FAILURE)
Main PID: 31682 (code=exited, status=1/FAILURE)
Aug 09 13:02:19 openqaworker6 systemd[1]: Starting Restarts openqa-worker-auto-restart@20.service as soon as possible without interrupting jobs...
Aug 09 13:02:20 openqaworker6 systemctl[31682]: Job for openqa-worker-auto-restart@20.service canceled.
Aug 09 13:02:20 openqaworker6 systemd[1]: openqa-reload-worker-auto-restart@20.service: Main process exited, code=exited, status=1/FAILURE
Aug 09 13:02:20 openqaworker6 systemd[1]: Failed to start Restarts openqa-worker-auto-restart@20.service as soon as possible without interrupting jobs.
Aug 09 13:02:20 openqaworker6 systemd[1]: openqa-reload-worker-auto-restart@20.service: Unit entered failed state.
Aug 09 13:02:20 openqaworker6 systemd[1]: openqa-reload-worker-auto-restart@20.service: Failed with result 'exit-code'.
This service is only supposed to reload the actual worker service to apply configuration changes. This fails because the actual worker service is inactive (which is of course wanted):
martchus@openqaworker6:~> sudo systemctl status openqa-worker-auto-restart@20.service
● openqa-worker-auto-restart@20.service - openQA Worker #20
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
└─20-nvme-autoformat.conf
Active: inactive (dead)
Aug 05 08:11:11 openqaworker6 worker[7202]: - pool directory: /var/lib/openqa/pool/20
Aug 05 08:11:11 openqaworker6 worker[7202]: [info] [pid:7202] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de
Aug 05 08:11:11 openqaworker6 worker[7202]: [info] [pid:7202] Project dir for host openqa.suse.de is /var/lib/openqa/share
Aug 05 08:11:11 openqaworker6 worker[7202]: [info] [pid:7202] Registering with openQA openqa.suse.de
Aug 05 08:11:12 openqaworker6 worker[7202]: [info] [pid:7202] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1194
Aug 05 08:11:12 openqaworker6 worker[7202]: [info] [pid:7202] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1194
Aug 09 13:02:20 openqaworker6 worker[7202]: [info] [pid:7202] Received signal TERM
Aug 09 13:02:20 openqaworker6 worker[7202]: [debug] [pid:7202] Informing openqa.suse.de that we are going offline
Aug 09 13:02:20 openqaworker6 systemd[1]: Stopping openQA Worker #20...
Aug 09 13:02:20 openqaworker6 systemd[1]: Stopped openQA Worker #20.
I assume the problem is that openqa-reload-worker-auto-restart@20.path
needed to be stopped as well which it isn't:
martchus@openqaworker6:~> sudo systemctl status openqa-reload-worker-auto-restart@20.path
● openqa-reload-worker-auto-restart@20.path
Loaded: loaded (/usr/lib/systemd/system/openqa-reload-worker-auto-restart@.path; static; vendor preset: disabled)
Active: active (waiting) since Wed 2021-08-04 15:06:03 CEST; 5 days ago
Aug 04 15:06:03 openqaworker6 systemd[1]: Started openqa-reload-worker-auto-restart@20.path.
Actions