Actions
action #93138
openopenqa-reload-worker-auto-restart and openqa-worker-auto-restart fail if numofworkers is reduced
Start date:
2021-05-26
Due date:
% Done:
0%
Estimated time:
Description
With https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/320/diffs#0a0ec388f05cea3ca9a77dd911a33555ec192597_38_38 @mgriessmeier reduced the numofworkers
for openqaworker2. Soon after we received an alert that several services on this machine fail. After closer investigation I see the following services failing:
● openqa-reload-worker-auto-restart@36.service loaded failed failed Restarts openqa-worker-auto-restart@36.service as soon as possible without interrupting jobs
● openqa-reload-worker-auto-restart@37.service loaded failed failed Restarts openqa-worker-auto-restart@37.service as soon as possible without interrupting jobs
● openqa-reload-worker-auto-restart@38.service loaded failed failed Restarts openqa-worker-auto-restart@38.service as soon as possible without interrupting jobs
● openqa-worker-auto-restart@37.service loaded failed failed openQA Worker #37
(Note: I don't understand why there is no openqa-worker-auto-restart@36.service
nor no openqa-worker-auto-restart@38.service
- is this a setup fail?)
These seem to be the related services which numofworkers
got reduced by.
Example output for one of these services:
openqaworker2:~ # systemctl status openqa-reload-worker-auto-restart@36.service
● openqa-reload-worker-auto-restart@36.service - Restarts openqa-worker-auto-restart@36.service as soon as possible without interrupting jobs
Loaded: loaded (/usr/lib/systemd/system/openqa-reload-worker-auto-restart@.service; static; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2021-05-26 13:44:13 CEST; 49min ago
Process: 21839 ExecStart=/usr/bin/systemctl reload openqa-worker-auto-restart@36.service (code=exited, status=1/FAILURE)
Main PID: 21839 (code=exited, status=1/FAILURE)
May 26 13:44:12 openqaworker2 systemd[1]: Starting Restarts openqa-worker-auto-restart@36.service as soon as possible without interrupting jobs...
May 26 13:44:13 openqaworker2 systemctl[21839]: Job for openqa-worker-auto-restart@36.service canceled.
May 26 13:44:13 openqaworker2 systemd[1]: openqa-reload-worker-auto-restart@36.service: Main process exited, code=exited, status=1/FAILURE
May 26 13:44:13 openqaworker2 systemd[1]: Failed to start Restarts openqa-worker-auto-restart@36.service as soon as possible without interrupting jobs.
May 26 13:44:13 openqaworker2 systemd[1]: openqa-reload-worker-auto-restart@36.service: Unit entered failed state.
May 26 13:44:13 openqaworker2 systemd[1]: openqa-reload-worker-auto-restart@36.service: Failed with result 'exit-code'.
and
openqaworker2:~ # systemctl status openqa-worker-auto-restart@37.service
● openqa-worker-auto-restart@37.service - openQA Worker #37
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
└─20-nvme-autoformat.conf
Active: failed (Result: timeout) since Wed 2021-05-26 13:45:43 CEST; 48min ago
Process: 18003 ExecStart=/usr/share/openqa/script/worker --instance 37 (code=killed, signal=KILL)
Main PID: 18003 (code=killed, signal=KILL)
May 26 13:45:43 openqaworker2 worker[18003]: [debug] [pid:22105] Optimizing /var/lib/openqa/pool/37/testresults/ImageMagick-116.png
May 26 13:45:43 openqaworker2 systemd[1]: openqa-worker-auto-restart@37.service: State 'stop-sigterm' timed out. Killing.
May 26 13:45:43 openqaworker2 systemd[1]: openqa-worker-auto-restart@37.service: Killing process 18003 (worker) with signal SIGKILL.
May 26 13:45:43 openqaworker2 systemd[1]: openqa-worker-auto-restart@37.service: Killing process 22105 (worker) with signal SIGKILL.
May 26 13:45:43 openqaworker2 systemd[1]: openqa-worker-auto-restart@37.service: Killing process 23223 (optipng) with signal SIGKILL.
May 26 13:45:43 openqaworker2 systemd[1]: openqa-worker-auto-restart@37.service: Main process exited, code=killed, status=9/KILL
May 26 13:45:43 openqaworker2 systemd[1]: openqa-worker-auto-restart@37.service: Killing process 22105 (worker) with signal SIGKILL.
May 26 13:45:43 openqaworker2 systemd[1]: Stopped openQA Worker #37.
May 26 13:45:43 openqaworker2 systemd[1]: openqa-worker-auto-restart@37.service: Unit entered failed state.
May 26 13:45:43 openqaworker2 systemd[1]: openqa-worker-auto-restart@37.service: Failed with result 'timeout'.
These services should handle a reduce of numofworkers
more gracefully.
Actions