action #177351
closeds390x workers services failed to load properly
0%
Description
Some info taken from the slack thread https://suse.slack.com/archives/C02CANHLANP/p1739768187218899
mgriessmeier@worker31:~> systemctl status openqa-worker@6
○ openqa-worker@6.service - openQA Worker #6
Loaded: error (Reason: Unit openqa-worker@6.service failed to load properly, please adjust/correct and reload service manager: File exists)
Active: inactive (dead)
mgriessmeier@worker31:~> systemctl status openqa-worker@{1,2,3,4,5,6,7,8,9,10}
○ openqa-worker@1.service - openQA Worker #1
Loaded: error (Reason: Unit openqa-worker@1.service failed to load properly, please adjust/correct and reload service manager: File exists)
Active: inactive (dead)
○ openqa-worker@2.service - openQA Worker #2
Loaded: error (Reason: Unit openqa-worker@2.service failed to load properly, please adjust/correct and reload service manager: File exists)
Active: inactive (dead)
○ openqa-worker@3.service - openQA Worker #3
Loaded: error (Reason: Unit openqa-worker@3.service failed to load properly, please adjust/correct and reload service manager: File exists)
Active: inactive (dead)
○ openqa-worker@4.service - openQA Worker #4
Loaded: error (Reason: Unit openqa-worker@4.service failed to load properly, please adjust/correct and reload service manager: File exists)
Active: inactive (dead)
○ openqa-worker@5.service - openQA Worker #5
Loaded: error (Reason: Unit openqa-worker@5.service failed to load properly, please adjust/correct and reload service manager: File exists)
Active: inactive (dead)
○ openqa-worker@6.service - openQA Worker #6
Loaded: error (Reason: Unit openqa-worker@6.service failed to load properly, please adjust/correct and reload service manager: File exists)
Active: inactive (dead)
last lines of openqa-worker@6
Feb 16 03:31:29 worker31 systemd[1]: openqa-worker@6.service: State 'stop-sigterm' timed out. Killing.
Feb 16 03:31:29 worker31 systemd[1]: openqa-worker@6.service: Killing process 100046 (worker) with signal SIGKILL.
Feb 16 03:31:29 worker31 systemd[1]: openqa-worker@6.service: Main process exited, code=killed, status=9/KILL
Feb 16 03:31:29 worker31 systemd[1]: openqa-worker@6.service: Failed with result 'timeout'.
Feb 16 03:31:29 worker31 systemd[1]: Stopped openQA Worker #6.
Feb 16 03:31:29 worker31 systemd[1]: openqa-worker@6.service: Consumed 7h 58min 43.276s CPU time.
not sure if it is connected, I saw
Feb 16 02:44:15 worker31 salt-minion[117170]: "The x509 modules are deprecated. Please migrate to the replacement "
Feb 16 02:44:15 worker31 salt-minion[117170]: [WARNING ] remote: "worker37" found in workerconf.sls but not in salt mine, host currently offline?
Feb 16 02:44:15 worker31 salt-minion[117170]: [WARNING ] remote: "worker38" found in workerconf.sls but not in salt mine, host currently offline?
Feb 16 03:30:00 worker31 salt-minion[117170]: /usr/lib/python3.6/site-packages/salt/states/x509.py:214: DeprecationWarning: The x509 modules are deprecated. Please migrate to the replacement modules (x509_v2). They are the default from Salt 3008 (Argon) onwards.
Feb 16 03:30:00 worker31 salt-minion[117170]: "The x509 modules are deprecated. Please migrate to the replacement "
Feb 16 03:30:00 worker31 salt-minion[117170]: /usr/lib/python3.6/site-packages/salt/transport/zeromq.py:706: UserWarning: Unregistering FD 19 after closing socket. This could result in unregistering handlers for the wrong socket. Please use stream.close() instead of closing the socket directly.
Feb 16 03:30:00 worker31 salt-minion[117170]: self._monitor_stream.close()
Feb 16 03:30:00 worker31 salt-minion[117170]: [WARNING ] Minion received a SIGTERM. Exiting.
Feb 16 03:30:00 worker31 salt-minion[117170]: The Salt Minion is shutdown. Minion received a SIGTERM. Exited.
Feb 16 03:30:00 worker31 systemd[1]: Stopping The Salt Minion...
Feb 16 03:30:03 worker31 systemd[1]: salt-minion.service: Deactivated successfully.
Feb 16 03:30:03 worker31 systemd[1]: Stopped The Salt Minion.
Feb 16 03:30:03 worker31 systemd[1]: salt-minion.service: Consumed 24min 1.154s CPU time.
on salt-minion.Active: active (running) since Sun 2025-02-16 03:36:43 UTC; 1 day 5h ago
last commit on salt-pillars-openqa is https://gitlab.suse.de/openqa/salt-pillars-openqa/-/commit/33abe48f2119e535bf405e5cd4616478e11f71dc but also doesnt seems relevant
Updated by gpathak 14 days ago · Edited
- Priority changed from Urgent to Normal
The openqa-worker-auto-restart@{1..63}.service
was masked which caused the openqa-worker@{1..63}.service
to not restart automatically.
Lowering the priority since the workers slots are back online and started running the jobs.
Updated by mkittler 14 days ago · Edited
Note that the openqa-worker-auto-restart@{1..63}.service
slots are also supposed to be enabled and running. The services openqa-worker@{1..63}.service
on the other hand are not directly used. Checkout https://gitlab.suse.de/openqa/salt-states-openqa#remarks-about-the-systemd-units-used-to-start-workers for details. (And yes, I suppose you can check/enable/start openqa-worker@{1..63}.service
as well because of the symlink /etc/systemd/system/openqa-worker@.service
.)
Were really all openqa-worker-auto-restart@{1..63}.service
from 1 to 63 masked? This would be quite a mistake but maybe someone (outside the tools team) did this masking trying to fix the actual issue?
Updated by gpathak 14 days ago · Edited
mkittler wrote in #note-6:
Note that the
openqa-worker-auto-restart@{1..63}.service
slots are also supposed to be enabled and running. The servicesopenqa-worker@{1..63}.service
on the other hand are not directly used. Checkout https://gitlab.suse.de/openqa/salt-states-openqa#remarks-about-the-systemd-units-used-to-start-workers for details. (And yes, I suppose you can check/enable/startopenqa-worker@{1..63}.service
as well because of the symlink/etc/systemd/system/openqa-worker@.service
.)Were really all
openqa-worker-auto-restart@{1..63}.service
from 1 to 63 masked? This would be quite a mistake but maybe someone (outside the tools team) did this masking trying to fix the actual issue?
Maybe I did it as part of #160095 and later missed to unmask the workers.