action #133352
closedActivating systemd target openqa-worker.target when openqa-worker-auto-restart@ is already used causes havoc size:M
Description
Observation¶
On the o3 worker openqaworker4 I recently called systemctl start openqa-worker.target
. That repeatedly soon after caused lots of incomplete openQA jobs with message like "Reason: abandoned: associated worker openqaworker4:10 re-connected but abandoned the job " in https://openqa.opensuse.org/tests/3455867 . This is due to the fact that conflicting systemd services start: openqa-worker which is a link to openqa-worker-plain and openqa-worker-auto-restart
Acceptance criteria¶
- AC1: Starting openqa-worker.target does not cause conflicts with already existing openqa-worker-auto-restart
Suggestions¶
- Look into how the openqa-worker.target is generated by a script systemd/systemd-openqa-generator in openQA repo
- Maybe this can be solved by "documentation" that we need to update the symlink or something
- It is ok if services fail preventing the admin to do stupid things but we should prevent such situation where the services actually start, pick up jobs and incomplete them
Updated by okurz about 1 year ago
- Subject changed from Activating systemd target openqa-worker.target when openqa-worker-auto-restart@ is already used causes havoc to Activating systemd target openqa-worker.target when openqa-worker-auto-restart@ is already used causes havoc size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by jbaier_cz about 1 year ago
- Status changed from Workable to In Progress
I looked around on openqaworker4 to see the current setup. If I understood correctly, the problem is that both openqa-worker-plain@.service
and openqa-worker-auto-restart@.service
are PartOf=openqa-worker.target
. The easy workaround would be to mask the other (unused) service, which if I interpret the history correctly was done on that particular worker, we can document this as the solution. However I believe we can improve also by introducing Conflicts=
in our unit files. According to documentation:
If a unit has a Conflicts= setting on another unit, starting the former will stop the latter and vice versa.
This should at least prevent both services to be running simultaneously. In our case, we are starting both services so the following applies.
If unit A that conflicts with unit B is scheduled to be started at the same time as B, the transaction will either fail (in case both are required parts of the transaction) or be modified to be fixed (in case one or both jobs are not a required part of the transaction). In the latter case, the job that is not required will be removed, or in case both are not required, the unit that conflicts will be started and the unit that is conflicted is stopped.
I will need to test the behavior and where to put the Conflicts=
for the best result.
Updated by jbaier_cz about 1 year ago
- Related to action #109734: Better way to prevent conflicts between openqa-worker@ and openqa-worker-auto-restart@ variants size:M added
Updated by openqa_review about 1 year ago
- Due date set to 2023-09-20
Setting due date based on mean cycle time of SUSE QE Tools
Updated by jbaier_cz about 1 year ago
There is supposed to be Conflicts=
in our unit files, we just have it wrongly added after non-existent Wants=
. https://github.com/os-autoinst/openQA/pull/5295 should help.
Updated by jbaier_cz about 1 year ago
Just for the record, the Wants=
line was removed in https://github.com/os-autoinst/openQA/pull/4577
Updated by jbaier_cz about 1 year ago
- Status changed from In Progress to Feedback
Updated by okurz about 1 year ago
PR merged. https://github.com/os-autoinst/openQA/pull/5298 to prevent the failed replacements go unnoticed next time.
Updated by jbaier_cz about 1 year ago
A different approach in https://github.com/os-autoinst/openQA/pull/5300
Updated by okurz about 1 year ago
https://github.com/os-autoinst/openQA/pull/5300 merged. I suggest you wait for that to be deployed to e.g. o3 workers and then try to test with starting the worker target
Updated by jbaier_cz about 1 year ago
- Status changed from Feedback to Resolved
I did some tests, now the manual activation of openqa-worker-auto-restart@X.service
will stop the conflicting openqa-worker-plain@X.service
before starting and vice versa. Activating the worker target did start the missing services (via openqa-worker@X.service
symlink), but did not start the conflicting one.