Project

General

Profile

Actions

action #133352

closed

Activating systemd target openqa-worker.target when openqa-worker-auto-restart@ is already used causes havoc size:M

Added by okurz 9 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-07-26
Due date:
% Done:

0%

Estimated time:

Description

Observation

On the o3 worker openqaworker4 I recently called systemctl start openqa-worker.target. That repeatedly soon after caused lots of incomplete openQA jobs with message like "Reason: abandoned: associated worker openqaworker4:10 re-connected but abandoned the job " in https://openqa.opensuse.org/tests/3455867 . This is due to the fact that conflicting systemd services start: openqa-worker which is a link to openqa-worker-plain and openqa-worker-auto-restart

Acceptance criteria

  • AC1: Starting openqa-worker.target does not cause conflicts with already existing openqa-worker-auto-restart

Suggestions

  • Look into how the openqa-worker.target is generated by a script systemd/systemd-openqa-generator in openQA repo
  • Maybe this can be solved by "documentation" that we need to update the symlink or something
  • It is ok if services fail preventing the admin to do stupid things but we should prevent such situation where the services actually start, pick up jobs and incomplete them

Related issues 1 (0 open1 closed)

Related to openQA Project - action #109734: Better way to prevent conflicts between openqa-worker@ and openqa-worker-auto-restart@ variants size:MResolvedjbaier_cz2022-04-09

Actions
Actions #1

Updated by okurz 9 months ago

  • Subject changed from Activating systemd target openqa-worker.target when openqa-worker-auto-restart@ is already used causes havoc to Activating systemd target openqa-worker.target when openqa-worker-auto-restart@ is already used causes havoc size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #2

Updated by jbaier_cz 8 months ago

  • Assignee set to jbaier_cz
Actions #3

Updated by jbaier_cz 8 months ago

  • Status changed from Workable to In Progress

I looked around on openqaworker4 to see the current setup. If I understood correctly, the problem is that both openqa-worker-plain@.service and openqa-worker-auto-restart@.service are PartOf=openqa-worker.target. The easy workaround would be to mask the other (unused) service, which if I interpret the history correctly was done on that particular worker, we can document this as the solution. However I believe we can improve also by introducing Conflicts= in our unit files. According to documentation:

If a unit has a Conflicts= setting on another unit, starting the former will stop the latter and vice versa.

This should at least prevent both services to be running simultaneously. In our case, we are starting both services so the following applies.

If unit A that conflicts with unit B is scheduled to be started at the same time as B, the transaction will either fail (in case both are required parts of the transaction) or be modified to be fixed (in case one or both jobs are not a required part of the transaction). In the latter case, the job that is not required will be removed, or in case both are not required, the unit that conflicts will be started and the unit that is conflicted is stopped.

I will need to test the behavior and where to put the Conflicts= for the best result.

Actions #4

Updated by jbaier_cz 8 months ago

  • Related to action #109734: Better way to prevent conflicts between openqa-worker@ and openqa-worker-auto-restart@ variants size:M added
Actions #5

Updated by openqa_review 8 months ago

  • Due date set to 2023-09-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by jbaier_cz 8 months ago

There is supposed to be Conflicts= in our unit files, we just have it wrongly added after non-existent Wants=. https://github.com/os-autoinst/openQA/pull/5295 should help.

Actions #7

Updated by jbaier_cz 8 months ago

Just for the record, the Wants= line was removed in https://github.com/os-autoinst/openQA/pull/4577

Actions #8

Updated by jbaier_cz 8 months ago

  • Status changed from In Progress to Feedback
Actions #9

Updated by okurz 8 months ago

PR merged. https://github.com/os-autoinst/openQA/pull/5298 to prevent the failed replacements go unnoticed next time.

Actions #10

Updated by jbaier_cz 8 months ago

Actions #11

Updated by okurz 8 months ago

https://github.com/os-autoinst/openQA/pull/5300 merged. I suggest you wait for that to be deployed to e.g. o3 workers and then try to test with starting the worker target

Actions #12

Updated by jbaier_cz 8 months ago

  • Status changed from Feedback to Resolved

I did some tests, now the manual activation of openqa-worker-auto-restart@X.service will stop the conflicting openqa-worker-plain@X.service before starting and vice versa. Activating the worker target did start the missing services (via openqa-worker@X.service symlink), but did not start the conflicting one.

Actions #13

Updated by jbaier_cz 8 months ago

  • Due date deleted (2023-09-20)
Actions

Also available in: Atom PDF