Project

General

Profile

Actions

action #80986

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #80908: [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobs

terminate worker process after executing all currently assigned jobs based on config/env variable

Added by okurz over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-12-11
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See #80908

Acceptance criteria

  • DONE AC1: the worker process terminates after all currently assigned jobs are done based on config/env variable and the systemd service restarts the process
  • DONE AC2: the worker can still be stopped on shutdown or manual systemctl stop openqa-worker@$id
  • DONE AC3: the worker service by default is still kept running if no additional config variable is set
  • DONE AC4: the worker service by default is still restarted on package upgrades if no additional config variable is set
  • DONE AC5: deploy updated system configuration on OSD and o3

Suggestions

  • Add something like stop() if $ENV{OPENQA_WORKER_TERMINATE_AFTER_JOBS_DONE};
  • Then in https://github.com/os-autoinst/openQA/blob/master/systemd/openqa-worker%40.service#L15 set the env variable OPENQA_WORKER_TERMINATE_AFTER_JOBS_DONE and configure the service to restart not only on failure (or change worker code to actually stop with failure to trigger a restart)
  • Maybe the forced restart of the systemd service on package upgrade can actually be prevented in that mode as well, e.g. override ExecStop or something. Maybe needs to be extended to block normal terminate requests if the above variable is set

Related issues 1 (0 open1 closed)

Blocks openQA Project - action #80910: openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobsResolvedmkittler2020-12-09

Actions
Actions #1

Updated by okurz over 3 years ago

  • Description updated (diff)
Actions #2

Updated by mkittler over 3 years ago

  • Assignee set to mkittler
Actions #3

Updated by mkittler over 3 years ago

  • Status changed from Workable to In Progress
Actions #4

Updated by openqa_review over 3 years ago

  • Due date set to 2020-12-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by mkittler over 3 years ago

The PR has been merged. However, the new service file hasn't been enabled in any production instance yet. The o3 workers are rebooted every night anyways so they are likely not of interest here. So I suppose I'll have to come up with a salt change for OSD and maybe I'll test the new service on one worker first.

Actions #6

Updated by livdywan over 3 years ago

mkittler wrote:

The PR has been merged. However, the new service file hasn't been enabled in any production instance yet. The o3 workers are rebooted every night anyways so they are likely not of interest here. So I suppose I'll have to come up with a salt change for OSD and maybe I'll test the new service on one worker first.

Technically that should be an infra ticket, but I think we assumed we want this on production here anyway... although I could offer looking into the salt part, if it helps, I'm getting used to adding stuff there anyways.

Actions #7

Updated by okurz over 3 years ago

keep in mind that this change so far would restart the worker service to re-read the config. What it does not do is prevent jobs to be aborted on package upgrade. That's to be left for the parent epic.

Actions #8

Updated by okurz over 3 years ago

okurz wrote:

keep in mind that this change so far would restart the worker service to re-read the config. What it does not do is prevent jobs to be aborted on package upgrade. That's to be left for the parent epic.

correcting myself. The last part is part of this ticket, see AC4

Actions #9

Updated by mkittler over 3 years ago

About AC4: The restarting is apparently achieved on package-level via the %service_del_postun macro. I don't intend to change this and will rely on the DISABLE_RESTART_ON_UPDATE setting in /etc/sysconfig/services on systems where we want to avoid interrupting running jobs when deploying updates.

It looks like none of the ACs are about enabling this in production. Should I still do this as part of this ticket?

By the way: It makes no sense to do this before the Christmas break so I was focusing on further upstream changes anyways, see https://github.com/os-autoinst/openQA/pull/3641#issuecomment-748147996. This further PR is about #80910#note-7 but it takes a different approach than the ticket description suggests. However, in my opinion the "soft-restart" approach I took in that PR fits better with this ticket as it also terminates the worker and relies on the already introduced systemd service file to automatically start the service again. It means that in any case not only the configuration but also the worker code is reloaded.

Actions #10

Updated by okurz over 3 years ago

mkittler wrote:

About AC4: The restarting is apparently achieved on package-level via the %service_del_postun macro. I don't intend to change this and will rely on the DISABLE_RESTART_ON_UPDATE setting in /etc/sysconfig/services on systems where we want to avoid interrupting running jobs when deploying updates.

I consider /etc/sysconfig/services a bit too broad and would offer a solution to our users based on either configuration or selection of appropriate systemd service files.

It looks like none of the ACs are about enabling this in production. Should I still do this as part of this ticket?

I suggest you either do that within this ticket or in another subticket you create as part of the epic or write down in the epic what would need to be done as next step if you don't plan to do that

Actions #12

Updated by livdywan over 3 years ago

  • Due date changed from 2020-12-26 to 2021-01-08

Updating the due date to account for holidays

Actions #13

Updated by mkittler over 3 years ago

It looks like the automatic restarting on package updates mentioned in #80986#note-9 only works when openqa-worker.target is started. So if that target is not started (and only openqa-worker@.service or openqa-worker-auto-restart@.service) there should be no interference.

Here some further findings regarding systemd's behavior with the openqa-worker.target which we might need to take into account when switching to a different service file:

  1. When starting openqa-worker.target
    1. it always starts openqa-worker@.service (also if openqa-worker-auto-restart@.service is enabled and openqa-worker@.service not).
    2. openqa-worker-auto-restart@.service is stopped if already running.
    3. only the first worker instance is started when starting openqa-worker.target; also when more worker instances were previously running.
  2. When stopping openqa-worker.target
    1. openqa-worker@.service and openqa-worker-auto-restart@.service are stopped and all instances are affected.
  3. When starting openqa-worker-auto-restart@1 while openqa-worker@1.service is running the latter is stopped automatically as expected. openqa-worker.target remains active if it was active (but it remains active even after manually stopping all worker instances anyways).

PR to document this: https://github.com/os-autoinst/openQA/pull/3667

Actions #14

Updated by livdywan over 3 years ago

  • Description updated (diff)
  • Due date deleted (2021-01-08)

We agreed we want the ticket to cover deployment on production, right? Please correct me if I'm wrong. Just trying to make it clearer, what's still missing.

Actions #15

Updated by livdywan over 3 years ago

  • Blocks action #80910: openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobs added
Actions #16

Updated by openqa_review over 3 years ago

  • Due date set to 2021-02-06

Setting due date based on mean cycle time of SUSE QE Tools

Actions #17

Updated by livdywan about 3 years ago

cdywan wrote:

We agreed we want the ticket to cover deployment on production, right? Please correct me if I'm wrong. Just trying to make it clearer, what's still missing.

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/429

Actions #18

Updated by mkittler about 3 years ago

AC5: The auto-restarting service is now used on OSD workers.

Actions #19

Updated by mkittler about 3 years ago

  • Description updated (diff)
  1. I'd wait to see how well it works on OSD before (manually) reconfiguration the o3 workers (AC5).
  2. Note that since the solution involves using a different service AC2 is not fulfilled. However, I'd refrain from updating the regular worker service for now (to add Restart=always).
  3. As discussed in the epic it would be very desirable to cover idling workers as well to avoid running one more job (per worker slot) with the old configuration. See the epic for this (as it is not covered by any of this tickets ACs).
Actions #20

Updated by mkittler about 3 years ago

  • Status changed from In Progress to Resolved

Configured on o3 yesterday via:

for i in aarch64 openqaworker1 openqaworker4 openqaworker7 power8 rebel imagetester; do echo $i && sshpass -p … ssh root@$i 'for worker_slot in $(systemctl list-units '\''openqa-worker@*.service'\'' | sed -e '\''/.*openqa-worker@.*\.service.*/!d'\'' -e '\''s|.*openqa-worker@\(.*\)\.service.*|\1|'\''); do systemctl disable openqa-worker@$worker_slot && systemctl enable openqa-worker-auto-restart@$worker_slot; done'; done

Run the same command with swapped service names to revert (if necessary). So far it looks like the change survived the reboot and didn't break anything.

Actions #21

Updated by mkittler about 3 years ago

  • Description updated (diff)
Actions #22

Updated by okurz about 3 years ago

  • Due date deleted (2021-02-06)
Actions

Also available in: Atom PDF