Project

General

Profile

action #80910

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

action #80908: [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobs

openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobs

Added by okurz 10 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-12-09
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

We want to upgrade more often but not disrupt openQA jobs on package upgrades as well as re-read configuration whenever a job finishes

Acceptance criteria

  • AC1: openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobs

Suggestions

Maybe it's as easy as triggering a re-read of the config in the openQA worker service after a job finishes or before the worker looks for new jobs to pick up.


Related issues

Blocked by openQA Project - action #80986: terminate worker process after executing all currently assigned jobs based on config/env variableResolved2020-12-11

History

#1 Updated by cdywan 10 months ago

  • Assignee set to cdywan

#2 Updated by okurz 10 months ago

cdywan please be aware of #80908#note-3 . We might be able to solve this story as well as the generic one to restart for upgrade by "terminate after executing all currently assigned jobs" and letting systemd restart and hence implicitly also load config again.

#3 Updated by openqa_review 10 months ago

  • Due date set to 2020-12-24

Setting due date based on mean cycle time of SUSE QE Tools

#4 Updated by cdywan 10 months ago

okurz wrote:

cdywan please be aware of #80908#note-3 . We might be able to solve this story as well as the generic one to restart for upgrade by "terminate after executing all currently assigned jobs" and letting systemd restart and hence implicitly also load config again.

Ack. We should probably have a ticket for that then since that's the epic.

#5 Updated by cdywan 10 months ago

  • Assignee deleted (cdywan)

cdywan wrote:

okurz wrote:

cdywan please be aware of #80908#note-3 . We might be able to solve this story as well as the generic one to restart for upgrade by "terminate after executing all currently assigned jobs" and letting systemd restart and hence implicitly also load config again.

Ack. We should probably have a ticket for that then since that's the epic.

Indeed we have #80986 now.

#6 Updated by okurz 10 months ago

  • Status changed from Workable to Blocked
  • Assignee set to mkittler

mkittler please check again after #80986 if this is implicitly done or not needed anymore.

#7 Updated by mkittler 9 months ago

  • Status changed from Blocked to In Progress

The phrasing "whenever they are ready to pick up new jobs" really calls for a solution which covers idling workers as well like my PR https://github.com/os-autoinst/openQA/pull/3641. Considering I've already created a PR it is no longer blocked.

#8 Updated by cdywan 9 months ago

  • Due date changed from 2020-12-24 to 2021-01-08

PR has not been merged yet. Updating due date to account for holidays.

See also !423 for the related salt change (not covered by this ticket)

#9 Updated by okurz 9 months ago

We, mkittler, cdywan, okurz discussed together. The mentioned PR is merged but the caveat is that still "one more job" will run with the old config. mkittler to change the code accordingly.

#10 Updated by mkittler 9 months ago

I'd like to note that this even leaves one more caveat considering the example given in AC1:

AC1: openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobs

One further complication is that to actually apply the new WORKER_CLASS the worker does not only need to re-read the config but also to re-register with its web UIs. That should be easy but I'd like to mention it because the code changes will be a little bit more than expected.

There's another problem when it comes to triggering the re-reading. I would have implemented this feature so that the worker re-reads the config before starting a new job so the new job will definitely run under the new config to avoid the "one more job" problem mentioned in okurz's previous comment. That is usually fine except for settings which don't affect the job itself but the scheduling of further jobs like the WORKER_CLASS. Even if I also make the worker reload the config after finishing its current jobs the new WORKER_CLASS would still not be applied when a worker is idling. I could implement a periodic check for re-reading the config file while idling to solve this. I could also use Inotify using Linux::Perl::inotify or Linux::Inotify2. (None of them are currently in TW.)

#11 Updated by cdywan 9 months ago

mkittler wrote:

I'd like to note that this even leaves one more caveat considering the example given in AC1:

AC1: openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobs

There's another problem when it comes to triggering the re-reading. I would have implemented this feature so that the worker re-reads the config before starting a new job so the new job will definitely run under the new config to avoid the "one more job" problem mentioned in okurz's previous comment. That is usually fine except for settings which don't affect the job itself but the scheduling of further jobs like the WORKER_CLASS. Even if I also make the worker reload the config after finishing its current jobs the new WORKER_CLASS would still not be applied when a worker is idling. I could implement a periodic check for re-reading the config file while idling to solve this. I could also use Inotify using Linux::Perl::inotify or Linux::Inotify2. (None of them are currently in TW.)

read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobs could be taken to mean read the file when there's a new job. No file monitoring required.

#12 Updated by cdywan 9 months ago

The systemd way suggested: PathModified which could emit a signal to terminate the worker (and also read the config as a side effect).

#13 Updated by mkittler 9 months ago

read the file when there's a new job

As discussed this simplification is not possible. It would mean a job for the old WORKER_CLASS runs with the new configuration.

We also came to further conclusions:

  1. Using systemd to fire the signal is a conceivable idea.
  2. We should avoid the changes mentioned in #note-10 because is leads to far. The worker should not have to deal with watching its config file.
  3. For now I just going to enable OPENQA_WORKER_TERMINATE_AFTER_JOBS_DONE by switching to openqa-worker-auto-restart@.service to terminate and restart the worker after each job assignment has been processed (see https://github.com/os-autoinst/openQA/pull/3636). This means we will always run one more job with the old configuration (per idle worker slot) but it is likely better than nothing. This is likely not fulfilling AC1 as it was originally meant but should be ok for now.

#14 Updated by mkittler 9 months ago

I've just did 3. from the previous comment on the o3 worker imagetester to see how it works in production.

To revert in case of problems, just use:

systemctl disable --now openqa-worker-auto-restart@{1,2}
systemctl enable --now openqa-worker@{1,2}

#15 Updated by mkittler 9 months ago

PR for how the systemd way would look like: https://github.com/os-autoinst/openQA/pull/3666

#16 Updated by mkittler 9 months ago

openqa-worker-auto-restart@.service seems to work on imagetester. Note that the openqa-worker.target was not enabled/started on this machine but it is on other machines and interferes with enabling other worker services (see https://progress.opensuse.org/issues/80986#note-13). So it must be disabled before starting any custom service files in place of openqa-worker@.service. This disables the automatic restart of the service on package updates as well.

#17 Updated by cdywan 9 months ago

  • Due date changed from 2021-01-08 to 2021-01-15

#18 Updated by mkittler 9 months ago

By the way, here's a draft for how a switch to openqa-worker-auto-restart@.service would look like in salt: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/426/diffs

With this SR we still need to disable and stop the regular openqa-worker@.service manually (e.g. using salt -C 'G@roles:worker' cmd.run 'systemctl disable --now openqa-worker@*'). For the least interruption this could be done before the next deployment.

The users requesting this were only talking about OSD and we update o3 more frequently anyways. So I'm not going to apply usage of openqa-worker-auto-restart@.service on all o3 workers for now (unless someone says we want that).

#19 Updated by cdywan 8 months ago

  • Blocked by action #80986: terminate worker process after executing all currently assigned jobs based on config/env variable added

#20 Updated by cdywan 8 months ago

  • Due date deleted (2021-01-15)

I guess we're still waiting on #80986

#21 Updated by openqa_review 8 months ago

  • Due date set to 2021-02-06

Setting due date based on mean cycle time of SUSE QE Tools

#22 Updated by cdywan 8 months ago

  • Status changed from In Progress to Blocked

#23 Updated by mkittler 8 months ago

  • Status changed from Blocked to In Progress

#24 Updated by okurz 8 months ago

both merged. Your salt change might help with #63874 as well although I still hope for an easier solution than the sed+awk+tr magic. I hoped we could rely on openqa-worker.target or something

#25 Updated by mkittler 8 months ago

It does not work in production as expected. I've been editing workers.ini twice on openqaworker-arm-1 and Received signal HUP has been logged immediately and the behavior of the worker was as expected:

Feb 10 11:00:40 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5442303/status
Feb 10 11:00:50 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5442303/status
Feb 10 11:01:00 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5442303/status
Feb 10 11:01:10 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5442303/status
Feb 10 11:01:15 openqaworker-arm-1 worker[3961]: [info] [pid:3961] Received signal HUP
Feb 10 11:01:20 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5442303/status
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:3961] Isotovideo exit status: 0
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] Stopping job 5442303 from openqa.suse.de: 05442303-sle-15-SP3-Full-aarch64-Build145.1-migration_offline_sle15sp2_ha_alpha_node02@aarch64 - reason: done
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5442303/status
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:3961] +++ worker notes +++
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:3961] End time: 2021-02-10 11:01:21
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:3961] Result: done
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:35402] Uploading vars.json
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:35402] Uploading autoinst-log.txt
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:35402] Uploading worker-log.txt
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:35402] Uploading serial0.txt
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:35402] Uploading video_time.vtt
Feb 10 11:01:22 openqaworker-arm-1 worker[3961]: [info] [pid:35402] Uploading serial_terminal.txt
Feb 10 11:01:22 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] Setting job 5442303 to done
Feb 10 11:01:22 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] Unable to read result-patch_sle.json: Can't open file "/var/lib/openqa/pool/1/testresults/result-patch_sle.json": No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1152.
Feb 10 11:01:22 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5442303/status
Feb 10 11:01:22 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5442303/set_done?reason=isotovideo+done%3A+isotovideo+received+signal+HUP&worker_id=476
Feb 10 11:01:22 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] Job 5442303 from openqa.suse.de finished - reason: done
Feb 10 11:01:22 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] Informing openqa.suse.de that we are going offline
Feb 10 11:01:23 openqaworker-arm-1 systemd[1]: openqa-worker-auto-restart@1.service: Service RestartSec=100ms expired, scheduling restart.
Feb 10 11:01:23 openqaworker-arm-1 systemd[1]: Stopped openQA Worker #1.
Feb 10 11:01:23 openqaworker-arm-1 systemd[1]: Starting openQA Worker #1...
Feb 10 11:01:23 openqaworker-arm-1 systemd[1]: Started openQA Worker #1.
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: [info] [pid:35409] worker 1:
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]:  - config file:           /etc/openqa/workers.ini
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]:  - worker hostname:       openqaworker-arm-1
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]:  - isotovideo version:    20
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]:  - websocket API version: 1
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]:  - web UI hosts:          openqa.suse.de
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]:  - class:                 qemu_aarch64,qemu_aarch64_slow_worker,tap,openqaworker-arm-1
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]:  - no cleanup:            no
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]:  - pool directory:        /var/lib/openqa/pool/1
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: [info] [pid:35409] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: [info] [pid:35409] Project dir for host openqa.suse.de is /var/lib/openqa/share
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: [info] [pid:35409] Registering with openQA openqa.suse.de
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: [info] [pid:35409] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/476
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: [info] [pid:35409] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 476
Feb 10 11:01:36 openqaworker-arm-1 worker[35409]: [info] [pid:35409] Received signal HUP
Feb 10 11:01:36 openqaworker-arm-1 worker[35409]: [debug] [pid:35409] Informing openqa.suse.de that we are going offline
Feb 10 11:01:37 openqaworker-arm-1 systemd[1]: openqa-worker-auto-restart@1.service: Service RestartSec=100ms expired, scheduling restart.
Feb 10 11:01:37 openqaworker-arm-1 systemd[1]: Stopped openQA Worker #1.
Feb 10 11:01:37 openqaworker-arm-1 systemd[1]: Starting openQA Worker #1...
Feb 10 11:01:37 openqaworker-arm-1 systemd[1]: Started openQA Worker #1.
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]: [info] [pid:35450] worker 1:
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]:  - config file:           /etc/openqa/workers.ini
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]:  - worker hostname:       openqaworker-arm-1
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]:  - isotovideo version:    20
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]:  - websocket API version: 1
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]:  - web UI hosts:          openqa.suse.de
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]:  - class:                 qemu_aarch64,qemu_aarch64_slow_worker,tap,openqaworker-arm-1
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]:  - no cleanup:            no
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]:  - pool directory:        /var/lib/openqa/pool/1
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]: [info] [pid:35450] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]: [info] [pid:35450] Project dir for host openqa.suse.de is /var/lib/openqa/share
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]: [info] [pid:35450] Registering with openQA openqa.suse.de

However, the job failed with

Result: failed finished 5 minutes ago ( 12:14 minutes )
Reason: isotovideo done: isotovideo received signal HUP

(https://openqa.suse.de/tests/5442303)

because systemd apparently sends the signal to the entire process group (and not just the worker process).

#27 Updated by okurz 8 months ago

from today morning after the weekly automatic reboot

# salt -l error --no-color -C 'G@roles:worker' cmd.run "systemctl list-units --failed | grep service"
openqaworker2.suse.de:
openqaworker8.suse.de:
    * openqa-worker@12.service loaded failed failed openQA Worker #12
openqaworker5.suse.de:
openqaworker9.suse.de:
QA-Power8-5-kvm.qa.suse.de:
openqaworker6.suse.de:
QA-Power8-4-kvm.qa.suse.de:
malbec.arch.suse.de:
grenache-1.qa.suse.de:
openqaworker10.suse.de:
openqaworker-arm-1.suse.de:
    * openqa-worker@2.service loaded failed failed openQA Worker #2
openqaworker-arm-3.suse.de:
    * openqa-worker@10.service loaded failed failed openQA Worker #10
    * openqa-worker@13.service loaded failed failed openQA Worker #13
    * openqa-worker@16.service loaded failed failed openQA Worker #16
    * openqa-worker@19.service loaded failed failed openQA Worker #19
    * openqa-worker@5.service  loaded failed failed openQA Worker #5 
    * openqa-worker@7.service  loaded failed failed openQA Worker #7 
    * openqa-worker@9.service  loaded failed failed openQA Worker #9
openqaworker-arm-2.suse.de:
    * openqa-worker@17.service loaded failed failed openQA Worker #17       
    * openqa-worker@2.service  loaded failed failed openQA Worker #2        
    * openqa-worker@6.service  loaded failed failed openQA Worker #6        
    * openqa-worker@9.service  loaded failed failed openQA Worker #9
powerqaworker-qam-1:
    Minion did not return. [Not connected]
openqaworker13.suse.de:
    Minion did not return. [Not connected]
openqaworker3.suse.de:
    Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code

seems there are multiple cases of openqa-worker@ that should not be running after openqa-worker-auto-restart@ instead should be active. Can you check/explain/fix that?

#28 Updated by cdywan 7 months ago

  • Due date changed from 2021-02-06 to 2021-02-19

mkittler Can you please check the issues mentioned?

#30 Updated by cdywan 7 months ago

  • Due date changed from 2021-02-19 to 2021-02-26

Should this be in Feedback?

#31 Updated by mkittler 7 months ago

On the next OSD deployment I can continue working on this. Until then I can't even get feedback. So it is actually blocked. The same counts for the parent ticket.

#32 Updated by okurz 7 months ago

  • Status changed from In Progress to Feedback

We use "Feedback" when we wait for defined events that need active checks by the assignee, e.g. "wait until after the next OSD deployment" is such case. "In Progress" means that you are busy coding, researching, trying, debugging, etc.

#33 Updated by mkittler 7 months ago

  • Status changed from Feedback to Resolved

Works in production, see #80908#note-18

#34 Updated by okurz 7 months ago

  • Due date deleted (2021-02-26)

Also available in: Atom PDF