action #80910
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #80908: [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobs
openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobs
Description
Motivation¶
We want to upgrade more often but not disrupt openQA jobs on package upgrades as well as re-read configuration whenever a job finishes
Acceptance criteria¶
- AC1: openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobs
Suggestions¶
Maybe it's as easy as triggering a re-read of the config in the openQA worker service after a job finishes or before the worker looks for new jobs to pick up.
Updated by okurz almost 4 years ago
@cdywan please be aware of #80908#note-3 . We might be able to solve this story as well as the generic one to restart for upgrade by "terminate after executing all currently assigned jobs" and letting systemd restart and hence implicitly also load config again.
Updated by openqa_review almost 4 years ago
- Due date set to 2020-12-24
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan almost 4 years ago
okurz wrote:
@cdywan please be aware of #80908#note-3 . We might be able to solve this story as well as the generic one to restart for upgrade by "terminate after executing all currently assigned jobs" and letting systemd restart and hence implicitly also load config again.
Ack. We should probably have a ticket for that then since that's the epic.
Updated by livdywan almost 4 years ago
- Assignee deleted (
livdywan)
cdywan wrote:
okurz wrote:
@cdywan please be aware of #80908#note-3 . We might be able to solve this story as well as the generic one to restart for upgrade by "terminate after executing all currently assigned jobs" and letting systemd restart and hence implicitly also load config again.
Ack. We should probably have a ticket for that then since that's the epic.
Indeed we have #80986 now.
Updated by mkittler almost 4 years ago
- Status changed from Blocked to In Progress
The phrasing "whenever they are ready to pick up new jobs" really calls for a solution which covers idling workers as well like my PR https://github.com/os-autoinst/openQA/pull/3641. Considering I've already created a PR it is no longer blocked.
Updated by livdywan almost 4 years ago
- Due date changed from 2020-12-24 to 2021-01-08
PR has not been merged yet. Updating due date to account for holidays.
See also !423 for the related salt change (not covered by this ticket)
Updated by okurz almost 4 years ago
We, mkittler, cdywan, okurz discussed together. The mentioned PR is merged but the caveat is that still "one more job" will run with the old config. @mkittler to change the code accordingly.
Updated by mkittler almost 4 years ago
I'd like to note that this even leaves one more caveat considering the example given in AC1:
AC1: openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobs
One further complication is that to actually apply the new WORKER_CLASS
the worker does not only need to re-read the config but also to re-register with its web UIs. That should be easy but I'd like to mention it because the code changes will be a little bit more than expected.
There's another problem when it comes to triggering the re-reading. I would have implemented this feature so that the worker re-reads the config before starting a new job so the new job will definitely run under the new config to avoid the "one more job" problem mentioned in @okurz's previous comment. That is usually fine except for settings which don't affect the job itself but the scheduling of further jobs like the WORKER_CLASS
. Even if I also make the worker reload the config after finishing its current jobs the new WORKER_CLASS
would still not be applied when a worker is idling. I could implement a periodic check for re-reading the config file while idling to solve this. I could also use Inotify using Linux::Perl::inotify
or Linux::Inotify2
. (None of them are currently in TW.)
Updated by livdywan almost 4 years ago
mkittler wrote:
I'd like to note that this even leaves one more caveat considering the example given in AC1:
AC1: openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobs
There's another problem when it comes to triggering the re-reading. I would have implemented this feature so that the worker re-reads the config before starting a new job so the new job will definitely run under the new config to avoid the "one more job" problem mentioned in @okurz's previous comment. That is usually fine except for settings which don't affect the job itself but the scheduling of further jobs like the
WORKER_CLASS
. Even if I also make the worker reload the config after finishing its current jobs the newWORKER_CLASS
would still not be applied when a worker is idling. I could implement a periodic check for re-reading the config file while idling to solve this. I could also use Inotify usingLinux::Perl::inotify
orLinux::Inotify2
. (None of them are currently in TW.)
read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobs
could be taken to mean read the file when there's a new job. No file monitoring required.
Updated by livdywan almost 4 years ago
The systemd way suggested: PathModified which could emit a signal to terminate the worker (and also read the config as a side effect).
Updated by mkittler almost 4 years ago
read the file when there's a new job
As discussed this simplification is not possible. It would mean a job for the old WORKER_CLASS
runs with the new configuration.
We also came to further conclusions:
- Using systemd to fire the signal is a conceivable idea.
- We should avoid the changes mentioned in #note-10 because is leads to far. The worker should not have to deal with watching its config file.
- For now I just going to enable
OPENQA_WORKER_TERMINATE_AFTER_JOBS_DONE
by switching toopenqa-worker-auto-restart@.service
to terminate and restart the worker after each job assignment has been processed (see https://github.com/os-autoinst/openQA/pull/3636). This means we will always run one more job with the old configuration (per idle worker slot) but it is likely better than nothing. This is likely not fulfilling AC1 as it was originally meant but should be ok for now.
Updated by mkittler almost 4 years ago
I've just did 3. from the previous comment on the o3 worker imagetester
to see how it works in production.
To revert in case of problems, just use:
systemctl disable --now openqa-worker-auto-restart@{1,2}
systemctl enable --now openqa-worker@{1,2}
Updated by mkittler almost 4 years ago
PR for how the systemd way would look like: https://github.com/os-autoinst/openQA/pull/3666
Updated by mkittler almost 4 years ago
openqa-worker-auto-restart@.service
seems to work on imagetester
. Note that the openqa-worker.target
was not enabled/started on this machine but it is on other machines and interferes with enabling other worker services (see https://progress.opensuse.org/issues/80986#note-13). So it must be disabled before starting any custom service files in place of openqa-worker@.service
. This disables the automatic restart of the service on package updates as well.
Updated by livdywan almost 4 years ago
- Due date changed from 2021-01-08 to 2021-01-15
Updated by mkittler almost 4 years ago
By the way, here's a draft for how a switch to openqa-worker-auto-restart@.service
would look like in salt: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/426/diffs
With this SR we still need to disable and stop the regular openqa-worker@.service
manually (e.g. using salt -C 'G@roles:worker' cmd.run 'systemctl disable --now openqa-worker@*'
). For the least interruption this could be done before the next deployment.
The users requesting this were only talking about OSD and we update o3 more frequently anyways. So I'm not going to apply usage of openqa-worker-auto-restart@.service
on all o3 workers for now (unless someone says we want that).
Updated by livdywan almost 4 years ago
- Blocked by action #80986: terminate worker process after executing all currently assigned jobs based on config/env variable added
Updated by livdywan almost 4 years ago
- Due date deleted (
2021-01-15)
I guess we're still waiting on #80986
Updated by openqa_review almost 4 years ago
- Due date set to 2021-02-06
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan almost 4 years ago
- Status changed from In Progress to Blocked
Updated by mkittler almost 4 years ago
- Status changed from Blocked to In Progress
- PR: https://github.com/os-autoinst/openQA/pull/3635
- SR to enable it on OSD: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/438
Updated by okurz almost 4 years ago
both merged. Your salt change might help with #63874 as well although I still hope for an easier solution than the sed+awk+tr magic. I hoped we could rely on openqa-worker.target or something
Updated by mkittler almost 4 years ago
It does not work in production as expected. I've been editing workers.ini
twice on openqaworker-arm-1
and Received signal HUP
has been logged immediately and the behavior of the worker was as expected:
Feb 10 11:00:40 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5442303/status
Feb 10 11:00:50 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5442303/status
Feb 10 11:01:00 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5442303/status
Feb 10 11:01:10 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5442303/status
Feb 10 11:01:15 openqaworker-arm-1 worker[3961]: [info] [pid:3961] Received signal HUP
Feb 10 11:01:20 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5442303/status
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:3961] Isotovideo exit status: 0
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] Stopping job 5442303 from openqa.suse.de: 05442303-sle-15-SP3-Full-aarch64-Build145.1-migration_offline_sle15sp2_ha_alpha_node02@aarch64 - reason: done
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5442303/status
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:3961] +++ worker notes +++
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:3961] End time: 2021-02-10 11:01:21
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:3961] Result: done
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:35402] Uploading vars.json
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:35402] Uploading autoinst-log.txt
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:35402] Uploading worker-log.txt
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:35402] Uploading serial0.txt
Feb 10 11:01:21 openqaworker-arm-1 worker[3961]: [info] [pid:35402] Uploading video_time.vtt
Feb 10 11:01:22 openqaworker-arm-1 worker[3961]: [info] [pid:35402] Uploading serial_terminal.txt
Feb 10 11:01:22 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] Setting job 5442303 to done
Feb 10 11:01:22 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] Unable to read result-patch_sle.json: Can't open file "/var/lib/openqa/pool/1/testresults/result-patch_sle.json": No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1152.
Feb 10 11:01:22 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5442303/status
Feb 10 11:01:22 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5442303/set_done?reason=isotovideo+done%3A+isotovideo+received+signal+HUP&worker_id=476
Feb 10 11:01:22 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] Job 5442303 from openqa.suse.de finished - reason: done
Feb 10 11:01:22 openqaworker-arm-1 worker[3961]: [debug] [pid:3961] Informing openqa.suse.de that we are going offline
Feb 10 11:01:23 openqaworker-arm-1 systemd[1]: openqa-worker-auto-restart@1.service: Service RestartSec=100ms expired, scheduling restart.
Feb 10 11:01:23 openqaworker-arm-1 systemd[1]: Stopped openQA Worker #1.
Feb 10 11:01:23 openqaworker-arm-1 systemd[1]: Starting openQA Worker #1...
Feb 10 11:01:23 openqaworker-arm-1 systemd[1]: Started openQA Worker #1.
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: [info] [pid:35409] worker 1:
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: - config file: /etc/openqa/workers.ini
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: - worker hostname: openqaworker-arm-1
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: - isotovideo version: 20
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: - websocket API version: 1
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: - web UI hosts: openqa.suse.de
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: - class: qemu_aarch64,qemu_aarch64_slow_worker,tap,openqaworker-arm-1
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: - no cleanup: no
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: - pool directory: /var/lib/openqa/pool/1
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: [info] [pid:35409] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: [info] [pid:35409] Project dir for host openqa.suse.de is /var/lib/openqa/share
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: [info] [pid:35409] Registering with openQA openqa.suse.de
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: [info] [pid:35409] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/476
Feb 10 11:01:26 openqaworker-arm-1 worker[35409]: [info] [pid:35409] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 476
Feb 10 11:01:36 openqaworker-arm-1 worker[35409]: [info] [pid:35409] Received signal HUP
Feb 10 11:01:36 openqaworker-arm-1 worker[35409]: [debug] [pid:35409] Informing openqa.suse.de that we are going offline
Feb 10 11:01:37 openqaworker-arm-1 systemd[1]: openqa-worker-auto-restart@1.service: Service RestartSec=100ms expired, scheduling restart.
Feb 10 11:01:37 openqaworker-arm-1 systemd[1]: Stopped openQA Worker #1.
Feb 10 11:01:37 openqaworker-arm-1 systemd[1]: Starting openQA Worker #1...
Feb 10 11:01:37 openqaworker-arm-1 systemd[1]: Started openQA Worker #1.
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]: [info] [pid:35450] worker 1:
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]: - config file: /etc/openqa/workers.ini
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]: - worker hostname: openqaworker-arm-1
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]: - isotovideo version: 20
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]: - websocket API version: 1
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]: - web UI hosts: openqa.suse.de
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]: - class: qemu_aarch64,qemu_aarch64_slow_worker,tap,openqaworker-arm-1
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]: - no cleanup: no
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]: - pool directory: /var/lib/openqa/pool/1
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]: [info] [pid:35450] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]: [info] [pid:35450] Project dir for host openqa.suse.de is /var/lib/openqa/share
Feb 10 11:01:40 openqaworker-arm-1 worker[35450]: [info] [pid:35450] Registering with openQA openqa.suse.de
However, the job failed with
Result: failed finished 5 minutes ago ( 12:14 minutes )
Reason: isotovideo done: isotovideo received signal HUP
(https://openqa.suse.de/tests/5442303)
because systemd apparently sends the signal to the entire process group (and not just the worker process).
Updated by mkittler almost 4 years ago
This PR should fix it: https://github.com/os-autoinst/openQA/pull/3716
Updated by okurz almost 4 years ago
from today morning after the weekly automatic reboot
# salt -l error --no-color -C 'G@roles:worker' cmd.run "systemctl list-units --failed | grep service"
openqaworker2.suse.de:
openqaworker8.suse.de:
* openqa-worker@12.service loaded failed failed openQA Worker #12
openqaworker5.suse.de:
openqaworker9.suse.de:
QA-Power8-5-kvm.qa.suse.de:
openqaworker6.suse.de:
QA-Power8-4-kvm.qa.suse.de:
malbec.arch.suse.de:
grenache-1.qa.suse.de:
openqaworker10.suse.de:
openqaworker-arm-1.suse.de:
* openqa-worker@2.service loaded failed failed openQA Worker #2
openqaworker-arm-3.suse.de:
* openqa-worker@10.service loaded failed failed openQA Worker #10
* openqa-worker@13.service loaded failed failed openQA Worker #13
* openqa-worker@16.service loaded failed failed openQA Worker #16
* openqa-worker@19.service loaded failed failed openQA Worker #19
* openqa-worker@5.service loaded failed failed openQA Worker #5
* openqa-worker@7.service loaded failed failed openQA Worker #7
* openqa-worker@9.service loaded failed failed openQA Worker #9
openqaworker-arm-2.suse.de:
* openqa-worker@17.service loaded failed failed openQA Worker #17
* openqa-worker@2.service loaded failed failed openQA Worker #2
* openqa-worker@6.service loaded failed failed openQA Worker #6
* openqa-worker@9.service loaded failed failed openQA Worker #9
powerqaworker-qam-1:
Minion did not return. [Not connected]
openqaworker13.suse.de:
Minion did not return. [Not connected]
openqaworker3.suse.de:
Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code
seems there are multiple cases of openqa-worker@ that should not be running after openqa-worker-auto-restart@ instead should be active. Can you check/explain/fix that?
Updated by livdywan almost 4 years ago
- Due date changed from 2021-02-06 to 2021-02-19
@mkittler Can you please check the issues mentioned?
Updated by mkittler almost 4 years ago
I've already checked and the fix (https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/0ca8d841188f0353b8b48f90f03e9735675accd3) has been merged.
Updated by livdywan over 3 years ago
- Due date changed from 2021-02-19 to 2021-02-26
Should this be in Feedback?
Updated by mkittler over 3 years ago
On the next OSD deployment I can continue working on this. Until then I can't even get feedback. So it is actually blocked. The same counts for the parent ticket.
Updated by okurz over 3 years ago
- Status changed from In Progress to Feedback
We use "Feedback" when we wait for defined events that need active checks by the assignee, e.g. "wait until after the next OSD deployment" is such case. "In Progress" means that you are busy coding, researching, trying, debugging, etc.
Updated by mkittler over 3 years ago
- Status changed from Feedback to Resolved
Works in production, see #80908#note-18