Project

General

Profile

Actions

coordination #80908

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobs

Added by okurz over 3 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-12-09
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Motivation

We want to upgrade more often but not disrupt openQA jobs on package upgrades as well as re-read configuration whenever a job finishes

Acceptance criteria

  • AC1: DONE openQA worker packages can be upgraded continously without interrupting currently running openQA jobs
  • AC2: DONE openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobs
  • AC3: Both o3 and osd deploy automatically after every change if all relevant checks have passed

Ideas

  • Use different git branches, e.g. "dev" or "main" and then "stable" or "tested" or "release" and create automatic merges by bots based on checks
  • Switch o3 workers to either deploy from worker containers which we update continuously or change the worker to allow non-transactional updates

Further details

One could try what apache does with apache2ctl graceful or systemctl reload apache2, e.g. see https://elearning.wsldp.com/pcmagazine/apache-graceful-restart-centos-7/

The restart of openQA workers could be simply prevented or delayed, e.g. with SendSIGKILL= in the openQA worker systemd service definitions which every openQA user is free to do, but then we could potentially wait hours until the service restarts if ever. Maybe we can still add a "graceful-stop" mode, wait a useful time for all jobs to finish and then restart (or even reboot the host).


Subtasks 11 (0 open11 closed)

action #80910: openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobsResolvedmkittler2020-12-09

Actions
action #80986: terminate worker process after executing all currently assigned jobs based on config/env variableResolvedmkittler2020-12-11

Actions
openQA Infrastructure - action #81884: openqa-webui should automatically restart on config updatesResolvedokurz2021-01-08

Actions
action #89200: Switch OSD deployment to two-daily deploymentResolvedmkittler2021-02-26

Actions
action #90152: module results missing on quick job (on auto-restarting worker)Resolvedmkittler2021-03-16

Actions
action #104178: Increase OSD deployment rate from every second day to dailyResolvedokurz2021-12-20

Actions
action #104841: Prevent empty changelog messages from osd-deployment when there are no changes size:MResolvedmkittler2022-01-12

Actions
action #105379: Continuous deployment of o3 workers - one worker first size:MResolvedmkittler2022-01-24

Actions
action #105885: Continuous deployment of o3 workers - all the other o3 workers size:MResolvedmkittler

Actions
action #111028: Continuous update of o3 webUIResolvedokurz2022-05-12

Actions
action #111377: Continuous deployment of osd workers - similar as on o3 size:MRejectedokurz2022-05-20

Actions
Actions #1

Updated by okurz over 3 years ago

  • Description updated (diff)
  • Status changed from New to Blocked
  • Assignee set to okurz
  • Target version changed from future to Ready

I will track this epic as we already have one specific subtask. That should be good enough for now.

Actions #2

Updated by okurz over 3 years ago

  • Description updated (diff)
Actions #3

Updated by okurz over 3 years ago

In https://chat.suse.de/group/qa-tools?msg=cRrsekSpzTHxMPRoz we discussed that maybe it can be as simple as "terminate after executing all currently assigned jobs" and let the worker be automatically restarted by systemd (or kubernetes). This way reading the config as well as reading any new files (after a package upgrade) would work.

Actions #4

Updated by Xiaojing_liu over 3 years ago

Discussed this with the Migration and Security teams, before this ticket is resolved, if the deployment time is the milestone release candidate day, could we postpone the OSD deployment? Because on that day, they need to give the test report ASSP, if the deployment stops and re-triggers their jobs, it will cost them more time to wait for the result. This is just a workaround before we improve the deployment progress.

Actions #5

Updated by okurz over 3 years ago

Xiaojing_liu wrote:

Discussed this with the Migration and Security teams, before this ticket is resolved, if the deployment time is the milestone release candidate day, could we postpone the OSD deployment? Because on that day, they need to give the test report ASSP, if the deployment stops and re-triggers their jobs, it will cost them more time to wait for the result. This is just a workaround before we improve the deployment progress.

I understand that. But we were at this point already some years ago and there was always something very important coming up blocking deployment. You brought up the idea in chat about shifting the deployment time. You could create a ticket for that and we can decide there.

Actions #6

Updated by mkittler over 3 years ago

Here's an idea to improve this further: https://github.com/os-autoinst/openQA/pull/3641 (besides #80986)

Actions #7

Updated by okurz about 3 years ago

  • Subject changed from [epic] Continuous deployment without interrupting currently running openQA jobs to [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobs
Actions #8

Updated by mkittler about 3 years ago

The restart of openQA workers could be simply prevented or delayed, e.g. with SendSIGKILL= in the openQA worker systemd service definitions which every openQA user is free to do, but then we could potentially wait hours until the service restarts if ever.

Because of the "but" I would refrain from following this approach. It would be a huge regression if restarting a worker is no longer possible in the way it worked before.


One could try what apache does with apache2ctl graceful or systemctl reload apache2

This is also what I had in mind with my "SIGHUB" approach and likely it is also the way to go as we've seen that simply reloading the config within the worker is more complicated than expected and only solves half of the epic.

With https://github.com/os-autoinst/openQA/pull/3641 I implemented almost the Apache/NGINX behavior besides the fact that my change only allows to terminate the worker gracefully but won't start it again on its own. I suppose for that we'd needed a master process and at least one worker process so the master can start the worker again as needed. At least all applications I know which can restart itself have at least 2 processes. Since we always use systemd the idea was to simply use Restart=always via openqa-worker-auto-restart@.service, see https://progress.opensuse.org/issues/80910#note-16.

My PR suggests to use e.g. systemctl kill --signal SIGHUP openqa-worker-auto-restart@*.service. However, we could of course add ExecReload=/bin/kill -HUP $MAINPID to openqa-worker-auto-restart@.service to allow e.g. systemctl reload openqa-worker-auto-restart@*.service.

This still leaves it open how we'd like to trigger the "reload". I see three ways we can do that and these are not meant as different alternatives but as different ways which would complement each other:

  1. Let systemd trigger the "reload" when the config file changes (draft: https://github.com/os-autoinst/openQA/pull/3666).
  2. Add an RPM hook to invoke systemctl kill --signal SIGHUP openqa-worker-auto-restart@*.service on updates. I don't know how to do that but considering there are already existing hooks for restarting it should be possible. (The implementation for restarting can be found in /usr/lib/rpm/macros.d/macros.systemd provided by the systemd-rpm-macros package. The relevant macro which is also used in our spec file is %service_del_postun which in turn uses %_restart_on_update().)
  3. Document how to trigger a graceful restart manually. To be more use-friendly, we could really go for setting ExecReload to allow using systemctl reload ….

I'd also like to note that if we're able to trigger it reliably, we could actually revert the use of Environment=OPENQA_WORKER_TERMINATE_AFTER_JOBS_DONE=1 so workers would not restart unnecessarily. Unnecessary restarts of the worker are obviously a negative "side-effect" of using openqa-worker-auto-restart@.service so far (considering the goal is to apply only updates and configuration changes). Note that support for the environment variable itself should stay because it might also be useful in other use cases, e.g. when running the worker within a container; I'm only suggesting to remove it from the systemd service. Maybe it would also be better to add yet another service instead of modifying the existing one in an incompatible way.

Actions #9

Updated by mkittler about 3 years ago

  • Status changed from Blocked to Feedback

I'm setting this to feedback because I'd actually like to hear some feedback from the team before proceeding. (If the approach is accepted this would solve the 2 sub tasks in one go so I added this comment on the epic-level.)

Actions #10

Updated by okurz about 3 years ago

Your overall approach looks sound and safe.

mkittler wrote:

  1. Document how to trigger a graceful restart manually. To be more use-friendly, we could really go for setting ExecReload to allow using systemctl reload ….

I guess this could be done without interfering with other functionality. I know we are struggling a bit with this epic to know where to go. How about you try to create a draft pull request that starts with the actual documentation changes starting from the users point of view based on the motivation and ACs for this epic and doing the implementation afterwards.

Actions #11

Updated by mkittler about 3 years ago

Ok, but it would really be a draft because it would not actually work until we switch to using openqa-worker-auto-restart@.service to have Restart=always. Enabling that service would be the next step then. I've already tested it on imagetester so the next step would be testing the salt change I've prepared on staging.

Actions #12

Updated by mkittler about 3 years ago

  • Assignee changed from okurz to mkittler

Since this first step is important for both sub tickets I'm writing this comment in the epic ticket and also assign it to myself.

Actions #13

Updated by mkittler about 3 years ago

  • Documentation PR has been merged.
  • The SR has been merged so openqa-worker-auto-restart@.service is now used on OSD workers and openqa-worker.target (which would cause restarts on package updates) is disabled.
Actions #15

Updated by livdywan about 3 years ago

mkittler wrote:

Outstanding PRs:

All M/PRs are merged now

Actions #16

Updated by livdywan about 3 years ago

As mentioned during the daily, a follow-up was also needed here: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/446 (discussed and merged)

Actions #17

Updated by mkittler about 3 years ago

Doesn't work in production, see #80910#note-25. I'm working on a fix. I stopped and masked openqa-reload-worker-auto-restart@*.path services on OSD workers to avoid further jobs failing with isotovideo received signal HUP.

Actions #18

Updated by mkittler about 3 years ago

After the deployment today everything seems to work:

  • We've seen that the package update workers receive SIGHUP.
  • I've been unmasking the paths services: salt -C 'G@roles:worker' cmd.run 'systemctl unmask openqa-reload-worker-auto-restart@{1..50}.path' (globbing not possible here)
  • salt -C 'G@roles:worker' state.apply no longer complains about masked services which are no up and running.
  • I've tested editing the worker config and manually reloading the worker service and both leads to the worker receiving SIGHUP.
  • The worker behaves correctly when receiving SIGHUP while idling, while setting up a job and while running a job (also when receiving it twice). The jobs being executed while receiving SIGHUP pass normally. There are no further jobs like #89056.
Actions #19

Updated by mkittler about 3 years ago

  • Status changed from Feedback to Resolved
Actions #20

Updated by okurz about 3 years ago

Awesome. Would you be interested in demo that feature in the next SUSE QE Tools workshop?

Actions #21

Updated by okurz about 3 years ago

  • Status changed from Resolved to Feedback

@mkittler because I have not seen a reply from you to the above and because I think we can do more here, at least find follow-up tickets I am reopening. Maybe we can discuss in the weekly what to do next about it

Actions #22

Updated by okurz about 3 years ago

  • Description updated (diff)
  • Status changed from Feedback to Blocked
  • Assignee changed from mkittler to okurz
Actions #23

Updated by okurz over 2 years ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)
  • Target version changed from Ready to future
Actions #24

Updated by okurz about 2 years ago

  • Description updated (diff)
Actions #25

Updated by okurz about 2 years ago

  • Target version changed from future to Ready

#104841 is resolved. The parent is in the backlog so this epic should again be as well.

  • How we should trigger a continuous deployment? okurz suggests that for webservices an HTTP route could be made accessible that is poked with authentication to trigger a self-update. However likely this is only commonly used for non-packaged, public-available web server instances. For our o3 workers would could just do a periodic polling, e.g. every 5 minutes do zypper -n ref -r devel:openQA | grep -q 'is up to date' which takes only 0.3s for a no-op.

So maybe we can just do a systemd timer every 5 minutes doing zypper -n ref -r devel:openQA | grep -q 'is up to date' && zypper -n dup -r devel:openQA

Actions #26

Updated by okurz about 2 years ago

  • Tracker changed from action to coordination
Actions #27

Updated by okurz about 2 years ago

  • Status changed from New to Blocked
  • Assignee set to okurz

tracking new subtasks

Actions #28

Updated by okurz almost 2 years ago

  • Status changed from Blocked to Resolved

Done here. We have continuous deployment on o3 and a layered daily osd deployment whenever o3 is healthy.

Actions

Also available in: Atom PDF