coordination #80908: [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobs - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

#1

Updated by okurz over 4 years ago

Description updated (diff)
Status changed from New to Blocked
Assignee set to okurz
Target version changed from future to Ready

I will track this epic as we already have one specific subtask. That should be good enough for now.

Actions

Copy link

#2

Updated by okurz over 4 years ago

Description updated (diff)

Actions

Copy link

#3

Updated by okurz over 4 years ago

In https://chat.suse.de/group/qa-tools?msg=cRrsekSpzTHxMPRoz we discussed that maybe it can be as simple as "terminate after executing all currently assigned jobs" and let the worker be automatically restarted by systemd (or kubernetes). This way reading the config as well as reading any new files (after a package upgrade) would work.

Actions

Copy link

#4

Updated by Xiaojing_liu over 4 years ago

Discussed this with the Migration and Security teams, before this ticket is resolved, if the deployment time is the milestone release candidate day, could we postpone the OSD deployment? Because on that day, they need to give the test report ASSP, if the deployment stops and re-triggers their jobs, it will cost them more time to wait for the result. This is just a workaround before we improve the deployment progress.

Actions

Copy link

#5

Updated by okurz over 4 years ago

Xiaojing_liu wrote:

Discussed this with the Migration and Security teams, before this ticket is resolved, if the deployment time is the milestone release candidate day, could we postpone the OSD deployment? Because on that day, they need to give the test report ASSP, if the deployment stops and re-triggers their jobs, it will cost them more time to wait for the result. This is just a workaround before we improve the deployment progress.

I understand that. But we were at this point already some years ago and there was always something very important coming up blocking deployment. You brought up the idea in chat about shifting the deployment time. You could create a ticket for that and we can decide there.

Actions

Copy link

#6

Updated by mkittler over 4 years ago

Here's an idea to improve this further: https://github.com/os-autoinst/openQA/pull/3641 (besides #80986)

Actions

Copy link

#7

Updated by okurz about 4 years ago

Subject changed from [epic] Continuous deployment without interrupting currently running openQA jobs to [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobs

Actions

Copy link

#8

Updated by mkittler about 4 years ago

The restart of openQA workers could be simply prevented or delayed, e.g. with SendSIGKILL= in the openQA worker systemd service definitions which every openQA user is free to do, but then we could potentially wait hours until the service restarts if ever.

Because of the "but" I would refrain from following this approach. It would be a huge regression if restarting a worker is no longer possible in the way it worked before.

One could try what apache does with apache2ctl graceful or systemctl reload apache2

This is also what I had in mind with my "SIGHUB" approach and likely it is also the way to go as we've seen that simply reloading the config within the worker is more complicated than expected and only solves half of the epic.

With https://github.com/os-autoinst/openQA/pull/3641 I implemented almost the Apache/NGINX behavior besides the fact that my change only allows to terminate the worker gracefully but won't start it again on its own. I suppose for that we'd needed a master process and at least one worker process so the master can start the worker again as needed. At least all applications I know which can restart itself have at least 2 processes. Since we always use systemd the idea was to simply use Restart=always via openqa-worker-auto-restart@.service, see https://progress.opensuse.org/issues/80910#note-16.

My PR suggests to use e.g. systemctl kill --signal SIGHUP openqa-worker-auto-restart@*.service. However, we could of course add ExecReload=/bin/kill -HUP $MAINPID to openqa-worker-auto-restart@.service to allow e.g. systemctl reload openqa-worker-auto-restart@*.service.

This still leaves it open how we'd like to trigger the "reload". I see three ways we can do that and these are not meant as different alternatives but as different ways which would complement each other:

Let systemd trigger the "reload" when the config file changes (draft: https://github.com/os-autoinst/openQA/pull/3666).
Add an RPM hook to invoke systemctl kill --signal SIGHUP openqa-worker-auto-restart@*.service on updates. I don't know how to do that but considering there are already existing hooks for restarting it should be possible. (The implementation for restarting can be found in /usr/lib/rpm/macros.d/macros.systemd provided by the systemd-rpm-macros package. The relevant macro which is also used in our spec file is %service_del_postun which in turn uses %_restart_on_update().)
- If that's not possible we could also try to use salt but it will be the less generic approach. (draft SR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/423/diffs)
- We also learned that the existing hook to restart will not go in our way as long as openqa-worker.target is not active. (see https://progress.opensuse.org/issues/80910#note-16)
- Draft: https://github.com/os-autoinst/openQA/pull/3699
Document how to trigger a graceful restart manually. To be more use-friendly, we could really go for setting ExecReload to allow using systemctl reload ….
- Done: https://open.qa/docs/#_stoppingrestarting_workers_without_interrupting_currently_running_jobs

I'd also like to note that if we're able to trigger it reliably, we could actually revert the use of Environment=OPENQA_WORKER_TERMINATE_AFTER_JOBS_DONE=1 so workers would not restart unnecessarily. Unnecessary restarts of the worker are obviously a negative "side-effect" of using openqa-worker-auto-restart@.service so far (considering the goal is to apply only updates and configuration changes). Note that support for the environment variable itself should stay because it might also be useful in other use cases, e.g. when running the worker within a container; I'm only suggesting to remove it from the systemd service. Maybe it would also be better to add yet another service instead of modifying the existing one in an incompatible way.

Actions

Copy link

#9

Updated by mkittler about 4 years ago

Status changed from Blocked to Feedback

I'm setting this to feedback because I'd actually like to hear some feedback from the team before proceeding. (If the approach is accepted this would solve the 2 sub tasks in one go so I added this comment on the epic-level.)

Actions

Copy link

#10

Updated by okurz about 4 years ago

Your overall approach looks sound and safe.

mkittler wrote:

Document how to trigger a graceful restart manually. To be more use-friendly, we could really go for setting ExecReload to allow using systemctl reload ….

I guess this could be done without interfering with other functionality. I know we are struggling a bit with this epic to know where to go. How about you try to create a draft pull request that starts with the actual documentation changes starting from the users point of view based on the motivation and ACs for this epic and doing the implementation afterwards.

Actions

Copy link

#11

Updated by mkittler about 4 years ago

Ok, but it would really be a draft because it would not actually work until we switch to using openqa-worker-auto-restart@.service to have Restart=always. Enabling that service would be the next step then. I've already tested it on imagetester so the next step would be testing the salt change I've prepared on staging.

Actions

Copy link

#12

Updated by mkittler about 4 years ago

Assignee changed from okurz to mkittler

PR for documentation: https://github.com/os-autoinst/openQA/pull/3682
The SR for using openqa-worker-auto-restart@.service seems to work on a staging worker (see comments in GitLab): https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/426

Since this first step is important for both sub tickets I'm writing this comment in the epic ticket and also assign it to myself.

Actions

Copy link

#13

Updated by mkittler about 4 years ago

Documentation PR has been merged.
The SR has been merged so openqa-worker-auto-restart@.service is now used on OSD workers and openqa-worker.target (which would cause restarts on package updates) is disabled.

Actions

Copy link

#14

Updated by mkittler about 4 years ago

Outstanding PRs:

reload on RPM update: https://github.com/os-autoinst/openQA/pull/3699
reload on config change: https://github.com/os-autoinst/openQA/pull/3666
OSD config: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/438

Actions

Copy link

#15

Updated by livdywan about 4 years ago

mkittler wrote:

Outstanding PRs:

reload on RPM update: https://github.com/os-autoinst/openQA/pull/3699

reload on config change: https://github.com/os-autoinst/openQA/pull/3666

OSD config: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/438

All M/PRs are merged now

Actions

Copy link

#16

Updated by livdywan about 4 years ago

As mentioned during the daily, a follow-up was also needed here: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/446 (discussed and merged)

Actions

Copy link

#17

Updated by mkittler about 4 years ago

Doesn't work in production, see #80910#note-25. I'm working on a fix. I stopped and masked openqa-reload-worker-auto-restart@*.path services on OSD workers to avoid further jobs failing with isotovideo received signal HUP.

Actions

Copy link

#18

Updated by mkittler about 4 years ago

After the deployment today everything seems to work:

We've seen that the package update workers receive SIGHUP.
I've been unmasking the paths services: salt -C 'G@roles:worker' cmd.run 'systemctl unmask openqa-reload-worker-auto-restart@{1..50}.path' (globbing not possible here)
salt -C 'G@roles:worker' state.apply no longer complains about masked services which are no up and running.
I've tested editing the worker config and manually reloading the worker service and both leads to the worker receiving SIGHUP.
The worker behaves correctly when receiving SIGHUP while idling, while setting up a job and while running a job (also when receiving it twice). The jobs being executed while receiving SIGHUP pass normally. There are no further jobs like #89056.

Actions

Copy link

#19

Updated by mkittler about 4 years ago

Status changed from Feedback to Resolved

Actions

Copy link

#20

Updated by okurz about 4 years ago

Awesome. Would you be interested in demo that feature in the next SUSE QE Tools workshop?

Actions

Copy link

#21

Updated by okurz about 4 years ago

Status changed from Resolved to Feedback

@mkittler because I have not seen a reply from you to the above and because I think we can do more here, at least find follow-up tickets I am reopening. Maybe we can discuss in the weekly what to do next about it

Actions

Copy link

#22

Updated by okurz about 4 years ago

Description updated (diff)
Status changed from Feedback to Blocked
Assignee changed from mkittler to okurz

Actions

Copy link

#23

Updated by okurz almost 4 years ago

Status changed from Blocked to New
Assignee deleted (~~okurz~~)
Target version changed from Ready to future

Actions

Copy link

#24

Updated by okurz about 3 years ago

Description updated (diff)

Actions

Copy link

#25

Updated by okurz about 3 years ago

Target version changed from future to Ready

#104841 is resolved. The parent is in the backlog so this epic should again be as well.

How we should trigger a continuous deployment? okurz suggests that for webservices an HTTP route could be made accessible that is poked with authentication to trigger a self-update. However likely this is only commonly used for non-packaged, public-available web server instances. For our o3 workers would could just do a periodic polling, e.g. every 5 minutes do zypper -n ref -r devel:openQA | grep -q 'is up to date' which takes only 0.3s for a no-op.

So maybe we can just do a systemd timer every 5 minutes doing zypper -n ref -r devel:openQA | grep -q 'is up to date' && zypper -n dup -r devel:openQA

Actions

Copy link

#26

Updated by okurz about 3 years ago

Tracker changed from action to coordination

Actions

Copy link

#27

Updated by okurz about 3 years ago

Status changed from New to Blocked
Assignee set to okurz

tracking new subtasks

Actions

Copy link

#28

Updated by okurz almost 3 years ago

Status changed from Blocked to Resolved

Done here. We have continuous deployment on o3 and a layered daily osd deployment whenever o3 is healthy.

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries