coordination #80908
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
[epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobs
Added by okurz about 4 years ago. Updated over 2 years ago.
Description
Motivation¶
We want to upgrade more often but not disrupt openQA jobs on package upgrades as well as re-read configuration whenever a job finishes
Acceptance criteria¶
- AC1: DONE openQA worker packages can be upgraded continously without interrupting currently running openQA jobs
- AC2: DONE openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobs
- AC3: Both o3 and osd deploy automatically after every change if all relevant checks have passed
Ideas¶
- Use different git branches, e.g. "dev" or "main" and then "stable" or "tested" or "release" and create automatic merges by bots based on checks
- Switch o3 workers to either deploy from worker containers which we update continuously or change the worker to allow non-transactional updates
Further details¶
One could try what apache does with apache2ctl graceful
or systemctl reload apache2
, e.g. see https://elearning.wsldp.com/pcmagazine/apache-graceful-restart-centos-7/
The restart of openQA workers could be simply prevented or delayed, e.g. with SendSIGKILL=
in the openQA worker systemd service definitions which every openQA user is free to do, but then we could potentially wait hours until the service restarts if ever. Maybe we can still add a "graceful-stop" mode, wait a useful time for all jobs to finish and then restart (or even reboot the host).
Updated by okurz about 4 years ago
- Description updated (diff)
- Status changed from New to Blocked
- Assignee set to okurz
- Target version changed from future to Ready
I will track this epic as we already have one specific subtask. That should be good enough for now.
Updated by okurz about 4 years ago
In https://chat.suse.de/group/qa-tools?msg=cRrsekSpzTHxMPRoz we discussed that maybe it can be as simple as "terminate after executing all currently assigned jobs" and let the worker be automatically restarted by systemd (or kubernetes). This way reading the config as well as reading any new files (after a package upgrade) would work.
Updated by Xiaojing_liu about 4 years ago
Discussed this with the Migration and Security teams, before this ticket is resolved, if the deployment time is the milestone release candidate day, could we postpone the OSD deployment? Because on that day, they need to give the test report ASSP, if the deployment stops and re-triggers their jobs, it will cost them more time to wait for the result. This is just a workaround before we improve the deployment progress.
Updated by okurz about 4 years ago
Xiaojing_liu wrote:
Discussed this with the Migration and Security teams, before this ticket is resolved, if the deployment time is the milestone release candidate day, could we postpone the OSD deployment? Because on that day, they need to give the test report ASSP, if the deployment stops and re-triggers their jobs, it will cost them more time to wait for the result. This is just a workaround before we improve the deployment progress.
I understand that. But we were at this point already some years ago and there was always something very important coming up blocking deployment. You brought up the idea in chat about shifting the deployment time. You could create a ticket for that and we can decide there.
Updated by mkittler almost 4 years ago
Here's an idea to improve this further: https://github.com/os-autoinst/openQA/pull/3641 (besides #80986)
Updated by okurz almost 4 years ago
- Subject changed from [epic] Continuous deployment without interrupting currently running openQA jobs to [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobs
Updated by mkittler almost 4 years ago
The restart of openQA workers could be simply prevented or delayed, e.g. with SendSIGKILL= in the openQA worker systemd service definitions which every openQA user is free to do, but then we could potentially wait hours until the service restarts if ever.
Because of the "but" I would refrain from following this approach. It would be a huge regression if restarting a worker is no longer possible in the way it worked before.
One could try what apache does with apache2ctl graceful or systemctl reload apache2
This is also what I had in mind with my "SIGHUB" approach and likely it is also the way to go as we've seen that simply reloading the config within the worker is more complicated than expected and only solves half of the epic.
With https://github.com/os-autoinst/openQA/pull/3641 I implemented almost the Apache/NGINX behavior besides the fact that my change only allows to terminate the worker gracefully but won't start it again on its own. I suppose for that we'd needed a master process and at least one worker process so the master can start the worker again as needed. At least all applications I know which can restart itself have at least 2 processes. Since we always use systemd the idea was to simply use Restart=always
via openqa-worker-auto-restart@.service
, see https://progress.opensuse.org/issues/80910#note-16.
My PR suggests to use e.g. systemctl kill --signal SIGHUP openqa-worker-auto-restart@*.service
. However, we could of course add ExecReload=/bin/kill -HUP $MAINPID
to openqa-worker-auto-restart@.service
to allow e.g. systemctl reload openqa-worker-auto-restart@*.service
.
This still leaves it open how we'd like to trigger the "reload". I see three ways we can do that and these are not meant as different alternatives but as different ways which would complement each other:
- Let systemd trigger the "reload" when the config file changes (draft: https://github.com/os-autoinst/openQA/pull/3666).
- Add an RPM hook to invoke
systemctl kill --signal SIGHUP openqa-worker-auto-restart@*.service
on updates. I don't know how to do that but considering there are already existing hooks for restarting it should be possible. (The implementation for restarting can be found in/usr/lib/rpm/macros.d/macros.systemd
provided by thesystemd-rpm-macros
package. The relevant macro which is also used in our spec file is%service_del_postun
which in turn uses%_restart_on_update()
.)- If that's not possible we could also try to use salt but it will be the less generic approach. (draft SR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/423/diffs)
- We also learned that the existing hook to restart will not go in our way as long as
openqa-worker.target
is not active. (see https://progress.opensuse.org/issues/80910#note-16) - Draft: https://github.com/os-autoinst/openQA/pull/3699
- Document how to trigger a graceful restart manually. To be more use-friendly, we could really go for setting
ExecReload
to allow usingsystemctl reload …
.
I'd also like to note that if we're able to trigger it reliably, we could actually revert the use of Environment=OPENQA_WORKER_TERMINATE_AFTER_JOBS_DONE=1
so workers would not restart unnecessarily. Unnecessary restarts of the worker are obviously a negative "side-effect" of using openqa-worker-auto-restart@.service
so far (considering the goal is to apply only updates and configuration changes). Note that support for the environment variable itself should stay because it might also be useful in other use cases, e.g. when running the worker within a container; I'm only suggesting to remove it from the systemd service. Maybe it would also be better to add yet another service instead of modifying the existing one in an incompatible way.
Updated by mkittler almost 4 years ago
- Status changed from Blocked to Feedback
I'm setting this to feedback because I'd actually like to hear some feedback from the team before proceeding. (If the approach is accepted this would solve the 2 sub tasks in one go so I added this comment on the epic-level.)
Updated by okurz almost 4 years ago
Your overall approach looks sound and safe.
mkittler wrote:
- Document how to trigger a graceful restart manually. To be more use-friendly, we could really go for setting
ExecReload
to allow usingsystemctl reload …
.
I guess this could be done without interfering with other functionality. I know we are struggling a bit with this epic to know where to go. How about you try to create a draft pull request that starts with the actual documentation changes starting from the users point of view based on the motivation and ACs for this epic and doing the implementation afterwards.
Updated by mkittler almost 4 years ago
Ok, but it would really be a draft because it would not actually work until we switch to using openqa-worker-auto-restart@.service
to have Restart=always
. Enabling that service would be the next step then. I've already tested it on imagetester so the next step would be testing the salt change I've prepared on staging.
Updated by mkittler almost 4 years ago
- Assignee changed from okurz to mkittler
- PR for documentation: https://github.com/os-autoinst/openQA/pull/3682
- The SR for using
openqa-worker-auto-restart@.service
seems to work on a staging worker (see comments in GitLab): https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/426
Since this first step is important for both sub tickets I'm writing this comment in the epic ticket and also assign it to myself.
Updated by mkittler almost 4 years ago
- Documentation PR has been merged.
- The SR has been merged so
openqa-worker-auto-restart@.service
is now used on OSD workers andopenqa-worker.target
(which would cause restarts on package updates) is disabled.
Updated by mkittler almost 4 years ago
Outstanding PRs:
- reload on RPM update: https://github.com/os-autoinst/openQA/pull/3699
- reload on config change: https://github.com/os-autoinst/openQA/pull/3666
- OSD config: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/438
Updated by livdywan almost 4 years ago
mkittler wrote:
Outstanding PRs:
- reload on RPM update: https://github.com/os-autoinst/openQA/pull/3699
- reload on config change: https://github.com/os-autoinst/openQA/pull/3666
- OSD config: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/438
All M/PRs are merged now
Updated by livdywan almost 4 years ago
As mentioned during the daily, a follow-up was also needed here: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/446 (discussed and merged)
Updated by mkittler almost 4 years ago
Doesn't work in production, see #80910#note-25. I'm working on a fix. I stopped and masked openqa-reload-worker-auto-restart@*.path
services on OSD workers to avoid further jobs failing with isotovideo received signal HUP
.
Updated by mkittler almost 4 years ago
After the deployment today everything seems to work:
- We've seen that the package update workers receive
SIGHUP
. - I've been unmasking the paths services:
salt -C 'G@roles:worker' cmd.run 'systemctl unmask openqa-reload-worker-auto-restart@{1..50}.path'
(globbing not possible here) salt -C 'G@roles:worker' state.apply
no longer complains about masked services which are no up and running.- I've tested editing the worker config and manually reloading the worker service and both leads to the worker receiving
SIGHUP
. - The worker behaves correctly when receiving
SIGHUP
while idling, while setting up a job and while running a job (also when receiving it twice). The jobs being executed while receivingSIGHUP
pass normally. There are no further jobs like #89056.
Updated by mkittler almost 4 years ago
- Status changed from Feedback to Resolved
Updated by okurz almost 4 years ago
Awesome. Would you be interested in demo that feature in the next SUSE QE Tools workshop?
Updated by okurz almost 4 years ago
- Status changed from Resolved to Feedback
@mkittler because I have not seen a reply from you to the above and because I think we can do more here, at least find follow-up tickets I am reopening. Maybe we can discuss in the weekly what to do next about it
Updated by okurz almost 4 years ago
- Description updated (diff)
- Status changed from Feedback to Blocked
- Assignee changed from mkittler to okurz
Updated by okurz over 3 years ago
- Status changed from Blocked to New
- Assignee deleted (
okurz) - Target version changed from Ready to future
Updated by okurz almost 3 years ago
- Target version changed from future to Ready
#104841 is resolved. The parent is in the backlog so this epic should again be as well.
- How we should trigger a continuous deployment? okurz suggests that for webservices an HTTP route could be made accessible that is poked with authentication to trigger a self-update. However likely this is only commonly used for non-packaged, public-available web server instances. For our o3 workers would could just do a periodic polling, e.g. every 5 minutes do
zypper -n ref -r devel:openQA | grep -q 'is up to date'
which takes only 0.3s for a no-op.
So maybe we can just do a systemd timer every 5 minutes doing zypper -n ref -r devel:openQA | grep -q 'is up to date' && zypper -n dup -r devel:openQA
Updated by okurz almost 3 years ago
- Tracker changed from action to coordination
Updated by okurz almost 3 years ago
- Status changed from New to Blocked
- Assignee set to okurz
tracking new subtasks
Updated by okurz over 2 years ago
- Status changed from Blocked to Resolved
Done here. We have continuous deployment on o3 and a layered daily osd deployment whenever o3 is healthy.