action #89200
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #80908: [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobs
Switch OSD deployment to two-daily deployment
0%
Description
Acceptance criteria¶
- AC1: gitlab CI pipeline switched from weekly to two-daily schedule
- AC2: The change was communicated to the main OSD users group
Suggestions¶
- Just change the schedule to deploy every second day in gitlab, either if possible directly or add multiple weekly schedules so that we have 3 or 4 days covered each week.
- Carefully prepare a communication that we can now deploy more often after no jobs are disrupted anymore during package upgrades :)
- Think about what we would like to see to feel save about deploying every day
Updated by mkittler almost 4 years ago
I would do that after checking how well tomorrow's first "non-disruptive" deployment goes.
Updated by mkittler almost 4 years ago
It looks like some jobs were still restarted today on grenache-1
and QA-Power8-5-kvm
:
openqa=# select id, t_finished, reason, (select host from workers where id = assigned_worker_id) from jobs where reason like '%quit%' and t_created > now() - interval '1 day';
id | t_finished | reason | host
---------+---------------------+--------------------------------------------+-----------------
5581139 | 2021-03-03 08:27:47 | quit: worker has been stopped or restarted | grenache-1
5581059 | 2021-03-03 08:27:26 | quit: worker has been stopped or restarted | grenache-1
5581185 | 2021-03-03 08:27:32 | quit: worker has been stopped or restarted | QA-Power8-5-kvm
5581140 | 2021-03-03 08:27:44 | quit: worker has been stopped or restarted | grenache-1
5581141 | 2021-03-03 08:27:48 | quit: worker has been stopped or restarted | grenache-1
5581200 | 2021-03-03 08:26:59 | quit: worker has been stopped or restarted | QA-Power8-5-kvm
5581199 | 2021-03-03 08:27:31 | quit: worker has been stopped or restarted | QA-Power8-5-kvm
5581088 | 2021-03-03 08:27:49 | quit: worker has been stopped or restarted | grenache-1
5581142 | 2021-03-03 08:27:49 | quit: worker has been stopped or restarted | grenache-1
(9 Zeilen)
I'll check what could be the cause of this before enabling a more frequent deployment.
Updated by mkittler almost 4 years ago
Looks like the services actually received SIGTERM:
Mär 03 09:26:55 QA-Power8-5-kvm systemd[1]: Stopping openQA Worker #2...
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] Received signal TERM
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [debug] [pid:86146] Stopping job 5581185 from openqa.suse.de: 05581185-sle-15-SP3-Online-ppc64le-Build156.3-wicked_basic_ref@ppc64le - reason: quit
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [debug] [pid:86146] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5581185/status
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] Trying to stop job gracefully by announcing it to command server via http://localhost:20023/ohFona2EOzURJmZN/broadcast
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] Isotovideo exit status: 1
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] +++ worker notes +++
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] End time: 2021-03-03 08:26:55
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] Result: quit
Mär 03 09:27:24 grenache-1 worker[523676]: [info] [pid:523676] Received signal TERM
Mär 03 09:27:24 grenache-1 worker[523676]: [debug] [pid:523676] Stopping job 5581139 from openqa.suse.de: 05581139-sle-15-SP3-Regression-on-Migration-from-SLE12-SPx-s390x-Buildhjluo_os-autoinst-distri-opensuse_unlock-offline_sles12sp4_ltss_pscc_sdk-asmm-contm-lgm-tcm-wsm_all_full@hjluo_os-autoinst-distri-opensuse_unlock@s390x-kvm-sle12 - reason: quit
Mär 03 09:27:24 grenache-1 worker[523676]: [debug] [pid:523676] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5581139/status
Mär 03 09:27:24 grenache-1 systemd[1]: Stopping openQA Worker #34...
So not a problem within the worker code. The problem is that openqa-worker.target
is still active on these hosts:
martchus@grenache-1:~> systemctl status openqa-worker.target
● openqa-worker.target - openQA Worker
Loaded: loaded (/usr/lib/systemd/system/openqa-worker.target; disabled; vendor preset: disabled)
Active: active since Wed 2021-03-03 09:27:24 CET; 3h 25min ago
martchus@QA-Power8-5-kvm:~> sudo systemctl status openqa-worker.target
● openqa-worker.target - openQA Worker
Loaded: loaded (/usr/lib/systemd/system/openqa-worker.target; disabled; vendor preset: disabled)
Active: active since Wed 2021-03-03 09:26:55 CET; 3h 26min ago
Mär 03 09:26:55 QA-Power8-5-kvm systemd[1]: Stopping openQA Worker.
Mär 03 09:26:55 QA-Power8-5-kvm systemd[1]: Reached target openQA Worker.
Despite being disabled it was apparently started on the deployment today. That stopped the …-auto-restart
services and thus interrupted the jobs.
Updated by mkittler almost 4 years ago
- Status changed from Workable to In Progress
There were actually 4 workers which had openqa-worker.target
still active. All haven't been rebooted for about a month and therefore the target was still active. I never explicitly stopped it because that means stopping all jobs (and I assumed on the next reboot it will be stopped anyways). I nevertheless stopped the target now on the remaining workers so we can actually say that deployments from now on don't interrupt any jobs. I also applied the salt states again on these workers to start all worker slots again and to ensure the target remains dead after applying salt states. sudo salt -C 'G@roles:worker' cmd.run 'systemctl is-active openqa-worker.target'
looks now good (in the sense that I get inactive
for every host).
I'll write a mail to openqa@suse.de stating our plans for the deployment.
Updated by mkittler almost 4 years ago
- Status changed from In Progress to Feedback
I've been changing the schedule to deploy on Monday, Wednesday and Friday (so the weekend is still excluded): https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit
Updated by mkittler almost 4 years ago
- Status changed from Feedback to Resolved
Looks like the deployment is triggered successfully now also on Mondays and Fridays. I suppose that's good for now. (And yes, yesterdays deployment failed but that's another story.)
I have also written a mail to openqa.suse.de and mentioned it in the workshop.
Think about what we would like to see to feel save about deploying every day
Since we're already doing it on o3 I suppose it should be ok to trigger OSD's deployment more often as well. We can always deactivate the schedule temporarily if a bad commit landed on master. We could also ended the pre-checks which consider o3's state, e.g. to take the number of recent incompletes on o3 into account.