action #89200: Switch OSD deployment to two-daily deployment - openQA Project (public) - openSUSE Project Management Tool

Actions

action #89200

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #80908: [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobs

Switch OSD deployment to two-daily deployment

Added by okurz about 4 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

2021-02-26

Due date:

% Done:

Estimated time:

Tags:

osd, deployment, gitlab

Description

Acceptance criteria¶

AC1: gitlab CI pipeline switched from weekly to two-daily schedule
AC2: The change was communicated to the main OSD users group

Suggestions¶

Just change the schedule to deploy every second day in gitlab, either if possible directly or add multiple weekly schedules so that we have 3 or 4 days covered each week.
Carefully prepare a communication that we can now deploy more often after no jobs are disrupted anymore during package upgrades :)
Think about what we would like to see to feel save about deploying every day

Actions

Copy link

Updated by mkittler about 4 years ago

I would do that after checking how well tomorrow's first "non-disruptive" deployment goes.

Actions

Copy link

Updated by mkittler about 4 years ago

Assignee set to mkittler

Actions

Copy link

Updated by mkittler about 4 years ago

It looks like some jobs were still restarted today on grenache-1 and QA-Power8-5-kvm:

openqa=# select id, t_finished, reason, (select host from workers where id = assigned_worker_id) from jobs where reason like '%quit%' and t_created > now() - interval '1 day';
   id    |     t_finished      |                   reason                   |      host       
---------+---------------------+--------------------------------------------+-----------------
 5581139 | 2021-03-03 08:27:47 | quit: worker has been stopped or restarted | grenache-1
 5581059 | 2021-03-03 08:27:26 | quit: worker has been stopped or restarted | grenache-1
 5581185 | 2021-03-03 08:27:32 | quit: worker has been stopped or restarted | QA-Power8-5-kvm
 5581140 | 2021-03-03 08:27:44 | quit: worker has been stopped or restarted | grenache-1
 5581141 | 2021-03-03 08:27:48 | quit: worker has been stopped or restarted | grenache-1
 5581200 | 2021-03-03 08:26:59 | quit: worker has been stopped or restarted | QA-Power8-5-kvm
 5581199 | 2021-03-03 08:27:31 | quit: worker has been stopped or restarted | QA-Power8-5-kvm
 5581088 | 2021-03-03 08:27:49 | quit: worker has been stopped or restarted | grenache-1
 5581142 | 2021-03-03 08:27:49 | quit: worker has been stopped or restarted | grenache-1
(9 Zeilen)

I'll check what could be the cause of this before enabling a more frequent deployment.

Actions

Copy link

Updated by mkittler about 4 years ago

Looks like the services actually received SIGTERM:

Mär 03 09:26:55 QA-Power8-5-kvm systemd[1]: Stopping openQA Worker #2...
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] Received signal TERM
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [debug] [pid:86146] Stopping job 5581185 from openqa.suse.de: 05581185-sle-15-SP3-Online-ppc64le-Build156.3-wicked_basic_ref@ppc64le - reason: quit
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [debug] [pid:86146] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5581185/status
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] Trying to stop job gracefully by announcing it to command server via http://localhost:20023/ohFona2EOzURJmZN/broadcast
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] Isotovideo exit status: 1
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] +++ worker notes +++
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] End time: 2021-03-03 08:26:55
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] Result: quit

Mär 03 09:27:24 grenache-1 worker[523676]: [info] [pid:523676] Received signal TERM
Mär 03 09:27:24 grenache-1 worker[523676]: [debug] [pid:523676] Stopping job 5581139 from openqa.suse.de: 05581139-sle-15-SP3-Regression-on-Migration-from-SLE12-SPx-s390x-Buildhjluo_os-autoinst-distri-opensuse_unlock-offline_sles12sp4_ltss_pscc_sdk-asmm-contm-lgm-tcm-wsm_all_full@hjluo_os-autoinst-distri-opensuse_unlock@s390x-kvm-sle12 - reason: quit
Mär 03 09:27:24 grenache-1 worker[523676]: [debug] [pid:523676] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5581139/status
Mär 03 09:27:24 grenache-1 systemd[1]: Stopping openQA Worker #34...

So not a problem within the worker code. The problem is that openqa-worker.target is still active on these hosts:

martchus@grenache-1:~> systemctl status openqa-worker.target
● openqa-worker.target - openQA Worker
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker.target; disabled; vendor preset: disabled)
   Active: active since Wed 2021-03-03 09:27:24 CET; 3h 25min ago

martchus@QA-Power8-5-kvm:~> sudo systemctl status openqa-worker.target
● openqa-worker.target - openQA Worker
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker.target; disabled; vendor preset: disabled)
   Active: active since Wed 2021-03-03 09:26:55 CET; 3h 26min ago

Mär 03 09:26:55 QA-Power8-5-kvm systemd[1]: Stopping openQA Worker.
Mär 03 09:26:55 QA-Power8-5-kvm systemd[1]: Reached target openQA Worker.

Despite being disabled it was apparently started on the deployment today. That stopped the …-auto-restart services and thus interrupted the jobs.

Actions

Copy link

Updated by mkittler about 4 years ago

Status changed from Workable to In Progress

There were actually 4 workers which had openqa-worker.target still active. All haven't been rebooted for about a month and therefore the target was still active. I never explicitly stopped it because that means stopping all jobs (and I assumed on the next reboot it will be stopped anyways). I nevertheless stopped the target now on the remaining workers so we can actually say that deployments from now on don't interrupt any jobs. I also applied the salt states again on these workers to start all worker slots again and to ensure the target remains dead after applying salt states. sudo salt -C 'G@roles:worker' cmd.run 'systemctl is-active openqa-worker.target' looks now good (in the sense that I get inactive for every host).

I'll write a mail to openqa@suse.de stating our plans for the deployment.

Actions

Copy link

Updated by mkittler about 4 years ago

Status changed from In Progress to Feedback

I've been changing the schedule to deploy on Monday, Wednesday and Friday (so the weekend is still excluded): https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit

Actions

Copy link

Updated by mkittler about 4 years ago

Status changed from Feedback to Resolved

Looks like the deployment is triggered successfully now also on Mondays and Fridays. I suppose that's good for now. (And yes, yesterdays deployment failed but that's another story.)

I have also written a mail to openqa.suse.de and mentioned it in the workshop.

Think about what we would like to see to feel save about deploying every day

Since we're already doing it on o3 I suppose it should be ok to trigger OSD's deployment more often as well. We can always deactivate the schedule temporarily if a bad commit landed on master. We could also ended the pre-checks which consider o3's state, e.g. to take the number of recent incompletes on o3 into account.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #89200

Switch OSD deployment to two-daily deployment

Acceptance criteria¶

Suggestions¶

Updated by mkittler about 4 years ago

Updated by mkittler about 4 years ago

Updated by mkittler about 4 years ago

Updated by mkittler about 4 years ago

Updated by mkittler about 4 years ago

Updated by mkittler about 4 years ago

Updated by mkittler about 4 years ago