Project

General

Profile

Actions

action #89200

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #80908: [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobs

Switch OSD deployment to two-daily deployment

Added by okurz about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-02-26
Due date:
% Done:

0%

Estimated time:

Description

Acceptance criteria

  • AC1: gitlab CI pipeline switched from weekly to two-daily schedule
  • AC2: The change was communicated to the main OSD users group

Suggestions

  • Just change the schedule to deploy every second day in gitlab, either if possible directly or add multiple weekly schedules so that we have 3 or 4 days covered each week.
  • Carefully prepare a communication that we can now deploy more often after no jobs are disrupted anymore during package upgrades :)
  • Think about what we would like to see to feel save about deploying every day
Actions #1

Updated by mkittler about 3 years ago

I would do that after checking how well tomorrow's first "non-disruptive" deployment goes.

Actions #2

Updated by mkittler about 3 years ago

  • Assignee set to mkittler
Actions #3

Updated by mkittler about 3 years ago

It looks like some jobs were still restarted today on grenache-1 and QA-Power8-5-kvm:

openqa=# select id, t_finished, reason, (select host from workers where id = assigned_worker_id) from jobs where reason like '%quit%' and t_created > now() - interval '1 day';
   id    |     t_finished      |                   reason                   |      host       
---------+---------------------+--------------------------------------------+-----------------
 5581139 | 2021-03-03 08:27:47 | quit: worker has been stopped or restarted | grenache-1
 5581059 | 2021-03-03 08:27:26 | quit: worker has been stopped or restarted | grenache-1
 5581185 | 2021-03-03 08:27:32 | quit: worker has been stopped or restarted | QA-Power8-5-kvm
 5581140 | 2021-03-03 08:27:44 | quit: worker has been stopped or restarted | grenache-1
 5581141 | 2021-03-03 08:27:48 | quit: worker has been stopped or restarted | grenache-1
 5581200 | 2021-03-03 08:26:59 | quit: worker has been stopped or restarted | QA-Power8-5-kvm
 5581199 | 2021-03-03 08:27:31 | quit: worker has been stopped or restarted | QA-Power8-5-kvm
 5581088 | 2021-03-03 08:27:49 | quit: worker has been stopped or restarted | grenache-1
 5581142 | 2021-03-03 08:27:49 | quit: worker has been stopped or restarted | grenache-1
(9 Zeilen)

I'll check what could be the cause of this before enabling a more frequent deployment.

Actions #4

Updated by mkittler about 3 years ago

Looks like the services actually received SIGTERM:

Mär 03 09:26:55 QA-Power8-5-kvm systemd[1]: Stopping openQA Worker #2...
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] Received signal TERM
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [debug] [pid:86146] Stopping job 5581185 from openqa.suse.de: 05581185-sle-15-SP3-Online-ppc64le-Build156.3-wicked_basic_ref@ppc64le - reason: quit
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [debug] [pid:86146] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5581185/status
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] Trying to stop job gracefully by announcing it to command server via http://localhost:20023/ohFona2EOzURJmZN/broadcast
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] Isotovideo exit status: 1
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] +++ worker notes +++
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] End time: 2021-03-03 08:26:55
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] Result: quit
Mär 03 09:27:24 grenache-1 worker[523676]: [info] [pid:523676] Received signal TERM
Mär 03 09:27:24 grenache-1 worker[523676]: [debug] [pid:523676] Stopping job 5581139 from openqa.suse.de: 05581139-sle-15-SP3-Regression-on-Migration-from-SLE12-SPx-s390x-Buildhjluo_os-autoinst-distri-opensuse_unlock-offline_sles12sp4_ltss_pscc_sdk-asmm-contm-lgm-tcm-wsm_all_full@hjluo_os-autoinst-distri-opensuse_unlock@s390x-kvm-sle12 - reason: quit
Mär 03 09:27:24 grenache-1 worker[523676]: [debug] [pid:523676] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5581139/status
Mär 03 09:27:24 grenache-1 systemd[1]: Stopping openQA Worker #34...

So not a problem within the worker code. The problem is that openqa-worker.target is still active on these hosts:

martchus@grenache-1:~> systemctl status openqa-worker.target
● openqa-worker.target - openQA Worker
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker.target; disabled; vendor preset: disabled)
   Active: active since Wed 2021-03-03 09:27:24 CET; 3h 25min ago

martchus@QA-Power8-5-kvm:~> sudo systemctl status openqa-worker.target
● openqa-worker.target - openQA Worker
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker.target; disabled; vendor preset: disabled)
   Active: active since Wed 2021-03-03 09:26:55 CET; 3h 26min ago

Mär 03 09:26:55 QA-Power8-5-kvm systemd[1]: Stopping openQA Worker.
Mär 03 09:26:55 QA-Power8-5-kvm systemd[1]: Reached target openQA Worker.

Despite being disabled it was apparently started on the deployment today. That stopped the …-auto-restart services and thus interrupted the jobs.

Actions #5

Updated by mkittler about 3 years ago

  • Status changed from Workable to In Progress

There were actually 4 workers which had openqa-worker.target still active. All haven't been rebooted for about a month and therefore the target was still active. I never explicitly stopped it because that means stopping all jobs (and I assumed on the next reboot it will be stopped anyways). I nevertheless stopped the target now on the remaining workers so we can actually say that deployments from now on don't interrupt any jobs. I also applied the salt states again on these workers to start all worker slots again and to ensure the target remains dead after applying salt states. sudo salt -C 'G@roles:worker' cmd.run 'systemctl is-active openqa-worker.target' looks now good (in the sense that I get inactive for every host).

I'll write a mail to openqa@suse.de stating our plans for the deployment.

Actions #6

Updated by mkittler about 3 years ago

  • Status changed from In Progress to Feedback

I've been changing the schedule to deploy on Monday, Wednesday and Friday (so the weekend is still excluded): https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit

Actions #7

Updated by mkittler about 3 years ago

  • Status changed from Feedback to Resolved

Looks like the deployment is triggered successfully now also on Mondays and Fridays. I suppose that's good for now. (And yes, yesterdays deployment failed but that's another story.)

I have also written a mail to openqa.suse.de and mentioned it in the workshop.


Think about what we would like to see to feel save about deploying every day

Since we're already doing it on o3 I suppose it should be ok to trigger OSD's deployment more often as well. We can always deactivate the schedule temporarily if a bad commit landed on master. We could also ended the pre-checks which consider o3's state, e.g. to take the number of recent incompletes on o3 into account.

Actions

Also available in: Atom PDF