



action #89200


coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #80908: [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobs

Switch OSD deployment to two-daily deployment

Added by okurz about 4 years ago. Updated about 4 years ago.

Feature requests
Target version:
Start date:
Due date:
% Done:


Estimated time:


Acceptance criteria

  • AC1: gitlab CI pipeline switched from weekly to two-daily schedule
  • AC2: The change was communicated to the main OSD users group


  • Just change the schedule to deploy every second day in gitlab, either if possible directly or add multiple weekly schedules so that we have 3 or 4 days covered each week.
  • Carefully prepare a communication that we can now deploy more often after no jobs are disrupted anymore during package upgrades :)
  • Think about what we would like to see to feel save about deploying every day
Actions #1

Updated by mkittler about 4 years ago

I would do that after checking how well tomorrow's first "non-disruptive" deployment goes.

Actions #2

Updated by mkittler about 4 years ago

  • Assignee set to mkittler
Actions #3

Updated by mkittler about 4 years ago

It looks like some jobs were still restarted today on grenache-1 and QA-Power8-5-kvm:

openqa=# select id, t_finished, reason, (select host from workers where id = assigned_worker_id) from jobs where reason like '%quit%' and t_created > now() - interval '1 day';
   id    |     t_finished      |                   reason                   |      host       
 5581139 | 2021-03-03 08:27:47 | quit: worker has been stopped or restarted | grenache-1
 5581059 | 2021-03-03 08:27:26 | quit: worker has been stopped or restarted | grenache-1
 5581185 | 2021-03-03 08:27:32 | quit: worker has been stopped or restarted | QA-Power8-5-kvm
 5581140 | 2021-03-03 08:27:44 | quit: worker has been stopped or restarted | grenache-1
 5581141 | 2021-03-03 08:27:48 | quit: worker has been stopped or restarted | grenache-1
 5581200 | 2021-03-03 08:26:59 | quit: worker has been stopped or restarted | QA-Power8-5-kvm
 5581199 | 2021-03-03 08:27:31 | quit: worker has been stopped or restarted | QA-Power8-5-kvm
 5581088 | 2021-03-03 08:27:49 | quit: worker has been stopped or restarted | grenache-1
 5581142 | 2021-03-03 08:27:49 | quit: worker has been stopped or restarted | grenache-1
(9 Zeilen)

I'll check what could be the cause of this before enabling a more frequent deployment.

Actions #4

Updated by mkittler about 4 years ago

Looks like the services actually received SIGTERM:

Mär 03 09:26:55 QA-Power8-5-kvm systemd[1]: Stopping openQA Worker #2...
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] Received signal TERM
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [debug] [pid:86146] Stopping job 5581185 from 05581185-sle-15-SP3-Online-ppc64le-Build156.3-wicked_basic_ref@ppc64le - reason: quit
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [debug] [pid:86146] REST-API call: POST
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] Trying to stop job gracefully by announcing it to command server via http://localhost:20023/ohFona2EOzURJmZN/broadcast
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] Isotovideo exit status: 1
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] +++ worker notes +++
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] End time: 2021-03-03 08:26:55
Mär 03 09:26:55 QA-Power8-5-kvm worker[86146]: [info] [pid:86146] Result: quit
Mär 03 09:27:24 grenache-1 worker[523676]: [info] [pid:523676] Received signal TERM
Mär 03 09:27:24 grenache-1 worker[523676]: [debug] [pid:523676] Stopping job 5581139 from 05581139-sle-15-SP3-Regression-on-Migration-from-SLE12-SPx-s390x-Buildhjluo_os-autoinst-distri-opensuse_unlock-offline_sles12sp4_ltss_pscc_sdk-asmm-contm-lgm-tcm-wsm_all_full@hjluo_os-autoinst-distri-opensuse_unlock@s390x-kvm-sle12 - reason: quit
Mär 03 09:27:24 grenache-1 worker[523676]: [debug] [pid:523676] REST-API call: POST
Mär 03 09:27:24 grenache-1 systemd[1]: Stopping openQA Worker #34...

So not a problem within the worker code. The problem is that is still active on these hosts:

martchus@grenache-1:~> systemctl status
● - openQA Worker
   Loaded: loaded (/usr/lib/systemd/system/; disabled; vendor preset: disabled)
   Active: active since Wed 2021-03-03 09:27:24 CET; 3h 25min ago

martchus@QA-Power8-5-kvm:~> sudo systemctl status
● - openQA Worker
   Loaded: loaded (/usr/lib/systemd/system/; disabled; vendor preset: disabled)
   Active: active since Wed 2021-03-03 09:26:55 CET; 3h 26min ago

Mär 03 09:26:55 QA-Power8-5-kvm systemd[1]: Stopping openQA Worker.
Mär 03 09:26:55 QA-Power8-5-kvm systemd[1]: Reached target openQA Worker.

Despite being disabled it was apparently started on the deployment today. That stopped the …-auto-restart services and thus interrupted the jobs.

Actions #5

Updated by mkittler about 4 years ago

  • Status changed from Workable to In Progress

There were actually 4 workers which had still active. All haven't been rebooted for about a month and therefore the target was still active. I never explicitly stopped it because that means stopping all jobs (and I assumed on the next reboot it will be stopped anyways). I nevertheless stopped the target now on the remaining workers so we can actually say that deployments from now on don't interrupt any jobs. I also applied the salt states again on these workers to start all worker slots again and to ensure the target remains dead after applying salt states. sudo salt -C 'G@roles:worker' 'systemctl is-active' looks now good (in the sense that I get inactive for every host).

I'll write a mail to stating our plans for the deployment.

Actions #6

Updated by mkittler about 4 years ago

  • Status changed from In Progress to Feedback

I've been changing the schedule to deploy on Monday, Wednesday and Friday (so the weekend is still excluded):

Actions #7

Updated by mkittler about 4 years ago

  • Status changed from Feedback to Resolved

Looks like the deployment is triggered successfully now also on Mondays and Fridays. I suppose that's good for now. (And yes, yesterdays deployment failed but that's another story.)

I have also written a mail to and mentioned it in the workshop.

Think about what we would like to see to feel save about deploying every day

Since we're already doing it on o3 I suppose it should be ok to trigger OSD's deployment more often as well. We can always deactivate the schedule temporarily if a bad commit landed on master. We could also ended the pre-checks which consider o3's state, e.g. to take the number of recent incompletes on o3 into account.


Also available in: Atom PDF