action #105885
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #80908: [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobs
Continuous deployment of o3 workers - all the other o3 workers size:M
Added by okurz almost 3 years ago. Updated over 2 years ago.
Description
Acceptance criteria¶
- AC1: All o3 workers automatically deploy after every update to os-autoinst or openQA-worker
Suggestions¶
- Since the automatic deployment has been done on openqaworker7 and #105379 contains enough information to apply the approach on other o3 workers we suggest to continue here
- Ensure the root filesystem is mounted read-write, i.e.
mount -o rw,remount /
- Enable the timer for "openqa-continuous-update.service", i.e.
systemctl enable --now openqa-continuous-update.timer
- For checking call
systemctl status openqa-continuous-update
and check results, e.g.journalctl -e -u openqa-continuous-update
and check for any unforeseen errors and such - Monitor the state of the systems after some hours
- Monitor again on the next day
Updated by okurz almost 3 years ago
- Copied from action #105379: Continuous deployment of o3 workers - one worker first size:M added
Updated by mkittler over 2 years ago
- Target version changed from future to Ready
Since it has been done on openqaworker7 and #105379 contains enough information to apply the approach on other o3 workers I'd suggest we can continue here.
Updated by livdywan over 2 years ago
- Subject changed from Continuous deployment of o3 workers - all the other o3 workers to Continuous deployment of o3 workers - all the other o3 workers size:m
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler over 2 years ago
Looks like just setting the mount option to rw won't cut it. What I've stated in #105379#note-10 is true, but when one wants to actually install a package one runs into:
error: can't create transaction lock on /usr/lib/sysimage/rpm/.rpm.lock (Read-only file system)
or just:
openqaworker1:~ # touch /foo
touch: '/foo' kann nicht berührt werden: Das Dateisystem ist nur lesbar
Using mount -o rw,remount /
doesn't help as well. Not sure on which levels the file system is still set to be read-only.
Updated by mkittler over 2 years ago
Looks like it is also read-only on btrfs-level but one can simply make it read-write:
openqaworker1:~ # btrfs property get -ts / ro
ro=true
openqaworker1:~ # btrfs property set -ts / ro false
openqaworker1:~ # touch /foo
openqaworker1:~ # rm /foo
openqaworker1:~ # zypper in openQA-continuous-update
Repository-Daten werden geladen...
Installierte Pakete werden gelesen...
Paketabhängigkeiten werden aufgelöst...
Das folgende NEUE Paket wird installiert:
openQA-continuous-update
1 neues Paket zu installieren.
Gesamtgröße des Downloads: 0 B. Bereits im Cache gespeichert: 341,9 KiB. Nach der Operation werden zusätzlich 1,3 KiB belegt.
Fortfahren? [j/n/v/...? zeigt alle Optionen] (j):
Im Cache openQA-continuous-update-4.6.1652868008.418a4ec-lp153.4993.1.noarch.rpm (1/1), 341,9 KiB ( 1,3 KiB entpackt)
…
(1/1) Installieren: openQA-continuous-update-4.6.1652868008.418a4ec-lp153.4993.1.noarch
openqaworker1:~ # systemctl enable --now openqa-continuous-update.timer
Created symlink /etc/systemd/system/timers.target.wants/openqa-continuous-update.timer → /usr/lib/systemd/system/openqa-continuous-update.timer.
I've just invoked openqaworker1:~ # systemctl start transactional-update.service
to see whether the setup will persist after rebooting the system via the transactional setup. Unfortunately there's currently nothing to be updated:
Mai 18 14:06:46 openqaworker1 transactional-update[30392]: Calling zypper --no-cd dup
Mai 18 14:06:52 openqaworker1 transactional-update[30392]: zypper: nothing to update
Mai 18 14:06:52 openqaworker1 transactional-update[30392]: Removing snapshot #1336...
Mai 18 14:06:52 openqaworker1 transactional-update[31633]: 2022-05-18 14:06:52 tukit 3.6.2 started
Mai 18 14:06:52 openqaworker1 transactional-update[31633]: 2022-05-18 14:06:52 Options: abort 1336
Mai 18 14:06:53 openqaworker1 transactional-update[31633]: 2022-05-18 14:06:53 Discarding snapshot 1336.
Mai 18 14:06:53 openqaworker1 transactional-update[31633]: 2022-05-18 14:06:53 Transaction completed.
Mai 18 14:06:53 openqaworker1 transactional-update[30392]: transactional-update finished
Mai 18 14:06:53 openqaworker1 systemd[1]: transactional-update.service: Succeeded.
Mai 18 14:06:53 openqaworker1 systemd[1]: Finished Update the system.
So I'm waiting for some updates to test whether it actually works (before applying the btrfs changes on all other workers).
Updated by mkittler over 2 years ago
- Status changed from Workable to In Progress
Updated by mkittler over 2 years ago
Now there were some updates. I also rebooted the system. The root filesystem is still read-write so I suppose it worked. It also doesn't look like there are any other read-only sub volumes left. So now I could apply the config on all other o3 workers (openqaworker1 and openqaworker7 have been handled).
Updated by mkittler over 2 years ago
- Status changed from In Progress to Feedback
Edited /etc/fstab
and executed mount -o rw,remount / && btrfs property set -ts / ro false && zypper ref && zypper -n in openQA-continuous-update && systemctl enable --now openqa-continuous-update.timer
on all o3 workers mentioned on https://progress.opensuse.org/projects/openqav3/wiki/#Manual-command-execution-on-o3-workers. (Of course I left out the file system configuration when no read-only btrfs was used anyways.)
Monitor again on the next day
To have a comparison, currently the fail/incomplete rate looks like this:
openqa=> with finished as (select result, t_finished from jobs) select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where result = 'failed' or result = 'incomplete') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_failures_or_incompletes, count(*) total from finished where t_finished >= '2022-05-10' group by year, month, day order by year, month, day asc;
year | month | day | ratio_of_all_failures_or_incompletes | total
------+-------+-----+--------------------------------------+-------
2022 | 5 | 10 | 39.6 | 2326
2022 | 5 | 11 | 59.21 | 983
2022 | 5 | 12 | 66.65 | 4893
2022 | 5 | 13 | 43.92 | 1152
2022 | 5 | 14 | 31.12 | 1401
2022 | 5 | 15 | 47.43 | 1908
2022 | 5 | 16 | 33.58 | 2543
2022 | 5 | 17 | 29.71 | 2198
2022 | 5 | 18 | 28.85 | 1234
(9 Zeilen)
openqa=> with finished as (select result, t_finished from jobs) select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where result = 'incomplete') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_incompletes, count(*) total from finished where t_finished >= '2022-05-10' group by year, month, day order by year, month, day asc;
year | month | day | ratio_of_all_incompletes | total
------+-------+-----+--------------------------+-------
2022 | 5 | 10 | 6.88 | 2326
2022 | 5 | 11 | 13.22 | 983
2022 | 5 | 12 | 59.13 | 4893
2022 | 5 | 13 | 7.47 | 1152
2022 | 5 | 14 | 7.21 | 1401
2022 | 5 | 15 | 2.52 | 1908
2022 | 5 | 16 | 5.86 | 2543
2022 | 5 | 17 | 3.69 | 2198
2022 | 5 | 18 | 7.53 | 1235
(9 Zeilen)
Updated by mkittler over 2 years ago
The journal looks good on the x86_64 workers. It seems like the repo wasn't reachable for some time but that didn't lead to any problems (like the service being stuck). The rate of failures/incompletes also didn't change significantly.
Apparently it doesn't work on aarch64 and power8. I'll look into it. Works now on these hosts as well. It was just a differently configured repo name.
Updated by okurz over 2 years ago
- Subject changed from Continuous deployment of o3 workers - all the other o3 workers size:m to Continuous deployment of o3 workers - all the other o3 workers size:M
Updated by okurz over 2 years ago
- Copied to action #111377: Continuous deployment of osd workers - similar as on o3 size:M added
Updated by okurz over 2 years ago
- Status changed from Feedback to Resolved
I checked
for i in openqaworker1 openqaworker4 openqaworker7 imagetester rebel; do echo $i && ssh root@$i "systemctl status openqa-continuous-update.service; rpm -q --changelog openQA | head; rpm -q --changelog os-autoinst | head" ; done
and all looks good.
Considering this done.