Project

General

Profile

action #111989

Seems like o3 machines do not automatically reboot anymore, likely because we continuously call `zypper dup` so that the nightly upgrades don't find any changes? size:M

Added by okurz 3 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2022-01-24
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

Seems like o3 machines do not automatically reboot anymore, likely because we continuously call zypper dup so that the nightly upgrades don't find any changes?

See #111758#note-18

Acceptance criteria

  • AC1: O3 machines are rebooted after updates requiring a reboot
  • AC2: Eventually root filesystem snapshots are cleaned up

Suggestions

  • Monitor machines for some weeks to keep track of updates
  • Check that there's no pending kernel update, can be done with for i in openqaworker1 openqaworker4 openqaworker7 imagetester rebel power8; do ssh root@$i "zypper ps | tail -n 2"; done
  • Revisit the update logic to only upgrade openQA packages instead of everything. We discussed this and came to the conclusion that we created an ugly Frankenstein-Monster which nobody should do. Yet another confirmation that the proper way(s) to go are either a fully rolling distribution, aka. Tumbleweed, or transactional-update server + container with continuously deployed workload on top

Related issues

Copied from openQA Project - action #105379: Continuous deployment of o3 workers - one worker first size:MResolved2022-01-24

History

#1 Updated by okurz 3 months ago

  • Copied from action #105379: Continuous deployment of o3 workers - one worker first size:M added

#2 Updated by tinita 2 months ago

  • Subject changed from Seems like o3 machines do not automatically reboot anymore, likely because we continuously call `zypper dup` so that the nightly upgrades don't find any changes? to Seems like o3 machines do not automatically reboot anymore, likely because we continuously call `zypper dup` so that the nightly upgrades don't find any changes? size:M
  • Description updated (diff)
  • Status changed from New to Workable

#3 Updated by okurz 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

#4 Updated by okurz 2 months ago

  • Due date set to 2022-07-07
  • Status changed from In Progress to Feedback
  • Priority changed from High to Normal

monitoring for the next weeks

#5 Updated by okurz 2 months ago

for i in openqaworker1 openqaworker4 openqaworker7 imagetester rebel power8; do ssh root@$i "zypper ps | tail -n 2; rebootmgrctl status; uptime"; done shows that a reboot is pending but was never planned. on openqaworker1 journalctl --since=today -u transactional-update shows:

-- Logs begin at Thu 2022-05-12 03:17:54 CEST, end at Fri 2022-06-17 17:47:58 CEST. --
Jun 17 00:22:02 openqaworker1 systemd[1]: Starting Update the system...
Jun 17 00:22:02 openqaworker1 openqa-check-devel-repo[4675]: devel:openQA looks good for Leap 15.3 (x86_64)
Jun 17 00:22:02 openqaworker1 transactional-update[4686]: Checking for newer version.
Jun 17 00:22:10 openqaworker1 transactional-update[4686]: transactional-update 3.6.2 started
Jun 17 00:22:10 openqaworker1 transactional-update[4686]: Options: cleanup dup reboot
Jun 17 00:22:10 openqaworker1 transactional-update[4686]: Separate /var detected.
Jun 17 00:22:14 openqaworker1 transactional-update[4686]: 2022-06-17 00:22:10 tukit 3.6.2 started
Jun 17 00:22:14 openqaworker1 transactional-update[4686]: 2022-06-17 00:22:10 Options: -c1336 open
Jun 17 00:22:14 openqaworker1 transactional-update[4686]: 2022-06-17 00:22:13 Using snapshot 1336 as base for new snapshot 1803.
Jun 17 00:22:14 openqaworker1 transactional-update[4686]: 2022-06-17 00:22:13 Parent snapshot 1330 does not exist any more - skipping rsync
Jun 17 00:22:14 openqaworker1 transactional-update[4686]: ID: 1803
Jun 17 00:22:14 openqaworker1 transactional-update[4686]: 2022-06-17 00:22:14 Transaction completed.
Jun 17 00:22:14 openqaworker1 transactional-update[4686]: Calling zypper --no-cd dup
Jun 17 00:22:19 openqaworker1 transactional-update[4686]: zypper: nothing to update
Jun 17 00:22:19 openqaworker1 transactional-update[4686]: Removing snapshot #1803...
Jun 17 00:22:19 openqaworker1 transactional-update[5535]: 2022-06-17 00:22:19 tukit 3.6.2 started
Jun 17 00:22:19 openqaworker1 transactional-update[5535]: 2022-06-17 00:22:19 Options: abort 1803
Jun 17 00:22:19 openqaworker1 transactional-update[5535]: 2022-06-17 00:22:19 Discarding snapshot 1803.
Jun 17 00:22:20 openqaworker1 transactional-update[5535]: 2022-06-17 00:22:20 Transaction completed.
Jun 17 00:22:20 openqaworker1 transactional-update[4686]: transactional-update finished
Jun 17 00:22:20 openqaworker1 systemd[1]: transactional-update.service: Succeeded.
Jun 17 00:22:20 openqaworker1 systemd[1]: Finished Update the system.

so "nothing to update" and nothing is done, no reboot is triggered.

#6 Updated by okurz about 1 month ago

$ ssh o3
Last login: Wed Jul  6 09:55:54 2022 from 192.168.47.252
okurz@ariel:~> for i in openqaworker1 openqaworker4 openqaworker7 imagetester rebel power8; do ssh root@$i "zypper ps | tail -n 2; rebootmgrctl status; uptime"; done
Since the last system boot core libraries or services have been updated.
Reboot is suggested to ensure that your system benefits from these updates.
Status: Reboot not requested
 13:40:56  up 37 days  2:50,  0 users,  load average: 1.57, 1.59, 3.06
No core libraries or services have been updated since the last system boot.
Reboot is probably not necessary.
Status: Reboot not requested
 13:40:57  up 1 day 10:06,  0 users,  load average: 1.30, 1.69, 2.43
No core libraries or services have been updated since the last system boot.
Reboot is probably not necessary.
Status: Reboot not requested
 13:40:58  up  10:04,  0 users,  load average: 0.38, 0.62, 1.04
Since the last system boot core libraries or services have been updated.
Reboot is suggested to ensure that your system benefits from these updates.
Status: Reboot not requested
 11:40:59  up 37 days  2:52,  0 users,  load average: 0.00, 0.02, 0.09
Since the last system boot core libraries or services have been updated.
Reboot is suggested to ensure that your system benefits from these updates.
Status: Reboot not requested
 13:40:59  up 37 days  2:51,  0 users,  load average: 0.00, 0.00, 0.00
Check failed:
Please install package 'lsof' first.
Error: The name org.opensuse.RebootMgr was not provided by any .service files
 11:41:00  up 37 days  1:13,  0 users,  load average: 0.00, 0.02, 0.21
okurz@ariel:~> 

and

for i in openqaworker1 openqaworker4 openqaworker7 imagetester rebel power8; do ssh root@$i "zypper ps | tail -n 2"; done
Since the last system boot core libraries or services have been updated.
Reboot is suggested to ensure that your system benefits from these updates.
No core libraries or services have been updated since the last system boot.
Reboot is probably not necessary.
No core libraries or services have been updated since the last system boot.
Reboot is probably not necessary.
Since the last system boot core libraries or services have been updated.
Reboot is suggested to ensure that your system benefits from these updates.
Since the last system boot core libraries or services have been updated.
Reboot is suggested to ensure that your system benefits from these updates.
Check failed:
Please install package 'lsof' first.

so some machines are up since 37 days and not rebooted. We discussed in weekly SUSE QE Tools unblock 2022-07-06 and agreed that likely the best approach is to just use the openqa-auto-update service which also, same as the transactional update, will most likely never need to install anything as the continuous update already does but it will check if a reboot is needed and request a reboot over rebootmgr. And with the maintenance window still setup correctly to be nightly there will be a nightly reboot but only when necessary.

I did

for i in openqaworker1 openqaworker4 openqaworker7 imagetester rebel power8; do ssh root@$i "systemctl enable --now openqa-auto-update.timer"; done

and now back to monitoring over the next days.

EDIT: Wait, where is aarch64? … Ah, it's missing! Apparently we had openQA-auto-update installed on all machines but not on aarch64. Fixed that with zypper -n in openQA-auto-update && systemctl enable --now openqa-auto-update.timer

#7 Updated by okurz about 1 month ago

  • Due date changed from 2022-07-07 to 2022-07-22

#8 Updated by okurz about 1 month ago

  • Due date deleted (2022-07-22)
  • Status changed from Feedback to Resolved

As we were missing tooling on power8 I did zypper -n in lsof rebootmgr && systemctl enable --now rebootmgr.

And then check again all with:

for i in aarch64 openqaworker1 openqaworker4 openqaworker7 imagetester rebel power8; do echo "## $i" && ssh root@$i "zypper ps | tail -n 2; rebootmgrctl status; uptime"; done

and all looks fine.

Also available in: Atom PDF