action #128561: salt managed host being down does not trigger any alert (was: jenkins.qa.suse.de stuck in emergency mode but no alert) size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #128561

closed

salt managed host being down does not trigger any alert (was: jenkins.qa.suse.de stuck in emergency mode but no alert) size:M

Added by okurz almost 2 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

dheidler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-05-03

Due date:

2023-07-04

% Done:

Estimated time:

Tags:

alert, infra, jenkins, emergency mode, missing alert

Description

Observation¶

In https://suse.slack.com/archives/C02CANHLANP/p1683041198918179 DimStar noted:

(Dominique Leuenberger) @Oliver Kurz Hi; seems jenkins is no longer scheduling QA runs for GNOME:Next - https://openqa.opensuse.org/group_overview/35 lists last run 3 days ago
(Fabian Vogt) Time to migrate to obs_rsync? Probably. The jenkins host is unreachable. Either down or overloaded.

There is a monitoring panel "host up" with an according alert. Looking at
https://monitor.qa.suse.de/d/GDjenkins/dashboard-for-jenkins?orgId=1&viewPanel=65105&from=1682650896818&to=1683189154032&editPanel=65105
one can see that there was a long window with no response but no alert. Likely we went a bit too far to ignore all "no data" conditions but the panel should not look at the response time but the "result_code" of the ping which always has a valid value and can be checked for host responses

Acceptance criteria¶

AC1: There is an alert when the machine is not fully up
AC2: There is no alert for usual planned reboots

Suggestions¶

Change "host up" to look at "result_code" and pick a sensible alert linked to that
Check if we do not already have a ticket for the same

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by okurz almost 2 years ago

Some error messages in journal about brltty, removing those packages, not needed there.

Actions

Copy link

Updated by okurz almost 2 years ago

Assignee deleted (~~okurz~~)

From journalctl --since=2023-04-20:

-- Boot be3e03e3619b442e9724da585dd49357 --
Apr 30 03:33:58 jenkins kernel: Linux version 5.14.21-150400.24.60-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.39.0.20220810-150100>

so a boot was conducted Sunday morning. This is due to the automatic updating and planned reboot during a rebootmgr maintenance window.

Later in the logs:

Apr 30 03:34:38 jenkins systemd[1]: Finished Remount Root and Kernel File Systems.
Apr 30 03:34:38 jenkins systemd[1]: Condition check resulted in One time configuration for iscsid.service being skipped.
Apr 30 03:34:38 jenkins systemd[1]: Condition check resulted in First Boot Wizard being skipped.
Apr 30 03:34:38 jenkins systemd[1]: Condition check resulted in Rebuild Hardware Database being skipped.
Apr 30 03:34:38 jenkins systemd[1]: Condition check resulted in Create System Users being skipped.
Apr 30 03:34:38 jenkins systemd[1]: Starting Create Static Device Nodes in /dev...
Apr 30 03:34:39 jenkins systemd[1]: Finished Create Static Device Nodes in /dev.
Apr 30 03:34:39 jenkins systemd[1]: Reached target Preparation for Local File Systems.
Apr 30 03:34:39 jenkins systemd[1]: Starting Rule-based Manager for Device Events and Files...
Apr 30 03:34:40 jenkins systemd-udevd[517]: Network interface NamePolicy= disabled by default.
Apr 30 03:35:37 jenkins systemd[1]: Started Rule-based Manager for Device Events and Files.
Apr 30 03:35:41 jenkins mtp-probe[536]: checking bus 1, device 2: "/sys/devices/pci0000:00/0000:00:05.7/usb1/1-1"
Apr 30 03:35:41 jenkins mtp-probe[536]: bus: 1, device: 2 was not an MTP device
Apr 30 03:35:58 jenkins systemd[1]: dev-disk-by\x2duuid-0b8a4aea\x2d1f96\x2d4b60\x2db305\x2d57a7317d0067.device: Job dev-disk-by\x2duuid-0b8a4aea\x2d1f96\x2d4b60\x2db305\x2d57a7317d0067.de>
Apr 30 03:35:58 jenkins systemd[1]: Timed out waiting for device /dev/disk/by-uuid/0b8a4aea-1f96-4b60-b305-57a7317d0067.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /boot/grub2/i386-pc.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Local File Systems.
Apr 30 03:35:58 jenkins systemd[1]: local-fs.target: Job local-fs.target/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: local-fs.target: Triggering OnFailure= dependencies.
Apr 30 03:35:58 jenkins systemd[1]: boot-grub2-i386\x2dpc.mount: Job boot-grub2-i386\x2dpc.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /srv.
Apr 30 03:35:58 jenkins systemd[1]: srv.mount: Job srv.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /boot/grub2/x86_64-efi.
Apr 30 03:35:58 jenkins systemd[1]: boot-grub2-x86_64\x2defi.mount: Job boot-grub2-x86_64\x2defi.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /tmp.
Apr 30 03:35:58 jenkins systemd[1]: tmp.mount: Job tmp.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /opt.
Apr 30 03:35:58 jenkins systemd[1]: opt.mount: Job opt.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /.snapshots.
Apr 30 03:35:58 jenkins systemd[1]: \x2esnapshots.mount: Job \x2esnapshots.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /usr/local.
Apr 30 03:35:58 jenkins systemd[1]: usr-local.mount: Job usr-local.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /home.
Apr 30 03:35:58 jenkins systemd[1]: home.mount: Job home.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /var.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Load/Save Random Seed.
Apr 30 03:35:58 jenkins systemd[1]: systemd-random-seed.service: Job systemd-random-seed.service/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Fail if at least one kernel crash has been recorded under /var/crash.
Apr 30 03:35:58 jenkins systemd[1]: check-for-kernel-crash.service: Job check-for-kernel-crash.service/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Record Runlevel Change in UTMP.
Apr 30 03:35:58 jenkins systemd[1]: systemd-update-utmp-runlevel.service: Job systemd-update-utmp-runlevel.service/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Lock Directory.
Apr 30 03:35:58 jenkins systemd[1]: var-lock.mount: Job var-lock.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Record System Boot/Shutdown in UTMP.
Apr 30 03:35:58 jenkins systemd[1]: systemd-update-utmp.service: Job systemd-update-utmp.service/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Flush Journal to Persistent Storage.
Apr 30 03:35:58 jenkins systemd[1]: systemd-journal-flush.service: Job systemd-journal-flush.service/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Runtime Directory.
Apr 30 03:35:58 jenkins systemd[1]: var-run.mount: Job var-run.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: var.mount: Job var.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: dev-disk-by\x2duuid-0b8a4aea\x2d1f96\x2d4b60\x2db305\x2d57a7317d0067.device: Job dev-disk-by\x2duuid-0b8a4aea\x2d1f96\x2d4b60\x2db305\x2d57a7317d0067.de>
Apr 30 03:35:58 jenkins systemd[1]: dev-disk-by\x2duuid-4b7ce8a7\x2df32b\x2d419e\x2d9b4a\x2d5eddce8691a8.device: Job dev-disk-by\x2duuid-4b7ce8a7\x2df32b\x2d419e\x2d9b4a\x2d5eddce8691a8.de>
Apr 30 03:35:58 jenkins systemd[1]: Timed out waiting for device /dev/disk/by-uuid/4b7ce8a7-f32b-419e-9b4a-5eddce8691a8.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /dev/disk/by-uuid/4b7ce8a7-f32b-419e-9b4a-5eddce8691a8.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Swaps.

/dev/disk/by-uuid/0b8a4aea-1f96-4b60-b305-57a7317d0067 is sda2 which is a QEMU harddisk but the same that is booted from initially so I don't know how that can run into timeout. Maybe qamaster rebooted itself during that time and then all VMs were started again and that took too long because the I/O performance of qamaster is not good enough?

I triggered a jenkins build of GNOME:Next explicitly now to not wait longer. The according openQA job is https://openqa.opensuse.org/tests/3260055.

I triggered another reboot and the system turned up fine again. Maybe we can configure automatic reboot tries instead of emergency mode?

Actions

Copy link

Updated by okurz almost 2 years ago

Subject changed from jenkins.qa.suse.de stuck in emergency mode but no alert to jenkins.qa.suse.de stuck in emergency mode but no alert size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by osukup almost 2 years ago

Related to action #130132: jenkins.qa.suse.de seems down added

Actions

Copy link

Updated by livdywan almost 2 years ago

Status changed from Workable to In Progress
Assignee set to livdywan

okurz wrote:

I triggered another reboot and the system turned up fine again. Maybe we can configure automatic reboot tries instead of emergency mode?

Let's do it:

sudo systemctl edit emergency
[Service]
ExecStartPre=/bin/sh -c "read -t 10 || /usr/bin/systemctl reboot"

Actions

Copy link

Updated by openqa_review almost 2 years ago

Due date set to 2023-06-15

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz over 1 year ago

cdywan wrote:

okurz wrote:

I triggered another reboot and the system turned up fine again. Maybe we can configure automatic reboot tries instead of emergency mode?

Let's do it:
sudo systemctl edit emergency
[Service]
ExecStartPre=/bin/sh -c "read -t 10 || /usr/bin/systemctl reboot"

Why not just systemctl mask emergency.{service,target}

Actions

Copy link

Updated by livdywan over 1 year ago

okurz wrote:

cdywan wrote:
sudo systemctl edit emergency
[Service]
ExecStartPre=/bin/sh -c "read -t 10 || /usr/bin/systemctl reboot"
Why not just systemctl mask emergency.{service,target}

Good point. I didn't realize that was an option. At least the Arch Linux wiki seems to agree it's a good idea :-D

Did that now

Actions

Copy link

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

#10

Updated by livdywan over 1 year ago

Discussed in the Unblock:

First try to trigger an alert if the VM is down or not pingable
With emergency being masked the issue is no longer reproducible
- Ideally verified by https://progress.opensuse.org/projects/openqav3/wiki/#Best-practices-for-infrastructure-work
Maybe emergency allows ping? And thus no alerts?
- It should be tested that the alert comes up if the machine is in fact down

Idea: Connect via SSH and run systemctl is-system-running.
https://www.man7.org/linux/man-pages/man1/systemctl.1.html
maintenance -> system is in emergency mode. This could be done likely every 5m by telegraf or even every couple of hours

Actions

Copy link

#11

Updated by livdywan over 1 year ago

Status changed from In Progress to Feedback

cdywan wrote:

Discussed in the Unblock:

First try to trigger an alert if the VM is down or not pingable

Waiting to find out how long it takes until we see an alert if ever

Actions

Copy link

#12

Updated by livdywan over 1 year ago

cdywan wrote:

cdywan wrote:

Discussed in the Unblock:

First try to trigger an alert if the VM is down or not pingable

Waiting to find out how long it takes until we see an alert if ever

According to Copy this should take 90 minutes

Actions

Copy link

#13

Updated by livdywan over 1 year ago

cdywan wrote:

cdywan wrote:

cdywan wrote:

Discussed in the Unblock:

First try to trigger an alert if the VM is down or not pingable

Waiting to find out how long it takes until we see an alert if ever

According to Copy this should take 90 minutes

And indeed Packet loss between worker hosts and other hosts alert caught it. Alas nobody thought to recover the machine and our suspicion that nobody will notice was also confirmed 🤐

Actions

Copy link

#14

Updated by livdywan over 1 year ago

Copied to action #130633: Better documentation on jenkins.qa.suse.de alerts and recovery added

Actions

Copy link

#15

Updated by okurz over 1 year ago

cdywan wrote:

And indeed Packet loss between worker hosts and other hosts alert caught it. Alas nobody thought to recover the machine and our suspicion that nobody will notice was also confirmed 🤐

That's a very far stretch: you are assigned to this ticket, we decided together that you will shut down that machine to trigger an alert and you were on alert duty, many others weren't even working Thu+Fri. So I don't see the problem.

Actions

Copy link

#16

Updated by livdywan over 1 year ago

okurz wrote:

cdywan wrote:

And indeed Packet loss between worker hosts and other hosts alert caught it. Alas nobody thought to recover the machine and our suspicion that nobody will notice was also confirmed 🤐

That's a very far stretch: you are assigned to this ticket, we decided together that you will shut down that machine to trigger an alert and you were on alert duty, many others weren't even working Thu+Fri. So I don't see the problem.

I would assume alerts are handled proactively by everyone in the team. And there was no indication that the alert, without anyone looking into it, was covered by an existing ticket. This is why it's so important that we reply to alert emails and note down in tickets what measures were taken.

Actions

Copy link

#17

Updated by livdywan over 1 year ago

Status changed from Feedback to Workable
Assignee deleted (~~livdywan~~)

Well, I guess I'm unassigning. I don't know how to recover this machine and nobody in the weekly was able to provide any suggestions.

Actions

Copy link

#18

Updated by okurz over 1 year ago

Subject changed from jenkins.qa.suse.de stuck in emergency mode but no alert size:M to salt managed host being down does not trigger any alert (was: jenkins.qa.suse.de stuck in emergency mode but no alert) size:M
Due date deleted (~~2023-06-15~~)
Priority changed from Normal to High

I started the VM jenkins.qa.suse.de again by ssh qamaster.qa.suse.de "sudo virsh start jenkins". I took #130633 and provided documentation how to manage that VM and others on qamaster.qa.suse.de

cdywan wrote:

okurz wrote:

cdywan wrote:

And indeed Packet loss between worker hosts and other hosts alert caught it. Alas nobody thought to recover the machine and our suspicion that nobody will notice was also confirmed 🤐

That's a very far stretch: you are assigned to this ticket, we decided together that you will shut down that machine to trigger an alert and you were on alert duty, many others weren't even working Thu+Fri. So I don't see the problem.

I would assume alerts are handled proactively by everyone in the team. And there was no indication that the alert, without anyone looking into it, was covered by an existing ticket. This is why it's so important that we reply to alert emails and note down in tickets what measures were taken.

Yes, I agree. Everyone should handle alerts proactively and if not then the person on alert duty catches it as you correctly did. I checked the timestamps and found that the alert was triggered at 2023-06-07 19:07:00 +0200 CEST so after the usual EOB on Wednesday for many. On Thursday multiple members of the team were not working due to public holiday hence also taking a day off on Friday. However others were working on Thursday so we should adress them directly – blameless of course – and ensure that they know about responsibilities. Likely #128789 added to the problem due to repeated daily alerts which is also why I bumped the priority in #128789 to "Immediate".

Back to the topic of this ticket: There was apparently no "host up" alert even though jenkins.qa.suse.de was down for more than 24h. This means we can reject the hypothesis from nsinger that machines would still be reported as "up" when in emergency mode because apparently there is not even an alert if a machine is powered off. This is the next task to work on: Ensure that any host being down – and I mean really "down", not in emergency mode - triggers an according alert

Actions

Copy link

#19

Updated by dheidler over 1 year ago

Assignee set to dheidler

Actions

Copy link

#20

Updated by dheidler over 1 year ago

Status changed from Workable to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/889

Actions

Copy link

#21

Updated by okurz over 1 year ago

Due date set to 2023-07-04

In the unblock we worked with dheidler together to work on https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/889 which I now merged. Next step is to test the effect on the actual alert. Decided with dheidler to use tumblesle for tests.

Actions

Copy link

#22