Project

General

Profile

Actions

action #128561

closed

salt managed host being down does not trigger any alert (was: jenkins.qa.suse.de stuck in emergency mode but no alert) size:M

Added by okurz 11 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-05-03
Due date:
2023-07-04
% Done:

0%

Estimated time:

Description

Observation

In https://suse.slack.com/archives/C02CANHLANP/p1683041198918179 DimStar noted:

(Dominique Leuenberger) @Oliver Kurz Hi; seems jenkins is no longer scheduling QA runs for GNOME:Next - https://openqa.opensuse.org/group_overview/35 lists last run 3 days ago
(Fabian Vogt) Time to migrate to obs_rsync? Probably. The jenkins host is unreachable. Either down or overloaded.

There is a monitoring panel "host up" with an according alert. Looking at
https://monitor.qa.suse.de/d/GDjenkins/dashboard-for-jenkins?orgId=1&viewPanel=65105&from=1682650896818&to=1683189154032&editPanel=65105
one can see that there was a long window with no response but no alert. Likely we went a bit too far to ignore all "no data" conditions but the panel should not look at the response time but the "result_code" of the ping which always has a valid value and can be checked for host responses

Acceptance criteria

  • AC1: There is an alert when the machine is not fully up
  • AC2: There is no alert for usual planned reboots

Suggestions

  • Change "host up" to look at "result_code" and pick a sensible alert linked to that
  • Check if we do not already have a ticket for the same

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure - action #130132: jenkins.qa.suse.de seems downResolvedlivdywan2023-05-31

Actions
Has duplicate openQA Infrastructure - action #131303: [alert] Packet loss between worker hosts and other hosts (tumblesle.qa.suse.de)Resolveddheidler2023-06-23

Actions
Copied to openQA Infrastructure - action #130633: Better documentation on jenkins.qa.suse.de alerts and recoveryResolvedokurz

Actions
Actions #1

Updated by okurz 11 months ago

Some error messages in journal about brltty, removing those packages, not needed there.

Actions #2

Updated by okurz 11 months ago

  • Assignee deleted (okurz)

From journalctl --since=2023-04-20:

-- Boot be3e03e3619b442e9724da585dd49357 --
Apr 30 03:33:58 jenkins kernel: Linux version 5.14.21-150400.24.60-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.39.0.20220810-150100>

so a boot was conducted Sunday morning. This is due to the automatic updating and planned reboot during a rebootmgr maintenance window.

Later in the logs:

Apr 30 03:34:38 jenkins systemd[1]: Finished Remount Root and Kernel File Systems.
Apr 30 03:34:38 jenkins systemd[1]: Condition check resulted in One time configuration for iscsid.service being skipped.
Apr 30 03:34:38 jenkins systemd[1]: Condition check resulted in First Boot Wizard being skipped.
Apr 30 03:34:38 jenkins systemd[1]: Condition check resulted in Rebuild Hardware Database being skipped.
Apr 30 03:34:38 jenkins systemd[1]: Condition check resulted in Create System Users being skipped.
Apr 30 03:34:38 jenkins systemd[1]: Starting Create Static Device Nodes in /dev...
Apr 30 03:34:39 jenkins systemd[1]: Finished Create Static Device Nodes in /dev.
Apr 30 03:34:39 jenkins systemd[1]: Reached target Preparation for Local File Systems.
Apr 30 03:34:39 jenkins systemd[1]: Starting Rule-based Manager for Device Events and Files...
Apr 30 03:34:40 jenkins systemd-udevd[517]: Network interface NamePolicy= disabled by default.
Apr 30 03:35:37 jenkins systemd[1]: Started Rule-based Manager for Device Events and Files.
Apr 30 03:35:41 jenkins mtp-probe[536]: checking bus 1, device 2: "/sys/devices/pci0000:00/0000:00:05.7/usb1/1-1"
Apr 30 03:35:41 jenkins mtp-probe[536]: bus: 1, device: 2 was not an MTP device
Apr 30 03:35:58 jenkins systemd[1]: dev-disk-by\x2duuid-0b8a4aea\x2d1f96\x2d4b60\x2db305\x2d57a7317d0067.device: Job dev-disk-by\x2duuid-0b8a4aea\x2d1f96\x2d4b60\x2db305\x2d57a7317d0067.de>
Apr 30 03:35:58 jenkins systemd[1]: Timed out waiting for device /dev/disk/by-uuid/0b8a4aea-1f96-4b60-b305-57a7317d0067.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /boot/grub2/i386-pc.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Local File Systems.
Apr 30 03:35:58 jenkins systemd[1]: local-fs.target: Job local-fs.target/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: local-fs.target: Triggering OnFailure= dependencies.
Apr 30 03:35:58 jenkins systemd[1]: boot-grub2-i386\x2dpc.mount: Job boot-grub2-i386\x2dpc.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /srv.
Apr 30 03:35:58 jenkins systemd[1]: srv.mount: Job srv.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /boot/grub2/x86_64-efi.
Apr 30 03:35:58 jenkins systemd[1]: boot-grub2-x86_64\x2defi.mount: Job boot-grub2-x86_64\x2defi.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /tmp.
Apr 30 03:35:58 jenkins systemd[1]: tmp.mount: Job tmp.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /opt.
Apr 30 03:35:58 jenkins systemd[1]: opt.mount: Job opt.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /.snapshots.
Apr 30 03:35:58 jenkins systemd[1]: \x2esnapshots.mount: Job \x2esnapshots.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /usr/local.
Apr 30 03:35:58 jenkins systemd[1]: usr-local.mount: Job usr-local.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /home.
Apr 30 03:35:58 jenkins systemd[1]: home.mount: Job home.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /var.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Load/Save Random Seed.
Apr 30 03:35:58 jenkins systemd[1]: systemd-random-seed.service: Job systemd-random-seed.service/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Fail if at least one kernel crash has been recorded under /var/crash.
Apr 30 03:35:58 jenkins systemd[1]: check-for-kernel-crash.service: Job check-for-kernel-crash.service/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Record Runlevel Change in UTMP.
Apr 30 03:35:58 jenkins systemd[1]: systemd-update-utmp-runlevel.service: Job systemd-update-utmp-runlevel.service/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Lock Directory.
Apr 30 03:35:58 jenkins systemd[1]: var-lock.mount: Job var-lock.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Record System Boot/Shutdown in UTMP.
Apr 30 03:35:58 jenkins systemd[1]: systemd-update-utmp.service: Job systemd-update-utmp.service/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Flush Journal to Persistent Storage.
Apr 30 03:35:58 jenkins systemd[1]: systemd-journal-flush.service: Job systemd-journal-flush.service/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Runtime Directory.
Apr 30 03:35:58 jenkins systemd[1]: var-run.mount: Job var-run.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: var.mount: Job var.mount/start failed with result 'dependency'.
Apr 30 03:35:58 jenkins systemd[1]: dev-disk-by\x2duuid-0b8a4aea\x2d1f96\x2d4b60\x2db305\x2d57a7317d0067.device: Job dev-disk-by\x2duuid-0b8a4aea\x2d1f96\x2d4b60\x2db305\x2d57a7317d0067.de>
Apr 30 03:35:58 jenkins systemd[1]: dev-disk-by\x2duuid-4b7ce8a7\x2df32b\x2d419e\x2d9b4a\x2d5eddce8691a8.device: Job dev-disk-by\x2duuid-4b7ce8a7\x2df32b\x2d419e\x2d9b4a\x2d5eddce8691a8.de>
Apr 30 03:35:58 jenkins systemd[1]: Timed out waiting for device /dev/disk/by-uuid/4b7ce8a7-f32b-419e-9b4a-5eddce8691a8.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for /dev/disk/by-uuid/4b7ce8a7-f32b-419e-9b4a-5eddce8691a8.
Apr 30 03:35:58 jenkins systemd[1]: Dependency failed for Swaps.

/dev/disk/by-uuid/0b8a4aea-1f96-4b60-b305-57a7317d0067 is sda2 which is a QEMU harddisk but the same that is booted from initially so I don't know how that can run into timeout. Maybe qamaster rebooted itself during that time and then all VMs were started again and that took too long because the I/O performance of qamaster is not good enough?

I triggered a jenkins build of GNOME:Next explicitly now to not wait longer. The according openQA job is https://openqa.opensuse.org/tests/3260055.

I triggered another reboot and the system turned up fine again. Maybe we can configure automatic reboot tries instead of emergency mode?

Actions #3

Updated by okurz 11 months ago

  • Subject changed from jenkins.qa.suse.de stuck in emergency mode but no alert to jenkins.qa.suse.de stuck in emergency mode but no alert size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by osukup 10 months ago

Actions #5

Updated by livdywan 10 months ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

okurz wrote:

I triggered another reboot and the system turned up fine again. Maybe we can configure automatic reboot tries instead of emergency mode?

Let's do it:

sudo systemctl edit emergency
[Service]
ExecStartPre=/bin/sh -c "read -t 10 || /usr/bin/systemctl reboot"
Actions #6

Updated by openqa_review 10 months ago

  • Due date set to 2023-06-15

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by okurz 10 months ago

cdywan wrote:

okurz wrote:

I triggered another reboot and the system turned up fine again. Maybe we can configure automatic reboot tries instead of emergency mode?

Let's do it:

sudo systemctl edit emergency
[Service]
ExecStartPre=/bin/sh -c "read -t 10 || /usr/bin/systemctl reboot"

Why not just systemctl mask emergency.{service,target}

Actions #8

Updated by livdywan 10 months ago

okurz wrote:

cdywan wrote:

sudo systemctl edit emergency
[Service]
ExecStartPre=/bin/sh -c "read -t 10 || /usr/bin/systemctl reboot"

Why not just systemctl mask emergency.{service,target}

Good point. I didn't realize that was an option. At least the Arch Linux wiki seems to agree it's a good idea :-D

Did that now

Actions #9

Updated by okurz 10 months ago

  • Description updated (diff)
Actions #10

Updated by livdywan 10 months ago

Discussed in the Unblock:

Idea: Connect via SSH and run systemctl is-system-running.
https://www.man7.org/linux/man-pages/man1/systemctl.1.html
maintenance -> system is in emergency mode. This could be done likely every 5m by telegraf or even every couple of hours

Actions #11

Updated by livdywan 10 months ago

  • Status changed from In Progress to Feedback

cdywan wrote:

Discussed in the Unblock:

  • First try to trigger an alert if the VM is down or not pingable

Waiting to find out how long it takes until we see an alert if ever

Actions #12

Updated by livdywan 10 months ago

cdywan wrote:

cdywan wrote:

Discussed in the Unblock:

  • First try to trigger an alert if the VM is down or not pingable

Waiting to find out how long it takes until we see an alert if ever

According to Copy this should take 90 minutes

Actions #13

Updated by livdywan 10 months ago

cdywan wrote:

cdywan wrote:

cdywan wrote:

Discussed in the Unblock:

  • First try to trigger an alert if the VM is down or not pingable

Waiting to find out how long it takes until we see an alert if ever

According to Copy this should take 90 minutes

And indeed Packet loss between worker hosts and other hosts alert caught it. Alas nobody thought to recover the machine and our suspicion that nobody will notice was also confirmed 🤐

Actions #14

Updated by livdywan 10 months ago

  • Copied to action #130633: Better documentation on jenkins.qa.suse.de alerts and recovery added
Actions #15

Updated by okurz 10 months ago

cdywan wrote:

And indeed Packet loss between worker hosts and other hosts alert caught it. Alas nobody thought to recover the machine and our suspicion that nobody will notice was also confirmed 🤐

That's a very far stretch: you are assigned to this ticket, we decided together that you will shut down that machine to trigger an alert and you were on alert duty, many others weren't even working Thu+Fri. So I don't see the problem.

Actions #16

Updated by livdywan 10 months ago

okurz wrote:

cdywan wrote:

And indeed Packet loss between worker hosts and other hosts alert caught it. Alas nobody thought to recover the machine and our suspicion that nobody will notice was also confirmed 🤐

That's a very far stretch: you are assigned to this ticket, we decided together that you will shut down that machine to trigger an alert and you were on alert duty, many others weren't even working Thu+Fri. So I don't see the problem.

I would assume alerts are handled proactively by everyone in the team. And there was no indication that the alert, without anyone looking into it, was covered by an existing ticket. This is why it's so important that we reply to alert emails and note down in tickets what measures were taken.

Actions #17

Updated by livdywan 10 months ago

  • Status changed from Feedback to Workable
  • Assignee deleted (livdywan)

Well, I guess I'm unassigning. I don't know how to recover this machine and nobody in the weekly was able to provide any suggestions.

Actions #18

Updated by okurz 10 months ago

  • Subject changed from jenkins.qa.suse.de stuck in emergency mode but no alert size:M to salt managed host being down does not trigger any alert (was: jenkins.qa.suse.de stuck in emergency mode but no alert) size:M
  • Due date deleted (2023-06-15)
  • Priority changed from Normal to High

I started the VM jenkins.qa.suse.de again by ssh qamaster.qa.suse.de "sudo virsh start jenkins". I took #130633 and provided documentation how to manage that VM and others on qamaster.qa.suse.de

cdywan wrote:

okurz wrote:

cdywan wrote:

And indeed Packet loss between worker hosts and other hosts alert caught it. Alas nobody thought to recover the machine and our suspicion that nobody will notice was also confirmed 🤐

That's a very far stretch: you are assigned to this ticket, we decided together that you will shut down that machine to trigger an alert and you were on alert duty, many others weren't even working Thu+Fri. So I don't see the problem.

I would assume alerts are handled proactively by everyone in the team. And there was no indication that the alert, without anyone looking into it, was covered by an existing ticket. This is why it's so important that we reply to alert emails and note down in tickets what measures were taken.

Yes, I agree. Everyone should handle alerts proactively and if not then the person on alert duty catches it as you correctly did. I checked the timestamps and found that the alert was triggered at 2023-06-07 19:07:00 +0200 CEST so after the usual EOB on Wednesday for many. On Thursday multiple members of the team were not working due to public holiday hence also taking a day off on Friday. However others were working on Thursday so we should adress them directly – blameless of course – and ensure that they know about responsibilities. Likely #128789 added to the problem due to repeated daily alerts which is also why I bumped the priority in #128789 to "Immediate".

Back to the topic of this ticket: There was apparently no "host up" alert even though jenkins.qa.suse.de was down for more than 24h. This means we can reject the hypothesis from nsinger that machines would still be reported as "up" when in emergency mode because apparently there is not even an alert if a machine is powered off. This is the next task to work on: Ensure that any host being down – and I mean really "down", not in emergency mode - triggers an according alert

Actions #19

Updated by dheidler 10 months ago

  • Assignee set to dheidler
Actions #20

Updated by dheidler 9 months ago

  • Status changed from Workable to Feedback
Actions #21

Updated by okurz 9 months ago

  • Due date set to 2023-07-04

In the unblock we worked with dheidler together to work on https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/889 which I now merged. Next step is to test the effect on the actual alert. Decided with dheidler to use tumblesle for tests.

Actions #22

Updated by okurz 9 months ago

  • Status changed from Feedback to In Progress
Actions #25

Updated by dheidler 9 months ago

  • Status changed from In Progress to Feedback
Actions #26

Updated by okurz 9 months ago

  • Has duplicate action #131303: [alert] Packet loss between worker hosts and other hosts (tumblesle.qa.suse.de) added
Actions #28

Updated by dheidler 9 months ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF