Project

General

Profile

action #102575

Prevent false-positive ticket reporting for openqaworker-arm-3

Added by okurz 2 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2021-11-17
Due date:
2021-12-02
% Done:

0%

Estimated time:

Description

Observation

Apparently we create EngInfra tickets even though openqaworker-arm-3 is up.

Acceptance criteria

  • AC1: Tickets are only created if the ARM hosts are actually down and not recoverable by IPMI ourselves

History

#1 Updated by mkittler 2 months ago

  • Assignee set to mkittler

#2 Updated by mkittler 2 months ago

  • Status changed from New to In Progress

The alert actually is firing. We're just not noticing it anymore because the web hook is executed instead of sending a mail. (That makes also a lot of sense.)

However, it took indeed a over half an hour until the worker came back online (responds to pings again). I've been looking into the alert's state history and all recent occurrences needed around half an hour to recover. For comparison I've checked the state history of the alert for arm-1 and arm-2 and they recover within 10 minutes. 10 minutes is also the timeout of ipmi-recover-worker. That explains why only arm-3 is affected. It also means that ipmi-recover-worker is not generally broken; it is actually doing what it is supposed to do. We might want to increase the timeout, though.

Before increasing the timeout it would be good to clear up the following questions:

  1. What is taking arm-3 so long to recover? Maybe the boot is stuck (until some service timeout is exceeded)? I've been upgrading it to Leap 15.3 and didn't notice anything immediately.
  2. Is the worker really not reachable via IPMI anymore?
  3. Is Infra actually doing something to recover the machine? (Unfortunately I don't have access to the ticket.)

#3 Updated by mkittler 2 months ago

TLDR: The boot takes under 10 minutes (in-line with the other ARM workers) but it took 20 minutes after sending the power cycle until the OS boot actually starts. The worker was responding to IPMI commands.


journalct --boot on arm-3:

Nov 17 09:36:59 openqaworker-arm-3 kernel: Booting Linux on physical CPU 0x0000000000 [0x431f0a11]
…
Nov 17 09:40:20 openqaworker-arm-3 systemd[1]: Started User Login Management.
…
Nov 17 09:40:24 openqaworker-arm-3 worker[2994]: [info] [pid:2994] Registering with openQA openqa.suse.de
Nov 17 09:40:24 openqaworker-arm-3 worker[2994]: [warn] [pid:2994] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Nov 17 09:41:06 openqaworker-arm-3 salt-minion[3139]: [ERROR   ] Error while bringing up minion for multi-master. Is master at salt responding?
Nov 17 09:43:34 openqaworker-arm-3 os-autoinst-openvswitch[2904]: Waiting for IP on bridge 'br1', 109s left ...
Nov 17 09:43:35 openqaworker-arm-3 os-autoinst-openvswitch[2904]: Waiting for IP on bridge 'br1', 108s left ...
Nov 17 09:43:36 openqaworker-arm-3 os-autoinst-openvswitch[2904]: Waiting for IP on bridge 'br1', 107s left ...
Nov 17 09:43:37 openqaworker-arm-3 salt-minion[3139]: [ERROR   ] Error while bringing up minion for multi-master. Is master at salt responding?
Nov 17 09:43:37 openqaworker-arm-3 os-autoinst-openvswitch[2904]: Waiting for IP on bridge 'br1', 106s left ...
Nov 17 09:43:38 openqaworker-arm-3 os-autoinst-openvswitch[2904]: Waiting for IP on bridge 'br1', 105s left ...
Nov 17 09:43:39 openqaworker-arm-3 os-autoinst-openvswitch[2904]: Waiting for IP on bridge 'br1', 104s left ...
Nov 17 09:43:41 openqaworker-arm-3 kernel: nicvf 0002:01:00.1 eth1: Link is Up 10000 Mbps Full duplex
Nov 17 09:43:42 openqaworker-arm-3 os-autoinst-openvswitch[2904]: Waiting for IP on bridge 'br1', 103s left ...
Nov 17 09:43:43 openqaworker-arm-3 os-autoinst-openvswitch[2904]: Waiting for IP on bridge 'br1', 102s left ...
…
Nov 17 09:43:52 openqaworker-arm-3 os-autoinst-openvswitch[2904]: Waiting for IP on bridge 'br1', 93s left ...
Nov 17 09:43:52 openqaworker-arm-3 ovs-vsctl[4275]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap0
Nov 17 09:43:53 openqaworker-arm-3 wickedd-dhcp4[1887]: eth1: Request to acquire DHCPv4 lease with UUID bdc09461-ec5b-0c00-5b09-0000ba000000
Nov 17 09:43:53 openqaworker-arm-3 kernel: NET: Registered protocol family 17
Nov 17 09:43:53 openqaworker-arm-3 nscd[1756]: 1756 ignored inotify event for `/etc/resolv.conf` (file exists)
Nov 17 09:43:53 openqaworker-arm-3 nscd[1756]: 1756 ignored inotify event for `/etc/resolv.conf` (file exists)
Nov 17 09:43:53 openqaworker-arm-3 os-autoinst-openvswitch[2904]: Waiting for IP on bridge 'br1', 92s left ...
Nov 17 09:44:04 openqaworker-arm-3 worker[2992]: [info] [pid:2992] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1283
…

So it "only" takes 4 minutes to reach the user login but the network takes then another 4 minutes. That is quite long but in total the worker would still be under 10 minutes (in-line with the other ARM workers).

However, the alert had already been firing since 9:13 (which caused ipmi-recover-worker to run almost immediately) so it took over 20 minutes until reboot was invoked which made it far enough to leave traces in the journal.

By the way, that's the journal before the reboot:

Nov 17 09:06:31 openqaworker-arm-3 worker[45279]: [debug] [pid:45279] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7693886/status
Nov 17 09:06:31 openqaworker-arm-3 worker[45279]: [debug] [pid:45279] Upload concluded (at online_migration_setup)
Nov 17 09:06:37 openqaworker-arm-3 worker[2973]: [debug] [pid:2973] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7693274/status
Nov 17 09:06:37 openqaworker-arm-3 worker[2973]: [debug] [pid:2973] Upload concluded (at await_install)
-- Reboot --
Nov 17 09:10:52 openqaworker-arm-3 kernel: Booting Linux on physical CPU 0x0000000000 [0x431f0a11]

So it isn't really interesting. All it tells us that there weren't any logged reboot attempts before the current one.

In conclusion, the OS boot (plus getting to the point where network is available) is slow but not that slow. I'm wondering what in that 20 minutes before the boot happened.


Note that the machine was responding to IPMI commands, that's the relevant ipmi-recover-worker log:

Attempting to reboot openqaworker-arm-3
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : false
Power Control Fault  : false
Power Restore Policy : previous
Last Power Event     : 
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false
Sleep Button Disable : not allowed
Diag Button Disable  : allowed
Reset Button Disable : allowed
Power Button Disable : allowed
Sleep Button Disabled: false
Diag Button Disabled : false
Reset Button Disabled: false
Power Button Disabled: false
Set Boot Device to disk
IPMI reports power status on ==> performing power cycle
Chassis Power Control: Cycle
PING openqaworker-arm-3.suse.de (10.160.0.85) 56(84) bytes of data.
…
PING openqaworker-arm-3.suse.de (10.160.0.85) 56(84) bytes of data.
--- openqaworker-arm-3.suse.de ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
openqaworker-arm-3 didn't come up after IPMI power cycle ==> creating infra ticket

It shows the chassis status from IPMI and looks like sending the power cycle worked.

By the way, our timeout is really for the ping command itself. So even if GitLab was very slow to start the script that should not matter at all.

#4 Updated by openqa_review 2 months ago

  • Due date set to 2021-12-02

Setting due date based on mean cycle time of SUSE QE Tools

#5 Updated by mkittler 2 months ago

Infra is only unplugging the server every couple of weeks. So maybe our script just needs to be more patient. I just wanted to increase the timeout (for arm-3 only) but currently Grafana is broken.

#6 Updated by mkittler 2 months ago

  • Status changed from In Progress to Feedback

I've got Grafana back working and appended &variables[timeout]=2100 to the web hook URL for the arm-3 recovery. So the ticket is now only created after 35 minutes pinging without response.

By the way, Grafana was broken after updating to 7.5.11. I installed the version from https://build.opensuse.org/request/show/932258 which fixes the problem.

#7 Updated by mkittler 2 months ago

So far the worker hasn't had to be recovered yet. I'll check the state history and ticket mails next week.

#8 Updated by mkittler 2 months ago

The recovery had to run 3 times since I've checked last time. This time it looked better but one of the time the 10 minutes would have been exceeded again. So I'll keep the workaround for now. I wouldn't know what to improve further.

#9 Updated by mkittler 2 months ago

  • Status changed from Feedback to Resolved

I think the urgent part is resolved here. Fixing the worker itself is a different story of course.

Also available in: Atom PDF