Project

General

Profile

Actions

action #163778

closed

[alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S

Added by tinita 16 days ago. Updated 3 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-07-11
Due date:
2024-08-06
% Done:

0%

Estimated time:

Description

Observation

We had several alerts regarding s390zl12 today, firing and resolving shortly after each other:
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/5ddd66bd99b31f7597fd68af2cc96304f8d9e480/view?orgId=1
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/openqa_ping_time_alert_s390zl12/view?orgId=1

Date: Thu, 11 Jul 2024 14:24:34 +0200
From: Grafana osd-admins@suse.de
To: osd-admins@suse.de
Subject: [FIRING:2] s390zl12

2 firing alert instances

Suggestions

Rollback steps

Actions #1

Updated by tinita 16 days ago

  • Subject changed from [alert] to [alert] host_up & Average Ping time (ms) alert for s390zl12
Actions #2

Updated by tinita 16 days ago

  • Description updated (diff)

I created a 2 day silence for each of them

Actions #3

Updated by okurz 16 days ago

  • Tags set to infra, alert, s390, reactive work
  • Priority changed from Normal to Urgent

@tinita thank you. 2 days is rather on the low side which means we need to treat this ticket as urgent. I would be ok if we set something like 2 months silence and mention the removal of silence as rollback step

Actions #4

Updated by tinita 16 days ago

ok, will do

Actions #5

Updated by tinita 16 days ago

  • Description updated (diff)
  • Priority changed from Urgent to High

Extended the silences to 2 months

Actions #6

Updated by okurz 15 days ago

  • Subject changed from [alert] host_up & Average Ping time (ms) alert for s390zl12 to [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #7

Updated by nicksinger 5 days ago

  • Description updated (diff)
Actions #8

Updated by nicksinger 5 days ago

  • Description updated (diff)
Actions #9

Updated by nicksinger 5 days ago

  • Subject changed from [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S to [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S - auto_review:"Error connecting to VNC server <.*:5901>: IO::Socket::INET: connect: Connection refused":retry
Actions #10

Updated by nicksinger 5 days ago

  • Subject changed from [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S - auto_review:"Error connecting to VNC server <.*:5901>: IO::Socket::INET: connect: Connection refused":retry to [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S

Sorry, the auto review makes no sense here and is already covered in https://progress.opensuse.org/issues/76813 which might duplicate this here then - but not sure. Asked @livdywan and @okurz in slack how we want to handle this ticket here.

Actions #11

Updated by nicksinger 5 days ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #12

Updated by nicksinger 5 days ago

Going forward with this I will focus on getting rid of the false alerts but not further investigating the failing tests because currently everything looks like these two issues are not linked with each other and just happen on the same host simultaneously.

Actions #13

Updated by openqa_review 5 days ago

  • Due date set to 2024-08-06

Setting due date based on mean cycle time of SUSE QE Tools

Actions #14

Updated by nicksinger 3 days ago

As discussed in the daily the next step I will take is to compare response times of these machines with other ones (same network, different network, architecture, etc) by using grafana. Based on these results I can better understand what would be helpful next steps (e.g. SD-Ticket about network performance, Debug s390 specifically, bump alert thresholds, etc).

Actions #15

Updated by nicksinger 3 days ago

  • Status changed from In Progress to Resolved

nicksinger wrote in #note-14:

As discussed in the daily the next step I will take is to compare response times of these machines with other ones (same network, different network, architecture, etc) by using grafana. Based on these results I can better understand what would be helpful next steps (e.g. SD-Ticket about network performance, Debug s390 specifically, bump alert thresholds, etc).

while discussing further I took another look at the alert history. Unfortunately the last entry is from 2024-07-11 which is apparently the same date the silence was created so not sure if it will still fire but currently everything looks fine. If further debugging is needed we can reopen again.

Actions

Also available in: Atom PDF