Project

General

Profile

Actions

action #163778

closed

[alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S

Added by tinita 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2024-07-11
Due date:
2024-08-06
% Done:

0%

Estimated time:

Description

Observation

We had several alerts regarding s390zl12 today, firing and resolving shortly after each other:
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/5ddd66bd99b31f7597fd68af2cc96304f8d9e480/view?orgId=1
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/openqa_ping_time_alert_s390zl12/view?orgId=1

Date: Thu, 11 Jul 2024 14:24:34 +0200
From: Grafana osd-admins@suse.de
To: osd-admins@suse.de
Subject: [FIRING:2] s390zl12

2 firing alert instances

Suggestions

Rollback steps


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #166136: s390 LPAR s390ZL12 down and unable to boot - potential corrupted filesystemResolvednicksinger2024-09-022024-09-17

Actions
Actions #1

Updated by tinita 5 months ago

  • Subject changed from [alert] to [alert] host_up & Average Ping time (ms) alert for s390zl12
Actions #2

Updated by tinita 5 months ago

  • Description updated (diff)

I created a 2 day silence for each of them

Actions #3

Updated by okurz 5 months ago

  • Tags set to infra, alert, s390, reactive work
  • Priority changed from Normal to Urgent

@tinita thank you. 2 days is rather on the low side which means we need to treat this ticket as urgent. I would be ok if we set something like 2 months silence and mention the removal of silence as rollback step

Actions #4

Updated by tinita 5 months ago

ok, will do

Actions #5

Updated by tinita 5 months ago

  • Description updated (diff)
  • Priority changed from Urgent to High

Extended the silences to 2 months

Actions #6

Updated by okurz 5 months ago

  • Subject changed from [alert] host_up & Average Ping time (ms) alert for s390zl12 to [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #7

Updated by nicksinger 5 months ago

  • Description updated (diff)
Actions #8

Updated by nicksinger 5 months ago

  • Description updated (diff)
Actions #9

Updated by nicksinger 5 months ago

  • Subject changed from [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S to [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S - auto_review:"Error connecting to VNC server <.*:5901>: IO::Socket::INET: connect: Connection refused":retry
Actions #10

Updated by nicksinger 5 months ago

  • Subject changed from [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S - auto_review:"Error connecting to VNC server <.*:5901>: IO::Socket::INET: connect: Connection refused":retry to [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S

Sorry, the auto review makes no sense here and is already covered in https://progress.opensuse.org/issues/76813 which might duplicate this here then - but not sure. Asked @livdywan and @okurz in slack how we want to handle this ticket here.

Actions #11

Updated by nicksinger 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #12

Updated by nicksinger 5 months ago

Going forward with this I will focus on getting rid of the false alerts but not further investigating the failing tests because currently everything looks like these two issues are not linked with each other and just happen on the same host simultaneously.

Actions #13

Updated by openqa_review 5 months ago

  • Due date set to 2024-08-06

Setting due date based on mean cycle time of SUSE QE Tools

Actions #14

Updated by nicksinger 5 months ago

As discussed in the daily the next step I will take is to compare response times of these machines with other ones (same network, different network, architecture, etc) by using grafana. Based on these results I can better understand what would be helpful next steps (e.g. SD-Ticket about network performance, Debug s390 specifically, bump alert thresholds, etc).

Actions #15

Updated by nicksinger 5 months ago

  • Status changed from In Progress to Resolved

nicksinger wrote in #note-14:

As discussed in the daily the next step I will take is to compare response times of these machines with other ones (same network, different network, architecture, etc) by using grafana. Based on these results I can better understand what would be helpful next steps (e.g. SD-Ticket about network performance, Debug s390 specifically, bump alert thresholds, etc).

while discussing further I took another look at the alert history. Unfortunately the last entry is from 2024-07-11 which is apparently the same date the silence was created so not sure if it will still fire but currently everything looks fine. If further debugging is needed we can reopen again.

Actions #16

Updated by livdywan 3 months ago

  • Related to action #166136: s390 LPAR s390ZL12 down and unable to boot - potential corrupted filesystem added
Actions

Also available in: Atom PDF