Project

General

Profile

Actions

action #128420

closed

[alert][grafana] 100% packet loss from qa-power8-4-kvm, grenache-1 and powerqaworker-qam-1 to s390zp{11,15,17}.suse.de size:M

Added by nicksinger 12 months ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Starting 2023-04-27 15:15:00 the mentioned machines in the title failed to access/ping s390 LPARs. Something between these hosts has changed or broke and needs to be fixed.
We had similar issues in the past, see the following SD tickets:

Suggestions

  • Check what these machines have in common. A quick look of mine showed that they are in the "old" qa network close by: https://racktables.suse.de/index.php?page=rack&rack_id=516
  • Check if other machines in that location, network, room, switch have the same problems
  • Create a new SD ticket referencing the old ones. Robert mentioned in one of them that we might need to get rid of a second uplink

Rollback steps

  1. Remove silence for rule_uid=2Z025iB4km
Actions #1

Updated by nicksinger 12 months ago

  • Description updated (diff)

silence added

Actions #2

Updated by nicksinger 12 months ago

  • Subject changed from [alert][grafana] 100% packet loss from qa-power8-4-kvm, grenache-1 and powerqaworker-qam-1 to s390zp{11,15,17}.suse.de to [alert][grafana] 100% packet loss from qa-power8-4-kvm, grenache-1 and powerqaworker-qam-1 to s390zp{11,15,17}.suse.de size:M
  • Status changed from New to Workable
Actions #3

Updated by nicksinger 11 months ago

  • Assignee set to nicksinger
Actions #4

Updated by nicksinger 11 months ago

  • Status changed from Workable to Feedback
Actions #5

Updated by nicksinger 11 months ago

  • Status changed from Feedback to In Progress

Robert was able to resolve the issue. Silence is removed. Broken connection to zl14 was discussed in https://app.slack.com/client/T02863RC2AC/C02CANHLANP/thread/C02CANHLANP-1685444859.828089 and @mgriessmeier said: "No it is down on purpose - it should not be defined as required host anymore." so I will remove it from our salt to get rid of the alert.

Actions #6

Updated by nicksinger 11 months ago

  • Status changed from In Progress to Resolved

Created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/537 to remove zl14 from our monitoring and tried to clarify the situation a little more in https://progress.opensuse.org/issues/125186#note-21 and https://suse.slack.com/archives/C02CANHLANP/p1685528579073759 (to get the ticket actually reopened).

Actions

Also available in: Atom PDF