action #128420
closed[alert][grafana] 100% packet loss from qa-power8-4-kvm, grenache-1 and powerqaworker-qam-1 to s390zp{11,15,17}.suse.de size:M
0%
Description
Observation¶
Starting 2023-04-27 15:15:00 the mentioned machines in the title failed to access/ping s390 LPARs. Something between these hosts has changed or broke and needs to be fixed.
We had similar issues in the past, see the following SD tickets:
- https://sd.suse.com/servicedesk/customer/portal/1/SD-92689
- https://sd.suse.com/servicedesk/customer/portal/1/SD-115963
Suggestions¶
- Check what these machines have in common. A quick look of mine showed that they are in the "old" qa network close by: https://racktables.suse.de/index.php?page=rack&rack_id=516
- Check if other machines in that location, network, room, switch have the same problems
- Create a new SD ticket referencing the old ones. Robert mentioned in one of them that we might need to get rid of a second uplink
Rollback steps¶
- Remove silence for rule_uid=2Z025iB4km
Updated by nicksinger over 1 year ago
- Subject changed from [alert][grafana] 100% packet loss from qa-power8-4-kvm, grenache-1 and powerqaworker-qam-1 to s390zp{11,15,17}.suse.de to [alert][grafana] 100% packet loss from qa-power8-4-kvm, grenache-1 and powerqaworker-qam-1 to s390zp{11,15,17}.suse.de size:M
- Status changed from New to Workable
Updated by nicksinger over 1 year ago
- Status changed from Workable to Feedback
Updated by nicksinger over 1 year ago
- Status changed from Feedback to In Progress
Robert was able to resolve the issue. Silence is removed. Broken connection to zl14 was discussed in https://app.slack.com/client/T02863RC2AC/C02CANHLANP/thread/C02CANHLANP-1685444859.828089 and @mgriessmeier said: "No it is down on purpose - it should not be defined as required host anymore." so I will remove it from our salt to get rid of the alert.
Updated by nicksinger over 1 year ago
- Status changed from In Progress to Resolved
Created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/537 to remove zl14 from our monitoring and tried to clarify the situation a little more in https://progress.opensuse.org/issues/125186#note-21 and https://suse.slack.com/archives/C02CANHLANP/p1685528579073759 (to get the ticket actually reopened).