action #114802: Handle "QA network infrastructure Package loss alert" introduced by #113746 size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #114802

closed

Handle "QA network infrastructure Package loss alert" introduced by #113746 size:M

Added by mkittler almost 3 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-07-28

Due date:

2022-10-12

% Done:

Estimated time:

Description

The alerts introduced by #113746 are alerting as not all hosts mentioned in that ticket's description are actually pingable.

Acceptance criteria¶

AC1: All package/packet are unpaused again and not alerting as problematic hosts are either recovered or ignored after all.

Suggestions¶

Check whether problematic hosts should be online or offline. If they should be online, try recovering them. If they should be offline, remove them from the list of checked hosts.
At this time, there's actually only one problematic host (s390zp14.suse.de). The alert is only firing multiple times because it is fired for each worker that cannot reach that host.
To check the problematic hosts, just check the panel of one of the package/packet loss alerts.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by mkittler almost 3 years ago

Related to action #113746: monitoring: The grafana "ping time" panel does not list all hosts size:S added

Actions

Copy link

Updated by mkittler almost 3 years ago

Description updated (diff)
Target version set to Ready

Actions

Copy link

Updated by mkittler almost 3 years ago

Status changed from New to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler almost 3 years ago

I've asked about the worker on #eng-testing.

Actions

Copy link

Updated by openqa_review almost 3 years ago

Due date set to 2022-08-12

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz almost 3 years ago

It seems some confusion was caused because we suddenly received a big number of emails mentioning the workers but the underlying problem is that only one target is down and that this is confirmed by multiple sending hosts. We should think of a solution how to prevent the big number of alert messages when it's just one target down.

How about adding the panel to be run from monitor.qa.suse.de itself and only enable alerts there, remove the alerts from workers, only keep the monitoring. So that if a host is down we receive one and only one alert but still from the worker hosts we have additional supporting monitoring data.

Actions

Copy link

Updated by mkittler almost 3 years ago

How about adding the panel to be run from monitor.qa.suse.de itself and only enable alerts there, remove the alerts from workers, only keep the monitoring. So that if a host is down we receive one and only one alert but still from the worker hosts we have additional supporting monitoring data.

Sounds reasonable. Should I do that as part of this ticket?

Actions

Copy link

Updated by mkittler almost 3 years ago

As discussed, it would make sense to have simply one graph that has no filter by the source host. It would show all "source - target" ping combinations and alert if there's a problem with any of it. If we sort the legend nicely and add a good alert description that should be sufficient to see quickly where the problem is (after the alert would trigger).

Actions

Copy link

Updated by mkittler almost 3 years ago

SR for the change mentioned in the previous comment: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/721
The host s390zp14.suse.de is meanwhile pingable again. It is used by Xuguang Guo. I still have to clarify whether his usage of the host corresponds to https://openqa.suse.de/admin/workers/357 (and if not remove that worker slot).
As shown in the screenshot on https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/721, two power workers cannot ping some s390x hosts. I'll have to look into that as it supposedly should work (even though it is not a real problem).

Actions

Copy link

#10

Updated by mkittler almost 3 years ago

s390zp14.suse.de is in fact used as openqaworker5:9. So nothing to do.

Actions

Copy link

#11

Updated by mkittler almost 3 years ago

The SR has been deployed and it looks good (https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=4). Of course the alert is no still on (see last point of #114802#note-9). I've paused it for now.

Actions

Copy link

#12

Updated by mkittler almost 3 years ago

Asked on #eng-infra about the problematic connections. In case we need to ignore those connection after all, we'd just have to merge this change: https://gitlab.suse.de/mkittler/salt-states-openqa/-/merge_requests/new/diffs?merge_request%5Bsource_branch%5D=packet-loss

Actions

Copy link

#13

Updated by mkittler almost 3 years ago

Status changed from In Progress to Feedback

Looks like Nick has already created an infra ticket for it: https://sd.suse.com/servicedesk/customer/portal/1/SD-92689

So let's wait what the outcome of that will be. In the meantime I could apply the SR to ignore those connections to be able to enable the alert (for everything else).

Actions

Copy link

#14

Updated by livdywan almost 3 years ago

Subject changed from Handle "QA network infrastructure Package loss alert" introduced by #113746 to Handle "QA network infrastructure Package loss alert" introduced by #113746 size:M

Actions

Copy link

#15

Updated by mkittler almost 3 years ago

It is unlikely to be a problem on the infra side. It could be a problem with our switches or a problem on the hosts. Mikelis said he'll update the infra ticket accordingly.

Actions

Copy link

#16

Updated by mkittler almost 3 years ago

He hasn't updated the ticket so here the most important findings from his investigation:

I had tcpdump on grenache, weird thing is: it does receive icmp request from s390zp14 and sends out a reply
both directions icmp is received and reply sent, but seems reply is not getting back
icmp echo is being received on both sides and icmp reply is being sent out
tcp session capture seems broken
I checked all the stuff on my side and there are no issues
there are few switches in both direction that are not managed by me and I cannot see their config
I'm fairly confident that issue is not on our infra side

Actions

Copy link

#17

Updated by mkittler over 2 years ago

Due date changed from 2022-08-12 to 2022-10-12

Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/725
However, let's wait for a bit longer.

Actions

Copy link

#18

Updated by mkittler over 2 years ago

I've been updating https://sd.suse.com/servicedesk/customer/portal/1/SD-92689 with an explanation that it can be closed. (I cannot close it myself.)

The SR https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/725 has the problem that it is mentioning OSD-specific host names in the salt states itself (rather than the pillars). We could either move the patterns into the pillars but further templating the grafana config is not so nice (as it always makes updating harder). We could also try to filter at telegraf-level ensuring that the alert won't fire for the data that's already in InfluxDB.

Actions

Copy link

#19

Updated by okurz over 2 years ago

I paused the alert "Packet loss between worker hosts and other hosts alert" again as it was alerting.

Actions

Copy link

#20

Updated by mkittler over 2 years ago

This is how excluding the problematic connections on telegraf-level could look like: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/443
Of course salt-states-openqa/monitoring/telegraf/telegraf-worker.conf needed to be adjusted for the altered data structure and an if matching the regex needed to be added: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/741

Actions

Copy link

#21

Updated by mkittler over 2 years ago

Status changed from Feedback to Resolved

The SRs have been merged and I've checked a few workers and the config changes look good.

I've also just been adding qa-power8-4-kvm back to salt. The salt package wasn't installed anymore at all on the machine. Likely due to an issue from quite a while ago and the machine was likely offline or booted into a temporary snapshot when it was fixed.

With that the alert is not firing anymore so I enabled it again.

Actions

Copy link

#22

Updated by okurz over 2 years ago

Status changed from Resolved to Feedback

mkittler wrote:

With that the alert is not firing anymore so I enabled it again.

On https://monitor.qa.suse.de/alerting/list I find "openqaworker10: package loss alert, PAUSED for 2 months" so that should be handled. And also I am confused because https://monitor.qa.suse.de/d/WDopenqaworker9/worker-dashboard-openqaworker9?orgId=1&viewPanel=65113 does not show any enabled alert. What do I miss? Or am I confusing something?

Actions

Copy link

#23

Updated by mkittler over 2 years ago

Status changed from Feedback to Resolved

"openqaworker10: package loss alert, PAUSED for 2 months" is just part of "tinas-dashboard". That's likely something she created for testing purposes but it has nothing to do with the alert this ticket is about.

The alert is only defined on https://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&editPanel=4&tab=alert and not the individual worker dashboards so we don't get tons of mails for the same problem.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #114802

Handle "QA network infrastructure Package loss alert" introduced by #113746 size:M

Acceptance criteria¶

Suggestions¶

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by openqa_review almost 3 years ago

Updated by okurz almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by livdywan almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by okurz over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by okurz over 2 years ago

Updated by mkittler over 2 years ago