Project

General

Profile

action #114802

Handle "QA network infrastructure Package loss alert" introduced by #113746 size:M

Added by mkittler about 2 months ago. Updated 3 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2022-07-28
Due date:
2022-10-12
% Done:

0%

Estimated time:

Description

The alerts introduced by #113746 are alerting as not all hosts mentioned in that ticket's description are actually pingable.

Acceptance criteria

  • AC1: All package/packet are unpaused again and not alerting as problematic hosts are either recovered or ignored after all.

Suggestions

  • Check whether problematic hosts should be online or offline. If they should be online, try recovering them. If they should be offline, remove them from the list of checked hosts.
  • At this time, there's actually only one problematic host (s390zp14.suse.de). The alert is only firing multiple times because it is fired for each worker that cannot reach that host.
  • To check the problematic hosts, just check the panel of one of the package/packet loss alerts.

Related issues

Related to openQA Infrastructure - action #113746: monitoring: The grafana "ping time" panel does not list all hosts size:SResolved2022-07-182022-08-09

History

#1 Updated by mkittler about 2 months ago

  • Related to action #113746: monitoring: The grafana "ping time" panel does not list all hosts size:S added

#2 Updated by mkittler about 2 months ago

  • Description updated (diff)
  • Target version set to Ready

#3 Updated by mkittler about 2 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler

#4 Updated by mkittler about 2 months ago

I've asked about the worker on #eng-testing.

#5 Updated by openqa_review about 2 months ago

  • Due date set to 2022-08-12

Setting due date based on mean cycle time of SUSE QE Tools

#6 Updated by okurz about 2 months ago

It seems some confusion was caused because we suddenly received a big number of emails mentioning the workers but the underlying problem is that only one target is down and that this is confirmed by multiple sending hosts. We should think of a solution how to prevent the big number of alert messages when it's just one target down.

How about adding the panel to be run from monitor.qa.suse.de itself and only enable alerts there, remove the alerts from workers, only keep the monitoring. So that if a host is down we receive one and only one alert but still from the worker hosts we have additional supporting monitoring data.

#7 Updated by mkittler about 2 months ago

How about adding the panel to be run from monitor.qa.suse.de itself and only enable alerts there, remove the alerts from workers, only keep the monitoring. So that if a host is down we receive one and only one alert but still from the worker hosts we have additional supporting monitoring data.

Sounds reasonable. Should I do that as part of this ticket?

#8 Updated by mkittler about 2 months ago

As discussed, it would make sense to have simply one graph that has no filter by the source host. It would show all "source - target" ping combinations and alert if there's a problem with any of it. If we sort the legend nicely and add a good alert description that should be sufficient to see quickly where the problem is (after the alert would trigger).

#9 Updated by mkittler about 2 months ago

#10 Updated by mkittler about 2 months ago

s390zp14.suse.de is in fact used as openqaworker5:9. So nothing to do.

#11 Updated by mkittler about 2 months ago

The SR has been deployed and it looks good (https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=4). Of course the alert is no still on (see last point of #114802#note-9). I've paused it for now.

#12 Updated by mkittler about 2 months ago

Asked on #eng-infra about the problematic connections. In case we need to ignore those connection after all, we'd just have to merge this change: https://gitlab.suse.de/mkittler/salt-states-openqa/-/merge_requests/new/diffs?merge_request%5Bsource_branch%5D=packet-loss

#13 Updated by mkittler about 2 months ago

  • Status changed from In Progress to Feedback

Looks like Nick has already created an infra ticket for it: https://sd.suse.com/servicedesk/customer/portal/1/SD-92689

So let's wait what the outcome of that will be. In the meantime I could apply the SR to ignore those connections to be able to enable the alert (for everything else).

#14 Updated by cdywan about 2 months ago

  • Subject changed from Handle "QA network infrastructure Package loss alert" introduced by #113746 to Handle "QA network infrastructure Package loss alert" introduced by #113746 size:M

#15 Updated by mkittler about 2 months ago

It is unlikely to be a problem on the infra side. It could be a problem with our switches or a problem on the hosts. Mikelis said he'll update the infra ticket accordingly.

#16 Updated by mkittler about 2 months ago

He hasn't updated the ticket so here the most important findings from his investigation:

  • I had tcpdump on grenache, weird thing is: it does receive icmp request from s390zp14 and sends out a reply
  • both directions icmp is received and reply sent, but seems reply is not getting back
  • icmp echo is being received on both sides and icmp reply is being sent out
  • tcp session capture seems broken
  • I checked all the stuff on my side and there are no issues
  • there are few switches in both direction that are not managed by me and I cannot see their config
  • I'm fairly confident that issue is not on our infra side

#17 Updated by mkittler about 2 months ago

  • Due date changed from 2022-08-12 to 2022-10-12

#18 Updated by mkittler 5 days ago

I've been updating https://sd.suse.com/servicedesk/customer/portal/1/SD-92689 with an explanation that it can be closed. (I cannot close it myself.)

The SR https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/725 has the problem that it is mentioning OSD-specific host names in the salt states itself (rather than the pillars). We could either move the patterns into the pillars but further templating the grafana config is not so nice (as it always makes updating harder). We could also try to filter at telegraf-level ensuring that the alert won't fire for the data that's already in InfluxDB.

#19 Updated by okurz 4 days ago

I paused the alert "Packet loss between worker hosts and other hosts alert" again as it was alerting.

#20 Updated by mkittler 4 days ago

This is how excluding the problematic connections on telegraf-level could look like: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/443
Of course salt-states-openqa/monitoring/telegraf/telegraf-worker.conf needed to be adjusted for the altered data structure and an if matching the regex needed to be added: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/741

#21 Updated by mkittler 3 days ago

  • Status changed from Feedback to Resolved

The SRs have been merged and I've checked a few workers and the config changes look good.

I've also just been adding qa-power8-4-kvm back to salt. The salt package wasn't installed anymore at all on the machine. Likely due to an issue from quite a while ago and the machine was likely offline or booted into a temporary snapshot when it was fixed.

With that the alert is not firing anymore so I enabled it again.

#22 Updated by okurz 3 days ago

  • Status changed from Resolved to Feedback

mkittler wrote:

With that the alert is not firing anymore so I enabled it again.

On https://monitor.qa.suse.de/alerting/list I find "openqaworker10: package loss alert, PAUSED for 2 months" so that should be handled. And also I am confused because https://monitor.qa.suse.de/d/WDopenqaworker9/worker-dashboard-openqaworker9?orgId=1&viewPanel=65113 does not show any enabled alert. What do I miss? Or am I confusing something?

#23 Updated by mkittler 3 days ago

  • Status changed from Feedback to Resolved

"openqaworker10: package loss alert, PAUSED for 2 months" is just part of "tinas-dashboard". That's likely something she created for testing purposes but it has nothing to do with the alert this ticket is about.

The alert is only defined on https://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&editPanel=4&tab=alert and not the individual worker dashboards so we don't get tons of mails for the same problem.

Also available in: Atom PDF