action #113746
closedmonitoring: The grafana "ping time" panel does not list all hosts size:S
0%
Description
Observation¶
For example currently https://monitor.qa.suse.de/d/WDopenqaworker10/worker-dashboard-openqaworker10?orgId=1&refresh=1m&viewPanel=65099&from=now-90d&to=now currently shows a list of hosts, e.g. dist.suse.de, download.opensuse.org, etc., but not scc.suse.com from
the list https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L18:
required_external_networks:
- dist.suse.de
- s390zp11.suse.de
- s390zp14.suse.de
- s390zp15.suse.de
- s390zp17.suse.de
- s390zp18.suse.de
- s390zp19.suse.de
- download.opensuse.org
- proxy.scc.suse.de
- qanet.qa.suse.de
- scc.suse.com
Acceptance criteria¶
- AC1: All hosts from "required_external_networks" are shown in the monitoring panel
- AC2: Alerts are triggered for unavailable hosts
Suggestions¶
- Drop scc.suse.com since it can't be pinged
- Add another boolean panel for hosts being unreachable. Since otherwise we get no alerts for no data i.e. no ping at all
Updated by okurz over 2 years ago
- Related to action #113716: [qe-core] proxy-scc is down added
Updated by livdywan over 2 years ago
FYI proxy.scc.suse.de was offline before
scc.suse.com is currently not allowing pings. We should exclude scc.suse.com from the list
Updated by okurz over 2 years ago
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/423 to handle scc.suse.com
Updated by livdywan over 2 years ago
- Subject changed from monitoring: The grafana "ping time" panel does not list all hosts to monitoring: The grafana "ping time" panel does not list all hosts size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by tinita over 2 years ago
- Status changed from Workable to In Progress
- Assignee set to tinita
Updated by openqa_review over 2 years ago
- Due date set to 2022-08-09
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 2 years ago
based on the query in https://monitor.qa.suse.de/d/WDopenqaworker9/worker-dashboard-openqaworker9?orgId=1&editPanel=65099&inspect=65099&inspectTab=query I took a look into how the data is stored in influxdb and found:
> select * from ping where ("host" = 'openqaworker9') AND time >= now() - 1h;
name: ping
time average_response_ms host ip maximum_response_ms minimum_response_ms packets_received packets_transmitted percent_packet_loss result_code standard_deviation_ms ttl url
---- ------------------- ---- -- ------------------- ------------------- ---------------- ------------------- ------------------- ----------- --------------------- --- ---
1658821380000000000 0.311 openqaworker9 0.311 0.311 1 1 0 0 0 64 dist.suse.de
1658821380000000000 0.92 openqaworker9 0.92 0.92 1 1 0 0 0 59 download.opensuse.org
1658821380000000000 0.346 openqaworker9 0.346 0.346 1 1 0 0 0 64 proxy.scc.suse.de
1658821380000000000 0.167 openqaworker9 0.167 0.167 1 1 0 0 0 63 s390zp15.suse.de
1658821380000000000 0.161 openqaworker9 0.161 0.161 1 1 0 0 0 63 qanet.qa.suse.de
1658821380000000000 0.225 openqaworker9 0.225 0.225 1 1 0 0 0 63 s390zp18.suse.de
1658821380000000000 3.66 openqaworker9 3.66 3.66 1 1 0 0 0 64 openqa.suse.de
1658821380000000000 0.198 openqaworker9 0.198 0.198 1 1 0 0 0 59 s390zp11.suse.de
1658821390000000000 openqaworker9 0 1 100 1 s390zp19.suse.de
1658821390000000000 openqaworker9 0 1 100 1 s390zp17.suse.de
1658821390000000000 openqaworker9 0 1 100 1 s390zp14.suse.de
…
and I see two different groups. There are hosts with reasonably low ping numbers and there are entries for the hosts that were pinged but no response was received, e.g.
s390zp19.suse.de with "packets_transmitted" being 1 but "percent_packet_loss" first.
Updated by tinita over 2 years ago
Updated by livdywan over 2 years ago
- Status changed from In Progress to Feedback
tinita wrote:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/718
Seems to do what we want. Approved and merged
Updated by tinita over 2 years ago
- Status changed from Feedback to Resolved
We just got an alert for s390zp14.suse.de (100% package loss from all workers), so I think AC1 and AC2 are fulfilled.
Updated by okurz over 2 years ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/719 to have a slightly better wording and fix an issue about false-alerts when sending hosts are down themselves.
Updated by tinita over 2 years ago
- Status changed from Resolved to Workable
- Assignee deleted (
tinita)
I was told that the alert is actually a false alarm, and that my merge request caused it, and that I shall pause the alarm.
I don't understand that, since to me it looks like s390zp14.suse.de has 100% packet loss from every worker, and I thought that we actually created this ticket to be alerted about this.
Maybe I totally got the ticket wrong.
I unassigned me.
Updated by mkittler over 2 years ago
I was told that the alert is actually a false alarm
At least some alerts were false but @okurz SR should have fixed it (see #113746#note-11).
I don't understand that, since to me it looks like s390zp14.suse.de has 100% packet loss from every worker
That's not what @okurz meant. He meant the case when sending hosts are down themselves.
Maybe I totally got the ticket wrong.
Well, I was confused at first as well. If I've also still got it wrong, please revert my changes to this ticket.
I unassigned me.
However, when I've got it correctly now, then this ticket should be almost resolved. One should check whether all ACs are fulfilled, though. And the new alerts (the ones that are no false alers) should be paused and handled. I created #114802 for that.
Updated by mkittler over 2 years ago
- Related to action #114802: Handle "QA network infrastructure Package loss alert" introduced by #113746 size:M added