action #113746
closed
FYI proxy.scc.suse.de was offline before
scc.suse.com is currently not allowing pings. We should exclude scc.suse.com from the list
- Subject changed from monitoring: The grafana "ping time" panel does not list all hosts to monitoring: The grafana "ping time" panel does not list all hosts size:S
- Description updated (diff)
- Status changed from New to Workable
- Status changed from Workable to In Progress
- Assignee set to tinita
- Due date set to 2022-08-09
Setting due date based on mean cycle time of SUSE QE Tools
based on the query in https://monitor.qa.suse.de/d/WDopenqaworker9/worker-dashboard-openqaworker9?orgId=1&editPanel=65099&inspect=65099&inspectTab=query I took a look into how the data is stored in influxdb and found:
> select * from ping where ("host" = 'openqaworker9') AND time >= now() - 1h;
name: ping
time average_response_ms host ip maximum_response_ms minimum_response_ms packets_received packets_transmitted percent_packet_loss result_code standard_deviation_ms ttl url
---- ------------------- ---- -- ------------------- ------------------- ---------------- ------------------- ------------------- ----------- --------------------- --- ---
1658821380000000000 0.311 openqaworker9 0.311 0.311 1 1 0 0 0 64 dist.suse.de
1658821380000000000 0.92 openqaworker9 0.92 0.92 1 1 0 0 0 59 download.opensuse.org
1658821380000000000 0.346 openqaworker9 0.346 0.346 1 1 0 0 0 64 proxy.scc.suse.de
1658821380000000000 0.167 openqaworker9 0.167 0.167 1 1 0 0 0 63 s390zp15.suse.de
1658821380000000000 0.161 openqaworker9 0.161 0.161 1 1 0 0 0 63 qanet.qa.suse.de
1658821380000000000 0.225 openqaworker9 0.225 0.225 1 1 0 0 0 63 s390zp18.suse.de
1658821380000000000 3.66 openqaworker9 3.66 3.66 1 1 0 0 0 64 openqa.suse.de
1658821380000000000 0.198 openqaworker9 0.198 0.198 1 1 0 0 0 59 s390zp11.suse.de
1658821390000000000 openqaworker9 0 1 100 1 s390zp19.suse.de
1658821390000000000 openqaworker9 0 1 100 1 s390zp17.suse.de
1658821390000000000 openqaworker9 0 1 100 1 s390zp14.suse.de
…
and I see two different groups. There are hosts with reasonably low ping numbers and there are entries for the hosts that were pinged but no response was received, e.g.
s390zp19.suse.de with "packets_transmitted" being 1 but "percent_packet_loss" first.
- Status changed from In Progress to Feedback
- Status changed from Feedback to Resolved
- Status changed from Resolved to Workable
- Assignee deleted (
tinita)
I was told that the alert is actually a false alarm, and that my merge request caused it, and that I shall pause the alarm.
I don't understand that, since to me it looks like s390zp14.suse.de has 100% packet loss from every worker, and I thought that we actually created this ticket to be alerted about this.
Maybe I totally got the ticket wrong.
I unassigned me.
I was told that the alert is actually a false alarm
At least some alerts were false but @okurz SR should have fixed it (see #113746#note-11).
I don't understand that, since to me it looks like s390zp14.suse.de has 100% packet loss from every worker
That's not what @okurz meant. He meant the case when sending hosts are down themselves.
Maybe I totally got the ticket wrong.
Well, I was confused at first as well. If I've also still got it wrong, please revert my changes to this ticket.
I unassigned me.
However, when I've got it correctly now, then this ticket should be almost resolved. One should check whether all ACs are fulfilled, though. And the new alerts (the ones that are no false alers) should be paused and handled. I created #114802 for that.
- Related to action #114802: Handle "QA network infrastructure Package loss alert" introduced by #113746 size:M added
After checking again I would say both ACs are fulfilled. So I'm assigning the ticket back to @tinita but mark it as resolved (leaving use with #114802) as follow-up.
- Status changed from Workable to Resolved
Also available in: Atom
PDF