Project

General

Profile

Actions

action #113746

closed

monitoring: The grafana "ping time" panel does not list all hosts size:S

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-07-18
Due date:
2022-08-09
% Done:

0%

Estimated time:

Description

Observation

For example currently https://monitor.qa.suse.de/d/WDopenqaworker10/worker-dashboard-openqaworker10?orgId=1&refresh=1m&viewPanel=65099&from=now-90d&to=now currently shows a list of hosts, e.g. dist.suse.de, download.opensuse.org, etc., but not scc.suse.com from
the list https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L18:

 required_external_networks:
  - dist.suse.de
  - s390zp11.suse.de
  - s390zp14.suse.de
  - s390zp15.suse.de
  - s390zp17.suse.de
  - s390zp18.suse.de
  - s390zp19.suse.de
  - download.opensuse.org
  - proxy.scc.suse.de
  - qanet.qa.suse.de
  - scc.suse.com

Acceptance criteria

  • AC1: All hosts from "required_external_networks" are shown in the monitoring panel
  • AC2: Alerts are triggered for unavailable hosts

Suggestions

  • Drop scc.suse.com since it can't be pinged
  • Add another boolean panel for hosts being unreachable. Since otherwise we get no alerts for no data i.e. no ping at all

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #113716: [qe-core] proxy-scc is downResolvedszarate2022-07-182022-07-19

Actions
Related to openQA Infrastructure - action #114802: Handle "QA network infrastructure Package loss alert" introduced by #113746 size:MResolvedmkittler2022-07-282022-10-12

Actions
Actions #1

Updated by okurz over 1 year ago

Actions #2

Updated by livdywan over 1 year ago

FYI proxy.scc.suse.de was offline before

scc.suse.com is currently not allowing pings. We should exclude scc.suse.com from the list

Actions #4

Updated by livdywan over 1 year ago

  • Subject changed from monitoring: The grafana "ping time" panel does not list all hosts to monitoring: The grafana "ping time" panel does not list all hosts size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by tinita over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita
Actions #6

Updated by openqa_review over 1 year ago

  • Due date set to 2022-08-09

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by okurz over 1 year ago

based on the query in https://monitor.qa.suse.de/d/WDopenqaworker9/worker-dashboard-openqaworker9?orgId=1&editPanel=65099&inspect=65099&inspectTab=query I took a look into how the data is stored in influxdb and found:

> select * from ping where ("host" = 'openqaworker9') AND time >= now() - 1h;
name: ping
time                average_response_ms host          ip maximum_response_ms minimum_response_ms packets_received packets_transmitted percent_packet_loss result_code standard_deviation_ms ttl url
----                ------------------- ----          -- ------------------- ------------------- ---------------- ------------------- ------------------- ----------- --------------------- --- ---
1658821380000000000 0.311               openqaworker9    0.311               0.311               1                1                   0                   0           0                     64  dist.suse.de
1658821380000000000 0.92                openqaworker9    0.92                0.92                1                1                   0                   0           0                     59  download.opensuse.org
1658821380000000000 0.346               openqaworker9    0.346               0.346               1                1                   0                   0           0                     64  proxy.scc.suse.de
1658821380000000000 0.167               openqaworker9    0.167               0.167               1                1                   0                   0           0                     63  s390zp15.suse.de
1658821380000000000 0.161               openqaworker9    0.161               0.161               1                1                   0                   0           0                     63  qanet.qa.suse.de
1658821380000000000 0.225               openqaworker9    0.225               0.225               1                1                   0                   0           0                     63  s390zp18.suse.de
1658821380000000000 3.66                openqaworker9    3.66                3.66                1                1                   0                   0           0                     64  openqa.suse.de
1658821380000000000 0.198               openqaworker9    0.198               0.198               1                1                   0                   0           0                     59  s390zp11.suse.de
1658821390000000000                     openqaworker9                                            0                1                   100                 1                                     s390zp19.suse.de
1658821390000000000                     openqaworker9                                            0                1                   100                 1                                     s390zp17.suse.de
1658821390000000000                     openqaworker9                                            0                1                   100                 1                                     s390zp14.suse.de
…

and I see two different groups. There are hosts with reasonably low ping numbers and there are entries for the hosts that were pinged but no response was received, e.g.
s390zp19.suse.de with "packets_transmitted" being 1 but "percent_packet_loss" first.

Actions #9

Updated by livdywan over 1 year ago

  • Status changed from In Progress to Feedback

tinita wrote:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/718

Seems to do what we want. Approved and merged

Actions #10

Updated by tinita over 1 year ago

  • Status changed from Feedback to Resolved

We just got an alert for s390zp14.suse.de (100% package loss from all workers), so I think AC1 and AC2 are fulfilled.

Example: https://monitor.qa.suse.de/d/WDopenqaworker5/worker-dashboard-openqaworker5?orgId=1&from=1659006192563&to=1659011792563&viewPanel=65113

Actions #11

Updated by okurz over 1 year ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/719 to have a slightly better wording and fix an issue about false-alerts when sending hosts are down themselves.

Actions #12

Updated by tinita over 1 year ago

  • Status changed from Resolved to Workable
  • Assignee deleted (tinita)

I was told that the alert is actually a false alarm, and that my merge request caused it, and that I shall pause the alarm.

I don't understand that, since to me it looks like s390zp14.suse.de has 100% packet loss from every worker, and I thought that we actually created this ticket to be alerted about this.

Maybe I totally got the ticket wrong.

I unassigned me.

Actions #13

Updated by mkittler over 1 year ago

I was told that the alert is actually a false alarm

At least some alerts were false but @okurz SR should have fixed it (see #113746#note-11).

I don't understand that, since to me it looks like s390zp14.suse.de has 100% packet loss from every worker

That's not what @okurz meant. He meant the case when sending hosts are down themselves.

Maybe I totally got the ticket wrong.

Well, I was confused at first as well. If I've also still got it wrong, please revert my changes to this ticket.

I unassigned me.

However, when I've got it correctly now, then this ticket should be almost resolved. One should check whether all ACs are fulfilled, though. And the new alerts (the ones that are no false alers) should be paused and handled. I created #114802 for that.

Actions #14

Updated by mkittler over 1 year ago

  • Related to action #114802: Handle "QA network infrastructure Package loss alert" introduced by #113746 size:M added
Actions #15

Updated by mkittler over 1 year ago

  • Assignee set to tinita

After checking again I would say both ACs are fulfilled. So I'm assigning the ticket back to @tinita but mark it as resolved (leaving use with #114802) as follow-up.

Actions #16

Updated by mkittler over 1 year ago

  • Status changed from Workable to Resolved
Actions

Also available in: Atom PDF