action #113746: monitoring: The grafana "ping time" panel does not list all hosts size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #113746

closed

monitoring: The grafana "ping time" panel does not list all hosts size:S

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

tinita

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-07-18

Due date:

2022-08-09

% Done:

Estimated time:

Tags:

reactive work

Description

Observation¶

For example currently https://monitor.qa.suse.de/d/WDopenqaworker10/worker-dashboard-openqaworker10?orgId=1&refresh=1m&viewPanel=65099&from=now-90d&to=now currently shows a list of hosts, e.g. dist.suse.de, download.opensuse.org, etc., but not scc.suse.com from
the list https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L18:

 required_external_networks:
  - dist.suse.de
  - s390zp11.suse.de
  - s390zp14.suse.de
  - s390zp15.suse.de
  - s390zp17.suse.de
  - s390zp18.suse.de
  - s390zp19.suse.de
  - download.opensuse.org
  - proxy.scc.suse.de
  - qanet.qa.suse.de
  - scc.suse.com

Acceptance criteria¶

AC1: All hosts from "required_external_networks" are shown in the monitoring panel
AC2: Alerts are triggered for unavailable hosts

Suggestions¶

Drop scc.suse.com since it can't be pinged
Add another boolean panel for hosts being unreachable. Since otherwise we get no alerts for no data i.e. no ping at all

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz over 2 years ago

Related to action #113716: [qe-core] proxy-scc is down added

Actions

Copy link

Updated by livdywan over 2 years ago

FYI proxy.scc.suse.de was offline before

scc.suse.com is currently not allowing pings. We should exclude scc.suse.com from the list

Actions

Copy link

Updated by okurz over 2 years ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/423 to handle scc.suse.com

Actions

Copy link

Updated by livdywan over 2 years ago

Subject changed from monitoring: The grafana "ping time" panel does not list all hosts to monitoring: The grafana "ping time" panel does not list all hosts size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by tinita over 2 years ago

Status changed from Workable to In Progress
Assignee set to tinita

Actions

Copy link

Updated by openqa_review over 2 years ago

Due date set to 2022-08-09

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz over 2 years ago

based on the query in https://monitor.qa.suse.de/d/WDopenqaworker9/worker-dashboard-openqaworker9?orgId=1&editPanel=65099&inspect=65099&inspectTab=query I took a look into how the data is stored in influxdb and found:

> select * from ping where ("host" = 'openqaworker9') AND time >= now() - 1h;
name: ping
time                average_response_ms host          ip maximum_response_ms minimum_response_ms packets_received packets_transmitted percent_packet_loss result_code standard_deviation_ms ttl url
----                ------------------- ----          -- ------------------- ------------------- ---------------- ------------------- ------------------- ----------- --------------------- --- ---
1658821380000000000 0.311               openqaworker9    0.311               0.311               1                1                   0                   0           0                     64  dist.suse.de
1658821380000000000 0.92                openqaworker9    0.92                0.92                1                1                   0                   0           0                     59  download.opensuse.org
1658821380000000000 0.346               openqaworker9    0.346               0.346               1                1                   0                   0           0                     64  proxy.scc.suse.de
1658821380000000000 0.167               openqaworker9    0.167               0.167               1                1                   0                   0           0                     63  s390zp15.suse.de
1658821380000000000 0.161               openqaworker9    0.161               0.161               1                1                   0                   0           0                     63  qanet.qa.suse.de
1658821380000000000 0.225               openqaworker9    0.225               0.225               1                1                   0                   0           0                     63  s390zp18.suse.de
1658821380000000000 3.66                openqaworker9    3.66                3.66                1                1                   0                   0           0                     64  openqa.suse.de
1658821380000000000 0.198               openqaworker9    0.198               0.198               1                1                   0                   0           0                     59  s390zp11.suse.de
1658821390000000000                     openqaworker9                                            0                1                   100                 1                                     s390zp19.suse.de
1658821390000000000                     openqaworker9                                            0                1                   100                 1                                     s390zp17.suse.de
1658821390000000000                     openqaworker9                                            0                1                   100                 1                                     s390zp14.suse.de
…

and I see two different groups. There are hosts with reasonably low ping numbers and there are entries for the hosts that were pinged but no response was received, e.g.
s390zp19.suse.de with "packets_transmitted" being 1 but "percent_packet_loss" first.

Actions

Copy link

Updated by tinita over 2 years ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/718

Actions

Copy link

Updated by livdywan over 2 years ago

Status changed from In Progress to Feedback

tinita wrote:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/718

Seems to do what we want. Approved and merged

Actions

Copy link

#10

Updated by tinita over 2 years ago

Status changed from Feedback to Resolved

We just got an alert for s390zp14.suse.de (100% package loss from all workers), so I think AC1 and AC2 are fulfilled.

Example: https://monitor.qa.suse.de/d/WDopenqaworker5/worker-dashboard-openqaworker5?orgId=1&from=1659006192563&to=1659011792563&viewPanel=65113

Actions

Copy link

#11

Updated by okurz over 2 years ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/719 to have a slightly better wording and fix an issue about false-alerts when sending hosts are down themselves.

Actions

Copy link

#12

Updated by tinita over 2 years ago

Status changed from Resolved to Workable
Assignee deleted (~~tinita~~)

I was told that the alert is actually a false alarm, and that my merge request caused it, and that I shall pause the alarm.

I don't understand that, since to me it looks like s390zp14.suse.de has 100% packet loss from every worker, and I thought that we actually created this ticket to be alerted about this.

Maybe I totally got the ticket wrong.

I unassigned me.

Actions

Copy link

#13

Updated by mkittler over 2 years ago

I was told that the alert is actually a false alarm

At least some alerts were false but @okurz SR should have fixed it (see #113746#note-11).

I don't understand that, since to me it looks like s390zp14.suse.de has 100% packet loss from every worker

That's not what @okurz meant. He meant the case when sending hosts are down themselves.

Maybe I totally got the ticket wrong.

Well, I was confused at first as well. If I've also still got it wrong, please revert my changes to this ticket.

I unassigned me.

However, when I've got it correctly now, then this ticket should be almost resolved. One should check whether all ACs are fulfilled, though. And the new alerts (the ones that are no false alers) should be paused and handled. I created #114802 for that.

Actions

Copy link

#14

Updated by mkittler over 2 years ago

Related to action #114802: Handle "QA network infrastructure Package loss alert" introduced by #113746 size:M added

Actions

Copy link

#15

Updated by mkittler over 2 years ago

Assignee set to tinita

After checking again I would say both ACs are fulfilled. So I'm assigning the ticket back to @tinita but mark it as resolved (leaving use with #114802) as follow-up.

Actions

Copy link

#16

Updated by mkittler over 2 years ago

Status changed from Workable to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #113746

monitoring: The grafana "ping time" panel does not list all hosts size:S

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz over 2 years ago

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by livdywan over 2 years ago

Updated by tinita over 2 years ago

Updated by openqa_review over 2 years ago

Updated by okurz over 2 years ago

Updated by tinita over 2 years ago

Updated by livdywan over 2 years ago

Updated by tinita over 2 years ago

Updated by okurz over 2 years ago

Updated by tinita over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago