action #113746: monitoring: The grafana "ping time" panel does not list all hosts size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #113746

closed

monitoring: The grafana "ping time" panel does not list all hosts size:S

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

tinita

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-07-18

Due date:

2022-08-09

% Done:

Estimated time:

Tags:

reactive work

Description

Observation¶

For example currently https://monitor.qa.suse.de/d/WDopenqaworker10/worker-dashboard-openqaworker10?orgId=1&refresh=1m&viewPanel=65099&from=now-90d&to=now currently shows a list of hosts, e.g. dist.suse.de, download.opensuse.org, etc., but not scc.suse.com from
the list https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L18:

 required_external_networks:
  - dist.suse.de
  - s390zp11.suse.de
  - s390zp14.suse.de
  - s390zp15.suse.de
  - s390zp17.suse.de
  - s390zp18.suse.de
  - s390zp19.suse.de
  - download.opensuse.org
  - proxy.scc.suse.de
  - qanet.qa.suse.de
  - scc.suse.com

Acceptance criteria¶

AC1: All hosts from "required_external_networks" are shown in the monitoring panel
AC2: Alerts are triggered for unavailable hosts

Suggestions¶

Drop scc.suse.com since it can't be pinged
Add another boolean panel for hosts being unreachable. Since otherwise we get no alerts for no data i.e. no ping at all

Related issues 2 (0 open — 2 closed)

Related to openQA Infrastructure (public) - action #113716: [qe-core] proxy-scc is down

Resolved

szarate

2022-07-18

2022-07-19

Actions

Related to openQA Infrastructure (public) - action #114802: Handle "QA network infrastructure Package loss alert" introduced by #113746 size:M

Resolved

mkittler

2022-07-28

2022-10-12

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by okurz over 2 years ago

Related to action #113716: [qe-core] proxy-scc is down added

Actions

Copy link

Updated by livdywan over 2 years ago

FYI proxy.scc.suse.de was offline before

scc.suse.com is currently not allowing pings. We should exclude scc.suse.com from the list

Actions

Copy link

Updated by okurz over 2 years ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/423 to handle scc.suse.com

Actions

Copy link

Updated by livdywan over 2 years ago

Subject changed from monitoring: The grafana "ping time" panel does not list all hosts to monitoring: The grafana "ping time" panel does not list all hosts size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by tinita over 2 years ago

Status changed from Workable to In Progress
Assignee set to tinita

Actions

Copy link

Updated by openqa_review over 2 years ago

Due date set to 2022-08-09

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz over 2 years ago

based on the query in https://monitor.qa.suse.de/d/WDopenqaworker9/worker-dashboard-openqaworker9?orgId=1&editPanel=65099&inspect=65099&inspectTab=query I took a look into how the data is stored in influxdb and found:

> select * from ping where ("host" = 'openqaworker9') AND time >= now() - 1h;
name: ping
time                average_response_ms host          ip maximum_response_ms minimum_response_ms packets_received packets_transmitted percent_packet_loss result_code standard_deviation_ms ttl url
----                ------------------- ----          -- ------------------- ------------------- ---------------- ------------------- ------------------- ----------- --------------------- --- ---
1658821380000000000 0.311               openqaworker9    0.311               0.311               1                1                   0                   0           0                     64  dist.suse.de
1658821380000000000 0.92                openqaworker9    0.92                0.92                1                1                   0                   0           0                     59  download.opensuse.org
1658821380000000000 0.346               openqaworker9    0.346               0.346               1                1                   0                   0           0                     64  proxy.scc.suse.de
1658821380000000000 0.167               openqaworker9    0.167               0.167               1                1                   0                   0           0                     63  s390zp15.suse.de
1658821380000000000 0.161               openqaworker9    0.161               0.161               1                1                   0                   0           0                     63  qanet.qa.suse.de
1658821380000000000 0.225               openqaworker9    0.225               0.225               1                1                   0                   0           0                     63  s390zp18.suse.de
1658821380000000000 3.66                openqaworker9    3.66                3.66                1                1                   0                   0           0                     64  openqa.suse.de
1658821380000000000 0.198               openqaworker9    0.198               0.198               1                1                   0                   0           0                     59  s390zp11.suse.de
1658821390000000000                     openqaworker9                                            0                1                   100                 1                                     s390zp19.suse.de
1658821390000000000                     openqaworker9                                            0                1                   100                 1                                     s390zp17.suse.de
1658821390000000000                     openqaworker9                                            0                1                   100                 1                                     s390zp14.suse.de
…

and I see two different groups. There are hosts with reasonably low ping numbers and there are entries for the hosts that were pinged but no response was received, e.g.
s390zp19.suse.de with "packets_transmitted" being 1 but "percent_packet_loss" first.

Actions

Copy link

Updated by tinita over 2 years ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/718

Actions

Copy link

Updated by livdywan over 2 years ago

Status changed from In Progress to Feedback

tinita wrote:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/718

Seems to do what we want. Approved and merged

Actions

Copy link

#10

Updated by tinita over 2 years ago

Status changed from Feedback to Resolved

We just got an alert for s390zp14.suse.de (100% package loss from all workers), so I think AC1 and AC2 are fulfilled.

Example: https://monitor.qa.suse.de/d/WDopenqaworker5/worker-dashboard-openqaworker5?orgId=1&from=1659006192563&to=1659011792563&viewPanel=65113

Actions

Copy link

#11

Updated by okurz over 2 years ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/719 to have a slightly better wording and fix an issue about false-alerts when sending hosts are down themselves.

Actions

Copy link

#12

Updated by tinita over 2 years ago

Status changed from Resolved to Workable
Assignee deleted (~~tinita~~)

I was told that the alert is actually a false alarm, and that my merge request caused it, and that I shall pause the alarm.

I don't understand that, since to me it looks like s390zp14.suse.de has 100% packet loss from every worker, and I thought that we actually created this ticket to be alerted about this.

Maybe I totally got the ticket wrong.

I unassigned me.

Actions

Copy link

#13

Updated by mkittler over 2 years ago

I was told that the alert is actually a false alarm

At least some alerts were false but @okurz SR should have fixed it (see #113746#note-11).

I don't understand that, since to me it looks like s390zp14.suse.de has 100% packet loss from every worker

That's not what @okurz meant. He meant the case when sending hosts are down themselves.

Maybe I totally got the ticket wrong.

Well, I was confused at first as well. If I've also still got it wrong, please revert my changes to this ticket.

I unassigned me.

However, when I've got it correctly now, then this ticket should be almost resolved. One should check whether all ACs are fulfilled, though. And the new alerts (the ones that are no false alers) should be paused and handled. I created #114802 for that.

Actions

Copy link

#14

Updated by mkittler over 2 years ago

Related to action #114802: Handle "QA network infrastructure Package loss alert" introduced by #113746 size:M added

Actions

Copy link

#15

Updated by mkittler over 2 years ago

Assignee set to tinita

After checking again I would say both ACs are fulfilled. So I'm assigning the ticket back to @tinita but mark it as resolved (leaving use with #114802) as follow-up.

Actions

Copy link

#16

Updated by mkittler over 2 years ago

Status changed from Workable to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #113746

monitoring: The grafana "ping time" panel does not list all hosts size:S

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz over 2 years ago

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by livdywan over 2 years ago

Updated by tinita over 2 years ago

Updated by openqa_review over 2 years ago

Updated by okurz over 2 years ago

Updated by tinita over 2 years ago

Updated by livdywan over 2 years ago

Updated by tinita over 2 years ago

Updated by okurz over 2 years ago

Updated by tinita over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago