action #106904: Monitoring for worker specific bandwidth size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

action #106904

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #102951: [epic] Better network performance monitoring

Monitoring for worker specific bandwidth size:M

Added by okurz about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Low

Assignee:

kraih

Category:

Feature requests

Target version:

Ready

Start date:

2022-02-16

Due date:

% Done:

Estimated time:

Description

User story¶

As an openQA instance administrator I am alerted on unusual up-/download rates to/from individual workers to identify problems in our infrastructure

Acceptance criteria¶

AC1: monitor.qa has alerts for each worker depending on cache service bandwidth

Suggestions¶

Extend telegraf config based on the example of reading worker minion jobs to read out the bandwidth data provided in the new user story #106901 , see https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/telegraf/telegraf-worker.conf#L29
Add new panels in grafana based on the new data
Define alert threshold and add alert descriptions

Actions

Copy link

Updated by okurz about 3 years ago

Priority changed from Normal to Low

Actions

Copy link

Updated by kraih about 3 years ago

Assignee set to kraih

I'd like to learn a bit more about Grafana.

Actions

Copy link

Updated by kraih about 3 years ago

Status changed from Workable to In Progress

Actions

Copy link

Updated by kraih about 3 years ago

PR for Grafana dashboard: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/660

Actions

Copy link

Updated by okurz about 3 years ago

merged. https://monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashboard-openqaworker-arm-1?viewPanel=65109&orgId=1&from=now-7d&to=now looks quite nice, same for other workers.

Actions

Copy link

Updated by kraih about 3 years ago

I've looked through the data we've already collected in the last few days, and for most workers the low point is around 19 MiB/s so far. Just openqaworker2 has dipped as low as 9.52 MiB/s at 2022-03-01 12:35:00. For an alert a value between 1 and 5 MiB/s (over an hour) might make sense.

Actions

Copy link

Updated by okurz about 3 years ago

As I suggested over chat: Use the max value and alert if it's below a threshold. We don't care about single dips of low transfer, we care about limited bandwidth affecting all download.

Actions

Copy link