action #106904
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #102951: [epic] Better network performance monitoring
Monitoring for worker specific bandwidth size:M
Description
User story¶
As an openQA instance administrator I am alerted on unusual up-/download rates to/from individual workers to identify problems in our infrastructure
Acceptance criteria¶
- AC1: monitor.qa has alerts for each worker depending on cache service bandwidth
Suggestions¶
- Extend telegraf config based on the example of reading worker minion jobs to read out the bandwidth data provided in the new user story #106901 , see https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/telegraf/telegraf-worker.conf#L29
- Add new panels in grafana based on the new data
- Define alert threshold and add alert descriptions
Updated by kraih over 2 years ago
PR for Grafana dashboard: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/660
Updated by okurz over 2 years ago
merged. https://monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashboard-openqaworker-arm-1?viewPanel=65109&orgId=1&from=now-7d&to=now looks quite nice, same for other workers.
Updated by kraih over 2 years ago
I've looked through the data we've already collected in the last few days, and for most workers the low point is around 19 MiB/s so far. Just openqaworker2 has dipped as low as 9.52 MiB/s at 2022-03-01 12:35:00. For an alert a value between 1 and 5 MiB/s (over an hour) might make sense.
Updated by okurz over 2 years ago
As I suggested over chat: Use the max value and alert if it's below a threshold. We don't care about single dips of low transfer, we care about limited bandwidth affecting all download.
Updated by kraih over 2 years ago
PR for Grafana alert: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/661