Project

General

Profile

Actions

action #127055

closed

[alert] Download rate alert openQA (openqaworker{14,16,17,18})

Added by jbaier_cz about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-04-01
Due date:
2023-04-19
% Done:

0%

Estimated time:
Tags:

Description

openqaworker14: Download rate alert

Cache service download rate is lower than expected. See https://progress.opensuse.org/issues/106904 for details.

Value: B=2.637854e+06

alertname: openqaworker14: Download rate alert
grafana_folder: openQA
hostname: openqaworker14
rule_uid: download_rate_alert_openqaworker14
type: worker

Link: http://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker14?orgId=1&viewPanel=65109

Similar alerts fired also for other workers mentioned in the subject.


Related issues 1 (1 open0 closed)

Copied to openQA Project - action #127265: Add timestamp to influxdb openqa_download_rate size:MWorkable

Actions
Actions #1

Updated by okurz about 1 year ago

  • Tags changed from alert to alert, infra
  • Priority changed from Normal to Urgent
Actions #2

Updated by mkittler about 1 year ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #3

Updated by mkittler about 1 year ago

The mean download speed was only 2.52 MiB/s for almost 8 hours. That's not much, indeed. That the speed stayed at exactly 2.52 MiB/s for that long and then suddenly jumps up to 60.7 MiB/s is strange. And it then stayed at 60.7 MiB/s even longer before a trend with a normal amount of variety starts. This let's me think that this graph can stay at certain speeds for quite long times. Maybe no downloads happened in that time and simply the last value is repeated. That's of course problematic if the last value is slow. Then it looks like downloads are slow for a very long time although simply no downloads are happening leading to a false alert.

To check whether the hypothesis is plausible I've checked the graph of openqaworker14 of the last hour. It shows as many different speeds as cache_asset jobs show up in the Minion dashboard for that time frame (excluding jobs which did not actually download anything). It could be a coincidence but it is in line with my hypothesis.

I remember that @kraih has worked improving this in the past so I'll look at the code to see what has happened and was is still missing.

Actions #4

Updated by mkittler about 1 year ago

  • Priority changed from Urgent to High

Yes, we simply update the speed after every download. So it is not surprising that it looks like we have a very low speed for hours although it might have been just a short low. The only measure to improve this that has been taken so far is excluding very small downloads.

This means the alerts were most likely false alerts and we don't need to fix anything but the alerting (or reporting of the metrics) itself. Since we have the underlying problem for quite a while now I suppose the false alerts aren't happening that often. So I'm lowering the priority of the ticket and we likely don't have to silence the alert.

Actions #5

Updated by mkittler about 1 year ago

The following improvements would be simple:

  1. Track the number of completed downloads within the cache service. We'd simply update it as a metric like the download rate. This would simply be an ever-increasing number (we should increase it when a download has been completed).
  2. Compute the derivative of 1. when querying InfluxDB.
  3. Add as an additional alert condition that 2. must be greater than 0. So we wouldn't fire any alerts if no downloads are ongoing.

We could actually skip 1. and just compute the derivative of the download rate itself. Assuming we expect at least some change if downloads are ongoing it would still make sense to suppress the alert if the derivative is 0.

Regardless of whether we do the simplification mentioned in the previous paragraph or not, there's one caveat to this approach: If downloads are ongoing but nevertheless no downloads finish at all (because downloads are slow which is what we want to be alerted about) then we'd still suppress the alert because in that situation the derivative would also be 0.

Additionally, I'm not sure how expensive it is to compute the derivative.


The next complicated solution would be tracking the number of ongoing downloads in the cache service. It would be a metric we'd add and subtract 1 each time we start and finish a download. We'd suppress any alerts if the number is 0 because if no downloads are done they also can't be slow. Likely that's the best solution. The only caveat is that we must not forget to cover any places where downloads are finished when implementing this and it'll likely be a bit more coding compared to using the derivative.

Actions #6

Updated by mkittler about 1 year ago

Draft for implementing the idea from the last section of my last comment: https://github.com/os-autoinst/openQA/pull/5071

Actions #7

Updated by openqa_review about 1 year ago

  • Due date set to 2023-04-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by tinita about 1 year ago

  • Copied to action #127265: Add timestamp to influxdb openqa_download_rate size:M added
Actions #9

Updated by mkittler about 1 year ago

  • Status changed from In Progress to Feedback
Actions #10

Updated by mkittler about 1 year ago

The PR has been merged but hasn't been deployed yet. I have created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/832 to adapt the alerting accordingly.

Actions #11

Updated by mkittler about 1 year ago

This is the first time I wanted to edit alerts since we have migrated to unified alerting. It wasn't that easy so I took some time documenting my workflow, see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/833.

Actions #12

Updated by mkittler about 1 year ago

  • Status changed from Feedback to In Progress

All changes have been deployed and it now looks as expected in Grafana, e.g. https://stats.openqa-monitor.qa.suse.de/alerting/grafana/download_rate_alert_grenache-1/view and the query explorer accessible from there. So this problem shouldn't happen anymore.

However, we've got:

alertname   DatasourceError
grafana_folder  openQA
hostname    QA-Power8-4-kvm
rule_uid    download_rate_alert_QA-Power8-4-kvm
rulename    QA-Power8-4-kvm: Download rate alert
type    worker

So I'm looking into why it doesn't work for QA-Power8-4-kvm.

Actions #14

Updated by mkittler about 1 year ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF