action #127055: [alert] Download rate alert openQA (openqaworker{14,16,17,18}) - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #127055

closed

[alert] Download rate alert openQA (openqaworker{14,16,17,18})

Added by jbaier_cz about 2 years ago. Updated about 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-04-01

Due date:

2023-04-19

% Done:

Estimated time:

Tags:

alert, infra

Description

openqaworker14: Download rate alert¶

Cache service download rate is lower than expected. See https://progress.opensuse.org/issues/106904 for details.

Value: B=2.637854e+06

alertname: openqaworker14: Download rate alert
grafana_folder: openQA
hostname: openqaworker14
rule_uid: download_rate_alert_openqaworker14
type: worker

Link: http://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker14?orgId=1&viewPanel=65109

Similar alerts fired also for other workers mentioned in the subject.

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by okurz about 2 years ago

Tags changed from alert to alert, infra
Priority changed from Normal to Urgent

Actions

Copy link

Updated by mkittler about 2 years ago

Status changed from New to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler about 2 years ago

The mean download speed was only 2.52 MiB/s for almost 8 hours. That's not much, indeed. That the speed stayed at exactly 2.52 MiB/s for that long and then suddenly jumps up to 60.7 MiB/s is strange. And it then stayed at 60.7 MiB/s even longer before a trend with a normal amount of variety starts. This let's me think that this graph can stay at certain speeds for quite long times. Maybe no downloads happened in that time and simply the last value is repeated. That's of course problematic if the last value is slow. Then it looks like downloads are slow for a very long time although simply no downloads are happening leading to a false alert.

To check whether the hypothesis is plausible I've checked the graph of openqaworker14 of the last hour. It shows as many different speeds as cache_asset jobs show up in the Minion dashboard for that time frame (excluding jobs which did not actually download anything). It could be a coincidence but it is in line with my hypothesis.

I remember that @kraih has worked improving this in the past so I'll look at the code to see what has happened and was is still missing.

Actions

Copy link

Updated by mkittler about 2 years ago

Priority changed from Urgent to High

Yes, we simply update the speed after every download. So it is not surprising that it looks like we have a very low speed for hours although it might have been just a short low. The only measure to improve this that has been taken so far is excluding very small downloads.

This means the alerts were most likely false alerts and we don't need to fix anything but the alerting (or reporting of the metrics) itself. Since we have the underlying problem for quite a while now I suppose the false alerts aren't happening that often. So I'm lowering the priority of the ticket and we likely don't have to silence the alert.

Actions

Copy link

Updated by mkittler about 2 years ago

The following improvements would be simple:

Track the number of completed downloads within the cache service. We'd simply update it as a metric like the download rate. This would simply be an ever-increasing number (we should increase it when a download has been completed).
Compute the derivative of 1. when querying InfluxDB.
Add as an additional alert condition that 2. must be greater than 0. So we wouldn't fire any alerts if no downloads are ongoing.

We could actually skip 1. and just compute the derivative of the download rate itself. Assuming we expect at least some change if downloads are ongoing it would still make sense to suppress the alert if the derivative is 0.

Regardless of whether we do the simplification mentioned in the previous paragraph or not, there's one caveat to this approach: If downloads are ongoing but nevertheless no downloads finish at all (because downloads are slow which is what we want to be alerted about) then we'd still suppress the alert because in that situation the derivative would also be 0.

Additionally, I'm not sure how expensive it is to compute the derivative.

The next complicated solution would be tracking the number of ongoing downloads in the cache service. It would be a metric we'd add and subtract 1 each time we start and finish a download. We'd suppress any alerts if the number is 0 because if no downloads are done they also can't be slow. Likely that's the best solution. The only caveat is that we must not forget to cover any places where downloads are finished when implementing this and it'll likely be a bit more coding compared to using the derivative.

Actions

Copy link

Updated by mkittler about 2 years ago

Draft for implementing the idea from the last section of my last comment: https://github.com/os-autoinst/openQA/pull/5071

Actions

Copy link

Updated by openqa_review about 2 years ago

Due date set to 2023-04-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by tinita about 2 years ago

Copied to action #127265: Add timestamp to influxdb openqa_download_rate size:M added

Actions

Copy link

Updated by mkittler about 2 years ago

Status changed from In Progress to Feedback

Actions

Copy link

#10

Updated by mkittler about 2 years ago

The PR has been merged but hasn't been deployed yet. I have created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/832 to adapt the alerting accordingly.

Actions

Copy link

#11

Updated by mkittler about 2 years ago

This is the first time I wanted to edit alerts since we have migrated to unified alerting. It wasn't that easy so I took some time documenting my workflow, see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/833.

Actions

Copy link

#12

Updated by mkittler about 2 years ago

Status changed from Feedback to In Progress

All changes have been deployed and it now looks as expected in Grafana, e.g. https://stats.openqa-monitor.qa.suse.de/alerting/grafana/download_rate_alert_grenache-1/view and the query explorer accessible from there. So this problem shouldn't happen anymore.

However, we've got:

alertname	DatasourceError
grafana_folder	openQA
hostname	QA-Power8-4-kvm
rule_uid	download_rate_alert_QA-Power8-4-kvm
rulename	QA-Power8-4-kvm: Download rate alert
type	worker

So I'm looking into why it doesn't work for QA-Power8-4-kvm.

Actions

Copy link

#13