action #127055
closed[alert] Download rate alert openQA (openqaworker{14,16,17,18})
0%
Description
openqaworker14: Download rate alert¶
Cache service download rate is lower than expected. See https://progress.opensuse.org/issues/106904 for details.
Value: B=2.637854e+06
alertname: openqaworker14: Download rate alert
grafana_folder: openQA
hostname: openqaworker14
rule_uid: download_rate_alert_openqaworker14
type: worker
Link: http://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker14?orgId=1&viewPanel=65109
Similar alerts fired also for other workers mentioned in the subject.
Updated by okurz over 1 year ago
- Tags changed from alert to alert, infra
- Priority changed from Normal to Urgent
Updated by mkittler over 1 year ago
- Status changed from New to In Progress
- Assignee set to mkittler
Updated by mkittler over 1 year ago
The mean download speed was only 2.52 MiB/s for almost 8 hours. That's not much, indeed. That the speed stayed at exactly 2.52 MiB/s for that long and then suddenly jumps up to 60.7 MiB/s is strange. And it then stayed at 60.7 MiB/s even longer before a trend with a normal amount of variety starts. This let's me think that this graph can stay at certain speeds for quite long times. Maybe no downloads happened in that time and simply the last value is repeated. That's of course problematic if the last value is slow. Then it looks like downloads are slow for a very long time although simply no downloads are happening leading to a false alert.
To check whether the hypothesis is plausible I've checked the graph of openqaworker14 of the last hour. It shows as many different speeds as cache_asset
jobs show up in the Minion dashboard for that time frame (excluding jobs which did not actually download anything). It could be a coincidence but it is in line with my hypothesis.
I remember that @kraih has worked improving this in the past so I'll look at the code to see what has happened and was is still missing.
Updated by mkittler over 1 year ago
- Priority changed from Urgent to High
Yes, we simply update the speed after every download. So it is not surprising that it looks like we have a very low speed for hours although it might have been just a short low. The only measure to improve this that has been taken so far is excluding very small downloads.
This means the alerts were most likely false alerts and we don't need to fix anything but the alerting (or reporting of the metrics) itself. Since we have the underlying problem for quite a while now I suppose the false alerts aren't happening that often. So I'm lowering the priority of the ticket and we likely don't have to silence the alert.
Updated by mkittler over 1 year ago
The following improvements would be simple:
- Track the number of completed downloads within the cache service. We'd simply update it as a metric like the download rate. This would simply be an ever-increasing number (we should increase it when a download has been completed).
- Compute the derivative of 1. when querying InfluxDB.
- Add as an additional alert condition that 2. must be greater than 0. So we wouldn't fire any alerts if no downloads are ongoing.
We could actually skip 1. and just compute the derivative of the download rate itself. Assuming we expect at least some change if downloads are ongoing it would still make sense to suppress the alert if the derivative is 0.
Regardless of whether we do the simplification mentioned in the previous paragraph or not, there's one caveat to this approach: If downloads are ongoing but nevertheless no downloads finish at all (because downloads are slow which is what we want to be alerted about) then we'd still suppress the alert because in that situation the derivative would also be 0.
Additionally, I'm not sure how expensive it is to compute the derivative.
The next complicated solution would be tracking the number of ongoing downloads in the cache service. It would be a metric we'd add and subtract 1 each time we start and finish a download. We'd suppress any alerts if the number is 0 because if no downloads are done they also can't be slow. Likely that's the best solution. The only caveat is that we must not forget to cover any places where downloads are finished when implementing this and it'll likely be a bit more coding compared to using the derivative.
Updated by mkittler over 1 year ago
Draft for implementing the idea from the last section of my last comment: https://github.com/os-autoinst/openQA/pull/5071
Updated by openqa_review over 1 year ago
- Due date set to 2023-04-19
Setting due date based on mean cycle time of SUSE QE Tools
Updated by tinita over 1 year ago
- Copied to action #127265: Add timestamp to influxdb openqa_download_rate size:M added
Updated by mkittler over 1 year ago
The PR has been merged but hasn't been deployed yet. I have created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/832 to adapt the alerting accordingly.
Updated by mkittler over 1 year ago
This is the first time I wanted to edit alerts since we have migrated to unified alerting. It wasn't that easy so I took some time documenting my workflow, see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/833.
Updated by mkittler over 1 year ago
- Status changed from Feedback to In Progress
All changes have been deployed and it now looks as expected in Grafana, e.g. https://stats.openqa-monitor.qa.suse.de/alerting/grafana/download_rate_alert_grenache-1/view and the query explorer accessible from there. So this problem shouldn't happen anymore.
However, we've got:
alertname DatasourceError
grafana_folder openQA
hostname QA-Power8-4-kvm
rule_uid download_rate_alert_QA-Power8-4-kvm
rulename QA-Power8-4-kvm: Download rate alert
type worker
So I'm looking into why it doesn't work for QA-Power8-4-kvm.
Updated by mkittler over 1 year ago
Should be fixed by https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/835
Updated by mkittler over 1 year ago
- Status changed from In Progress to Resolved
It looks now good after https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/835 has been merged.