action #152503
closed[FIRING:1] worker38 (worker38: partitions usage (%) alert openQA partitions_usage_alert_worker38 worker)
0%
Description
On 12.12.2023 11:15 CET.
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/partitions_usage_alert_worker38/view?orgId=1
Maybe an issue with a partly crashed workercache service or a job with to too large disk size.
Also interesting:
br1 out peaked short of 6GB/s around that time:
https://stats.openqa-monitor.qa.suse.de/d/WDworker38/worker-dashboard-worker38?orgId=1&from=1702374388707&to=1702374657861&viewPanel=42026
Updated by mkittler 11 months ago
- Status changed from In Progress to Feedback
Here are zoomed-out versions of the relevant panels:
- https://stats.openqa-monitor.qa.suse.de/d/WDworker38/worker-dashboard-worker38?viewPanel=65090&orgId=1&from=1702370216820&to=1702378829748
- https://stats.openqa-monitor.qa.suse.de/d/WDworker38/worker-dashboard-worker38?orgId=1&from=1702370216820&to=1702378829748&viewPanel=42026 (might be relevant as mentioned in the ticket description as "also interesting")
The cache service logs don't go far enough.
According to the alert's state history it was also pending on the 14th. Looking at the graph from the last 7 days it becomes obvious that spike (going over up to 80 %) are actually nothing special: https://stats.openqa-monitor.qa.suse.de/d/WDworker38/worker-dashboard-worker38?viewPanel=65090&orgId=1&from=1702409800646&to=1703006173106
Judging from the more recent cache service history I'd say this is not a problem of the cache service. Additionally, these spikes usually go away quickly again so the cleanup must be effective.
It looks like other workers also had spikes in the time frame (and before) but not as many and not as high (e.g. only up to 65 % on worker 39). Maybe worker 38 is special because it has the tap worker class as the only worker since last week? The spikes in disk usage only started to grow over 80 % on worker38 as of the 2023-12-11 which is also the data of b4726bc8504e1f1db69e92384f201a631a735a81/4be80b2c720f6023b20355c9f4ac71096dc0aee4 to only use worker38 for MM tests. So that would make sense.
I suppose we can resolve the issue as the extreme case when the alert condition lasted for too long didn't happen again and nothing seems generally broken.