action #152503
closed
[FIRING:1] worker38 (worker38: partitions usage (%) alert openQA partitions_usage_alert_worker38 worker)
Added by dheidler about 1 year ago.
Updated almost 1 year ago.
- Tags set to reactive work
- Target version set to Ready
- Description updated (diff)
- Tags changed from reactive work to reactive work, infra
- Priority changed from Normal to High
- Status changed from New to In Progress
- Assignee set to mkittler
- Status changed from In Progress to Feedback
Here are zoomed-out versions of the relevant panels:
The cache service logs don't go far enough.
According to the alert's state history it was also pending on the 14th. Looking at the graph from the last 7 days it becomes obvious that spike (going over up to 80 %) are actually nothing special: https://stats.openqa-monitor.qa.suse.de/d/WDworker38/worker-dashboard-worker38?viewPanel=65090&orgId=1&from=1702409800646&to=1703006173106
Judging from the more recent cache service history I'd say this is not a problem of the cache service. Additionally, these spikes usually go away quickly again so the cleanup must be effective.
It looks like other workers also had spikes in the time frame (and before) but not as many and not as high (e.g. only up to 65 % on worker 39). Maybe worker 38 is special because it has the tap worker class as the only worker since last week? The spikes in disk usage only started to grow over 80 % on worker38 as of the 2023-12-11 which is also the data of b4726bc8504e1f1db69e92384f201a631a735a81/4be80b2c720f6023b20355c9f4ac71096dc0aee4 to only use worker38 for MM tests. So that would make sense.
I suppose we can resolve the issue as the extreme case when the alert condition lasted for too long didn't happen again and nothing seems generally broken.
- Status changed from Feedback to Resolved
Agreed, thanks for the investigation and thorough writeup
Also available in: Atom
PDF