Project

General

Profile

Actions

action #152503

closed

[FIRING:1] worker38 (worker38: partitions usage (%) alert openQA partitions_usage_alert_worker38 worker)

Added by dheidler 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-12-12
Due date:
% Done:

0%

Estimated time:

Description

On 12.12.2023 11:15 CET.

http://stats.openqa-monitor.qa.suse.de/alerting/grafana/partitions_usage_alert_worker38/view?orgId=1

Maybe an issue with a partly crashed workercache service or a job with to too large disk size.

Also interesting:
br1 out peaked short of 6GB/s around that time:
https://stats.openqa-monitor.qa.suse.de/d/WDworker38/worker-dashboard-worker38?orgId=1&from=1702374388707&to=1702374657861&viewPanel=42026

Actions #1

Updated by okurz 5 months ago

  • Tags set to reactive work
  • Target version set to Ready
Actions #2

Updated by dheidler 5 months ago

  • Description updated (diff)
Actions #3

Updated by okurz 5 months ago

  • Tags changed from reactive work to reactive work, infra
  • Priority changed from Normal to High
Actions #4

Updated by mkittler 4 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #5

Updated by mkittler 4 months ago

  • Status changed from In Progress to Feedback

Here are zoomed-out versions of the relevant panels:


The cache service logs don't go far enough.


According to the alert's state history it was also pending on the 14th. Looking at the graph from the last 7 days it becomes obvious that spike (going over up to 80 %) are actually nothing special: https://stats.openqa-monitor.qa.suse.de/d/WDworker38/worker-dashboard-worker38?viewPanel=65090&orgId=1&from=1702409800646&to=1703006173106

Judging from the more recent cache service history I'd say this is not a problem of the cache service. Additionally, these spikes usually go away quickly again so the cleanup must be effective.


It looks like other workers also had spikes in the time frame (and before) but not as many and not as high (e.g. only up to 65 % on worker 39). Maybe worker 38 is special because it has the tap worker class as the only worker since last week? The spikes in disk usage only started to grow over 80 % on worker38 as of the 2023-12-11 which is also the data of b4726bc8504e1f1db69e92384f201a631a735a81/4be80b2c720f6023b20355c9f4ac71096dc0aee4 to only use worker38 for MM tests. So that would make sense.


I suppose we can resolve the issue as the extreme case when the alert condition lasted for too long didn't happen again and nothing seems generally broken.

Actions #6

Updated by okurz 4 months ago

  • Status changed from Feedback to Resolved

Agreed, thanks for the investigation and thorough writeup

Actions

Also available in: Atom PDF