action #152503: [FIRING:1] worker38 (worker38: partitions usage (%) alert openQA partitions_usage_alert_worker38 worker) - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #152503

closed

[FIRING:1] worker38 (worker38: partitions usage (%) alert openQA partitions_usage_alert_worker38 worker)

Added by dheidler over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

-

Target version:

openQA Project (public) - Ready

Start date:

2023-12-12

Due date:

% Done:

0%

Estimated time:

Tags:

infra, reactive work

Description

On 12.12.2023 11:15 CET.

http://stats.openqa-monitor.qa.suse.de/alerting/grafana/partitions_usage_alert_worker38/view?orgId=1

Maybe an issue with a partly crashed workercache service or a job with to too large disk size.

Also interesting:
br1 out peaked short of 6GB/s around that time:
https://stats.openqa-monitor.qa.suse.de/d/WDworker38/worker-dashboard-worker38?orgId=1&from=1702374388707&to=1702374657861&viewPanel=42026

Actions

#1

Updated by okurz over 1 year ago

Tags set to reactive work
Target version set to Ready

Actions

#2

Updated by dheidler over 1 year ago

Description updated (diff)

Actions

#3

Updated by okurz over 1 year ago

Tags changed from reactive work to reactive work, infra
Priority changed from Normal to High

Actions

#4

Updated by mkittler over 1 year ago

Status changed from New to In Progress
Assignee set to mkittler

Actions

#5

Updated by mkittler over 1 year ago

Status changed from In Progress to Feedback

Here are zoomed-out versions of the relevant panels:

https://stats.openqa-monitor.qa.suse.de/d/WDworker38/worker-dashboard-worker38?viewPanel=65090&orgId=1&from=1702370216820&to=1702378829748
https://stats.openqa-monitor.qa.suse.de/d/WDworker38/worker-dashboard-worker38?orgId=1&from=1702370216820&to=1702378829748&viewPanel=42026 (might be relevant as mentioned in the ticket description as "also interesting")

The cache service logs don't go far enough.

According to the alert's state history it was also pending on the 14th. Looking at the graph from the last 7 days it becomes obvious that spike (going over up to 80 %) are actually nothing special: https://stats.openqa-monitor.qa.suse.de/d/WDworker38/worker-dashboard-worker38?viewPanel=65090&orgId=1&from=1702409800646&to=1703006173106

Judging from the more recent cache service history I'd say this is not a problem of the cache service. Additionally, these spikes usually go away quickly again so the cleanup must be effective.

It looks like other workers also had spikes in the time frame (and before) but not as many and not as high (e.g. only up to 65 % on worker 39). Maybe worker 38 is special because it has the tap worker class as the only worker since last week? The spikes in disk usage only started to grow over 80 % on worker38 as of the 2023-12-11 which is also the data of b4726bc8504e1f1db69e92384f201a631a735a81/4be80b2c720f6023b20355c9f4ac71096dc0aee4 to only use worker38 for MM tests. So that would make sense.

I suppose we can resolve the issue as the extreme case when the alert condition lasted for too long didn't happen again and nothing seems generally broken.

Actions

#6

Updated by okurz over 1 year ago

Status changed from Feedback to Resolved

Agreed, thanks for the investigation and thorough writeup

Actions

Also available in: Atom PDF