⚲

Project

General

Profile

Home
Projects
Help

Search:

openQA Infrastructure (public)

All Projects

openQA Infrastructure (public)

Overview
Activity
Roadmap
Issues

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #152503

closed

[FIRING:1] worker38 (worker38: partitions usage (%) alert openQA partitions_usage_alert_worker38 worker)

Added by dheidler about 1 year ago. Updated almost 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-12-12

Due date:

% Done:

Estimated time:

Tags:

infra, reactive work

Description

On 12.12.2023 11:15 CET.

http://stats.openqa-monitor.qa.suse.de/alerting/grafana/partitions_usage_alert_worker38/view?orgId=1

Maybe an issue with a partly crashed workercache service or a job with to too large disk size.

Also interesting:
br1 out peaked short of 6GB/s around that time:
https://stats.openqa-monitor.qa.suse.de/d/WDworker38/worker-dashboard-worker38?orgId=1&from=1702374388707&to=1702374657861&viewPanel=42026

History
Notes
Property changes

Actions

Copy link

Updated by okurz about 1 year ago

Tags set to reactive work
Target version set to Ready

Actions

Copy link

Updated by dheidler about 1 year ago

Description updated (diff)

Actions

Copy link

Updated by okurz about 1 year ago

Tags changed from reactive work to reactive work, infra
Priority changed from Normal to High

Actions

Copy link

Updated by mkittler almost 1 year ago

Status changed from New to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler almost 1 year ago

Status changed from In Progress to Feedback

Here are zoomed-out versions of the relevant panels:

https://stats.openqa-monitor.qa.suse.de/d/WDworker38/worker-dashboard-worker38?viewPanel=65090&orgId=1&from=1702370216820&to=1702378829748
https://stats.openqa-monitor.qa.suse.de/d/WDworker38/worker-dashboard-worker38?orgId=1&from=1702370216820&to=1702378829748&viewPanel=42026 (might be relevant as mentioned in the ticket description as "also interesting")

The cache service logs don't go far enough.

According to the alert's state history it was also pending on the 14th. Looking at the graph from the last 7 days it becomes obvious that spike (going over up to 80 %) are actually nothing special: https://stats.openqa-monitor.qa.suse.de/d/WDworker38/worker-dashboard-worker38?viewPanel=65090&orgId=1&from=1702409800646&to=1703006173106

Judging from the more recent cache service history I'd say this is not a problem of the cache service. Additionally, these spikes usually go away quickly again so the cleanup must be effective.

It looks like other workers also had spikes in the time frame (and before) but not as many and not as high (e.g. only up to 65 % on worker 39). Maybe worker 38 is special because it has the tap worker class as the only worker since last week? The spikes in disk usage only started to grow over 80 % on worker38 as of the 2023-12-11 which is also the data of b4726bc8504e1f1db69e92384f201a631a735a81/4be80b2c720f6023b20355c9f4ac71096dc0aee4 to only use worker38 for MM tests. So that would make sense.

I suppose we can resolve the issue as the extreme case when the alert condition lasted for too long didn't happen again and nothing seems generally broken.

Actions

Copy link

Updated by okurz almost 1 year ago

Status changed from Feedback to Resolved

Agreed, thanks for the investigation and thorough writeup

Actions

Copy link

Also available in: Atom PDF