Project

General

Profile

Actions

action #97136

closed

[alert] multiple unhandled alerts about "broken workers" size:M

Added by okurz about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2021-08-18
Due date:
% Done:

0%

Estimated time:

Description

Observation

There is at least one broken worker for more than 15 minutes. Have a look at
https://openqa.suse.de/admin/workers to find out which one it is (click the help
icon to view the concrete error message).

      Metric name




      Value


      Number of broken workers




      15.000

View your Alert rule
http://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=96&orgId=1

Suggestions

  • Check the yet-another-alert from the start of the week when after weekly rebooting there were reports about broken workers
  • Understand https://github.com/os-autoinst/openQA/pull/4122 which is likely to be related
  • Look into worker logs
  • Prevent the situation that workers are reported as broken too soon

Related issues 1 (0 open1 closed)

Copied to openQA Infrastructure - action #97139: [alert] multiple unhandled alerts about "malbec: Memory usage alert" size:MResolvedmkittler2021-08-182021-09-09

Actions
Actions #1

Updated by okurz about 3 years ago

  • Copied to action #97139: [alert] multiple unhandled alerts about "malbec: Memory usage alert" size:M added
Actions #2

Updated by okurz about 3 years ago

  • Subject changed from [alert] multiple unhandled alerts about "broken workers" to [alert] multiple unhandled alerts about "broken workers" size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by dheidler about 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #4

Updated by dheidler about 3 years ago

  • Priority changed from Urgent to High

The broken wokers alert was covering 5min instead of described 15m, so let's fix that:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/564

Actions #5

Updated by dheidler about 3 years ago

  • Status changed from In Progress to Feedback

Reverting my original PR as marking workers as broken when download jobs are piling up as it seems to create more issues than it solves:
https://github.com/os-autoinst/openQA/pull/4144

Actions #6

Updated by dheidler about 3 years ago

Paused the broken workers alert at https://stats.openqa-monitor.qa.suse.de/alerting/list for now

Actions #7

Updated by dheidler about 3 years ago

MR to not count overloaded workers as broken:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/565

Actions #8

Updated by dheidler about 3 years ago

reenabled alert.

Actions #9

Updated by dheidler about 3 years ago

  • Status changed from Feedback to Resolved

no new alerts so far.

Actions

Also available in: Atom PDF