Project

General

Profile

Actions

action #174916

closed

[alert][zabbix@suse.de] Problem: Load average is too high (per CPU load over 4 for 5m) size: S

Added by gpuliti 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2024-12-31
Due date:
2025-01-25
% Done:

100%

Estimated time:

Description

Problem started at 23:31:06 on 2024.12.30
Problem name: Load average is too high (per CPU load over 4 for 5m)
Host: ariel.dmz-prg2.suse.org
Severity: Average
Operational data: Load averages(1m 5m 15m): (42.467285 41.881836 38.085449), # of CPUs: 10
Original problem ID: 1097854176


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #174316: [o3][zabbix][alert] no email about zabbix alerts including storage and cpu load size:SResolvedjbaier_cz2024-12-12

Actions
Actions #1

Updated by okurz 5 months ago

  • Tags changed from infra, alert to infra, alert, o3, ariel, load, reactive work
  • Category set to Regressions/Crashes
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #2

Updated by gpuliti 5 months ago

  • Assignee set to gpuliti

Looking at Zabbix events seems that the same alert have been created 10+ times starting 2024-12-25 07:17:42 (UTC), but it seems not happening between the normal working hours (UTC), could it be a cleaning process that is triggering when no usage or low usage of ariel? https://zabbix.suse.de/zabbix.php?show=2&name=&inventory%5B0%5D%5Bfield%5D=type&inventory%5B0%5D%5Bvalue%5D=&evaltype=0&tags%5B0%5D%5Btag%5D=&tags%5B0%5D%5Boperator%5D=0&tags%5B0%5D%5Bvalue%5D=&show_tags=3&tag_name_format=0&tag_priority=&show_opdata=0&show_timeline=1&filter_name=&filter_show_counter=0&filter_custom_time=0&sort=clock&sortorder=DESC&age_state=0&show_suppressed=0&unacknowledged=0&compact_view=0&details=0&highlight_row=0&action=problem.view&triggerids%5B%5D=115351

top showed that most of the process are openqa related, as expected, with maximum usage of 50%+ CPU with an average of 10% for every process.

https://openqa.opensuse.org/tests seems that was not affected, for example https://openqa.opensuse.org/tests/4739746# fits within the time slice of the problem, but was not affected.

Actions #3

Updated by gpuliti 5 months ago

  • Priority changed from Urgent to High

@okurz (slack thread):

we can consider to relax the alert check

Lowering the priority to high.

Actions #4

Updated by gpuliti 5 months ago · Edited

  • Status changed from New to Blocked
Actions #5

Updated by tinita 5 months ago · Edited

While the alert is about a load of over 4, looking at the graphs
https://zabbix.suse.de/history.php?action=showgraph&itemids%5B%5D=341938 (1m avg 30 days)
https://zabbix.suse.de/history.php?action=showgraph&itemids%5B%5D=341940 (5m avg 30 days)
http://127.0.0.1:8080/munin/opensuse.org/openqa.opensuse.org/load.html
it seems there are indeed some unusual peeks of over 40 in the recent weeks.

edit: for checking a relation with the job queue, use https://metrics.opensuse.org/d/osrt_openqa/osrt-openqa?orgId=1&from=now-30d&to=now&viewPanel=panel-2&refresh=1m
but it looks unrelated

Actions #6

Updated by livdywan 5 months ago

Let's block on #174316 before trying to adjust the numbers

Actions #7

Updated by jbaier_cz 5 months ago

  • Status changed from Blocked to Workable

Blocker solved

Actions #8

Updated by jbaier_cz 5 months ago

  • Related to action #174316: [o3][zabbix][alert] no email about zabbix alerts including storage and cpu load size:S added
Actions #9

Updated by jbaier_cz 5 months ago · Edited

tinita wrote in #note-5:

While the alert is about a load of over 4, looking at the graphs

The alert is about per CPU load over 4, ariel has currently 10 cpu cores, i.e. the alert is about load (1m avg) over 40 for at least 5 minutes.

Actions #10

Updated by gpuliti 5 months ago

  • Status changed from Workable to Blocked
Actions #11

Updated by gpuliti 5 months ago

  • Status changed from Blocked to In Progress
Actions #12

Updated by openqa_review 5 months ago

  • Due date set to 2025-01-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #13

Updated by gpuliti 5 months ago

  • Status changed from In Progress to Feedback
  • % Done changed from 0 to 90

I've changed the ariel macro {$LOAD_AVG_PER_CPU.MAX.WARN} to 6 since ariel have 10 cpu core this would raise the alert when the total load average it's over 60.

I'm going to monitor for the next few days, but since the last time it has been thrown is 03/01/25 I think the chance of it happening again considering the recent history is small.

Actions #14

Updated by robert.richardson 5 months ago

  • Subject changed from [alert][zabbix@suse.de] Problem: Load average is too high (per CPU load over 4 for 5m) to [alert][zabbix@suse.de] Problem: Load average is too high (per CPU load over 4 for 5m) size: S
Actions #15

Updated by gpuliti 5 months ago

  • Status changed from Feedback to Resolved
  • % Done changed from 90 to 100

No problem have showed up, so I'm set as resolved.

Actions

Also available in: Atom PDF