action #174916
closed[alert][zabbix@suse.de] Problem: Load average is too high (per CPU load over 4 for 5m) size: S
100%
Description
Problem started at 23:31:06 on 2024.12.30
Problem name: Load average is too high (per CPU load over 4 for 5m)
Host: ariel.dmz-prg2.suse.org
Severity: Average
Operational data: Load averages(1m 5m 15m): (42.467285 41.881836 38.085449), # of CPUs: 10
Original problem ID: 1097854176
Updated by gpuliti 5 months ago
- Assignee set to gpuliti
Looking at Zabbix events seems that the same alert have been created 10+ times starting 2024-12-25 07:17:42 (UTC), but it seems not happening between the normal working hours (UTC), could it be a cleaning process that is triggering when no usage or low usage of ariel? https://zabbix.suse.de/zabbix.php?show=2&name=&inventory%5B0%5D%5Bfield%5D=type&inventory%5B0%5D%5Bvalue%5D=&evaltype=0&tags%5B0%5D%5Btag%5D=&tags%5B0%5D%5Boperator%5D=0&tags%5B0%5D%5Bvalue%5D=&show_tags=3&tag_name_format=0&tag_priority=&show_opdata=0&show_timeline=1&filter_name=&filter_show_counter=0&filter_custom_time=0&sort=clock&sortorder=DESC&age_state=0&show_suppressed=0&unacknowledged=0&compact_view=0&details=0&highlight_row=0&action=problem.view&triggerids%5B%5D=115351
top
showed that most of the process are openqa related, as expected, with maximum usage of 50%+ CPU with an average of 10% for every process.
https://openqa.opensuse.org/tests seems that was not affected, for example https://openqa.opensuse.org/tests/4739746# fits within the time slice of the problem, but was not affected.
Updated by tinita 5 months ago · Edited
While the alert is about a load of over 4, looking at the graphs
https://zabbix.suse.de/history.php?action=showgraph&itemids%5B%5D=341938 (1m avg 30 days)
https://zabbix.suse.de/history.php?action=showgraph&itemids%5B%5D=341940 (5m avg 30 days)
http://127.0.0.1:8080/munin/opensuse.org/openqa.opensuse.org/load.html
it seems there are indeed some unusual peeks of over 40 in the recent weeks.
edit: for checking a relation with the job queue, use https://metrics.opensuse.org/d/osrt_openqa/osrt-openqa?orgId=1&from=now-30d&to=now&viewPanel=panel-2&refresh=1m
but it looks unrelated
Updated by jbaier_cz 5 months ago
- Related to action #174316: [o3][zabbix][alert] no email about zabbix alerts including storage and cpu load size:S added
Updated by gpuliti 5 months ago
- Status changed from Workable to Blocked
Still blocked by https://sd.suse.com/servicedesk/customer/portal/1/SD-176515
Updated by openqa_review 5 months ago
- Due date set to 2025-01-25
Setting due date based on mean cycle time of SUSE QE Tools
Updated by gpuliti 5 months ago
- Status changed from In Progress to Feedback
- % Done changed from 0 to 90
I've changed the ariel macro {$LOAD_AVG_PER_CPU.MAX.WARN}
to 6 since ariel have 10 cpu core this would raise the alert when the total load average it's over 60.
I'm going to monitor for the next few days, but since the last time it has been thrown is 03/01/25 I think the chance of it happening again considering the recent history is small.
Updated by robert.richardson 5 months ago
- Subject changed from [alert][zabbix@suse.de] Problem: Load average is too high (per CPU load over 4 for 5m) to [alert][zabbix@suse.de] Problem: Load average is too high (per CPU load over 4 for 5m) size: S