Project

General

Profile

Actions

action #151597

closed

[alert] osiris-1 (osiris-1: partitions usage (%) alert Generic partitions_usage_alert_osiris-1 generic

Added by tinita 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-11-28
Due date:
% Done:

0%

Estimated time:

Description

Observation

Alert from Grafana:

1 firing alert instance
[IMAGE]

 GROUPED BY 

hostname=osiris-1

1 firing instances

Firing [stats.openqa-monitor.qa.suse.de]
osiris-1: partitions usage (%) alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
A0=96.03554068410229 
Labels
alertname
osiris-1: partitions usage (%) alert
grafana_folder
Generic
hostname
osiris-1
rule_uid
partitions_usage_alert_osiris-1
type
generic
Silence [stats.openqa-monitor.qa.suse.de]
View dashboard [stats.openqa-monitor.qa.suse.de]
View panel [stats.openqa-monitor.qa.suse.de]
Observed 32s before this notification was delivered, at 2023-11-28 11:49:00 +0100 CET

http://stats.openqa-monitor.qa.suse.de/alerting/grafana/partitions_usage_alert_osiris-1/view?orgId=1
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/partitions_usage_alert_osiris-1/view?orgId=1


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #138650: partition usage panels show a long list of undefined and no reasonable graphs at least for some generic machines size:MResolvedtinita2023-10-27

Actions
Actions #1

Updated by tinita 5 months ago

  • Related to action #138650: partition usage panels show a long list of undefined and no reasonable graphs at least for some generic machines size:M added
Actions #2

Updated by tinita 5 months ago · Edited

  • Status changed from New to In Progress

Since the alert doesn't really tell me which partition is problematic (see #138650) I had a look on osiris-1 and it's /var/lib/libvirt/images/dist.suse.de

Actions #3

Updated by tinita 5 months ago

The mentioned partition is of the type nfs4.
The alert is supposed to ignore nfs mounts, but checks for the exact string nfs only.

I just changed it into a regex:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1053 Ignore all nfs partitions

Actions #4

Updated by tinita 5 months ago

It should be noted that the alert is about disk usage regarding size:

SELECT mean("used_percent") AS "used_percent" FROM "disk" WHERE ("host" = 'osiris-1' AND fstype !~ /^nfs/ AND fstype != 'udf') AND $timeFilter GROUP BY time($interval), "device", "fstype" fill(null)

while the linked panel is about diskio:

SELECT non_negative_derivative(mean(reads),1s) as "read" FROM "diskio" WHERE "host" = 'osiris-1' AND $timeFilter GROUP BY time($interval), *

I think we should have two different panels.

Actions #5

Updated by tinita 5 months ago

  • Tags set to infra, monitoring, grafana
Actions #8

Updated by tinita 5 months ago

Actions #9

Updated by tinita 5 months ago

  • Status changed from In Progress to Feedback
Actions #10

Updated by tinita 5 months ago

  • Status changed from Feedback to Resolved

Alert gone

Actions

Also available in: Atom PDF