Project

General

Profile

action #70834

[alert] Refine I/O time alerts for OSD

Added by nicksinger 5 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2020-09-02
Due date:
% Done:

0%

Estimated time:
Tags:

Description

We have several IO time alerts for OSD itself:

They need to be reworked so that:

  1. The right disk is shown for the right purpose (e.g. /dev/vde is not /results any longer)
  2. DONE: The alert thresholds need to be adjusted to not trigger that often
    • Spikes of up to 7s seem to happen from time to time
    • The situation gets critical if these spikes continue for several minutes

All above linked alerts are on pause right now since they don't provide a big benefit being that flaky.


Related issues

Related to openQA Infrastructure - action #69667: missing monitoring data for vde after partitions where reorderedResolved2020-08-06

Related to openQA Infrastructure - action #73165: [osd] Consolidate "expensive+fast" and "cheap+slow" storage after realizing vdc is "cheap+slow" as wellResolved2020-09-02

History

#1 Updated by okurz 5 months ago

  • Related to action #69667: missing monitoring data for vde after partitions where reordered added

#2 Updated by okurz 5 months ago

  • Tags set to alert
  • Target version set to Ready

#3 Updated by okurz 3 months ago

  • Related to action #73165: [osd] Consolidate "expensive+fast" and "cheap+slow" storage after realizing vdc is "cheap+slow" as well added

#4 Updated by okurz 3 months ago

  • Status changed from New to Feedback
  • Assignee set to okurz

From what I learned in #73165 I can update current monitoring and alerting in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/375 . I merged but it seems this did not trigger a CI pipeline in master anymore. Did that manually now.

#5 Updated by okurz 3 months ago

  • Description updated (diff)
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)
  • Priority changed from Normal to Low

Crossed of the point I have done. The rest is left to be done.

#6 Updated by okurz 3 months ago

  • Status changed from Workable to Resolved
  • Assignee set to okurz
  • Priority changed from Low to Normal

hm, given that the current state is ok again and we change the partition layout that seldomly I think it is ok like it is. Of course if someone has a cool idea we can rework our salt code, I have recorded that now in #65271

Also available in: Atom PDF