Project

General

Profile

Actions

action #70834

closed

[alert] Refine I/O time alerts for OSD

Added by nicksinger about 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2020-09-02
Due date:
% Done:

0%

Estimated time:
Tags:

Description

We have several IO time alerts for OSD itself:

They need to be reworked so that:

  1. The right disk is shown for the right purpose (e.g. /dev/vde is not /results any longer)
  2. DONE: The alert thresholds need to be adjusted to not trigger that often
    • Spikes of up to 7s seem to happen from time to time
    • The situation gets critical if these spikes continue for several minutes

All above linked alerts are on pause right now since they don't provide a big benefit being that flaky.


Related issues 3 (0 open3 closed)

Related to openQA Infrastructure - action #69667: missing monitoring data for vde after partitions where reorderedResolvedmkittler2020-08-06

Actions
Related to openQA Infrastructure - action #73165: [osd] Consolidate "expensive+fast" and "cheap+slow" storage after realizing vdc is "cheap+slow" as wellResolvedokurz2020-09-02

Actions
Related to openQA Infrastructure - action #110269: [alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:MResolvedkraih

Actions
Actions #1

Updated by okurz about 4 years ago

  • Related to action #69667: missing monitoring data for vde after partitions where reordered added
Actions #2

Updated by okurz about 4 years ago

  • Tags set to alert
  • Target version set to Ready
Actions #3

Updated by okurz about 4 years ago

  • Related to action #73165: [osd] Consolidate "expensive+fast" and "cheap+slow" storage after realizing vdc is "cheap+slow" as well added
Actions #4

Updated by okurz about 4 years ago

  • Status changed from New to Feedback
  • Assignee set to okurz

From what I learned in #73165 I can update current monitoring and alerting in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/375 . I merged but it seems this did not trigger a CI pipeline in master anymore. Did that manually now.

Actions #5

Updated by okurz about 4 years ago

  • Description updated (diff)
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)
  • Priority changed from Normal to Low

Crossed of the point I have done. The rest is left to be done.

Actions #6

Updated by okurz almost 4 years ago

  • Status changed from Workable to Resolved
  • Assignee set to okurz
  • Priority changed from Low to Normal

hm, given that the current state is ok again and we change the partition layout that seldomly I think it is ok like it is. Of course if someone has a cool idea we can rework our salt code, I have recorded that now in #65271

Actions #7

Updated by okurz over 2 years ago

  • Related to action #110269: [alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:M added
Actions

Also available in: Atom PDF