Project

General

Profile

Actions

action #89821

closed

alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)

Added by okurz almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2021-03-10
Due date:
% Done:

0%

Estimated time:

Description

Observation

Multiple alert email reports:
Notification: PROBLEM
Host: openqa.suse.de
State: WARNING
Date/Time: Tue Mar 9 13:17:18 UTC 2021
Info: WARN - 80.1% used (64.06 of 79.99 GB), trend: +573.77 MB / 24 hours

Service: fs_/srv

See Online: https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_%2Fsrv

Acceptance criteria

  • AC1: /srv on osd has enough free space
  • AC2: alert is handled
  • AC3: icinga alert is only triggering if internal grafana alert is not handled or not effective

Suggestions

  • Follow the above thruk link to understand the monitoring data
  • Crosscheck alert limit "80%" with the limit we have in grafana
  • Make sure the grafana limit is smaller
  • Ensure there is enough space, e.g. ask EngInfra to increase or cleanup

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #100859: investigate how to optimize /srv data utilization on OSD size:SResolvedmkittler2021-10-12

Actions
Actions #1

Updated by okurz almost 4 years ago

  • Tags set to alert, thruk, icinga, srv, grafana, osd, postgres, storage, space
Actions #2

Updated by mkittler almost 4 years ago

Looks like the relevant device is /dev/vdb so this is not about assets/results:

martchus@openqa:~> df -h / /srv/ /var/lib/openqa/share /var/lib/openqa/testresults
Dateisystem    Größe Benutzt Verf. Verw% Eingehängt auf
/dev/vda1        20G    8,1G   11G   44% /
/dev/vdb         80G     61G   20G   77% /srv
/dev/vdc        7,0T    5,6T  1,5T   80% /var/lib/openqa/share
/dev/vdd        5,5T    4,6T 1021G   82% /var/lib/openqa

In fact, it might be about the PostgreSQL database:

martchus@openqa:~> mount | grep /dev/vdb
/dev/vdb on /srv type xfs (rw,noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/vdb on /var/lib/pgsql type xfs (rw,noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
Actions #3

Updated by okurz almost 4 years ago

Yes, /srv is the filesystem where we store mostly the database but a bit of other data as well. Sorry if that was not clear. I have mentioned "postgres" only as a tag, not in the text of the description

Actions #4

Updated by mkittler almost 4 years ago

It is really mostly the database:

--- /srv
   54,0 GiB [##########] /PSQL10
    6,2 GiB [#         ] /log                                                                                                                                                                                                                                                                                               
    1,3 GiB [          ]  homes.img
   23,1 MiB [          ] /salt
   10,0 MiB [          ] /pillar

We've already shared some figures within the chat and the indexes using most of the disk space. Here are the commands I've used to check this out: https://github.com/Martchus/openQA-helper#show-postgresql-table-sizes

Not sure whether we can easily improve/optimize this. We also likely can't just drop most of the indexes because they are actually there for a reason.


So I'd follow the last suggestion for now and asked infra to increase the size of /dev/vdb, e.g. to 100 GiB (so far we have 80 GiB).

Actions #5

Updated by okurz almost 4 years ago

So I'd follow the last suggestion for now and asked infra to increase the size of /dev/vdb, e.g. to 100 GiB (so far we have 80 GiB).

Good. I assume you did so with ticket. Can you please reference the ticket? And did you include osd-admins@suse.de in CC?

Actions #6

Updated by mkittler almost 4 years ago

  • Assignee set to mkittler

I haven't done anything because I wanted to wait for at least one reply within the team. I'll do it now then.

Actions #7

Updated by mkittler almost 4 years ago

  • Status changed from Workable to Blocked
Actions #8

Updated by mkittler almost 4 years ago

  • Status changed from Blocked to Feedback
  • Priority changed from Urgent to Normal
martchus@openqa:~> df -h /dev/vdb
Dateisystem    Größe Benutzt Verf. Verw% Eingehängt auf
/dev/vdb         80G     62G   19G   78% /srv
martchus@openqa:~> sudo xfs_growfs /var/lib/pgsql
meta-data=/dev/vdb               isize=256    agcount=16, agsize=1310720 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0        finobt=0 spinodes=0 rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=20971520, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =Intern                 bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =keine                  extsz=4096   blocks=0, rtextents=0
Datenblöcke von 20971520 auf 26214400 geändert.
martchus@openqa:~> df -h /dev/vdb
Dateisystem    Größe Benutzt Verf. Verw% Eingehängt auf
/dev/vdb        100G     62G   39G   62% /srv

They gave us more space so I guess we're good for now (AC1 and AC2).


About AC3: Should I change the threshold in our monitoring from 90 % to 80 % or should I asked Infra to change the threshold in their monitoring from 80 % to 90 %. Or we just keep the 2 alerts differently so the Infra alert serves as initial alert and our own alert as a last reminder.

Actions #9

Updated by okurz almost 4 years ago

About AC3: Should I change the threshold in our monitoring from 90 % to 80 % or should I asked Infra to change the threshold in their monitoring from 80 % to 90 %. Or we just keep the 2 alerts differently so the Infra alert serves as initial alert and our own alert as a last reminder.

I suggest to keep the 2 alerts but the other way around: Our alert first on 80% and the EngInfra alert on 90%, i.e. ask 'em to bump to 90%.

Actions #11

Updated by okurz over 3 years ago

  • Status changed from Feedback to Blocked
Actions #12

Updated by okurz over 3 years ago

  • Status changed from Blocked to Resolved
Actions #13

Updated by okurz about 3 years ago

  • Related to action #100859: investigate how to optimize /srv data utilization on OSD size:S added
Actions

Also available in: Atom PDF