action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages) - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #89821

closed

alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)

Added by okurz about 4 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-03-10

Due date:

% Done:

Estimated time:

Tags:

alert, osd, storage, space, thruk, icinga, grafana, srv, postgres

Description

Observation¶

Multiple alert email reports:
Notification: PROBLEM
Host: openqa.suse.de
State: WARNING
Date/Time: Tue Mar 9 13:17:18 UTC 2021
Info: WARN - 80.1% used (64.06 of 79.99 GB), trend: +573.77 MB / 24 hours

Service: fs_/srv

See Online: https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_%2Fsrv

Acceptance criteria¶

AC1: /srv on osd has enough free space
AC2: alert is handled
AC3: icinga alert is only triggering if internal grafana alert is not handled or not effective

Suggestions¶

Follow the above thruk link to understand the monitoring data
Crosscheck alert limit "80%" with the limit we have in grafana
Make sure the grafana limit is smaller
Ensure there is enough space, e.g. ask EngInfra to increase or cleanup

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz about 4 years ago

Tags set to alert, thruk, icinga, srv, grafana, osd, postgres, storage, space

Actions

Copy link

Updated by mkittler about 4 years ago

Looks like the relevant device is /dev/vdb so this is not about assets/results:

martchus@openqa:~> df -h / /srv/ /var/lib/openqa/share /var/lib/openqa/testresults
Dateisystem    Größe Benutzt Verf. Verw% Eingehängt auf
/dev/vda1        20G    8,1G   11G   44% /
/dev/vdb         80G     61G   20G   77% /srv
/dev/vdc        7,0T    5,6T  1,5T   80% /var/lib/openqa/share
/dev/vdd        5,5T    4,6T 1021G   82% /var/lib/openqa

In fact, it might be about the PostgreSQL database:

martchus@openqa:~> mount | grep /dev/vdb
/dev/vdb on /srv type xfs (rw,noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/vdb on /var/lib/pgsql type xfs (rw,noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota)

Actions

Copy link

Updated by okurz about 4 years ago

Yes, /srv is the filesystem where we store mostly the database but a bit of other data as well. Sorry if that was not clear. I have mentioned "postgres" only as a tag, not in the text of the description

Actions

Copy link

Updated by mkittler about 4 years ago

It is really mostly the database:

--- /srv
   54,0 GiB [##########] /PSQL10
    6,2 GiB [#         ] /log                                                                                                                                                                                                                                                                                               
    1,3 GiB [          ]  homes.img
   23,1 MiB [          ] /salt
   10,0 MiB [          ] /pillar

We've already shared some figures within the chat and the indexes using most of the disk space. Here are the commands I've used to check this out: https://github.com/Martchus/openQA-helper#show-postgresql-table-sizes

Not sure whether we can easily improve/optimize this. We also likely can't just drop most of the indexes because they are actually there for a reason.

So I'd follow the last suggestion for now and asked infra to increase the size of /dev/vdb, e.g. to 100 GiB (so far we have 80 GiB).

Actions

Copy link

Updated by okurz about 4 years ago

So I'd follow the last suggestion for now and asked infra to increase the size of /dev/vdb, e.g. to 100 GiB (so far we have 80 GiB).

Good. I assume you did so with ticket. Can you please reference the ticket? And did you include osd-admins@suse.de in CC?

Actions

Copy link

Updated by mkittler about 4 years ago

Assignee set to mkittler

I haven't done anything because I wanted to wait for at least one reply within the team. I'll do it now then.

Actions

Copy link

Updated by mkittler about 4 years ago

Status changed from Workable to Blocked

https://infra.nue.suse.com/SelfService/Display.html?id=186825

Actions

Copy link

Updated by mkittler about 4 years ago

Status changed from Blocked to Feedback
Priority changed from Urgent to Normal

martchus@openqa:~> df -h /dev/vdb
Dateisystem    Größe Benutzt Verf. Verw% Eingehängt auf
/dev/vdb         80G     62G   19G   78% /srv
martchus@openqa:~> sudo xfs_growfs /var/lib/pgsql
meta-data=/dev/vdb               isize=256    agcount=16, agsize=1310720 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0        finobt=0 spinodes=0 rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=20971520, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =Intern                 bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =keine                  extsz=4096   blocks=0, rtextents=0
Datenblöcke von 20971520 auf 26214400 geändert.
martchus@openqa:~> df -h /dev/vdb
Dateisystem    Größe Benutzt Verf. Verw% Eingehängt auf
/dev/vdb        100G     62G   39G   62% /srv

They gave us more space so I guess we're good for now (AC1 and AC2).

About AC3: Should I change the threshold in our monitoring from 90 % to 80 % or should I asked Infra to change the threshold in their monitoring from 80 % to 90 %. Or we just keep the 2 alerts differently so the Infra alert serves as initial alert and our own alert as a last reminder.

Actions

Copy link

Updated by okurz about 4 years ago

About AC3: Should I change the threshold in our monitoring from 90 % to 80 % or should I asked Infra to change the threshold in their monitoring from 80 % to 90 %. Or we just keep the 2 alerts differently so the Infra alert serves as initial alert and our own alert as a last reminder.

I suggest to keep the 2 alerts but the other way around: Our alert first on 80% and the EngInfra alert on 90%, i.e. ask 'em to bump to 90%.

Actions

Copy link

#10