action #89821
alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)
0%
Description
Observation¶
Multiple alert email reports:
Notification: PROBLEM
Host: openqa.suse.de
State: WARNING
Date/Time: Tue Mar 9 13:17:18 UTC 2021
Info: WARN - 80.1% used (64.06 of 79.99 GB), trend: +573.77 MB / 24 hours
Service: fs_/srv
See Online: https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_%2Fsrv
Acceptance criteria¶
- AC1: /srv on osd has enough free space
- AC2: alert is handled
- AC3: icinga alert is only triggering if internal grafana alert is not handled or not effective
Suggestions¶
- Follow the above thruk link to understand the monitoring data
- Crosscheck alert limit "80%" with the limit we have in grafana
- Make sure the grafana limit is smaller
- Ensure there is enough space, e.g. ask EngInfra to increase or cleanup
Related issues
History
#1
Updated by okurz about 2 years ago
- Tags set to alert, thruk, icinga, srv, grafana, osd, postgres, storage, space
#2
Updated by mkittler about 2 years ago
Looks like the relevant device is /dev/vdb
so this is not about assets/results:
martchus@openqa:~> df -h / /srv/ /var/lib/openqa/share /var/lib/openqa/testresults Dateisystem Größe Benutzt Verf. Verw% Eingehängt auf /dev/vda1 20G 8,1G 11G 44% / /dev/vdb 80G 61G 20G 77% /srv /dev/vdc 7,0T 5,6T 1,5T 80% /var/lib/openqa/share /dev/vdd 5,5T 4,6T 1021G 82% /var/lib/openqa
In fact, it might be about the PostgreSQL database:
martchus@openqa:~> mount | grep /dev/vdb /dev/vdb on /srv type xfs (rw,noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota) /dev/vdb on /var/lib/pgsql type xfs (rw,noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
#3
Updated by okurz about 2 years ago
Yes, /srv is the filesystem where we store mostly the database but a bit of other data as well. Sorry if that was not clear. I have mentioned "postgres" only as a tag, not in the text of the description
#4
Updated by mkittler about 2 years ago
It is really mostly the database:
--- /srv 54,0 GiB [##########] /PSQL10 6,2 GiB [# ] /log 1,3 GiB [ ] homes.img 23,1 MiB [ ] /salt 10,0 MiB [ ] /pillar
We've already shared some figures within the chat and the indexes using most of the disk space. Here are the commands I've used to check this out: https://github.com/Martchus/openQA-helper#show-postgresql-table-sizes
Not sure whether we can easily improve/optimize this. We also likely can't just drop most of the indexes because they are actually there for a reason.
So I'd follow the last suggestion for now and asked infra to increase the size of /dev/vdb
, e.g. to 100 GiB (so far we have 80 GiB).
#5
Updated by okurz about 2 years ago
So I'd follow the last suggestion for now and asked infra to increase the size of
/dev/vdb
, e.g. to 100 GiB (so far we have 80 GiB).
Good. I assume you did so with ticket. Can you please reference the ticket? And did you include osd-admins@suse.de in CC?
#6
Updated by mkittler about 2 years ago
- Assignee set to mkittler
I haven't done anything because I wanted to wait for at least one reply within the team. I'll do it now then.
#7
Updated by mkittler about 2 years ago
- Status changed from Workable to Blocked
#8
Updated by mkittler about 2 years ago
- Status changed from Blocked to Feedback
- Priority changed from Urgent to Normal
martchus@openqa:~> df -h /dev/vdb Dateisystem Größe Benutzt Verf. Verw% Eingehängt auf /dev/vdb 80G 62G 19G 78% /srv martchus@openqa:~> sudo xfs_growfs /var/lib/pgsql meta-data=/dev/vdb isize=256 agcount=16, agsize=1310720 blks = sectsz=4096 attr=2, projid32bit=1 = crc=0 finobt=0 spinodes=0 rmapbt=0 = reflink=0 data = bsize=4096 blocks=20971520, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=0 log =Intern bsize=4096 blocks=2560, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =keine extsz=4096 blocks=0, rtextents=0 Datenblöcke von 20971520 auf 26214400 geändert. martchus@openqa:~> df -h /dev/vdb Dateisystem Größe Benutzt Verf. Verw% Eingehängt auf /dev/vdb 100G 62G 39G 62% /srv
They gave us more space so I guess we're good for now (AC1 and AC2).
About AC3: Should I change the threshold in our monitoring from 90 % to 80 % or should I asked Infra to change the threshold in their monitoring from 80 % to 90 %. Or we just keep the 2 alerts differently so the Infra alert serves as initial alert and our own alert as a last reminder.
#9
Updated by okurz about 2 years ago
About AC3: Should I change the threshold in our monitoring from 90 % to 80 % or should I asked Infra to change the threshold in their monitoring from 80 % to 90 %. Or we just keep the 2 alerts differently so the Infra alert serves as initial alert and our own alert as a last reminder.
I suggest to keep the 2 alerts but the other way around: Our alert first on 80% and the EngInfra alert on 90%, i.e. ask 'em to bump to 90%.
#10
Updated by mkittler about 2 years ago
I've asked them: https://infra.nue.suse.com/SelfService/Display.html?id=186868
#11
Updated by okurz about 2 years ago
- Status changed from Feedback to Blocked
#12
Updated by okurz about 2 years ago
- Status changed from Blocked to Resolved
Confirmed with Daniel Rodríguez, https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_/srv#pnp_th2/1618211527/1618301527/0 looks good
#13
Updated by okurz over 1 year ago
- Related to action #100859: investigate how to optimize /srv data utilization on OSD size:S added