https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842021-03-10T08:41:35ZopenSUSE Project Management ToolopenQA Infrastructure - action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)https://progress.opensuse.org/issues/89821?journal_id=3901182021-03-10T08:41:35Zokurzokurz@suse.com
<ul><li><strong>Tags</strong> set to <i>alert, thruk, icinga, srv, grafana, osd, postgres, storage, space</i></li></ul> openQA Infrastructure - action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)https://progress.opensuse.org/issues/89821?journal_id=3901422021-03-10T09:25:46Zmkittlermarius.kittler@suse.com
<ul></ul><p>Looks like the relevant device is <code>/dev/vdb</code> so this is not about assets/results:</p>
<pre><code>martchus@openqa:~> df -h / /srv/ /var/lib/openqa/share /var/lib/openqa/testresults
Dateisystem Größe Benutzt Verf. Verw% Eingehängt auf
/dev/vda1 20G 8,1G 11G 44% /
/dev/vdb 80G 61G 20G 77% /srv
/dev/vdc 7,0T 5,6T 1,5T 80% /var/lib/openqa/share
/dev/vdd 5,5T 4,6T 1021G 82% /var/lib/openqa
</code></pre>
<p>In fact, it might be about the PostgreSQL database:</p>
<pre><code>martchus@openqa:~> mount | grep /dev/vdb
/dev/vdb on /srv type xfs (rw,noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/vdb on /var/lib/pgsql type xfs (rw,noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
</code></pre> openQA Infrastructure - action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)https://progress.opensuse.org/issues/89821?journal_id=3901842021-03-10T11:04:51Zokurzokurz@suse.com
<ul></ul><p>Yes, /srv is the filesystem where we store mostly the database but a bit of other data as well. Sorry if that was not clear. I have mentioned "postgres" only as a tag, not in the text of the description</p>
openQA Infrastructure - action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)https://progress.opensuse.org/issues/89821?journal_id=3902862021-03-10T15:20:16Zmkittlermarius.kittler@suse.com
<ul></ul><p>It is really mostly the database:</p>
<pre><code>--- /srv
54,0 GiB [##########] /PSQL10
6,2 GiB [# ] /log
1,3 GiB [ ] homes.img
23,1 MiB [ ] /salt
10,0 MiB [ ] /pillar
</code></pre>
<p>We've already shared some figures within the chat and the indexes using most of the disk space. Here are the commands I've used to check this out: <a href="https://github.com/Martchus/openQA-helper#show-postgresql-table-sizes" class="external">https://github.com/Martchus/openQA-helper#show-postgresql-table-sizes</a></p>
<p>Not sure whether we can easily improve/optimize this. We also likely can't just drop most of the indexes because they are actually there for a reason.</p>
<hr>
<p>So I'd follow the last suggestion for now and asked infra to increase the size of <code>/dev/vdb</code>, e.g. to 100 GiB (so far we have 80 GiB).</p>
openQA Infrastructure - action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)https://progress.opensuse.org/issues/89821?journal_id=3903012021-03-10T18:16:49Zokurzokurz@suse.com
<ul></ul><blockquote>
<p>So I'd follow the last suggestion for now and asked infra to increase the size of <code>/dev/vdb</code>, e.g. to 100 GiB (so far we have 80 GiB).</p>
</blockquote>
<p>Good. I assume you did so with ticket. Can you please reference the ticket? And did you include <a href="mailto:osd-admins@suse.de">osd-admins@suse.de</a> in CC?</p>
openQA Infrastructure - action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)https://progress.opensuse.org/issues/89821?journal_id=3904962021-03-11T09:24:16Zmkittlermarius.kittler@suse.com
<ul><li><strong>Assignee</strong> set to <i>mkittler</i></li></ul><p>I haven't done anything because I wanted to wait for at least one reply within the team. I'll do it now then.</p>
openQA Infrastructure - action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)https://progress.opensuse.org/issues/89821?journal_id=3909352021-03-12T10:54:14Zmkittlermarius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>Workable</i> to <i>Blocked</i></li></ul><p><a href="https://infra.nue.suse.com/SelfService/Display.html?id=186825" class="external">https://infra.nue.suse.com/SelfService/Display.html?id=186825</a></p>
openQA Infrastructure - action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)https://progress.opensuse.org/issues/89821?journal_id=3912442021-03-12T15:49:46Zmkittlermarius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>Blocked</i> to <i>Feedback</i></li><li><strong>Priority</strong> changed from <i>Urgent</i> to <i>Normal</i></li></ul><pre><code>martchus@openqa:~> df -h /dev/vdb
Dateisystem Größe Benutzt Verf. Verw% Eingehängt auf
/dev/vdb 80G 62G 19G 78% /srv
martchus@openqa:~> sudo xfs_growfs /var/lib/pgsql
meta-data=/dev/vdb isize=256 agcount=16, agsize=1310720 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=0 finobt=0 spinodes=0 rmapbt=0
= reflink=0
data = bsize=4096 blocks=20971520, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =Intern bsize=4096 blocks=2560, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =keine extsz=4096 blocks=0, rtextents=0
Datenblöcke von 20971520 auf 26214400 geändert.
martchus@openqa:~> df -h /dev/vdb
Dateisystem Größe Benutzt Verf. Verw% Eingehängt auf
/dev/vdb 100G 62G 39G 62% /srv
</code></pre>
<p>They gave us more space so I guess we're good for now (AC1 and AC2).</p>
<hr>
<p>About AC3: Should I change the threshold in our monitoring from 90 % to 80 % or should I asked Infra to change the threshold in their monitoring from 80 % to 90 %. Or we just keep the 2 alerts differently so the Infra alert serves as initial alert and our own alert as a last reminder.</p>
openQA Infrastructure - action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)https://progress.opensuse.org/issues/89821?journal_id=3912472021-03-12T16:06:45Zokurzokurz@suse.com
<ul></ul><blockquote>
<p>About AC3: Should I change the threshold in our monitoring from 90 % to 80 % or should I asked Infra to change the threshold in their monitoring from 80 % to 90 %. Or we just keep the 2 alerts differently so the Infra alert serves as initial alert and our own alert as a last reminder.</p>
</blockquote>
<p>I suggest to keep the 2 alerts but the other way around: Our alert first on 80% and the EngInfra alert on 90%, i.e. ask 'em to bump to 90%.</p>
openQA Infrastructure - action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)https://progress.opensuse.org/issues/89821?journal_id=3912592021-03-12T16:35:28Zmkittlermarius.kittler@suse.com
<ul></ul><p>I've asked them: <a href="https://infra.nue.suse.com/SelfService/Display.html?id=186868" class="external">https://infra.nue.suse.com/SelfService/Display.html?id=186868</a></p>
openQA Infrastructure - action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)https://progress.opensuse.org/issues/89821?journal_id=3940382021-03-29T13:10:53Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Blocked</i></li></ul><p>blocked by <a href="https://infra.nue.suse.com/SelfService/Display.html?id=186868" class="external">https://infra.nue.suse.com/SelfService/Display.html?id=186868</a></p>
openQA Infrastructure - action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)https://progress.opensuse.org/issues/89821?journal_id=3968992021-04-13T08:13:41Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Blocked</i> to <i>Resolved</i></li></ul><p>Confirmed with Daniel Rodríguez, <a href="https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_/srv#pnp_th2/1618211527/1618301527/0" class="external">https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_/srv#pnp_th2/1618211527/1618301527/0</a> looks good</p>
openQA Infrastructure - action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)https://progress.opensuse.org/issues/89821?journal_id=4554082021-10-14T09:11:32Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed" href="/issues/100859">action #100859</a>: investigate how to optimize /srv data utilization on OSD size:S</i> added</li></ul>