openSUSE Project Management Tool: Issueshttps://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842023-07-09T11:52:44ZopenSUSE Project Management Tool
Redmine openQA Infrastructure - action #132470 (Resolved): salt states fail to apply due to glibc error o...https://progress.opensuse.org/issues/1324702023-07-09T11:52:44Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>On storage.oqa.suse.de:</p>
<pre><code># zypper ref
zypper: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /usr/lib64/libzypp.so.1722)
</code></pre>
<p>which then also shows up in <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1679723" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1679723</a></p>
openQA Infrastructure - action #123082 (Resolved): backup of o3 to storage.qa.suse.de was not con...https://progress.opensuse.org/issues/1230822023-01-13T10:39:30Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>On storage.qa.suse.de <code>ls -ltra /storage/rsnapshot/*/openqa.opensuse.org/</code> shows:</p>
<pre><code>/storage/rsnapshot/alpha.3/openqa.opensuse.org/:
total 0
drwxr-xr-x 1 root root 6 Nov 19 2021 root
drwxr-xr-x 1 root root 8 Nov 22 2021 .
drwxr-xr-x 1 root root 38 Nov 22 2021 ..
/storage/rsnapshot/_delete.12732/openqa.opensuse.org/:
total 0
drwxr-xr-x 1 root root 6 Dec 30 2021 root
drwxr-xr-x 1 root root 8 Dec 30 2021 .
drwxr-xr-x 1 root root 66 Dec 31 2021 ..
/storage/rsnapshot/beta.2/openqa.opensuse.org/:
total 0
drwxr-xr-x 1 root root 0 Aug 25 00:00 .
drwxr-xr-x 1 root root 66 Aug 25 03:10 ..
/storage/rsnapshot/beta.1/openqa.opensuse.org/:
total 0
drwxr-xr-x 1 root root 0 Sep 22 00:00 .
drwxr-xr-x 1 root root 66 Sep 22 03:38 ..
/storage/rsnapshot/beta.0/openqa.opensuse.org/:
total 0
drwxr-xr-x 1 root root 0 Oct 24 00:00 .
drwxr-xr-x 1 root root 66 Oct 24 03:14 ..
/storage/rsnapshot/alpha.2/openqa.opensuse.org/:
total 0
drwxr-xr-x 1 root root 0 Nov 28 00:00 .
drwxr-xr-x 1 root root 66 Nov 28 03:51 ..
/storage/rsnapshot/alpha.1/openqa.opensuse.org/:
total 0
drwxr-xr-x 1 root root 0 Dec 1 00:00 .
drwxr-xr-x 1 root root 66 Dec 1 03:47 ..
/storage/rsnapshot/alpha.0/openqa.opensuse.org/:
total 0
drwxr-xr-x 1 root root 0 Jan 12 00:00 .
drwxr-xr-x 1 root root 66 Jan 12 05:55 ..
</code></pre>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> We are alerted if the backup can not be conducted</li>
<li><strong>AC2:</strong> o3 is backed up again</li>
</ul>
<a name="Rollback-steps"></a>
<h2 >Rollback steps<a href="#Rollback-steps" class="wiki-anchor">¶</a></h2>
<ul>
<li>Add storage.qa.suse.de back to salt</li>
</ul>
openQA Infrastructure - action #89821 (Resolved): alert: PROBLEM Service Alert: openqa.suse.de/fs...https://progress.opensuse.org/issues/898212021-03-10T08:41:07Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Multiple alert email reports:<br>
Notification: PROBLEM<br>
Host: openqa.suse.de<br>
State: WARNING<br>
Date/Time: Tue Mar 9 13:17:18 UTC 2021<br>
Info: WARN - 80.1% used (64.06 of 79.99 GB), trend: +573.77 MB / 24 hours</p>
<p>Service: fs_/srv</p>
<p>See Online: <a href="https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_%2Fsrv" class="external">https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_%2Fsrv</a></p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> /srv on osd has enough free space</li>
<li><strong>AC2:</strong> alert is handled</li>
<li><strong>AC3:</strong> icinga alert is only triggering if internal grafana alert is not handled or not effective</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Follow the above thruk link to understand the monitoring data</li>
<li>Crosscheck alert limit "80%" with the limit we have in grafana</li>
<li>Make sure the grafana limit is smaller</li>
<li>Ensure there is enough space, e.g. ask EngInfra to increase or cleanup</li>
</ul>
openQA Project - action #88121 (Resolved): Trigger cleanup of results (or assets) if not enough f...https://progress.opensuse.org/issues/881212021-01-21T11:50:18Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>See parent epic <a class="issue tracker-6 status-1 priority-4 priority-default child parent" title="coordination: [epic] Automatically remove assets+results based on available free space (New)" href="https://progress.opensuse.org/issues/76984">#76984</a> . To be able to progress with <a class="issue tracker-6 status-1 priority-4 priority-default child parent" title="coordination: [epic] Automatically remove assets+results based on available free space (New)" href="https://progress.opensuse.org/issues/76984">#76984</a> we should try to split out smaller simple stories and start with implementing "df" calls in general. This would also allow us to gather experience if calling df is cheap and reliable enough</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Regular cleanup of results (or assets) is triggered if free space for results (or assets) is below configured limit</li>
<li><strong>AC2:</strong> If no free space limit is configured no df check is called and no cleanup is triggered</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Extend the existing asset+result cleanup to
<ul>
<li>check the free space of the filesystem including the assets/results directory</li>
<li>compare the free space against a configured value, e.g. in openqa.ini</li>
<li>trigger the same cleanup that we would trigger from the systemd timer</li>
</ul></li>
<li>can use <a href="https://software.opensuse.org/package/perl-Filesys-Df?search_term=perl-FileSys-Df" class="external">https://software.opensuse.org/package/perl-Filesys-Df?search_term=perl-FileSys-Df</a></li>
<li>can mock "df" in tests to simply give back what we want, e.g. "enough free space available" or "free space exceeded"</li>
<li>Optional: Extend to assets as well</li>
</ul>
openQA Project - coordination #80546 (Resolved): [epic] Scale up: Enable to store more resultshttps://progress.opensuse.org/issues/805462020-11-27T21:03:09Zokurzokurz@suse.com
<a name="Ideas"></a>
<h2 >Ideas<a href="#Ideas" class="wiki-anchor">¶</a></h2>
<ul>
<li>Simple enlarge storage space for "results" on our production instances (which is also a test for scalability) -> <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: [easy] Extend OSD storage space for "results" to make bug investigation and failure archeology ea... (Resolved)" href="https://progress.opensuse.org/issues/77890">#77890</a></li>
<li>Setup additional storage (server) for old results
<ul>
<li>Use overlayfs to make archived results appear alongside the regular results</li>
<li>Move old results at some point from the regular web UI host to the external storage (server), e.g. automatically via a Minion job, just move the resultdir to a different place, point to a different resultdir but keep the database entry itself and set a database flag "archived" or similar</li>
<li>Mark jobs with archived results as such</li>
</ul></li>
</ul>
openQA Infrastructure - action #77890 (Resolved): [easy] Extend OSD storage space for "results" t...https://progress.opensuse.org/issues/778902020-11-14T16:09:19Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>More and more products are tested and more and more tests are running on OSD which increases the results we need to store per time period. This in turn means that we need to restrict the time duration for which we can save results. This makes it harder to investigate product bugs which have been reported based on openQA test results as well as test failures. As the new department QE was formed the two biggest user groups of OSD are now joint in one department which can make some decisions easier. It is a good opportunity to let QE management coordinate an increase of storage space for "results" with EngInfra.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Significant increase of storage space for "results" on OSD</li>
<li><strong>AC2:</strong> Job group result retention periods have been increased to make efficient use of the available storage space</li>
<li><strong>AC3:</strong> "results" on OSD has still enough headroom, e.g. only used up to 80-85%</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Suggest the proposal to QE mgmt</li>
<li>Ask QE mgmt to create EngInfra ticket with according "budget allocation" or at least something like a "yes, we need it to run the business" :)</li>
<li>If not happening create the EngInfra ticket on your own and ask QE mgmt to support after the fact</li>
</ul>
<a name="Further-details"></a>
<h2 >Further details<a href="#Further-details" class="wiki-anchor">¶</a></h2>
<p>In mid of 2020 EngInfra was so kind to provide us a comparably big increase of storage space for "assets", which is "cheap+slow" rotating disk storage so it was cheaper and hence easier for them to just give us the space (someone could have come to that conclusion earlier), but "results" is on "expensive+fast" SSD backed storage so likely not that easy to convince but still likely cheaper to buy more enterprise SSD storage for 10x the consumer prices than making people busy rerunning tests in openQA or manually just to collect logs or find out what errors are about.</p>
openQA Project - coordination #76984 (New): [epic] Automatically remove assets+results based on a...https://progress.opensuse.org/issues/769842020-11-04T16:10:15Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>See examples like <a class="issue tracker-4 status-3 priority-6 priority-high2 closed behind-schedule" title="action: Fix /results over-usage on osd (was: sudden increase in job group results for SLE 15 SP2 Incidents) (Resolved)" href="https://progress.opensuse.org/issues/76822">#76822</a> : openQA has automatic removal of assets+results but the sum of all configured retention periods and asset quotas can still exceed the available space so that manual administration is required. In case the cleanup based on these parameters can not free enough space we should do the next step and remove more until we have enough free space again. We already do something similar in <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/etc/master/cron.d/SLES.CRON#L18">https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/etc/master/cron.d/SLES.CRON#L18</a> to remove videos of older test jobs which we identified as a big contributor to space usage.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> the filesystem including the openQA results directory is ensured to have at least a configured amount of free space</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Read and understand <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/etc/master/cron.d/SLES.CRON#L18">https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/etc/master/cron.d/SLES.CRON#L18</a></li>
<li>Extend the existing asset+result cleanup to
<ul>
<li>check the free space of the filesystem including the assets/results directory</li>
<li>compare the free space against a configured value, e.g. in openqa.ini</li>
<li>if free space is below limit after results cleanup remove more data from results checking in each step until free space limit is reached, e.g.</li>
<li>videos from oldest, non-important jobs first ("oldest first" can mean simply job id numbers ascending order)</li>
<li>other results from oldest, non-important jobs</li>
<li>videos from oldest, important jobs</li>
<li>other results from oldest, important jobs</li>
<li>if after all steps free space limit could still not be reached, i.e. if all result data was removed, raise error</li>
<li>the above can be configured as well, e.g. "results_free_space_cleanup_components=non-important-results-videos,non-important-results-other,important-results-videos,important-results-other"</li>
</ul></li>
<li>can use <a href="https://software.opensuse.org/package/perl-Filesys-Df?search_term=perl-FileSys-Df">https://software.opensuse.org/package/perl-Filesys-Df?search_term=perl-FileSys-Df</a></li>
<li>can mock "df" in tests to simply give back what we want, e.g. "enough free space available" or "free space exceeded"</li>
<li>Optional: Extend to assets as well</li>
</ul>
<a name="Impact"></a>
<h2 >Impact<a href="#Impact" class="wiki-anchor">¶</a></h2>
<p>This can also greatly help us as administrators of osd to ensure that /results limits are not exceeded which repeatedly caused us additional administration work.</p>
<a name="Workaround"></a>
<h2 >Workaround<a href="#Workaround" class="wiki-anchor">¶</a></h2>
<p>Have a periodic job calling "df" and checking against limit, remove results otherwise</p>
openQA Infrastructure - action #76822 (Resolved): Fix /results over-usage on osd (was: sudden inc...https://progress.opensuse.org/issues/768222020-10-30T22:52:25Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>see <a href="https://w3.nue.suse.com/~okurz/job_group_results_2020-10-30.png" class="external">https://w3.nue.suse.com/~okurz/job_group_results_2020-10-30.png</a> , there seems to be a very sudden increase in the job group "Maintenance: Test Repo/Maintenance: SLE 15 SP2 Updates". I wonder if someone changed result settings or just many recent results accumulated now. I will just monitor :)</p>
<p>EDIT: In 2020-11-04: we have seen an email alert from grafana for /results</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> /results is way below the alarm threshold again to have headroom for some weeks at least</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Review the trend of individual job groups</li>
<li>Reduce result retention periods after coordinating with job group stakeholders or owners</li>
</ul>
openQA Infrastructure - action #73183 (Resolved): Extend vdc (/assets)https://progress.opensuse.org/issues/731832020-10-09T19:13:14Zokurzokurz@suse.com
<p>In <a href="https://infra.nue.suse.com/SelfService/Display.html?id=178140" class="external">https://infra.nue.suse.com/SelfService/Display.html?id=178140</a> we asked SUSE IT to give us more storage in particular linked to the need for SLE4SAP+HA. gschlotter did that, we learned that vdc (/assets) is actually on cheap+slow storage (rotating disk).</p>
<p>Did <code>xfs_growfs /assets</code> now which gave us 7TB for /assets with a current usage. Can adjust job group quotas and will monitor at best over next weeks how the usage turns out.</p>