action #164427: HTTP response alert every Monday 01:00 CET/CEST due to fstrim size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

action #164427

## Observation 
 See #163592-53. https://suse.slack.com/archives/C02CANHLANP/p1721841092771439 
 > As there is really just 1 (!) job running right now on OSD and 8 scheduled I will use the opportunity to conduct another important load experiment for https://progress.opensuse.org/issues/163592 . Expect spotty responsiveness of OSD for the next hours 

 ``` 
 date -Is && time sudo nice -n 19 ionice -c 3 /usr/sbin/fstrim --listed-in /etc/fstab:/proc/self/mountinfo --verbose --quiet-unsupported 
 ``` 

 From the output of that command and `lsof -p $(pidof fstrim)` it looks like only one mount point after another is trimmed. /home and /space-slow finished within seconds. /results takes long. But maybe that was me running "ab" for benchmarking. In "ab" I saw just normal response times. Pretty much as soon as I aborted that fstrim continued and finished stating `/assets` and only some seconds later the other file-systems: 

 ``` 
 $ date -Is && time sudo nice -n 19 ionice -c 3 /usr/sbin/fstrim --listed-in /etc/fstab:/proc/self/mountinfo --verbose --quiet-unsupported 
 2024-07-24T19:13:17+02:00 
 /home: 3.4 GiB (3630792704 bytes) trimmed on /srv/homes.img 
 /space-slow: 1.6 TiB (1796142841856 bytes) trimmed on /dev/vde 
 /results: 1.1 TiB (1231325605888 bytes) trimmed on /dev/vdd 
 /assets: 2.6 TiB (2893093855232 bytes) trimmed on /dev/vdc 
 /srv: 117.4 GiB (126101643264 bytes) trimmed on /dev/vdb 
 /: 7.4 GiB (7976497152 bytes) trimmed on /dev/vda1 

 real      37m36.323s 
 user      0m0.009s 
 sys       2m55.929s 
 ``` 

 so overall that took 37m. Let's see how long the same takes w/o me running "ab". That run finished again within 24m but interesting enough no significant effect on service availability. Let's see if without nice/ionice we can actually trigger unresponsiveness. Nope, I can't. So maybe after repeated runs the effect is not reproducible anymore for now. Over the past 30 days on https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=now-30d&to=now&viewPanel=78&refresh=1m it's visible that we have that outage reproduced each Monday morning. With that I consider we can simply override the fstrim service and see the effect in production. I called `systemctl edit fstrim` and added 

 ``` 
 [Service] 
 IOSchedulingClass=idle 
 CPUSchedulingPolicy=idle 
 ``` 

 ## Suggestions 
 * Monitor the effect of nice&ionice on fstrim 

 ## Rollback actions 
 * in `/etc/openqa/openqa.ini` on OSD bump the reduced job limit again from 330 to 420 (or maybe do some middle ground between those limits?) (why would 420 not be supported anymore?) 
 * Remove notification policy override in https://monitor.qa.suse.de/alerting/routes

Back

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

action #164427