Project

General

Profile

Actions

action #164427

open

HTTP response alert every Monday 01:00 CET/CEST due to fstrim

Added by okurz 3 days ago. Updated 3 days ago.

Status:
Feedback
Priority:
Low
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-07-10
Due date:
2024-09-20 (Due in 55 days)
% Done:

0%

Estimated time:

Description

Observation

See #163592-53. https://suse.slack.com/archives/C02CANHLANP/p1721841092771439

As there is really just 1 (!) job running right now on OSD and 8 scheduled I will use the opportunity to conduct another important load experiment for https://progress.opensuse.org/issues/163592 . Expect spotty responsiveness of OSD for the next hours

date -Is && time sudo nice -n 19 ionice -c 3 /usr/sbin/fstrim --listed-in /etc/fstab:/proc/self/mountinfo --verbose --quiet-unsupported

From the output of that command and lsof -p $(pidof fstrim) it looks like only one mount point after another is trimmed. /home and /space-slow finished within seconds. /results takes long. But maybe that was me running "ab" for benchmarking. In "ab" I saw just normal response times. Pretty much as soon as I aborted that fstrim continued and finished stating /assets and only some seconds later the other file-systems:

$ date -Is && time sudo nice -n 19 ionice -c 3 /usr/sbin/fstrim --listed-in /etc/fstab:/proc/self/mountinfo --verbose --quiet-unsupported
2024-07-24T19:13:17+02:00
/home: 3.4 GiB (3630792704 bytes) trimmed on /srv/homes.img
/space-slow: 1.6 TiB (1796142841856 bytes) trimmed on /dev/vde
/results: 1.1 TiB (1231325605888 bytes) trimmed on /dev/vdd
/assets: 2.6 TiB (2893093855232 bytes) trimmed on /dev/vdc
/srv: 117.4 GiB (126101643264 bytes) trimmed on /dev/vdb
/: 7.4 GiB (7976497152 bytes) trimmed on /dev/vda1

real    37m36.323s
user    0m0.009s
sys     2m55.929s

so overall that took 37m. Let's see how long the same takes w/o me running "ab". That run finished again within 24m but interesting enough no significant effect on service availability. Let's see if without nice/ionice we can actually trigger unresponsiveness. Nope, I can't. So maybe after repeated runs the effect is not reproducible anymore for now. Over the past 30 days on https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=now-30d&to=now&viewPanel=78&refresh=1m it's visible that we have that outage reproduced each Monday morning. With that I consider we can simply override the fstrim service and see the effect in production. I called systemctl edit fstrim and added

[Service]
IOSchedulingClass=idle
CPUSchedulingPolicy=idle

Suggestions

  • Monitor the effect of nice&ionice on fstrim

Rollback actions

  • in /etc/openqa/openqa.ini on OSD bump the reduced job limit again from 330 to 420 (or maybe do some middle ground between those limits?) (why would 420 not be supported anymore?)
  • Remove notification policy override in https://monitor.qa.suse.de/alerting/routes

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure - action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:MResolvedokurz2024-07-10

Actions
Actions #1

Updated by okurz 3 days ago

  • Copied from action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:M added
Actions #2

Updated by okurz 3 days ago

  • Description updated (diff)
Actions

Also available in: Atom PDF