action #96554: Mitigate on-going disk I/O alerts size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #96554

closed

coordination #96447: [epic] Failed systemd services and job age alerts

Mitigate on-going disk I/O alerts size:M

Added by livdywan over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-08-04

Due date:

% Done:

Estimated time:

Description

Observation¶

Alerts Disk I/O time for /dev/vdd (/results)

Suggestion¶

Bump our thresholds
Monitor the systemd journal (while the alert is running)
Watch htop activity
Observe team/squad channels

Related issues 4 (2 open — 2 closed)

Related to openQA Project (public) - action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry

Resolved

mkittler

2021-08-04

2021-08-19

Actions

Related to openQA Infrastructure (public) - action #96807: Web UI is slow and Apache Response Time alert got triggered

Resolved

okurz

2021-08-12

2021-10-01

Actions

Copied to openQA Infrastructure (public) - action #97409: Re-use existing filesystems on workers after reboot if possible to prevent full worker asset cache re-syncing

New

Actions

Copied to openQA Infrastructure (public) - action #97412: Reduce I/O load on OSD by using more cache size on workers with using free disk space when available instead of hardcoded space

New

Actions

Copy link

Updated by livdywan over 3 years ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler over 3 years ago

It is more likely that the high number of incompletes we recently saw is caused by the deployment http://mailman.suse.de/mlarch/SuSE/openqa/2021/openqa.2021.08/msg00001.html because the alert never lasted very long.

It occurred just now (2021-08-04T13:01 to 2021-08-04T13:04) again (for /dev/vdc) for 3 minutes. I wasn't fast enough to check htop but the journal doesn't contain anything helpful. The dmesg log also doesn't have anything special in it:

2021-08-04T12:58:47,908296+02:00 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000011
2021-08-04T12:58:47,910678+02:00 RBP: 00007ffebec14690 R08: 0000000000000003 R09: 00007ffebec14570
2021-08-04T12:58:47,913060+02:00 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
2021-08-04T12:58:47,915242+02:00 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
2021-08-04T13:04:48,554468+02:00 warn_alloc: 1 callbacks suppressed
2021-08-04T13:04:48,554475+02:00 vsftpd: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
2021-08-04T13:04:48,561233+02:00 CPU: 3 PID: 4197 Comm: vsftpd Not tainted 5.3.18-lp152.84-default #1 openSUSE Leap 15.2
2021-08-04T13:04:48,563913+02:00 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011

Btw, when checking htop while the alert is not active the I/O traffic is still constantly high. For instance seeing at least one Apache worker with ~60 M/s seems normal. Sometimes there's an occasional peak where an Apache worker shows ~3 G/s for an Apache worker but it is usually not for a very long time. The top openQA prefork workers usually take around 1 to 4 M/s.

Actions

Copy link

Updated by openqa_review over 3 years ago

Due date set to 2021-08-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by livdywan over 3 years ago

Seems to be getting worse. Disk I/O time for /dev/vdc (/assets) was alerting for 11 minutes again.

Actions

Copy link

Updated by mkittler over 3 years ago

By the way, we're measuring the read_time/write_time here (and read_time seems to be the problem considering the alert history). From the documentation (https://github.com/influxdata/telegraf/tree/master/plugins/inputs/diskio#read_time--write_time):

These values count the number of milliseconds that I/O requests have waited on this block device. If there are multiple I/O requests waiting, these values will increase at a rate greater than 1000/second; for example, if 60 read requests wait for an average of 30 ms, the read_time field will increase by 60*30 = 1800.

If /dev/vdc is too slow then asset downloads might time out and that's exactly what we see in issue #96557. This only raises the question why only a few workers were affected. Maybe it really takes a high number of worker slots and jobs with big assets for this to have an impact (in terms of jobs exceeding the setup timeout)? (At least the table #96557#note-19 would allow this conclusion.)

Actions

Copy link

Updated by mkittler over 3 years ago

Related to action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry added

Actions

Copy link

Updated by mkittler over 3 years ago

The comment #96557#note-24 contains some investigation regarding /dev/vdc. At least currently the performance doesn't look bad.

Actions

Copy link

Updated by livdywan over 3 years ago

Disk I/O alerts 10.09-11.02, 9.45-10.01, 9.24-?, 8.43-9.03, 8.05-8.32, 7.44-7.46, 6.03-6.25, 5.43-5.45, 5.33-5.35, 5.13-5.17 (all CEST)

See also https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=47&orgId=1

There was also CPU Load 6.09-6.29, 5.44-5.46, 5.36-5.37, 5.20-5.22 (all CEST) correlating 🤔️

Actions

Copy link

Updated by mkittler over 3 years ago

This time the alert is still active. It is about the read performance of the device we're storing assets on. It doesn't have any impact so far.

Actions

Copy link

#10

Updated by mkittler over 3 years ago

There was also CPU Load … correlating

Now the CPU load alert was even shortly firing.

Maybe the number of incompletes we're currently seeing (https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=14&orgId=1&from=1628546400000&to=now) are related (#96710).

Actions

Copy link

#11

Updated by mkittler over 3 years ago

The SR to increase the threshold for the assets disk has been merged. Let's see whether this was sufficient.

I've been creating another SR to increase the threshold for the results disk as we've seen an alert about it yesterday: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/545
It looks like the threshold there was actually accidentally quite low.

Actions

Copy link

#12

Updated by mkittler over 3 years ago

Status changed from In Progress to Feedback

The 2nd SR has been merged as well.

Actions

Copy link

#13

Updated by livdywan over 3 years ago

https://github.com/os-autoinst/scripts/pull/99 which adds grep timeouts was merged

Actions

Copy link

#14

Updated by mkittler over 3 years ago

The alert happened again two times for the assets read I/O time last weekend. It could just be a symptom of having an unusual amount of scheduled jobs due to other issues so I'm currently hesitant to bump the threshold again.

Actions

Copy link

#15

Updated by okurz over 3 years ago

please pause the alerts until we found mitigations.
I monitored data on monitor.qa.suse.de even though there was right now no alert triggering. "disk I/O time for /space-slow" was at 40s for nearly 5 minutes. disk i/o requests and bytes spiked at the same time so it was not like the storage stalled.
It coincides with an increase in average ping time of multiple workers and an increase in the number of running jobs so very likely just many jobs starting and requesting assets, nothing wrong with that. I am sure we could spread the load a bit if we manage to have less assets that need to be downloaded. One thing I could think of is to try to reuse the cache after reboot of workers instead of recreating the filesystem everytime and another idea is to increase the cache size, e.g. use all available space instead of the artifical limits of the cache directory.

Actions

Copy link

#16

Updated by livdywan over 3 years ago

Related to action #96807: Web UI is slow and Apache Response Time alert got triggered added

Actions

Copy link

#17

Updated by mkittler over 3 years ago

Ok, I've been pausing the alert for /dev/vdd. I just kept in on because it was nice to see how the situation develops without having to check manually all the time.

One thing I could think of is to try to reuse the cache after reboot of workers instead of recreating the filesystem everytime

That would make sense and would certainly help in cases when multiple workers are rebooted at the same time. Maybe that means we should migrate away from ext2, though.

another idea is to increase the cache size, e.g. use all available space instead of the artifical limits of the cache directory

A good idea. The worker cache could just ensure a certain percentage of free disk space in the file system.

Actions

Copy link

#18

Updated by mkittler over 3 years ago

With https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/545 and https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/542 I made improvements for vdc and vdd so I'm resuming the alert for vdd again. (All other alerts aren't paused anyways.) I've also been creating a similar fix for vde: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/562

Actions

Copy link

#19

Updated by okurz over 3 years ago

Copied to action #97409: Re-use existing filesystems on workers after reboot if possible to prevent full worker asset cache re-syncing added

Actions

Copy link

#20

Updated by okurz over 3 years ago

Copied to action #97412: Reduce I/O load on OSD by using more cache size on workers with using free disk space when available instead of hardcoded space added

Actions

Copy link

#21

Updated by mkittler over 3 years ago

Status changed from Feedback to Resolved

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/562 has been merged and deployed. @okurz created tickets for further improvements. So I'm considering this ticket resolved.

Actions

Copy link

#22

Updated by okurz over 3 years ago

Due date changed from 2021-08-19 to 2021-09-10
Status changed from Resolved to Feedback

new alert from yesterday evening: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=57&orgId=1&from=1630005418432&to=1630024386708

Actions

Copy link

#23

Updated by mkittler over 3 years ago

Not sure what to do except for increasing the threshold: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/566

Actions

Copy link

#24

Updated by okurz over 3 years ago

Due date deleted (~~2021-09-10~~)
Status changed from Feedback to Resolved

MR merged. Currently there is no alert. Same as in before because we already have separate improvement tickets we can resolve.

Actions

Copy link

#25

Updated by tinita over 3 years ago

Status changed from Resolved to Workable
Assignee deleted (~~mkittler~~)

The alert was triggered again this morning:
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=48&orgId=1&from=1634589539385&to=1634626591378

Actions

Copy link

#26

Updated by livdywan over 3 years ago

There was an MR bumping the thresholds from 10 to 20s - maybe that already covers this? https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/608/diffs

Actions

Copy link

#27

Updated by okurz over 3 years ago

Careful. This ticket was one of the things that we identified went wrong regarding qcow compress. So bumping alert thresholds without a good explanation why we think this is fine is what we decided we want to avoid

Actions

Copy link

#28

Updated by okurz over 3 years ago

Assignee set to mkittler

@mkittler as you requested, assigning to you again to crosscheck.

Actions

Copy link

#29

Updated by mkittler over 3 years ago

Status changed from Workable to Resolved

Yes, the mentioned SR should fix this. It got only merged after the alert happened.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #96554

Mitigate on-going disk I/O alerts size:M

Observation¶

Suggestion¶

Updated by livdywan over 3 years ago

Updated by mkittler over 3 years ago

Updated by openqa_review over 3 years ago

Updated by livdywan over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by livdywan over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by livdywan over 3 years ago

Updated by mkittler over 3 years ago

Updated by okurz over 3 years ago

Updated by livdywan over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by mkittler over 3 years ago

Updated by okurz over 3 years ago

Updated by mkittler over 3 years ago

Updated by okurz over 3 years ago

Updated by tinita over 3 years ago

Updated by livdywan over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by mkittler over 3 years ago