Project

General

Profile

action #65202

o3 is very slow/unresponsive and runs into timeouts

Added by mkittler about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Start date:
2020-04-02
Due date:
% Done:

0%

Estimated time:

Description

The problem seems to be /dev/vdb which is performing very badly. From our side it seems like business as usual with no relevant changes within openQA so I created an infra ticket: https://infra.nue.suse.com/SelfService/Display.html?id=166937&results=318cb683a19b7861445d993a05756d17

The only thing we could still look into ourselves is the recently updated kernel.

update from okurz (2020-04-03)

The degraded performance started since the work on https://infra.nue.suse.com/SelfService/Display.html?id=164966 when the size of vdb was increased from 4TB to 5TB. We have as of time of writing (2020-04-03Z14:35) not yet resized the existing filesystem for /space to cover the complete space in case we need to revert to the older partition size. We do not know what the underlying storage is doing. Maybe it is even doing something useful, i.e. full data integrity check after the volume was increased from 4TB to 5TB.


Related issues

Related to openQA Project - action #65283: many screenshots on o3 show no result, no popup, update of URL in address bar only flashes for a secondResolved2020-04-03

History

#1 Updated by okurz about 2 years ago

  • Status changed from New to Blocked
  • Assignee set to mkittler

mkittler as you are the reporter of the infra ticket you will be informed about updates so please track this ticket here as "Blocked" and update it accordingly when there are news from https://infra.nue.suse.com/SelfService/Display.html?id=166937

#2 Updated by okurz about 2 years ago

I added information about the kernel versions that have been used in the SUSE IT ticket and also asked for help in https://chat.suse.de/channel/suse-it-ama?msg=hweMdK2ttE26nyxTc

#3 Updated by okurz about 2 years ago

I also updated https://status.opensuse.org/admin to reflect the current status.

#4 Updated by okurz about 2 years ago

I will try an older kernel myself:

zypper install --oldpackage kernel-default-4.12.14-lp151.28.36.1
grub2-set-default 'openSUSE, with Linux 4.12.14-lp151.28.36-default'
reboot

did not work, still booted most recent kernel. Trying first to preserve older version:

sed -i -n 's/latest-1/&,latest-2/p' /etc/zypp/zypp.conf
zypper install --oldpackage kernel-default-4.12.14-lp151.28.36.1
grub2-set-default 'openSUSE, with Linux 4.12.14-lp151.28.36-default'
grub2-set-default 'openSUSE, with Linux 4.12.14-lp151.28.40-default'
reboot

but the machine still reboots 4.12.14-lp151.28.44-default.

As it seems the biggest I/O load comes from workers trying to update their cache I try to reduce the pressure by temporarily stopping openqa worker instances on openqaworker4 plus the complete openqa-worker.target on openqaworker7:

ssh root@openqaworker4 "systemctl stop openqa-worker@{5..20}"
ssh root@openqaworker7 "systemctl stop openqa-worker.target"

this is not boot-persistent so will automatically enable again after next reboot the latest.

Installed hdparm and conducted hdparm -tT /dev/vdb:

 Timing cached reads:   6122 MB in  2.00 seconds = 3063.64 MB/sec
 Timing buffered disk reads:  14 MB in  3.28 seconds =   4.26 MB/sec

In comparison osd:

$ for i in a c e d ; do sudo hdparm -tT /dev/vd$i; done

/dev/vda:
 Timing cached reads:   3968 MB in  2.00 seconds = 1986.07 MB/sec
 Timing buffered disk reads: 368 MB in  3.03 seconds = 121.46 MB/sec

/dev/vdc:
 Timing cached reads:   3486 MB in  2.00 seconds = 1744.09 MB/sec
 Timing buffered disk reads: 214 MB in  3.02 seconds =  70.76 MB/sec

/dev/vde:
 Timing cached reads:   4238 MB in  2.00 seconds = 2121.77 MB/sec
 Timing buffered disk reads: 802 MB in  3.00 seconds = 267.22 MB/sec

/dev/vdd:
 Timing cached reads:   4800 MB in  2.00 seconds = 2401.98 MB/sec
 Timing buffered disk reads: 272 MB in  3.03 seconds =  89.69 MB/sec

No problems in cached reads but buffered disk reads suffer a lot on o3 in comparison.

#5 Updated by okurz about 2 years ago

Initially the machine was rebooted by gschlotter to bump up the storage from 4TB to 5TB which is certainly appreciated. gschlotter also kindly booted into older kernel
4.12.14-lp151.28.36-default but still same symptoms, so again most likely we need someone with access to the hypervisor and storage to find out the I/O regression.

Just for the sake of completeness, the openQA changes since the time of "last good", as found out from o3 /var/log/zypp/history:

$ git log1 --no-merges b26b51beb..2e73b2a4a
d509e69f4 (okurz/enhance/simplify_systemd_enque) Provide enqueue convenience scripts rather than systemd
c4827b6a1 (okurz/enhance/doc_clock) docs: Mention the importance of synchronized clocks for remote worker machines
3804e375d docs: Update "Run workers" to not only mention "virtual machines" as backend
2f5d14b7f The job result is a string and not a number
b0370449f (Martchus/result-size-2) Account freed disk space when deleting logs
721d73a2c (Martchus/restart-error) Turn warning about missing assets when restarting jobs into error
3f161b61b (okurz/enhance/license) Harmonize License copyright format
4495ec42e (kalikiana/job_groups_validation_docs) Validate parameters passed to job group API routes
03b69edf0 (kalikiana/audit-log-unify) Unify single/multi audit event rendering

the only thing that I could see remotely relevant for "I/O impact" would be "b0370449f (Martchus/result-size-2) Account freed disk space when deleting logs".

[03/04/2020 15:58:27] <Martchus> okurz: You could revert it hot-patch-style but it is so unlikely that I wouldn't revert the change in the repo. And it is unlikely because the gru process didn't even show up in iotop while everything was slow yesterday. (And the results cleanup task was not running anyways.)
[03/04/2020 15:59:26] <Martchus> And the function this PR changes is to my knowlege only executed within the cleanup task (I could check that again).
[03/04/2020 16:18:01] <okurz> Martchus: ok. Wasn't sure if there is anything e.g. running on inserting results or so. Of course, if it's only in the non-permanent minion jobs even less likely. Also, the problem today was very prominent when effectively no jobs were yet running, only requests to assets from cache services. And most of the according processes are just I/O stuck. We just don't know what the underlying storage is doing. Maybe it is even 
[03/04/2020 16:18:01] <okurz> doing something useful, i.e. full data integrity check after the volume was increased from 4TB to 5TB

#6 Updated by okurz about 2 years ago

  • Description updated (diff)

Restarting incomplete jobs on o3 with for i in $(ssh openqa.opensuse.org "sudo -u geekotest psql --no-align --tuples-only --command=\"select id from jobs where (clone_id is null and (reason ~ 'Service Unavailable' or reason ~ 'api failure') and t_finished >= '2020-04-02');\" openqa"); do openqa-client --host openqa.opensuse.org jobs/$i/restart post;done

Unfortunately I don't know the numbers from before but currently what I get on the different devices attached to o3, trying again measuring buffered disk reads:

$ avgtime -r 3 ssh o3 "sudo hdparm -t /dev/vd*"

/dev/vda: 110-130 MB/sec
/dev/vdb: 15-26 MB/sec
/dev/vdc: 25-35 MB/sec
/dev/vdd: 51-159 MB/sec

and for osd:

/dev/vda: 130-190 MB/sec
/dev/vdc: 150-190 MB/sec
/dev/vdd: 70-130 MB/sec
/dev/vde: 218-470 MB/sec

so osd is much faster in general but also it seems that the performance of /dev/vdb is increasing even though the system should still be under comparable load. Running for i in {1..1000}; do echo $i && hdparm -t /dev/vdb ; sleep 1; done now on o3.

I started the additional worker instances on openqaworker4 again and also everything on openqaworker7. Immediately the jobs request a lot of data over the cache service but still the hdparm read in parallel drops not below 20 MB/sec and stays at an average 30-6 MB/sec so much better than today in the morning.

#7 Updated by okurz about 2 years ago

  • Related to action #65283: many screenshots on o3 show no result, no popup, update of URL in address bar only flashes for a second added

#8 Updated by okurz about 2 years ago

  • Status changed from Blocked to Resolved

We closed https://infra.nue.suse.com/SelfService/Display.html?id=166937 after the performance improved again. Could very well be that the underlying netapp had to reshuffle how the data is stored after the change of the virtual disk.

[06/04/2020 10:16:29] <mkittler> Good that o3 is "fixed". It is a bit unsatisfying not to know what the problem was and that our investigation in the end didn't help at all.
[06/04/2020 10:18:28] <okurz> @mkittler I think we did good to identify that there is nothing *on* the machine that is impacting performance but it's on the outer layer of virtualization

Also available in: Atom PDF