openQA Infrastructure - action #65202: o3 is very slow/unresponsive and runs into timeouts</h1> <article> <h1>openQA Infrastructure - action #65202: o3 is very slow/unresponsive and runs into timeouts</h1> <p>2020-04-02T18:27:45Z</p> <ul><li><strong>Status</strong> changed from <i>New</i> to <i>Blocked</i></li><li><strong>Assignee</strong> set to <i>mkittler</i></li></ul><p><a class="user active user-mention" href="https://progress.opensuse.org/users/22072">@mkittler</a> as you are the reporter of the infra ticket you will be informed about updates so please track this ticket here as "Blocked" and update it accordingly when there are news from <a href="https://infra.nue.suse.com/SelfService/Display.html?id=166937" class="external">https://infra.nue.suse.com/SelfService/Display.html?id=166937</a></p> </article> <article> <h1>openQA Infrastructure - action #65202: o3 is very slow/unresponsive and runs into timeouts</h1> <p>2020-04-03T04:56:13Z</p> <ul></ul><p>I added information about the kernel versions that have been used in the SUSE IT ticket and also asked for help in <a href="https://chat.suse.de/channel/suse-it-ama?msg=hweMdK2ttE26nyxTc" class="external">https://chat.suse.de/channel/suse-it-ama?msg=hweMdK2ttE26nyxTc</a></p> </article> <article> <h1>openQA Infrastructure - action #65202: o3 is very slow/unresponsive and runs into timeouts</h1> <p>2020-04-03T05:01:31Z</p> <ul></ul><p>I also updated <a href="https://status.opensuse.org/admin" class="external">https://status.opensuse.org/admin</a> to reflect the current status.</p> </article> <article> <h1>openQA Infrastructure - action #65202: o3 is very slow/unresponsive and runs into timeouts</h1> <p>2020-04-03T05:23:53Z</p> <ul></ul><p>I will try an older kernel myself:</p> <pre><code>zypper install --oldpackage kernel-default-4.12.14-lp151.28.36.1 grub2-set-default 'openSUSE, with Linux 4.12.14-lp151.28.36-default' reboot </code></pre> <p>did not work, still booted most recent kernel. Trying first to preserve older version:</p> <pre><code>sed -i -n 's/latest-1/&,latest-2/p' /etc/zypp/zypp.conf zypper install --oldpackage kernel-default-4.12.14-lp151.28.36.1 grub2-set-default 'openSUSE, with Linux 4.12.14-lp151.28.36-default' grub2-set-default 'openSUSE, with Linux 4.12.14-lp151.28.40-default' reboot </code></pre> <p>but the machine still reboots 4.12.14-lp151.28.44-default.</p> <p>As it seems the biggest I/O load comes from workers trying to update their cache I try to reduce the pressure by temporarily stopping openqa worker instances on openqaworker4 plus the complete openqa-worker.target on openqaworker7:</p> <pre><code>ssh root@openqaworker4 "systemctl stop openqa-worker@{5..20}" ssh root@openqaworker7 "systemctl stop openqa-worker.target" </code></pre> <p>this is not boot-persistent so will automatically enable again after next reboot the latest.</p> <p>Installed hdparm and conducted <code>hdparm -tT /dev/vdb</code>:</p> <pre><code> Timing cached reads: 6122 MB in 2.00 seconds = 3063.64 MB/sec Timing buffered disk reads: 14 MB in 3.28 seconds = 4.26 MB/sec </code></pre> <p>In comparison osd:</p> <pre><code>$ for i in a c e d ; do sudo hdparm -tT /dev/vd$i; done /dev/vda: Timing cached reads: 3968 MB in 2.00 seconds = 1986.07 MB/sec Timing buffered disk reads: 368 MB in 3.03 seconds = 121.46 MB/sec /dev/vdc: Timing cached reads: 3486 MB in 2.00 seconds = 1744.09 MB/sec Timing buffered disk reads: 214 MB in 3.02 seconds = 70.76 MB/sec /dev/vde: Timing cached reads: 4238 MB in 2.00 seconds = 2121.77 MB/sec Timing buffered disk reads: 802 MB in 3.00 seconds = 267.22 MB/sec /dev/vdd: Timing cached reads: 4800 MB in 2.00 seconds = 2401.98 MB/sec Timing buffered disk reads: 272 MB in 3.03 seconds = 89.69 MB/sec </code></pre> <p>No problems in cached reads but buffered disk reads suffer a lot on o3 in comparison.</p> </article> <article> <h1>openQA Infrastructure - action #65202: o3 is very slow/unresponsive and runs into timeouts</h1> <p>2020-04-03T13:36:59Z</p> <ul></ul><p>Initially the machine was rebooted by gschlotter to bump up the storage from 4TB to 5TB which is certainly appreciated. gschlotter also kindly booted into older kernel <br> 4.12.14-lp151.28.36-default but still same symptoms, so again most likely we need someone with access to the hypervisor and storage to find out the I/O regression.</p> <p>Just for the sake of completeness, the openQA changes since the time of "last good", as found out from o3 /var/log/zypp/history:</p> <pre><code>$ git log1 --no-merges b26b51beb..2e73b2a4a d509e69f4 (okurz/enhance/simplify_systemd_enque) Provide enqueue convenience scripts rather than systemd c4827b6a1 (okurz/enhance/doc_clock) docs: Mention the importance of synchronized clocks for remote worker machines 3804e375d docs: Update "Run workers" to not only mention "virtual machines" as backend 2f5d14b7f The job result is a string and not a number b0370449f (Martchus/result-size-2) Account freed disk space when deleting logs 721d73a2c (Martchus/restart-error) Turn warning about missing assets when restarting jobs into error 3f161b61b (okurz/enhance/license) Harmonize License copyright format 4495ec42e (kalikiana/job_groups_validation_docs) Validate parameters passed to job group API routes 03b69edf0 (kalikiana/audit-log-unify) Unify single/multi audit event rendering </code></pre> <p>the only thing that I could see remotely relevant for "I/O impact" would be "b0370449f (Martchus/result-size-2) Account freed disk space when deleting logs".</p> <pre><code>[03/04/2020 15:58:27] <Martchus> okurz: You could revert it hot-patch-style but it is so unlikely that I wouldn't revert the change in the repo. And it is unlikely because the gru process didn't even show up in iotop while everything was slow yesterday. (And the results cleanup task was not running anyways.) [03/04/2020 15:59:26] <Martchus> And the function this PR changes is to my knowlege only executed within the cleanup task (I could check that again). [03/04/2020 16:18:01] <okurz> Martchus: ok. Wasn't sure if there is anything e.g. running on inserting results or so. Of course, if it's only in the non-permanent minion jobs even less likely. Also, the problem today was very prominent when effectively no jobs were yet running, only requests to assets from cache services. And most of the according processes are just I/O stuck. We just don't know what the underlying storage is doing. Maybe it is even [03/04/2020 16:18:01] <okurz> doing something useful, i.e. full data integrity check after the volume was increased from 4TB to 5TB </code></pre> </article> <article> <h1>openQA Infrastructure - action #65202: o3 is very slow/unresponsive and runs into timeouts</h1> <p>2020-04-03T14:42:32Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/290655/diff?detail_id=287478">diff</a>)</li></ul><p>Restarting incomplete jobs on o3 with <code>for i in $(ssh openqa.opensuse.org "sudo -u geekotest psql --no-align --tuples-only --command=\"select id from jobs where (clone_id is null and (reason ~ 'Service Unavailable' or reason ~ 'api failure') and t_finished >= '2020-04-02');\" openqa"); do openqa-client --host openqa.opensuse.org jobs/$i/restart post;done</code></p> <p>Unfortunately I don't know the numbers from before but currently what I get on the different devices attached to o3, trying again measuring buffered disk reads:</p> <pre><code>$ avgtime -r 3 ssh o3 "sudo hdparm -t /dev/vd*" </code></pre> <p>/dev/vda: 110-130 MB/sec<br> /dev/vdb: 15-26 MB/sec<br> /dev/vdc: 25-35 MB/sec<br> /dev/vdd: 51-159 MB/sec</p> <p>and for osd:</p> <p>/dev/vda: 130-190 MB/sec<br> /dev/vdc: 150-190 MB/sec<br> /dev/vdd: 70-130 MB/sec<br> /dev/vde: 218-470 MB/sec</p> <p>so osd is much faster in general but also it seems that the performance of /dev/vdb is increasing even though the system should still be under comparable load. Running <code>for i in {1..1000}; do echo $i && hdparm -t /dev/vdb ; sleep 1; done</code> now on o3.</p> <p>I started the additional worker instances on openqaworker4 again and also everything on openqaworker7. Immediately the jobs request a lot of data over the cache service but still the <code>hdparm</code> read in parallel drops not below 20 MB/sec and stays at an average 30-6 MB/sec so much better than today in the morning.</p> </article> <article> <h1>openQA Infrastructure - action #65202: o3 is very slow/unresponsive and runs into timeouts</h1> <p>2020-04-03T19:00:42Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed" href="/issues/65283">action #65283</a>: many screenshots on o3 show no result, no popup, update of URL in address bar only flashes for a second</i> added</li></ul> </article> <article> <h1>openQA Infrastructure - action #65202: o3 is very slow/unresponsive and runs into timeouts</h1> <p>2020-04-06T17:00:31Z</p> <ul><li><strong>Status</strong> changed from <i>Blocked</i> to <i>Resolved</i></li></ul><p>We closed <a href="https://infra.nue.suse.com/SelfService/Display.html?id=166937" class="external">https://infra.nue.suse.com/SelfService/Display.html?id=166937</a> after the performance improved again. Could very well be that the underlying netapp had to reshuffle how the data is stored after the change of the virtual disk.</p> <pre><code>[06/04/2020 10:16:29] <mkittler> Good that o3 is "fixed". It is a bit unsatisfying not to know what the problem was and that our investigation in the end didn't help at all. [06/04/2020 10:18:28] <okurz> @mkittler I think we did good to identify that there is nothing *on* the machine that is impacting performance but it's on the outer layer of virtualization </code></pre> </article> </main></body></html>