action #58805
closed[infra]Severe storage performance issue on openqa.suse.de workers
0%
Description
Last week on Thursday, a handful of tests in two LTP testsuites started timing out. I've initially reported it as a kernel performance regression: https://bugzilla.suse.com/show_bug.cgi?id=1155018
However, I've tried to reproduce the problem on a released kernel version which didn't have the issue 3 weeks ago and succeeded: https://openqa.suse.de/tests/overview?build=15ga_mdoucha_bsc_1155018&version=15&distri=sle
This successful reproduction on a known good kernel indicates that the problem is somewhere in OpenQA infrastructure, possibly a bug introduced during the weekly deployment on Wednesday, October 23rd. The timeout continues to appear in kernel-of-the-day LTP tests: https://openqa.suse.de/tests/3533819#step/DOR000/7
Both PPC64LE and x86_64 are affected. Reproducibility on aarch64 and s390 is currently unknown because we don't run the affected testsuites on those two platforms. The failing tests mostly belong to the async & direct I/O stress testsuite.
Updated by coolo over 4 years ago
https://openqa.suse.de/tests/3513596#step/DOR000/7 as given in chat is 6 days old and the last deployment of os-autoinst on openqaworker13 is from Oct 16th (as of today 13 days old). The uptime is 74 days and there was no qemu update since Sep 8th.
As such I have serious doubt this is caused by infrastructure. Especially as https://openqa.suse.de/tests/3529082#next_previous shows it also 26 days ago: https://openqa.suse.de/tests/3426802#step/DOR000/7
Updated by MDoucha over 4 years ago
coolo wrote:
https://openqa.suse.de/tests/3513596#step/DOR000/7 as given in chat is 6 days old and the last deployment of os-autoinst on openqaworker13 is from Oct 16th (as of today 13 days old). The uptime is 74 days and there was no qemu update since Sep 8th.
As such I have serious doubt this is caused by infrastructure. Especially as https://openqa.suse.de/tests/3529082#next_previous shows it also 26 days ago: https://openqa.suse.de/tests/3426802#step/DOR000/7
https://openqa.suse.de/tests/3426802#step/DOR000/7 is literally the only recorded case where DOR000 failed before last week (in QAM Single Incidents jobs). As such, I'm inclined to write that one off as a random hickup which would normally be ignored. Here's a breakdown of DOR000 failures in the entire recorded history of QAM Single Incidents:
12SP1@x86_64: 0 since 2019-08-21
12SP2@x86_64: 1 since 2019-10-23, 0 more since 2019-08-23
12SP2@ppc64le: 6 since 2019-10-23, 0 more since 2019-08-23
12SP3@x86_64: 3 since 2019-10-23, 0 more since 2019-08-21
12SP3@ppc64le: 11 since 2019-10-23, 0 more since 2019-08-21
12SP4@x86_64: 4 since 2019-10-23, 0 more since 2019-08-01
12SP4@ppc64le: 12 since 2019-10-23, 1 more since 2019-08-01
15GA@x86_64: 2 since 2019-10-23, 0 more since 2019-10-01
15GA@ppc64le: 9 since 2019-10-23, 0 more since 2019-10-01
15SP1@x86_64: 5 since 2019-10-23, 0 more since 2019-08-21
15SP1@ppc64le: 6 since 2019-10-23, 0 more since 2019-08-21
The 13 extra failures on 3 weeks old 15GA live patch 9 reproduced today are not included. As you can see, the problem was almost non-existent before 2019-10-23. Now it reproduces regularly even on kernels originally testes on 2019-10-04 when just 1 failure was recorded out of 96 DOR000 tests.
Updated by MDoucha over 4 years ago
Some more relevant data:
- all failing DOR000@PPC64LE jobs in the last 6 days ran on QA-Power8-4 or QA-Power8-5.
- all passing DOR000@PPC64LE jobs in the last 6 days ran on powerqaworker-qam-1 and malbec
- all failing DOR000@x86_64 jobs in the last 6 days ran on openqaworker3, openqaworker10 or openqaworker13
- all passing DOR000@x86_64 jobs in the last 6 days ran on openqaworker2, openqaworker3, openqaworker5, openqaworker6, openqaworker7, openqaworker8, openqaworker9 or openqaworker10
openqaworker3 has 1 failure, otherwise passes
openqaworker10 has roughly half passes, half failures
Updated by coolo over 4 years ago
We will have to see how the performance problem caused by 58823 impacts this. So once we deploy a fix for this, the workers should be better off.
Updated by coolo over 4 years ago
This is a 'regression' caused by https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/203 - we basically prefer with this change reproducible failures over random timeouts. And this is what you see.
There is no good setting for this - so fix is to avoid these slow disks for running IO tests. tmpfs is no solution either due to out of disk errors, so we need to buy ssds/nvme :(
Updated by coolo over 4 years ago
- Related to action #20914: [tools] configure vm settings for workers with rotating discs added
Updated by coolo over 4 years ago
- Status changed from New to Feedback
hopefully fixed, but we need to keep monitoring
Updated by MDoucha over 4 years ago
coolo wrote:
I deployed a partial revert
If you're trying to prevent random timeouts during VM shutdown due to delayed I/O buffer writeback, wouldn't it be better to simply flush VM buffers between test modules using the commit
QEMU command?
https://qemu.weilnetz.de/doc/qemu-doc.html#index-commit
Updated by okurz over 4 years ago
might reduce the propability but cant not prevent other VMs interfering or also within the "await_install" module where a lot of I/O happens and we need to abort the timeout just after that – before being able to sync to disk.
Updated by okurz about 4 years ago
- Related to coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results added
Updated by okurz about 4 years ago
- Status changed from Feedback to Resolved
- Assignee set to okurz
I haven't heard anything critical regarding this since some time. I assume the issue as described here is fixed. Further research is conducted in #64746 as well. I do not think we should "commit" between test modules as especially between test modules we do not want to miss anything from test flow scope. Rather for await_install
we could trigger an explicit commit exactly at the point when we have time, e.g. after having stopped at the "stop countdown" dialog. Besides, for the specific problem in await_install
I have contributed a new feature in YaST that is at least available in more recent product versions that automatically stops at the end of installation waiting for user confirmation. For tmpfs workers we could use bcache/lvmcache in the corresponding tickets we have.