Project

General

Profile

action #58805

[infra]Severe storage performance issue on openqa.suse.de workers

Added by MDoucha almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
Start date:
2019-10-29
Due date:
% Done:

0%

Estimated time:

Description

Last week on Thursday, a handful of tests in two LTP testsuites started timing out. I've initially reported it as a kernel performance regression: https://bugzilla.suse.com/show_bug.cgi?id=1155018

However, I've tried to reproduce the problem on a released kernel version which didn't have the issue 3 weeks ago and succeeded: https://openqa.suse.de/tests/overview?build=15ga_mdoucha_bsc_1155018&version=15&distri=sle

This successful reproduction on a known good kernel indicates that the problem is somewhere in OpenQA infrastructure, possibly a bug introduced during the weekly deployment on Wednesday, October 23rd. The timeout continues to appear in kernel-of-the-day LTP tests: https://openqa.suse.de/tests/3533819#step/DOR000/7

Both PPC64LE and x86_64 are affected. Reproducibility on aarch64 and s390 is currently unknown because we don't run the affected testsuites on those two platforms. The failing tests mostly belong to the async & direct I/O stress testsuite.


Related issues

Related to openQA Infrastructure - action #20914: [tools] configure vm settings for workers with rotating discsResolved2017-07-282019-11-05

Related to openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old resultsBlocked2020-03-182021-10-05

History

#1 Updated by MDoucha almost 2 years ago

  • Description updated (diff)

#2 Updated by MDoucha almost 2 years ago

  • Description updated (diff)

#3 Updated by coolo almost 2 years ago

https://openqa.suse.de/tests/3513596#step/DOR000/7 as given in chat is 6 days old and the last deployment of os-autoinst on openqaworker13 is from Oct 16th (as of today 13 days old). The uptime is 74 days and there was no qemu update since Sep 8th.

As such I have serious doubt this is caused by infrastructure. Especially as https://openqa.suse.de/tests/3529082#next_previous shows it also 26 days ago: https://openqa.suse.de/tests/3426802#step/DOR000/7

#4 Updated by MDoucha almost 2 years ago

coolo wrote:

https://openqa.suse.de/tests/3513596#step/DOR000/7 as given in chat is 6 days old and the last deployment of os-autoinst on openqaworker13 is from Oct 16th (as of today 13 days old). The uptime is 74 days and there was no qemu update since Sep 8th.

As such I have serious doubt this is caused by infrastructure. Especially as https://openqa.suse.de/tests/3529082#next_previous shows it also 26 days ago: https://openqa.suse.de/tests/3426802#step/DOR000/7

https://openqa.suse.de/tests/3426802#step/DOR000/7 is literally the only recorded case where DOR000 failed before last week (in QAM Single Incidents jobs). As such, I'm inclined to write that one off as a random hickup which would normally be ignored. Here's a breakdown of DOR000 failures in the entire recorded history of QAM Single Incidents:

12SP1@x86_64: 0 since 2019-08-21
12SP2@x86_64: 1 since 2019-10-23, 0 more since 2019-08-23
12SP2@ppc64le: 6 since 2019-10-23, 0 more since 2019-08-23
12SP3@x86_64: 3 since 2019-10-23, 0 more since 2019-08-21
12SP3@ppc64le: 11 since 2019-10-23, 0 more since 2019-08-21
12SP4@x86_64: 4 since 2019-10-23, 0 more since 2019-08-01
12SP4@ppc64le: 12 since 2019-10-23, 1 more since 2019-08-01
15GA@x86_64: 2 since 2019-10-23, 0 more since 2019-10-01
15GA@ppc64le: 9 since 2019-10-23, 0 more since 2019-10-01
15SP1@x86_64: 5 since 2019-10-23, 0 more since 2019-08-21
15SP1@ppc64le: 6 since 2019-10-23, 0 more since 2019-08-21

The 13 extra failures on 3 weeks old 15GA live patch 9 reproduced today are not included. As you can see, the problem was almost non-existent before 2019-10-23. Now it reproduces regularly even on kernels originally testes on 2019-10-04 when just 1 failure was recorded out of 96 DOR000 tests.

#5 Updated by MDoucha almost 2 years ago

Some more relevant data:

  • all failing DOR000@PPC64LE jobs in the last 6 days ran on QA-Power8-4 or QA-Power8-5.
  • all passing DOR000@PPC64LE jobs in the last 6 days ran on powerqaworker-qam-1 and malbec
  • all failing DOR000@x86_64 jobs in the last 6 days ran on openqaworker3, openqaworker10 or openqaworker13
  • all passing DOR000@x86_64 jobs in the last 6 days ran on openqaworker2, openqaworker3, openqaworker5, openqaworker6, openqaworker7, openqaworker8, openqaworker9 or openqaworker10

openqaworker3 has 1 failure, otherwise passes
openqaworker10 has roughly half passes, half failures

#6 Updated by coolo almost 2 years ago

We will have to see how the performance problem caused by 58823 impacts this. So once we deploy a fix for this, the workers should be better off.

#7 Updated by coolo almost 2 years ago

This is a 'regression' caused by https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/203 - we basically prefer with this change reproducible failures over random timeouts. And this is what you see.

There is no good setting for this - so fix is to avoid these slow disks for running IO tests. tmpfs is no solution either due to out of disk errors, so we need to buy ssds/nvme :(

#8 Updated by coolo almost 2 years ago

  • Related to action #20914: [tools] configure vm settings for workers with rotating discs added

#9 Updated by coolo almost 2 years ago

I deployed a partial revert

#10 Updated by coolo almost 2 years ago

  • Status changed from New to Feedback

hopefully fixed, but we need to keep monitoring

#11 Updated by MDoucha almost 2 years ago

coolo wrote:

I deployed a partial revert

If you're trying to prevent random timeouts during VM shutdown due to delayed I/O buffer writeback, wouldn't it be better to simply flush VM buffers between test modules using the commit QEMU command?
https://qemu.weilnetz.de/doc/qemu-doc.html#index-commit

#12 Updated by okurz almost 2 years ago

might reduce the propability but cant not prevent other VMs interfering or also within the "await_install" module where a lot of I/O happens and we need to abort the timeout just after that – before being able to sync to disk.

#13 Updated by okurz over 1 year ago

  • Related to coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results added

#14 Updated by okurz over 1 year ago

  • Status changed from Feedback to Resolved
  • Assignee set to okurz

I haven't heard anything critical regarding this since some time. I assume the issue as described here is fixed. Further research is conducted in #64746 as well. I do not think we should "commit" between test modules as especially between test modules we do not want to miss anything from test flow scope. Rather for await_install we could trigger an explicit commit exactly at the point when we have time, e.g. after having stopped at the "stop countdown" dialog. Besides, for the specific problem in await_install I have contributed a new feature in YaST that is at least available in more recent product versions that automatically stops at the end of installation waiting for user confirmation. For tmpfs workers we could use bcache/lvmcache in the corresponding tickets we have.

Also available in: Atom PDF