coordination #99579: [epic][retro] Follow-up to "Published QCOW images appear to be uncompressed" - openQA Project (public) - openSUSE Project Management Tool

Actions

coordination #99579

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[epic][retro] Follow-up to "Published QCOW images appear to be uncompressed"

Added by okurz over 3 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Organisational

Target version:

Ready

Start date:

2021-10-01

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Description

Motivation¶

In #99246 gladly mdoucha could identify a big performance regression due to https://github.com/os-autoinst/os-autoinst/pull/1699/commits/eb207de0a372d832a60a081dd08dc674c90ef950 . After the very specific bug report we could deploy a fix to openqa.suse.de within 2 hours so very quick. But before that we had nearly two months of vague issues, user reports about reduced performance, multiple alerts related to high CPU time, high I/O pressure, long test runtimes and long test schedule queues.

For example:

Looking at https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=47&orgId=1&from=1626435701542&to=1632687629034 it looks indeed like in 2021-07 the Disk I/O times were significantly lower than in 2021-08
In late 2021-08 and 2021-09 there were multiple Disk I/O related alerts but no relevant followup was conducted. This for me another reminder that we should dilligently act on alerts and try really hard to understand the reasons for any failing.

Acceptance criteria¶

AC1: A Five-Whys analysis has been conducted and results documented
AC2: Improvements are planned

Suggestions¶

Bring up in retro
Conduct "Five-Whys" analysis for the topic
Identify follow-up tasks in tickets

Five Whys¶

Why did we not prevent the merge of the PR?

e.g. increase code coverage in https://app.codecov.io/gh/os-autoinst/os-autoinst/ , especially https://app.codecov.io/gh/os-autoinst/os-autoinst/tree/master/OpenQA/Qemu
Add specific test checking that qcow images are compressed (and have a test for published HDD images)
Potentially extend end-to-end tests, e.g. openQA-in-openQA, could use published images (and sanity check them)
We should use perl signatures everywhere -> #99660

Why could we not link problems to the code change immediately after deployment?

Monitor mean value of asset sizes
Monitor mean value of job completion times
Deploy synchronously after every merged pull request to make alerts more stricter and be more likely to link alerts to the smaller deployed changes

Why did we not link the I/O alerts to the deployed change?

Distracted by multiple network problems and increased load due to crashing workers ->
Deploying more often should help
Network performance monitoring
Have a "test-openQA-job" that we use as reference, e.g. the openQA-in-openQA tests or the os-autoinst full-stack test but check their runtime

Why could we not link multiple user reports to the alerts (mentioned above)?

user reports did not tell us more than what we should have seen from monitoring data but confirmed the presence of the issue which we should have linked. For this it's good to have tickets
lookup if the alert levels for "Disk I/O time for /dev/vdc (/assets)" have been bumped to very high numbers for good reason, lower if possible. The github PR was likely deployed on 2021-08-04 where we saw a first spike. There was an alert on "2021-08-04 13:01:35" (only for a short time).

Why did we not look into the I/O alerts in more detail?

The alert triggered, e.g. in https://monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=query&editPanel=47&viewPanel=47&orgId=1&from=1628074003444&to=1628077380302 but turned "OK" soon after. Our wiki already explains that we need to take alerts serious, regardless if it's "OK" again or not. ->
DONE: We need an accepted hypothesis when we want to change alerts -> https://progress.opensuse.org/projects/qa/wiki/Wiki/diff?utf8=%E2%9C%93&version=335&version_from=334&commit=View+differences

Subtasks 4 (0 open — 4 closed)

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries