Project

General

Profile

Actions

coordination #99579

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[epic][retro] Follow-up to "Published QCOW images appear to be uncompressed"

Added by okurz about 3 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Organisational
Target version:
Start date:
2021-10-01
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Motivation

In #99246 gladly mdoucha could identify a big performance regression due to https://github.com/os-autoinst/os-autoinst/pull/1699/commits/eb207de0a372d832a60a081dd08dc674c90ef950 . After the very specific bug report we could deploy a fix to openqa.suse.de within 2 hours so very quick. But before that we had nearly two months of vague issues, user reports about reduced performance, multiple alerts related to high CPU time, high I/O pressure, long test runtimes and long test schedule queues.

For example:

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets

Five Whys

  1. Why did we not prevent the merge of the PR?
  2. Why could we not link problems to the code change immediately after deployment?
    • Monitor mean value of asset sizes
    • Monitor mean value of job completion times
    • Deploy synchronously after every merged pull request to make alerts more stricter and be more likely to link alerts to the smaller deployed changes
  3. Why did we not link the I/O alerts to the deployed change?
    • Distracted by multiple network problems and increased load due to crashing workers ->
    • Deploying more often should help
    • Network performance monitoring
    • Have a "test-openQA-job" that we use as reference, e.g. the openQA-in-openQA tests or the os-autoinst full-stack test but check their runtime
  4. Why could we not link multiple user reports to the alerts (mentioned above)?
    • user reports did not tell us more than what we should have seen from monitoring data but confirmed the presence of the issue which we should have linked. For this it's good to have tickets
    • lookup if the alert levels for "Disk I/O time for /dev/vdc (/assets)" have been bumped to very high numbers for good reason, lower if possible. The github PR was likely deployed on 2021-08-04 where we saw a first spike. There was an alert on "2021-08-04 13:01:35" (only for a short time).
  5. Why did we not look into the I/O alerts in more detail?

Subtasks 4 (0 open4 closed)

action #99654: Revisit decision in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/545 regarding I/O alerts size:SResolvedmkittler2021-10-01

Actions
coordination #99660: [epic] Use more perl signatures in our perl projectsResolvedokurz2021-10-01

Actions
action #99663: Use more perl signatures - os-autoinst size:MResolvedokurz2021-10-01

Actions
action #105127: Use more perl signatures - openQA - some simple classes size:SResolvedkodymo

Actions

Related issues 2 (0 open2 closed)

Related to openQA Project (public) - action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retryResolvedmkittler2021-08-042021-08-19

Actions
Copied from openQA Project (public) - action #99246: Published QCOW images appear to be uncompressedResolvedokurz2021-09-242021-10-09

Actions
Actions

Also available in: Atom PDF