Project

General

Profile

Actions

coordination #99579

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[epic][retro] Follow-up to "Published QCOW images appear to be uncompressed"

Added by okurz about 3 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Organisational
Target version:
Start date:
2021-10-01
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Motivation

In #99246 gladly mdoucha could identify a big performance regression due to https://github.com/os-autoinst/os-autoinst/pull/1699/commits/eb207de0a372d832a60a081dd08dc674c90ef950 . After the very specific bug report we could deploy a fix to openqa.suse.de within 2 hours so very quick. But before that we had nearly two months of vague issues, user reports about reduced performance, multiple alerts related to high CPU time, high I/O pressure, long test runtimes and long test schedule queues.

For example:

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets

Five Whys

  1. Why did we not prevent the merge of the PR?
  2. Why could we not link problems to the code change immediately after deployment?
    • Monitor mean value of asset sizes
    • Monitor mean value of job completion times
    • Deploy synchronously after every merged pull request to make alerts more stricter and be more likely to link alerts to the smaller deployed changes
  3. Why did we not link the I/O alerts to the deployed change?
    • Distracted by multiple network problems and increased load due to crashing workers ->
    • Deploying more often should help
    • Network performance monitoring
    • Have a "test-openQA-job" that we use as reference, e.g. the openQA-in-openQA tests or the os-autoinst full-stack test but check their runtime
  4. Why could we not link multiple user reports to the alerts (mentioned above)?
    • user reports did not tell us more than what we should have seen from monitoring data but confirmed the presence of the issue which we should have linked. For this it's good to have tickets
    • lookup if the alert levels for "Disk I/O time for /dev/vdc (/assets)" have been bumped to very high numbers for good reason, lower if possible. The github PR was likely deployed on 2021-08-04 where we saw a first spike. There was an alert on "2021-08-04 13:01:35" (only for a short time).
  5. Why did we not look into the I/O alerts in more detail?

Subtasks 4 (0 open4 closed)

action #99654: Revisit decision in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/545 regarding I/O alerts size:SResolvedmkittler2021-10-01

Actions
coordination #99660: [epic] Use more perl signatures in our perl projectsResolvedokurz2021-10-01

Actions
action #99663: Use more perl signatures - os-autoinst size:MResolvedokurz2021-10-01

Actions
action #105127: Use more perl signatures - openQA - some simple classes size:SResolvedkodymo

Actions

Related issues 2 (0 open2 closed)

Related to openQA Project (public) - action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retryResolvedmkittler2021-08-042021-08-19

Actions
Copied from openQA Project (public) - action #99246: Published QCOW images appear to be uncompressedResolvedokurz2021-09-242021-10-09

Actions
Actions #1

Updated by okurz about 3 years ago

  • Copied from action #99246: Published QCOW images appear to be uncompressed added
Actions #2

Updated by okurz about 3 years ago

  • Description updated (diff)
Actions #3

Updated by okurz about 3 years ago

  • Tracker changed from action to coordination
  • Subject changed from [retro] Follow-up to "Published QCOW images appear to be uncompressed" to [epic][retro] Follow-up to "Published QCOW images appear to be uncompressed"
  • Description updated (diff)
  • Parent task set to #64746

Conducted five-why analysis during retro with the team. Thank you team!

Actions #4

Updated by okurz about 3 years ago

  • Related to action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry added
Actions #5

Updated by okurz about 3 years ago

  • Description updated (diff)
Actions #6

Updated by okurz about 3 years ago

  • Description updated (diff)
Actions #7

Updated by MDoucha about 3 years ago

okurz wrote:

Five Whys

  1. Why did we not prevent the merge of the PR?

I'm going to give you an answer which you won't like: The code review as currently practiced by the Tools team is a farce that only wastes everyone's time. You keep bikeshedding complete nonsense like comment wording while paying zero attention to potential bugs. And worse, some of the reviewer even prefer to needlessly increase the risk of errors in order to make the code "pretty". (I could link specific examples but I'll refrain from doing so to protect the guilty. You know who you are.)

The PR that introduced the bug was adding a new parameter to two existing functions. The logical thing to do during code review is to grep the codebase for any calls of those two functions that were not updated in the PR. The bug slipped through because nobody bothered to do that trivial check.

Actions #8

Updated by okurz about 3 years ago

MDoucha wrote:

okurz wrote:

Five Whys

  1. Why did we not prevent the merge of the PR?

I'm going to give you an answer which you won't like: The code review as currently practiced by the Tools team is a farce that only wastes everyone's time. You keep bikeshedding complete nonsense like comment wording while paying zero attention to potential bugs. And worse, some of the reviewer even prefer to needlessly increase the risk of errors in order to make the code "pretty".

Well, I don't dislike your comment because you are often right and here as well. Obviously most people would have it phrased differently but it's still true what you say what can happen. However, this is also not a workable suggestion that we can follow with. So unless you have a proper suggestion I will just read it as "Be more diligent in code review".

(I could link specific examples but I'll refrain from doing so to protect the guilty. You know who you are.)

Well, I know that you mean me and I am not afraid :D

The PR that introduced the bug was adding a new parameter to two existing functions. The logical thing to do during code review is to grep the codebase for any calls of those two functions that were not updated in the PR. The bug slipped through because nobody bothered to do that trivial check.

Correct. Hence we introduce signatures next to all the other suggestions. Obviously my experience was confirmed and I forgot about that during review: A commit starting with the word "Fix …" is often a cause for regressions so we should be even more careful with that ;)

Actions #9

Updated by okurz about 3 years ago

  • Status changed from New to Workable

So we consider it "Workable" as an epic meaning that we can define further individual specific subtasks while others can work on subtasks.

Actions #10

Updated by MDoucha about 3 years ago

okurz wrote:

MDoucha wrote:

I'm going to give you an answer which you won't like: The code review as currently practiced by the Tools team is a farce that only wastes everyone's time. You keep bikeshedding complete nonsense like comment wording while paying zero attention to potential bugs. And worse, some of the reviewer even prefer to needlessly increase the risk of errors in order to make the code "pretty".

Well, I don't dislike your comment because you are often right and here as well. Obviously most people would have it phrased differently but it's still true what you say what can happen. However, this is also not a workable suggestion that we can follow with. So unless you have a proper suggestion I will just read it as "Be more diligent in code review".

I can be more specific and give you a code review checklist:

  1. How does this code interact with the rest of the codebase? What can break if this gets merged?
  2. Under which conditions will this code produce errors? Can any of those errors be avoided?
  3. Are there any structural issues, e.g. copy-pasted code, inefficient loops or function call patterns that would cause performance issues?
  4. Are there any obvious future changes to this code? Can it be improved now to make those future changes easier?

There's also the employee exchange program so people from your team could join Kernel QA for a week or two to learn how to do code review better. Or you can just read the LTP mailing list archives:
https://lists.linux.it/pipermail/ltp/

(I could link specific examples but I'll refrain from doing so to protect the guilty. You know who you are.)

Well, I know that you mean me and I am not afraid :D

No, I don't mean you.

Actions #11

Updated by tinita about 3 years ago

And worse, some of the reviewer even prefer to needlessly increase the risk of errors in order to make the code "pretty".

I can not imagine an example here, and if noone else can, this is pretty useless.
I can only speak for myself, if it's me, please give examples. Otherwise, maybe get in touch with the person(s) in question directly.
We want to improve.
Thanks.

Actions #12

Updated by MDoucha about 3 years ago

tinita wrote:

And worse, some of the reviewer even prefer to needlessly increase the risk of errors in order to make the code "pretty".

I can not imagine an example here, and if noone else can, this is pretty useless.
I can only speak for myself, if it's me, please give examples. Otherwise, maybe get in touch with the person(s) in question directly.
We want to improve.
Thanks.

If it were you, you'd remember a long argument with me about why a particular suggestion is wrong. I will at least describe the main example I had in mind when I wrote the sentence above:

The code is updating a file and needs to avoid corrupting it if the update fails halfway through for whatever reason. Solution: write to a temporary file and rename it. But the reviewer argued that rename() with manual cleanup in case of failure is ugly and insisted on using Mojo::File::tempfile and renaming it using move_to() which is not guaranteed to be atomic. (In addition to also needing an unnecessary chmod() call because the tempfile is always created with 0600 access permissions.)

Actions #13

Updated by okurz about 3 years ago

MDoucha wrote:

tinita wrote:

And worse, some of the reviewer even prefer to needlessly increase the risk of errors in order to make the code "pretty".

I can not imagine an example here, and if noone else can, this is pretty useless.
I can only speak for myself, if it's me, please give examples. Otherwise, maybe get in touch with the person(s) in question directly.
We want to improve.
Thanks.

If it were you, you'd remember a long argument with me about why a particular suggestion is wrong. I will at least describe the main example I had in mind when I wrote the sentence above:

The code is updating a file and needs to avoid corrupting it if the update fails halfway through for whatever reason. Solution: write to a temporary file and rename it. But the reviewer argued that rename() with manual cleanup in case of failure is ugly and insisted on using Mojo::File::tempfile and renaming it using move_to() which is not guaranteed to be atomic. (In addition to also needing an unnecessary chmod() call because the tempfile is always created with 0600 access permissions.)

I guess you mean https://github.com/os-autoinst/openQA/pull/3144#pullrequestreview-424274015 . I don't see the argument in there as "increase the risk of errors" but rather avoid overly defensive programming

By the way. I like your code review checklist that you posted above

Actions #14

Updated by okurz about 3 years ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

More sub tasks could be defined but right now we have enough in the backlog. I can track this based on what we already have defined at least until the current open sub tasks are done

Actions #15

Updated by okurz almost 3 years ago

  • Parent task changed from #64746 to #80142
Actions #16

Updated by okurz about 2 years ago

subtasks actively being worked on with low prio, no problems.

Actions #17

Updated by okurz almost 2 years ago

  • Status changed from Blocked to Resolved

#99663 was fixed which I consider the most critical part. With openQA we can continue on the go. I wouldn't restrict our work to need that strictly.

Actions

Also available in: Atom PDF