Project

General

Profile

Actions

action #94606

closed

New builds of aggregate tests should not obsolete old ones size:M

Added by okurz almost 3 years ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-06-22
Due date:
% Done:

0%

Estimated time:

Description

Motivation

From discussion between okurz and mgrifalconi. Currently SLE maintenance aggregate tests are scheduled twice per day. Often only the first build of a day is interesting for reviewers as it is likely more complete and the second build would likely only include a smaller inter-day delta. But currently (to-be-confirmed) aggregate tests are scheduled by obsoleting older builds meaning that the tests of the first build per day might not yet be completely finished and aborted when the second build gets triggered. As openQA supports deprioritizing older builds instead of obsoleting this can also give aggregate tests the possibility to finish.

Acceptance criteria

  • AC1: SLE maintenance aggregate jobs from older builds can (mostly) finish even if not finished by the time another build is scheduled
  • AC2: OSD can still ensure a reasonable job age for all related architectures and worker classes

Suggestions

As documented on http://open.qa/docs/#_spawning_multiple_jobs_based_on_templates_isos_post use _DEPRIORITIZEBUILD instead of _OBSOLETE, e.g. in https://gitlab.suse.de/qa-maintenance/openQABot/-/blob/400f79aa9bb8283870aba16f8b6749f37400d454/openqabot/openqabot.py#L184

  • Monitor the impact of _DEPRIORITIZEBUILD
  • Tweak _DEPRIORITIZE_LIMIT based on monitoring data and observation over some days/weeks
  • Consider setting the _ONLY_OBSOLETE_SAME_BUILD option
  • Consider introducing the option to set scheduling flags in the metadata project e.g. by product/team/group

Challenges

  • AFAIR originally there had been even more "aggregate tests". The next build is scheduled which is always scheduled with a constant time offset (unlike in product validation where there can be the exception of a rapid succession of builds). If the first build of a day is not even able to finish all tests by then and this is not blocking the release of any updates then I guess we won't significantly benefit from such behaviour change. IMHO the criteria for releaseability should not be "any failed test blocking the release" but "not less passed tests than on our reference". If we would stick to that then we would have a direct motivation to have efficient, fast, relevant tests.
Actions #1

Updated by okurz almost 3 years ago

  • Status changed from New to Feedback
  • Assignee set to okurz
Actions #2

Updated by okurz almost 3 years ago

  • Due date set to 2021-08-03

There was no response on https://gitlab.suse.de/qa-maintenance/openQABot/-/merge_requests/73 . I asked in https://chat.suse.de/channel/testing?msg=eMyffDHQumTrJvBva for feedback. As I don't have access to see the bot in production I refrain from merging myself for now.

Actions #3

Updated by okurz over 2 years ago

  • Due date deleted (2021-08-03)
  • Status changed from Feedback to Blocked

I received some response but … it's complicated. https://gitlab.suse.de/qa-maintenance/openQABot/-/merge_requests/73#note_330982 suggests to set obsoletion settings on job templates, not within the bot. If that would work, I like it, but this should be tested first if it works -> #95539

Actions #4

Updated by okurz over 2 years ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)
  • Target version changed from Ready to future

It's a nice idea but outside current team's capacity. To be followed up later.

Actions #5

Updated by okurz 6 months ago

  • Assignee set to okurz
  • Target version changed from future to Ready

A related discussion came up in https://suse.slack.com/archives/C02CANHLANP/p1697105710342209?thread_ts=1695707529.901119&cid=C02CANHLANP

I will suggest again to use _DEPRIORITIZEBUILD

Actions #6

Updated by okurz 6 months ago

  • Status changed from New to Feedback
Actions #7

Updated by okurz 6 months ago

  • Priority changed from Low to High

PR merged. https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1898472#L1 shows "schedule updates" with the setting _DEPRIORITIZEBUILD instead of _OBSOLETE. https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1899988#L726 shows "schedule incidents" with the same setting. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=1697100466597&to=1697201416553 shows that we do have a significant increase in scheduled and executed jobs. I suspect that this is due to my change.

I wrote an announcement message in https://suse.slack.com/archives/C02CANHLANP/p1697202087781019

I would like to inform you about a change in scheduling SLE maintenance updates. As part of https://progress.opensuse.org/issues/94606 we switched the openQA job scheduling from obsoleting older tests (_OBSOLETE) to deprioritizing jobs of older builds but still giving them a chance to finish (_DEPRIORITIZEBUILD). This should help to provide a complete picture of test stability and results as well as help with investigating new unknown issues. This comes at a cost of executing more openQA tests however I see our infrastructure to be able to cope with that. In rare cases I expect that particularly long running, unreviewed but recurring failures in jobs to cause a delay in special scenarios but such cases should be addressed in the specific cases of test scenarios using one of the usual review best practices and mitigations.

Let's monitor over the next days hence changing prio to "High".

Actions #9

Updated by okurz 6 months ago

  • Status changed from Feedback to Blocked

Due to #138026 there were likely either unrelated failures or not even enough jobs scheduled to evaluate the impact of my changes. So let's wait for #138026 first and then monitor for longer.

Actions #10

Updated by okurz 6 months ago

  • Status changed from Blocked to Resolved

#138026 resolved. I see no problem in https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1 yet, resolving.

Actions #11

Updated by MDoucha 6 months ago

I've cancelled over 3500 PPC64LE kernel livepatch jobs today which should have been obsoleted automatically. The ticket title and description say that deprioritize should be used for aggregate tests. Incident tests should be obsoleted by default.

The bot does not support different settings for aggregate and incident tests defined in the same config file so fixing this in the config would require splitting all SLES bot configs into separate aggregate-only and incident-only files.

Actions #12

Updated by okurz 6 months ago

  • Status changed from Resolved to Feedback

MDoucha wrote in #note-11:

I've cancelled over 3500 PPC64LE kernel livepatch jobs today which should have been obsoleted automatically. The ticket title and description say that deprioritize should be used for aggregate tests. Incident tests should be obsoleted by default.

I am not sure. I agree that the original problem was aggregate tests but why should incident tests be obsoleted by default?

The bot does not support different settings for aggregate and incident tests defined in the same config file so fixing this in the config would require splitting all SLES bot configs into separate aggregate-only and incident-only files.

We can also tweak the limit until when jobs are deprioritized until obsoletion. WDYT?

Actions #13

Updated by MDoucha 6 months ago

okurz wrote in #note-12:

MDoucha wrote in #note-11:

I've cancelled over 3500 PPC64LE kernel livepatch jobs today which should have been obsoleted automatically. The ticket title and description say that deprioritize should be used for aggregate tests. Incident tests should be obsoleted by default.

I am not sure. I agree that the original problem was aggregate tests but why should incident tests be obsoleted by default?

Because the new jobs are identical to the old jobs except for REPOHASH which is just bot's internal note that has no effect on test behavior.

Also, I believe that this ticket could have been solved by simply adding _ONLY_OBSOLETE_SAME_BUILD=1 to openqabot/types/aggregates.py. Then different aggregate builds even from the same day would not interfere with each other.

The bot does not support different settings for aggregate and incident tests defined in the same config file so fixing this in the config would require splitting all SLES bot configs into separate aggregate-only and incident-only files.

We can also tweak the limit until when jobs are deprioritized until obsoletion. WDYT?

I have no idea what you mean.

Actions #14

Updated by okurz 5 months ago

  • Status changed from Feedback to New

MDoucha wrote in #note-13:

okurz wrote in #note-12:

MDoucha wrote in #note-11:

I've cancelled over 3500 PPC64LE kernel livepatch jobs today which should have been obsoleted automatically. The ticket title and description say that deprioritize should be used for aggregate tests. Incident tests should be obsoleted by default.

I am not sure. I agree that the original problem was aggregate tests but why should incident tests be obsoleted by default?

Because the new jobs are identical to the old jobs except for REPOHASH which is just bot's internal note that has no effect on test behavior.

Also, I believe that this ticket could have been solved by simply adding _ONLY_OBSOLETE_SAME_BUILD=1 to openqabot/types/aggregates.py. Then different aggregate builds even from the same day would not interfere with each other.

Thanks, we will look into that.

The bot does not support different settings for aggregate and incident tests defined in the same config file so fixing this in the config would require splitting all SLES bot configs into separate aggregate-only and incident-only files.

We can also tweak the limit until when jobs are deprioritized until obsoletion. WDYT?

I have no idea what you mean.

I mean the variables from
http://open.qa/docs/#_spawning_multiple_jobs_based_on_templates_isos_post

_DEPRIORITIZEBUILD

Setting this switch to '1' will deprioritize the unfinished jobs of old builds, and it will obsolete the jobs once the configurable limit of the priority value is reached.
_DEPRIORITIZE_LIMIT

The configurable limit of priority value up to which jobs should be deprioritized. Needs _DEPRIORITIZEBUILD. Defaults to 100.

I would like to take a look that _DEPRIORITIZEBUILD is properly set.

Actions #15

Updated by okurz 5 months ago

  • Subject changed from New builds of aggregate tests should not obsolete old ones to New builds of aggregate tests should not obsolete old ones size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #16

Updated by okurz 5 months ago

  • Status changed from Workable to Resolved

Over the past weeks I don't see more problematic cases of overly long schedule looking at https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1 so considering this not problematic anymore in general. I crosschecked how SLE maintenance products are scheduled on OSD on openqa.suse.de/admin/productlog and found that the according variables are set correctly. I also double-checked the implementation in openQA and we do set a default depriorization limit of 100 with a step size of 10. So assuming that a product is triggered with default priority 60 as seen for SLE incident tests up to three depriorizations are conducted so in total not more than 4 unfinished builds should exist. In qem-bot the obsoletion flag is set separately for incidents and aggregates so on further problems one could still consider to set different settings. However in general I would prefer to give all tests a chance to finish regardless of the state of external repositories as having a finished history of jobs can help test reviewing.

Actions #17

Updated by MDoucha 5 months ago

okurz wrote in #note-16:

Over the past weeks I don't see more problematic cases of overly long schedule looking at https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1 so considering this not problematic anymore in general. I crosschecked how SLE maintenance products are scheduled on OSD on openqa.suse.de/admin/productlog and found that the according variables are set correctly. I also double-checked the implementation in openQA and we do set a default depriorization limit of 100 with a step size of 10. So assuming that a product is triggered with default priority 60 as seen for SLE incident tests up to three depriorizations are conducted so in total not more than 4 unfinished builds should exist. In qem-bot the obsoletion flag is set separately for incidents and aggregates so on further problems one could still consider to set different settings. However in general I would prefer to give all tests a chance to finish regardless of the state of external repositories as having a finished history of jobs can help test reviewing.

Let's do some math.

  • we get ~70 livepatch incidents in one batch every month
  • each incident gets 141 test jobs (63 for PPC64LE alone)
  • most of those incidents get restarted due to repo changes (rebuilds, channel settings changes, patchinfo changes), often more than once

That's 9870 livepatch jobs in total every month (4410 for PPC64LE alone). With current settings and 3 incident restarts on average, you'll get 29610 jobs in the queue at once (13230 for PPC64LE alone).

Actions #18

Updated by okurz 5 months ago

MDoucha wrote in #note-17:

  • most of those incidents get restarted due to repo changes (rebuilds, channel settings changes, patchinfo changes), often more than once

That's 9870 livepatch jobs in total every month (4410 for PPC64LE alone). With current settings and 3 incident restarts on average, you'll get 29610 jobs in the queue at once (13230 for PPC64LE alone).

why should there be 3 incident restarts happening within short timeframe?

Actions #19

Updated by MDoucha 5 months ago

okurz wrote in #note-18:

why should there be 3 incident restarts happening within short timeframe?

Because the incident manager often has to do manual changes in the incident after it was submitted into OpenQA. Usually because some trivial issue was found by the test, like misconfigured update channels or incident number in repo metadata. The bot will retrigger all tests after every change in the incident repo.

Actions #20

Updated by okurz 5 months ago

ok. If your concern is only the live-patch tests how about using _OBSOLETE=1 in according product settings for live-patch tests?

Actions #21

Updated by MDoucha 5 months ago

okurz wrote in #note-20:

ok. If your concern is only the live-patch tests how about using _OBSOLETE=1 in according product settings for live-patch tests?

The same applies to all maintenance update test. Livepatches are just the most obvious and most painful fallout of this change. Single incident jobs need to be obsoleted by fresh isos post. Letting the old jobs finish will not yield any interesting results. It's just a pointless waste of testing resources.

Actions #22

Updated by okurz 5 months ago

MDoucha wrote in #note-21:

The same applies to all maintenance update test. Livepatches are just the most obvious and most painful fallout of this change. Single incident jobs need to be obsoleted by fresh isos post. Letting the old jobs finish will not yield any interesting results. It's just a pointless waste of testing resources.

other people's opinions might differ. I suggest to open up a discussion in broader context, e.g. in Slack #eng-testing and ask others, otherwise I tend to keep the behaviour as is.

Actions

Also available in: Atom PDF