action #94606
closedNew builds of aggregate tests should not obsolete old ones size:M
Description
Motivation¶
From discussion between okurz and mgrifalconi. Currently SLE maintenance aggregate tests are scheduled twice per day. Often only the first build of a day is interesting for reviewers as it is likely more complete and the second build would likely only include a smaller inter-day delta. But currently (to-be-confirmed) aggregate tests are scheduled by obsoleting older builds meaning that the tests of the first build per day might not yet be completely finished and aborted when the second build gets triggered. As openQA supports deprioritizing older builds instead of obsoleting this can also give aggregate tests the possibility to finish.
Acceptance criteria¶
- AC1: SLE maintenance aggregate jobs from older builds can (mostly) finish even if not finished by the time another build is scheduled
- AC2: OSD can still ensure a reasonable job age for all related architectures and worker classes
Suggestions¶
As documented on http://open.qa/docs/#_spawning_multiple_jobs_based_on_templates_isos_post use _DEPRIORITIZEBUILD
instead of _OBSOLETE
, e.g. in https://gitlab.suse.de/qa-maintenance/openQABot/-/blob/400f79aa9bb8283870aba16f8b6749f37400d454/openqabot/openqabot.py#L184
- Monitor the impact of _DEPRIORITIZEBUILD
- Tweak _DEPRIORITIZE_LIMIT based on monitoring data and observation over some days/weeks
- Consider setting the _ONLY_OBSOLETE_SAME_BUILD option
- Consider introducing the option to set scheduling flags in the metadata project e.g. by product/team/group
Challenges¶
- AFAIR originally there had been even more "aggregate tests". The next build is scheduled which is always scheduled with a constant time offset (unlike in product validation where there can be the exception of a rapid succession of builds). If the first build of a day is not even able to finish all tests by then and this is not blocking the release of any updates then I guess we won't significantly benefit from such behaviour change. IMHO the criteria for releaseability should not be "any failed test blocking the release" but "not less passed tests than on our reference". If we would stick to that then we would have a direct motivation to have efficient, fast, relevant tests.
Updated by okurz over 3 years ago
- Status changed from New to Feedback
- Assignee set to okurz
Updated by okurz over 3 years ago
- Due date set to 2021-08-03
There was no response on https://gitlab.suse.de/qa-maintenance/openQABot/-/merge_requests/73 . I asked in https://chat.suse.de/channel/testing?msg=eMyffDHQumTrJvBva for feedback. As I don't have access to see the bot in production I refrain from merging myself for now.
Updated by okurz about 3 years ago
- Due date deleted (
2021-08-03) - Status changed from Feedback to Blocked
I received some response but … it's complicated. https://gitlab.suse.de/qa-maintenance/openQABot/-/merge_requests/73#note_330982 suggests to set obsoletion settings on job templates, not within the bot. If that would work, I like it, but this should be tested first if it works -> #95539
Updated by okurz about 3 years ago
- Status changed from Blocked to New
- Assignee deleted (
okurz) - Target version changed from Ready to future
It's a nice idea but outside current team's capacity. To be followed up later.
Updated by okurz 12 months ago
- Assignee set to okurz
- Target version changed from future to Ready
A related discussion came up in https://suse.slack.com/archives/C02CANHLANP/p1697105710342209?thread_ts=1695707529.901119&cid=C02CANHLANP
I will suggest again to use _DEPRIORITIZEBUILD
Updated by okurz 12 months ago
- Priority changed from Low to High
PR merged. https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1898472#L1 shows "schedule updates" with the setting _DEPRIORITIZEBUILD
instead of _OBSOLETE
. https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1899988#L726 shows "schedule incidents" with the same setting. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=1697100466597&to=1697201416553 shows that we do have a significant increase in scheduled and executed jobs. I suspect that this is due to my change.
I wrote an announcement message in https://suse.slack.com/archives/C02CANHLANP/p1697202087781019
I would like to inform you about a change in scheduling SLE maintenance updates. As part of https://progress.opensuse.org/issues/94606 we switched the openQA job scheduling from obsoleting older tests (_OBSOLETE) to deprioritizing jobs of older builds but still giving them a chance to finish (_DEPRIORITIZEBUILD). This should help to provide a complete picture of test stability and results as well as help with investigating new unknown issues. This comes at a cost of executing more openQA tests however I see our infrastructure to be able to cope with that. In rare cases I expect that particularly long running, unreviewed but recurring failures in jobs to cause a delay in special scenarios but such cases should be addressed in the specific cases of test scenarios using one of the usual review best practices and mitigations.
Let's monitor over the next days hence changing prio to "High".
Updated by okurz 12 months ago
- Status changed from Blocked to Resolved
#138026 resolved. I see no problem in https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1 yet, resolving.
Updated by MDoucha 12 months ago
I've cancelled over 3500 PPC64LE kernel livepatch jobs today which should have been obsoleted automatically. The ticket title and description say that deprioritize should be used for aggregate tests. Incident tests should be obsoleted by default.
The bot does not support different settings for aggregate and incident tests defined in the same config file so fixing this in the config would require splitting all SLES bot configs into separate aggregate-only and incident-only files.
Updated by okurz 12 months ago
- Status changed from Resolved to Feedback
MDoucha wrote in #note-11:
I've cancelled over 3500 PPC64LE kernel livepatch jobs today which should have been obsoleted automatically. The ticket title and description say that deprioritize should be used for aggregate tests. Incident tests should be obsoleted by default.
I am not sure. I agree that the original problem was aggregate tests but why should incident tests be obsoleted by default?
The bot does not support different settings for aggregate and incident tests defined in the same config file so fixing this in the config would require splitting all SLES bot configs into separate aggregate-only and incident-only files.
We can also tweak the limit until when jobs are deprioritized until obsoletion. WDYT?
Updated by MDoucha 12 months ago
okurz wrote in #note-12:
MDoucha wrote in #note-11:
I've cancelled over 3500 PPC64LE kernel livepatch jobs today which should have been obsoleted automatically. The ticket title and description say that deprioritize should be used for aggregate tests. Incident tests should be obsoleted by default.
I am not sure. I agree that the original problem was aggregate tests but why should incident tests be obsoleted by default?
Because the new jobs are identical to the old jobs except for REPOHASH
which is just bot's internal note that has no effect on test behavior.
Also, I believe that this ticket could have been solved by simply adding _ONLY_OBSOLETE_SAME_BUILD=1
to openqabot/types/aggregates.py. Then different aggregate builds even from the same day would not interfere with each other.
The bot does not support different settings for aggregate and incident tests defined in the same config file so fixing this in the config would require splitting all SLES bot configs into separate aggregate-only and incident-only files.
We can also tweak the limit until when jobs are deprioritized until obsoletion. WDYT?
I have no idea what you mean.
Updated by okurz 11 months ago
- Status changed from Feedback to New
MDoucha wrote in #note-13:
okurz wrote in #note-12:
MDoucha wrote in #note-11:
I've cancelled over 3500 PPC64LE kernel livepatch jobs today which should have been obsoleted automatically. The ticket title and description say that deprioritize should be used for aggregate tests. Incident tests should be obsoleted by default.
I am not sure. I agree that the original problem was aggregate tests but why should incident tests be obsoleted by default?
Because the new jobs are identical to the old jobs except for
REPOHASH
which is just bot's internal note that has no effect on test behavior.Also, I believe that this ticket could have been solved by simply adding
_ONLY_OBSOLETE_SAME_BUILD=1
to openqabot/types/aggregates.py. Then different aggregate builds even from the same day would not interfere with each other.
Thanks, we will look into that.
The bot does not support different settings for aggregate and incident tests defined in the same config file so fixing this in the config would require splitting all SLES bot configs into separate aggregate-only and incident-only files.
We can also tweak the limit until when jobs are deprioritized until obsoletion. WDYT?
I have no idea what you mean.
I mean the variables from
http://open.qa/docs/#_spawning_multiple_jobs_based_on_templates_isos_post
_DEPRIORITIZEBUILD
Setting this switch to '1' will deprioritize the unfinished jobs of old builds, and it will obsolete the jobs once the configurable limit of the priority value is reached.
_DEPRIORITIZE_LIMIT
The configurable limit of priority value up to which jobs should be deprioritized. Needs _DEPRIORITIZEBUILD. Defaults to 100.
I would like to take a look that _DEPRIORITIZEBUILD is properly set.
Updated by okurz 11 months ago
- Status changed from Workable to Resolved
Over the past weeks I don't see more problematic cases of overly long schedule looking at https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1 so considering this not problematic anymore in general. I crosschecked how SLE maintenance products are scheduled on OSD on openqa.suse.de/admin/productlog and found that the according variables are set correctly. I also double-checked the implementation in openQA and we do set a default depriorization limit of 100 with a step size of 10. So assuming that a product is triggered with default priority 60 as seen for SLE incident tests up to three depriorizations are conducted so in total not more than 4 unfinished builds should exist. In qem-bot the obsoletion flag is set separately for incidents and aggregates so on further problems one could still consider to set different settings. However in general I would prefer to give all tests a chance to finish regardless of the state of external repositories as having a finished history of jobs can help test reviewing.
Updated by MDoucha 11 months ago
okurz wrote in #note-16:
Over the past weeks I don't see more problematic cases of overly long schedule looking at https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1 so considering this not problematic anymore in general. I crosschecked how SLE maintenance products are scheduled on OSD on openqa.suse.de/admin/productlog and found that the according variables are set correctly. I also double-checked the implementation in openQA and we do set a default depriorization limit of 100 with a step size of 10. So assuming that a product is triggered with default priority 60 as seen for SLE incident tests up to three depriorizations are conducted so in total not more than 4 unfinished builds should exist. In qem-bot the obsoletion flag is set separately for incidents and aggregates so on further problems one could still consider to set different settings. However in general I would prefer to give all tests a chance to finish regardless of the state of external repositories as having a finished history of jobs can help test reviewing.
Let's do some math.
- we get ~70 livepatch incidents in one batch every month
- each incident gets 141 test jobs (63 for PPC64LE alone)
- most of those incidents get restarted due to repo changes (rebuilds, channel settings changes, patchinfo changes), often more than once
That's 9870 livepatch jobs in total every month (4410 for PPC64LE alone). With current settings and 3 incident restarts on average, you'll get 29610 jobs in the queue at once (13230 for PPC64LE alone).
Updated by okurz 11 months ago
MDoucha wrote in #note-17:
- most of those incidents get restarted due to repo changes (rebuilds, channel settings changes, patchinfo changes), often more than once
That's 9870 livepatch jobs in total every month (4410 for PPC64LE alone). With current settings and 3 incident restarts on average, you'll get 29610 jobs in the queue at once (13230 for PPC64LE alone).
why should there be 3 incident restarts happening within short timeframe?
Updated by MDoucha 11 months ago
okurz wrote in #note-18:
why should there be 3 incident restarts happening within short timeframe?
Because the incident manager often has to do manual changes in the incident after it was submitted into OpenQA. Usually because some trivial issue was found by the test, like misconfigured update channels or incident number in repo metadata. The bot will retrigger all tests after every change in the incident repo.
Updated by MDoucha 11 months ago
okurz wrote in #note-20:
ok. If your concern is only the live-patch tests how about using
_OBSOLETE=1
in according product settings for live-patch tests?
The same applies to all maintenance update test. Livepatches are just the most obvious and most painful fallout of this change. Single incident jobs need to be obsoleted by fresh isos post
. Letting the old jobs finish will not yield any interesting results. It's just a pointless waste of testing resources.
Updated by okurz 11 months ago
MDoucha wrote in #note-21:
The same applies to all maintenance update test. Livepatches are just the most obvious and most painful fallout of this change. Single incident jobs need to be obsoleted by fresh
isos post
. Letting the old jobs finish will not yield any interesting results. It's just a pointless waste of testing resources.
other people's opinions might differ. I suggest to open up a discussion in broader context, e.g. in Slack #eng-testing and ask others, otherwise I tend to keep the behaviour as is.