action #94606
closed
New builds of aggregate tests should not obsolete old ones size:M
Added by okurz over 3 years ago.
Updated about 1 year ago.
Category:
Feature requests
Description
Motivation¶
From discussion between okurz and mgrifalconi. Currently SLE maintenance aggregate tests are scheduled twice per day. Often only the first build of a day is interesting for reviewers as it is likely more complete and the second build would likely only include a smaller inter-day delta. But currently (to-be-confirmed) aggregate tests are scheduled by obsoleting older builds meaning that the tests of the first build per day might not yet be completely finished and aborted when the second build gets triggered. As openQA supports deprioritizing older builds instead of obsoleting this can also give aggregate tests the possibility to finish.
Acceptance criteria¶
- AC1: SLE maintenance aggregate jobs from older builds can (mostly) finish even if not finished by the time another build is scheduled
- AC2: OSD can still ensure a reasonable job age for all related architectures and worker classes
Suggestions¶
As documented on http://open.qa/docs/#_spawning_multiple_jobs_based_on_templates_isos_post use _DEPRIORITIZEBUILD
instead of _OBSOLETE
, e.g. in https://gitlab.suse.de/qa-maintenance/openQABot/-/blob/400f79aa9bb8283870aba16f8b6749f37400d454/openqabot/openqabot.py#L184
- Monitor the impact of _DEPRIORITIZEBUILD
- Tweak _DEPRIORITIZE_LIMIT based on monitoring data and observation over some days/weeks
- Consider setting the _ONLY_OBSOLETE_SAME_BUILD option
- Consider introducing the option to set scheduling flags in the metadata project e.g. by product/team/group
Challenges¶
- AFAIR originally there had been even more "aggregate tests". The next build is scheduled which is always scheduled with a constant time offset (unlike in product validation where there can be the exception of a rapid succession of builds). If the first build of a day is not even able to finish all tests by then and this is not blocking the release of any updates then I guess we won't significantly benefit from such behaviour change. IMHO the criteria for releaseability should not be "any failed test blocking the release" but "not less passed tests than on our reference". If we would stick to that then we would have a direct motivation to have efficient, fast, relevant tests.
- Status changed from New to Feedback
- Assignee set to okurz
- Due date set to 2021-08-03
- Due date deleted (
2021-08-03)
- Status changed from Feedback to Blocked
- Status changed from Blocked to New
- Assignee deleted (
okurz)
- Target version changed from Ready to future
It's a nice idea but outside current team's capacity. To be followed up later.
- Assignee set to okurz
- Target version changed from future to Ready
- Status changed from New to Feedback
- Priority changed from Low to High
- Status changed from Feedback to Blocked
Due to #138026 there were likely either unrelated failures or not even enough jobs scheduled to evaluate the impact of my changes. So let's wait for #138026 first and then monitor for longer.
- Status changed from Blocked to Resolved
I've cancelled over 3500 PPC64LE kernel livepatch jobs today which should have been obsoleted automatically. The ticket title and description say that deprioritize should be used for aggregate tests. Incident tests should be obsoleted by default.
The bot does not support different settings for aggregate and incident tests defined in the same config file so fixing this in the config would require splitting all SLES bot configs into separate aggregate-only and incident-only files.
- Status changed from Resolved to Feedback
MDoucha wrote in #note-11:
I've cancelled over 3500 PPC64LE kernel livepatch jobs today which should have been obsoleted automatically. The ticket title and description say that deprioritize should be used for aggregate tests. Incident tests should be obsoleted by default.
I am not sure. I agree that the original problem was aggregate tests but why should incident tests be obsoleted by default?
The bot does not support different settings for aggregate and incident tests defined in the same config file so fixing this in the config would require splitting all SLES bot configs into separate aggregate-only and incident-only files.
We can also tweak the limit until when jobs are deprioritized until obsoletion. WDYT?
okurz wrote in #note-12:
MDoucha wrote in #note-11:
I've cancelled over 3500 PPC64LE kernel livepatch jobs today which should have been obsoleted automatically. The ticket title and description say that deprioritize should be used for aggregate tests. Incident tests should be obsoleted by default.
I am not sure. I agree that the original problem was aggregate tests but why should incident tests be obsoleted by default?
Because the new jobs are identical to the old jobs except for REPOHASH
which is just bot's internal note that has no effect on test behavior.
Also, I believe that this ticket could have been solved by simply adding _ONLY_OBSOLETE_SAME_BUILD=1
to openqabot/types/aggregates.py. Then different aggregate builds even from the same day would not interfere with each other.
The bot does not support different settings for aggregate and incident tests defined in the same config file so fixing this in the config would require splitting all SLES bot configs into separate aggregate-only and incident-only files.
We can also tweak the limit until when jobs are deprioritized until obsoletion. WDYT?
I have no idea what you mean.
- Status changed from Feedback to New
MDoucha wrote in #note-13:
okurz wrote in #note-12:
MDoucha wrote in #note-11:
I've cancelled over 3500 PPC64LE kernel livepatch jobs today which should have been obsoleted automatically. The ticket title and description say that deprioritize should be used for aggregate tests. Incident tests should be obsoleted by default.
I am not sure. I agree that the original problem was aggregate tests but why should incident tests be obsoleted by default?
Because the new jobs are identical to the old jobs except for REPOHASH
which is just bot's internal note that has no effect on test behavior.
Also, I believe that this ticket could have been solved by simply adding _ONLY_OBSOLETE_SAME_BUILD=1
to openqabot/types/aggregates.py. Then different aggregate builds even from the same day would not interfere with each other.
Thanks, we will look into that.
The bot does not support different settings for aggregate and incident tests defined in the same config file so fixing this in the config would require splitting all SLES bot configs into separate aggregate-only and incident-only files.
We can also tweak the limit until when jobs are deprioritized until obsoletion. WDYT?
I have no idea what you mean.
I mean the variables from
http://open.qa/docs/#_spawning_multiple_jobs_based_on_templates_isos_post
_DEPRIORITIZEBUILD
Setting this switch to '1' will deprioritize the unfinished jobs of old builds, and it will obsolete the jobs once the configurable limit of the priority value is reached.
_DEPRIORITIZE_LIMIT
The configurable limit of priority value up to which jobs should be deprioritized. Needs _DEPRIORITIZEBUILD. Defaults to 100.
I would like to take a look that _DEPRIORITIZEBUILD is properly set.
- Subject changed from New builds of aggregate tests should not obsolete old ones to New builds of aggregate tests should not obsolete old ones size:M
- Description updated (diff)
- Status changed from New to Workable
- Status changed from Workable to Resolved
Over the past weeks I don't see more problematic cases of overly long schedule looking at https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1 so considering this not problematic anymore in general. I crosschecked how SLE maintenance products are scheduled on OSD on openqa.suse.de/admin/productlog and found that the according variables are set correctly. I also double-checked the implementation in openQA and we do set a default depriorization limit of 100 with a step size of 10. So assuming that a product is triggered with default priority 60 as seen for SLE incident tests up to three depriorizations are conducted so in total not more than 4 unfinished builds should exist. In qem-bot the obsoletion flag is set separately for incidents and aggregates so on further problems one could still consider to set different settings. However in general I would prefer to give all tests a chance to finish regardless of the state of external repositories as having a finished history of jobs can help test reviewing.
okurz wrote in #note-16:
Over the past weeks I don't see more problematic cases of overly long schedule looking at https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1 so considering this not problematic anymore in general. I crosschecked how SLE maintenance products are scheduled on OSD on openqa.suse.de/admin/productlog and found that the according variables are set correctly. I also double-checked the implementation in openQA and we do set a default depriorization limit of 100 with a step size of 10. So assuming that a product is triggered with default priority 60 as seen for SLE incident tests up to three depriorizations are conducted so in total not more than 4 unfinished builds should exist. In qem-bot the obsoletion flag is set separately for incidents and aggregates so on further problems one could still consider to set different settings. However in general I would prefer to give all tests a chance to finish regardless of the state of external repositories as having a finished history of jobs can help test reviewing.
Let's do some math.
- we get ~70 livepatch incidents in one batch every month
- each incident gets 141 test jobs (63 for PPC64LE alone)
- most of those incidents get restarted due to repo changes (rebuilds, channel settings changes, patchinfo changes), often more than once
That's 9870 livepatch jobs in total every month (4410 for PPC64LE alone). With current settings and 3 incident restarts on average, you'll get 29610 jobs in the queue at once (13230 for PPC64LE alone).
MDoucha wrote in #note-17:
- most of those incidents get restarted due to repo changes (rebuilds, channel settings changes, patchinfo changes), often more than once
That's 9870 livepatch jobs in total every month (4410 for PPC64LE alone). With current settings and 3 incident restarts on average, you'll get 29610 jobs in the queue at once (13230 for PPC64LE alone).
why should there be 3 incident restarts happening within short timeframe?
okurz wrote in #note-18:
why should there be 3 incident restarts happening within short timeframe?
Because the incident manager often has to do manual changes in the incident after it was submitted into OpenQA. Usually because some trivial issue was found by the test, like misconfigured update channels or incident number in repo metadata. The bot will retrigger all tests after every change in the incident repo.
ok. If your concern is only the live-patch tests how about using _OBSOLETE=1
in according product settings for live-patch tests?
okurz wrote in #note-20:
ok. If your concern is only the live-patch tests how about using _OBSOLETE=1
in according product settings for live-patch tests?
The same applies to all maintenance update test. Livepatches are just the most obvious and most painful fallout of this change. Single incident jobs need to be obsoleted by fresh isos post
. Letting the old jobs finish will not yield any interesting results. It's just a pointless waste of testing resources.
MDoucha wrote in #note-21:
The same applies to all maintenance update test. Livepatches are just the most obvious and most painful fallout of this change. Single incident jobs need to be obsoleted by fresh isos post
. Letting the old jobs finish will not yield any interesting results. It's just a pointless waste of testing resources.
other people's opinions might differ. I suggest to open up a discussion in broader context, e.g. in Slack #eng-testing and ask others, otherwise I tend to keep the behaviour as is.
Also available in: Atom
PDF