action #178204
open
openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #178243: [epic] More efficient handling of big job schedules, not executable jobs, never matching worker classes, etc.
Reduce test start time on openqa.suse.de size:S
Added by gpuliti about 21 hours ago.
Updated 8 minutes ago.
Category:
Regressions/Crashes
- Copied from action #174235: Cover code of os-autoinst path script/os-autoinst-openvswitch fully (statement coverage) size:S added
- Copied from deleted (action #174235: Cover code of os-autoinst path script/os-autoinst-openvswitch fully (statement coverage) size:S)
- Tags set to osd, infra, administration, openqa, tests
- Project changed from openQA Project (public) to openQA Infrastructure (public)
- Category changed from Regressions/Crashes to Regressions/Crashes
- Priority changed from Normal to Urgent
Made urgent as is this is related to a recent alert and not silenced and no mitigation applied yet
I mentioned the problematic old jobs on #eng-testing:
There are jobs scheduled on OSD with the worker class qemu_x86_64,intel,tap. Those cannot be scheduled because the combination intel,tap doesn't exist at the moment. I suppose qesapworker-prgX workers would in theory provide that but the tap worker class is disabled there as tap_secondary. Not sure what the best solution is.
There is also a s390-kvm,tap job which is also a combination that doesn't exist.
- Description updated (diff)
- Status changed from New to In Progress
- Assignee set to okurz
- Description updated (diff)
- Priority changed from Urgent to High
- Related to action #73174: [osd][alert] Job age (scheduled) (median) alert added
- Parent task set to #178243
- Description updated (diff)
- Priority changed from High to Normal
Asked in Slack #discuss-qe-security https://suse.slack.com/archives/C044KDGKW58/p1741003651502569
(Oliver Kurz) Hi, I found https://openqa.suse.de/tests/16908416# which is scheduled since 4 days and not picked up. Jobs with this worker class config can never be picked up as there are no workers for "qemu_x86_64,intel,tap". Unless somebody just added those tests recently I wonder why that did not impact you more as with such jobs never finishing also build validation would never finish. Can you elaborate?
(Timo Jyrinki) They worked in October when they were added https://openqa.suse.de/tests/15723250, so it's likely a bit of wishful thinking that it went away due to infrastructure arrangements and would come back some day, and that it'd be an useful indicator that "this type of worker is still missing". The same happened eg with coppi (x86 IPMI), it eventually came back after we had a couple of months of similar never-executed jobs. I'm not sure why next&previous tab doesn't correctly show the months of runs, it only shows whatever is being looked at right now + the latest (scheduled) one. We should still have also our own ticket to track that - maybe it has not been created because scheduled / cancelled jobs do not prevent "black badge" from appearing, only failing ones, and the black badging has been our main goal of review.
(Oliver Kurz) That's a very good explanation. The point "it has not been created because scheduled / cancelled jobs do not prevent "black badge" from appearing" is crucial for us to consider. […] from the reviewer's perspective it makes sense to look at the black certificate which means: There are no failures left for review. But from a product quality perspective that is for sure not enough because those could be reviewed but critical failures or critical loss of test coverage. So am I right to assume that the main problem is then the mediocre process about build validation as in "just tell me if you found critical bugs", not caring if tests never run?
(Timo Jyrinki) That is one way of putting it, yes. The focus is on "did you find critical bugs" combined with "report tomorrow afternoon now that we finally managed to do a build", not "were you able to fix all issues you may have in test infra or tests, to have full coverage, and if not how much more time you need?". Of course, generally when doing a report I consider if the coverage was "ok", but just on a high level compared to the previous builds on the summary page (taking into account possible refactorings affecting the numbers), and if we have considerably lower coverage I'll report that too.
(Oliver Kurz) ok. So I am tending towards not alerting the tools team on too high job age as we can't fix such cases of not picked up tests without you as test owners and I understand why there is little motivation on your side to do it so we also shouldn't care about it as such jobs will be cancelled anyway after 7 days when not picked up
- Subject changed from Reduce test start time on openqa.suse.de to Reduce test start time on openqa.suse.de size:S
- Description updated (diff)
Also available in: Atom
PDF