action #174583
openopenqa/scripts-ci pipeline fails, jobs obsoleted - New openQA API route to follow job obsolescence? size:S
0%
Description
Observation¶
The pipeline is failing because the openQA jobs got obsoleted:
See: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3562638
{"blocked_by_id":null,"id":4713396,"result":"obsoleted","state":"done"}
{"blocked_by_id":null,"id":4713397,"result":"obsoleted","state":"done"}
The multimachine case looks a bit more involved e.g. https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3625091 :
{"blocked_by_id":null,"id":16374878,"result":"skipped","state":"cancelled"}
1490{"blocked_by_id":null,"id":16374879,"result":"timeout_exceeded","state":"done"}
Acceptance Criteria¶
- AC1: Obsoleted jobs don't cause failures in GitLab pipelines
Suggestions¶
Verify if this is a specific worker or workers and take them out of productionConsider restarting affected jobs- An "obsolete" should be considered part of expected behavior. How about a new openQA API route to follow job obsolescence?
- Ignore the case of "obsoleted" jobs as the pipeline runs frequently enough anyway.
- Check whether we cancel the full parallel cluster in case a job in it is cancelled/obsoleted as we also saw jobs with parallel dependencies ending up with the result "timeout_exceeded".
- Treat skipped/cancelled the same as obsoleted (and ignore it)
- Ensure this is logged in case it is not always the case
Mitigations¶
- ~DONE Pause affected pipelines on GitLab i.e. openqa-schedule-mm-ping-test o3/osd~
Updated by okurz about 1 month ago
- Tags changed from alert to alert, infra, reactive work
- Priority changed from High to Urgent
Updated by gpuliti about 1 month ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
I've rerun the job and now is successful https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3564579
Updated by jbaier_cz about 1 month ago
- Status changed from Resolved to Feedback
- Priority changed from Urgent to Normal
That won't stop the issue from happening again. Of course the rerun helped, because now the openQA jobs are not obsoleted (and that mitigates the urgency). IMHO the solution to this ticket is to not consider obsoleted jobs as a failure in the pipeline.
Updated by okurz 27 days ago
- Tags changed from alert, infra, reactive work to alert, reactive work
- Subject changed from openqa/scripts-ci pipeline fails, jobs obsoleted to openqa/scripts-ci pipeline fails, jobs obsoleted - New openQA API route to follow restarts?
- Description updated (diff)
- % Done changed from 50 to 0
Updated by livdywan 10 days ago
https://github.com/os-autoinst/openQA/pull/6101 the "monitor" command we rely on here needs to support skipping of "aborted" jobs. Which it turns out we already have a constant for.
Updated by openqa_review 9 days ago
- Due date set to 2025-01-24
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan 9 days ago
tinita wrote in #note-23:
livdywan wrote in #note-22:
Couldn't really come up with a mitigation since retrying the pipeline won't help if a job is obsoleted.
Why wouldn't that help? It would schedule new jobs.
It won't help ensuring the original jobs are known to pass or be superseded as mentioned in the GitHub pull request thread? But that is something I also wanted to discuss again to make sure we are on the same page.
Updated by livdywan 9 days ago
- Description updated (diff)
- Priority changed from Urgent to High
Couldn't really come up with a mitigation since retrying the pipeline won't help if a job is obsoleted.
Actually as briefly discussed in the daily it should be fine to pause the pipelines now. Previously I didn't want to do it as it wasn't clear if discussed failures were relating to the same issue but we are getting alert-fatigued enough at this point.
Updated by tinita 9 days ago
livdywan wrote in #note-25:
It won't help ensuring the original jobs are known to pass or be superseded as mentioned in the GitHub pull request thread? But that is something I also wanted to discuss again to make sure we are on the same page.
My approach in the PR was: check if the job was obsoleted/cancelled and do the openqa-cli schedule
call again.
Updated by livdywan 6 days ago
- Copied to action #175299: Option to ignore obsoleted jobs when using openqa-cli monitor added
Updated by livdywan 6 days ago
- Copied to action #175305: Flag to return restarted jobs when using openQA jobs API route size:S added
Updated by jbaier_cz 2 days ago
- Related to action #175698: [tools][multi-machine tests] Timeout_exceeded on multiple workers including arm1, arm2 and mania added
Updated by livdywan 1 day ago · Edited
- Description updated (diff)
- Status changed from In Progress to Workable
- Assignee deleted (
livdywan)
Regular emails enabled again.
Now as for how openqa-schedule-mm-ping-test ends up obsoleting jobs despite our docs saying otherwise:
- t/43-cli-schedule.t doesn't cover this since "cancelled" jobs are mocked here.
- t/api/02-iso.t has a
build obsoletion/deprioritization
case using API routes directly. And these tests fail when run without_OBSOLETE
or with_OBSOLETE=0
set.
I'm documenting what I was looking at since I wanted to come up with a unit test to cover the gap but couldn't find what we aren't testing here.