Project

General

Profile

Actions

action #174583

open

openqa/scripts-ci pipeline fails, jobs obsoleted - New openQA API route to follow job obsolescence? size:S

Added by jbaier_cz about 1 month ago. Updated 1 day ago.

Status:
Workable
Priority:
High
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2024-12-19
Due date:
2025-01-24 (Due in 5 days)
% Done:

0%

Estimated time:

Description

Observation

The pipeline is failing because the openQA jobs got obsoleted:

See: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3562638

{"blocked_by_id":null,"id":4713396,"result":"obsoleted","state":"done"}
{"blocked_by_id":null,"id":4713397,"result":"obsoleted","state":"done"}

The multimachine case looks a bit more involved e.g. https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3625091 :

{"blocked_by_id":null,"id":16374878,"result":"skipped","state":"cancelled"}
1490{"blocked_by_id":null,"id":16374879,"result":"timeout_exceeded","state":"done"}

Acceptance Criteria

  • AC1: Obsoleted jobs don't cause failures in GitLab pipelines

Suggestions

  • Verify if this is a specific worker or workers and take them out of production
  • Consider restarting affected jobs
  • An "obsolete" should be considered part of expected behavior. How about a new openQA API route to follow job obsolescence?
  • Ignore the case of "obsoleted" jobs as the pipeline runs frequently enough anyway.
  • Check whether we cancel the full parallel cluster in case a job in it is cancelled/obsoleted as we also saw jobs with parallel dependencies ending up with the result "timeout_exceeded".
    • Treat skipped/cancelled the same as obsoleted (and ignore it)
    • Ensure this is logged in case it is not always the case

Mitigations


Related issues 3 (3 open0 closed)

Related to openQA Tests (public) - action #175698: [tools][multi-machine tests] Timeout_exceeded on multiple workers including arm1, arm2 and maniaBlockedjbaier_cz2025-01-17

Actions
Copied to openQA Project (public) - action #175299: Option to ignore obsoleted jobs when using openqa-cli monitorNew2024-12-19

Actions
Copied to openQA Project (public) - action #175305: Flag to return restarted jobs when using openQA jobs API route size:SFeedbackdheidler2024-12-192025-01-31

Actions
Actions #1

Updated by jbaier_cz about 1 month ago

  • Description updated (diff)
Actions #2

Updated by okurz about 1 month ago

  • Tags changed from alert to alert, infra, reactive work
  • Priority changed from High to Urgent
Actions #3

Updated by gpuliti about 1 month ago

  • Assignee set to gpuliti
Actions #4

Updated by jbaier_cz about 1 month ago

  • Description updated (diff)
Actions #5

Updated by livdywan about 1 month ago

  • Description updated (diff)
Actions #6

Updated by gpuliti about 1 month ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

I've rerun the job and now is successful https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3564579

Actions #7

Updated by jbaier_cz about 1 month ago

  • Status changed from Resolved to Feedback
  • Priority changed from Urgent to Normal

That won't stop the issue from happening again. Of course the rerun helped, because now the openQA jobs are not obsoleted (and that mitigates the urgency). IMHO the solution to this ticket is to not consider obsoleted jobs as a failure in the pipeline.

Actions #8

Updated by gpuliti 30 days ago

  • % Done changed from 100 to 50
Actions #9

Updated by gpuliti 30 days ago

  • Status changed from Feedback to Workable
  • Assignee deleted (gpuliti)
Actions #10

Updated by okurz 27 days ago

  • Status changed from Workable to New
Actions #11

Updated by okurz 27 days ago

  • Tags changed from alert, infra, reactive work to alert, reactive work
  • Subject changed from openqa/scripts-ci pipeline fails, jobs obsoleted to openqa/scripts-ci pipeline fails, jobs obsoleted - New openQA API route to follow restarts?
  • Description updated (diff)
  • % Done changed from 50 to 0
Actions #13

Updated by okurz 19 days ago

  • Priority changed from Normal to High
Actions #14

Updated by okurz 11 days ago

  • Priority changed from High to Urgent
Actions #15

Updated by livdywan 11 days ago

  • Status changed from New to In Progress
  • Assignee set to livdywan

Taking a look as discussed

Actions #16

Updated by livdywan 11 days ago

  • Status changed from In Progress to Workable

I'm afraid there is something more important on my mind.

Actions #17

Updated by mkittler 10 days ago

  • Subject changed from openqa/scripts-ci pipeline fails, jobs obsoleted - New openQA API route to follow restarts? to openqa/scripts-ci pipeline fails, jobs obsoleted - New openQA API route to follow job obsolescence? size:S
  • Description updated (diff)
Actions #18

Updated by livdywan 10 days ago

  • Description updated (diff)
Actions #19

Updated by livdywan 10 days ago

  • Status changed from Workable to In Progress

There's no dryrun and no unittests so to see what jobs are being filtered I'm also splitting the code a little for manual validation.

Actions #20

Updated by livdywan 10 days ago

  • Status changed from In Progress to Feedback
Actions #21

Updated by livdywan 10 days ago

https://github.com/os-autoinst/openQA/pull/6101 the "monitor" command we rely on here needs to support skipping of "aborted" jobs. Which it turns out we already have a constant for.

Actions #22

Updated by livdywan 9 days ago

  • Status changed from Feedback to In Progress

I should know better than to put an Urgent ticket in feedback 🙈

Couldn't really come up with a mitigation since retrying the pipeline won't help if a job is obsoleted.

Actions #23

Updated by tinita 9 days ago

livdywan wrote in #note-22:

Couldn't really come up with a mitigation since retrying the pipeline won't help if a job is obsoleted.

Why wouldn't that help? It would schedule new jobs.

Actions #24

Updated by openqa_review 9 days ago

  • Due date set to 2025-01-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions #25

Updated by livdywan 9 days ago

tinita wrote in #note-23:

livdywan wrote in #note-22:

Couldn't really come up with a mitigation since retrying the pipeline won't help if a job is obsoleted.

Why wouldn't that help? It would schedule new jobs.

It won't help ensuring the original jobs are known to pass or be superseded as mentioned in the GitHub pull request thread? But that is something I also wanted to discuss again to make sure we are on the same page.

Actions #26

Updated by livdywan 9 days ago

  • Description updated (diff)
  • Priority changed from Urgent to High

Couldn't really come up with a mitigation since retrying the pipeline won't help if a job is obsoleted.

Actually as briefly discussed in the daily it should be fine to pause the pipelines now. Previously I didn't want to do it as it wasn't clear if discussed failures were relating to the same issue but we are getting alert-fatigued enough at this point.

Actions #27

Updated by tinita 9 days ago

livdywan wrote in #note-25:

It won't help ensuring the original jobs are known to pass or be superseded as mentioned in the GitHub pull request thread? But that is something I also wanted to discuss again to make sure we are on the same page.

My approach in the PR was: check if the job was obsoleted/cancelled and do the openqa-cli schedule call again.

Actions #28

Updated by okurz 6 days ago

  1. add API query parameter "follow=1" to follow restarts
  2. investigate why jobs ended up as "obsolete" as we don't trigger with obsoletion and no obsolete should be default -> crosscheck the builds are not obsoleted by default
Actions #29

Updated by livdywan 6 days ago

  • Copied to action #175299: Option to ignore obsoleted jobs when using openqa-cli monitor added
Actions #30

Updated by livdywan 6 days ago

  • Copied to action #175305: Flag to return restarted jobs when using openQA jobs API route size:S added
Actions #31

Updated by livdywan 2 days ago

I don't want to put it to Workable or Feedback right now, but investigation is going slower because of other on-going tickets.

  • Pipelines are running but manually monitored (by me) without alert emails.
  • No fix per se identified so far.
Actions #32

Updated by jbaier_cz 2 days ago

  • Related to action #175698: [tools][multi-machine tests] Timeout_exceeded on multiple workers including arm1, arm2 and mania added
Actions #33

Updated by livdywan 1 day ago · Edited

  • Description updated (diff)
  • Status changed from In Progress to Workable
  • Assignee deleted (livdywan)

Regular emails enabled again.

Now as for how openqa-schedule-mm-ping-test ends up obsoleting jobs despite our docs saying otherwise:

  • t/43-cli-schedule.t doesn't cover this since "cancelled" jobs are mocked here.
  • t/api/02-iso.t has a build obsoletion/deprioritization case using API routes directly. And these tests fail when run without _OBSOLETE or with _OBSOLETE=0 set.

I'm documenting what I was looking at since I wanted to come up with a unit test to cover the gap but couldn't find what we aren't testing here.

Actions

Also available in: Atom PDF