Project

General

Profile

Actions

action #174583

closed

openqa/scripts-ci pipeline fails, jobs ended up in various not-ok states - New openQA API route to follow job obsolescence? size:S

Added by jbaier_cz 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-12-19
Due date:
% Done:

0%

Estimated time:

Description

Observation

The pipeline is failing because the openQA jobs got obsoleted:

See: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3562638

{"blocked_by_id":null,"id":4713396,"result":"obsoleted","state":"done"}
{"blocked_by_id":null,"id":4713397,"result":"obsoleted","state":"done"}

for unknown reason as we don't trigger with OBSOLETE and that should not be default according to openQA documentation

The multimachine case looks a bit more involved e.g. https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3625091 :

{"blocked_by_id":null,"id":16374878,"result":"skipped","state":"cancelled"}
1490{"blocked_by_id":null,"id":16374879,"result":"timeout_exceeded","state":"done"}

Acceptance Criteria

  • AC1: Unfinished jobs don't cause failures in GitLab pipelines

Suggestions

  • Verify if this is a specific worker or workers and take them out of production
  • Consider restarting affected jobs
  • An "obsolete" should be considered part of expected behavior. How about a new openQA API route to follow job obsolescence? -> handled in #175299
  • Ignore the case of "obsoleted" jobs as the pipeline runs frequently enough anyway. check why jobs ended up as obsolete even though scripts-ci don't trigger with obsoletion
  • Check whether we cancel the full parallel cluster in case a job in it is cancelled/obsoleted as we also saw jobs with parallel dependencies ending up with the result "timeout_exceeded".
    • Treat skipped/cancelled the same as obsoleted (and ignore it)
    • Ensure this is logged in case it is not always the case

Mitigations


Related issues 3 (1 open2 closed)

Related to openQA Tests (public) - action #175698: [tools][multi-machine tests] Timeout_exceeded on multiple workers including arm1, arm2 and maniaResolvedmkittler2025-01-17

Actions
Copied to openQA Project (public) - action #175299: Option to ignore obsoleted jobs when using openqa-cli monitorNew2024-12-19

Actions
Copied to openQA Project (public) - action #175305: Flag to return restarted jobs when using openQA jobs API route size:SResolveddheidler2024-12-19

Actions
Actions #1

Updated by jbaier_cz 2 months ago

  • Description updated (diff)
Actions #2

Updated by okurz 2 months ago

  • Tags changed from alert to alert, infra, reactive work
  • Priority changed from High to Urgent
Actions #3

Updated by gpuliti 2 months ago

  • Assignee set to gpuliti
Actions #4

Updated by jbaier_cz 2 months ago

  • Description updated (diff)
Actions #5

Updated by livdywan 2 months ago

  • Description updated (diff)
Actions #6

Updated by gpuliti 2 months ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

I've rerun the job and now is successful https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3564579

Actions #7

Updated by jbaier_cz 2 months ago

  • Status changed from Resolved to Feedback
  • Priority changed from Urgent to Normal

That won't stop the issue from happening again. Of course the rerun helped, because now the openQA jobs are not obsoleted (and that mitigates the urgency). IMHO the solution to this ticket is to not consider obsoleted jobs as a failure in the pipeline.

Actions #8

Updated by gpuliti 2 months ago

  • % Done changed from 100 to 50
Actions #9

Updated by gpuliti 2 months ago

  • Status changed from Feedback to Workable
  • Assignee deleted (gpuliti)
Actions #10

Updated by okurz 2 months ago

  • Status changed from Workable to New
Actions #11

Updated by okurz 2 months ago

  • Tags changed from alert, infra, reactive work to alert, reactive work
  • Subject changed from openqa/scripts-ci pipeline fails, jobs obsoleted to openqa/scripts-ci pipeline fails, jobs obsoleted - New openQA API route to follow restarts?
  • Description updated (diff)
  • % Done changed from 50 to 0
Actions #13

Updated by okurz about 2 months ago

  • Priority changed from Normal to High
Actions #14

Updated by okurz about 2 months ago

  • Priority changed from High to Urgent
Actions #15

Updated by livdywan about 2 months ago

  • Status changed from New to In Progress
  • Assignee set to livdywan

Taking a look as discussed

Actions #16

Updated by livdywan about 2 months ago

  • Status changed from In Progress to Workable

I'm afraid there is something more important on my mind.

Actions #17

Updated by mkittler about 2 months ago

  • Subject changed from openqa/scripts-ci pipeline fails, jobs obsoleted - New openQA API route to follow restarts? to openqa/scripts-ci pipeline fails, jobs obsoleted - New openQA API route to follow job obsolescence? size:S
  • Description updated (diff)
Actions #18

Updated by livdywan about 2 months ago

  • Description updated (diff)
Actions #19

Updated by livdywan about 2 months ago

  • Status changed from Workable to In Progress

There's no dryrun and no unittests so to see what jobs are being filtered I'm also splitting the code a little for manual validation.

Actions #20

Updated by livdywan about 2 months ago

  • Status changed from In Progress to Feedback
Actions #21

Updated by livdywan about 2 months ago

https://github.com/os-autoinst/openQA/pull/6101 the "monitor" command we rely on here needs to support skipping of "aborted" jobs. Which it turns out we already have a constant for.

Actions #22

Updated by livdywan about 2 months ago

  • Status changed from Feedback to In Progress

I should know better than to put an Urgent ticket in feedback 🙈

Couldn't really come up with a mitigation since retrying the pipeline won't help if a job is obsoleted.

Actions #23

Updated by tinita about 2 months ago

livdywan wrote in #note-22:

Couldn't really come up with a mitigation since retrying the pipeline won't help if a job is obsoleted.

Why wouldn't that help? It would schedule new jobs.

Actions #24

Updated by openqa_review about 2 months ago

  • Due date set to 2025-01-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions #25

Updated by livdywan about 2 months ago

tinita wrote in #note-23:

livdywan wrote in #note-22:

Couldn't really come up with a mitigation since retrying the pipeline won't help if a job is obsoleted.

Why wouldn't that help? It would schedule new jobs.

It won't help ensuring the original jobs are known to pass or be superseded as mentioned in the GitHub pull request thread? But that is something I also wanted to discuss again to make sure we are on the same page.

Actions #26

Updated by livdywan about 2 months ago

  • Description updated (diff)
  • Priority changed from Urgent to High

Couldn't really come up with a mitigation since retrying the pipeline won't help if a job is obsoleted.

Actually as briefly discussed in the daily it should be fine to pause the pipelines now. Previously I didn't want to do it as it wasn't clear if discussed failures were relating to the same issue but we are getting alert-fatigued enough at this point.

Actions #27

Updated by tinita about 2 months ago

livdywan wrote in #note-25:

It won't help ensuring the original jobs are known to pass or be superseded as mentioned in the GitHub pull request thread? But that is something I also wanted to discuss again to make sure we are on the same page.

My approach in the PR was: check if the job was obsoleted/cancelled and do the openqa-cli schedule call again.

Actions #28

Updated by okurz about 2 months ago

  1. add API query parameter "follow=1" to follow restarts
  2. investigate why jobs ended up as "obsolete" as we don't trigger with obsoletion and no obsolete should be default -> crosscheck the builds are not obsoleted by default
Actions #29

Updated by livdywan about 2 months ago

  • Copied to action #175299: Option to ignore obsoleted jobs when using openqa-cli monitor added
Actions #30

Updated by livdywan about 2 months ago

  • Copied to action #175305: Flag to return restarted jobs when using openQA jobs API route size:S added
Actions #31

Updated by livdywan about 1 month ago

I don't want to put it to Workable or Feedback right now, but investigation is going slower because of other on-going tickets.

  • Pipelines are running but manually monitored (by me) without alert emails.
  • No fix per se identified so far.
Actions #32

Updated by jbaier_cz about 1 month ago

  • Related to action #175698: [tools][multi-machine tests] Timeout_exceeded on multiple workers including arm1, arm2 and mania added
Actions #33

Updated by livdywan about 1 month ago · Edited

  • Description updated (diff)
  • Status changed from In Progress to Workable
  • Assignee deleted (livdywan)

Regular emails enabled again.

Now as for how openqa-schedule-mm-ping-test ends up obsoleting jobs despite our docs saying otherwise:

  • t/43-cli-schedule.t doesn't cover this since "cancelled" jobs are mocked here.
  • t/api/02-iso.t has a build obsoletion/deprioritization case using API routes directly. And these tests fail when run without _OBSOLETE or with _OBSOLETE=0 set.

I'm documenting what I was looking at since I wanted to come up with a unit test to cover the gap but couldn't find what we aren't testing here.

Actions #34

Updated by mkittler about 1 month ago

  • Assignee set to mkittler
Actions #35

Updated by mkittler about 1 month ago

  • Status changed from Workable to In Progress

With #175305 resolved this only leaves implementing an option to ignore obsoleted jobs which has already been attempted in https://github.com/os-autoinst/openQA/pull/6101. However, for this we now have #175299 which is not on the backlog.

This leaves the following for this ticket:

  • Resume pipelines when #175299 is merged and deployed.
  • Check whether we cancel the full parallel cluster in case a job in it is cancelled/obsoleted as we also saw jobs with parallel dependencies ending up with the result "timeout_exceeded".

So I'm looking into the last point.

Actions #36

Updated by mkittler about 1 month ago · Edited

We are in fact explicitly excluding parallel children when cancelling jobs from the job cluster by default. This is how the code is written and it looks intentional but wrong at the same time. It explains the timeouted job we saw.

EDIT: I misread the not so straight forward code. So we actually do cancel the cluster completely by default. There must nevertheless be a bug in the logic.

EDIT: I couldn't find any problem with the code, though. I extended unit tests and the tests didn't show this problem: https://github.com/os-autoinst/openQA/pull/6130

EDIT: I'm not sure how the the skipped job even ended up skipped. There's nothing in the audit log.

I suppose at this point it isn't worth looking into this case any further.

So for now I'm just blocking this ticket on #175299.

Actions #37

Updated by mkittler about 1 month ago

  • Status changed from In Progress to Blocked
Actions #38

Updated by okurz about 1 month ago

  • Status changed from Blocked to Workable

First we shouldn't block tickets in the backlog on tickets in future. Second, we had identified that we don't trigger the scripts-ci pipelines with obsoletion enabled so unless our documentation is wrong about the default jobs should not be obsolete hence we don't need #175299 . Third, this ticket has due date 2025-01-24 and it's unrealistic to assume that #175299 would be finished until then

Actions #39

Updated by okurz about 1 month ago

  • Subject changed from openqa/scripts-ci pipeline fails, jobs obsoleted - New openQA API route to follow job obsolescence? size:S to openqa/scripts-ci pipeline fails, jobs ended up in various not-ok states - New openQA API route to follow job obsolescence?
  • Description updated (diff)
  • Status changed from Workable to New

The latest failure from today is https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3688628#L56 showing

{"blocked_by_id":null,"id":16517839,"result":"parallel_failed","state":"done"}
{"blocked_by_id":null,"id":16517840,"result":"incomplete","state":"done"}

the latter is https://openqa.suse.de/tests/16517840 that incompleted with "Reason: backend died: QEMU exited unexpectedly, see log for details" and was cloned automatically as https://openqa.suse.de/tests/16517843 . How about just using the new "follow" query parameter in scripts-ci?

I updated the ticket and removed the estimate as I removed that much. Are you ok with the ticket changes?

Actions #40

Updated by okurz about 1 month ago

  • Subject changed from openqa/scripts-ci pipeline fails, jobs ended up in various not-ok states - New openQA API route to follow job obsolescence? to openqa/scripts-ci pipeline fails, jobs ended up in various not-ok states - New openQA API route to follow job obsolescence? size:S
Actions #41

Updated by okurz about 1 month ago

  • Status changed from New to In Progress
Actions #42

Updated by mkittler about 1 month ago · Edited

The failure from today (https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3688628#L56) is actually a different issue, see #170209#note-38.

I checked my inbox again and saw that the pipeline also failed yesterday, and that failure was actually one of the cases we also saw previously when creating this ticket (combination of timeout exceeded + skipped, see https://openqa.suse.de/tests/16512096 and https://openqa.suse.de/tests/16512097). So maybe worth looking into a little bit further after all.

The pipelines are also still running. Depending on what I can find out (or fix) I'll see whether it makes sense to pause them again.

Actions #43

Updated by mkittler about 1 month ago · Edited

I think the suggestion

Check whether we cancel the full parallel cluster in case a job in it is cancelled/obsoleted as we also saw jobs with parallel dependencies ending up with the result "timeout_exceeded".

is maybe looking at the problem from the wrong perspective. Maybe it is simply that only one out of two jobs are ever executed. The one that is executed then exceeds the timeout and then the other job is cancelled due to it. This would not explain the case of obsoletes and raise the question why only one of the jobs is assigned to a worker. However, it is very likely what happened because t_finished of both jobs is the same:

openqa=> select id, state, result, reason, t_created, t_started, t_finished from jobs where id in ( 16512096, 16512097 ) order by id;
    id    |   state   |      result      |                    reason                     |      t_created      |      t_started      |     t_finished      
----------+-----------+------------------+-----------------------------------------------+---------------------+---------------------+---------------------
 16512096 | done      | timeout_exceeded | timeout: test execution exceeded MAX_JOB_TIME | 2025-01-20 17:13:24 | 2025-01-20 18:42:47 | 2025-01-20 20:43:24
 16512097 | cancelled | skipped          |                                               | 2025-01-20 17:13:24 |                     | 2025-01-20 20:43:24
(2 rows)

It can of course be that 16512097 was assigned to a worker but the worker went away for some reason. Then the assignment is taken back. However, normally a re-assignment is supposed to happen.

Actions #44

Updated by mkittler about 1 month ago · Edited

Looks like 16512097 was assigned to a worker. Not sure what happened then but apparently the assignment wasn't effectively happening and then the job was skipped:

$ xzgrep '16512096' /var/log/openqa_scheduler*
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:48.515970Z] [debug] [pid:1734] Assigned job '16512097' to worker ID '2672'
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:48.517437Z] [debug] [pid:1734] [Job#16512097] Prepare for being processed by worker 2672
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:48.544525Z] [debug] [pid:1734] Sent job(s) '16512097' to worker '2672' (worker33:26)
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:48.822243Z] [debug] [pid:1734] Allocated: { job => 16512097, worker => 2672 }
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:43:29.559985Z] [debug] [pid:1734] Skipping job 16512097 because dependent jobs are not ready
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:43:40.989996Z] [debug] [pid:1734] Skipping job 16512097 because dependent jobs are not ready
…
/var/log/openqa_scheduler.3.xz:[2025-01-20T20:18:33.044806Z] [debug] [pid:1734] Skipping job 16512097 because dependent jobs are not ready
/var/log/openqa_scheduler.3.xz:[2025-01-20T20:19:41.399959Z] [debug] [pid:1734] Skipping job 16512097 because dependent jobs are not ready
/var/log/openqa_scheduler.3.xz:[2025-01-20T20:22:15.657510Z] [debug] [pid:1734] Skipping job 16512097 because dependent jobs are not ready
/var/log/openqa_scheduler.3.xz:[2025-01-20T20:22:57.245009Z] [debug] [pid:1734] Skipping job 16512097 because dependent jobs are not ready

The other job was assigned as expected:

$ xzgrep '16512096' /var/log/openqa_scheduler*
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:47.093956Z] [debug] [pid:1734] Need to schedule 2 parallel jobs for job 16512096 (with priority 50)
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:47.120683Z] [debug] [pid:1734] Assigned job '16512096' to worker ID '2679' (worker33:33)
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:47.123886Z] [debug] [pid:1734] [Job#16512096] Prepare for being processed by worker 2679
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:47.154420Z] [debug] [pid:1734] Created network for 16512096: 3
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:47.184515Z] [debug] [pid:1734] Sent job(s) '16512096' to worker '2679'
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:48.821476Z] [debug] [pid:1734] Allocated: { job => 16512096, worker => 2679 }

I don't think the dependent jobs is not ready due to Minion jobs because scheduling this product shouldn't cause any Minion jobs. We have git_auto_update = no on OSD and the jobs don't specify any Git URLs. I also confirmed that no Minion jobs are created by running openqa_url=http://localhost:9526 distri=sle version=15-SP5 flavor=Server-DVD-Updates test_name=ovs-client ./openqa-schedule-mm-ping-test locally.

I cannot reproduce the problem locally with two workers using PARALLEL_ONE_HOST_ONLY=1. Even if one of the workers doesn't react to the assignment and the half-scheduled cluster needs to be repaired everything works as expected. I can reproduce it if I comment out $self->_pick_siblings_of_running($allocated_jobs, $allocated_workers);. So there is most likely an issue within that function.

EDIT: I can reproduce the following situation that might be exactly what happened in production: When only one of two worker slots accept their assignment and then only another worker slot with mismatching host is available to repair the half-assigned cluster the "Skipping job … because dependent jobs are not ready" message is logged. This is quite confusing. It just means there's no free slot on the required worker host to schedule the job.

Actions #45

Updated by mkittler about 1 month ago

PR https://github.com/os-autoinst/openQA/pull/6132 will fix scheduling-related problems that could have lead to this.

This still leaves the obsoletion case but I guess it isn't worth looking into it further.

Actions #46

Updated by mkittler about 1 month ago

  • Status changed from In Progress to Feedback

The PR has been merged and deployed. Let's see whether it helps.

Actions #47

Updated by mkittler about 1 month ago · Edited

  • Status changed from Feedback to In Progress

The pipeline failed again. This time it is the obsoletion case again, see https://openqa.opensuse.org/tests/4801850 and https://openqa.opensuse.org/tests/4801851. The reason is "cancelled based on job settings".

That means my changes for the timeout/skipped case might still have done the trick but I will look into the obsolescence case once more.

Actions #48

Updated by mkittler about 1 month ago

The documentation about _OBSOLETE=1 says that it obsoletes "jobs in older builds with same DISTRI and VERSION". Judging by the code also FLAVOR and ARCH must match. In our case DISTRI and VERSION are just "opensuse" and "Tumbleweed" anf FLAVOR and ARCH are just "DVD" and "x86_64". That's all very generic. So the jobs are probably obsoleted by a completely unrelated scheduled product. We specify _GROUP=0 but judging by the documentation and code the group is not one of the filter parameters.

It should be easier to find out what scheduled product caused the obsolescence so I created https://github.com/os-autoinst/openQA/pull/6134.

Of course we still need to think how to prevent the obsolescence from happening here.

Actions #49

Updated by mkittler about 1 month ago

  • Status changed from In Progress to Feedback

This PR will prevent the obsolescence from happening: https://github.com/os-autoinst/scripts/pull/366

Actions #51

Updated by mkittler about 1 month ago

  • Status changed from Feedback to Resolved

I haven't seen another failed CI run so I would consider the ticket resolved. (We can still re-open it if we see the same kinds of issues again.)

Actions #52

Updated by okurz about 1 month ago

  • Due date deleted (2025-01-24)
Actions

Also available in: Atom PDF