action #174583: openqa/scripts-ci pipeline fails, jobs ended up in various not-ok states - New openQA API route to follow job obsolescence? size:S - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #174583

closed

openqa/scripts-ci pipeline fails, jobs ended up in various not-ok states - New openQA API route to follow job obsolescence? size:S

Added by jbaier_cz 4 months ago. Updated 3 months ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2024-12-19

Due date:

% Done:

Estimated time:

Tags:

alert, reactive work

Description

Observation¶

The pipeline is failing because the openQA jobs got obsoleted:

See: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3562638

{"blocked_by_id":null,"id":4713396,"result":"obsoleted","state":"done"}
{"blocked_by_id":null,"id":4713397,"result":"obsoleted","state":"done"}

for unknown reason as we don't trigger with OBSOLETE and that should not be default according to openQA documentation

The multimachine case looks a bit more involved e.g. https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3625091 :

{"blocked_by_id":null,"id":16374878,"result":"skipped","state":"cancelled"}
1490{"blocked_by_id":null,"id":16374879,"result":"timeout_exceeded","state":"done"}

Acceptance Criteria¶

AC1: Unfinished jobs don't cause failures in GitLab pipelines

Suggestions¶

~~Verify if this is a specific worker or workers and take them out of production~~
~~Consider restarting affected jobs~~
~~An "obsolete" should be considered part of expected behavior. How about a new openQA API route to follow job obsolescence?~~ -> handled in #175299
~~Ignore the case of "obsoleted" jobs as the pipeline runs frequently enough anyway.~~ check why jobs ended up as obsolete even though scripts-ci don't trigger with obsoletion
Check whether we cancel the full parallel cluster in case a job in it is cancelled/obsoleted as we also saw jobs with parallel dependencies ending up with the result "timeout_exceeded".
- Treat skipped/cancelled the same as obsoleted (and ignore it)
- Ensure this is logged in case it is not always the case

Mitigations¶

~DONE Pause affected pipelines on GitLab i.e. openqa-schedule-mm-ping-test o3/osd~

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by jbaier_cz 4 months ago

Description updated (diff)

Actions

Copy link

Updated by okurz 4 months ago

Tags changed from alert to alert, infra, reactive work
Priority changed from High to Urgent

Actions

Copy link

Updated by gpuliti 4 months ago

Assignee set to gpuliti

Actions

Copy link

Updated by jbaier_cz 4 months ago

Description updated (diff)

Actions

Copy link

Updated by livdywan 4 months ago

Description updated (diff)

Actions

Copy link

Updated by gpuliti 4 months ago

Status changed from New to Resolved
% Done changed from 0 to 100

I've rerun the job and now is successful https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3564579

Actions

Copy link

Updated by jbaier_cz 4 months ago

Status changed from Resolved to Feedback
Priority changed from Urgent to Normal

That won't stop the issue from happening again. Of course the rerun helped, because now the openQA jobs are not obsoleted (and that mitigates the urgency). IMHO the solution to this ticket is to not consider obsoleted jobs as a failure in the pipeline.

Actions

Copy link

Updated by gpuliti 4 months ago

% Done changed from 100 to 50

Actions

Copy link

Updated by gpuliti 4 months ago

Status changed from Feedback to Workable
Assignee deleted (~~gpuliti~~)

Actions

Copy link

#10

Updated by okurz 4 months ago

Status changed from Workable to New

Actions

Copy link

#11

Updated by okurz 4 months ago

Tags changed from alert, infra, reactive work to alert, reactive work
Subject changed from openqa/scripts-ci pipeline fails, jobs obsoleted to openqa/scripts-ci pipeline fails, jobs obsoleted - New openQA API route to follow restarts?
Description updated (diff)
% Done changed from 50 to 0

Actions

Copy link

#12

Updated by gpuliti 4 months ago · Edited

Again in:

Actions

Copy link

#13

Updated by okurz 4 months ago

Priority changed from Normal to High

Actions

Copy link

#14

Updated by okurz 4 months ago

Priority changed from High to Urgent

Actions

Copy link

#15

Updated by livdywan 4 months ago

Status changed from New to In Progress
Assignee set to livdywan

Taking a look as discussed

Actions

Copy link

#16

Updated by livdywan 4 months ago

Status changed from In Progress to Workable

I'm afraid there is something more important on my mind.

Actions

Copy link

#17

Updated by mkittler 4 months ago

Subject changed from openqa/scripts-ci pipeline fails, jobs obsoleted - New openQA API route to follow restarts? to openqa/scripts-ci pipeline fails, jobs obsoleted - New openQA API route to follow job obsolescence? size:S
Description updated (diff)

Actions

Copy link

#18

Updated by livdywan 4 months ago

Description updated (diff)

Actions

Copy link

#19

Updated by livdywan 4 months ago

Status changed from Workable to In Progress

There's no dryrun and no unittests so to see what jobs are being filtered I'm also splitting the code a little for manual validation.

Actions

Copy link

#20

Updated by livdywan 4 months ago

Status changed from In Progress to Feedback

https://github.com/os-autoinst/scripts/pull/363

Actions

Copy link

#21

Updated by livdywan 4 months ago

https://github.com/os-autoinst/openQA/pull/6101 the "monitor" command we rely on here needs to support skipping of "aborted" jobs. Which it turns out we already have a constant for.

Actions

Copy link

#22

Updated by livdywan 4 months ago

Status changed from Feedback to In Progress

I should know better than to put an Urgent ticket in feedback 🙈

Couldn't really come up with a mitigation since retrying the pipeline won't help if a job is obsoleted.

Actions

Copy link

#23

Updated by tinita 4 months ago

livdywan wrote in #note-22:

Couldn't really come up with a mitigation since retrying the pipeline won't help if a job is obsoleted.

Why wouldn't that help? It would schedule new jobs.

Actions

Copy link

#24

Updated by openqa_review 4 months ago

Due date set to 2025-01-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#25

Updated by livdywan 3 months ago

tinita wrote in #note-23:

livdywan wrote in #note-22:

Couldn't really come up with a mitigation since retrying the pipeline won't help if a job is obsoleted.

Why wouldn't that help? It would schedule new jobs.

It won't help ensuring the original jobs are known to pass or be superseded as mentioned in the GitHub pull request thread? But that is something I also wanted to discuss again to make sure we are on the same page.

Actions

Copy link

#26

Updated by livdywan 3 months ago

Description updated (diff)
Priority changed from Urgent to High

Couldn't really come up with a mitigation since retrying the pipeline won't help if a job is obsoleted.

Actually as briefly discussed in the daily it should be fine to pause the pipelines now. Previously I didn't want to do it as it wasn't clear if discussed failures were relating to the same issue but we are getting alert-fatigued enough at this point.

Actions

Copy link

#27

Updated by tinita 3 months ago

livdywan wrote in #note-25:

It won't help ensuring the original jobs are known to pass or be superseded as mentioned in the GitHub pull request thread? But that is something I also wanted to discuss again to make sure we are on the same page.

My approach in the PR was: check if the job was obsoleted/cancelled and do the openqa-cli schedule call again.

Actions

Copy link

#28

Updated by okurz 3 months ago

add API query parameter "follow=1" to follow restarts
investigate why jobs ended up as "obsolete" as we don't trigger with obsoletion and no obsolete should be default -> crosscheck the builds are not obsoleted by default

Actions

Copy link

#29

Updated by livdywan 3 months ago

Copied to action #175299: Option to ignore obsoleted jobs when using openqa-cli monitor added

Actions

Copy link

#30

Updated by livdywan 3 months ago

Copied to action #175305: Flag to return restarted jobs when using openQA jobs API route size:S added

Actions

Copy link

#31

Updated by livdywan 3 months ago

I don't want to put it to Workable or Feedback right now, but investigation is going slower because of other on-going tickets.

Pipelines are running but manually monitored (by me) without alert emails.
No fix per se identified so far.

Actions

Copy link

#32

Updated by jbaier_cz 3 months ago

Related to action #175698: [tools][multi-machine tests] Timeout_exceeded on multiple workers including arm1, arm2 and mania added

Actions

Copy link

#33

Updated by livdywan 3 months ago · Edited

Description updated (diff)
Status changed from In Progress to Workable
Assignee deleted (~~livdywan~~)

Regular emails enabled again.

Now as for how openqa-schedule-mm-ping-test ends up obsoleting jobs despite our docs saying otherwise:

t/43-cli-schedule.t doesn't cover this since "cancelled" jobs are mocked here.
t/api/02-iso.t has a build obsoletion/deprioritization case using API routes directly. And these tests fail when run without _OBSOLETE or with _OBSOLETE=0 set.

I'm documenting what I was looking at since I wanted to come up with a unit test to cover the gap but couldn't find what we aren't testing here.

Actions

Copy link

#34

Updated by mkittler 3 months ago

Assignee set to mkittler

Actions

Copy link

#35

Updated by mkittler 3 months ago

Status changed from Workable to In Progress

With #175305 resolved this only leaves implementing an option to ignore obsoleted jobs which has already been attempted in https://github.com/os-autoinst/openQA/pull/6101. However, for this we now have #175299 which is not on the backlog.

This leaves the following for this ticket:

Resume pipelines when #175299 is merged and deployed.
Check whether we cancel the full parallel cluster in case a job in it is cancelled/obsoleted as we also saw jobs with parallel dependencies ending up with the result "timeout_exceeded".

So I'm looking into the last point.

Actions

Copy link

#36

Updated by mkittler 3 months ago · Edited

We are in fact explicitly excluding parallel children when cancelling jobs from the job cluster by default. This is how the code is written and it looks intentional but wrong at the same time. It explains the timeouted job we saw.

EDIT: I misread the not so straight forward code. So we actually do cancel the cluster completely by default. There must nevertheless be a bug in the logic.

EDIT: I couldn't find any problem with the code, though. I extended unit tests and the tests didn't show this problem: https://github.com/os-autoinst/openQA/pull/6130

EDIT: I'm not sure how the the skipped job even ended up skipped. There's nothing in the audit log.

I suppose at this point it isn't worth looking into this case any further.

So for now I'm just blocking this ticket on #175299.

Actions

Copy link

#37

Updated by mkittler 3 months ago

Status changed from In Progress to Blocked

Actions

Copy link

#38

Updated by okurz 3 months ago

Status changed from Blocked to Workable

First we shouldn't block tickets in the backlog on tickets in future. Second, we had identified that we don't trigger the scripts-ci pipelines with obsoletion enabled so unless our documentation is wrong about the default jobs should not be obsolete hence we don't need #175299 . Third, this ticket has due date 2025-01-24 and it's unrealistic to assume that #175299 would be finished until then

Actions

Copy link

#39

Updated by okurz 3 months ago

Subject changed from openqa/scripts-ci pipeline fails, jobs obsoleted - New openQA API route to follow job obsolescence? size:S to openqa/scripts-ci pipeline fails, jobs ended up in various not-ok states - New openQA API route to follow job obsolescence?
Description updated (diff)
Status changed from Workable to New

The latest failure from today is https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3688628#L56 showing

{"blocked_by_id":null,"id":16517839,"result":"parallel_failed","state":"done"}
{"blocked_by_id":null,"id":16517840,"result":"incomplete","state":"done"}

the latter is https://openqa.suse.de/tests/16517840 that incompleted with "Reason: backend died: QEMU exited unexpectedly, see log for details" and was cloned automatically as https://openqa.suse.de/tests/16517843 . How about just using the new "follow" query parameter in scripts-ci?

I updated the ticket and removed the estimate as I removed that much. Are you ok with the ticket changes?

Actions

Copy link

#40

Updated by okurz 3 months ago

Subject changed from openqa/scripts-ci pipeline fails, jobs ended up in various not-ok states - New openQA API route to follow job obsolescence? to openqa/scripts-ci pipeline fails, jobs ended up in various not-ok states - New openQA API route to follow job obsolescence? size:S

Actions

Copy link

#41

Updated by okurz 3 months ago

Status changed from New to In Progress

Actions

Copy link

#42

Updated by mkittler 3 months ago · Edited

The failure from today (https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3688628#L56) is actually a different issue, see #170209#note-38.

I checked my inbox again and saw that the pipeline also failed yesterday, and that failure was actually one of the cases we also saw previously when creating this ticket (combination of timeout exceeded + skipped, see https://openqa.suse.de/tests/16512096 and https://openqa.suse.de/tests/16512097). So maybe worth looking into a little bit further after all.

The pipelines are also still running. Depending on what I can find out (or fix) I'll see whether it makes sense to pause them again.

Actions

Copy link

#43

Updated by mkittler 3 months ago · Edited

I think the suggestion

Check whether we cancel the full parallel cluster in case a job in it is cancelled/obsoleted as we also saw jobs with parallel dependencies ending up with the result "timeout_exceeded".

is maybe looking at the problem from the wrong perspective. Maybe it is simply that only one out of two jobs are ever executed. The one that is executed then exceeds the timeout and then the other job is cancelled due to it. This would not explain the case of obsoletes and raise the question why only one of the jobs is assigned to a worker. However, it is very likely what happened because t_finished of both jobs is the same:

openqa=> select id, state, result, reason, t_created, t_started, t_finished from jobs where id in ( 16512096, 16512097 ) order by id;
    id    |   state   |      result      |                    reason                     |      t_created      |      t_started      |     t_finished      
----------+-----------+------------------+-----------------------------------------------+---------------------+---------------------+---------------------
 16512096 | done      | timeout_exceeded | timeout: test execution exceeded MAX_JOB_TIME | 2025-01-20 17:13:24 | 2025-01-20 18:42:47 | 2025-01-20 20:43:24
 16512097 | cancelled | skipped          |                                               | 2025-01-20 17:13:24 |                     | 2025-01-20 20:43:24
(2 rows)

It can of course be that 16512097 was assigned to a worker but the worker went away for some reason. Then the assignment is taken back. However, normally a re-assignment is supposed to happen.

Actions

Copy link

#44

Updated by mkittler 3 months ago · Edited

Looks like 16512097 was assigned to a worker. Not sure what happened then but apparently the assignment wasn't effectively happening and then the job was skipped:

$ xzgrep '16512096' /var/log/openqa_scheduler*
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:48.515970Z] [debug] [pid:1734] Assigned job '16512097' to worker ID '2672'
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:48.517437Z] [debug] [pid:1734] [Job#16512097] Prepare for being processed by worker 2672
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:48.544525Z] [debug] [pid:1734] Sent job(s) '16512097' to worker '2672' (worker33:26)
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:48.822243Z] [debug] [pid:1734] Allocated: { job => 16512097, worker => 2672 }
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:43:29.559985Z] [debug] [pid:1734] Skipping job 16512097 because dependent jobs are not ready
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:43:40.989996Z] [debug] [pid:1734] Skipping job 16512097 because dependent jobs are not ready
…
/var/log/openqa_scheduler.3.xz:[2025-01-20T20:18:33.044806Z] [debug] [pid:1734] Skipping job 16512097 because dependent jobs are not ready
/var/log/openqa_scheduler.3.xz:[2025-01-20T20:19:41.399959Z] [debug] [pid:1734] Skipping job 16512097 because dependent jobs are not ready
/var/log/openqa_scheduler.3.xz:[2025-01-20T20:22:15.657510Z] [debug] [pid:1734] Skipping job 16512097 because dependent jobs are not ready
/var/log/openqa_scheduler.3.xz:[2025-01-20T20:22:57.245009Z] [debug] [pid:1734] Skipping job 16512097 because dependent jobs are not ready

The other job was assigned as expected:

$ xzgrep '16512096' /var/log/openqa_scheduler*
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:47.093956Z] [debug] [pid:1734] Need to schedule 2 parallel jobs for job 16512096 (with priority 50)
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:47.120683Z] [debug] [pid:1734] Assigned job '16512096' to worker ID '2679' (worker33:33)
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:47.123886Z] [debug] [pid:1734] [Job#16512096] Prepare for being processed by worker 2679
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:47.154420Z] [debug] [pid:1734] Created network for 16512096: 3
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:47.184515Z] [debug] [pid:1734] Sent job(s) '16512096' to worker '2679'
/var/log/openqa_scheduler.3.xz:[2025-01-20T18:42:48.821476Z] [debug] [pid:1734] Allocated: { job => 16512096, worker => 2679 }

I don't think the dependent jobs is not ready due to Minion jobs because scheduling this product shouldn't cause any Minion jobs. We have git_auto_update = no on OSD and the jobs don't specify any Git URLs. I also confirmed that no Minion jobs are created by running openqa_url=http://localhost:9526 distri=sle version=15-SP5 flavor=Server-DVD-Updates test_name=ovs-client ./openqa-schedule-mm-ping-test locally.

I cannot reproduce the problem locally with two workers using PARALLEL_ONE_HOST_ONLY=1. Even if one of the workers doesn't react to the assignment and the half-scheduled cluster needs to be repaired everything works as expected. I can reproduce it if I comment out $self->_pick_siblings_of_running($allocated_jobs, $allocated_workers);. So there is most likely an issue within that function.

EDIT: I can reproduce the following situation that might be exactly what happened in production: When only one of two worker slots accept their assignment and then only another worker slot with mismatching host is available to repair the half-assigned cluster the "Skipping job … because dependent jobs are not ready" message is logged. This is quite confusing. It just means there's no free slot on the required worker host to schedule the job.

Actions

Copy link

#45

Updated by mkittler 3 months ago

PR https://github.com/os-autoinst/openQA/pull/6132 will fix scheduling-related problems that could have lead to this.

This still leaves the obsoletion case but I guess it isn't worth looking into it further.

Actions

Copy link

#46

Updated by mkittler 3 months ago

Status changed from In Progress to Feedback

The PR has been merged and deployed. Let's see whether it helps.

Actions

Copy link

#47

Updated by mkittler 3 months ago · Edited

Status changed from Feedback to In Progress

The pipeline failed again. This time it is the obsoletion case again, see https://openqa.opensuse.org/tests/4801850 and https://openqa.opensuse.org/tests/4801851. The reason is "cancelled based on job settings".

That means my changes for the timeout/skipped case might still have done the trick but I will look into the obsolescence case once more.

Actions

Copy link

#48

Updated by mkittler 3 months ago

The documentation about _OBSOLETE=1 says that it obsoletes "jobs in older builds with same DISTRI and VERSION". Judging by the code also FLAVOR and ARCH must match. In our case DISTRI and VERSION are just "opensuse" and "Tumbleweed" anf FLAVOR and ARCH are just "DVD" and "x86_64". That's all very generic. So the jobs are probably obsoleted by a completely unrelated scheduled product. We specify _GROUP=0 but judging by the documentation and code the group is not one of the filter parameters.

It should be easier to find out what scheduled product caused the obsolescence so I created https://github.com/os-autoinst/openQA/pull/6134.

Of course we still need to think how to prevent the obsolescence from happening here.

Actions

Copy link

#49

Updated by mkittler 3 months ago

Status changed from In Progress to Feedback

This PR will prevent the obsolescence from happening: https://github.com/os-autoinst/scripts/pull/366

Actions

Copy link

#50

Updated by okurz 3 months ago

both https://github.com/os-autoinst/scripts/pull/366 and https://github.com/os-autoinst/openQA/pull/6134 are merged. What's next?

Actions

Copy link

#51

Updated by mkittler 3 months ago

Status changed from Feedback to Resolved

I haven't seen another failed CI run so I would consider the ticket resolved. (We can still re-open it if we see the same kinds of issues again.)

Actions

Copy link

#52

Updated by okurz 3 months ago

Due date deleted (~~2025-01-24~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #174583

openqa/scripts-ci pipeline fails, jobs ended up in various not-ok states - New openQA API route to follow job obsolescence? size:S

Observation¶

Acceptance Criteria¶

Suggestions¶

Mitigations¶

Updated by jbaier_cz 4 months ago

Updated by okurz 4 months ago

Updated by gpuliti 4 months ago

Updated by jbaier_cz 4 months ago

Updated by livdywan 4 months ago

Updated by gpuliti 4 months ago

Updated by jbaier_cz 4 months ago

Updated by gpuliti 4 months ago

Updated by gpuliti 4 months ago

Updated by okurz 4 months ago

Updated by okurz 4 months ago

Updated by gpuliti 4 months ago · Edited

Updated by okurz 4 months ago

Updated by okurz 4 months ago

Updated by livdywan 4 months ago

Updated by livdywan 4 months ago

Updated by mkittler 4 months ago

Updated by livdywan 4 months ago

Updated by livdywan 4 months ago

Updated by livdywan 4 months ago

Updated by livdywan 4 months ago

Updated by livdywan 4 months ago

Updated by tinita 4 months ago

Updated by openqa_review 4 months ago

Updated by livdywan 3 months ago

Updated by livdywan 3 months ago

Updated by tinita 3 months ago

Updated by okurz 3 months ago

Updated by livdywan 3 months ago

Updated by livdywan 3 months ago

Updated by livdywan 3 months ago

Updated by jbaier_cz 3 months ago

Updated by livdywan 3 months ago · Edited

Updated by mkittler 3 months ago

Updated by mkittler 3 months ago

Updated by mkittler 3 months ago · Edited

Updated by mkittler 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by mkittler 3 months ago · Edited

Updated by mkittler 3 months ago · Edited

Updated by mkittler 3 months ago · Edited

Updated by mkittler 3 months ago

Updated by mkittler 3 months ago

Updated by mkittler 3 months ago · Edited

Updated by mkittler 3 months ago

Updated by mkittler 3 months ago

Updated by okurz 3 months ago

Updated by mkittler 3 months ago

Updated by okurz 3 months ago