Project

General

Profile

Actions

action #62984

closed

coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

coordination #62420: [epic] Distinguish all types of incompletes

coordination #61922: [epic] Incomplete jobs with no logs at all

Fix problem with job-worker assignment resulting in API errors

Added by mkittler over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
-
Start date:
2020-02-03
Due date:
% Done:

0%

Estimated time:

Description

Now with the reason being passed from the worker to the web UI we're able to query the database for jobs incompleted due to API errors. Unfortunately, the following query usually returns some jobs on OSD:

openqa=> select id, t_started, state, reason from jobs where reason like '%Got status update%' and result = 'incomplete' and t_finished >= (NOW() - interval '12 hour') order by id;
   id    |      t_started      | state |                                                                            reason                                                                             
---------+---------------------+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------
 3856945 | 2020-02-03 00:52:54 | done  | api failure: 400 response: Got status update for job 3856945 with unexpected worker ID 679 (expected no updates anymore, job is done with result incomplete)
 3856983 |                     | done  | api failure: 400 response: Got status update for job 3856983 and worker 1310 but there is not even a worker assigned to this job (job is scheduled)
 3857048 | 2020-02-03 01:36:32 | done  | api failure: 400 response: Got status update for job 3857048 with unexpected worker ID 679 (expected no updates anymore, job is done with result incomplete)
 3857698 | 2020-02-03 09:08:04 | done  | api failure: 400 response: Got status update for job 3857698 with unexpected worker ID 1030 (expected no updates anymore, job is done with result incomplete)

I suppose https://github.com/os-autoinst/openQA/pull/2667 helps only a little by fixing one small race condition but there's apparently a bigger problem.

Note that these jobs might have been marked as incomplete by the web UI and then the reason got overridden by the worker again so the reason might be misleading. That should be fixed, too.


Files

fedorastg20200207.zip (451 KB) fedorastg20200207.zip AdamWill, 2020-02-07 20:25

Related issues 2 (0 open2 closed)

Related to openQA Project - action #62015: jobs incomplete without logs as some workers are rejected (was: Scheduler does not work)Resolvedmkittler2020-01-10

Actions
Related to openQA Project - action #62417: os-autoinst occasionally crashing on startupResolvedmkittler2020-01-21

Actions
Actions

Also available in: Atom PDF