Project

General

Profile

Actions

action #60866

closed

Periodic stale job detection keeps scheduler busy producing a massive number of incomplete jobs (was: Again many incomplete jobs with no logs at all)

Added by okurz about 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2019-12-10
Due date:
% Done:

0%

Estimated time:

Description

Observation

E.g. https://openqa.suse.de/tests/3687641 shows modules on the details, no result for any module but no logs. Only from the system journal one can get a bit:

Dec 10 17:11:37 openqaworker6 worker[5647]: [info] [pid:10183] 10183: WORKING 3687641
Dec 10 17:11:37 openqaworker6 worker[5647]: [debug] [pid:10183] +++ worker notes +++
Dec 10 17:11:37 openqaworker6 worker[5647]: [debug] [pid:10183] Start time: 2019-12-10 16:11:37
Dec 10 17:11:37 openqaworker6 worker[5647]: [debug] [pid:10183] Running on openqaworker6:10 (Linux 4.12.14-lp151.28.7-default #1 SMP Mon Jun 17 16:36:38 UTC 2019 (f8a1872) x86_64)
Dec 10 17:11:37 openqaworker6 worker[5647]: [info] [pid:5647] isotovideo has been started (PID: 10183)
Dec 10 17:11:37 openqaworker6 worker[5647]: [debug] [pid:5647] Running job 3687641 from openqa.suse.de: 03687641-sle-15-SP2-Online-x86_64-Build104.1-sles4sap_gnome_extended@64bit-2gbram.
Dec 10 17:11:37 openqaworker6 worker[5647]: [debug] [pid:5647] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3687641/status
Dec 10 17:11:47 openqaworker6 worker[5647]: [debug] [pid:5647] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3687641/status
Dec 10 17:11:58 openqaworker6 worker[5647]: [debug] [pid:5647] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3687641/status
Dec 10 17:11:58 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 2)
Dec 10 17:12:03 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 1)
Dec 10 17:12:08 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
Dec 10 17:12:08 openqaworker6 worker[5647]: [debug] [pid:5647] Stopping job 3687641 from openqa.suse.de: 03687641-sle-15-SP2-Online-x86_64-Build104.1-sles4sap_gnome_extended@64bit-2gbram - reason: api-failure
Dec 10 17:12:08 openqaworker6 worker[5647]: [error] [pid:5647] Aborting job because web UI doesn't accept new images anymore (likely considers this job dead)
Dec 10 17:12:08 openqaworker6 worker[5647]: [debug] [pid:5647] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3687641/status
Dec 10 17:12:08 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 2)
Dec 10 17:12:13 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 1)
Dec 10 17:12:18 openqaworker6 worker[5647]: [warn] [pid:5647] Can't open /var/lib/openqa/pool/10/testresults/result-bootloader.json for result upload - likely isotovideo could not be started or failed early. Error message: No such file or directory
Dec 10 17:12:18 openqaworker6 worker[5647]: [debug] [pid:5647] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3687641/status
Dec 10 17:12:18 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 2)
Dec 10 17:12:18 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
Dec 10 17:12:18 openqaworker6 worker[5647]: [info] [pid:5647] Trying to stop job gracefully by announcing it to command server via http://localhost:20103/uhMJRhgnkWGej2zu/broadcast
Dec 10 17:12:20 openqaworker6 worker[5647]: [info] [pid:5647] Isotovideo exit status: 0
Dec 10 17:12:20 openqaworker6 worker[5647]: [debug] [pid:5647] Registered process:10273
Dec 10 17:12:27 openqaworker6 worker[5647]: [info] [pid:5647] +++ worker notes +++
Dec 10 17:12:27 openqaworker6 worker[5647]: [info] [pid:5647] End time: 2019-12-10 16:12:27
Dec 10 17:12:27 openqaworker6 worker[5647]: [info] [pid:5647] Result: api-failure
Dec 10 17:12:27 openqaworker6 worker[5647]: [debug] [pid:5647] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3687641/set_done?result=incomplete
Dec 10 17:12:27 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 1)
Dec 10 17:12:27 openqaworker6 worker[5647]: [info] [pid:5647] Registering with openQA openqa.suse.de
Dec 10 17:12:27 openqaworker6 worker[5647]: [info] [pid:5647] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/972
Dec 10 17:12:27 openqaworker6 worker[5647]: [debug] [pid:5647] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3687641/set_done
Dec 10 17:12:27 openqaworker6 worker[5647]: [debug] [pid:5647] Websocket connection to http://openqa.suse.de/api/v1/ws/972 finished from our side.
Dec 10 17:12:27 openqaworker6 worker[5647]: [info] [pid:5647] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 972
Dec 10 17:12:27 openqaworker6 worker[5647]: [debug] [pid:5647] Job 3687641 from openqa.suse.de finished - reason: api-failure
Dec 10 17:12:27 openqaworker6 worker[5647]: [debug] [pid:5647] Cleaning up for next job
Dec 10 17:12:32 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
Dec 10 17:12:32 openqaworker6 worker[5647]: [error] [pid:5647] Unable to make final image uploads. Maybe the web UI considers this job already dead.

Reproducible

To find these jobs one can search the database with the SQL query:

select id,test from jobs where (result='incomplete' and t_finished >= (NOW() - interval '24 hour') and result_dir is null and id not in (select job_id from comments where id is null)) order by id desc;

Related issues 2 (0 open2 closed)

Related to openQA Project (public) - action #55529: job incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cacheResolvedkraih2019-08-14

Actions
Related to openQA Project (public) - coordination #61922: [epic] Incomplete jobs with no logs at allResolvedmkittler2020-02-03

Actions
Actions

Also available in: Atom PDF