action #62984

action #39719: [saga][epic] Detect "known failures" and mark jobs as such

action #62420: [epic] Distinguish all types of incompletes

action #61922: [epic] Incomplete jobs with no logs at all

Fix problem with job-worker assignment resulting in API errors

Added by mkittler 2 months ago. Updated 9 days ago.

Status:ResolvedStart date:03/02/2020
Priority:NormalDue date:
Assignee:mkittler% Done:

0%

Category:Concrete Bugs
Target version:-
Difficulty:
Duration:

Description

Now with the reason being passed from the worker to the web UI we're able to query the database for jobs incompleted due to API errors. Unfortunately, the following query usually returns some jobs on OSD:

openqa=> select id, t_started, state, reason from jobs where reason like '%Got status update%' and result = 'incomplete' and t_finished >= (NOW() - interval '12 hour') order by id;
   id    |      t_started      | state |                                                                            reason                                                                             
---------+---------------------+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------
 3856945 | 2020-02-03 00:52:54 | done  | api failure: 400 response: Got status update for job 3856945 with unexpected worker ID 679 (expected no updates anymore, job is done with result incomplete)
 3856983 |                     | done  | api failure: 400 response: Got status update for job 3856983 and worker 1310 but there is not even a worker assigned to this job (job is scheduled)
 3857048 | 2020-02-03 01:36:32 | done  | api failure: 400 response: Got status update for job 3857048 with unexpected worker ID 679 (expected no updates anymore, job is done with result incomplete)
 3857698 | 2020-02-03 09:08:04 | done  | api failure: 400 response: Got status update for job 3857698 with unexpected worker ID 1030 (expected no updates anymore, job is done with result incomplete)

I suppose https://github.com/os-autoinst/openQA/pull/2667 helps only a little by fixing one small race condition but there's apparently a bigger problem.

Note that these jobs might have been marked as incomplete by the web UI and then the reason got overridden by the worker again so the reason might be misleading. That should be fixed, too.

fedorastg20200207.zip (451 KB) AdamWill, 07/02/2020 08:25 pm


Related issues

Related to openQA Project - action #62015: jobs incomplete without logs as some workers are rejected... Resolved 10/01/2020
Related to openQA Project - action #62417: os-autoinst occasionally crashing on startup Resolved 21/01/2020

History

#1 Updated by mkittler 2 months ago

  • Related to action #62015: jobs incomplete without logs as some workers are rejected (was: Scheduler does not work) added

#2 Updated by mkittler 2 months ago

  • Related to action #62417: os-autoinst occasionally crashing on startup added

#3 Updated by mkittler about 1 month ago

It is weird that worker 1310 tries to execute job 3856983 which is still scheduled and has t_started not set:

  • That means that it is currently possible to mark a job as done by the API when it is still scheduled. Not sure whether we want to allow that.
  • The following might have happened here: The job has been set back to scheduled (from assigned) because the worker didn't grab the job in time. Then the worker actually grabs the job and tries to report the status for that scheduled job. I have to check whether that's really what happened but I suppose there's currently nothing which would prevent it from happening.

#4 Updated by mkittler about 1 month ago

I've investigated one of the jobs now (https://openqa.suse.de/tests/3857698) and it was indeed just duplicated as dead job by the web UI:

[2020-02-03T11:14:07.0404 CET] [warn] [pid:24590] Dead job 3857698 aborted and duplicated 3857969

The API errors on the worker side are only a consequence of this and we shouldn't let the worker override the reason set the by stale job detection. PR https://github.com/os-autoinst/openQA/pull/2723 implements that.

However, it is very strange that the job has been declared dead in the first place. We increased the threshold for this to 2 days but the job only ran for 1:06 hours. So there must be something wrong within the stale job detection.

#5 Updated by mkittler about 1 month ago

  • Status changed from New to In Progress

The job https://openqa.suse.de/tests/3861757 from today seems to be a little bit more interesting:

Relevant lines from the web UI and scheduler log:

Feb 04 17:07:24 [debug] [24590] need to schedule 1 jobs for 3861757(55)
Feb 04 17:07:26 [debug] [pid:24590] Job '3861757' has worker '1083' assigned
Feb 04 17:07:26 [debug] [pid:24590] Allocated: { job => 3861757, worker => 1083 }

Feb 04 17:07:28 [debug] [pid:24590] need to schedule 1 jobs for 3861757(55)
Feb 04 17:07:30 [debug] [pid:24590] Job '3861757' has worker '1280' assigned
Feb 04T17:07:31 [debug] [pid:24590] Allocated: { job => 3861757, worker => 1280 }

Feb 04 17:07:30 openqa openqa-websockets[24591]: [debug] [pid:24591] Started to send message to 1280 for job 3861757
Feb 04 17:07:51 openqa openqa[24593]: [debug] [d95d2498] POST "/api/v1/jobs/3861757/status"
Feb 04 17:08:21 openqa openqa[24593]: [debug] [6a8fddc8] POST "/api/v1/jobs/3861757/status"
Feb 04 17:08:21 openqa openqa[24593]: [info] [pid:6358] Got status update for job 3861757 with unexpected worker ID 1280 (expected no updates anymore, job is done with result incomplete)

So the job has been assigned to a different worker in the first place and then re-scheduled to worker 1280. That's nothing unusual. However, the status update from worker 1280 is quite delayed and the second status update at 17:08:21 already fails. This time I'm not sure why the job has been marked as incomplete. I guess https://github.com/os-autoinst/openQA/pull/2723 will help debugging these cases as the first and initial reason for incompleting the job will not be overridden anymore.

#6 Updated by mkittler about 1 month ago

Today there are more jobs like 3856983 (e.g. https://openqa.suse.de/tests/3863115). So I had a look at the logs to confirm my theory. Apparently the job has actually assigned to the worker and accepted by the worker:

[2020-02-05T07:01:23.0546 CET] [debug] [pid:27733] Job '3863115' has worker '593' assigned
Feb 05 07:01:23 openqaworker9 worker[60384]: [debug] [pid:60384] Accepting job 3863115 from openqa.suse.de
Feb 05 07:01:29 openqa openqa[27735]: [debug] [38b25ef4] POST "/api/v1/jobs/3863115/status"
Feb 05 07:01:29 openqa openqa[27735]: [info] [pid:27803] Got status update for job 3863115 and worker 593 but there is not even a worker assigned to this job (job is scheduled)

However, no line Job 3863115 reset to state scheduled has been logged and in fact the worker accepted the job quite fast. So it looks like my theory isn't correct (although it is a conceivable problem, too).

The worker also made some apparently successful status updates during the setup phase:

Feb 05 07:01:23 openqaworker9 worker[60384]: [debug] [pid:60384] Found ISO, caching SLE-15-SP2-Online-x86_64-BuildH.104.7-Media1.iso
Feb 05 07:01:23 openqaworker9 worker[60384]: [info] [pid:60384] Downloading SLE-15-SP2-Online-x86_64-BuildH.104.7-Media1.iso, request #29948 sent to Cache Service
Feb 05 07:01:28 openqaworker9 worker[60384]: [debug] [pid:60384] Updating status so job 3863115 is not considered dead.
Feb 05 07:01:28 openqaworker9 worker[60384]: [debug] [pid:60384] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3863115/status
Feb 05 07:01:34 openqaworker9 worker[60384]: [debug] [pid:60384] Updating status so job 3863115 is not considered dead.
Feb 05 07:01:34 openqaworker9 worker[60384]: [debug] [pid:60384] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3863115/status
Feb 05 07:01:39 openqaworker9 worker[60384]: [debug] [pid:60384] Updating status so job 3863115 is not considered dead.
Feb 05 07:01:39 openqaworker9 worker[60384]: [debug] [pid:60384] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3863115/status
Feb 05 07:01:44 openqaworker9 worker[60384]: [debug] [pid:60384] Updating status so job 3863115 is not considered dead.
Feb 05 07:01:44 openqaworker9 worker[60384]: [debug] [pid:60384] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3863115/status
Feb 05 07:01:44 openqaworker9 worker[60384]: [info] [pid:60384] Download of SLE-15-SP2-Online-x86_64-BuildH.104.7-Media1.iso processed:
Feb 05 07:01:44 openqaworker9 worker[60384]: [info] [#29948] Cache size of "/var/lib/openqa/cache" is 45GiB, with limit 50GiB
Feb 05 07:01:44 openqaworker9 worker[60384]: [info] [#29948] Downloading "SLE-15-SP2-Online-x86_64-BuildH.104.7-Media1.iso" from "http://openqa.suse.de/tests/3863115/asset/iso/SLE-15-SP2-Online-x86_64-BuildH.104.7-Media1.iso"
Feb 05 07:01:44 openqaworker9 worker[60384]: [info] [#29948] Size of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP2-Online-x86_64-BuildH.104.7-Media1.iso" is 370MiB, with ETag ""17200000-59dcdda34e5f0""
Feb 05 07:01:44 openqaworker9 worker[60384]: [info] [#29948] Download of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP2-Online-x86_64-BuildH.104.7-Media1.iso" successful, new cache size is 46GiB
Feb 05 07:01:44 openqaworker9 worker[60384]: [debug] [pid:60384] Symlinked asset because hardlink failed: Cannot create link from "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP2-Online-x86_64-BuildH.104.7-Media1.iso" to "/var/lib/openq>
Feb 05 07:01:44 openqaworker9 worker[60384]: [debug] [pid:60384] Linked asset "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP2-Online-x86_64-BuildH.104.7-Media1.iso" to "/var/lib/openqa/pool/9/SLE-15-SP2-Online-x86_64-BuildH.104.7-Media>
Feb 05 07:01:44 openqaworker9 worker[60384]: [info] [pid:60384] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #29949 sent to Cache Service
Feb 05 07:01:49 openqaworker9 worker[60384]: [debug] [pid:60384] Updating status so job 3863115 is not considered dead.
Feb 05 07:01:49 openqaworker9 worker[60384]: [debug] [pid:60384] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3863115/status
Feb 05 07:01:49 openqaworker9 worker[60384]: [info] [pid:60384] Output of rsync:
Feb 05 07:01:49 openqaworker9 worker[60384]: [info] [#29949] Calling: rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Feb 05 07:01:49 openqaworker9 worker[60384]: receiving incremental file list

So it took the worker actually 5 minutes to do the first status update after accepting the job. I couldn't find Worker 593 accepted job 3863115 which was never assigned to it or has already finished so at the point when the worker accepted the job it was actually correctly assigned. The job must have been re-scheduled after it. It is weird that the re-scheduling isn't logged, though. Maybe there's a place when re-scheduling happens I'm not aware of? Neverthless, in this case it would have helped if the web socket server would mark the job as running when the worker accepts it. (So far the job is only marked as running on the first status update.)

#7 Updated by okurz about 1 month ago

https://github.com/os-autoinst/openQA/pull/2723 merged. Please monitor incompletes on o3+osd, e.g. https://gitlab.suse.de/openqa/auto-review/pipelines can give you some insight if the original issue is fixed or if the fix caused any regressions. Also https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-7d&to=now can be helpful, especially crosslinking "running" jobs vs. "incomplete jobs" as so far it looks like having 100+ running is likely to trigger many incompletes. Would be great if this symptom vanishes due to your fix.

#8 Updated by AdamWill about 1 month ago

Bad news, I'm afraid. I just updated our staging instance to latest git openQA and os-autoinst, but got this almost right away:

https://openqa.stg.fedoraproject.org/tests/716136

Server log extract:

Feb 06 06:23:18 openqa-stg01.qa.fedoraproject.org openqa[3730045]: [info] Got status update for job 716136 and worker 73 but there is not even a worker assigned to this job (job is scheduled)
Feb 06 06:23:18 openqa-stg01.qa.fedoraproject.org openqa-websockets[3730044]: [warn] Worker 73 accepted job 716136 which was never assigned to it or has already finished
Feb 06 06:23:23 openqa-stg01.qa.fedoraproject.org openqa[3730045]: [info] Got status update for job 716136 and worker 73 but there is not even a worker assigned to this job (job is scheduled)
Feb 06 06:23:28 openqa-stg01.qa.fedoraproject.org openqa[3730045]: [info] Got status update for job 716136 and worker 73 but there is not even a worker assigned to this job (job is scheduled)
Feb 06 06:23:28 openqa-stg01.qa.fedoraproject.org openqa[3730045]: [info] Got status update for job 716136 and worker 73 but there is not even a worker assigned to this job (job is scheduled)
Feb 06 06:23:33 openqa-stg01.qa.fedoraproject.org openqa[3730045]: [info] Got status update for job 716136 and worker 73 but there is not even a worker assigned to this job (job is scheduled)
Feb 06 06:23:38 openqa-stg01.qa.fedoraproject.org openqa[3730045]: [info] Got status update for job 716136 and worker 73 but there is not even a worker assigned to this job (job is scheduled)
Feb 06 06:23:38 openqa-stg01.qa.fedoraproject.org openqa[3730045]: [info] Got status update for job 716136 and worker 73 but there is not even a worker assigned to this job (job is scheduled)
Feb 06 06:23:43 openqa-stg01.qa.fedoraproject.org openqa[3730045]: [info] Got status update for job 716136 with unexpected worker ID 73 (expected no updates anymore, job is done with result incomplete)
Feb 06 06:23:48 openqa-stg01.qa.fedoraproject.org openqa[3730045]: [info] Got status update for job 716136 with unexpected worker ID 73 (expected no updates anymore, job is done with result incomplete)

Worker log extract:

Feb 06 06:23:18 qa01.qa.fedoraproject.org worker[1268257]: [info] 1430800: WORKING 716136
Feb 06 06:23:18 qa01.qa.fedoraproject.org worker[1268257]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/716136/status): 400 response: Got status update for job 716136 and worker 73 but there is not even a worker assigned to this job (job is scheduled) (remaining tries: 2)
Feb 06 06:23:19 qa01.qa.fedoraproject.org worker[1268257]: [warn] Ignoring WS message from http://openqa-stg01.qa.fedoraproject.org with type grab_job but no job ID (currently running 716136 for http://openqa-stg01.qa.fedoraproject.org):
Feb 06 06:23:23 qa01.qa.fedoraproject.org worker[1268257]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/716136/status): 400 response: Got status update for job 716136 and worker 73 but there is not even a worker assigned to this job (job is scheduled) (remaining tries: 1)
Feb 06 06:23:28 qa01.qa.fedoraproject.org worker[1268257]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/716136/status): 400 response: Got status update for job 716136 and worker 73 but there is not even a worker assigned to this job (job is scheduled) (remaining tries: 0)
Feb 06 06:23:28 qa01.qa.fedoraproject.org worker[1268257]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/716136/status): 400 response: Got status update for job 716136 and worker 73 but there is not even a worker assigned to this job (job is scheduled) (remaining tries: 2)
Feb 06 06:23:33 qa01.qa.fedoraproject.org worker[1268257]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/716136/status): 400 response: Got status update for job 716136 and worker 73 but there is not even a worker assigned to this job (job is scheduled) (remaining tries: 1)
Feb 06 06:23:38 qa01.qa.fedoraproject.org worker[1268257]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/716136/status): 400 response: Got status update for job 716136 and worker 73 but there is not even a worker assigned to this job (job is scheduled) (remaining tries: 2)
Feb 06 06:23:38 qa01.qa.fedoraproject.org worker[1268257]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/716136/status): 400 response: Got status update for job 716136 and worker 73 but there is not even a worker assigned to this job (job is scheduled) (remaining tries: 0)
Feb 06 06:23:43 qa01.qa.fedoraproject.org worker[1268257]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/716136/status): 400 response: Got status update for job 716136 with unexpected worker ID 73 (expected no updates anymore, job is done with result incomplete) (remaining tries: 1)
Feb 06 06:23:48 qa01.qa.fedoraproject.org worker[1268257]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/716136/status): 400 response: Got status update for job 716136 with unexpected worker ID 73 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)

#9 Updated by AdamWill about 1 month ago

The worker doesn't seem to have delayed at the start - it seems to have finished the previous job at 06:23:02 and started this job at 06:23:18. I can't tell from the logs what time the server sent the job out, though, that doesn't seem to be logged (at info).

#10 Updated by okurz about 1 month ago

as you mentioned the ambiguous "latest git", can you check if you have https://github.com/os-autoinst/openQA/pull/2723 included? And also, of course, there could be more helpful information if you run at debug log level :)

#11 Updated by dzedro about 1 month ago

Another few api failures for debugging:
Reason: api failure: 400 response: Got status update for job 3870582 with unexpected worker ID 605 (expected no updates anymore, job is done with result incomplete)
https://openqa.suse.de/tests/3870582
Reason: api failure: 400 response: Got status update for job 3870388 with unexpected worker ID 399 (expected no updates anymore, job is done with result incomplete)
https://openqa.suse.de/tests/3870388
Reason: api failure: 400 response: Got status update for job 3864302 and worker 590 but there is not even a worker assigned to this job (job is scheduled)
https://openqa.suse.de/tests/3864302

#12 Updated by mkittler about 1 month ago

Note that OSD hasn't PR https://github.com/os-autoinst/openQA/pull/2723 yet. On o3 it is deployed now. There haven't been any incompletes with the reason api failure: 400 response: on o3 since then. Instead there are some incompletes with the reason associated worker has not sent any status updates for too long. So at least the reason from the stall job detection is no longer overridden. It also seems that the stall job detection isn't generally causing lots of incompletes so it seems that the improvements in https://github.com/os-autoinst/openQA/pull/2723 have helped with that (previously the stall job detection wasn't as patient as it should have been).

Of course these API errors are just a symptom. PR https://github.com/os-autoinst/openQA/pull/2723 helps only in certain cases. I haven't found out the cause for the incompletes on @AdamWill's instance yet and also haven't investigated all the incompletes we've seen on OSD/o3 yet.

#13 Updated by mkittler about 1 month ago

@AdamWill As in the other ticket: I'm not able to make sense of the server log by only looking at the log lines from the first failed status updated. The interesting part is what lead to that situation in the first place. That's basically everything from the first attempt of the scheduler and web socket server to assign the job to a worker. It is also good to know the story from the worker-side from the beginning. And yes, it is annoying to dig in the logs but currently it seems to be the only way to figure out what's going wrong. (See my previous comments for examples of relevant log lines.)

#14 Updated by AdamWill about 1 month ago

@okurz "as you mentioned the ambiguous "latest git", can you check if you have https://github.com/os-autoinst/openQA/pull/2723 included?"

yes. when I say latest I mean latest. commit 4861e34893826b6da4f5bdcaecad1cb66fc4e5cc . That's why I wrote after your previous comment and said "bad news". :)

@mkittler I couldn't identify any other relevant log lines at all. I can just throw the log files at you and let you figure it out, though. :) I'll set logging to debug on all the staging boxes shortly and do that next time I get an incomplete.

#15 Updated by AdamWill about 1 month ago

Here are the scheduler and websockets logs from our staging server, and the whole journal from our qa01 worker host, for today (00:00 to now). Here are five incomplete jobs from that period:

https://openqa.stg.fedoraproject.org/tests/717338
https://openqa.stg.fedoraproject.org/tests/717463
https://openqa.stg.fedoraproject.org/tests/717435
https://openqa.stg.fedoraproject.org/tests/717451
https://openqa.stg.fedoraproject.org/tests/717733

The time stamps should be consistent between the logs and the UI.

#16 Updated by AdamWill about 1 month ago

welp, for now Fedora's instances are running with this:

https://src.fedoraproject.org/rpms/openqa/blob/master/f/FedoraUpdateRestart.pm#_49

KWALITEE!

#17 Updated by mkittler about 1 month ago

Let's go through the logs of the first job you've mentioned:

Feb 07 09:06:28 openqa-stg01.qa.fedoraproject.org openqa-scheduler[3984044]: [debug] [pid:3984044] Discarding 717338(30) due to incomplete cluster
Feb 07 09:06:31 openqa-stg01.qa.fedoraproject.org openqa-scheduler[3984044]: [debug] [pid:3984044] Discarding 717338(29) due to incomplete cluster
Feb 07 09:06:51 openqa-stg01.qa.fedoraproject.org openqa-scheduler[3984044]: [debug] [pid:3984044] Discarding 717338(28) due to incomplete cluster
[...]
Feb 07 09:07:52 openqa-stg01.qa.fedoraproject.org openqa-scheduler[3984044]: [debug] [pid:3984044] Discarding 717338(23) due to incomplete cluster
[...]
Feb 07 09:08:12 openqa-stg01.qa.fedoraproject.org openqa-scheduler[3984044]: [debug] [pid:3984044] Job '717338' has worker '63' assigned
Feb 07 09:08:12 openqa-stg01.qa.fedoraproject.org openqa-scheduler[3984044]: [debug] [pid:3984044] [Job#717338] Prepare for being processed by worker 63
Feb 07 09:08:13 openqa-stg01.qa.fedoraproject.org openqa-scheduler[3984044]: [debug] [pid:3984044] Created network for 717338 : 2
Feb 07 09:08:14 openqa-stg01.qa.fedoraproject.org openqa-scheduler[3984044]: [debug] [pid:3984044] Sent job(s) '717338' to worker '63'
Feb 07 09:08:14 openqa-stg01.qa.fedoraproject.org openqa-scheduler[3984044]: [debug] [pid:3984044] Allocated: { job => 717338, worker => 63 }
Feb 07 09:08:14 qa01.qa.fedoraproject.org worker[1268188]: [info] 1967011: WORKING 717338
Feb 07 09:08:19 qa01.qa.fedoraproject.org worker[1268188]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/717338/status): 400 response: Got status update for job 717338 and worker 63 but there is not even a worker assigned to this job (job is scheduled) (remaining tries: 1)
Feb 07 09:08:24 qa01.qa.fedoraproject.org worker[1268188]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/717338/status): 400 response: Got status update for job 717338 and worker 63 but there is not even a worker assigned to this job (job is scheduled) (remaining tries: 0)
Feb 07 09:08:24 qa01.qa.fedoraproject.org worker[1268188]: [error] Aborting job because web UI doesn't accept new images anymore (likely considers this job dead)
Feb 07 09:08:24 qa01.qa.fedoraproject.org worker[1268188]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/717338/status): 400 response: Got status update for job 717338 and worker 63 but there is not even a worker assigned to this job (job is scheduled) (remaining tries: 2)
Feb 07 09:08:29 qa01.qa.fedoraproject.org worker[1268188]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/717338/status): 400 response: Got status update for job 717338 and worker 63 but there is not even a worker assigned to this job (job is scheduled) (remaining tries: 1)
[...]
Feb 07 09:08:32 openqa-stg01.qa.fedoraproject.org openqa-scheduler[3984044]: [debug] [pid:3984044] Job '717338' has worker '63' assigned
Feb 07 09:08:32 openqa-stg01.qa.fedoraproject.org openqa-scheduler[3984044]: [debug] [pid:3984044] [Job#717338] Prepare for being processed by worker 63
Feb 07 09:08:32 openqa-stg01.qa.fedoraproject.org openqa-scheduler[3984044]: [debug] [pid:3984044] Sent job(s) '717338' to worker '63'
Feb 07 09:08:32 qa01.qa.fedoraproject.org worker[1268188]: [warn] Ignoring WS message from http://openqa-stg01.qa.fedoraproject.org with type grab_job but no job ID (currently running 717338 for http://openqa-stg01.qa.fedoraproject.org):

It seems that the job is discarded a few times before it is actually assigned to the worker. Nevertheless it is assigned at some point to the expected worker which then also starts working on the job. It needs 5 seconds for the first status update and that fails because the web UI is apparently not aware of the job assignment (anymore). The web UI even tries to send the job again to the worker although the worker hasn't even given completely up at this point. The worker claims that the grab_job message doesn't even have a job ID which also indicated that something is broken on the web UI side. So either setting the worker-job relation in the web UI database didn't work at all or the assignment has been undone for some reason.

It looks similar for the other jobs although not all of them have been discarded due to an incomplete cluster. So discarding is likely not the problem.

#18 Updated by mkittler about 1 month ago

By the way, in the cases of @AdamWill's instance there's no worker assigned anymore at all (... but there is not even a worker assigned to this job ...). Sometimes, e.g. in job https://openqa.opensuse.org/tests/1176435 on o3, there is also a wrong worker assigned (... unexpected worker ID 52 (expected 58, job is running)). It is neverthless likely the same issue. In the case a wrong worker is assigned the web UI likely also "forgot" the assignment but had already made a new assignment.

By the way, that's the only incomplete job in o3 in the last 24 hours cause by an API failure. On OSD there have been 3 incomplete jobs in the last 24 hours caused by an API failure.

#19 Updated by mkittler about 1 month ago

When looking at the code I see one potential race condition which might cause the behavior we see. When the worker sends a status update and claims it doesn't run a job the web UI "re-schedules" the job (it sets it back to "scheduled" and removes the job-worker relation again).

So the following might happen:

  1. The web UI creates the job-worker relation (sub set_assigned_worker for single jobs).
  2. Before the job is sent to the worker the worker itself sends a status update to the web socket server claiming it isn't working on any job. The web socket server will remove the job-worker relation again.
  3. The web UI does some further preparations, e.g. setting the JOBTOKEN and then finally sends the job to the worker via the web socket server. This step does not rely on an existing job-worker relation which explains why we don't run into an error already at this point.
  4. The job might be assigned to another worker or left scheduled.
  5. When the worker does its first status update for the job it runs into the error we see because of 2. Depending on 4. the error is either ... but there is not even a worker assigned to this job or ... unexpected worker ID 52.

Not sure whether that is really the case in practice but it is a conceivable explanation considering the code. The likeliness for this to happen seems low but there's also only a small fraction of jobs affected.

#20 Updated by mkittler about 1 month ago

At least for the job on o3 this explanation seems to be true (in the logs from @AdamWill is wasn't clear because he hasn't the debug log level enabled):

[2020-02-17T02:46:18.0003 UTC] [debug] [pid:12756] Updating worker seen from worker_status
[2020-02-17T02:46:18.0018 UTC] [debug] [pid:12756] Found job 1176435 in DB from worker_status update sent by worker 52
[2020-02-17T02:46:18.0020 UTC] [debug] [pid:12756] Worker 52 for job 1176435 has token H3FnaVPFW6M2MLU9
[2020-02-17T02:46:18.0030 UTC] [debug] [pid:12756] Job  1176435 reset to state scheduled
[2020-02-17T02:46:18.0070 UTC] [debug] [pid:12756] Started to send message to 52 for job 1176435
[2020-02-17T02:46:18.0070 UTC] [debug] [1dcdbbb4] 200 OK (0.002195s, 455.581/s)
[2020-02-17T02:46:18.0161 UTC] [debug] [4e5c2ed9] POST "/api/send_job"
[2020-02-17T02:46:23.0116 UTC] [info] [pid:21478] Got status update for job 1176435 and worker 52 but there is not even a worker assigned to this job (job is scheduled)

#21 Updated by AdamWill about 1 month ago

where don't I have debug logging enabled?

#22 Updated by mkittler about 1 month ago

@AdamWill The only of your log files which has [debug] messages is the scheduler log. However, the "re-scheduling" is done by the web socket server.

#23 Updated by AdamWill about 1 month ago

oh sigh, I misread 'web socket server' as 'web server' and restarted the web UI service, not the websockets one...

#25 Updated by okurz about 1 month ago

  • Status changed from In Progress to Feedback

PR was merged and is running fine on o3, e.g. also https://gitlab.suse.de/openqa/auto-review/-/jobs/173689 shows no related incompletes. I guess it's up to AdamWill to give the final feedback?

#26 Updated by mkittler about 1 month ago

Yes, we could not reproduce a single instance of that problem on o3 since the last deployment. The number of other incomplete jobs seems unaffected as well. But the reproduction rate was very low so let's wait a few more days and also OSD.

@AdamWill It would be good to have your feedback, too.

#27 Updated by mkittler about 1 month ago

Seems like the PR to fix this can cause jobs being stuck in the assigned state (until the stale job detection marks them as incomplete). This of course also blocks the worker slots for other jobs:

[16:24] <guillaume_g> Martchus_, okurz: Currently, the following workers are blocked in 'assigned' state: openqa-aarch64:{2,15,16} I tried to restart a test, but the worker is still blocked. Any idea?

These workers did not send the rejected message. I'm not yet sure whether that is a bug in the PR or whether this just reveals an existing bug within the worker which causes it to be stuck (where "stuck" means it is still sending status updates but not accepting/rejecting any jobs).

The re-scheduling on worker registration works correctly so the problem can be worked around by simply restarting the workers.

#28 Updated by mkittler about 1 month ago

Having an outdated worker causes now the same problem in certain cases (one worker serves multiple web UIs and some error conditions when the worker accepts a job). I have expected this but assumed it wouldn't be a problem in production. However, I have seen this on one worker slot of the power8 worker on o3 because apparently it hasn't been updated to the latest version last night. Likely no new openQA-worker package wasn't available at the time of the nightly deployment. It now is available so this problem should go away on its own on the next deployment.

#29 Updated by okurz about 1 month ago

The above problem was mentioned by guillaume_g regarding aarch64 which seems to have openQA-worker-4.6.1582560835.235bbba12-lp151.2313.1.noarch just as well as other machines so I doubt this can explain it.

Currently on o3 I can find openqa-client --json-output --host https://openqa.opensuse.org jobs state=assigned | jq '.jobs | .[] | .id' | wc -l
22
, that is 22 assigned and hence blocked worker instances which is worrying.

#30 Updated by okurz about 1 month ago

I have merged a revert of the latest change with
https://github.com/os-autoinst/openQA/pull/2772 to give us more time and also because of next automatic deployments, e.g. osd tomorrow morning. @mkittler I hope we have possibility to t talk so that I can offer my help

EDIT: 2020-02-26: On o3 all worker instances are currently busy hence no worker stuck in assigned. Good sign for the revert at least. The current state of master including the revert has also been successfully deployed to osd right now: https://gitlab.suse.de/openqa/osd-deployment/pipelines/54057

EDIT: Discussed with mkittler. We agreed that a first next step should be to have a scalability test with N workers, e.g. 100, connecting to M webuis, e.g. 1, running K jobs, e.g. 1000, asserting 1. N workers connected to webui, online and idle, 2. M jobs executed and done, 3. N workers idle again

#31 Updated by mkittler about 1 month ago

Before continuing to rework the fix I created a test to reproduce the issue with @okurz similar to what he's already mentioned in the previous comment.

The test actually works and allows to reproduce the issue locally within seconds: https://github.com/os-autoinst/openQA/pull/2777

#32 Updated by mkittler 28 days ago

I'm currently checking how I can make the fix for scheduling issues work without leaving jobs in assigned state. I guess implementing the following on top of my previous attempt would help:

  1. Send the assigned job to the worker again if it sends a status update claiming it does not work on any job. This should help in case the assignment didn't reach the worker e.g. due to network issues. If the worker is really just about to start that job or otherwise busy the additional message wouldn't hurt.
  2. Re-schedule the job if the worker works an a different job than it is supposed to. This is what happens right now on the current master and I guess this behavior can be preserved without risking running into the race condition we want to avoid.
  3. Ensure the periodic stale job detection takes care of assigned jobs, too (maybe it already does). That should help if the worker really went completely offline.
  4. Reduce the patiency of the periodic stale job detection again so 3. is actually effective. Having those jobs around for 2 days isn't nice. We previously came up with the idea to use the max. job time here.

Implementing step 1. is a little bit more effort than I initially thought because the code for making the JSON-data for the job is meddled into the scheduler and now parts of that need to be done by the web socket server. But with a little bit refactoring (splitting up functions) it should be doable.

#33 Updated by mkittler 25 days ago

@coolo came up with a simpler idea on #opensuse-factory: The idea is to simply give the worker a 2nd chance if it reports an idle status when there's actually a job assigned. I've implemented this idea last Friday. After giving it some testing today it seems to work well. I've combined the change with letting the worker explicitly reject jobs. So if the communication between the web socket server and the worker works flawless jobs are even shorter in the assigned state.

#35 Updated by mkittler 21 days ago

I've just had a look on o3. Seems like the PR didn't cause jobs being stuck in assigned or an increased number of incomplete jobs. That's good - however, the "fix" might not have covered all cases. At least there's been one job which is incomplete due to an API error since the deployment of the PR on 2020-03-12 03:00:06:

1201141 | 2020-03-12 15:43:31 | 2020-03-12 15:45:24 | done  | api failure: 400 response: Got status update for job 1201141 with unexpected worker ID 58 (expected 197, job is running)

#36 Updated by mkittler 18 days ago

There's still another one on o3: https://openqa.opensuse.org/tests/1203904

That's the entire worker log for that job (the first line is from the previous job):

Mär 16 05:09:45 openqaworker1 worker[2973]: [info] Uploading worker-log.txt
Mär 16 06:40:21 openqaworker1 worker[2973]: [info] +++ setup notes +++
Mär 16 06:40:21 openqaworker1 worker[2973]: [info] Running on openqaworker1:2 (Linux 4.12.14-lp151.28.40-default #1 SMP Fri Mar 6 13:48:15 UTC 2020 (f0f1262) x86_64)
Mär 16 06:40:21 openqaworker1 worker[2973]: [info] Downloading openSUSE-Tumbleweed-NET-x86_64-Snapshot20200314-Media.iso, request #28076 sent to Cache Service
Mär 16 06:40:36 openqaworker1 worker[2973]: [info] Download of openSUSE-Tumbleweed-NET-x86_64-Snapshot20200314-Media.iso processed:
Mär 16 06:40:36 openqaworker1 worker[2973]: [info] [#28076] Cache size of "/var/lib/openqa/cache" is 50GiB, with limit 50GiB
Mär 16 06:40:36 openqaworker1 worker[2973]: [info] [#28076] Downloading "openSUSE-Tumbleweed-NET-x86_64-Snapshot20200314-Media.iso" from "http://openqa1-opensuse/tests/1203904/asset/iso/openSUSE-Tumbleweed-NET-x86_64-Snapshot20200314-Me>
Mär 16 06:40:36 openqaworker1 worker[2973]: [info] [#28076] Size of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-NET-x86_64-Snapshot20200314-Media.iso" is 139MiB, with ETag ""8b00000-5a0f2223d0901""
Mär 16 06:40:36 openqaworker1 worker[2973]: [info] [#28076] Download of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-NET-x86_64-Snapshot20200314-Media.iso" successful, new cache size is 50GiB
Mär 16 06:40:36 openqaworker1 worker[2973]: [info] Rsync from 'rsync://openqa1-opensuse/tests' to '/var/lib/openqa/cache/openqa1-opensuse', request #28100 sent to Cache Service
Mär 16 06:40:41 openqaworker1 worker[2973]: [info] Output of rsync:
Mär 16 06:40:41 openqaworker1 worker[2973]: [info] [#28100] Calling: rsync -avHP rsync://openqa1-opensuse/tests/ --delete /var/lib/openqa/cache/openqa1-opensuse/tests/
Mär 16 06:40:41 openqaworker1 worker[2973]: receiving incremental file list
Mär 16 06:40:41 openqaworker1 worker[2973]: sent 1,927 bytes  received 1,402,446 bytes  936,248.67 bytes/sec
Mär 16 06:40:41 openqaworker1 worker[2973]: total size is 3,842,060,740  speedup is 2,735.78
Mär 16 06:40:41 openqaworker1 worker[2973]: [info] Finished to rsync tests
Mär 16 06:40:41 openqaworker1 worker[2973]: [info] Preparing cgroup to start isotovideo
Mär 16 06:40:41 openqaworker1 worker[2973]: [info] Using cgroup /sys/fs/cgroup/systemd/openqa.slice/openqa-worker.slice/openqa-worker@2.service/1203904
Mär 16 06:40:41 openqaworker1 worker[2973]: [info] Starting isotovideo container
Mär 16 06:40:41 openqaworker1 worker[2973]: [info] 12319: WORKING 1203904
Mär 16 06:40:41 openqaworker1 worker[2973]: [info] isotovideo has been started (PID: 12319)
Mär 16 06:40:41 openqaworker1 worker[2973]: [warn] Refusing to grab job from http://openqa1-opensuse: already busy with job(s) 1203904
Mär 16 06:40:41 openqaworker1 worker[2973]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1203904/status): 400 response: Got status update for job 1203904 and worker 52 but there is not even a worker assigned to this >
Mär 16 06:40:45 openqaworker1 worker[2973]: [warn] Refusing to grab job from http://openqa1-opensuse: already busy with job(s) 1203904
Mär 16 06:40:46 openqaworker1 worker[2973]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1203904/status): 400 response: Got status update for job 1203904 and worker 52 but there is not even a worker assigned to this >
Mär 16 06:40:51 openqaworker1 worker[2973]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1203904/status): 400 response: Got status update for job 1203904 and worker 52 but there is not even a worker assigned to this >
Mär 16 06:40:51 openqaworker1 worker[2973]: [error] Aborting job because web UI doesn't accept new images anymore (likely considers this job dead)
Mär 16 06:40:51 openqaworker1 worker[2973]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1203904/status): 400 response: Got status update for job 1203904 and worker 52 but there is not even a worker assigned to this >
Mär 16 06:40:56 openqaworker1 worker[2973]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1203904/status): 400 response: Got status update for job 1203904 and worker 52 but there is not even a worker assigned to this >
Mär 16 06:41:06 openqaworker1 worker[2973]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1203904/status): 400 response: Got status update for job 1203904 and worker 52 but there is not even a worker assigned to this >
Mär 16 06:41:06 openqaworker1 worker[2973]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1203904/status): 400 response: Got status update for job 1203904 and worker 52 but there is not even a worker assigned to this >
Mär 16 06:41:06 openqaworker1 worker[2973]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20023/Rw70PE6v6wPcycJ5/broadcast
Mär 16 06:41:08 openqaworker1 worker[2973]: [info] Isotovideo exit status: 0
Mär 16 06:41:15 openqaworker1 worker[2973]: [info] +++ worker notes +++
Mär 16 06:41:15 openqaworker1 worker[2973]: [info] End time: 2020-03-16 05:41:15
Mär 16 06:41:15 openqaworker1 worker[2973]: [info] Result: api-failure

So from the worker side the problem looks still like before.

Scheduler log for that job:

[2020-03-16T05:40:21.0161 UTC] [debug] [pid:4183] Job '1203904' has worker '52' assigned
[2020-03-16T05:40:21.0171 UTC] [debug] [pid:4183] [Job#1203904] Prepare for being processed by worker 52
[2020-03-16T05:40:21.0327 UTC] [debug] [pid:4183] Sent job(s) '1203904' to worker '52'
[2020-03-16T05:40:25.0019 UTC] [debug] [pid:4183] Job '1203904' has worker '183' assigned
[2020-03-16T05:40:25.0022 UTC] [debug] [pid:4183] [Job#1203904] Prepare for being processed by worker 183
[2020-03-16T05:40:25.0087 UTC] [debug] [pid:4183] Failed sending job(s) '1203904' to worker '183'
[2020-03-16T05:40:45.0862 UTC] [debug] [pid:4183] Job '1203904' has worker '191' assigned
[2020-03-16T05:40:45.0866 UTC] [debug] [pid:4183] [Job#1203904] Prepare for being processed by worker 191
[2020-03-16T05:40:45.0937 UTC] [debug] [pid:4183] Failed sending job(s) '1203904' to worker '191'
[2020-03-16T05:41:07.0774 UTC] [debug] [pid:4183] Job '1203904' has worker '195' assigned
[2020-03-16T05:41:07.0777 UTC] [debug] [pid:4183] [Job#1203904] Prepare for being processed by worker 195
[2020-03-16T05:41:07.0866 UTC] [debug] [pid:4183] Failed sending job(s) '1203904' to worker '195'

Web UI log:

[2020-03-16T05:39:13.0441 UTC] [debug] [pid:4184] Received from worker "52" worker_status message "{
  "status" => "free",
  "type" => "worker_status"
}
"
[2020-03-16T05:39:47.0353 UTC] [debug] [pid:4184] Received from worker "52" worker_status message "{
  "status" => "free",
  "type" => "worker_status"
}
"
[2020-03-16T05:39:47.0357 UTC] [debug] [pid:4184] Updating seen of worker 52 from worker_status
[2020-03-16T05:40:21.0267 UTC] [debug] [pid:4184] Received from worker "52" worker_status message "{
  "status" => "free",
  "type" => "worker_status"
}
"
[2020-03-16T05:40:21.0269 UTC] [debug] [pid:4184] Updating seen of worker 52 from worker_status
[2020-03-16T05:40:21.0284 UTC] [debug] [pid:4184] Found job 1203904 in DB from worker_status update sent by worker 52
[2020-03-16T05:40:21.0286 UTC] [debug] [pid:4184] Worker 52 for job 1203904 has token Rw70PE6v6wPcycJ5
[2020-03-16T05:40:21.0286 UTC] [debug] [pid:4184] Rescheduling jobs assigned to worker 52
[2020-03-16T05:40:21.0294 UTC] [debug] [pid:4184] Job 1203904 reset to state scheduled

The problem is that the worker was idling a while before taking that job. Hence the "2nd chance logic" didn't prevent the problem because the worker already claimed it was free several times before (when it was actually free). I need to improve the "2nd chance logic" so only "free vs. assignment" mismatches cause the worker to lose its 2nd chance. I actually could have thought of this problem earlier but at least the previous PR didn't make things worse and now only needs some further tweaking.

I suppose this problem could not be detected by the scalability test because there the workers are not idling long enough (before a job is assigned) to send a status update that they are free.

#37 Updated by mkittler 18 days ago

PR for fixing "2nd chance logic": https://github.com/os-autoinst/openQA/pull/2839

#38 Updated by mkittler 15 days ago

So far no occurrence of the problem on o3 since the last deployment (and also no unwanted side-effects of the fix). I'll watch it a few more days.

@AdamWill In your case the problem seemed more apparent. Does it work for you now with the most recently mentioned PR?

#39 Updated by AdamWill 15 days ago

I'm afraid I hadn't got a chance to deploy it yet - we went into the F32 Beta release cycle and that was pretty intense, then I had to investigate a bunch of test failures (still working on one mysterious one now). I'll try and deploy it on staging today and see.

#40 Updated by AdamWill 15 days ago

OK, I updated our staging deployment to latest git now, and disabled the workaround I had in place for this. Will keep an eye on it for the next few days.

#41 Updated by mkittler 10 days ago

  • Status changed from Feedback to Resolved
  • Target version deleted (Current Sprint)

Adam's and my last comment got lost. Adam was stating that there were no incompletes in the last 500 jobs. I have checked o3 right now again and there are still no such incompletes and also not jobs stuck in assigned. So I consider this issue resolved because statistically it is likely that we would have encountered at least one case by now.

#42 Updated by okurz 9 days ago

@mkittler you don't need to waste brain cycles on removing "Current Sprint" in the target version as we agreed in the team :)

Also available in: Atom PDF