action #21028
closedSome jobs immediately fail with a 404 (new scheduling logic)
0%
Description
I just upgraded our staging instance to latest git master (with the new scheduling logic where the server sends jobs to the workers rather than the workers grabbing jobs from the server). I then kicked off a full compose test run. It mostly worked, but a few jobs came up incomplete with no logs / assets, like this one:
https://openqa.stg.fedoraproject.org/tests/140359
Looking at the worker logs, I see this:
Aug 02 23:56:07 qa07.qa.fedoraproject.org worker[16436]: [INFO] 19425: WORKING 140359
Aug 02 23:56:17 qa07.qa.fedoraproject.org worker[16436]: [ERROR] 404 response: Not Found (remaining tries: 0)
Aug 02 23:56:17 qa07.qa.fedoraproject.org worker[16436]: [ERROR] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Aug 02 23:56:18 qa07.qa.fedoraproject.org worker[16436]: [ERROR] 404 response: Not Found (remaining tries: 0)
Aug 02 23:56:18 qa07.qa.fedoraproject.org worker[16436]: killed 19425
Aug 02 23:56:59 qa07.qa.fedoraproject.org worker[16436]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: api-failure)
Aug 02 23:56:59 qa07.qa.fedoraproject.org worker[16436]: [INFO] cleaning up 00140359-fedora-Rawhide-KDE-live-iso-x86_64-BuildFedora-Rawhide-20170731.n.0-desktop_notifications_live@64bit
On the server end I see this:
Aug 02 23:56:07 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Worker: 1 accepted job 140359
Aug 02 23:56:07 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Worker 1 websocket connection closed - 1005
Aug 02 23:56:09 openqa-stg01.qa.fedoraproject.org openqa-scheduler[7849]: [scheduler:debug] IPC calling mutex_lock
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] GET "/ws/1"
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Routing to a callback
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] API key from client: f1f2fb7abe8eb23f
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] API auth by user: http://adamwill.id.fedoraproject.org/, operator: 1
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Routing to a callback
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] 101 Switching Protocols (0.012517s, 79.891/s)
Aug 02 23:56:14 openqa-stg01.qa.fedoraproject.org openqa-scheduler[7849]: [scheduler:debug] IPC calling mutex_lock
Aug 02 23:56:17 openqa-stg01.qa.fedoraproject.org openqa[7877]: [7884:info] Got status update for job with no worker assigned (maybe running job already considered dead): 140359
Aug 02 23:56:18 openqa-stg01.qa.fedoraproject.org openqa[7877]: [7886:info] Got status update for job with no worker assigned (maybe running job already considered dead): 140359
Probably some kind of race condition with the dead job detection stuff...
Updated by coolo over 7 years ago
hmm, I wonder why it closed the websocket connection. I fear that's the base of the problem
Updated by EDiGiacinto over 7 years ago
This also happens when scheduler decides to kill the job in the machine, specifically when the worker was too slow to respond to the scheduler job allocation message - this behavior should be automatically adjusted to avoid it as much as possible.
Updated by AdamWill over 7 years ago
So, hum, this may not actually be caused by the new scheduler code, exactly. Now I look for the 1005 status code, I see stuff that looks at least similar - 1005 status codes, "Got status update for job with no worker assigned (maybe running job already considered dead)" messages, incomplete jobs - from long before I upgraded. Here's one from early July, for e.g...server logs:
Jul 05 21:12:08 openqa-stg01.qa.fedoraproject.org openqa[1710]: [Wed Jul 5 21:12:08 2017] [18446:info] Got status update for job with no worker assigned (maybe running job already considered dead): 132171
Jul 05 21:12:09 openqa-stg01.qa.fedoraproject.org openqa[1710]: [Wed Jul 5 21:12:09 2017] [18915:info] Got status update for job with no worker assigned (maybe running job already considered dead): 132171
Jul 05 21:12:18 openqa-stg01.qa.fedoraproject.org openqa-websockets[1704]: [Wed Jul 5 21:12:18 2017] [websockets:debug] Worker 2 websocket connection closed - 1005
worker logs:
Jul 05 21:06:57 qa07.qa.fedoraproject.org worker[943]: [INFO] got job 132171: 00132171-fedora-26-Atomic-dvd_ostree-iso-x86_64-BuildFedora-26-20170705.n.0-install_default@uefi
Jul 05 21:06:57 qa07.qa.fedoraproject.org worker[943]: [INFO] 31068: WORKING 132171
Jul 05 21:12:08 qa07.qa.fedoraproject.org worker[943]: [ERROR] 404 response: Not Found (remaining tries: 0)
Jul 05 21:12:08 qa07.qa.fedoraproject.org worker[943]: [ERROR] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Jul 05 21:12:09 qa07.qa.fedoraproject.org worker[943]: [ERROR] 404 response: Not Found (remaining tries: 0)
Jul 05 21:12:09 qa07.qa.fedoraproject.org worker[943]: killed 31068
Jul 05 21:12:18 qa07.qa.fedoraproject.org worker[943]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: api-failure)
Jul 05 21:12:18 qa07.qa.fedoraproject.org worker[943]: [INFO] cleaning up 00132171-fedora-26-Atomic-dvd_ostree-iso-x86_64-BuildFedora-26-20170705.n.0-install_default@uefi
How it looks in the web UI: https://openqa.stg.fedoraproject.org/tests/132171
the timing seems maybe slightly different, but we're certainly in the same ballpark, here. Note that per the websockets RFC, 1005 is a special status code that I think indicates an unexpected disconnection or so:
https://tools.ietf.org/html/rfc6455#section-7.4.1
"1005 is a reserved value and MUST NOT be set as a status code in a Close control frame by an endpoint. It is designated for use in applications expecting a status code to indicate that no status code was actually present."
an obvious difference in behaviour, though, is that in the older cases I can find, the job got automatically cloned and the clone was picked up and ran successfully, so the failure was hidden, to some extent. In the recent cases, there was no automatic cloning...
Updated by AdamWill over 7 years ago
Well, hum, on the other hand, there are significant differences there...in all the old cases I can see it looks like this happened while a job had been running for a while, and the server 'dead worker' detection decided the worker was dead for some reason, killed it, and duped the job. That doesn't look like what's going on here, there's none of the log messages from the 'dead worker' detection code, and in every case, the 1005 websocket disconnection comes immediately after the worker picks up the job:
--
Aug 02 23:55:13 openqa-stg01.qa.fedoraproject.org openqa-scheduler[7849]: [scheduler:debug] IPC calling mutex_lock
Aug 02 23:55:15 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Worker: 1 accepted job 140400
Aug 02 23:55:15 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Worker 1 websocket connection closed - 1005
Aug 02 23:55:18 openqa-stg01.qa.fedoraproject.org openqa-scheduler[7849]: [scheduler:debug] IPC calling mutex_lock
Aug 02 23:55:20 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] GET "/ws/1"¶
Aug 02 23:56:04 openqa-stg01.qa.fedoraproject.org openqa-scheduler[7849]: [scheduler:debug] IPC calling mutex_lock
Aug 02 23:56:07 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Worker: 1 accepted job 140359
Aug 02 23:56:07 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Worker 1 websocket connection closed - 1005
Aug 02 23:56:09 openqa-stg01.qa.fedoraproject.org openqa-scheduler[7849]: [scheduler:debug] IPC calling mutex_lock
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] GET "/ws/1"¶
Aug 02 23:56:54 openqa-stg01.qa.fedoraproject.org openqa-scheduler[7849]: [scheduler:debug] IPC calling mutex_lock
Aug 02 23:56:59 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Worker: 1 accepted job 140310
Aug 02 23:56:59 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Worker 1 websocket connection closed - 1005
Aug 02 23:56:59 openqa-stg01.qa.fedoraproject.org openqa-scheduler[7849]: [scheduler:debug] IPC calling mutex_lock
Aug 02 23:57:04 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] GET "/ws/1"¶
Aug 02 23:57:30 openqa-stg01.qa.fedoraproject.org openqa-scheduler[7849]: [scheduler:debug] +=----------------=+
Aug 02 23:57:51 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Worker: 1 accepted job 140311
Aug 02 23:57:51 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Worker 1 websocket connection closed - 1005
Aug 02 23:57:55 openqa-stg01.qa.fedoraproject.org openqa-scheduler[7849]: [scheduler:debug] [Congestion control] Resetting failures count. Next scheduler round will be reset to 8000ms
Aug 02 23:57:55 openqa-stg01.qa.fedoraproject.org openqa-scheduler[7849]: [scheduler:debug] [scheduler] Current tick is at 57000ms. New tick will be in: 9000ms¶
Aug 02 23:58:41 openqa-stg01.qa.fedoraproject.org openqa-scheduler[7849]: [scheduler:debug] +=----------------=+
Aug 02 23:58:43 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Worker: 1 accepted job 140379
Aug 02 23:58:43 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Worker 1 websocket connection closed - 1005
Aug 02 23:58:48 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] GET "/ws/1"
Aug 02 23:58:48 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Routing to a callback
I guess it's also notable that they all happened on the same worker and all in the space of three minutes...but I can't find any further log messages anywhere that could shed any further light...
I'm gonna re-run the same set of compose tests again, and see what happens this time.
Updated by AdamWill over 7 years ago
So after that last comment I updated to git master once again, going from 25c355a to b5911e5 . Things have been significantly worse for every compose tested since then:
https://openqa.stg.fedoraproject.org/tests/overview?distri=fedora&version=Rawhide&build=Fedora-Rawhide-20170807.n.0&groupid=1
https://openqa.stg.fedoraproject.org/tests/overview?distri=fedora&version=Rawhide&build=Fedora-Rawhide-20170806.n.0&groupid=1
https://openqa.stg.fedoraproject.org/tests/overview?distri=fedora&version=Rawhide&build=Fedora-Rawhide-20170805.n.0&groupid=1
https://openqa.stg.fedoraproject.org/tests/overview?distri=fedora&version=Rawhide&build=Fedora-Rawhide-20170804.n.1&groupid=1
You can compare to the same composes on our production instance - just remove the '.stg' from the URL - and see the difference. There are multiple spurious 'incomplete' tests for each compose. I haven't looked at each case yet, but from the first one I picked:
https://openqa.stg.fedoraproject.org/tests/141226
it looks like a problem with the new code to detect when a status update doesn't come from the expected worker. On the worker end I see these messages:
Aug 05 11:30:34 qa09.qa.fedoraproject.org worker[6600]: [INFO] cleaning up 00141323-fedora-Rawhide-universal-x86_64-BuildFedora-Rawhide-20170804.n.1-upgrade_desktop_64bit@64bit
Aug 05 11:30:34 qa09.qa.fedoraproject.org worker[6600]: [DEBUG] Build finished, setting us free to pick up new jobs
Aug 05 11:30:34 qa09.qa.fedoraproject.org worker[6600]: [DEBUG] Build finished, setting us free to pick up new jobs
Aug 05 11:30:34 qa09.qa.fedoraproject.org worker[6600]: [WARN] Received command scheduler_abort for job 141265, but we do not have any assigned. Ignoring!
Aug 05 11:30:34 qa09.qa.fedoraproject.org worker[6600]: [INFO] registering worker with openQA http://openqa-stg01.qa.fedoraproject.org...
Aug 05 11:30:34 qa09.qa.fedoraproject.org worker[6600]: [DEBUG] Job 141226 scheduled for next cycle
Aug 05 11:30:34 qa09.qa.fedoraproject.org worker[6600]: [INFO] got job 141226: 00141226-fedora-Rawhide-Server-dvd-iso-x86_64-BuildFedora-Rawhide-20170804.n.1-base_services_start@64bit
Aug 05 11:30:34 qa09.qa.fedoraproject.org worker[6600]: [INFO] 46394: WORKING 141226
Aug 05 11:31:44 qa09.qa.fedoraproject.org worker[6600]: [ERROR] 502 response: Proxy Error (remaining tries: 2)
Aug 05 11:32:49 qa09.qa.fedoraproject.org worker[6600]: [ERROR] 502 response: Proxy Error (remaining tries: 1)
Aug 05 11:33:54 qa09.qa.fedoraproject.org worker[6600]: [ERROR] 502 response: Proxy Error (remaining tries: 0)
Aug 05 11:33:54 qa09.qa.fedoraproject.org worker[6600]: [ERROR] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Aug 05 11:34:04 qa09.qa.fedoraproject.org worker[6600]: [DEBUG] Refusing, we are already performing another job
Aug 05 11:34:04 qa09.qa.fedoraproject.org worker[6600]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: api-failure)
Aug 05 11:34:55 qa09.qa.fedoraproject.org worker[6600]: [ERROR] 502 response: Proxy Error (remaining tries: 2)
Aug 05 11:35:53 qa09.qa.fedoraproject.org worker[6600]: [WARN] Received command scheduler_abort for different job id 141294 (our 141226). Ignoring!
Aug 05 11:36:00 qa09.qa.fedoraproject.org worker[6600]: [ERROR] 502 response: Proxy Error (remaining tries: 1)
Aug 05 11:37:05 qa09.qa.fedoraproject.org worker[6600]: [ERROR] 502 response: Proxy Error (remaining tries: 0)
Aug 05 11:37:05 qa09.qa.fedoraproject.org worker[6600]: [ERROR] ERROR base_services_start-nsswitch.conf: 404 response: Not Found
Aug 05 11:37:05 qa09.qa.fedoraproject.org worker[6600]: [ERROR] ERROR video.ogv: 404 response: Not Found
Aug 05 11:37:15 qa09.qa.fedoraproject.org worker[6600]: [INFO] registering worker with openQA http://openqa-stg01.qa.fedoraproject.org...
On the server end:
Aug 05 11:30:44 openqa-stg01.qa.fedoraproject.org openqa[5260]: [5297:info] Got status update for job 141226 that does not belong to Worker 25
Aug 05 11:31:49 openqa-stg01.qa.fedoraproject.org openqa[5260]: [5298:info] Got status update for job 141226 that does not belong to Worker 25
Aug 05 11:32:54 openqa-stg01.qa.fedoraproject.org openqa[5260]: [5295:info] Got status update for job 141226 that does not belong to Worker 25
Aug 05 11:33:55 openqa-stg01.qa.fedoraproject.org openqa[5260]: [5298:info] Got status update for job 141226 that does not belong to Worker 25
Aug 05 11:35:00 openqa-stg01.qa.fedoraproject.org openqa[5260]: [5287:info] Got status update for job 141226 that does not belong to Worker 25
Aug 05 11:36:05 openqa-stg01.qa.fedoraproject.org openqa[5260]: [5288:info] Got status update for job 141226 that does not belong to Worker 25
Aug 05 11:37:05 openqa-stg01.qa.fedoraproject.org openqa[5260]: [5284:info] Got artefact for job with no worker assigned (maybe running job already considered dead): 141226
Aug 05 11:37:05 openqa-stg01.qa.fedoraproject.org openqa[5260]: [5294:info] Got artefact for job with no worker assigned (maybe running job already considered dead): 141226
Aug 05 11:37:05 openqa-stg01.qa.fedoraproject.org openqa[5260]: [5284:info] Got status update for job 141226 that does not belong to Worker 25
Aug 05 11:38:11 openqa-stg01.qa.fedoraproject.org openqa[5260]: [5297:info] Got status update for job 141226 that does not belong to Worker 25
Aug 05 11:39:16 openqa-stg01.qa.fedoraproject.org openqa[5260]: [5297:info] Got status update for job 141226 that does not belong to Worker 25
Note that several of the 'does not belong' messages are exactly 60 seconds before the worker reports a 502.
And really, both server and worker logs are filled with these kinds of messages for dozens of jobs. It really seems like there's something fundamentally going wrong in this 'who's got what job' determination somewhere...
Updated by AdamWill over 7 years ago
Here's the scale of the problem:
[root@openqa-stg01 adamwill][PROD]# journalctl --since 2017-08-01 | grep "does not belong" | wc -l
3495
Updated by AdamWill over 7 years ago
So just a personal perspective here, after talking to @coolo a bit, if you don't mind me sticking my oar in: I think it'd really help to take the overall approach of keeping things simple. Pick the single way we want this to work, do our best to implement it, then debug issues one by one as they arise. Adding options to vary the behaviour and so forth, or making big changes to the approach in midstream, is just going to confuse us. I'm pretty busy this week, but when I can get time I'll try to dig into the issues I'm seeing in detail and try to identify specific causes and help come up with fixes, but the simpler the code is and the fewer alternatives there are to deal with, the easier it'll be. Thanks a lot!
Updated by EDiGiacinto over 7 years ago
- Related to action #20378: [tools]Too many 502 on openqa added
Updated by AdamWill over 7 years ago
So with latest git I'm not seeing this any more, but I am very frequently seeing a problem where workers effectively get stuck. Will file a new issue for that one.
Updated by EDiGiacinto over 7 years ago
- Status changed from In Progress to Resolved