action #21028
closedSome jobs immediately fail with a 404 (new scheduling logic)
0%
Description
I just upgraded our staging instance to latest git master (with the new scheduling logic where the server sends jobs to the workers rather than the workers grabbing jobs from the server). I then kicked off a full compose test run. It mostly worked, but a few jobs came up incomplete with no logs / assets, like this one:
https://openqa.stg.fedoraproject.org/tests/140359
Looking at the worker logs, I see this:
Aug 02 23:56:07 qa07.qa.fedoraproject.org worker[16436]: [INFO] 19425: WORKING 140359
Aug 02 23:56:17 qa07.qa.fedoraproject.org worker[16436]: [ERROR] 404 response: Not Found (remaining tries: 0)
Aug 02 23:56:17 qa07.qa.fedoraproject.org worker[16436]: [ERROR] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Aug 02 23:56:18 qa07.qa.fedoraproject.org worker[16436]: [ERROR] 404 response: Not Found (remaining tries: 0)
Aug 02 23:56:18 qa07.qa.fedoraproject.org worker[16436]: killed 19425
Aug 02 23:56:59 qa07.qa.fedoraproject.org worker[16436]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: api-failure)
Aug 02 23:56:59 qa07.qa.fedoraproject.org worker[16436]: [INFO] cleaning up 00140359-fedora-Rawhide-KDE-live-iso-x86_64-BuildFedora-Rawhide-20170731.n.0-desktop_notifications_live@64bit
On the server end I see this:
Aug 02 23:56:07 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Worker: 1 accepted job 140359
Aug 02 23:56:07 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Worker 1 websocket connection closed - 1005
Aug 02 23:56:09 openqa-stg01.qa.fedoraproject.org openqa-scheduler[7849]: [scheduler:debug] IPC calling mutex_lock
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] GET "/ws/1"
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Routing to a callback
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] API key from client: f1f2fb7abe8eb23f
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] API auth by user: http://adamwill.id.fedoraproject.org/, operator: 1
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Routing to a callback
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] 101 Switching Protocols (0.012517s, 79.891/s)
Aug 02 23:56:14 openqa-stg01.qa.fedoraproject.org openqa-scheduler[7849]: [scheduler:debug] IPC calling mutex_lock
Aug 02 23:56:17 openqa-stg01.qa.fedoraproject.org openqa[7877]: [7884:info] Got status update for job with no worker assigned (maybe running job already considered dead): 140359
Aug 02 23:56:18 openqa-stg01.qa.fedoraproject.org openqa[7877]: [7886:info] Got status update for job with no worker assigned (maybe running job already considered dead): 140359
Probably some kind of race condition with the dead job detection stuff...