Project

General

Profile

Actions

action #21028

closed

Some jobs immediately fail with a 404 (new scheduling logic)

Added by AdamWill over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
Feature requests
Target version:
-
Start date:
2017-08-03
Due date:
% Done:

0%

Estimated time:

Description

I just upgraded our staging instance to latest git master (with the new scheduling logic where the server sends jobs to the workers rather than the workers grabbing jobs from the server). I then kicked off a full compose test run. It mostly worked, but a few jobs came up incomplete with no logs / assets, like this one:

https://openqa.stg.fedoraproject.org/tests/140359

Looking at the worker logs, I see this:

Aug 02 23:56:07 qa07.qa.fedoraproject.org worker[16436]: [INFO] 19425: WORKING 140359
Aug 02 23:56:17 qa07.qa.fedoraproject.org worker[16436]: [ERROR] 404 response: Not Found (remaining tries: 0)
Aug 02 23:56:17 qa07.qa.fedoraproject.org worker[16436]: [ERROR] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Aug 02 23:56:18 qa07.qa.fedoraproject.org worker[16436]: [ERROR] 404 response: Not Found (remaining tries: 0)
Aug 02 23:56:18 qa07.qa.fedoraproject.org worker[16436]: killed 19425
Aug 02 23:56:59 qa07.qa.fedoraproject.org worker[16436]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: api-failure)
Aug 02 23:56:59 qa07.qa.fedoraproject.org worker[16436]: [INFO] cleaning up 00140359-fedora-Rawhide-KDE-live-iso-x86_64-BuildFedora-Rawhide-20170731.n.0-desktop_notifications_live@64bit

On the server end I see this:

Aug 02 23:56:07 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Worker: 1 accepted job 140359
Aug 02 23:56:07 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Worker 1 websocket connection closed - 1005
Aug 02 23:56:09 openqa-stg01.qa.fedoraproject.org openqa-scheduler[7849]: [scheduler:debug] IPC calling mutex_lock
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] GET "/ws/1"
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Routing to a callback
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] API key from client: f1f2fb7abe8eb23f
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] API auth by user: http://adamwill.id.fedoraproject.org/, operator: 1
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] Routing to a callback
Aug 02 23:56:12 openqa-stg01.qa.fedoraproject.org openqa-websockets[7851]: [websockets:debug] 101 Switching Protocols (0.012517s, 79.891/s)
Aug 02 23:56:14 openqa-stg01.qa.fedoraproject.org openqa-scheduler[7849]: [scheduler:debug] IPC calling mutex_lock
Aug 02 23:56:17 openqa-stg01.qa.fedoraproject.org openqa[7877]: [7884:info] Got status update for job with no worker assigned (maybe running job already considered dead): 140359
Aug 02 23:56:18 openqa-stg01.qa.fedoraproject.org openqa[7877]: [7886:info] Got status update for job with no worker assigned (maybe running job already considered dead): 140359

Probably some kind of race condition with the dead job detection stuff...


Related issues 1 (0 open1 closed)

Related to openQA Tests (public) - action #20378: [tools]Too many 502 on openqaResolvedcoolo2017-07-18

Actions
Actions

Also available in: Atom PDF