The symptom is pretty bad and needs fixing regardless (this is also affecting both Fedora openQA deployments and killing our workers).
But yes, we do need to also figure out why workers are getting 404s in the first place. Here's the logs from one case on our staging instance.
Worker log:
Mar 14 17:18:12 qa09.qa.fedoraproject.org worker[5943]: [INFO] 19431: WORKING 79745
Mar 14 17:23:01 qa09.qa.fedoraproject.org worker[5943]: [ERROR] 404 response: Not Found (remaining tries: 0)
Mar 14 17:23:01 qa09.qa.fedoraproject.org worker[5943]: [ERROR] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Mar 14 17:23:01 qa09.qa.fedoraproject.org worker[5943]: Mojo::Reactor::Poll: Timer failed: No worker id or webui host set! at /usr/share/openqa/script/../lib/OpenQA/Worker/Common.pm line 181.
Mar 14 17:23:03 qa09.qa.fedoraproject.org worker[5943]: WebUI Mojo::IOLoop=HASH(0x3558158) is unknown! - Should not happen but happened, exiting! at /usr/share/openqa/script/../lib/OpenQA/Worker/Common.pm line 404.
Mar 14 17:23:03 qa09.qa.fedoraproject.org worker[5943]: [INFO] registering worker with openQA Mojo::IOLoop=HASH(0x3558158)...
Server Apache log, grepped for 79745:
[root@openqa-stg01 adamwill][PROD]# grep 79745 /var/log/httpd/access_log
10.5.124.239 - - [14/Mar/2017:17:18:22 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:18:32 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:18:42 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:18:52 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:19:02 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:19:12 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:19:22 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 261 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:19:32 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 226 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:19:42 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 191 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:19:43 +0000] "POST /api/v1/jobs/79745/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:19:44 +0000] "POST /api/v1/jobs/79745/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:19:52 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:20:02 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:20:12 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:20:22 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:20:32 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:20:42 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:20:52 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:21:02 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:21:12 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:21:22 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:21:32 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:21:42 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:21:52 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"
10.5.124.239 - - [14/Mar/2017:17:22:52 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 404 49 "-" "Mojolicious (Perl)"
Server journal, grepped for 79745:
Mar 14 17:22:44 openqa-stg01.qa.fedoraproject.org openqa-websockets[20266]: [Tue Mar 14 17:22:44 2017] [websockets:debug] job considered dead: 79745 worker 24 not seen. In state running
Mar 14 17:22:45 openqa-stg01.qa.fedoraproject.org openqa-websockets[20266]: [Tue Mar 14 17:22:45 2017] [websockets:warn] dead job 79745 aborted and duplicated 79978
Mar 14 17:22:52 openqa-stg01.qa.fedoraproject.org openqa[20267]: [Tue Mar 14 17:22:52 2017] [13376:info] Got status update for job with no worker assigned (maybe running job already considered dead): 79745
So basically it seems like the worker is checking in every ten seconds and everything is hunky dory, then the worker neglects to check in for a minute (between 17:21:52 and 17:22:52). During that minute, websockets decides the job is dead - at 17:22:44 - and aborts it. Then when the worker does try and check in at 17:22:52, the server returns a 404 and logs it. And that's the 404 that triggers the worker death due to the issue with the register_worker
call.