openQA Project - action #16380: workers die like flies</h1> <article> <h1>openQA Project - action #16380: workers die like flies</h1> <p>2017-02-01T10:31:20Z</p> <ul></ul><p>This happened to me when job was set done before all status updates were finished, we need to review stop_job ordering. Second part is that the response on unsuccessful api call is too harsh I think. 404 should not end as an api failure and shutting down the worker IMO.</p> </article> <article> <h1>openQA Project - action #16380: workers die like flies</h1> <p>2017-02-01T11:37:45Z</p> <ul></ul><p>No, but it should stop the job. Dead jobs are marked this way.</p> </article> <article> <h1>openQA Project - action #16380: workers die like flies</h1> <p>2017-02-01T15:28:54Z</p> <ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li><li><strong>Assignee</strong> set to <i>oholecek</i></li></ul><p>There is obvious bug in scheduling worker registration without <code>$host</code> specified and stop_job clearing the <code>$current_host</code> in the meantime. Testing fix for it now. But this fixes the symptom - dying worker, not the actual cause, which will still trigger marking the job as incomplete.</p> </article> <article> <h1>openQA Project - action #16380: workers die like flies</h1> <p>2017-03-01T09:24:19Z</p> <ul><li><strong>Target version</strong> changed from <i>Milestone 5</i> to <i>Milestone 6</i></li></ul> </article> <article> <h1>openQA Project - action #16380: workers die like flies</h1> <p>2017-03-14T19:30:49Z</p> <ul></ul><p>The symptom is pretty bad and needs fixing regardless (this is also affecting both Fedora openQA deployments and killing our workers).</p> <p>But yes, we do need to also figure out <em>why</em> workers are getting 404s in the first place. Here's the logs from one case on our staging instance.</p> <p>Worker log:</p> <p>Mar 14 17:18:12 qa09.qa.fedoraproject.org worker[5943]: [INFO] 19431: WORKING 79745<br> Mar 14 17:23:01 qa09.qa.fedoraproject.org worker[5943]: [ERROR] 404 response: Not Found (remaining tries: 0)<br> Mar 14 17:23:01 qa09.qa.fedoraproject.org worker[5943]: [ERROR] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)<br> Mar 14 17:23:01 qa09.qa.fedoraproject.org worker[5943]: Mojo::Reactor::Poll: Timer failed: No worker id or webui host set! at /usr/share/openqa/script/../lib/OpenQA/Worker/Common.pm line 181.<br> Mar 14 17:23:03 qa09.qa.fedoraproject.org worker[5943]: WebUI Mojo::IOLoop=HASH(0x3558158) is unknown! - Should not happen but happened, exiting! at /usr/share/openqa/script/../lib/OpenQA/Worker/Common.pm line 404.<br> Mar 14 17:23:03 qa09.qa.fedoraproject.org worker[5943]: [INFO] registering worker with openQA Mojo::IOLoop=HASH(0x3558158)...</p> <p>Server Apache log, grepped for 79745:</p> <p>[root@openqa-stg01 adamwill][PROD]# grep 79745 /var/log/httpd/access_log<br> 10.5.124.239 - - [14/Mar/2017:17:18:22 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:18:32 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:18:42 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:18:52 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:19:02 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:19:12 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:19:22 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 261 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:19:32 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 226 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:19:42 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 191 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:19:43 +0000] "POST /api/v1/jobs/79745/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:19:44 +0000] "POST /api/v1/jobs/79745/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:19:52 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:20:02 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:20:12 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:20:22 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:20:32 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:20:42 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:20:52 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:21:02 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:21:12 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:21:22 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:21:32 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:21:42 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:21:52 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 200 52 "-" "Mojolicious (Perl)"<br> 10.5.124.239 - - [14/Mar/2017:17:22:52 +0000] "POST /api/v1/jobs/79745/status HTTP/1.1" 404 49 "-" "Mojolicious (Perl)"</p> <p>Server journal, grepped for 79745:</p> <p>Mar 14 17:22:44 openqa-stg01.qa.fedoraproject.org openqa-websockets[20266]: [Tue Mar 14 17:22:44 2017] [websockets:debug] job considered dead: 79745 worker 24 not seen. In state running<br> Mar 14 17:22:45 openqa-stg01.qa.fedoraproject.org openqa-websockets[20266]: [Tue Mar 14 17:22:45 2017] [websockets:warn] dead job 79745 aborted and duplicated 79978<br> Mar 14 17:22:52 openqa-stg01.qa.fedoraproject.org openqa[20267]: [Tue Mar 14 17:22:52 2017] [13376:info] Got status update for job with no worker assigned (maybe running job already considered dead): 79745</p> <p>So basically it seems like the worker is checking in every ten seconds and everything is hunky dory, then the worker neglects to check in for a minute (between 17:21:52 and 17:22:52). During that minute, websockets decides the job is dead - at 17:22:44 - and aborts it. Then when the worker <em>does</em> try and check in at 17:22:52, the server returns a 404 and logs it. And that's the 404 that triggers the worker death due to the issue with the <code>register_worker</code> call. </p> </article> <article> <h1>openQA Project - action #16380: workers die like flies</h1> <p>2017-03-14T20:12:08Z</p> <ul></ul><p>Wow, OK, so something wacky clearly happened on the worker host in question for this case (and one other) around the time in question...there's some messy complaints from the kernel. Not sure exactly what was going on, but logs attached for interest. Dunno if this is what's happening to you guys, or if it's something different.</p> </article> <article> <h1>openQA Project - action #16380: workers die like flies</h1> <p>2017-03-14T20:13:32Z</p> <ul><li><strong>File</strong> <a href="/attachments/2964">wtf.txt</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/2964/wtf.txt">wtf.txt</a> added</li></ul> </article> <article> <h1>openQA Project - action #16380: workers die like flies</h1> <p>2017-03-20T13:13:36Z</p> <ul><li><strong>% Done</strong> changed from <i>0</i> to <i>80</i></li></ul><p><a href="https://github.com/os-autoinst/openQA/pull/1215" class="external">PR</a> merged, the tests are still missing.</p> </article> <article> <h1>openQA Project - action #16380: workers die like flies</h1> <p>2017-03-23T10:08:28Z</p> <ul><li><strong>Priority</strong> changed from <i>High</i> to <i>Normal</i></li></ul><p>I guess this is less critical now. reducing prio.</p> </article> <article> <h1>openQA Project - action #16380: workers die like flies</h1> <p>2017-04-20T09:12:21Z</p> <ul><li><strong>Target version</strong> changed from <i>Milestone 6</i> to <i>Milestone 7</i></li></ul><p><em>bump</em></p> </article> <article> <h1>openQA Project - action #16380: workers die like flies</h1> <p>2017-05-23T06:48:50Z</p> <ul><li><strong>Target version</strong> changed from <i>Milestone 7</i> to <i>Milestone 8</i></li></ul><p>Who has information if this still applies? Are there any workaround cronjob/monitoring jobs that restart workers because of this or are we done here?</p> </article> <article> <h1>openQA Project - action #16380: workers die like flies</h1> <p>2017-05-23T07:14:58Z</p> <ul></ul><p>if you could stop messing with target versions outside of backlog plannning meetings, we would have a way to go through this systemetically. But continue messing</p> </article> <article> <h1>openQA Project - action #16380: workers die like flies</h1> <p>2017-07-17T07:28:56Z</p> <ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Rejected</i></li></ul><p>the github issue is there, but left WIP for months</p> </article> </main></body></html>