Project

General

Profile

Actions

action #25124

closed

[tools][sprint 201709.1] Workers disconnects from websocket server and getting stuck: job shows as 'State: assigned' forever

Added by AdamWill over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
-
Start date:
2017-09-08
Due date:
% Done:

0%

Estimated time:

Description

So with a git checkout from 20170907, we're still seeing workers get stuck in the Fedora openQA instances. The symptoms are:

after that it just does that same thing over and over - 'Sending worker status', with the occasional 'uninitialized value' error - until I restart the process.

I see that there's one further tweak to the scheduler code this morning:
https://github.com/os-autoinst/openQA/pull/1450

I'll deploy that, and see if it helps.


Related issues 2 (0 open2 closed)

Related to openQA Project - coordination #32851: [tools][EPIC] Scheduling redesignResolvedokurz2018-05-05

Actions
Related to openQA Project - action #35296: Error messages on worker about "Use of uninitialized value $host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Common.pm line 359, <GEN298662> line 4."Rejected2018-04-20

Actions
Actions #1

Updated by AdamWill over 6 years ago

And I already got a stuck worker with the latest code, here are the worker logs, starting several jobs back at the point suspicious errors start appearing:

Sep 08 22:25:41 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Job 156843 scheduled for next cycle
Sep 08 22:25:41 qa07.qa.fedoraproject.org worker[28095]: [INFO] got job 156843: 00156843-fedora-27-universal-x86_64-BuildFedora-27-20170907.n.0-ANA272013-NOREPORT-install_delete_partial@uefi
Sep 08 22:25:43 qa07.qa.fedoraproject.org worker[28095]: [INFO] 7062: WORKING 156843
Sep 08 22:25:43 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending IMMEDIATELY worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:25:45 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:30:53 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Refusing, we are already performing another job
Sep 08 22:30:53 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:30:53 qa07.qa.fedoraproject.org worker[28095]: [ERROR] 400 response: Bad Request (remaining tries: 2)
Sep 08 22:30:58 qa07.qa.fedoraproject.org worker[28095]: [ERROR] 400 response: Bad Request (remaining tries: 1)
Sep 08 22:31:03 qa07.qa.fedoraproject.org worker[28095]: [ERROR] 400 response: Bad Request (remaining tries: 0)
Sep 08 22:31:03 qa07.qa.fedoraproject.org worker[28095]: [ERROR] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Sep 08 22:31:04 qa07.qa.fedoraproject.org worker[28095]: [ERROR] 400 response: Bad Request (remaining tries: 2)
Sep 08 22:31:08 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:31:09 qa07.qa.fedoraproject.org worker[28095]: [ERROR] 400 response: Bad Request (remaining tries: 1)
Sep 08 22:31:12 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Refusing, we are already performing another job
Sep 08 22:31:13 qa07.qa.fedoraproject.org worker[28095]: [INFO] registering worker with openQA http://openqa-stg01.qa.fedoraproject.org...
Sep 08 22:31:13 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Connection turned off from http://openqa-stg01.qa.fedoraproject.org - 1005 : Not specified
Sep 08 22:31:14 qa07.qa.fedoraproject.org worker[28095]: [ERROR] 400 response: Bad Request (remaining tries: 0)
Sep 08 22:31:14 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: api-failure)
Sep 08 22:31:14 qa07.qa.fedoraproject.org worker[28095]: killed 7062
Sep 08 22:31:55 qa07.qa.fedoraproject.org worker[28095]: [INFO] cleaning up 00156843-fedora-27-universal-x86_64-BuildFedora-27-20170907.n.0-ANA272013-NOREPORT-install_delete_partial@uefi
Sep 08 22:31:55 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Build finished, setting us free to pick up new jobs
Sep 08 22:31:55 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending IMMEDIATELY worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:31:55 qa07.qa.fedoraproject.org worker[28095]: [INFO] registering worker with openQA http://openqa-stg01.qa.fedoraproject.org...
Sep 08 22:31:55 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Connection turned off from http://openqa-stg01.qa.fedoraproject.org - 1008 : Connection terminated from WebSocket server - thought dead
Sep 08 22:32:09 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Job 156823 scheduled for next cycle
Sep 08 22:32:09 qa07.qa.fedoraproject.org worker[28095]: [INFO] got job 156823: 00156823-fedora-27-universal-x86_64-BuildFedora-27-20170907.n.0-ANA272013-NOREPORT-install_lvmthin@64bit
Sep 08 22:32:09 qa07.qa.fedoraproject.org worker[28095]: [INFO] 7519: WORKING 156823
Sep 08 22:32:09 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending IMMEDIATELY worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:32:15 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:32:30 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:32:45 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:33:00 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:33:15 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:33:30 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:33:45 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:34:00 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:34:15 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:34:33 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:34:48 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:35:03 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:35:18 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:35:33 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:35:48 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:36:03 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:36:18 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:36:33 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:36:48 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:36:55 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Connection turned off from http://openqa-stg01.qa.fedoraproject.org - 1006 : Not specified
Sep 08 22:37:15 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:37:30 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:37:45 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:38:00 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:38:15 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:38:34 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:38:53 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:39:08 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:39:23 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:39:38 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:39:41 qa07.qa.fedoraproject.org worker[28095]: [INFO] cleaning up 00156823-fedora-27-universal-x86_64-BuildFedora-27-20170907.n.0-ANA272013-NOREPORT-install_lvmthin@64bit
Sep 08 22:39:41 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Build finished, setting us free to pick up new jobs
Sep 08 22:39:41 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending IMMEDIATELY worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:39:41 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Connection turned off from http://openqa-stg01.qa.fedoraproject.org - 1008 : Connection terminated from WebSocket server - thought dead
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: Use of uninitialized value $host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Commands.pm line 137.
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: Use of uninitialized value $host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Commands.pm line 140.
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Job 156829 scheduled for next cycle
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: [INFO] got job 156829: 00156829-fedora-27-universal-x86_64-BuildFedora-27-20170907.n.0-ANA272013-NOREPORT-install_blivet_xfs@64bit
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: Use of uninitialized value $host in pattern match (m//) at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 468.
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: Use of uninitialized value $OpenQA::Worker::Engines::isotovideo::current_host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 131.
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: Use of uninitialized value $OpenQA::Worker::Engines::isotovideo::current_host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 148.
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: [WARN] job is missing files, releasing job
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: Mojo::Reactor::Poll: I/O watcher failed: No worker id or webui host set! at /usr/share/openqa/script/../lib/OpenQA/Worker/Common.pm line 184.
Sep 08 22:40:01 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:40:16 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:40:31 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:40:46 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:41:01 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:41:16 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:41:31 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:41:46 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:42:01 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:42:16 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:42:31 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:42:46 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:43:01 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:43:16 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:43:31 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:43:46 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:44:01 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:44:16 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:44:31 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:44:44 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Connection turned off from http://openqa-stg01.qa.fedoraproject.org - 1006 : Not specified
Sep 08 22:45:04 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:45:19 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:45:34 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:45:49 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:46:04 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:46:19 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:46:34 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:46:49 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:47:04 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:47:19 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:47:34 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:47:50 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:48:05 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:48:20 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:48:35 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:48:50 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:49:05 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:49:20 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:49:35 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:49:37 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Connection turned off from http://openqa-stg01.qa.fedoraproject.org - 1006 : Not specified
Sep 08 22:49:57 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:50:12 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:50:27 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:50:42 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:50:57 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:51:12 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:51:27 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:51:42 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:51:57 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Actions #2

Updated by AdamWill over 6 years ago

Still seeing this with the 20170908 code and Mojolicious 7.45. Here's a current case:

https://openqa.stg.fedoraproject.org/tests/157566

Actions #3

Updated by EDiGiacinto over 6 years ago

  • Subject changed from Workers still getting stuck with 20170907 scheduler: job shows as 'State: assigned' forever, worker logs constantly sending status to server but never does anything to Workers disconnects from websocket server and getting stuck: job shows as 'State: assigned' forever
  • Category changed from 122 to Regressions/Crashes

AdamWill wrote:

Sep 08 22:39:38 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:39:41 qa07.qa.fedoraproject.org worker[28095]: [INFO] cleaning up 00156823-fedora-27-universal-x86_64-BuildFedora-27-20170907.n.0-ANA272013-NOREPORT-install_lvmthin@64bit
Sep 08 22:39:41 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Build finished, setting us free to pick up new jobs
Sep 08 22:39:41 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Sending IMMEDIATELY worker status to http://openqa-stg01.qa.fedoraproject.org
Sep 08 22:39:41 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Connection turned off from http://openqa-stg01.qa.fedoraproject.org - 1008 : Connection terminated from WebSocket server - thought dead
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: Use of uninitialized value $host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Commands.pm line 137.
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: Use of uninitialized value $host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Commands.pm line 140.
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: [DEBUG] Job 156829 scheduled for next cycle
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: [INFO] got job 156829: 00156829-fedora-27-universal-x86_64-BuildFedora-27-20170907.n.0-ANA272013-NOREPORT-install_blivet_xfs@64bit
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: Use of uninitialized value $host in pattern match (m//) at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 468.
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: Use of uninitialized value $OpenQA::Worker::Engines::isotovideo::current_host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 131.
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: Use of uninitialized value $OpenQA::Worker::Engines::isotovideo::current_host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 148.
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: [WARN] job is missing files, releasing job
Sep 08 22:39:44 qa07.qa.fedoraproject.org worker[28095]: Mojo::Reactor::Poll: I/O watcher failed: No worker id or webui host set! at /usr/share/openqa/script/../lib/OpenQA/Worker/Common.pm line 184.

As i see it, the problem hides here, not in the websocket server: i see failure in updating the web interface about the job, due to a premature connection close that is cleaning the job current_host

Actions #4

Updated by EDiGiacinto over 6 years ago

Seems like so, we see the same errors in one of the workers that is currently stuck:

Sep 10 12:37:37 openqaworker2 worker[21911]: [DEBUG] Sending worker status to openqa.suse.de                           
Sep 10 12:37:19 openqaworker2 worker[21911]: Mojo::Reactor::Poll: I/O watcher failed: No worker id or webui host set! at /usr/share/openqa/script/../lib/OpenQA/Worker/Common.pm line 184.                                                    
Sep 10 12:37:19 openqaworker2 worker[21911]: [WARN] job is missing files, releasing job                                
Sep 10 12:37:19 openqaworker2 worker[21911]: Use of uninitialized value $OpenQA::Worker::Engines::isotovideo::current_host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 160.                   
Sep 10 12:37:19 openqaworker2 worker[21911]: Use of uninitialized value $OpenQA::Worker::Engines::isotovideo::current_host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 148.                   
Sep 10 12:37:19 openqaworker2 worker[21911]: Use of uninitialized value $OpenQA::Worker::Engines::isotovideo::current_host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 131.                   
Sep 10 12:37:19 openqaworker2 worker[21911]: Use of uninitialized value $host in pattern match (m//) at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 468.                                                                       
Sep 10 12:37:19 openqaworker2 worker[21911]: [INFO] got job 1157542: 01157542-sle-15-Leanos-DVD-Staging:G-x86_64-BuildG.21.2-default_install@64bit-staging                                                                                    
Sep 10 12:37:19 openqaworker2 worker[21911]: [DEBUG] Job 1157542 scheduled for next cycle                              
Sep 10 12:37:19 openqaworker2 worker[21911]: Use of uninitialized value $host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Commands.pm line 140.                                                                          
Sep 10 12:37:19 openqaworker2 worker[21911]: Use of uninitialized value $host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Commands.pm line 137.                                                                          
Sep 10 12:37:17 openqaworker2 worker[21911]: [DEBUG] Connection turned off from openqa.suse.de - 1006 :                
Sep 10 12:37:17 openqaworker2 worker[21911]: Use of uninitialized value $reason in concatenation (.) or string at /usr/share/openqa/script/../lib/OpenQA/Worker/Common.pm line 362.                                                           
Sep 10 12:37:11 openqaworker2 worker[21911]: [DEBUG] Sending worker status to openqa.suse.de                           
Sep 10 12:36:56 openqaworker2 worker[21911]: [DEBUG] Sending worker status to openqa.suse.de                           
Sep 10 12:36:41 openqaworker2 worker[21911]: [DEBUG] Sending worker status to openqa.suse.de                           

Edit:
If i'm correct, as a confirmation of a bad behavior from workers, you should see in websocket server logs that the worker is sending a wrong message, declaring that the job is in "queue" for the worker instead of declaring itself free, with state in scheduled (you should see it as running) - with the $job still set due to not to a clear teardown. The beauty of shared global variables :)

Actions #5

Updated by EDiGiacinto over 6 years ago

https://github.com/os-autoinst/openQA/pull/1451 Merged, please let us know if the problem still persist

Actions #6

Updated by AdamWill over 6 years ago

Am trying with that plus the commit from your git repo that sets the websocket inactivity timeout to 0.

Actions #7

Updated by AdamWill over 6 years ago

  • Status changed from New to Resolved

Well, since this fix, I'm not getting stuck workers any more. I am still getting jobs incompleted without being duplicated, but I'm filing other issues on those problems. So, let's close this one.

Actions #8

Updated by szarate over 6 years ago

  • Subject changed from Workers disconnects from websocket server and getting stuck: job shows as 'State: assigned' forever to [tools][sprint 201709.1] Workers disconnects from websocket server and getting stuck: job shows as 'State: assigned' forever
Actions #9

Updated by EDiGiacinto about 6 years ago

Actions #10

Updated by EDiGiacinto about 6 years ago

  • Related to action #35296: Error messages on worker about "Use of uninitialized value $host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Common.pm line 359, <GEN298662> line 4." added
Actions

Also available in: Atom PDF