Project

General

Profile

Actions

action #54680

closed

Worker re-registration after websocket disconnects causes incompletes

Added by mkittler almost 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2019-07-26
Due date:
% Done:

0%

Estimated time:

Description

Jul 26 11:36:05 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:06 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:07 openqaworker8 worker[64627]: [error] [pid:64627] Websocket connection to http://openqa.suse.de/api/v1/ws/1275 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jul 26 11:36:07 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:08 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:09 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:10 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:11 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:12 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:13 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:14 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:15 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:17 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:17 openqaworker8 worker[64627]: [info] [pid:64627] Registering with openQA openqa.suse.de
Jul 26 11:36:17 openqaworker8 worker[64627]: [info] [pid:64627] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1275
Jul 26 11:36:17 openqaworker8 worker[64627]: [info] [pid:64627] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1275
Jul 26 11:36:18 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:18 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 2)
Jul 26 11:36:23 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 1)
Jul 26 11:36:28 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 0)
Jul 26 11:36:28 openqaworker8 worker[64627]: [debug] [pid:64627] Stopping job 3136372 from openqa.suse.de: 03136372-sle-15-SP1-Installer-DVD-x86_64-Build228.2-zluo-poo_50336@64bit - reason: api-failure
Jul 26 11:36:28 openqaworker8 worker[64627]: [error] [pid:64627] Aborting job because web UI doesn't accept new images anymore (likely considers this job dead)
Jul 26 11:36:28 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:28 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 2)
Jul 26 11:36:29 openqaworker8 worker[64627]: [warn] [pid:64627] Can't open /var/lib/openqa/pool/1/testresults/result-consoletest_setup.json for result upload - likely isotovideo could not be started or failed early. Error message: No su>
Jul 26 11:36:29 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:29 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 2)
Jul 26 11:36:33 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 1)
Jul 26 11:36:34 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 1)
Jul 26 11:36:38 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 0)
Jul 26 11:36:38 openqaworker8 worker[64627]: [info] [pid:64627] Trying to stop job gracefully by announcing it to command server via http://localhost:20013/2rO1xssSrlFHWKfa/broadcast
Jul 26 11:36:42 openqaworker8 worker[64627]: [info] [pid:64627] Isotovideo exit status: 1
Jul 26 11:36:47 openqaworker8 worker[64627]: [info] [pid:64627] +++ worker notes +++
Jul 26 11:36:47 openqaworker8 worker[64627]: [info] [pid:64627] End time: 2019-07-26 09:36:47
Jul 26 11:36:47 openqaworker8 worker[64627]: [info] [pid:64627] Result: api-failure
Jul 26 11:36:47 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/set_done?result=incomplete
Jul 26 11:36:47 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 0)
Jul 26 11:36:47 openqaworker8 worker[64627]: [error] [pid:64627] Unable to make final image uploads. Maybe the web UI considers this job already dead.
Jul 26 11:36:47 openqaworker8 worker[64627]: [info] [pid:64627] Registering with openQA openqa.suse.de
Jul 26 11:36:47 openqaworker8 worker[64627]: [info] [pid:64627] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1275
Jul 26 11:36:47 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/set_done
Jul 26 11:36:47 openqaworker8 worker[64627]: [debug] [pid:64627] Websocket connection to http://openqa.suse.de/api/v1/ws/1275 finished from our side.
Jul 26 11:36:47 openqaworker8 worker[64627]: [info] [pid:64627] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1275
Jul 26 11:36:47 openqaworker8 worker[64627]: [debug] [pid:64627] Job 3136372 from openqa.suse.de finished - reason: api-failure
Jul 26 11:36:47 openqaworker8 worker[64627]: [debug] [pid:64627] Cleaning up for next job
Jul 26 11:38:53 openqaworker8 worker[64627]: [warn] [pid:64627] Ignoring WS message from openqa.suse.de with type livelog_stop and job ID 3136372 (currently not executing a job):
Jul 26 11:38:53 openqaworker8 worker[64627]: { jobid => 3136372, type => "livelog_stop" }
Jul 26 11:38:54 openqaworker8 worker[64627]: [warn] [pid:64627] Ignoring WS message from openqa.suse.de with type livelog_stop and job ID 3136372 (currently not executing a job):
Jul 26 11:38:54 openqaworker8 worker[64627]: { jobid => 3136372, type => "livelog_stop" }
Jul 26 11:38:54 openqaworker8 worker[64627]: [warn] [pid:64627] Ignoring WS message from openqa.suse.de with type livelog_stop and job ID 3136372 (currently not executing a job):
Jul 26 11:38:54 openqaworker8 worker[64627]: { jobid => 3136372, type => "livelog_stop" }
Jul 26 11:38:55 openqaworker8 worker[64627]: [warn] [pid:64627] Ignoring WS message from openqa.suse.de with type livelog_stop and job ID 3136372 (currently not executing a job):
Jul 26 11:38:55 openqaworker8 worker[64627]: { jobid => 3136372, type => "livelog_stop" }
Jul 26 11:38:57 openqaworker8 worker[64627]: [warn] [pid:64627] Ignoring WS message from openqa.suse.de with type livelog_stop and job ID 3136372 (currently not executing a job):
Jul 26 11:38:57 openqaworker8 worker[64627]: { jobid => 3136372, type => "livelog_stop" }
Jul 26 11:39:06 openqaworker8 worker[64627]: [warn] [pid:64627] Ignoring WS message from openqa.suse.de with type livelog_stop and job ID 3136372 (currently not executing a job):
Jul 26 11:39:06 openqaworker8 worker[64627]: { jobid => 3136372, type => "livelog_stop" }
Jul 26 11:39:06 openqaworker8 worker[64627]: [warn] [pid:64627] Ignoring WS message from openqa.suse.de with type livelog_stop and job ID 3136372 (currently not executing a job):
Jul 26 11:39:06 openqaworker8 worker[64627]: { jobid => 3136372, type => "livelog_stop" }

To simplify the worker/web UI communication and to prevent having job in 'running' state which are no longer actually worked on the worker now tries to re-register after loosing its web socket connection or after too many fatal API errors.

Unfortunately the re-registration in the middle of the job execution leads to the behavior above where the web UI rejects any further status updates and in consequence the worker abandons the job. Interestingly the websocket server still tries to send messages for that job.

I suppose the web UI should not instantly refuse any further updates when the worker registers again. However, we still need to take care not to have any jobs in perpetual 'running' state. It would also be nice if the websocket server would behave accordingly if the web UI gives up on a job although the worker seems to be good at ignoring those web socket messages.

Actions #1

Updated by mkittler over 4 years ago

  • Description updated (diff)
Actions #2

Updated by mkittler over 4 years ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
  • Target version set to Current Sprint

It seems that not all incompletes on o3 are caused by this issue. But there are other problems, e.g.

Jul 29 09:51:25 openqa-aarch64 worker[3075]: [info] Registering with openQA http://openqa1-opensuse
Jul 29 09:51:25 openqa-aarch64 worker[3075]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/172
Jul 29 09:51:26 openqa-aarch64 worker[3075]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 172
Jul 29 09:51:42 openqa-aarch64 worker[3075]: [info] +++ setup notes +++
Jul 29 09:51:42 openqa-aarch64 worker[3075]: [info] Start time: 2019-07-29 07:51:42
Jul 29 09:51:42 openqa-aarch64 worker[3075]: [info] Running on openqa-aarch64:7 (Linux 4.12.14-lp151.28.10-default #1 SMP Sat Jul 13 17:59:31 UTC 2019 (0ab03b7) aarch64)
Jul 29 09:52:02 openqa-aarch64 worker[3075]: [info] Preparing cgroups to start isotovideo
Jul 29 09:52:02 openqa-aarch64 worker[3075]: Use of uninitialized value in subroutine entry at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 97.
Jul 29 09:52:02 openqa-aarch64 worker[3075]: [info] Starting isotovideo container
Jul 29 09:52:02 openqa-aarch64 worker[3075]: [info] isotovideo has been started (PID: 3506)
Jul 29 09:52:02 openqa-aarch64 worker[3075]: [info] 3506: WORKING 993618
Jul 29 09:55:14 openqa-aarch64 worker[3075]: [warn] Ignoring WS message from http://openqa1-opensuse with type grab_job but no job ID (currently running 993618 for http://openqa1-opensuse):
Jul 29 09:55:14 openqa-aarch64 worker[3075]: {
Jul 29 09:55:14 openqa-aarch64 worker[3075]:   job  => {
Jul 29 09:55:14 openqa-aarch64 worker[3075]:             assets             => {
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     iso  => ["openSUSE-MicroOS-DVD-aarch64-Snapshot20190726-Media.iso"],
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     repo => [
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                               "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                               "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726-source",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                               "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726-debuginfo",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                             ],
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                   },
Jul 29 09:55:14 openqa-aarch64 worker[3075]:             assigned_worker_id => 172,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:             blocked_by_id      => undef,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:             clone_id           => undef,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:             group              => "openSUSE Tumbleweed AArch64",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:             group_id           => 3,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:             id                 => 993628,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:             name               => "microos-Tumbleweed-DVD-aarch64-Build20190726-container-host\@aarch64-4G-HD40G",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:             priority           => 50,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:             result             => "none",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:             settings           => {
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     ARCH                        => "aarch64",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     BACKEND                     => "qemu",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     BUILD                       => 20190726,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     CHECKSUM_ISO                => "093afea92a51b262433c87709af12f1e66ef0c3bab572a2eb507595af0c53175",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     DESKTOP                     => "textmode",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     DISTRI                      => "microos",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     EXTRA                       => "FEATURES",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     FLAVOR                      => "DVD",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     FULLURL                     => 1,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     GRUB_KERNEL_OPTION_APPEND   => " rcupdate.rcu_cpu_stall_suppress=1 ",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     HDDMODEL                    => "virtio-blk-device",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     HDDSIZEGB                   => 40,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     ISO                         => "openSUSE-MicroOS-DVD-aarch64-Snapshot20190726-Media.iso",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     JOBTOKEN                    => "s1iLsojiELq1nL4K",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     MACHINE                     => "aarch64-4G-HD40G",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     MIRROR_HTTP                 => "http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     MIRROR_HTTPS                => "https://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     MIRROR_PREFIX               => "http://openqa.opensuse.org/assets/repo",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     MULTI_STEP_KUBIC_FLOW       => 1,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     NAME                        => "00993628-microos-Tumbleweed-DVD-aarch64-Build20190726-container-host\@aarch64-4G-HD40G",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     QEMU                        => "aarch64",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     QEMU_COMPRESS_LEVEL         => 3,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     QEMU_COMPRESS_THREADS       => 3,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     QEMU_HUGE_PAGES_PATH        => "/dev/hugepages/",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     QEMU_VIRTIO_RNG             => 1,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     QEMUCPU                     => "host",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     QEMUMACHINE                 => "virt,gic-version=host",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     QEMURAM                     => 4096,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     REPO_0                      => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     REPO_1                      => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726-debuginfo",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     REPO_3                      => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726-source",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     REPO_OSS                    => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     REPO_OSS_DEBUGINFO          => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726-debuginfo",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     REPO_OSS_DEBUGINFO_PACKAGES => "java*,kernel-default-debug*,mraa-debug*",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     REPO_OSS_SOURCE             => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726-source",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     REPO_OSS_SOURCE_PACKAGES    => "coreutils",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     ROOTONLY                    => 1,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     SERIALDEV                   => "ttyAMA0",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     SUSEMIRROR                  => "http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     SYSTEM_ROLE                 => "container-host",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     TEST                        => "container-host",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     TIMEOUT_SCALE               => 2,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     UEFI                        => 1,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     UEFI_PFLASH_CODE            => "/usr/share/qemu/aavmf-aarch64-code.bin",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     UEFI_PFLASH_VARS            => "/usr/share/qemu/aavmf-aarch64-vars.bin",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     VERSION                     => "Tumbleweed",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     VNC_TYPING_LIMIT            => 10,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     WORKER_CLASS                => "qemu_aarch64",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     REPO_1                      => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726-debuginfo",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     REPO_3                      => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726-source",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     REPO_OSS                    => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     REPO_OSS_DEBUGINFO          => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726-debuginfo",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     REPO_OSS_DEBUGINFO_PACKAGES => "java*,kernel-default-debug*,mraa-debug*",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     REPO_OSS_SOURCE             => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726-source",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     REPO_OSS_SOURCE_PACKAGES    => "coreutils",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     ROOTONLY                    => 1,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     SERIALDEV                   => "ttyAMA0",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     SUSEMIRROR                  => "http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     SYSTEM_ROLE                 => "container-host",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     TEST                        => "container-host",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     TIMEOUT_SCALE               => 2,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     UEFI                        => 1,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     UEFI_PFLASH_CODE            => "/usr/share/qemu/aavmf-aarch64-code.bin",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     UEFI_PFLASH_VARS            => "/usr/share/qemu/aavmf-aarch64-vars.bin",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     VERSION                     => "Tumbleweed",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     VNC_TYPING_LIMIT            => 10,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                     WORKER_CLASS                => "qemu_aarch64",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:                                   },
Jul 29 09:55:14 openqa-aarch64 worker[3075]:             state              => "scheduled",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:             t_finished         => undef,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:             t_started          => undef,
Jul 29 09:55:14 openqa-aarch64 worker[3075]:             test               => "container-host",
Jul 29 09:55:14 openqa-aarch64 worker[3075]:           },
Jul 29 09:55:14 openqa-aarch64 worker[3075]:   type => "grab_job",
Jul 29 09:55:14 openqa-aarch64 worker[3075]: }
Jul 29 09:55:15 openqa-aarch64 worker[3075]: [error] 400 response: Got status update for job 993618 with unexpected worker ID 172 (expected none) (remaining tries: 2)
Jul 29 09:55:20 openqa-aarch64 worker[3075]: [error] 400 response: Got status update for job 993618 with unexpected worker ID 172 (expected none) (remaining tries: 1)
Jul 29 09:55:24 openqa-aarch64 worker[3075]: [warn] Ignoring WS message from http://openqa1-opensuse with type grab_job but no job ID (currently running 993618 for http://openqa1-opensuse):
Jul 29 09:55:24 openqa-aarch64 worker[3075]: {
Jul 29 09:55:24 openqa-aarch64 worker[3075]:   job  => {
Jul 29 09:55:24 openqa-aarch64 worker[3075]:             assets             => {
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     iso  => ["openSUSE-MicroOS-DVD-aarch64-Snapshot20190726-Media.iso"],
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     repo => [
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                               "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726-debuginfo",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                               "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726-source",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                               "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                             ],
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                   },
Jul 29 09:55:24 openqa-aarch64 worker[3075]:             assigned_worker_id => 172,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:             blocked_by_id      => undef,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:             clone_id           => undef,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:             group              => "openSUSE Tumbleweed AArch64",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:             group_id           => 3,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:             id                 => 993627,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:             name               => "microos-Tumbleweed-DVD-aarch64-Build20190726-microos_10G-disk\@aarch64",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:             priority           => 50,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:             result             => "none",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:             settings           => {
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     ARCH                        => "aarch64",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     BACKEND                     => "qemu",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     BUILD                       => 20190726,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     CHECKSUM_ISO                => "093afea92a51b262433c87709af12f1e66ef0c3bab572a2eb507595af0c53175",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     DESKTOP                     => "textmode",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     DISTRI                      => "microos",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     FLAVOR                      => "DVD",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     FULLURL                     => 1,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     GRUB_KERNEL_OPTION_APPEND   => " rcupdate.rcu_cpu_stall_suppress=1 ",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     HDDMODEL                    => "virtio-blk-device",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     HDDSIZEGB                   => 10,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     ISO                         => "openSUSE-MicroOS-DVD-aarch64-Snapshot20190726-Media.iso",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     JOBTOKEN                    => "v51CpFHvDE_sg0Mp",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     MACHINE                     => "aarch64",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     MIRROR_HTTP                 => "http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     MIRROR_HTTPS                => "https://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     MIRROR_PREFIX               => "http://openqa.opensuse.org/assets/repo",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     MULTI_STEP_KUBIC_FLOW       => 1,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     NAME                        => "00993627-microos-Tumbleweed-DVD-aarch64-Build20190726-microos_10G-disk\@aarch64",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     QEMU                        => "aarch64",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     QEMU_COMPRESS_LEVEL         => 3,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     QEMU_COMPRESS_THREADS       => 3,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     QEMU_HUGE_PAGES_PATH        => "/dev/hugepages/",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     QEMU_VIRTIO_RNG             => 1,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     QEMUCPU                     => "host",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     QEMUMACHINE                 => "virt,gic-version=host",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     QEMURAM                     => 2048,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     REPO_0                      => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     REPO_1                      => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726-debuginfo",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     REPO_3                      => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726-source",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     REPO_OSS                    => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     REPO_OSS_DEBUGINFO          => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726-debuginfo",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     REPO_OSS_DEBUGINFO_PACKAGES => "java*,kernel-default-debug*,mraa-debug*",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     REPO_OSS_SOURCE             => "openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726-source",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     REPO_OSS_SOURCE_PACKAGES    => "coreutils",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     ROOTONLY                    => 1,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     SERIALDEV                   => "ttyAMA0",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     SUSEMIRROR                  => "http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-aarch64-Snapshot20190726",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     SYSTEM_ROLE                 => "microos",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     TEST                        => "microos_10G-disk",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     TIMEOUT_SCALE               => 2,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     UEFI                        => 1,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     UEFI_PFLASH_CODE            => "/usr/share/qemu/aavmf-aarch64-code.bin",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     UEFI_PFLASH_VARS            => "/usr/share/qemu/aavmf-aarch64-vars.bin",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     VERSION                     => "Tumbleweed",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     VNC_TYPING_LIMIT            => 7,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                     WORKER_CLASS                => "qemu_aarch64",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:                                   },
Jul 29 09:55:24 openqa-aarch64 worker[3075]:             state              => "scheduled",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:             t_finished         => undef,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:             t_started          => undef,
Jul 29 09:55:24 openqa-aarch64 worker[3075]:             test               => "microos_10G-disk",
Jul 29 09:55:24 openqa-aarch64 worker[3075]:           },
Jul 29 09:55:24 openqa-aarch64 worker[3075]:   type => "grab_job",
Jul 29 09:55:24 openqa-aarch64 worker[3075]: }
Jul 29 09:55:25 openqa-aarch64 worker[3075]: [error] 400 response: Got status update for job 993618 with unexpected worker ID 172 (expected none) (remaining tries: 0)
Jul 29 09:55:25 openqa-aarch64 worker[3075]: [error] Aborting job because web UI doesn't accept new images anymore (likely considers this job dead)
Jul 29 09:55:25 openqa-aarch64 worker[3075]: [error] 400 response: Got status update for job 993618 with unexpected worker ID 172 (expected none) (remaining tries: 2)
Jul 29 09:55:30 openqa-aarch64 worker[3075]: [error] 400 response: Got status update for job 993618 with unexpected worker ID 172 (expected none) (remaining tries: 1)
Jul 29 09:55:35 openqa-aarch64 worker[3075]: [warn] Can't open /var/lib/openqa/pool/7/testresults/result-bootloader_uefi.json for result upload - likely isotovideo could not be started or failed early. Error message: No such file or dir>
Jul 29 09:55:35 openqa-aarch64 worker[3075]: [error] 400 response: Got status update for job 993618 with unexpected worker ID 172 (expected none) (remaining tries: 2)
Jul 29 09:55:35 openqa-aarch64 worker[3075]: [error] 400 response: Got status update for job 993618 with unexpected worker ID 172 (expected none) (remaining tries: 0)
Jul 29 09:55:35 openqa-aarch64 worker[3075]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20073/_DW7_Ro7OhzvBUS1/broadcast
Jul 29 09:55:40 openqa-aarch64 worker[3075]: [error] 400 response: Got status update for job 993618 with unexpected worker ID 172 (expected none) (remaining tries: 1)
Jul 29 09:55:45 openqa-aarch64 worker[3075]: [info] Unable to stop the command server gracefully:
Jul 29 09:55:45 openqa-aarch64 worker[3075]: [info] Command server likely not reachable at all
Jul 29 09:55:52 openqa-aarch64 worker[3075]: [info] Isotovideo exit status: 1
Jul 29 09:55:53 openqa-aarch64 worker[3075]: [info] +++ worker notes +++
Jul 29 09:55:53 openqa-aarch64 worker[3075]: [info] End time: 2019-07-29 07:55:53
Jul 29 09:55:53 openqa-aarch64 worker[3075]: [info] Result: api-failure
Jul 29 09:55:53 openqa-aarch64 worker[3075]: [error] 400 response: Got status update for job 993618 with unexpected worker ID 172 (expected none) (remaining tries: 0)
Jul 29 09:55:53 openqa-aarch64 worker[3075]: [error] Unable to make final image uploads. Maybe the web UI considers this job already dead.
Jul 29 09:55:53 openqa-aarch64 worker[3075]: [info] Registering with openQA http://openqa1-opensuse
Jul 29 09:55:53 openqa-aarch64 worker[3075]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/172
Jul 29 09:55:54 openqa-aarch64 worker[3075]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 172
Jul 29 09:56:04 openqa-aarch64 worker[3075]: [info] +++ setup notes +++
Jul 29 09:56:04 openqa-aarch64 worker[3075]: [info] Start time: 2019-07-29 07:56:04
Jul 29 09:56:04 openqa-aarch64 worker[3075]: [info] Running on openqa-aarch64:7 (Linux 4.12.14-lp151.28.10-default #1 SMP Sat Jul 13 17:59:31 UTC 2019 (0ab03b7) aarch64)
Jul 29 09:56:09 openqa-aarch64 worker[3075]: [info] Preparing cgroups to start isotovideo
Jul 29 09:56:09 openqa-aarch64 worker[3075]: Use of uninitialized value in subroutine entry at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 97.
Jul 29 09:56:09 openqa-aarch64 worker[3075]: [info] Starting isotovideo container
Jul 29 09:56:09 openqa-aarch64 worker[3075]: [info] isotovideo has been started (PID: 5349)
Jul 29 09:56:10 openqa-aarch64 worker[3075]: [info] 5349: WORKING 993629

Here no websocket disconnect is logged. However, the web UI thinks at some point that the worker is not registered anymore and not executing the job anymore. Hence it blocks the worker's API calls and tries to assign a different job to it. That the worker re-registers after an API failure is quite helpful in this case because it allows the worker to continue with the next job successfully (rather than being stuck). That was also one of the motivations behind the re-registration. Not sure why the web UI behaved so strange here. Maybe it considered the worker dead because it was too unresponsive.

Actions #3

Updated by okurz over 4 years ago

I think this problem also happens when we upgrade the openQA package and therefore restart services, e.g. the websockets service. Restarting the webUI, e.g. after a package update of openQA itself, terminates jobs from workers which were assigned to the "old" instance and are retriggered as new ones. An example is the job https://openqa.opensuse.org/tests/994980#downloads which does not even have the worker or autoinst-log.txt so without the knowledge that the webUI restarted the reader of the job would not know at all what is going on. Wasn't that different some time ago? Let's see if the PR helps in this regard.

Actions #4

Updated by mkittler over 4 years ago

  • Status changed from In Progress to Feedback

PRs: https://github.com/os-autoinst/openQA/pull/2227, https://github.com/os-autoinst/openQA/pull/2231

The PRs have been merged. I'll keep this on feedback to see whether it helps in practice. But at least locally it seems to work. So when restarting the service the job is continued. Of course if the REST API is not reachable as well it would still be stopped and when the web UI sees the worker the next time it would mark the job still as incomplete. There shouldn't be jobs remaining in perpetual running state.

Actions #5

Updated by okurz over 4 years ago

  • Category changed from Feature requests to Regressions/Crashes
  • Status changed from Feedback to Resolved
[31/07/2019 09:39:24] <okurz> Martchus_: let's see if the websockets disconnects are now handled more gracefully. if I update the openQA webui with the new packages and the workers already have it since today, should it incomplete as many jobs as in before?
[31/07/2019 09:41:49] <okurz> to answer myself, seems like it :) I updated the webui and it seems not a single running job had a problem with it :)
[31/07/2019 09:42:14] <guillaume_g> okurz: good job! :)

seems you did good ;) With this I recon you got the feedback you were looking for and we can resolve the issue. Also if you don't mind I set it to "Concrete Bugs" as we learned it was a regression in behaviour due to the worker rework.

Actions #6

Updated by mkittler over 4 years ago

  • Target version changed from Current Sprint to Done
Actions

Also available in: Atom PDF