action #47060

[worker service][scheduling] openqaworker2:21 ~ openqaworker2:24 stops getting new jobs for over 1 day.

Added by xlai about 1 year ago. Updated 7 months ago.

Status:ResolvedStart date:03/02/2019
Priority:HighDue date:
Assignee:mkittler% Done:

0%

Category:Concrete Bugs
Target version:Current Sprint
Difficulty:
Duration:

Description

I checked these four workers in https://openqa.nue.suse.com/admin/workers, all shows working, but after clicking into the jobs the workers are on, all the jobs are not running actually, either cancelled or finished.

Please help to recover these 4 worker first. It is better to find the root cause so as to avoid future such issues. Thanks for the help!


Related issues

Related to openQA Project - action #47117: EPIC: Fix worker->websocket->scheduler->webui connection New 04/02/2019
Related to openQA Project - action #47087: [scheduling] Workers on openqaworker2 stuck frequently Resolved 04/02/2019

History

#1 Updated by xlai about 1 year ago

  • Assignee set to nicksinger

@nicksinger, if you want others to take, please reassign. I will be on holiday from next monday, sorry to assign it to you for quick response, as this is a urgent request.

#2 Updated by coolo about 1 year ago

  • Related to action #47117: EPIC: Fix worker->websocket->scheduler->webui connection added

#3 Updated by coolo about 1 year ago

  • Category set to 122
  • Assignee deleted (nicksinger)
  • Target version set to Ready

There isn't much nick can do - restarting the workers only helps for a short time ;(

#4 Updated by mkittler about 1 year ago

  • Assignee set to mkittler
  • Target version changed from Ready to Current Sprint

trying to investigate

#5 Updated by mkittler about 1 year ago

The worker have already been restarted. According to the log of openqaworker2:21 it started to hang on Feb 01 21:31:14 and has been restarted on Feb 04 10:15:56. Since then it is running fine. The log shows that the worker got stuck when starting the next job. Unfortunately there's no useful information in the logs to get a clue why it couldn't execute that job and just hang.

For openqaworker2:22 and openqaworker2:24 it looks the same. openqaworker2:23 actually still executed jobs on Feb 02 (but not Feb 03) and maybe didn't hang at all.

The jobs where those workers hang on where:

Hard to tell what the exact problem was in these cases.

Since @coolo already added an epic for the bigger problem this is just (yet another) symptom of I would actually like to close it. At least I don't see anything particular in the worker which could be fix - we need to improve the overall architecture.

#6 Updated by mkittler about 1 year ago

  • Priority changed from Urgent to High

Seems like those workers still run. So I'm lowering the prio.

It seems that openqaworker2:27 is stuck now. I don't restart it yet so I can investigate tomorrow.

#7 Updated by mkittler about 1 year ago

Restarted openqaworker2:27. It seems like it got stuck very early in OpenQA::Worker::Engines::isotovideo::engine_workit. Maybe because the job was cancelled after starting the setup for isotovideo but before before isotovideo was actually running. Audit log:

11 days ago coolo   job_cancel  { "id": 2431736 }

As far as I remember the worker log for the slots @21 to @24 (where the issue was initially about) looked similar.

#8 Updated by okurz about 1 year ago

  • Related to action #47087: [scheduling] Workers on openqaworker2 stuck frequently added

#9 Updated by mkittler about 1 year ago

Since the worker are likely hanging inside OpenQA::Worker::Engines::isotovideo::engine_workit or right after I'm investigating this function. Maybe it is worth adding a bit more debug printing.

If somebody has better ideas feel free to comment.

#10 Updated by okurz about 1 year ago

I again restarted a bunch of workers myself on grenache-1.qa as there is still a long backlog on workers normally done by this machine. Could it be that these workers are more likely to fail due to trying to register against *two webuis?:

Jan 31 07:46:20 grenache-1 worker[695309]: [info] registering worker grenache-1 version 13 with openQA openqa.suse.de using protocol version [1]
Jan 31 07:46:20 grenache-1 worker[695309]: [info] registering worker grenache-1 version 13 with openQA openqa.chell.qa.suse.de using protocol version [1]
Feb 15 14:49:48 grenache-1 worker[695309]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. Apache modules proxy_wstunnel and rewrite enabled?

#11 Updated by coolo about 1 year ago

then this wouldn't happen on qaworker2 as well

#12 Updated by mkittler about 1 year ago

I now deployed the new version with more debug printing on grenache-1 and openqaworker2. If you see a worker which is stuck on these instances, please let me know.

#13 Updated by mkittler about 1 year ago

When looking for stuck workers I found none which are stuck during startup but instead lots of jobs stuck in uploading. I just add the information here instead of creating yet another "stuck worker" ticket:

-- Logs begin at Do 2018-12-06 12:12:40 CET, end at Di 2019-02-19 17:55:29 CET. --
Feb 19 01:06:34 openqaworker5 worker[4721]: [error] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Feb 19 01:06:34 openqaworker5 worker[4721]: [error] 400 response: Bad Request (remaining tries: 0)
Feb 19 01:06:29 openqaworker5 worker[4721]: [error] 400 response: Bad Request (remaining tries: 1)
Feb 19 01:06:24 openqaworker5 worker[4721]: [error] 400 response: Bad Request (remaining tries: 2)
Feb 19 01:06:24 openqaworker5 worker[4721]: Use of uninitialized value in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 987. -- $ret->{$extra_test->{name}} = $extra_result;
Feb 19 01:06:24 openqaworker5 worker[4721]: Use of uninitialized value $test in concatenation (.) or string at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 932. -- my $result = read_json_file("result-$test.json");
Feb 19 01:06:24 openqaworker5 worker[4721]: [info] uploading worker-log.txt
Feb 19 01:06:24 openqaworker5 worker[4721]: [info] uploading autoinst-log.txt
Feb 19 01:06:24 openqaworker5 worker[4721]: [info] uploading serial0.txt
Feb 19 01:06:24 openqaworker5 worker[4721]: [info] uploading vars.json
Feb 19 01:06:24 openqaworker5 worker[4721]: [info] uploading video_time.vtt
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] uploading video.ogv
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] uploading test-status.log
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] uploading log-generic.tar.xz
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] uploading generate_report-system-status.log
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] uploading generate_report-output.xml
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] uploading generate_report-journalctl.log
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] uploading generate_report-dmesg.log
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] result: done
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] end time: 2019-02-19 00:06:23
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] +++ worker notes +++
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] Isotovideo exit status: 0
Feb 19 00:20:13 openqaworker5 qemu-system-x86_64[4910]: looking for plugins in '/usr/lib64/sasl2', failed to open directory, error: No such file or directory
Feb 19 00:20:10 openqaworker5 worker[4721]: [info] 4854: WORKING 2472073
Feb 19 00:20:10 openqaworker5 worker[4721]: Use of uninitialized value $exit in numeric eq (==) at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 246.
Feb 19 00:20:10 openqaworker5 worker[4721]: [info] rsync:
Feb 19 00:20:10 openqaworker5 worker[4721]: Use of uninitialized value in concatenation (.) or string at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 245.
Feb 19 00:20:05 openqaworker5 worker[4721]: [error] Can't download SLES-15-SP1-x86_64-169.1@64bit-smp-minimal_with_sdk169.1_installed_withhome-uefi-vars.qcow2
Feb 19 00:19:08 openqaworker5 worker[4721]: [info] running on openqaworker5:3 (Linux 4.4.172-86-default #1 SMP Fri Feb 1 06:14:18 UTC 2019 (62e27f4) x86_64)
Feb 19 00:19:08 openqaworker5 worker[4721]: [info] start time: 2019-02-18 23:19:08
Feb 19 00:19:08 openqaworker5 worker[4721]: [info] +++ setup notes +++
Feb 19 00:19:08 openqaworker5 worker[4721]: GLOB(0x9ae3300)[info] got job 2472073: 02472073-sle-15-SP1-Installer-DVD-x86_64-Build169.1-xfstests_btrfs-generic-101-200@64bit-smp

I restarted all affected worker services (except one which I've left for investigation). A lot of different worker hosts were affected (3, 5, 6, 7, 8, 9).

#14 Updated by xlai about 1 year ago

mkittler wrote:

When looking for stuck workers I found none which are stuck during startup but instead lots of jobs stuck in uploading. I just add the information here instead of creating yet another "stuck worker" ticket:


-- Logs begin at Do 2018-12-06 12:12:40 CET, end at Di 2019-02-19 17:55:29 CET. --
Feb 19 01:06:34 openqaworker5 worker[4721]: [error] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Feb 19 01:06:34 openqaworker5 worker[4721]: [error] 400 response: Bad Request (remaining tries: 0)
Feb 19 01:06:29 openqaworker5 worker[4721]: [error] 400 response: Bad Request (remaining tries: 1)
Feb 19 01:06:24 openqaworker5 worker[4721]: [error] 400 response: Bad Request (remaining tries: 2)
Feb 19 01:06:24 openqaworker5 worker[4721]: Use of uninitialized value in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 987. -- $ret->{$extra_test->{name}} = $extra_result;
Feb 19 01:06:24 openqaworker5 worker[4721]: Use of uninitialized value $test in concatenation (.) or string at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 932. -- my $result = read_json_file("result-$test.json");
Feb 19 01:06:24 openqaworker5 worker[4721]: [info] uploading worker-log.txt
Feb 19 01:06:24 openqaworker5 worker[4721]: [info] uploading autoinst-log.txt
Feb 19 01:06:24 openqaworker5 worker[4721]: [info] uploading serial0.txt
Feb 19 01:06:24 openqaworker5 worker[4721]: [info] uploading vars.json
Feb 19 01:06:24 openqaworker5 worker[4721]: [info] uploading video_time.vtt
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] uploading video.ogv
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] uploading test-status.log
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] uploading log-generic.tar.xz
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] uploading generate_report-system-status.log
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] uploading generate_report-output.xml
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] uploading generate_report-journalctl.log
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] uploading generate_report-dmesg.log
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] result: done
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] end time: 2019-02-19 00:06:23
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] +++ worker notes +++
Feb 19 01:06:23 openqaworker5 worker[4721]: [info] Isotovideo exit status: 0
Feb 19 00:20:13 openqaworker5 qemu-system-x86_64[4910]: looking for plugins in '/usr/lib64/sasl2', failed to open directory, error: No such file or directory
Feb 19 00:20:10 openqaworker5 worker[4721]: [info] 4854: WORKING 2472073
Feb 19 00:20:10 openqaworker5 worker[4721]: Use of uninitialized value $exit in numeric eq (==) at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 246.
Feb 19 00:20:10 openqaworker5 worker[4721]: [info] rsync:
Feb 19 00:20:10 openqaworker5 worker[4721]: Use of uninitialized value in concatenation (.) or string at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 245.
Feb 19 00:20:05 openqaworker5 worker[4721]: [error] Can't download SLES-15-SP1-x86_64-169.1@64bit-smp-minimal_with_sdk169.1_installed_withhome-uefi-vars.qcow2
Feb 19 00:19:08 openqaworker5 worker[4721]: [info] running on openqaworker5:3 (Linux 4.4.172-86-default #1 SMP Fri Feb 1 06:14:18 UTC 2019 (62e27f4) x86_64)
Feb 19 00:19:08 openqaworker5 worker[4721]: [info] start time: 2019-02-18 23:19:08
Feb 19 00:19:08 openqaworker5 worker[4721]: [info] +++ setup notes +++
Feb 19 00:19:08 openqaworker5 worker[4721]: GLOB(0x9ae3300)[info] got job 2472073: 02472073-sle-15-SP1-Installer-DVD-x86_64-Build169.1-xfstests_btrfs-generic-101-200@64bit-smp

I restarted all affected worker services (except one which I've left for investigation). A lot of different worker hosts were affected (3, 5, 6, 7, 8, 9).

@mkittler,
A reminder that there was network outage issue from Feb 16 until Feb 19. I am not sure whether this is related to the upload issue. Just provide the info in case misleading to another investigation direction.

#15 Updated by mkittler about 1 year ago

It might be related. However, I could also observe the problem an hour ago on a remote worker from a colleague. And besides, it is not a timeout, just 400. Btw, on that remote worker apparmor was disabled. So apparmor is likely not the problem.

With some more information from the log with debug output enabled I think the following is going on:

  1. The worker fails to do the (final) status update to mark the job as done. Maybe the worker's request is invalid or the web UI fails to handle the request.
  2. After 3 attempts the worker gives up. But it seems to correctly reset its state to 'no job being executed' again.
  3. The web UI considers the worker still as working. Cancelling the job via the web UI has no effect and only gives stop_job called after job has already been asked to stop (reason: api-failure) in the worker log.
  4. Only restarting the worker convinces the web UI that the worker is free again and incompletes the job.

Log with debug verbosity:


Feb 20 14:55:05 zluo-vm worker[14687]: [debug] [pid:14687] updating status
Feb 20 14:55:05 zluo-vm worker[14687]: [debug] [pid:14687] POST http://f40.suse.de/api/v1/jobs/1305/status
Feb 20 14:55:05 zluo-vm worker[14687]: [debug] [pid:14687] updating status
Feb 20 14:55:05 zluo-vm worker[14687]: [debug] [pid:14687] POST http://f40.suse.de/api/v1/jobs/1305/status
Feb 20 14:55:06 zluo-vm worker[14687]: [debug] [pid:14687] updating status
Feb 20 14:55:06 zluo-vm worker[14687]: [debug] [pid:14687] POST http://f40.suse.de/api/v1/jobs/1305/status
Feb 20 14:55:06 zluo-vm worker[14687]: [info] [pid:14687] Isotovideo exit status: 0
Feb 20 14:55:06 zluo-vm worker[14687]: [debug] [pid:14687] stop_job done
Feb 20 14:55:06 zluo-vm worker[14687]: [debug] [pid:14687] ## removing timer update_status
Feb 20 14:55:06 zluo-vm worker[14687]: [debug] [pid:14687] ## removing timer job_timeout
Feb 20 14:55:06 zluo-vm worker[14687]: [debug] [pid:14687] POST http://f40.suse.de/api/v1/jobs/1305/status
Feb 20 14:55:06 zluo-vm worker[14687]: [info] [pid:14687] +++ worker notes +++
Feb 20 14:55:06 zluo-vm worker[14687]: [info] [pid:14687] end time: 2019-02-20 13:55:06
Feb 20 14:55:06 zluo-vm worker[14687]: [info] [pid:14687] result: done
Feb 20 14:55:06 zluo-vm worker[14687]: [info] [pid:14687] uploading execute_test_run-dmesg.log
Feb 20 14:55:06 zluo-vm worker[14687]: [info] [pid:14687] uploading execute_test_run-journalctl.log
Feb 20 14:55:06 zluo-vm worker[14687]: [info] [pid:14687] uploading execute_test_run-output.xml
Feb 20 14:55:06 zluo-vm worker[14687]: [info] [pid:14687] uploading execute_test_run-system-status.log
Feb 20 14:55:06 zluo-vm worker[14687]: [info] [pid:14687] uploading execute_test_run-testlog.tar.gz
Feb 20 14:55:06 zluo-vm worker[14687]: [info] [pid:14687] uploading video.ogv
Feb 20 14:55:06 zluo-vm worker[14687]: [info] [pid:14687] uploading video_time.vtt
Feb 20 14:55:06 zluo-vm worker[14687]: [info] [pid:14687] uploading vars.json
Feb 20 14:55:07 zluo-vm worker[14687]: [info] [pid:14687] uploading serial0.txt
Feb 20 14:55:07 zluo-vm worker[14687]: [info] [pid:14687] uploading autoinst-log.txt
Feb 20 14:55:07 zluo-vm worker[14687]: [info] [pid:14687] uploading worker-log.txt
Feb 20 14:55:07 zluo-vm worker[14687]: [debug] [pid:14687] setting job 1305 to done
Feb 20 14:55:07 zluo-vm worker[14687]: Use of uninitialized value $test in concatenation (.) or string at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 932.
Feb 20 14:55:07 zluo-vm worker[14687]: Use of uninitialized value in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 987.
Feb 20 14:55:07 zluo-vm worker[14687]: [debug] [pid:14687] POST http://f40.suse.de/api/v1/jobs/1305/status
Feb 20 14:55:07 zluo-vm worker[14687]: [error] [pid:14687] 400 response: Bad Request (remaining tries: 2)
Feb 20 14:55:07 zluo-vm worker[14687]: [debug] [pid:14687] ## adding timer 5
Feb 20 14:55:12 zluo-vm worker[14687]: [error] [pid:14687] 400 response: Bad Request (remaining tries: 1)
Feb 20 14:55:12 zluo-vm worker[14687]: [debug] [pid:14687] ## adding timer 5
Feb 20 14:55:17 zluo-vm worker[14687]: [error] [pid:14687] 400 response: Bad Request (remaining tries: 0)
Feb 20 14:55:17 zluo-vm worker[14687]: [debug] [pid:14687] stop_job called after job has already been asked to stop (reason: api-failure)
Feb 20 14:55:17 zluo-vm worker[14687]: [error] [pid:14687] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Feb 20 14:55:17 zluo-vm worker[14687]: [debug] [pid:14687] stop_job called after job has already been asked to stop (reason: api-failure)
Feb 20 14:56:17 zluo-vm worker[14687]: [debug] [pid:14687] Sending worker status to http://f40.suse.de (workerstatus timer)
Feb 20 14:56:17 zluo-vm worker[14687]: [debug] [pid:14687] Population for http://f40.suse.de is 27
Feb 20 14:56:17 zluo-vm worker[14687]: [debug] [pid:14687] ## changing timer workerstatus-http://f40.suse.de
Feb 20 14:56:17 zluo-vm worker[14687]: [debug] [pid:14687] ## removing timer workerstatus-http://f40.suse.de
Feb 20 14:56:17 zluo-vm worker[14687]: [debug] [pid:14687] ## adding timer workerstatus-http://f40.suse.de 86.53
Feb 20 14:57:44 zluo-vm worker[14687]: [debug] [pid:14687] Sending worker status to http://f40.suse.de (workerstatus timer)
Feb 20 14:57:44 zluo-vm worker[14687]: [debug] [pid:14687] Population for http://f40.suse.de is 27
Feb 20 14:57:44 zluo-vm worker[14687]: [debug] [pid:14687] ## changing timer workerstatus-http://f40.suse.de
Feb 20 14:57:44 zluo-vm worker[14687]: [debug] [pid:14687] ## removing timer workerstatus-http://f40.suse.de
Feb 20 14:57:44 zluo-vm worker[14687]: [debug] [pid:14687] ## adding timer workerstatus-http://f40.suse.de 86.53

#16 Updated by okurz about 1 year ago

  • Blocks action #48152: [sle][functional][u] qa_userspace_apparmor-profiles uploading last over 4 hours added

#17 Updated by mkittler 12 months ago

The 2nd problem (job/worker stuck on upload) which was very apparent on last Tuesday and Wednesday doesn't seem to have occurred again. (I wasn't working on Thursday and Friday but today OSD looks good.) By the way, I also cloned one of the affected jobs (the one mentioned in the blocked ticket) locally but couldn't reproduce the problem so far.

The actually reported problem (job/worker stuck on startup) didn't reoccur, too. But debug printing for further investigation is now deployed on openqaworker2.

#18 Updated by mkittler 12 months ago

  • Status changed from New to In Progress

Again 2 jobs being stuck on upload (this time on openqaworker-arm-1): https://openqa.suse.de/tests/2525067, https://openqa.suse.de/tests/2525076

#19 Updated by mkittler 12 months ago

With the latest modifications to the stop_job the log for the upload problem now looks like this (instance 6):

Mär 07 13:41:29 openqaworker-arm-1 worker[36970]: [debug] [pid:36970] Population for openqa.suse.de is 352
Mär 07 13:41:29 openqaworker-arm-1 worker[36970]: [debug] [pid:36970] Sending worker status to openqa.suse.de (workerstatus timer)
Mär 07 13:41:08 openqaworker-arm-1 worker[36970]: [debug] [pid:36970] stop_job called after job has already been asked to stop (reason: api-failure)
Mär 07 13:41:08 openqaworker-arm-1 worker[36970]: [error] [pid:36970] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)
Mär 07 13:41:08 openqaworker-arm-1 worker[36970]: [debug] [pid:36970] upload shutdown_ltp-52.txt
Mär 07 13:41:08 openqaworker-arm-1 worker[36970]: [debug] [pid:36970] upload shutdown_ltp-45.txt
Mär 07 13:41:08 openqaworker-arm-1 worker[36970]: [debug] [pid:36970] upload shutdown_ltp-44.txt
Mär 07 13:41:08 openqaworker-arm-1 worker[36970]: [debug] [pid:36970] upload shutdown_ltp-43.txt

and (instance 7):

Mär 07 13:41:24 openqaworker-arm-1 worker[36543]: [debug] [pid:36543] Population for openqa.suse.de is 352
Mär 07 13:41:24 openqaworker-arm-1 worker[36543]: [debug] [pid:36543] Sending worker status to openqa.suse.de (workerstatus timer)
Mär 07 13:41:06 openqaworker-arm-1 worker[36543]: [debug] [pid:36543] stop_job called after job has already been asked to stop (reason: api-failure)
Mär 07 13:41:06 openqaworker-arm-1 worker[36543]: [error] [pid:36543] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)
Mär 07 13:41:06 openqaworker-arm-1 worker[36543]: [debug] [pid:36543] upload shutdown_ltp-52.txt
Mär 07 13:41:06 openqaworker-arm-1 worker[36543]: [debug] [pid:36543] upload shutdown_ltp-45.txt
Mär 07 13:41:06 openqaworker-arm-1 worker[36543]: [debug] [pid:36543] upload shutdown_ltp-44.txt
Mär 07 13:41:06 openqaworker-arm-1 worker[36543]: [debug] [pid:36543] upload shutdown_ltp-43.txt

(Logs are in reverse order.)

Because the worker doesn't reach the web UI's REST API anymore it aborts the job without further status updates. It seems the worker would actually be available again to take the next job. But the web UI is still thinking the job is uploading and that the worker is working on that job. So it never attempts to assign a new job to the worker and the job stays uploading forever. Only restarting the worker causes the web UI to mark the job as incomplete. Likely because that disconnects the web-socket connection which is instantly treated as a failure.

There would be 2 ways to fix this:

  1. Implement a timeout for upload inactivity on the web UI side.
  2. Let the worker close and re-open the web socket connection when aborting a job after an API failure. I assume that would have the same effect as manually restarting the worker (current workaround).

Which option would be the best? Maybe even both? In any case, I need to find a way to reproduce the problem locally.


By the way, I couldn't observe the initially reported problem anymore so far.

#20 Updated by okurz 12 months ago

  • Blocks deleted (action #48152: [sle][functional][u] qa_userspace_apparmor-profiles uploading last over 4 hours)

#21 Updated by okurz 8 months ago

  • Subject changed from [worker service]openqaworker2:21 ~ openqaworker2:24 stops getting new jobs for over 1 day. to [worker service][scheduling] openqaworker2:21 ~ openqaworker2:24 stops getting new jobs for over 1 day.
  • Category changed from 122 to Concrete Bugs

#22 Updated by mkittler 7 months ago

I implemented option 2. as part of the worker restructuring because I thought it would be a good idea to give the worker-web-UI communication a fresh start after an error occurs anyways. So jobs abounded by the worker due to connection problems with the web UI shouldn't stay forever in uploading anymore.

related PR: https://github.com/os-autoinst/openQA/pull/2192


But the initial problem stated in the ticket description was actually not about the upload problem but the worker being stuck somewhere within OpenQA::Worker::Engines::isotovideo::engine_workit. That function hasn't been changed a lot when restructuring the worker but there is now a little bit more debug printing. So maybe we can find out more when it happens in production again. That leaves the question: Do we still observe this issue in production? (Note that I was never able to reproduce it in my local environment.)

#23 Updated by okurz 7 months ago

mkittler wrote:

That leaves the question: Do we still observe this issue in production?

I guess you could try to answer yourself but probably it's too early after the latest deployment with the worker restructuring to say. For example you could take a look on https://openqa.suse.de/tests/ if there are any long-scheduled jobs that are not picked up pointing to the related workers (that would match and should pick jobs up) are potentially stuck. Interesting enough currently the schedule queue is even empty which is a good sign :)

#24 Updated by xlai 7 months ago

okurz wrote:

mkittler wrote:

That leaves the question: Do we still observe this issue in production?


I guess you could try to answer yourself but probably it's too early after the latest deployment with the worker restructuring to say. For example you could take a look on https://openqa.suse.de/tests/ if there are any long-scheduled jobs that are not picked up pointing to the related workers (that would match and should pick jobs up) are potentially stuck. Interesting enough currently the schedule queue is even empty which is a good sign :)

One update from my side: we do not see such issue happening at least form our job group for quite some time. So feel free to change this ticket status as it should be.

#25 Updated by mkittler 7 months ago

@okurz I'm also using some SQL queries, e.g.:

select lazy_workers.*, jobs.TEST as last_job_name, jobs.t_finished as last_job_finished, jobs.state as last_job_state, jobs.result as last_job_result from ( select host, instance, max(jobs.id) as last_job_id from workers join jobs on workers.id=jobs.assigned_worker_id group by host, instance order by last_job_id limit 200 ) as lazy_workers join jobs on jobs.id=lazy_workers.last_job_id;

I was just asking because somebody might already know something.

#26 Updated by mkittler 7 months ago

  • Status changed from In Progress to Resolved

So feel free to change this ticket status as it should be.

I guess it is enough to keep one of those "worker stuck" tickets open so I'm closing this now.

Also available in: Atom PDF