Project

General

Profile

Actions

action #115106

closed

Cancelled jobs can end up being stuck associated with worker

Added by mkittler almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-08-08
Due date:
% Done:

0%

Estimated time:

Description

Observation

On o3 there were three jobs cancelled with reason "cancelled based on job settings" and no result "obsoleted" (the result is now actually shown in the web UI). The according worker slots still had these jobs associated after 12 days so these worker slots were basically blocked/stuck. I suppose it is correct that these jobs ended up cancelled but the associated worker slots should not be blocked/stuck.

Jobs: https://openqa.opensuse.org/tests/2483081, https://openqa.opensuse.org/tests/2496053, https://openqa.opensuse.org/tests/2492150

NOTE: I've created #116078 for the concrete problems on power8. I'll keep that ticket around for determining whether we can still do anything better in openQA (see suggestions below). However, it seems unlikely considering the worker slots were actually stuck. So maybe we also want to decide closing this generic ticket and only focus on recovering power8 and debugging its strange behavior.

The worker log for one of the affected slots (on power8):

Jul 26 16:32:05 power8 systemd[1]: Started openQA Worker #5.
Jul 26 16:32:06 power8 worker[158130]: [info] worker 5:
Jul 26 16:32:06 power8 worker[158130]:  - config file:           /etc/openqa/workers.ini
Jul 26 16:32:06 power8 worker[158130]:  - worker hostname:       power8
Jul 26 16:32:06 power8 worker[158130]:  - isotovideo version:    26
Jul 26 16:32:06 power8 worker[158130]:  - websocket API version: 1
Jul 26 16:32:06 power8 worker[158130]:  - web UI hosts:          http://openqa1-opensuse
Jul 26 16:32:06 power8 worker[158130]:  - class:                 qemu_ppc64le,qemu_ppc64,qemu_ppc,heavyload
Jul 26 16:32:06 power8 worker[158130]:  - no cleanup:            no
Jul 26 16:32:06 power8 worker[158130]:  - pool directory:        /var/lib/openqa/pool/5
Jul 26 16:32:06 power8 worker[158130]: [info] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa1-opensuse
Jul 26 16:32:06 power8 worker[158130]: [info] Project dir for host http://openqa1-opensuse is /var/lib/openqa/share
Jul 26 16:32:06 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Jul 26 16:32:06 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Jul 26 16:32:06 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Jul 26 16:32:25 power8 worker[158130]: [info] +++ setup notes +++
Jul 26 16:32:25 power8 worker[158130]: [info] Running on power8:5 (Linux 5.14.21-150400.24.11-default #1 SMP Sun Jul 17 20:46:33 UTC 2022 (be260ca) ppc64le)
Jul 26 16:32:25 power8 worker[158130]: [info] Downloading openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso.sha256, request #207848 sent to Cache Service
Jul 26 16:32:35 power8 worker[158130]: [info] Download of openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso.sha256 processed:
Jul 26 16:32:35 power8 worker[158130]: [info] [#207848] Cache size of "/var/lib/openqa/cache" is 599 GiB, with limit 600 GiB
Jul 26 16:32:35 power8 worker[158130]: [info] [#207848] Downloading "openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso.sha256" from "http://openqa1-opensuse/tests/2483081/asset/other/openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso.sha256"
Jul 26 16:32:35 power8 worker[158130]: [info] [#207848] Content of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso.sha256" has not changed, updating last use
Jul 26 16:32:35 power8 worker[158130]: [info] Downloading opensuse-Tumbleweed-ppc64le-20220725-DVD@ppc64le-with-ltp.qcow2, request #207849 sent to Cache Service
Jul 26 16:32:46 power8 worker[158130]: [info] Download of opensuse-Tumbleweed-ppc64le-20220725-DVD@ppc64le-with-ltp.qcow2 processed:
Jul 26 16:32:46 power8 worker[158130]: [info] [#207849] Cache size of "/var/lib/openqa/cache" is 599 GiB, with limit 600 GiB
Jul 26 16:32:46 power8 worker[158130]: [info] [#207849] Downloading "opensuse-Tumbleweed-ppc64le-20220725-DVD@ppc64le-with-ltp.qcow2" from "http://openqa1-opensuse/tests/2483081/asset/hdd/opensuse-Tumbleweed-ppc64le-20220725-DVD@ppc64le-with-ltp.qcow2"
Jul 26 16:32:46 power8 worker[158130]: [info] [#207849] Content of "/var/lib/openqa/cache/openqa1-opensuse/opensuse-Tumbleweed-ppc64le-20220725-DVD@ppc64le-with-ltp.qcow2" has not changed, updating last use
Jul 26 16:32:46 power8 worker[158130]: [info] Downloading openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso, request #207850 sent to Cache Service
Jul 26 16:32:51 power8 worker[158130]: [info] Download of openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso processed:
Jul 26 16:32:51 power8 worker[158130]: [info] [#207850] Cache size of "/var/lib/openqa/cache" is 599 GiB, with limit 600 GiB
Jul 26 16:32:51 power8 worker[158130]: [info] [#207850] Downloading "openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso" from "http://openqa1-opensuse/tests/2483081/asset/iso/openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso"
Jul 26 16:32:51 power8 worker[158130]: [info] [#207850] Content of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso" has not changed, updating last use
Jul 26 16:32:51 power8 worker[158130]: [info] Rsync from 'rsync://openqa1-opensuse/tests' to '/var/lib/openqa/cache/openqa1-opensuse', request #207851 sent to Cache Service
Jul 26 16:32:56 power8 worker[158130]: [info] Output of rsync:
Jul 26 16:32:56 power8 worker[158130]: [info] [#207851] Calling: rsync -avHP --timeout 1800 rsync://openqa1-opensuse/tests/ --delete /var/lib/openqa/cache/openqa1-opensuse/tests/
Jul 26 16:32:56 power8 worker[158130]: Try 1:
Jul 26 16:32:56 power8 worker[158130]: receiving incremental file list
Jul 26 16:32:56 power8 worker[158130]: sent 1,985 bytes  received 1,494,517 bytes  997,668.00 bytes/sec
Jul 26 16:32:56 power8 worker[158130]: total size is 7,044,120,358  speedup is 4,707.06
Jul 26 16:32:56 power8 worker[158130]: [info] Finished to rsync tests
Jul 26 16:32:56 power8 worker[158130]: [info] Preparing cgroup to start isotovideo
Jul 26 16:32:56 power8 worker[158130]: [info] Using cgroup /sys/fs/cgroup/systemd/openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@5.service/2483081
Jul 26 16:32:56 power8 worker[158130]: [info] Starting isotovideo container
Jul 26 16:32:56 power8 worker[158130]: [info] isotovideo has been started (PID: 158393)
Jul 26 16:32:56 power8 worker[158393]: [info] 158393: WORKING 2483081
Jul 26 16:33:57 power8 worker[158130]: [info] Test schedule has changed, reloading test_order.json
Jul 26 16:34:07 power8 worker[158130]: [info] Test schedule has changed, reloading test_order.json
Jul 26 22:33:03 power8 worker[158130]: [info] +++ worker notes +++
Jul 26 22:33:03 power8 worker[158130]: [info] End time: 2022-07-26 22:33:03
Jul 26 22:33:03 power8 worker[158130]: [info] Result: timeout
Jul 26 22:33:03 power8 worker[24903]: [info] Uploading boot_ltp-config-5.18.11-1-default
Jul 26 22:33:03 power8 worker[24903]: [info] Uploading boot_ltp-framebuffer.dat.gz
Jul 26 22:33:03 power8 worker[24903]: [info] Uploading boot_ltp-kernel-pkg.txt
Jul 26 22:33:03 power8 worker[24903]: [info] Uploading boot_ltp-ver_linux_before.txt
Jul 26 22:33:03 power8 worker[24903]: [info] Uploading ltp_known_issues.yaml
Jul 27 14:18:23 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jul 27 14:18:33 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Jul 27 14:18:33 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Jul 27 14:18:34 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Jul 28 03:51:00 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jul 28 03:51:10 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Jul 28 03:51:10 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Jul 28 03:51:11 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Jul 29 12:51:23 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jul 29 12:51:33 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Jul 29 12:51:33 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Jul 29 12:51:34 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Jul 29 19:40:40 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jul 29 19:40:50 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Jul 29 19:40:50 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Jul 29 19:40:51 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Jul 30 10:38:29 power8 systemd[1]: Reloading openQA Worker #5...
Jul 30 10:38:29 power8 worker[158130]: [info] Received signal HUP
Jul 30 10:38:29 power8 systemd[1]: Reloaded openQA Worker #5.
Jul 30 10:53:25 power8 systemd[1]: Reloading openQA Worker #5...
Jul 30 10:53:25 power8 worker[158130]: [info] Received signal HUP
Jul 30 10:53:25 power8 systemd[1]: Reloaded openQA Worker #5.
Jul 30 11:05:29 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jul 30 11:05:40 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Jul 30 11:05:40 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Jul 30 11:05:40 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Jul 30 11:58:58 power8 systemd[1]: Reloading openQA Worker #5...
Jul 30 11:58:58 power8 worker[158130]: [info] Received signal HUP
Jul 30 11:58:58 power8 systemd[1]: Reloaded openQA Worker #5.
Jul 30 12:11:59 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jul 30 12:12:09 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Jul 30 12:12:10 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Jul 30 12:12:11 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 01 08:05:41 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 01 08:05:51 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 01 08:05:52 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 01 08:05:52 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 01 08:21:20 power8 systemd[1]: Reloading openQA Worker #5...
Aug 01 08:21:20 power8 worker[158130]: [info] Received signal HUP
Aug 01 08:21:20 power8 systemd[1]: Reloaded openQA Worker #5.
Aug 01 13:16:41 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 01 13:16:51 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 01 13:16:51 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 01 13:16:53 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 01 13:37:18 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 01 13:37:28 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 01 13:37:29 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 01 13:37:30 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 01 15:04:22 power8 systemd[1]: Reloading openQA Worker #5...
Aug 01 15:04:22 power8 worker[158130]: [info] Received signal HUP
Aug 01 15:04:22 power8 systemd[1]: Reloaded openQA Worker #5.
Aug 02 01:15:45 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 02 01:15:55 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 02 01:15:56 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 02 01:15:56 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 02 12:45:53 power8 systemd[1]: Reloading openQA Worker #5...
Aug 02 12:45:53 power8 worker[158130]: [info] Received signal HUP
Aug 02 12:45:53 power8 systemd[1]: Reloaded openQA Worker #5.
Aug 03 11:11:36 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 03 11:11:46 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 03 11:11:47 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 03 11:11:48 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 03 13:25:30 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 03 13:25:40 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 03 13:25:40 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 03 13:25:41 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 03 20:04:34 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 03 20:04:44 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 03 20:04:44 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 03 20:04:45 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 04 04:10:24 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 04 04:10:34 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 04 04:10:35 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 04 04:10:35 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 04 04:25:39 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 04 04:25:49 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 04 04:25:49 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 04 04:25:50 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 04 14:18:00 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 04 14:18:10 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 04 14:18:11 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 04 14:18:11 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 04 15:39:54 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 04 15:40:04 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 04 15:40:05 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 04 15:40:05 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 06 12:05:40 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 06 12:05:50 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 06 12:05:51 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 06 12:05:51 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 08 14:42:25 power8 systemd[1]: Stopping openQA Worker #5...
Aug 08 14:42:25 power8 worker[158130]: [info] Received signal TERM
Aug 08 14:43:55 power8 systemd[1]: openqa-worker-auto-restart@5.service: State 'stop-sigterm' timed out. Killing.
Aug 08 14:43:55 power8 systemd[1]: openqa-worker-auto-restart@5.service: Killing process 158130 (worker) with signal SIGKILL.
Aug 08 14:43:55 power8 systemd[1]: openqa-worker-auto-restart@5.service: Killing process 24903 (worker) with signal SIGKILL.
Aug 08 14:43:55 power8 systemd[1]: openqa-worker-auto-restart@5.service: Main process exited, code=killed, status=9/KILL
Aug 08 14:43:55 power8 systemd[1]: openqa-worker-auto-restart@5.service: Failed with result 'timeout'.
Aug 08 14:43:55 power8 systemd[1]: Stopped openQA Worker #5.
Aug 08 14:43:55 power8 systemd[1]: Starting openQA Worker #5...
Aug 08 14:43:55 power8 systemd[1]: Started openQA Worker #5.
Aug 08 14:43:56 power8 worker[63906]: [info] worker 5:
Aug 08 14:43:56 power8 worker[63906]:  - config file:           /etc/openqa/workers.ini
Aug 08 14:43:56 power8 worker[63906]:  - worker hostname:       power8
Aug 08 14:43:56 power8 worker[63906]:  - isotovideo version:    26
Aug 08 14:43:56 power8 worker[63906]:  - websocket API version: 1
Aug 08 14:43:56 power8 worker[63906]:  - web UI hosts:          http://openqa1-opensuse
Aug 08 14:43:56 power8 worker[63906]:  - class:                 qemu_ppc64le,qemu_ppc64,qemu_ppc,heavyload
Aug 08 14:43:56 power8 worker[63906]:  - no cleanup:            no
Aug 08 14:43:56 power8 worker[63906]:  - pool directory:        /var/lib/openqa/pool/5
Aug 08 14:43:56 power8 worker[63906]: [info] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa1-opensuse
Aug 08 14:43:56 power8 worker[63906]: [info] Project dir for host http://openqa1-opensuse is /var/lib/openqa/share
Aug 08 14:43:56 power8 worker[63906]: [info] Registering with openQA http://openqa1-opensuse
Aug 08 14:44:00 power8 worker[63906]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 08 14:44:00 power8 worker[63906]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17

The last stopping/killing was triggered by me restarting the service. So apparently the worker was stuck and did not react to SIGTERM anymore.

Note that after restarting the services the jobs are now shown properly as "obsoleted" on the web UI and the worker slots are used again.

Suggestions

  • Find out why the workers got stuck; even if the actual problem is on the web UI side the worker service should react to SIGTERM.
  • Find out whether something went wrong on the web UI side.
    • That jobs were cancelled is probably correct behavior.
    • That the stale job detection did nothing about the jobs makes sense considering they were cancelled.
    • That the workers kept being associated with the jobs does not seem to be a problem considering the workers were actually stuck and restarting them actually freed up the relation.
    • So maybe nothing wrong on the web UI side?
Actions

Also available in: Atom PDF