Project

General

Profile

Actions

action #115106

closed

Cancelled jobs can end up being stuck associated with worker

Added by mkittler over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-08-08
Due date:
% Done:

0%

Estimated time:

Description

Observation

On o3 there were three jobs cancelled with reason "cancelled based on job settings" and no result "obsoleted" (the result is now actually shown in the web UI). The according worker slots still had these jobs associated after 12 days so these worker slots were basically blocked/stuck. I suppose it is correct that these jobs ended up cancelled but the associated worker slots should not be blocked/stuck.

Jobs: https://openqa.opensuse.org/tests/2483081, https://openqa.opensuse.org/tests/2496053, https://openqa.opensuse.org/tests/2492150

NOTE: I've created #116078 for the concrete problems on power8. I'll keep that ticket around for determining whether we can still do anything better in openQA (see suggestions below). However, it seems unlikely considering the worker slots were actually stuck. So maybe we also want to decide closing this generic ticket and only focus on recovering power8 and debugging its strange behavior.

The worker log for one of the affected slots (on power8):

Jul 26 16:32:05 power8 systemd[1]: Started openQA Worker #5.
Jul 26 16:32:06 power8 worker[158130]: [info] worker 5:
Jul 26 16:32:06 power8 worker[158130]:  - config file:           /etc/openqa/workers.ini
Jul 26 16:32:06 power8 worker[158130]:  - worker hostname:       power8
Jul 26 16:32:06 power8 worker[158130]:  - isotovideo version:    26
Jul 26 16:32:06 power8 worker[158130]:  - websocket API version: 1
Jul 26 16:32:06 power8 worker[158130]:  - web UI hosts:          http://openqa1-opensuse
Jul 26 16:32:06 power8 worker[158130]:  - class:                 qemu_ppc64le,qemu_ppc64,qemu_ppc,heavyload
Jul 26 16:32:06 power8 worker[158130]:  - no cleanup:            no
Jul 26 16:32:06 power8 worker[158130]:  - pool directory:        /var/lib/openqa/pool/5
Jul 26 16:32:06 power8 worker[158130]: [info] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa1-opensuse
Jul 26 16:32:06 power8 worker[158130]: [info] Project dir for host http://openqa1-opensuse is /var/lib/openqa/share
Jul 26 16:32:06 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Jul 26 16:32:06 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Jul 26 16:32:06 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Jul 26 16:32:25 power8 worker[158130]: [info] +++ setup notes +++
Jul 26 16:32:25 power8 worker[158130]: [info] Running on power8:5 (Linux 5.14.21-150400.24.11-default #1 SMP Sun Jul 17 20:46:33 UTC 2022 (be260ca) ppc64le)
Jul 26 16:32:25 power8 worker[158130]: [info] Downloading openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso.sha256, request #207848 sent to Cache Service
Jul 26 16:32:35 power8 worker[158130]: [info] Download of openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso.sha256 processed:
Jul 26 16:32:35 power8 worker[158130]: [info] [#207848] Cache size of "/var/lib/openqa/cache" is 599 GiB, with limit 600 GiB
Jul 26 16:32:35 power8 worker[158130]: [info] [#207848] Downloading "openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso.sha256" from "http://openqa1-opensuse/tests/2483081/asset/other/openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso.sha256"
Jul 26 16:32:35 power8 worker[158130]: [info] [#207848] Content of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso.sha256" has not changed, updating last use
Jul 26 16:32:35 power8 worker[158130]: [info] Downloading opensuse-Tumbleweed-ppc64le-20220725-DVD@ppc64le-with-ltp.qcow2, request #207849 sent to Cache Service
Jul 26 16:32:46 power8 worker[158130]: [info] Download of opensuse-Tumbleweed-ppc64le-20220725-DVD@ppc64le-with-ltp.qcow2 processed:
Jul 26 16:32:46 power8 worker[158130]: [info] [#207849] Cache size of "/var/lib/openqa/cache" is 599 GiB, with limit 600 GiB
Jul 26 16:32:46 power8 worker[158130]: [info] [#207849] Downloading "opensuse-Tumbleweed-ppc64le-20220725-DVD@ppc64le-with-ltp.qcow2" from "http://openqa1-opensuse/tests/2483081/asset/hdd/opensuse-Tumbleweed-ppc64le-20220725-DVD@ppc64le-with-ltp.qcow2"
Jul 26 16:32:46 power8 worker[158130]: [info] [#207849] Content of "/var/lib/openqa/cache/openqa1-opensuse/opensuse-Tumbleweed-ppc64le-20220725-DVD@ppc64le-with-ltp.qcow2" has not changed, updating last use
Jul 26 16:32:46 power8 worker[158130]: [info] Downloading openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso, request #207850 sent to Cache Service
Jul 26 16:32:51 power8 worker[158130]: [info] Download of openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso processed:
Jul 26 16:32:51 power8 worker[158130]: [info] [#207850] Cache size of "/var/lib/openqa/cache" is 599 GiB, with limit 600 GiB
Jul 26 16:32:51 power8 worker[158130]: [info] [#207850] Downloading "openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso" from "http://openqa1-opensuse/tests/2483081/asset/iso/openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso"
Jul 26 16:32:51 power8 worker[158130]: [info] [#207850] Content of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-DVD-ppc64le-Snapshot20220725-Media.iso" has not changed, updating last use
Jul 26 16:32:51 power8 worker[158130]: [info] Rsync from 'rsync://openqa1-opensuse/tests' to '/var/lib/openqa/cache/openqa1-opensuse', request #207851 sent to Cache Service
Jul 26 16:32:56 power8 worker[158130]: [info] Output of rsync:
Jul 26 16:32:56 power8 worker[158130]: [info] [#207851] Calling: rsync -avHP --timeout 1800 rsync://openqa1-opensuse/tests/ --delete /var/lib/openqa/cache/openqa1-opensuse/tests/
Jul 26 16:32:56 power8 worker[158130]: Try 1:
Jul 26 16:32:56 power8 worker[158130]: receiving incremental file list
Jul 26 16:32:56 power8 worker[158130]: sent 1,985 bytes  received 1,494,517 bytes  997,668.00 bytes/sec
Jul 26 16:32:56 power8 worker[158130]: total size is 7,044,120,358  speedup is 4,707.06
Jul 26 16:32:56 power8 worker[158130]: [info] Finished to rsync tests
Jul 26 16:32:56 power8 worker[158130]: [info] Preparing cgroup to start isotovideo
Jul 26 16:32:56 power8 worker[158130]: [info] Using cgroup /sys/fs/cgroup/systemd/openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@5.service/2483081
Jul 26 16:32:56 power8 worker[158130]: [info] Starting isotovideo container
Jul 26 16:32:56 power8 worker[158130]: [info] isotovideo has been started (PID: 158393)
Jul 26 16:32:56 power8 worker[158393]: [info] 158393: WORKING 2483081
Jul 26 16:33:57 power8 worker[158130]: [info] Test schedule has changed, reloading test_order.json
Jul 26 16:34:07 power8 worker[158130]: [info] Test schedule has changed, reloading test_order.json
Jul 26 22:33:03 power8 worker[158130]: [info] +++ worker notes +++
Jul 26 22:33:03 power8 worker[158130]: [info] End time: 2022-07-26 22:33:03
Jul 26 22:33:03 power8 worker[158130]: [info] Result: timeout
Jul 26 22:33:03 power8 worker[24903]: [info] Uploading boot_ltp-config-5.18.11-1-default
Jul 26 22:33:03 power8 worker[24903]: [info] Uploading boot_ltp-framebuffer.dat.gz
Jul 26 22:33:03 power8 worker[24903]: [info] Uploading boot_ltp-kernel-pkg.txt
Jul 26 22:33:03 power8 worker[24903]: [info] Uploading boot_ltp-ver_linux_before.txt
Jul 26 22:33:03 power8 worker[24903]: [info] Uploading ltp_known_issues.yaml
Jul 27 14:18:23 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jul 27 14:18:33 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Jul 27 14:18:33 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Jul 27 14:18:34 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Jul 28 03:51:00 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jul 28 03:51:10 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Jul 28 03:51:10 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Jul 28 03:51:11 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Jul 29 12:51:23 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jul 29 12:51:33 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Jul 29 12:51:33 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Jul 29 12:51:34 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Jul 29 19:40:40 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jul 29 19:40:50 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Jul 29 19:40:50 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Jul 29 19:40:51 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Jul 30 10:38:29 power8 systemd[1]: Reloading openQA Worker #5...
Jul 30 10:38:29 power8 worker[158130]: [info] Received signal HUP
Jul 30 10:38:29 power8 systemd[1]: Reloaded openQA Worker #5.
Jul 30 10:53:25 power8 systemd[1]: Reloading openQA Worker #5...
Jul 30 10:53:25 power8 worker[158130]: [info] Received signal HUP
Jul 30 10:53:25 power8 systemd[1]: Reloaded openQA Worker #5.
Jul 30 11:05:29 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jul 30 11:05:40 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Jul 30 11:05:40 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Jul 30 11:05:40 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Jul 30 11:58:58 power8 systemd[1]: Reloading openQA Worker #5...
Jul 30 11:58:58 power8 worker[158130]: [info] Received signal HUP
Jul 30 11:58:58 power8 systemd[1]: Reloaded openQA Worker #5.
Jul 30 12:11:59 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jul 30 12:12:09 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Jul 30 12:12:10 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Jul 30 12:12:11 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 01 08:05:41 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 01 08:05:51 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 01 08:05:52 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 01 08:05:52 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 01 08:21:20 power8 systemd[1]: Reloading openQA Worker #5...
Aug 01 08:21:20 power8 worker[158130]: [info] Received signal HUP
Aug 01 08:21:20 power8 systemd[1]: Reloaded openQA Worker #5.
Aug 01 13:16:41 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 01 13:16:51 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 01 13:16:51 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 01 13:16:53 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 01 13:37:18 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 01 13:37:28 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 01 13:37:29 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 01 13:37:30 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 01 15:04:22 power8 systemd[1]: Reloading openQA Worker #5...
Aug 01 15:04:22 power8 worker[158130]: [info] Received signal HUP
Aug 01 15:04:22 power8 systemd[1]: Reloaded openQA Worker #5.
Aug 02 01:15:45 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 02 01:15:55 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 02 01:15:56 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 02 01:15:56 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 02 12:45:53 power8 systemd[1]: Reloading openQA Worker #5...
Aug 02 12:45:53 power8 worker[158130]: [info] Received signal HUP
Aug 02 12:45:53 power8 systemd[1]: Reloaded openQA Worker #5.
Aug 03 11:11:36 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 03 11:11:46 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 03 11:11:47 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 03 11:11:48 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 03 13:25:30 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 03 13:25:40 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 03 13:25:40 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 03 13:25:41 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 03 20:04:34 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 03 20:04:44 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 03 20:04:44 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 03 20:04:45 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 04 04:10:24 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 04 04:10:34 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 04 04:10:35 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 04 04:10:35 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 04 04:25:39 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 04 04:25:49 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 04 04:25:49 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 04 04:25:50 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 04 14:18:00 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 04 14:18:10 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 04 14:18:11 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 04 14:18:11 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 04 15:39:54 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 04 15:40:04 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 04 15:40:05 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 04 15:40:05 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 06 12:05:40 power8 worker[158130]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/17 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 06 12:05:50 power8 worker[158130]: [info] Registering with openQA http://openqa1-opensuse
Aug 06 12:05:51 power8 worker[158130]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 06 12:05:51 power8 worker[158130]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17
Aug 08 14:42:25 power8 systemd[1]: Stopping openQA Worker #5...
Aug 08 14:42:25 power8 worker[158130]: [info] Received signal TERM
Aug 08 14:43:55 power8 systemd[1]: openqa-worker-auto-restart@5.service: State 'stop-sigterm' timed out. Killing.
Aug 08 14:43:55 power8 systemd[1]: openqa-worker-auto-restart@5.service: Killing process 158130 (worker) with signal SIGKILL.
Aug 08 14:43:55 power8 systemd[1]: openqa-worker-auto-restart@5.service: Killing process 24903 (worker) with signal SIGKILL.
Aug 08 14:43:55 power8 systemd[1]: openqa-worker-auto-restart@5.service: Main process exited, code=killed, status=9/KILL
Aug 08 14:43:55 power8 systemd[1]: openqa-worker-auto-restart@5.service: Failed with result 'timeout'.
Aug 08 14:43:55 power8 systemd[1]: Stopped openQA Worker #5.
Aug 08 14:43:55 power8 systemd[1]: Starting openQA Worker #5...
Aug 08 14:43:55 power8 systemd[1]: Started openQA Worker #5.
Aug 08 14:43:56 power8 worker[63906]: [info] worker 5:
Aug 08 14:43:56 power8 worker[63906]:  - config file:           /etc/openqa/workers.ini
Aug 08 14:43:56 power8 worker[63906]:  - worker hostname:       power8
Aug 08 14:43:56 power8 worker[63906]:  - isotovideo version:    26
Aug 08 14:43:56 power8 worker[63906]:  - websocket API version: 1
Aug 08 14:43:56 power8 worker[63906]:  - web UI hosts:          http://openqa1-opensuse
Aug 08 14:43:56 power8 worker[63906]:  - class:                 qemu_ppc64le,qemu_ppc64,qemu_ppc,heavyload
Aug 08 14:43:56 power8 worker[63906]:  - no cleanup:            no
Aug 08 14:43:56 power8 worker[63906]:  - pool directory:        /var/lib/openqa/pool/5
Aug 08 14:43:56 power8 worker[63906]: [info] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa1-opensuse
Aug 08 14:43:56 power8 worker[63906]: [info] Project dir for host http://openqa1-opensuse is /var/lib/openqa/share
Aug 08 14:43:56 power8 worker[63906]: [info] Registering with openQA http://openqa1-opensuse
Aug 08 14:44:00 power8 worker[63906]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/17
Aug 08 14:44:00 power8 worker[63906]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 17

The last stopping/killing was triggered by me restarting the service. So apparently the worker was stuck and did not react to SIGTERM anymore.

Note that after restarting the services the jobs are now shown properly as "obsoleted" on the web UI and the worker slots are used again.

Suggestions

  • Find out why the workers got stuck; even if the actual problem is on the web UI side the worker service should react to SIGTERM.
  • Find out whether something went wrong on the web UI side.
    • That jobs were cancelled is probably correct behavior.
    • That the stale job detection did nothing about the jobs makes sense considering they were cancelled.
    • That the workers kept being associated with the jobs does not seem to be a problem considering the workers were actually stuck and restarting them actually freed up the relation.
    • So maybe nothing wrong on the web UI side?
Actions #1

Updated by mkittler over 2 years ago

The newly assigned jobs are stuck in assigned state. It looks like the newly restarted worker slots are already stuck again. The log looks like the one from the ticket description.

Multiple similar kernel call trace have been logged, e.g.:

[1335924.570973] Kernel attempted to read user page (100000016) - exploit attempt? (uid: 0)
[1335924.571020] BUG: Unable to handle kernel data access on read at 0x100000016
[1335924.571037] Faulting instruction address: 0xc0000000005561dc
[1335924.571054] Oops: Kernel access of bad area, sig: 11 [#21]
[1335924.571069] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
[1335924.571089] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache netfs kvm_hv kvm af_packet rfkill crct10dif_vpmsum ipmi_powernv(X) ipmi_devintf ipmi_msghandler uio_pdrv_genirq leds_powernv(X) rtc_opal(X) powernv_op_panel(N) i2c_opal(X) uio tg3 ses enclosure scsi_transport_sas libphy drm drm_panel_orientation_quirks configfs backlight ip_tables x_tables sr_mod cdrom sd_mod t10_pi xhci_pci xhci_pci_renesas xhci_hcd ipr(X) vmx_crypto gf128mul usbcore libata usb_common ext4 crc32c_vpmsum crc16 mbcache jbd2 fuse binfmt_misc dm_snapshot dm_bufio loop sg br_netfilter bridge stp llc dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod
[1335924.571195] Supported: No, Unsupported modules are loaded
[1335924.571212] CPU: 32 PID: 63061 Comm: Zypp-main Tainted: G      D W      X  N 5.14.21-150400.24.11-default #1 SLE15-SP4 5031505b0a65e234cdf253965338bef90a38442d
[1335924.571232] NIP:  c0000000005561dc LR: c00000000053c9a8 CTR: 0000000000000001
[1335924.571248] REGS: c000000024223730 TRAP: 0300   Tainted: G      D W      X  N  (5.14.21-150400.24.11-default)
[1335924.571265] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 44442222  XER: 00000000
[1335924.571288] CFAR: c0000000005561cc DAR: 0000000100000016 DSISR: 40000000 IRQMASK: 0 
                 GPR00: c00000000053c9a8 c0000000242239d0 c000000002833a00 c00000147a580fc0 
                 GPR04: c000000024223be8 c000000024223ae4 0000000000000000 0000000000000000 
                 GPR08: c000000024a9a047 00000000020c0008 c008000000000000 ffffff0000000000 
                 GPR12: 0000000000002200 c0000007fffdc080 00007fffc2632c80 00007fffc2632c60 
                 GPR16: 00007fffc2632c30 00007fffc2632c50 c000000024a9a020 00007fffc2632c70 
                 GPR20: 0000000000000000 c000000024223be8 0000000000000043 c000000024223af8 
                 GPR24: ffffffffffffffff 0000000000000000 c000000024223be8 0000000000000000 
                 GPR28: c00000147a580fc0 c00000147a580fc0 00000000830003f4 00000000fffffffe 
[1335924.571355] NIP [c0000000005561dc] __d_lookup+0x8c/0x290
[1335924.571375] LR [c00000000053c9a8] lookup_fast+0x108/0x240
[1335924.571392] Call Trace:
[1335924.571406] [c0000000242239d0] [c000000024223a00] 0xc000000024223a00 (unreliable)
[1335924.571424] [c000000024223a40] [c00000000053c9a8] lookup_fast+0x108/0x240
[1335924.571442] [c000000024223aa0] [c00000000054267c] path_openat+0x25c/0x1330
[1335924.571459] [c000000024223ba0] [c0000000005456e4] do_filp_open+0xa4/0x130
[1335924.571477] [c000000024223ce0] [c000000000524418] do_sys_openat2+0x2e8/0x440
[1335924.571496] [c000000024223d50] [c000000000526198] do_sys_open+0x78/0xc0
[1335924.571512] [c000000024223db0] [c00000000003269c] system_call_exception+0x15c/0x330
[1335924.571531] [c000000024223e10] [c00000000000c74c] system_call_common+0xec/0x250
[1335924.571550] --- interrupt: c00 at 0x7fffa58d9e14
[1335924.571565] NIP:  00007fffa58d9e14 LR: 00007fffa589d7ec CTR: 0000000000000000
[1335924.571581] REGS: c000000024223e80 TRAP: 0c00   Tainted: G      D W      X  N  (5.14.21-150400.24.11-default)
[1335924.571597] MSR:  900000000280f033 <SF,HV,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28444488  XER: 00000000
[1335924.571622] IRQMASK: 0 
                 GPR00: 000000000000011e 00007fffc2632800 00007fffa59c7200 ffffffffffffff9c 
                 GPR04: 0000010038cf89b0 0000000000084800 0000000000000000 00007fffc26348f0 
                 GPR08: 0000000000004000 0000000000000000 0000000000000000 0000000000000000 
                 GPR12: 0000000000000000 00007fffa40e2450 00007fffc2632c80 00007fffc2632c60 
                 GPR16: 00007fffc2632c30 00007fffc2632c50 00007fffa67775a0 00007fffc2632c70 
                 GPR20: 00007fffa67862e0 00007fffc2632c90 00007fffa67862d8 0000000000000000 
                 GPR24: 0000000000000000 00007fffc2632c40 00007fffc2632ca0 00007fffc2632c30 
                 GPR28: 0000010038eb2003 0000010038cf86f0 0000010038eb1ff0 0000010038cf85c0 
[1335924.571684] NIP [00007fffa58d9e14] 0x7fffa58d9e14
[1335924.571699] LR [00007fffa589d7ec] 0x7fffa589d7ec
[1335924.571714] --- interrupt: c00
[1335924.571728] Instruction dump:
[1335924.571743] fb610048 fb810050 fba10058 7c9a2378 7c7c1b78 3b600000 3b200000 3b00ffff 
[1335924.571764] 48000010 ebff0000 2fbf0000 419e0060 <813f0018> 7f89f000 409effec 3bbf0050 
[1335924.571785] ---[ end trace c8499ad55c2d02e2 ]---

Zypper seems to get stuck as well (within openqa-continuous-update.service but also when calling it manually). Like the openQA worker it doesn't react to SIGTERM.

Unfortunately strace and coredumpctl aren't installed. Rebooting the system now.

Actions #2

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Feedback

I've been restarting the machine. The serial log shows that a few more processes were stuck:

$ ipmi-openqaworker-power8-ipmi.suse.de sol activate
[SOL Session operational.  Use ~? for help]
[1338058.866349][    T1] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[1338058.888658][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 31816 (run-crons).
[1338058.888799][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 31843 (suse.de-abuild-).
[1338058.888889][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 31849 (Zypp-main).
[1338058.889012][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 36522 (run-crons).
[1338058.889114][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 36549 (suse.de-abuild-).
[1338058.889208][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 36555 (Zypp-main).
[1338058.889311][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 38207 (run-crons).
[1338058.889413][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 38234 (suse.de-abuild-).
[1338058.889506][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 38240 (Zypp-main).
[1338058.889610][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 38801 (run-crons).
[1338058.889713][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 38828 (suse.de-abuild-).
[1338058.889804][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 38834 (Zypp-main).
[1338058.889914][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 43644 (run-crons).
[1338058.890018][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 43671 (suse.de-abuild-).
[1338058.890111][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 43677 (Zypp-main).
[1338058.891776][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 118342 (run-crons).
[1338058.891890][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 118369 (suse.de-abuild-).
[1338058.891994][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 118375 (Zypp-main).
[1338058.892097][    T1] systemd-shutdown[1]: Sending SIGKILL to PID 141390 (Zypp-main).
[1338058.897068][    T1] systemd-shutdown[1]: Unmounting file systems.
[1338058.898551][T68259] [68259]: Remounting '/var/lib/openqa' read-only in with options 'nobarrier'.
[1338058.961074][T68259] [68259]: Failed to remount '/var/lib/openqa' read-only: Device or resource busy
[1338058.962212][T68260] [68260]: Unmounting '/var/lib/openqa'.
[1338058.962279][T68260] [68260]: Failed to unmount /var/lib/openqa: Device or resource busy
[1338058.963292][T68261] [68261]: Remounting '/' read-only in with options 'nobarrier'.
[1338059.006747][T68261] EXT4-fs (sda3): re-mounted. Opts: nobarrier. Quota mode: none.
[1338059.012773][    T1] systemd-shutdown[1]: Not all file systems unmounted, 1 left.
[1338059.012814][    T1] systemd-shutdown[1]: Deactivating swaps.
[1338059.012872][    T1] systemd-shutdown[1]: All swaps deactivated.
[1338059.012903][    T1] systemd-shutdown[1]: Detaching loop devices.
[1338059.019960][    T1] systemd-shutdown[1]: All loop devices detached.
[1338059.019991][    T1] systemd-shutdown[1]: Stopping MD devices.
[1338059.020108][    T1] systemd-shutdown[1]: All MD devices stopped.
[1338059.020141][    T1] systemd-shutdown[1]: Detaching DM devices.
[1338059.020257][    T1] systemd-shutdown[1]: All DM devices detached.
[1338059.020427][    T1] systemd-shutdown[1]: Unmounting file systems.
[1338059.021845][T68262] [68262]: Remounting '/var/lib/openqa' read-only in with options 'nobarrier'.
[1338059.021983][T68262] [68262]: Failed to remount '/var/lib/openqa' read-only: Device or resource busy
[1338059.023036][T68263] [68263]: Unmounting '/var/lib/openqa'.
[1338059.023110][T68263] [68263]: Failed to unmount /var/lib/openqa: Device or resource busy
[1338059.023255][    T1] systemd-shutdown[1]: Not all file systems unmounted, 1 left.
[1338059.023290][    T1] systemd-shutdown[1]: Cannot finalize remaining file systems, continuing.
[1338059.032688][    T1] systemd-shutdown[1]: Failed to finalize file systems, ignoring.
[1338059.032810][    T1] systemd-shutdown[1]: Syncing filesystems and block devices.
[1338059.033101][    T1] systemd-shutdown[1]: Rebooting.
[1338059.033319][    T1] kvm: exiting hardware virtualization
[1338062.234507][T67998] ata1: translated ATA stat/err 0x50/00 to SCSI SK/ASC/ASCQ 0x5/21/04
[1338062.234907][    T1] reboot: Restarting system

However, the system came up just fine again. I've installed systemd-coredump and strace which also worked (without zypper getting stuck).


So far I'd conclude it isn't a problem with the worker code but with the machine (considering zypper was stuck as well). I'll check on the machine in the next days. If it gets stuck again I can try checking with strace on which system call those processes got stuck.

Actions #3

Updated by favogt over 2 years ago

I'll check on the machine in the next days. If it gets stuck again I can try checking with strace on which system call those processes got stuck.

FTR, I did run cat /proc/$(pidof zypper)/task/*/stack once and it showed that it was stuck in poll. Considering that there's something wrong with the filesystem (VFS?) layer, basically anything with paths could trigger it. The first OOPS was triggered by systemd-journald IIRC.

Actions #4

Updated by okurz over 2 years ago

  • Category set to Regressions/Crashes
  • Assignee set to mkittler
  • Target version set to Ready
Actions #5

Updated by mkittler over 2 years ago

It looks like slot 6 got stuck:

ug 08 21:13:32 power8 worker[59642]: [info] Using cgroup /sys/fs/cgroup/systemd/openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@6.service/2504351
Aug 08 21:13:32 power8 worker[59642]: [info] Starting isotovideo container
Aug 08 21:13:32 power8 worker[59642]: [info] isotovideo has been started (PID: 59739)
Aug 08 21:13:32 power8 worker[59739]: [info] 59739: WORKING 2504351
Aug 08 21:14:13 power8 worker[59642]: [info] Test schedule has changed, reloading test_order.json
Aug 08 21:14:23 power8 worker[59642]: [info] Test schedule has changed, reloading test_order.json
Aug 09 03:13:39 power8 worker[59642]: [info] +++ worker notes +++
Aug 09 03:13:39 power8 worker[59642]: [info] End time: 2022-08-09 03:13:39
Aug 09 03:13:39 power8 worker[59642]: [info] Result: timeout
Aug 09 03:13:39 power8 worker[120511]: [info] Uploading boot_ltp-config-5.18.15-1-default
Aug 09 03:13:39 power8 worker[120511]: [info] Uploading boot_ltp-framebuffer.dat.gz
Aug 09 03:13:39 power8 worker[120511]: [info] Uploading boot_ltp-kernel-pkg.txt
Aug 09 03:13:39 power8 worker[120511]: [info] Uploading boot_ltp-ver_linux_before.txt
Aug 09 03:13:39 power8 worker[120511]: [info] Uploading ltp_known_issues.yaml
Aug 09 08:00:52 power8 worker[59642]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/18 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 09 08:01:02 power8 worker[59642]: [info] Registering with openQA http://openqa1-opensuse
Aug 09 08:01:02 power8 worker[59642]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/18
Aug 09 08:01:03 power8 worker[59642]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 18
Aug 09 10:49:38 power8 worker[59642]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/18 finished by remote side with code 1006, no reason - trying again in 10 seconds
Aug 09 10:49:48 power8 worker[59642]: [info] Registering with openQA http://openqa1-opensuse
Aug 09 10:49:49 power8 worker[59642]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/18
Aug 09 10:49:49 power8 worker[59642]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 18

(related job is https://openqa.opensuse.org/tests/2504351)

The processes' stack:

power8:~ # cat /proc/59642/task/*/stack
[<0>] 0xc000000803d63520
[<0>] __switch_to+0x2d4/0x490
[<0>] do_sys_poll+0x37c/0x780
[<0>] sys_poll+0xe4/0x150
[<0>] system_call_exception+0x15c/0x330
[<0>] system_call_common+0xec/0x250

When attaching strace, it isn't actually completely stuck:

power8:~ # strace -p 59642
strace: Process 59642 attached
restart_syscall(<... resuming interrupted restart_syscall ...>) = 0
poll([{fd=5, events=POLLIN|POLLPRI|POLLOUT}, {fd=9, events=POLLIN|POLLPRI|POLLOUT}], 2, 504383) = 1 ([{fd=5, revents=POLLOUT}])
write(5, "\201\376\20p\0\v\266!{)\325Try\323OtT\301Db~\337~hd\305U\"1\224I"..., 4216) = 4216
poll([{fd=5, events=POLLIN|POLLPRI}, {fd=9, events=POLLIN|POLLPRI|POLLOUT}], 2, 95630) = 1 ([{fd=5, revents=POLLIN}])
read(5, "\201 {\"population\":130,\"type\":\"info"..., 131072) = 34
poll([{fd=5, events=POLLIN|POLLPRI}, {fd=9, events=POLLIN|POLLPRI|POLLOUT}], 2, 95597

Then I restarted the service via systemd and got (first line is the last line of the previous paste):

poll([{fd=5, events=POLLIN|POLLPRI}, {fd=9, events=POLLIN|POLLPRI|POLLOUT}], 2, 95597) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=1, si_uid=0} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Unterbrechung während des Betriebssystemaufrufs)
rt_sigprocmask(SIG_BLOCK, [TERM], [], 8) = 0
rt_sigaction(SIGTERM, NULL, {sa_handler=0x12abd7920, sa_mask=[], sa_flags=0}, 8) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=114, ...}) = 0
flock(8, LOCK_EX)                       = 0
write(8, "[2022-08-09T14:07:03.717162Z] [i"..., 58) = 58
flock(8, LOCK_UN)                       = 0
flock(1, LOCK_EX)                       = 0
write(1, "[info] Received signal TERM\n", 28) = 28
flock(1, LOCK_UN)                       = 0
rt_sigprocmask(SIG_UNBLOCK, [TERM], NULL, 8) = 0
poll([{fd=5, events=POLLIN|POLLPRI}, {fd=9, events=POLLIN|POLLPRI|POLLOUT}], 2, 35544) = 0 (Timeout)
poll([{fd=5, events=POLLIN|POLLPRI|POLLOUT}, {fd=9, events=POLLIN|POLLPRI|POLLOUT}], 2, 504375) = 1 ([{fd=5, revents=POLLOUT}])
write(5, "\201\376\20p\0/NR{\r-'r]+<tp97bZ'\rh@=&\"\25l:"..., 4216) = 4216
poll([{fd=5, events=POLLIN|POLLPRI}, {fd=9, events=POLLIN|POLLPRI|POLLOUT}], 2, 95630) = 1 ([{fd=5, revents=POLLIN}])
read(5, "\201 {\"population\":130,\"type\":\"info"..., 131072) = 34
poll([{fd=5, events=POLLIN|POLLPRI}, {fd=9, events=POLLIN|POLLPRI|POLLOUT}], 2, 95597 <unfinished ...>) = ?
+++ killed by SIGKILL +++
power8:~ # 

Note that zypper ref works currently just fine on the system.

Actions #6

Updated by favogt over 2 years ago

  • Status changed from Feedback to Blocked

And the kernel is complaining again. I reported a product bug: https://bugzilla.opensuse.org/show_bug.cgi?id=1202275

Actions #7

Updated by favogt over 2 years ago

I downgraded the kernel on power8 to the latest version from 15.3. While that's technically incompatible with util-linux, it's probably still better than constant crashes. Let's see.
I'm not sure how the PPC boot process handles the default boot entry, so if that ignores saved_entry from grubenv (set by grub2-set-default) it might end up with 5.14 after a reboot though.

Actions #8

Updated by mkittler about 2 years ago

  • Description updated (diff)
  • Status changed from Blocked to New

Added note in ticket description. Let's re-evaluate this ticket in the next estimation meeting.

Actions #9

Updated by mkittler about 2 years ago

  • Status changed from New to Feedback

Find out why the workers got stuck; even if the actual problem is on the web UI side the worker service should react to SIGTERM.

That we now know. It is stuck on syscalls and not the only affected service. So I suppose there's nothing we should improve in our code.

Find out whether something went wrong on the web UI side.

The points in the ticket description already suggest there's nothing wrong.

The newly assigned jobs are stuck in assigned state. It looks like the newly restarted worker slots are already stuck again. The log looks like the one from the ticket description.

If one would have taken the worker offline then the assigned jobs would have been set back to scheduled after a while. So I suppose this situation is also ok. Of course as long as the worker appears idling the scheduler might re-attempt the assignment. Thus jobs might go back and forth between the scheduled and assigned state until they're finally assigned to a responsive worker. When I remember correctly I have also observed that. We could introduce a scoring system on the web UI side to avoid this. That would mean the scheduler would reduce the score of workers that don't pickup assigned jobs and when assigning jobs it would generally prefer workers with a higher score. Not sure whether it is worth implementing something like that. It is just the only improvement that comes to my mind we could do on the web UI side. (The score also needed to be reset/increased in certain cases so I suppose it wouldn't be that simple.)


Considering the mentioned improvements are likely not worth it I'd rather close this issue and maybe add the scoring idea to our ticket for misc ideas.

Actions #10

Updated by livdywan about 2 years ago

mkittler wrote:

Considering the mentioned improvements are likely not worth it I'd rather close this issue and maybe add the scoring idea to our ticket for misc ideas.

I would suggest to file a proper ticket for this since it otherwise seems like we have no solution here. Except we're assuming that it's unlikely to occur again. As I understand it the issue hasn't be resolved.

Actions #11

Updated by mkittler about 2 years ago

As I understand it the issue hasn't be resolved.

But we have https://progress.opensuse.org/issues/116078 for the immediate issue.

I've been adding the idea on https://progress.opensuse.org/journals/551209/diff?detail_id=520102 because this kind of improvement is likely nothing we'll have time to work on right now anyways.

Actions #12

Updated by mkittler about 2 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF