action #44351
closed
[ipmi] Workers/jobs stuck
Added by xlai over 5 years ago.
Updated over 5 years ago.
Description
Currently at least 4 ipmi workers (openqaworker2:23/24/25/26) are stuck. They can not finish jobs or take new jobs for over a day and some even several days.
Despite the misleading message from the developer mode, this issue has nothing to do with the developer mode. (The misleading message is also already fixed on latest master.)
This issue is likely tied to the latest refactoring of the worker cache. At least the autoinst-log.txt
leads to that conclusion:
tail -f /var/lib/openqa/pool/26/autoinst-log.txt
[2018-11-26T11:07:10.0806 CET] [info] +++ setup notes +++
[2018-11-26T11:07:10.0806 CET] [info] start time: 2018-11-26 10:07:10
[2018-11-26T11:07:10.0806 CET] [info] running on openqaworker2:26 (Linux 4.7.5-2.g02c4d35-default #1 SMP PREEMPT Mon Sep 26 08:11:45 UTC 2016 (02c4d35) x86_64)
[2018-11-26T11:07:10.0829 CET] [debug] Downloading SLE-15-SP1-Installer-DVD-x86_64-Build100.4-Media1.iso - request sent to Cache Service.
Since spvm jobs on grenache show the same symptom, this is likely also not ipmi-specific.
Job links: https://openqa.suse.de/tests/2272494 (openqaworker2), https://openqa.suse.de/tests/2278365 (grenache-1)
- Status changed from New to Workable
- Assignee deleted (
nicksinger)
we will take a look into it.
Please keep in mind that this ticket-queue is not only for me but for the whole tools-team so I'd kindly ask you to not assign tasks everybody can help with :)
Of course I'll help you but it could happen that other people never take a look into the ticket if it's assigned to one person already.
- Subject changed from [ipmi] Workers keep in developer mode. to [ipmi] Workers/jobs stuck in developer mode.
- Subject changed from [ipmi] Workers/jobs stuck in developer mode. to [ipmi] Workers/jobs stuck
- Description updated (diff)
The openqa-cacheservice-minon service has crashed on openqaworker2 because it failed to access the SQLite database which was already locked at this point:
sudo systemctl status openqa-worker-cacheservice-minion
● openqa-worker-cacheservice-minion.service - OpenQA Worker Cache Service Minion
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice-minion.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Sa 2018-11-24 15:54:20 CET; 1 day 21h ago
Main PID: 12040 (code=exited, status=11)
Nov 15 14:00:19 openqaworker2 openqa-workercache[12040]: rsync warning: some files vanished before they could be transferred (code 24) at main.c(1652) [generator=3.1.0]
Nov 21 11:30:25 openqaworker2 openqa-workercache[12040]: file has vanished: "/sle/products/opensuse/needles/.git/HEAD.lock" (in tests)
Nov 21 11:30:27 openqaworker2 openqa-workercache[12040]: rsync warning: some files vanished before they could be transferred (code 24) at main.c(1652) [generator=3.1.0]
Nov 23 14:45:13 openqaworker2 openqa-workercache[12040]: file has vanished: "/sle/.git/objects/90/tmp_obj_hwho0r" (in tests)
Nov 23 14:45:13 openqaworker2 openqa-workercache[12040]: file has vanished: "/sle/.git/objects/bc/tmp_obj_N2vtLr" (in tests)
Nov 23 14:45:13 openqaworker2 openqa-workercache[12040]: rsync warning: some files vanished before they could be transferred (code 24) at main.c(1652) [generator=3.1.0]
Nov 24 15:54:20 openqaworker2 openqa-workercache[12040]: DBD::SQLite::st execute failed: database is locked at /usr/lib/perl5/vendor_perl/5.18.2/Minion/Backend/SQLite.pm line 357.
Nov 24 15:54:20 openqaworker2 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=11/n/a
Nov 24 15:54:20 openqaworker2 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 24 15:54:20 openqaworker2 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
After restarting the service manually, it seems to run again.
This still leaves the jobs and the service openqa-workercache in a faulty state. Restarting openqa-workercache doesn't help. It still attempts to access a Minion job but can't:
[e] Can't call method "info" on an undefined value at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Service.pm line 100.
The jobs are now incomplete because restarting the cache services caused the worker to fail:
sudo systemctl status openqa-worker@26
● openqa-worker@26.service - openQA Worker #26
Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled; vendor preset: disabled)
Active: inactive (dead) since Mo 2018-11-26 13:18:56 CET; 18min ago
Main PID: 23598 (code=exited, status=0/SUCCESS)
Nov 26 11:06:55 openqaworker2 systemd[1]: Started openQA Worker #26.
Nov 26 11:06:56 openqaworker2 worker[23598]: [info] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de
Nov 26 11:06:56 openqaworker2 worker[23598]: [info] Project dir for host openqa.suse.de is /var/lib/openqa/share
Nov 26 11:06:57 openqaworker2 worker[23598]: [info] registering worker openqaworker2 version 13 with openQA openqa.suse.de using protocol version [1]
Nov 26 11:07:10 openqaworker2 worker[23598]: GLOB(0x5c28388)[info] got job 2272494: 02272494-sle-15-SP1-Installer-DVD-x86_64-Build100.4-virt-guest-migration-sles15-from-sles15-to-developing-kvm-dst@virt-mm-64bit-ipmi
Nov 26 11:07:10 openqaworker2 worker[23598]: [info] +++ setup notes +++
Nov 26 11:07:10 openqaworker2 worker[23598]: [info] start time: 2018-11-26 10:07:10
Nov 26 11:07:10 openqaworker2 worker[23598]: [info] running on openqaworker2:26 (Linux 4.7.5-2.g02c4d35-default #1 SMP PREEMPT Mon Sep 26 08:11:45 UTC 2016 (02c4d35) x86_64)
Nov 26 13:18:56 openqaworker2 worker[23598]: Mojo::Reactor::Poll: I/O watcher failed: Connection refused at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 39.
Nov 26 13:18:56 openqaworker2 worker[23598]: stopping because a critical error occurred.
It seems like the worker are not restarted automatically so I'm doing that manually now.
- Related to action #44363: Exit worker with non-zero return code in error cases added
nicksinger wrote:
we will take a look into it.
Please keep in mind that this ticket-queue is not only for me but for the whole tools-team so I'd kindly ask you to not assign tasks everybody can help with :)
Of course I'll help you but it could happen that other people never take a look into the ticket if it's assigned to one person already.
Sure, I will avoid it next time. Sorry for the trouble.
Thank you all for the help!
We have another worker in this state, grenache-1:17. Would you please help to recover it?
- Status changed from Workable to Feedback
- Assignee set to nicksinger
- Status changed from Feedback to Resolved
Also available in: Atom
PDF