action #44351: [ipmi] Workers/jobs stuck - openQA Infrastructure - openSUSE Project Management Tool

Actions

Copy link

action #44351

closed

[ipmi] Workers/jobs stuck

Added by xlai over 5 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

nicksinger

Category:

Target version:

Start date:

2018-11-26

Due date:

% Done:

Estimated time:

Description

Currently at least 4 ipmi workers (openqaworker2:23/24/25/26) are stuck. They can not finish jobs or take new jobs for over a day and some even several days.

Despite the misleading message from the developer mode, this issue has nothing to do with the developer mode. (The misleading message is also already fixed on latest master.)

This issue is likely tied to the latest refactoring of the worker cache. At least the autoinst-log.txt leads to that conclusion:

tail -f /var/lib/openqa/pool/26/autoinst-log.txt 
[2018-11-26T11:07:10.0806 CET] [info] +++ setup notes +++
[2018-11-26T11:07:10.0806 CET] [info] start time: 2018-11-26 10:07:10
[2018-11-26T11:07:10.0806 CET] [info] running on openqaworker2:26 (Linux 4.7.5-2.g02c4d35-default #1 SMP PREEMPT Mon Sep 26 08:11:45 UTC 2016 (02c4d35) x86_64)
[2018-11-26T11:07:10.0829 CET] [debug] Downloading SLE-15-SP1-Installer-DVD-x86_64-Build100.4-Media1.iso - request sent to Cache Service.

Since spvm jobs on grenache show the same symptom, this is likely also not ipmi-specific.

Job links: https://openqa.suse.de/tests/2272494 (openqaworker2), https://openqa.suse.de/tests/2278365 (grenache-1)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by nicksinger over 5 years ago

Status changed from New to Workable
Assignee deleted (~~nicksinger~~)

we will take a look into it.
Please keep in mind that this ticket-queue is not only for me but for the whole tools-team so I'd kindly ask you to not assign tasks everybody can help with :)
Of course I'll help you but it could happen that other people never take a look into the ticket if it's assigned to one person already.

Actions

Copy link

Updated by nicksinger over 5 years ago

Subject changed from [ipmi] Workers keep in developer mode. to [ipmi] Workers/jobs stuck in developer mode.

Actions

Copy link

Updated by mkittler over 5 years ago

Subject changed from [ipmi] Workers/jobs stuck in developer mode. to [ipmi] Workers/jobs stuck
Description updated (diff)

Actions

Copy link

Updated by mkittler over 5 years ago

The openqa-cacheservice-minon service has crashed on openqaworker2 because it failed to access the SQLite database which was already locked at this point:

sudo systemctl status openqa-worker-cacheservice-minion
● openqa-worker-cacheservice-minion.service - OpenQA Worker Cache Service Minion
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice-minion.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Sa 2018-11-24 15:54:20 CET; 1 day 21h ago
 Main PID: 12040 (code=exited, status=11)

Nov 15 14:00:19 openqaworker2 openqa-workercache[12040]: rsync warning: some files vanished before they could be transferred (code 24) at main.c(1652) [generator=3.1.0]
Nov 21 11:30:25 openqaworker2 openqa-workercache[12040]: file has vanished: "/sle/products/opensuse/needles/.git/HEAD.lock" (in tests)
Nov 21 11:30:27 openqaworker2 openqa-workercache[12040]: rsync warning: some files vanished before they could be transferred (code 24) at main.c(1652) [generator=3.1.0]
Nov 23 14:45:13 openqaworker2 openqa-workercache[12040]: file has vanished: "/sle/.git/objects/90/tmp_obj_hwho0r" (in tests)
Nov 23 14:45:13 openqaworker2 openqa-workercache[12040]: file has vanished: "/sle/.git/objects/bc/tmp_obj_N2vtLr" (in tests)
Nov 23 14:45:13 openqaworker2 openqa-workercache[12040]: rsync warning: some files vanished before they could be transferred (code 24) at main.c(1652) [generator=3.1.0]
Nov 24 15:54:20 openqaworker2 openqa-workercache[12040]: DBD::SQLite::st execute failed: database is locked at /usr/lib/perl5/vendor_perl/5.18.2/Minion/Backend/SQLite.pm line 357.
Nov 24 15:54:20 openqaworker2 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=11/n/a
Nov 24 15:54:20 openqaworker2 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 24 15:54:20 openqaworker2 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.

After restarting the service manually, it seems to run again.

This still leaves the jobs and the service openqa-workercache in a faulty state. Restarting openqa-workercache doesn't help. It still attempts to access a Minion job but can't:

[e] Can't call method "info" on an undefined value at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Service.pm line 100.

Actions

Copy link

Updated by mkittler over 5 years ago

The jobs are now incomplete because restarting the cache services caused the worker to fail:

sudo systemctl status openqa-worker@26
● openqa-worker@26.service - openQA Worker #26
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Mo 2018-11-26 13:18:56 CET; 18min ago
 Main PID: 23598 (code=exited, status=0/SUCCESS)

Nov 26 11:06:55 openqaworker2 systemd[1]: Started openQA Worker #26.
Nov 26 11:06:56 openqaworker2 worker[23598]: [info] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de
Nov 26 11:06:56 openqaworker2 worker[23598]: [info] Project dir for host openqa.suse.de is /var/lib/openqa/share
Nov 26 11:06:57 openqaworker2 worker[23598]: [info] registering worker openqaworker2 version 13 with openQA openqa.suse.de using protocol version [1]
Nov 26 11:07:10 openqaworker2 worker[23598]: GLOB(0x5c28388)[info] got job 2272494: 02272494-sle-15-SP1-Installer-DVD-x86_64-Build100.4-virt-guest-migration-sles15-from-sles15-to-developing-kvm-dst@virt-mm-64bit-ipmi
Nov 26 11:07:10 openqaworker2 worker[23598]: [info] +++ setup notes +++
Nov 26 11:07:10 openqaworker2 worker[23598]: [info] start time: 2018-11-26 10:07:10
Nov 26 11:07:10 openqaworker2 worker[23598]: [info] running on openqaworker2:26 (Linux 4.7.5-2.g02c4d35-default #1 SMP PREEMPT Mon Sep 26 08:11:45 UTC 2016 (02c4d35) x86_64)
Nov 26 13:18:56 openqaworker2 worker[23598]: Mojo::Reactor::Poll: I/O watcher failed: Connection refused at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 39.
Nov 26 13:18:56 openqaworker2 worker[23598]: stopping because a critical error occurred.

It seems like the worker are not restarted automatically so I'm doing that manually now.

Actions

Copy link

Updated by mkittler over 5 years ago

Related to action #44363: Exit worker with non-zero return code in error cases added

Actions

Copy link

Updated by xlai over 5 years ago

nicksinger wrote:

we will take a look into it.
Please keep in mind that this ticket-queue is not only for me but for the whole tools-team so I'd kindly ask you to not assign tasks everybody can help with :)
Of course I'll help you but it could happen that other people never take a look into the ticket if it's assigned to one person already.

Sure, I will avoid it next time. Sorry for the trouble.

Thank you all for the help!

Actions

Copy link