Project

General

Profile

action #44351

[ipmi] Workers/jobs stuck

Added by xlai about 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
Start date:
2018-11-26
Due date:
% Done:

0%

Estimated time:

Description

Currently at least 4 ipmi workers (openqaworker2:23/24/25/26) are stuck. They can not finish jobs or take new jobs for over a day and some even several days.

Despite the misleading message from the developer mode, this issue has nothing to do with the developer mode. (The misleading message is also already fixed on latest master.)

This issue is likely tied to the latest refactoring of the worker cache. At least the autoinst-log.txt leads to that conclusion:

tail -f /var/lib/openqa/pool/26/autoinst-log.txt 
[2018-11-26T11:07:10.0806 CET] [info] +++ setup notes +++
[2018-11-26T11:07:10.0806 CET] [info] start time: 2018-11-26 10:07:10
[2018-11-26T11:07:10.0806 CET] [info] running on openqaworker2:26 (Linux 4.7.5-2.g02c4d35-default #1 SMP PREEMPT Mon Sep 26 08:11:45 UTC 2016 (02c4d35) x86_64)
[2018-11-26T11:07:10.0829 CET] [debug] Downloading SLE-15-SP1-Installer-DVD-x86_64-Build100.4-Media1.iso - request sent to Cache Service.

Since spvm jobs on grenache show the same symptom, this is likely also not ipmi-specific.

Job links: https://openqa.suse.de/tests/2272494 (openqaworker2), https://openqa.suse.de/tests/2278365 (grenache-1)


Related issues

Related to openQA Project - action #44363: Exit worker with non-zero return code in error casesResolved2018-11-26

History

#1 Updated by nicksinger about 4 years ago

  • Status changed from New to Workable
  • Assignee deleted (nicksinger)

we will take a look into it.
Please keep in mind that this ticket-queue is not only for me but for the whole tools-team so I'd kindly ask you to not assign tasks everybody can help with :)
Of course I'll help you but it could happen that other people never take a look into the ticket if it's assigned to one person already.

#2 Updated by nicksinger about 4 years ago

  • Subject changed from [ipmi] Workers keep in developer mode. to [ipmi] Workers/jobs stuck in developer mode.

#3 Updated by mkittler about 4 years ago

  • Subject changed from [ipmi] Workers/jobs stuck in developer mode. to [ipmi] Workers/jobs stuck
  • Description updated (diff)

#4 Updated by mkittler about 4 years ago

The openqa-cacheservice-minon service has crashed on openqaworker2 because it failed to access the SQLite database which was already locked at this point:

sudo systemctl status openqa-worker-cacheservice-minion
● openqa-worker-cacheservice-minion.service - OpenQA Worker Cache Service Minion
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice-minion.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Sa 2018-11-24 15:54:20 CET; 1 day 21h ago
 Main PID: 12040 (code=exited, status=11)

Nov 15 14:00:19 openqaworker2 openqa-workercache[12040]: rsync warning: some files vanished before they could be transferred (code 24) at main.c(1652) [generator=3.1.0]
Nov 21 11:30:25 openqaworker2 openqa-workercache[12040]: file has vanished: "/sle/products/opensuse/needles/.git/HEAD.lock" (in tests)
Nov 21 11:30:27 openqaworker2 openqa-workercache[12040]: rsync warning: some files vanished before they could be transferred (code 24) at main.c(1652) [generator=3.1.0]
Nov 23 14:45:13 openqaworker2 openqa-workercache[12040]: file has vanished: "/sle/.git/objects/90/tmp_obj_hwho0r" (in tests)
Nov 23 14:45:13 openqaworker2 openqa-workercache[12040]: file has vanished: "/sle/.git/objects/bc/tmp_obj_N2vtLr" (in tests)
Nov 23 14:45:13 openqaworker2 openqa-workercache[12040]: rsync warning: some files vanished before they could be transferred (code 24) at main.c(1652) [generator=3.1.0]
Nov 24 15:54:20 openqaworker2 openqa-workercache[12040]: DBD::SQLite::st execute failed: database is locked at /usr/lib/perl5/vendor_perl/5.18.2/Minion/Backend/SQLite.pm line 357.
Nov 24 15:54:20 openqaworker2 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=11/n/a
Nov 24 15:54:20 openqaworker2 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 24 15:54:20 openqaworker2 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.

After restarting the service manually, it seems to run again.

This still leaves the jobs and the service openqa-workercache in a faulty state. Restarting openqa-workercache doesn't help. It still attempts to access a Minion job but can't:

[e] Can't call method "info" on an undefined value at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Service.pm line 100.

#5 Updated by mkittler about 4 years ago

The jobs are now incomplete because restarting the cache services caused the worker to fail:

sudo systemctl status openqa-worker@26
● openqa-worker@26.service - openQA Worker #26
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Mo 2018-11-26 13:18:56 CET; 18min ago
 Main PID: 23598 (code=exited, status=0/SUCCESS)

Nov 26 11:06:55 openqaworker2 systemd[1]: Started openQA Worker #26.
Nov 26 11:06:56 openqaworker2 worker[23598]: [info] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de
Nov 26 11:06:56 openqaworker2 worker[23598]: [info] Project dir for host openqa.suse.de is /var/lib/openqa/share
Nov 26 11:06:57 openqaworker2 worker[23598]: [info] registering worker openqaworker2 version 13 with openQA openqa.suse.de using protocol version [1]
Nov 26 11:07:10 openqaworker2 worker[23598]: GLOB(0x5c28388)[info] got job 2272494: 02272494-sle-15-SP1-Installer-DVD-x86_64-Build100.4-virt-guest-migration-sles15-from-sles15-to-developing-kvm-dst@virt-mm-64bit-ipmi
Nov 26 11:07:10 openqaworker2 worker[23598]: [info] +++ setup notes +++
Nov 26 11:07:10 openqaworker2 worker[23598]: [info] start time: 2018-11-26 10:07:10
Nov 26 11:07:10 openqaworker2 worker[23598]: [info] running on openqaworker2:26 (Linux 4.7.5-2.g02c4d35-default #1 SMP PREEMPT Mon Sep 26 08:11:45 UTC 2016 (02c4d35) x86_64)
Nov 26 13:18:56 openqaworker2 worker[23598]: Mojo::Reactor::Poll: I/O watcher failed: Connection refused at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 39.
Nov 26 13:18:56 openqaworker2 worker[23598]: stopping because a critical error occurred.

It seems like the worker are not restarted automatically so I'm doing that manually now.

#6 Updated by mkittler about 4 years ago

  • Related to action #44363: Exit worker with non-zero return code in error cases added

#7 Updated by xlai about 4 years ago

nicksinger wrote:

we will take a look into it.
Please keep in mind that this ticket-queue is not only for me but for the whole tools-team so I'd kindly ask you to not assign tasks everybody can help with :)
Of course I'll help you but it could happen that other people never take a look into the ticket if it's assigned to one person already.

Sure, I will avoid it next time. Sorry for the trouble.

Thank you all for the help!

#8 Updated by xlai almost 4 years ago

We have another worker in this state, grenache-1:17. Would you please help to recover it?

#9 Updated by nicksinger almost 4 years ago

Restarted the services and worker once again. Looks like it's working again (instance 17 included: https://openqa.suse.de/tests/2285036).

#10 Updated by nicksinger almost 4 years ago

  • Status changed from Workable to Feedback
  • Assignee set to nicksinger

in the meantime the worker got updated and https://progress.opensuse.org/issues/44750 should also help mitigate these problems. Closing.

#11 Updated by nicksinger almost 4 years ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF