Project

General

Profile

Actions

action #28355

closed

[tools][bonus][Sprint 201711.2] Worker loop dies during job setup

Added by EDiGiacinto over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
2017-11-24
Due date:
% Done:

0%

Estimated time:

Description

Today this happened in openqaworker8:

Nov 23 17:31:28 openqaworker8 worker[25677]: total size is 5,382,757,930  speedup is 2,216.76
Nov 23 17:31:28 openqaworker8 worker[25677]: [Thu Nov 23 17:31:28 2017] [worker:info] Waiting for subprocess
Nov 23 17:31:29 openqaworker8 worker[25677]: [Thu Nov 23 17:31:29 2017] [worker:info] 64461: WORKING 1269063
Nov 23 17:31:30 openqaworker8 qemu-system-x86_64[64486]: looking for plugins in '/usr/lib64/sasl2', failed to open directory, error: No such file or directory
Nov 23 17:41:30 openqaworker8 worker[25677]: [Thu Nov 23 17:41:30 2017] [worker:info] cleaning up 01269063-sle-15-Installer-DVD-x86_64-Build349.1-remote_vnc_target_nfs@64bit
Nov 23 17:46:33 openqaworker8 worker[25677]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x7fe59a8)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 456.
Nov 24 09:17:23 openqaworker8 worker[25677]: [Fri Nov 24 09:17:23 2017] [worker:warn] Job ids match but current host not set

Since then the worker is stuck.


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #18164: [devops][tools] monitoring of openqa worker instancesResolvednicksinger2018-04-25

Actions
Related to openQA Project - action #28714: [tools] Investigate why sporadically job is set to scalar value of the reference instead of the reference itself.Resolvedmkittler2017-12-01

Actions
Actions #1

Updated by EDiGiacinto over 6 years ago

Same happened to another instance:

Nov 23 17:29:14 openqaw2 worker[16151]: [Thu Nov 23 17:29:14 2017] [worker:info] got job 1268880: 01268880-sle-15-Installer-DVD-s390x-Build349.1-xfs@s390x-kvm
Nov 23 17:29:14 openqaw2 worker[16151]: [Thu Nov 23 17:29:14 2017] [worker:info] OpenQA::Worker::Cache: Initialized with openqa.suse.de at /var/lib/openqa/cache, current size is 53064695808
Nov 23 17:29:14 openqaw2 worker[16151]: [Thu Nov 23 17:29:14 2017] [worker:info] Waiting for subprocess
Nov 23 17:29:14 openqaw2 worker[16151]: receiving incremental file list
Nov 23 17:29:15 openqaw2 worker[16151]: [Thu Nov 23 17:29:15 2017] [worker:info] Waiting for subprocess
Nov 23 17:29:15 openqaw2 worker[16151]: sent 2,409 bytes  received 1,497,146 bytes  999,703.33 bytes/sec
Nov 23 17:29:15 openqaw2 worker[16151]: total size is 5,382,122,337  speedup is 3,589.15
Nov 23 17:29:15 openqaw2 worker[16151]: [Thu Nov 23 17:29:15 2017] [worker:info] 3148: WORKING 1268880
Nov 23 17:41:17 openqaw2 worker[16151]: [Thu Nov 23 17:41:17 2017] [worker:info] cleaning up 01268880-sle-15-Installer-DVD-s390x-Build349.1-xfs@s390x-kvm
Nov 23 17:46:33 openqaw2 worker[16151]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x79ffa80)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 456.
Actions #2

Updated by EDiGiacinto over 6 years ago

  • Description updated (diff)
Actions #3

Updated by EDiGiacinto over 6 years ago

Same happened to malbec:

Nov 27 17:26:18 malbec worker[77746]: receiving incremental file list
Nov 27 17:26:18 malbec worker[77746]: [Mon Nov 27 17:26:18 2017] [worker:info] Waiting for subprocess
Nov 27 17:26:18 malbec worker[77746]: sent 2,402 bytes  received 1,532,075 bytes  1,022,984.67 bytes/sec
Nov 27 17:26:18 malbec worker[77746]: total size is 5,400,925,980  speedup is 3,519.72
Nov 27 17:26:18 malbec worker[77746]: [Mon Nov 27 17:26:18 2017] [worker:info] Waiting for subprocess
Nov 27 17:26:19 malbec worker[77746]: [Mon Nov 27 17:26:19 2017] [worker:info] 59072: WORKING 1274999
Nov 27 18:00:55 malbec worker[77746]: [Mon Nov 27 18:00:55 2017] [worker:info] cleaning up 01274999-sle-12-SP3-Server-DVD-Incidents-Minimal-ppc64le-Build:5745:grub2.1511796549-qam-minimal@ppc64le
Nov 27 18:10:58 malbec worker[77746]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x7efc170)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 456.
Actions #4

Updated by EDiGiacinto over 6 years ago

  • Subject changed from [tools] Worker loop dies during job setup (sporadic) to [tools] Worker loop dies during job setup

Today happened as well, but the job is stuck on running this time:

Nov 27 15:59:40 openqaworker10 worker[2027]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x7a126c8)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 456.
Nov 27 15:49:50 openqaworker10 worker[2027]: [Mon Nov 27 15:49:50 2017] [worker:info] cleaning up 01274588-sle-12-SP3-Server-DVD-Incidents-x86_64-Build:4905:nfs-utils.1511789388-create_hdd_minimal_base+sdk@64bit
Nov 27 15:12:20 openqaworker10 qemu-system-x86_64[11410]: looking for plugins in '/usr/lib64/sasl2', failed to open directory, error: No such file or directory
Nov 27 15:12:17 openqaworker10 worker[2027]: [Mon Nov 27 15:12:17 2017] [worker:info] 11348: WORKING 1274588
Nov 27 15:12:16 openqaworker10 worker[2027]: [Mon Nov 27 15:12:16 2017] [worker:info] Waiting for subprocess
Nov 27 15:12:16 openqaworker10 worker[2027]: total size is 5,399,209,103  speedup is 3,529.49

Removing (sporadic) since we are seeing this more frequently since this week

Actions #5

Updated by szarate over 6 years ago

  • Priority changed from Normal to High
Actions #6

Updated by szarate over 6 years ago

  • Related to action #18164: [devops][tools] monitoring of openqa worker instances added
Actions #7

Updated by szarate over 6 years ago

To some extend maybe related to poo#27832?

Actions #8

Updated by EDiGiacinto over 6 years ago

szarate wrote:

To some extend maybe related to poo#27832?

Definetly poo#27832 would help to have more informations/better status about workers.

The worker after it comes with that error appears almost dead and to restart takes the full timeout time, so likely means it's not being shutten down by sigterm signals sent by systemd; i'm more worried if also https://github.com/os-autoinst/openQA/pull/1467/commits/6fada432251f7ed39fd2c51b9c3c47ae6c8318b0 (read the comment) is related somehow, tricky to exclude.

I would go to treat such kinda of error, and prevent the worker to just die meanwhile.

Actions #9

Updated by EDiGiacinto over 6 years ago

  • Status changed from New to In Progress
  • Assignee set to EDiGiacinto
Actions #10

Updated by EDiGiacinto over 6 years ago

  • Status changed from In Progress to Feedback

PR: https://github.com/os-autoinst/openQA/pull/1514 this should avoid workers to die in first place

Actions #11

Updated by EDiGiacinto over 6 years ago

Again happened yesterday:

Nov 30 21:31:52 openqaworker2 worker[3445]: [Thu Nov 30 21:31:52 2017] [worker:info] Waiting for subprocess
Nov 30 21:31:52 openqaworker2 worker[3445]: [Thu Nov 30 21:31:52 2017] [worker:info] Waiting for subprocess
Nov 30 21:31:53 openqaworker2 worker[3445]: [Thu Nov 30 21:31:53 2017] [worker:info] 5898: WORKING 1283324
Nov 30 22:05:35 openqaworker2 worker[3445]: can't open /var/lib/openqa/pool/13/testresults/result-consoletest_setup.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 759.
Nov 30 22:05:57 openqaworker2 worker[3445]: [Thu Nov 30 22:05:57 2017] [worker:info] cleaning up 01283324-sle-15-Installer-DVD-x86_64-Build360.1-gnome@svirt-xen-pv
Nov 30 22:13:24 openqaworker2 worker[3445]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x7e930a8)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 456.
Actions #12

Updated by EDiGiacinto over 6 years ago

  • Related to action #28714: [tools] Investigate why sporadically job is set to scalar value of the reference instead of the reference itself. added
Actions #13

Updated by EDiGiacinto over 6 years ago

  • Status changed from Feedback to Resolved

Setting this as resolved, the workaround should be enough for the moment, reopen if necessary.

Since it would require further investigation and debugging on why this actually happens - filed a new issue to track updates on that.

Actions #14

Updated by szarate over 6 years ago

  • Subject changed from [tools] Worker loop dies during job setup to [tools][bonus][Sprint 201711.2] Worker loop dies during job setup

Adding the bonus tag, as it was not planned.

Actions #15

Updated by EDiGiacinto about 6 years ago

  • Description updated (diff)
Actions

Also available in: Atom PDF