action #28355
closed
[tools][bonus][Sprint 201711.2] Worker loop dies during job setup
Added by EDiGiacinto over 7 years ago.
Updated about 7 years ago.
Description
Today this happened in openqaworker8:
Nov 23 17:31:28 openqaworker8 worker[25677]: total size is 5,382,757,930 speedup is 2,216.76
Nov 23 17:31:28 openqaworker8 worker[25677]: [Thu Nov 23 17:31:28 2017] [worker:info] Waiting for subprocess
Nov 23 17:31:29 openqaworker8 worker[25677]: [Thu Nov 23 17:31:29 2017] [worker:info] 64461: WORKING 1269063
Nov 23 17:31:30 openqaworker8 qemu-system-x86_64[64486]: looking for plugins in '/usr/lib64/sasl2', failed to open directory, error: No such file or directory
Nov 23 17:41:30 openqaworker8 worker[25677]: [Thu Nov 23 17:41:30 2017] [worker:info] cleaning up 01269063-sle-15-Installer-DVD-x86_64-Build349.1-remote_vnc_target_nfs@64bit
Nov 23 17:46:33 openqaworker8 worker[25677]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x7fe59a8)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 456.
Nov 24 09:17:23 openqaworker8 worker[25677]: [Fri Nov 24 09:17:23 2017] [worker:warn] Job ids match but current host not set
Since then the worker is stuck.
Same happened to another instance:
Nov 23 17:29:14 openqaw2 worker[16151]: [Thu Nov 23 17:29:14 2017] [worker:info] got job 1268880: 01268880-sle-15-Installer-DVD-s390x-Build349.1-xfs@s390x-kvm
Nov 23 17:29:14 openqaw2 worker[16151]: [Thu Nov 23 17:29:14 2017] [worker:info] OpenQA::Worker::Cache: Initialized with openqa.suse.de at /var/lib/openqa/cache, current size is 53064695808
Nov 23 17:29:14 openqaw2 worker[16151]: [Thu Nov 23 17:29:14 2017] [worker:info] Waiting for subprocess
Nov 23 17:29:14 openqaw2 worker[16151]: receiving incremental file list
Nov 23 17:29:15 openqaw2 worker[16151]: [Thu Nov 23 17:29:15 2017] [worker:info] Waiting for subprocess
Nov 23 17:29:15 openqaw2 worker[16151]: sent 2,409 bytes received 1,497,146 bytes 999,703.33 bytes/sec
Nov 23 17:29:15 openqaw2 worker[16151]: total size is 5,382,122,337 speedup is 3,589.15
Nov 23 17:29:15 openqaw2 worker[16151]: [Thu Nov 23 17:29:15 2017] [worker:info] 3148: WORKING 1268880
Nov 23 17:41:17 openqaw2 worker[16151]: [Thu Nov 23 17:41:17 2017] [worker:info] cleaning up 01268880-sle-15-Installer-DVD-s390x-Build349.1-xfs@s390x-kvm
Nov 23 17:46:33 openqaw2 worker[16151]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x79ffa80)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 456.
- Description updated (diff)
Same happened to malbec:
Nov 27 17:26:18 malbec worker[77746]: receiving incremental file list
Nov 27 17:26:18 malbec worker[77746]: [Mon Nov 27 17:26:18 2017] [worker:info] Waiting for subprocess
Nov 27 17:26:18 malbec worker[77746]: sent 2,402 bytes received 1,532,075 bytes 1,022,984.67 bytes/sec
Nov 27 17:26:18 malbec worker[77746]: total size is 5,400,925,980 speedup is 3,519.72
Nov 27 17:26:18 malbec worker[77746]: [Mon Nov 27 17:26:18 2017] [worker:info] Waiting for subprocess
Nov 27 17:26:19 malbec worker[77746]: [Mon Nov 27 17:26:19 2017] [worker:info] 59072: WORKING 1274999
Nov 27 18:00:55 malbec worker[77746]: [Mon Nov 27 18:00:55 2017] [worker:info] cleaning up 01274999-sle-12-SP3-Server-DVD-Incidents-Minimal-ppc64le-Build:5745:grub2.1511796549-qam-minimal@ppc64le
Nov 27 18:10:58 malbec worker[77746]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x7efc170)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 456.
- Subject changed from [tools] Worker loop dies during job setup (sporadic) to [tools] Worker loop dies during job setup
Today happened as well, but the job is stuck on running this time:
Nov 27 15:59:40 openqaworker10 worker[2027]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x7a126c8)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 456.
Nov 27 15:49:50 openqaworker10 worker[2027]: [Mon Nov 27 15:49:50 2017] [worker:info] cleaning up 01274588-sle-12-SP3-Server-DVD-Incidents-x86_64-Build:4905:nfs-utils.1511789388-create_hdd_minimal_base+sdk@64bit
Nov 27 15:12:20 openqaworker10 qemu-system-x86_64[11410]: looking for plugins in '/usr/lib64/sasl2', failed to open directory, error: No such file or directory
Nov 27 15:12:17 openqaworker10 worker[2027]: [Mon Nov 27 15:12:17 2017] [worker:info] 11348: WORKING 1274588
Nov 27 15:12:16 openqaworker10 worker[2027]: [Mon Nov 27 15:12:16 2017] [worker:info] Waiting for subprocess
Nov 27 15:12:16 openqaworker10 worker[2027]: total size is 5,399,209,103 speedup is 3,529.49
Removing (sporadic) since we are seeing this more frequently since this week
- Priority changed from Normal to High
- Related to action #18164: [devops][tools] monitoring of openqa worker instances added
To some extend maybe related to poo#27832?
szarate wrote:
To some extend maybe related to poo#27832?
Definetly poo#27832 would help to have more informations/better status about workers.
The worker after it comes with that error appears almost dead and to restart takes the full timeout time, so likely means it's not being shutten down by sigterm signals sent by systemd; i'm more worried if also https://github.com/os-autoinst/openQA/pull/1467/commits/6fada432251f7ed39fd2c51b9c3c47ae6c8318b0 (read the comment) is related somehow, tricky to exclude.
I would go to treat such kinda of error, and prevent the worker to just die meanwhile.
- Status changed from New to In Progress
- Assignee set to EDiGiacinto
- Status changed from In Progress to Feedback
Again happened yesterday:
Nov 30 21:31:52 openqaworker2 worker[3445]: [Thu Nov 30 21:31:52 2017] [worker:info] Waiting for subprocess
Nov 30 21:31:52 openqaworker2 worker[3445]: [Thu Nov 30 21:31:52 2017] [worker:info] Waiting for subprocess
Nov 30 21:31:53 openqaworker2 worker[3445]: [Thu Nov 30 21:31:53 2017] [worker:info] 5898: WORKING 1283324
Nov 30 22:05:35 openqaworker2 worker[3445]: can't open /var/lib/openqa/pool/13/testresults/result-consoletest_setup.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 759.
Nov 30 22:05:57 openqaworker2 worker[3445]: [Thu Nov 30 22:05:57 2017] [worker:info] cleaning up 01283324-sle-15-Installer-DVD-x86_64-Build360.1-gnome@svirt-xen-pv
Nov 30 22:13:24 openqaworker2 worker[3445]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x7e930a8)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 456.
- Related to action #28714: [tools] Investigate why sporadically job is set to scalar value of the reference instead of the reference itself. added
- Status changed from Feedback to Resolved
Setting this as resolved, the workaround should be enough for the moment, reopen if necessary.
Since it would require further investigation and debugging on why this actually happens - filed a new issue to track updates on that.
- Subject changed from [tools] Worker loop dies during job setup to [tools][bonus][Sprint 201711.2] Worker loop dies during job setup
Adding the bonus tag, as it was not planned.
- Description updated (diff)
Also available in: Atom
PDF