action #28355
closed[tools][bonus][Sprint 201711.2] Worker loop dies during job setup
0%
Description
Today this happened in openqaworker8:
Nov 23 17:31:28 openqaworker8 worker[25677]: total size is 5,382,757,930 speedup is 2,216.76
Nov 23 17:31:28 openqaworker8 worker[25677]: [Thu Nov 23 17:31:28 2017] [worker:info] Waiting for subprocess
Nov 23 17:31:29 openqaworker8 worker[25677]: [Thu Nov 23 17:31:29 2017] [worker:info] 64461: WORKING 1269063
Nov 23 17:31:30 openqaworker8 qemu-system-x86_64[64486]: looking for plugins in '/usr/lib64/sasl2', failed to open directory, error: No such file or directory
Nov 23 17:41:30 openqaworker8 worker[25677]: [Thu Nov 23 17:41:30 2017] [worker:info] cleaning up 01269063-sle-15-Installer-DVD-x86_64-Build349.1-remote_vnc_target_nfs@64bit
Nov 23 17:46:33 openqaworker8 worker[25677]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x7fe59a8)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 456.
Nov 24 09:17:23 openqaworker8 worker[25677]: [Fri Nov 24 09:17:23 2017] [worker:warn] Job ids match but current host not set
Since then the worker is stuck.
Updated by EDiGiacinto about 7 years ago
Same happened to another instance:
Nov 23 17:29:14 openqaw2 worker[16151]: [Thu Nov 23 17:29:14 2017] [worker:info] got job 1268880: 01268880-sle-15-Installer-DVD-s390x-Build349.1-xfs@s390x-kvm
Nov 23 17:29:14 openqaw2 worker[16151]: [Thu Nov 23 17:29:14 2017] [worker:info] OpenQA::Worker::Cache: Initialized with openqa.suse.de at /var/lib/openqa/cache, current size is 53064695808
Nov 23 17:29:14 openqaw2 worker[16151]: [Thu Nov 23 17:29:14 2017] [worker:info] Waiting for subprocess
Nov 23 17:29:14 openqaw2 worker[16151]: receiving incremental file list
Nov 23 17:29:15 openqaw2 worker[16151]: [Thu Nov 23 17:29:15 2017] [worker:info] Waiting for subprocess
Nov 23 17:29:15 openqaw2 worker[16151]: sent 2,409 bytes received 1,497,146 bytes 999,703.33 bytes/sec
Nov 23 17:29:15 openqaw2 worker[16151]: total size is 5,382,122,337 speedup is 3,589.15
Nov 23 17:29:15 openqaw2 worker[16151]: [Thu Nov 23 17:29:15 2017] [worker:info] 3148: WORKING 1268880
Nov 23 17:41:17 openqaw2 worker[16151]: [Thu Nov 23 17:41:17 2017] [worker:info] cleaning up 01268880-sle-15-Installer-DVD-s390x-Build349.1-xfs@s390x-kvm
Nov 23 17:46:33 openqaw2 worker[16151]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x79ffa80)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 456.
Updated by EDiGiacinto almost 7 years ago
Same happened to malbec:
Nov 27 17:26:18 malbec worker[77746]: receiving incremental file list
Nov 27 17:26:18 malbec worker[77746]: [Mon Nov 27 17:26:18 2017] [worker:info] Waiting for subprocess
Nov 27 17:26:18 malbec worker[77746]: sent 2,402 bytes received 1,532,075 bytes 1,022,984.67 bytes/sec
Nov 27 17:26:18 malbec worker[77746]: total size is 5,400,925,980 speedup is 3,519.72
Nov 27 17:26:18 malbec worker[77746]: [Mon Nov 27 17:26:18 2017] [worker:info] Waiting for subprocess
Nov 27 17:26:19 malbec worker[77746]: [Mon Nov 27 17:26:19 2017] [worker:info] 59072: WORKING 1274999
Nov 27 18:00:55 malbec worker[77746]: [Mon Nov 27 18:00:55 2017] [worker:info] cleaning up 01274999-sle-12-SP3-Server-DVD-Incidents-Minimal-ppc64le-Build:5745:grub2.1511796549-qam-minimal@ppc64le
Nov 27 18:10:58 malbec worker[77746]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x7efc170)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 456.
Updated by EDiGiacinto almost 7 years ago
- Subject changed from [tools] Worker loop dies during job setup (sporadic) to [tools] Worker loop dies during job setup
Today happened as well, but the job is stuck on running this time:
Nov 27 15:59:40 openqaworker10 worker[2027]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x7a126c8)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 456.
Nov 27 15:49:50 openqaworker10 worker[2027]: [Mon Nov 27 15:49:50 2017] [worker:info] cleaning up 01274588-sle-12-SP3-Server-DVD-Incidents-x86_64-Build:4905:nfs-utils.1511789388-create_hdd_minimal_base+sdk@64bit
Nov 27 15:12:20 openqaworker10 qemu-system-x86_64[11410]: looking for plugins in '/usr/lib64/sasl2', failed to open directory, error: No such file or directory
Nov 27 15:12:17 openqaworker10 worker[2027]: [Mon Nov 27 15:12:17 2017] [worker:info] 11348: WORKING 1274588
Nov 27 15:12:16 openqaworker10 worker[2027]: [Mon Nov 27 15:12:16 2017] [worker:info] Waiting for subprocess
Nov 27 15:12:16 openqaworker10 worker[2027]: total size is 5,399,209,103 speedup is 3,529.49
Removing (sporadic) since we are seeing this more frequently since this week
Updated by szarate almost 7 years ago
- Related to action #18164: [devops][tools] monitoring of openqa worker instances added
Updated by szarate almost 7 years ago
To some extend maybe related to poo#27832?
Updated by EDiGiacinto almost 7 years ago
szarate wrote:
To some extend maybe related to poo#27832?
Definetly poo#27832 would help to have more informations/better status about workers.
The worker after it comes with that error appears almost dead and to restart takes the full timeout time, so likely means it's not being shutten down by sigterm signals sent by systemd; i'm more worried if also https://github.com/os-autoinst/openQA/pull/1467/commits/6fada432251f7ed39fd2c51b9c3c47ae6c8318b0 (read the comment) is related somehow, tricky to exclude.
I would go to treat such kinda of error, and prevent the worker to just die meanwhile.
Updated by EDiGiacinto almost 7 years ago
- Status changed from New to In Progress
- Assignee set to EDiGiacinto
Updated by EDiGiacinto almost 7 years ago
- Status changed from In Progress to Feedback
PR: https://github.com/os-autoinst/openQA/pull/1514 this should avoid workers to die in first place
Updated by EDiGiacinto almost 7 years ago
Again happened yesterday:
Nov 30 21:31:52 openqaworker2 worker[3445]: [Thu Nov 30 21:31:52 2017] [worker:info] Waiting for subprocess
Nov 30 21:31:52 openqaworker2 worker[3445]: [Thu Nov 30 21:31:52 2017] [worker:info] Waiting for subprocess
Nov 30 21:31:53 openqaworker2 worker[3445]: [Thu Nov 30 21:31:53 2017] [worker:info] 5898: WORKING 1283324
Nov 30 22:05:35 openqaworker2 worker[3445]: can't open /var/lib/openqa/pool/13/testresults/result-consoletest_setup.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 759.
Nov 30 22:05:57 openqaworker2 worker[3445]: [Thu Nov 30 22:05:57 2017] [worker:info] cleaning up 01283324-sle-15-Installer-DVD-x86_64-Build360.1-gnome@svirt-xen-pv
Nov 30 22:13:24 openqaworker2 worker[3445]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x7e930a8)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 456.
Updated by EDiGiacinto almost 7 years ago
- Related to action #28714: [tools] Investigate why sporadically job is set to scalar value of the reference instead of the reference itself. added
Updated by EDiGiacinto almost 7 years ago
- Status changed from Feedback to Resolved
Setting this as resolved, the workaround should be enough for the moment, reopen if necessary.
Since it would require further investigation and debugging on why this actually happens - filed a new issue to track updates on that.
Updated by szarate almost 7 years ago
- Subject changed from [tools] Worker loop dies during job setup to [tools][bonus][Sprint 201711.2] Worker loop dies during job setup
Adding the bonus tag, as it was not planned.