action #28714
closed[tools] Investigate why sporadically job is set to scalar value of the reference instead of the reference itself.
0%
Description
It seems that under certain conditions (possibly websocket connection turned down) the worker sets the job to an invalid value.
Logs of that happening can be seen in #28355, currently we avoid that by not starting on invalid jobs (but this should not happen in first place - as the job will go from assigned back to scheduled - and can cause problems, e.g. wrt MM clusters).
ACs:
- Investigate, verify that it still happen and fix it properly as #28355 is a workaround
Updated by EDiGiacinto about 7 years ago
- Related to action #28355: [tools][bonus][Sprint 201711.2] Worker loop dies during job setup added
Updated by EDiGiacinto over 6 years ago
- Related to coordination #32851: [tools][EPIC] Scheduling redesign added
Updated by EDiGiacinto over 6 years ago
- Description updated (diff)
- Category set to 122
- Priority changed from Normal to Low
- Target version set to Ready
Setting as low and in the ready queue as we have workaround for it - but this is a bit scary, as can become a real problem (mostly for MM tests, as jumping back from assigned->scheduled makes things more complex ) and the workaround hides it from the logs.
Updated by szarate over 6 years ago
So this seems to be happening:
Apr 05 10:21:48 QA-Power8-4-kvm worker[41841]: [info] quit due to signal TERM
Apr 05 10:21:48 QA-Power8-4-kvm worker[41841]: Mojo::Reactor::Poll: Timer failed: Can't use string ("HASH(0xaf90080)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 151.
Updated by szarate over 6 years ago
@mudler's theory is that the string itself is comming from the webUI somehow.
Updated by EDiGiacinto over 6 years ago
For reference: https://openqa.suse.de/tests/1587182
Updated by EDiGiacinto over 6 years ago
Pr opened with temporary workaround: https://github.com/os-autoinst/openQA/pull/1618
Updated by EDiGiacinto over 6 years ago
Just saw this again:
Aug 01 12:03:04 openqaworker12 worker[26285]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x9584728)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 522.
While i was testing new scheduler changes, but eventually jobs went back to scheduled.
Updated by EDiGiacinto over 6 years ago
Happened once again:
Aug 15 18:56:30 openqaworker6 worker[13360]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. proxy_wstunnel enabled?
Aug 15 18:56:08 openqaworker6 worker[13360]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. proxy_wstunnel enabled?
Aug 15 18:55:49 openqaworker6 worker[13360]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. proxy_wstunnel enabled?
Aug 15 18:55:31 openqaworker6 worker[13360]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. proxy_wstunnel enabled?
Aug 15 18:55:12 openqaworker6 worker[13360]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. proxy_wstunnel enabled?
Aug 14 17:37:13 openqaworker6 worker[13360]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x9ec9610)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 522.
Updated by mkittler almost 6 years ago
- Status changed from New to In Progress
- Assignee set to mkittler
- Target version changed from Ready to 445
Note that line 522 is now 529:
sub start_job {
my ($host) = @_;
return _reset_state unless verify_job;
# block the job from having dangerous settings (isotovideo specific though)
# it needs to come from worker_settings
-> delete $job->{settings}->{GENERAL_HW_CMD_DIR};
# add_log_channel('worker', path => 'worker-log.txt', level => $worker_settings->{LOG_LEVEL} // 'info');
# update settings with worker-specific stuff
copy_job_settings($job, $worker_settings);
Updated by mkittler almost 6 years ago
Updated by mkittler almost 6 years ago
- Target version changed from 445 to Current Sprint
Updated by mkittler almost 6 years ago
- Status changed from In Progress to Resolved
Not sure whether we still see this in production. If we observe it again we can reopen the ticket. The PR has been merged so we should have a little bit better debug output.