Project

General

Profile

action #28714

[tools] Investigate why sporadically job is set to scalar value of the reference instead of the reference itself.

Added by EDiGiacinto over 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2017-12-01
Due date:
% Done:

0%

Estimated time:
Difficulty:
Duration:

Description

It seems that under certain conditions (possibly websocket connection turned down) the worker sets the job to an invalid value.

Logs of that happening can be seen in #28355, currently we avoid that by not starting on invalid jobs (but this should not happen in first place - as the job will go from assigned back to scheduled - and can cause problems, e.g. wrt MM clusters).

ACs:

  • Investigate, verify that it still happen and fix it properly as #28355 is a workaround

Related issues

Related to openQA Project - action #28355: [tools][bonus][Sprint 201711.2] Worker loop dies during job setupResolved2017-11-24

Related to openQA Project - action #32851: [tools][EPIC] Scheduling redesignResolved2018-05-05

History

#1 Updated by EDiGiacinto over 2 years ago

  • Related to action #28355: [tools][bonus][Sprint 201711.2] Worker loop dies during job setup added

#2 Updated by EDiGiacinto over 2 years ago

  • Description updated (diff)

#3 Updated by EDiGiacinto over 2 years ago

  • Related to action #32851: [tools][EPIC] Scheduling redesign added

#4 Updated by EDiGiacinto about 2 years ago

  • Description updated (diff)

#5 Updated by EDiGiacinto about 2 years ago

  • Description updated (diff)
  • Category set to 122
  • Priority changed from Normal to Low
  • Target version set to Ready

Setting as low and in the ready queue as we have workaround for it - but this is a bit scary, as can become a real problem (mostly for MM tests, as jumping back from assigned->scheduled makes things more complex ) and the workaround hides it from the logs.

#6 Updated by szarate about 2 years ago

So this seems to be happening:

Apr 05 10:21:48 QA-Power8-4-kvm worker[41841]: [info] quit due to signal TERM
Apr 05 10:21:48 QA-Power8-4-kvm worker[41841]: Mojo::Reactor::Poll: Timer failed: Can't use string ("HASH(0xaf90080)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 151.

#7 Updated by szarate about 2 years ago

@mudler's theory is that the string itself is comming from the webUI somehow.

#9 Updated by EDiGiacinto about 2 years ago

Pr opened with temporary workaround: https://github.com/os-autoinst/openQA/pull/1618

#10 Updated by EDiGiacinto almost 2 years ago

Just saw this again:

Aug 01 12:03:04 openqaworker12 worker[26285]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x9584728)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 522.

While i was testing new scheduler changes, but eventually jobs went back to scheduled.

#11 Updated by EDiGiacinto almost 2 years ago

Happened once again:

Aug 15 18:56:30 openqaworker6 worker[13360]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. proxy_wstunnel enabled?
Aug 15 18:56:08 openqaworker6 worker[13360]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. proxy_wstunnel enabled?
Aug 15 18:55:49 openqaworker6 worker[13360]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. proxy_wstunnel enabled?
Aug 15 18:55:31 openqaworker6 worker[13360]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. proxy_wstunnel enabled?
Aug 15 18:55:12 openqaworker6 worker[13360]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. proxy_wstunnel enabled?
Aug 14 17:37:13 openqaworker6 worker[13360]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x9ec9610)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 522.

#12 Updated by mkittler over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
  • Target version changed from Ready to Current Sprint - kernel

Note that line 522 is now 529:

sub start_job {
    my ($host) = @_;

    return _reset_state unless verify_job;
    # block the job from having dangerous settings (isotovideo specific though)
    # it needs to come from worker_settings
->  delete $job->{settings}->{GENERAL_HW_CMD_DIR};
    # add_log_channel('worker', path => 'worker-log.txt', level => $worker_settings->{LOG_LEVEL} // 'info');

    # update settings with worker-specific stuff
    copy_job_settings($job, $worker_settings);

#14 Updated by mkittler over 1 year ago

  • Target version changed from Current Sprint - kernel to Current Sprint

#15 Updated by mkittler over 1 year ago

  • Status changed from In Progress to Resolved

Not sure whether we still see this in production. If we observe it again we can reopen the ticket. The PR has been merged so we should have a little bit better debug output.

Also available in: Atom PDF