Project

General

Profile

Actions

action #28714

closed

[tools] Investigate why sporadically job is set to scalar value of the reference instead of the reference itself.

Added by EDiGiacinto almost 7 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2017-12-01
Due date:
% Done:

0%

Estimated time:

Description

It seems that under certain conditions (possibly websocket connection turned down) the worker sets the job to an invalid value.

Logs of that happening can be seen in #28355, currently we avoid that by not starting on invalid jobs (but this should not happen in first place - as the job will go from assigned back to scheduled - and can cause problems, e.g. wrt MM clusters).

ACs:

  • Investigate, verify that it still happen and fix it properly as #28355 is a workaround

Related issues 2 (0 open2 closed)

Related to openQA Project - action #28355: [tools][bonus][Sprint 201711.2] Worker loop dies during job setupResolvedEDiGiacinto2017-11-24

Actions
Related to openQA Project - coordination #32851: [tools][EPIC] Scheduling redesignResolvedokurz2018-05-05

Actions
Actions #1

Updated by EDiGiacinto almost 7 years ago

  • Related to action #28355: [tools][bonus][Sprint 201711.2] Worker loop dies during job setup added
Actions #2

Updated by EDiGiacinto almost 7 years ago

  • Description updated (diff)
Actions #3

Updated by EDiGiacinto over 6 years ago

Actions #4

Updated by EDiGiacinto over 6 years ago

  • Description updated (diff)
Actions #5

Updated by EDiGiacinto over 6 years ago

  • Description updated (diff)
  • Category set to 122
  • Priority changed from Normal to Low
  • Target version set to Ready

Setting as low and in the ready queue as we have workaround for it - but this is a bit scary, as can become a real problem (mostly for MM tests, as jumping back from assigned->scheduled makes things more complex ) and the workaround hides it from the logs.

Actions #6

Updated by szarate over 6 years ago

So this seems to be happening:

Apr 05 10:21:48 QA-Power8-4-kvm worker[41841]: [info] quit due to signal TERM
Apr 05 10:21:48 QA-Power8-4-kvm worker[41841]: Mojo::Reactor::Poll: Timer failed: Can't use string ("HASH(0xaf90080)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 151.

Actions #7

Updated by szarate over 6 years ago

@mudler's theory is that the string itself is comming from the webUI somehow.

Actions #9

Updated by EDiGiacinto over 6 years ago

Pr opened with temporary workaround: https://github.com/os-autoinst/openQA/pull/1618

Actions #10

Updated by EDiGiacinto over 6 years ago

Just saw this again:

Aug 01 12:03:04 openqaworker12 worker[26285]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x9584728)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 522.

While i was testing new scheduler changes, but eventually jobs went back to scheduled.

Actions #11

Updated by EDiGiacinto over 6 years ago

Happened once again:

Aug 15 18:56:30 openqaworker6 worker[13360]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. proxy_wstunnel enabled?
Aug 15 18:56:08 openqaworker6 worker[13360]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. proxy_wstunnel enabled?
Aug 15 18:55:49 openqaworker6 worker[13360]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. proxy_wstunnel enabled?
Aug 15 18:55:31 openqaworker6 worker[13360]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. proxy_wstunnel enabled?
Aug 15 18:55:12 openqaworker6 worker[13360]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. proxy_wstunnel enabled?
Aug 14 17:37:13 openqaworker6 worker[13360]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("HASH(0x9ec9610)") as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 522.
Actions #12

Updated by mkittler almost 6 years ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
  • Target version changed from Ready to 445

Note that line 522 is now 529:

sub start_job {
    my ($host) = @_;

    return _reset_state unless verify_job;
    # block the job from having dangerous settings (isotovideo specific though)
    # it needs to come from worker_settings
->  delete $job->{settings}->{GENERAL_HW_CMD_DIR};
    # add_log_channel('worker', path => 'worker-log.txt', level => $worker_settings->{LOG_LEVEL} // 'info');

    # update settings with worker-specific stuff
    copy_job_settings($job, $worker_settings);
Actions #14

Updated by mkittler almost 6 years ago

  • Target version changed from 445 to Current Sprint
Actions #15

Updated by mkittler almost 6 years ago

  • Status changed from In Progress to Resolved

Not sure whether we still see this in production. If we observe it again we can reopen the ticket. The PR has been merged so we should have a little bit better debug output.

Actions

Also available in: Atom PDF