Project

General

Profile

Actions

action #45275

closed

Fix bug in worker status update and improve error handling

Added by mkittler over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2018-12-18
Due date:
% Done:

0%

Estimated time:

Description

It seems like the worker upload / status update is buggy and requires improvements from both sides.

On the worker side it looks like this (observed on openqaworker3):

Dez 10 17:48:02 openqaworker3 worker[21703]: [info] uploading worker-log.txt
Dez 10 17:48:02 openqaworker3 worker[21703]: Use of uninitialized value $test in concatenation (.) or string at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 915.
Dez 10 17:48:02 openqaworker3 worker[21703]: Use of uninitialized value in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 970.
Dez 10 17:48:02 openqaworker3 worker[21703]: [error] 500 response: Internal Server Error (remaining tries: 2)
Dez 10 17:48:07 openqaworker3 worker[21703]: [error] 500 response: Internal Server Error (remaining tries: 1)
Dez 10 17:48:12 openqaworker3 worker[21703]: [error] 500 response: Internal Server Error (remaining tries: 0)
Dez 10 17:48:12 openqaworker3 worker[21703]: [error] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Dez 10 17:48:22 openqaworker3 worker[21703]: [info] Collected unknown process with pid 3661 and exit status: 0
Dez 10 17:48:22 openqaworker3 worker[21703]: [info] registering worker openqaworker3 version 13 with openQA openqa.suse.de using protocol version [1]
Dez 12 10:22:28 openqaworker3 worker[21703]: [warn] Received command cancel for different job id 2323020 (our 2318720). Ignoring!

The job is abandoned as expected after the web UI doesn't behave as expected. However, the job ID isn't cleared and the worker is basically stuck and unable to do further work until restarting it manually.

On the web UI side like this (observed on OSD):

[2018-12-10T17:48:02.0238 CET] [error] DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  null value in column "name" violates not-null constraint
DETAIL:  Failing row contains (48873251, 2318720, null, tests/qa_automation/execute_test_run.pm, null, 2018-12-10 16:48:02, 2018-12-10 16:48:02, none, 0, 0, 0, 0). [for Statement "INSERT INTO job_modules ( category, job_id, name, script, t_created, t_updated) VALUES ( ?, ?, ?, ?, ?, ? ) RETURNING id" with ParamValues: 1=undef, 2='2318720', 3=undef, 4='tests/qa_automation/execute_test_run.pm', 5='2018-12-10 16:48:02', 6='2018-12-10 16:48:02'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/JobModules.pm line 170

[2018-12-10T17:48:07.0315 CET] [error] DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  null value in column "name" violates not-null constraint
DETAIL:  Failing row contains (48873252, 2318720, null, tests/qa_automation/execute_test_run.pm, null, 2018-12-10 16:48:07, 2018-12-10 16:48:07, none, 0, 0, 0, 0). [for Statement "INSERT INTO job_modules ( category, job_id, name, script, t_created, t_updated) VALUES ( ?, ?, ?, ?, ?, ? ) RETURNING id" with ParamValues: 1=undef, 2='2318720', 3=undef, 4='tests/qa_automation/execute_test_run.pm', 5='2018-12-10 16:48:07', 6='2018-12-10 16:48:07'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/JobModules.pm line 170

[2018-12-10T17:48:11.0596 CET] [info] Worker 10333 stopped
[2018-12-10T17:48:11.0610 CET] [info] Worker 27101 started
[2018-12-10T17:48:12.0389 CET] [error] DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  null value in column "name" violates not-null constraint
DETAIL:  Failing row contains (48873253, 2318720, null, tests/qa_automation/execute_test_run.pm, null, 2018-12-10 16:48:12, 2018-12-10 16:48:12, none, 0, 0, 0, 0). [for Statement "INSERT INTO job_modules ( category, job_id, name, script, t_created, t_updated) VALUES ( ?, ?, ?, ?, ?, ? ) RETURNING id" with ParamValues: 1=undef, 2='2318720', 3=undef, 4='tests/qa_automation/execute_test_run.pm', 5='2018-12-10 16:48:12', 6='2018-12-10 16:48:12'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/JobModules.pm line 170

It looks like we try to insert a job module which already exists. So far I'm not sure why this is the case. For sure the error handling for this should be improved.

Actions #1

Updated by mkittler over 5 years ago

  • Description updated (diff)
Actions #2

Updated by mkittler over 5 years ago

Another similar problem observed on openqaw2:

Dez 10 18:02:54 openqaw2 worker[31413]: [info] end time: 2018-12-10 17:02:54
Dez 10 18:02:54 openqaw2 worker[31413]: [info] result: done
Dez 10 18:02:54 openqaw2 worker[31413]: [info] uploading video.ogv
Dez 10 18:02:54 openqaw2 worker[31413]: [error] ERROR video.ogv: 404 response: Not Found
Dez 10 18:02:54 openqaw2 worker[31413]: [error] 400 response: Bad Request (remaining tries: 2)
Dez 10 18:02:59 openqaw2 worker[31413]: [error] 400 response: Bad Request (remaining tries: 1)
Dez 10 18:03:04 openqaw2 worker[31413]: [error] 400 response: Bad Request (remaining tries: 0)
Dez 10 18:03:04 openqaw2 worker[31413]: [error] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Dez 10 18:03:14 openqaw2 worker[31413]: [info] registering worker openqaw2 version 13 with openQA openqa.suse.de using protocol version [1]
Dez 12 10:40:29 openqaw2 worker[31413]: [warn] Received command cancel for different job id 2324113 (our 2318750). Ignoring!

Here the error is different but the bad handling of the error is the same.

Actions #3

Updated by mkittler over 5 years ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
  • Target version changed from Ready to Current Sprint
Actions #4

Updated by mkittler over 5 years ago

Updated the PR, see https://github.com/os-autoinst/openQA/pull/1940#issuecomment-450902783.

The errors on the web UI side are likely caused by the worker sending wrong/insufficient data which violates the not-null constraint. So likely also a worker issue but it is hard to tell what went wrong in this case.

Actions #5

Updated by mkittler over 5 years ago

  • Status changed from In Progress to Resolved

PR has been merged

Actions

Also available in: Atom PDF