Actions
action #45275
closedFix bug in worker status update and improve error handling
Start date:
2018-12-18
Due date:
% Done:
0%
Estimated time:
Description
It seems like the worker upload / status update is buggy and requires improvements from both sides.
On the worker side it looks like this (observed on openqaworker3):
Dez 10 17:48:02 openqaworker3 worker[21703]: [info] uploading worker-log.txt
Dez 10 17:48:02 openqaworker3 worker[21703]: Use of uninitialized value $test in concatenation (.) or string at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 915.
Dez 10 17:48:02 openqaworker3 worker[21703]: Use of uninitialized value in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 970.
Dez 10 17:48:02 openqaworker3 worker[21703]: [error] 500 response: Internal Server Error (remaining tries: 2)
Dez 10 17:48:07 openqaworker3 worker[21703]: [error] 500 response: Internal Server Error (remaining tries: 1)
Dez 10 17:48:12 openqaworker3 worker[21703]: [error] 500 response: Internal Server Error (remaining tries: 0)
Dez 10 17:48:12 openqaworker3 worker[21703]: [error] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Dez 10 17:48:22 openqaworker3 worker[21703]: [info] Collected unknown process with pid 3661 and exit status: 0
Dez 10 17:48:22 openqaworker3 worker[21703]: [info] registering worker openqaworker3 version 13 with openQA openqa.suse.de using protocol version [1]
Dez 12 10:22:28 openqaworker3 worker[21703]: [warn] Received command cancel for different job id 2323020 (our 2318720). Ignoring!
The job is abandoned as expected after the web UI doesn't behave as expected. However, the job ID isn't cleared and the worker is basically stuck and unable to do further work until restarting it manually.
On the web UI side like this (observed on OSD):
[2018-12-10T17:48:02.0238 CET] [error] DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: null value in column "name" violates not-null constraint
DETAIL: Failing row contains (48873251, 2318720, null, tests/qa_automation/execute_test_run.pm, null, 2018-12-10 16:48:02, 2018-12-10 16:48:02, none, 0, 0, 0, 0). [for Statement "INSERT INTO job_modules ( category, job_id, name, script, t_created, t_updated) VALUES ( ?, ?, ?, ?, ?, ? ) RETURNING id" with ParamValues: 1=undef, 2='2318720', 3=undef, 4='tests/qa_automation/execute_test_run.pm', 5='2018-12-10 16:48:02', 6='2018-12-10 16:48:02'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/JobModules.pm line 170
[2018-12-10T17:48:07.0315 CET] [error] DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: null value in column "name" violates not-null constraint
DETAIL: Failing row contains (48873252, 2318720, null, tests/qa_automation/execute_test_run.pm, null, 2018-12-10 16:48:07, 2018-12-10 16:48:07, none, 0, 0, 0, 0). [for Statement "INSERT INTO job_modules ( category, job_id, name, script, t_created, t_updated) VALUES ( ?, ?, ?, ?, ?, ? ) RETURNING id" with ParamValues: 1=undef, 2='2318720', 3=undef, 4='tests/qa_automation/execute_test_run.pm', 5='2018-12-10 16:48:07', 6='2018-12-10 16:48:07'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/JobModules.pm line 170
[2018-12-10T17:48:11.0596 CET] [info] Worker 10333 stopped
[2018-12-10T17:48:11.0610 CET] [info] Worker 27101 started
[2018-12-10T17:48:12.0389 CET] [error] DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: null value in column "name" violates not-null constraint
DETAIL: Failing row contains (48873253, 2318720, null, tests/qa_automation/execute_test_run.pm, null, 2018-12-10 16:48:12, 2018-12-10 16:48:12, none, 0, 0, 0, 0). [for Statement "INSERT INTO job_modules ( category, job_id, name, script, t_created, t_updated) VALUES ( ?, ?, ?, ?, ?, ? ) RETURNING id" with ParamValues: 1=undef, 2='2318720', 3=undef, 4='tests/qa_automation/execute_test_run.pm', 5='2018-12-10 16:48:12', 6='2018-12-10 16:48:12'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/JobModules.pm line 170
It looks like we try to insert a job module which already exists. So far I'm not sure why this is the case. For sure the error handling for this should be improved.
Actions