Project

General

Profile

Actions

action #13814

closed

dead job is still too eager

Added by coolo over 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2016-09-19
Due date:
% Done:

0%

Estimated time:

Description

https://openqa.opensuse.org/tests/265989 is incomplete as dead. Trying to figure out why

Actions #1

Updated by coolo over 8 years ago

[Mon Sep 19 13:42:41 2016] [websockets:debug] [DBIx debug] Took 0.00182104 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 11:42:41', '104'.
[Mon Sep 19 13:42:46 2016] [websockets:debug] [DBIx debug] Took 0.00141501 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 11:42:46', '104'.
[Mon Sep 19 13:42:51 2016] [websockets:debug] [DBIx debug] Took 0.00140905 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 11:42:51', '104'.
[Mon Sep 19 13:42:56 2016] [websockets:debug] [DBIx debug] Took 0.00981402 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 11:42:56', '104'.
[Mon Sep 19 13:42:56 2016] [4195:debug] [DBIx debug] Took 0.01209998 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 11:42:56', '104'.
[Mon Sep 19 13:43:01 2016] [websockets:debug] [DBIx debug] Took 0.00182796 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 11:43:01', '104'.
[Mon Sep 19 13:43:06 2016] [websockets:debug] [DBIx debug] Took 0.00152612 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 11:43:06', '104'.
[Mon Sep 19 13:43:51 2016] [websockets:debug] job considered dead: 265989
[Mon Sep 19 13:43:51 2016] [websockets:debug] [DBIx debug] Took 0.00207400 seconds executed: UPDATE workers SET job_id = ?, t_updated = ? WHERE ( id = ? ): NULL, '2016-09-19 11:43:51', '104'.

So there was quite some silence from the worker - but the websocket server was active, it's not him.

Actions #2

Updated by coolo over 8 years ago

OK, it wasn't silent - it was talking to the other side:

[Mon Sep 19 13:43:06 2016] [4195:debug] POST "/api/v1/jobs/265989/status"
[Mon Sep 19 13:43:12 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:12 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:19 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:19 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:25 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:25 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:31 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:32 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:40 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:40 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:46 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:46 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"

Actions #3

Updated by coolo over 8 years ago

  • Status changed from New to Resolved
Actions #4

Updated by coolo over 8 years ago

  • Status changed from Resolved to New
  • Assignee changed from coolo to dgutu

next try ;(

Actions #5

Updated by coolo over 8 years ago

[Mon Sep 19 18:56:27 2016] [scheduler:debug] [DBIx debug] Took 0.00041199 seconds executed: UPDATE workers SET job_id = ? WHERE ( id = ? ): '266180', '50'.
[Mon Sep 19 19:18:34 2016] [13312:debug] [DBIx debug] Took 0.01566005 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:34', '50'.
[Mon Sep 19 19:18:35 2016] [13315:debug] [DBIx debug] Took 0.00622296 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:35', '50'.
[Mon Sep 19 19:18:36 2016] [13352:debug] [DBIx debug] Took 0.00609016 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:36', '50'.
[Mon Sep 19 19:18:37 2016] [13351:debug] [DBIx debug] Took 0.00381804 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:37', '50'.
[Mon Sep 19 19:18:37 2016] [13312:debug] POST "/api/v1/jobs/266180/artefact"
[Mon Sep 19 19:18:37 2016] [13314:debug] POST "/api/v1/jobs/266180/artefact"
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00395203 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '14'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00445509 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '54'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00793791 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '16'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00825691 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '29'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.02400804 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '98'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00660801 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '56'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00552702 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '60'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00415897 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '97'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00476885 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '43'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00495505 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '63'.
[Mon Sep 19 19:19:37 2016] [13314:debug] POST "/api/v1/jobs/266180/artefact"
[Mon Sep 19 19:20:18 2016] [websockets:debug] job considered dead: 266180
[Mon Sep 19 19:20:19 2016] [websockets:debug] [DBIx debug] Took 0.00286794 seconds executed: UPDATE workers SET job_id = ?, t_updated = ? WHERE ( id = ? ): NULL, '2016-09-19 17:20:19', '50'.
[Mon Sep 19 19:20:37 2016] [13351:debug] POST "/api/v1/jobs/266180/artefact"
Actions #6

Updated by coolo over 8 years ago

what looks related is this:

192.168.112.6 - - [19/Sep/2016:19:18:37 +0200] "POST /api/v1/jobs/266180/artefact HTTP/1.1" 406 - "-" "Mojolicious (Perl)"
192.168.112.6 - - [19/Sep/2016:19:19:37 +0200] "POST /api/v1/jobs/266180/artefact HTTP/1.1" 406 - "-" "Mojolicious (Perl)"
192.168.112.6 - - [19/Sep/2016:19:20:37 +0200] "POST /api/v1/jobs/266180/artefact HTTP/1.1" 404 30248 "-" "Mojolicious (Perl)"
Actions #7

Updated by coolo over 8 years ago

  • Target version set to Milestone 2

hah, because

[Mon Sep 19 19:18:37 2016] [13314:error] Can't write to log: at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious.pm line 104.

Actions #8

Updated by coolo over 8 years ago

ok, the dead job detection only does its job in this case - the worker stopped action for a minute.

Actions #9

Updated by coolo over 8 years ago

Trouble ahead

Sep 19 19:10:48 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Plugin/DefaultHelpers
Sep 19 19:10:54 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Plugin/DefaultHelpers
Sep 19 19:11:13 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Plugin/DefaultHelpers
Sep 19 19:11:15 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Plugin/DefaultHelpers
Sep 19 19:11:22 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Plugin/DefaultHelpers
Sep 19 19:11:29 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to asset: Cannot allocate memory at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Cont
Sep 19 19:11:29 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Plugin/DefaultHelpers
Sep 19 19:11:38 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to asset: Cannot allocate memory at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Cont
Sep 19 19:11:52 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Plugin/DefaultHelpers
Sep 19 19:11:52 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Plugin/DefaultHelpers
Actions #10

Updated by coolo over 8 years ago


[Mon Sep 19 19:01:46 2016] [13315:debug] POST "/api/v1/workers/44/grab_job"
[Mon Sep 19 19:01:46 2016] [13315:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1" and action "auth_operator"
[Mon Sep 19 19:01:46 2016] [13315:debug] API key from client: *5D46836035B1FE08*
[Mon Sep 19 19:01:46 2016] [13315:debug] [DBIx debug] Took 0.00244999 seconds executed: SELECT me.id, me.key, me.secret, me.user_id, me.t_expiration, me.t_created, me.t
_updated FROM api_keys me WHERE ( me.key = ? ): '5D46836035B1FE08'.
[Mon Sep 19 19:01:46 2016] [13315:debug] [DBIx debug] Took 0.00038695 seconds executed: SELECT me.id, me.username, me.email, me.fullname, me.nickname, 
me.is_operator, m
e.is_admin, me.t_created, me.t_updated FROM users me WHERE ( me.id = ? ): '66'.
[Mon Sep 19 19:01:46 2016] [13315:debug] Key is for user 'https://www.opensuse.org/openid/user/k0da'
[Mon Sep 19 19:01:46 2016] [13315:debug] API auth by user: https://www.opensuse.org/openid/user/k0da, operator: 1
[Mon Sep 19 19:01:46 2016] [13315:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1::Job" and action "grab"
[Mon Sep 19 19:01:46 2016] [13315:debug] dispatching IPC job_grab to scheduler: [
[Mon Sep 19 19:02:46 2016] [13315:debug] Template "exception.production.html.ep" not found
[Mon Sep 19 19:02:46 2016] [13315:debug] Rendering template "exception.html.ep"
[Mon Sep 19 19:02:46 2016] [13315:debug] Rendering cached template "layouts/error.html.ep"
[Mon Sep 19 19:02:46 2016] [13315:debug] 500 Internal Server Error (60.202517s, 0.017/s)

[Mon Sep 19 18:59:42 2016] [13349:error] org.freedesktop.DBus.Error.Failed: Can't use string ("HASH(0x9daff38)") as a HASH ref while "strict refs" in use at /usr/share/
openqa/script/../lib/OpenQA/Schema/Result/Workers.pm line 78.

Actions #11

Updated by coolo about 8 years ago

  • Assignee changed from dgutu to coolo

it's still happening, e.g. 587977 on openqa.suse.de

[Thu Sep 22 22:20:54 2016] [scheduler:debug] [DBIx debug] Took 0.00220895 seconds executed: UPDATE workers SET job_id = ?, t_updated = ? WHERE ( id = ? ): '587977', '2016-09-22 20:20:54', '240'.
[Thu Sep 22 22:23:14 2016] [websockets:debug] job considered dead: 587977

in between no status update from the worker, but some mmapi calls. Strange

Actions #12

Updated by coolo about 8 years ago

But the worker didn't report any obvious problem

Sep 22 22:20:54 openqaworker3 worker[8618]: 12311: WORKING 587977
Sep 22 22:23:15 openqaworker3 qemu-system-x86_64[12505]: looking for plugins in '/usr/lib64/sasl2', failed to open directory, error: No such file or directory
Sep 22 22:35:56 openqaworker3 worker[8618]: ERROR clone-autoinst.xml: 404 response: Not Found
Sep 22 22:35:56 openqaworker3 worker[8618]: ERROR video.ogv: 404 response: Not Found
Sep 22 22:36:03 openqaworker3 worker[8618]: cleaning up 00587977-sle-12-SP2-Server-DVD-x86_64-Build2141-sles12sp2_autoyast_bug-879147_autoinst@64bit...
Sep 22 22:36:05 openqaworker3 worker[8618]: job 587977 incomplete

the 404 are because it's already dead by then.

Actions #13

Updated by coolo about 8 years ago

  • Priority changed from Urgent to High

not urgent anymore

Actions #14

Updated by coolo about 8 years ago

  • Status changed from New to Resolved

the dead job detection will kill non-dead jobs. But as long as we have it, we have to live with the fact that we misjudge half dead jobs as dead ;(

Actions

Also available in: Atom PDF