action #13814
closeddead job is still too eager
0%
Description
https://openqa.opensuse.org/tests/265989 is incomplete as dead. Trying to figure out why
Updated by coolo over 8 years ago
[Mon Sep 19 13:42:41 2016] [websockets:debug] [DBIx debug] Took 0.00182104 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 11:42:41', '104'.
[Mon Sep 19 13:42:46 2016] [websockets:debug] [DBIx debug] Took 0.00141501 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 11:42:46', '104'.
[Mon Sep 19 13:42:51 2016] [websockets:debug] [DBIx debug] Took 0.00140905 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 11:42:51', '104'.
[Mon Sep 19 13:42:56 2016] [websockets:debug] [DBIx debug] Took 0.00981402 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 11:42:56', '104'.
[Mon Sep 19 13:42:56 2016] [4195:debug] [DBIx debug] Took 0.01209998 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 11:42:56', '104'.
[Mon Sep 19 13:43:01 2016] [websockets:debug] [DBIx debug] Took 0.00182796 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 11:43:01', '104'.
[Mon Sep 19 13:43:06 2016] [websockets:debug] [DBIx debug] Took 0.00152612 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 11:43:06', '104'.
[Mon Sep 19 13:43:51 2016] [websockets:debug] job considered dead: 265989
[Mon Sep 19 13:43:51 2016] [websockets:debug] [DBIx debug] Took 0.00207400 seconds executed: UPDATE workers SET job_id = ?, t_updated = ? WHERE ( id = ? ): NULL, '2016-09-19 11:43:51', '104'.
So there was quite some silence from the worker - but the websocket server was active, it's not him.
Updated by coolo over 8 years ago
OK, it wasn't silent - it was talking to the other side:
[Mon Sep 19 13:43:06 2016] [4195:debug] POST "/api/v1/jobs/265989/status"
[Mon Sep 19 13:43:12 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:12 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:19 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:19 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:25 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:25 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:31 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:32 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:40 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:40 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:46 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
[Mon Sep 19 13:43:46 2016] [4195:debug] POST "/api/v1/jobs/265989/artefact"
Updated by coolo over 8 years ago
- Status changed from New to Resolved
Updated by coolo over 8 years ago
- Status changed from Resolved to New
- Assignee changed from coolo to dgutu
next try ;(
Updated by coolo over 8 years ago
[Mon Sep 19 18:56:27 2016] [scheduler:debug] [DBIx debug] Took 0.00041199 seconds executed: UPDATE workers SET job_id = ? WHERE ( id = ? ): '266180', '50'.
[Mon Sep 19 19:18:34 2016] [13312:debug] [DBIx debug] Took 0.01566005 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:34', '50'.
[Mon Sep 19 19:18:35 2016] [13315:debug] [DBIx debug] Took 0.00622296 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:35', '50'.
[Mon Sep 19 19:18:36 2016] [13352:debug] [DBIx debug] Took 0.00609016 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:36', '50'.
[Mon Sep 19 19:18:37 2016] [13351:debug] [DBIx debug] Took 0.00381804 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:37', '50'.
[Mon Sep 19 19:18:37 2016] [13312:debug] POST "/api/v1/jobs/266180/artefact"
[Mon Sep 19 19:18:37 2016] [13314:debug] POST "/api/v1/jobs/266180/artefact"
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00395203 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '14'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00445509 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '54'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00793791 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '16'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00825691 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '29'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.02400804 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '98'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00660801 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '56'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00552702 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '60'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00415897 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '97'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00476885 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '43'.
[Mon Sep 19 19:18:50 2016] [websockets:debug] [DBIx debug] Took 0.00495505 seconds executed: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2016-09-19 17:18:50', '63'.
[Mon Sep 19 19:19:37 2016] [13314:debug] POST "/api/v1/jobs/266180/artefact"
[Mon Sep 19 19:20:18 2016] [websockets:debug] job considered dead: 266180
[Mon Sep 19 19:20:19 2016] [websockets:debug] [DBIx debug] Took 0.00286794 seconds executed: UPDATE workers SET job_id = ?, t_updated = ? WHERE ( id = ? ): NULL, '2016-09-19 17:20:19', '50'.
[Mon Sep 19 19:20:37 2016] [13351:debug] POST "/api/v1/jobs/266180/artefact"
Updated by coolo over 8 years ago
what looks related is this:
192.168.112.6 - - [19/Sep/2016:19:18:37 +0200] "POST /api/v1/jobs/266180/artefact HTTP/1.1" 406 - "-" "Mojolicious (Perl)"
192.168.112.6 - - [19/Sep/2016:19:19:37 +0200] "POST /api/v1/jobs/266180/artefact HTTP/1.1" 406 - "-" "Mojolicious (Perl)"
192.168.112.6 - - [19/Sep/2016:19:20:37 +0200] "POST /api/v1/jobs/266180/artefact HTTP/1.1" 404 30248 "-" "Mojolicious (Perl)"
Updated by coolo over 8 years ago
- Target version set to Milestone 2
hah, because
[Mon Sep 19 19:18:37 2016] [13314:error] Can't write to log: at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious.pm line 104.
Updated by coolo over 8 years ago
ok, the dead job detection only does its job in this case - the worker stopped action for a minute.
Updated by coolo over 8 years ago
Trouble ahead
Sep 19 19:10:48 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Plugin/DefaultHelpers
Sep 19 19:10:54 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Plugin/DefaultHelpers
Sep 19 19:11:13 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Plugin/DefaultHelpers
Sep 19 19:11:15 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Plugin/DefaultHelpers
Sep 19 19:11:22 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Plugin/DefaultHelpers
Sep 19 19:11:29 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to asset: Cannot allocate memory at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Cont
Sep 19 19:11:29 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Plugin/DefaultHelpers
Sep 19 19:11:38 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to asset: Cannot allocate memory at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Cont
Sep 19 19:11:52 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Plugin/DefaultHelpers
Sep 19 19:11:52 ariel openqa[9157]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Plugin/DefaultHelpers
Updated by coolo over 8 years ago
[Mon Sep 19 19:01:46 2016] [13315:debug] POST "/api/v1/workers/44/grab_job"
[Mon Sep 19 19:01:46 2016] [13315:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1" and action "auth_operator"
[Mon Sep 19 19:01:46 2016] [13315:debug] API key from client: *5D46836035B1FE08*
[Mon Sep 19 19:01:46 2016] [13315:debug] [DBIx debug] Took 0.00244999 seconds executed: SELECT me.id, me.key, me.secret, me.user_id, me.t_expiration, me.t_created, me.t
_updated FROM api_keys me WHERE ( me.key = ? ): '5D46836035B1FE08'.
[Mon Sep 19 19:01:46 2016] [13315:debug] [DBIx debug] Took 0.00038695 seconds executed: SELECT me.id, me.username, me.email, me.fullname, me.nickname,
me.is_operator, m
e.is_admin, me.t_created, me.t_updated FROM users me WHERE ( me.id = ? ): '66'.
[Mon Sep 19 19:01:46 2016] [13315:debug] Key is for user 'https://www.opensuse.org/openid/user/k0da'
[Mon Sep 19 19:01:46 2016] [13315:debug] API auth by user: https://www.opensuse.org/openid/user/k0da, operator: 1
[Mon Sep 19 19:01:46 2016] [13315:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1::Job" and action "grab"
[Mon Sep 19 19:01:46 2016] [13315:debug] dispatching IPC job_grab to scheduler: [
[Mon Sep 19 19:02:46 2016] [13315:debug] Template "exception.production.html.ep" not found
[Mon Sep 19 19:02:46 2016] [13315:debug] Rendering template "exception.html.ep"
[Mon Sep 19 19:02:46 2016] [13315:debug] Rendering cached template "layouts/error.html.ep"
[Mon Sep 19 19:02:46 2016] [13315:debug] 500 Internal Server Error (60.202517s, 0.017/s)
[Mon Sep 19 18:59:42 2016] [13349:error] org.freedesktop.DBus.Error.Failed: Can't use string ("HASH(0x9daff38)") as a HASH ref while "strict refs" in use at /usr/share/
openqa/script/../lib/OpenQA/Schema/Result/Workers.pm line 78.
Updated by coolo about 8 years ago
- Assignee changed from dgutu to coolo
it's still happening, e.g. 587977 on openqa.suse.de
[Thu Sep 22 22:20:54 2016] [scheduler:debug] [DBIx debug] Took 0.00220895 seconds executed: UPDATE workers SET job_id = ?, t_updated = ? WHERE ( id = ? ): '587977', '2016-09-22 20:20:54', '240'.
[Thu Sep 22 22:23:14 2016] [websockets:debug] job considered dead: 587977
in between no status update from the worker, but some mmapi calls. Strange
Updated by coolo about 8 years ago
But the worker didn't report any obvious problem
Sep 22 22:20:54 openqaworker3 worker[8618]: 12311: WORKING 587977
Sep 22 22:23:15 openqaworker3 qemu-system-x86_64[12505]: looking for plugins in '/usr/lib64/sasl2', failed to open directory, error: No such file or directory
Sep 22 22:35:56 openqaworker3 worker[8618]: ERROR clone-autoinst.xml: 404 response: Not Found
Sep 22 22:35:56 openqaworker3 worker[8618]: ERROR video.ogv: 404 response: Not Found
Sep 22 22:36:03 openqaworker3 worker[8618]: cleaning up 00587977-sle-12-SP2-Server-DVD-x86_64-Build2141-sles12sp2_autoyast_bug-879147_autoinst@64bit...
Sep 22 22:36:05 openqaworker3 worker[8618]: job 587977 incomplete
the 404 are because it's already dead by then.
Updated by coolo about 8 years ago
- Status changed from New to Resolved
the dead job detection will kill non-dead jobs. But as long as we have it, we have to live with the fact that we misjudge half dead jobs as dead ;(