Project

General

Profile

Actions

action #57017

closed

websocket service stalls for 10 minutes when a worker misbehaves

Added by coolo over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2019-09-18
Due date:
% Done:

0%

Estimated time:

Description

Yesterday we had total blackout of the webui as we ran out of database handles. And it's mostly about job locks. The apache log looks like this:

2620:113:80c0:8080:10:160:1:93 - - [17/Sep/2019:16:00:14 +0200] "POST /api/v1/barrier/DRBD_RESOURCE_RESTARTED_alpha?check_dead_job=0 HTTP/1.1" 200 3 "-" "Mojolicious (Perl)" 50267
2620:113:80c0:8080:10:160:1:100 - - [17/Sep/2019:16:01:46 +0200] "POST /api/v1/barrier/REMOVE_NODE_BY_IP_DONE_delta?check_dead_job=0 HTTP/1.1" 409 4 "-" "Mojolicious (Perl)" 69502
2620:113:80c0:8080:10:160:0:243 - - [17/Sep/2019:16:08:33 +0200] "POST /api/v1/barrier/PACEMAKER_CTS_CHECKED_pacemaker?check_dead_job=0 HTTP/1.1" 409 4 "-" "Mojolicious (Perl)" 25025
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:00:52 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480104796
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:01:37 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480100422
2620:113:80c0:8080:10:160:1:20 - - [17/Sep/2019:16:10:12 +0200] "POST /api/v1/barrier/PACEMAKER_CTS_CHECKED_pacemaker?check_dead_job=0 HTTP/1.1" 409 4 "-" "Mojolicious (Perl)" 37502
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:05:52 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480098754
2620:113:80c0:8080:10:160:1:20 - - [17/Sep/2019:16:13:54 +0200] "POST /api/v1/barrier/PACEMAKER_CTS_CHECKED_pacemaker?check_dead_job=0 HTTP/1.1" 409 4 "-" "Mojolicious (Perl)" 50245
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:06:37 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480104153
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:10:52 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480101377
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:11:42 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480099105
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:21:44 +0200] "POST /api/v1/barrier/test_t04_sit_tunnel_xml_pre_run?check_dead_job=0 HTTP/1.1" 409 4 "-" "Mojolicious (Perl)" 25022
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:16:10 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480083671
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:16:42 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480102288
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:21:18 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480102722
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:21:42 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480099029
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:22:52 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480092022
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:26:26 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480063841
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:26:47 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480100969
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:27:19 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480100403
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:27:52 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480010150
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:31:40 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480101156
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:31:48 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480092209
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:32:19 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480102590
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:32:53 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480102287
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:33:45 +0200] "POST /api/v1/barrier/test_t04_sit_tunnel_xml_pre_run?check_dead_job=0 HTTP/1.1" 502 1202 "-" "Mojolicious (Perl)" 480102560
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:36:48 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480102510
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:36:48 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480032829
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:37:19 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480027596
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:37:58 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480011141
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:38:45 +0200] "POST /api/v1/barrier/test_t04_sit_tunnel_xml_pre_run?check_dead_job=0 HTTP/1.1" 502 1202 "-" "Mojolicious (Perl)" 480002487
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:41:53 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480102649
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:41:56 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480102665
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:41:56 +0200] "POST /api/v1/barrier/test_t04_sit_tunnel_xml_pre_run?check_dead_job=0 HTTP/1.1" 502 1202 "-" "Mojolicious (Perl)" 480025992
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:42:24 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480102684
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:42:58 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480102514
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:43:45 +0200] "POST /api/v1/barrier/test_t04_sit_tunnel_xml_pre_run?check_dead_job=0 HTTP/1.1" 502 1202 "-" "Mojolicious (Perl)" 480102540

In parallel to this we have this in our postgresql log (just an example, there are tons):

2019-09-17 16:53:00.498 CEST openqa geekotest [7863]LOG:  duration: 1154803.870 ms  execute <unnamed>: UPDATE job_locks SET locked_by = $1 WHERE ( ( locked_by = $2 AND name = $3 AND owner = $4 ) )
2019-09-17 16:53:00.498 CEST openqa geekotest [7863]DETAIL:  parameters: $1 = '3368401,3368405', $2 = '3368401', $3 = 'test_t04_sit_tunnel_xml_pre_run', $4 = '3368401'
2019-09-17 16:53:00.498 CEST openqa geekotest [8763]LOG:  duration: 1807604.689 ms  execute <unnamed>: UPDATE job_locks SET locked_by = $1 WHERE ( ( locked_by IS NULL AND name = $2 AND owner = $3 ) )
2019-09-17 16:53:00.498 CEST openqa geekotest [8763]DETAIL:  parameters: $1 = '3368563', $2 = 'test_t03_bridge_ifup_ifreload_pre_run', $3 = '3368436'
2019-09-17 16:53:00.498 CEST openqa geekotest [10596]LOG:  duration: 3572978.037 ms  execute dbdpg_p10595_23: UPDATE jobs SET passed_module_count = $1, t_

And as these job_locks update happen in the webui, it blocks the whole mojolicous worker, leading to a domino effect.

Actions

Also available in: Atom PDF