action #57017

websocket service stalls for 10 minutes when a worker misbehaves

Added by coolo 5 months ago. Updated 4 months ago.

Status:ResolvedStart date:18/09/2019
Priority:NormalDue date:
Assignee:mkittler% Done:

0%

Category:Concrete Bugs
Target version:Done
Difficulty:hard
Duration:

Description

Yesterday we had total blackout of the webui as we ran out of database handles. And it's mostly about job locks. The apache log looks like this:

2620:113:80c0:8080:10:160:1:93 - - [17/Sep/2019:16:00:14 +0200] "POST /api/v1/barrier/DRBD_RESOURCE_RESTARTED_alpha?check_dead_job=0 HTTP/1.1" 200 3 "-" "Mojolicious (Perl)" 50267
2620:113:80c0:8080:10:160:1:100 - - [17/Sep/2019:16:01:46 +0200] "POST /api/v1/barrier/REMOVE_NODE_BY_IP_DONE_delta?check_dead_job=0 HTTP/1.1" 409 4 "-" "Mojolicious (Perl)" 69502
2620:113:80c0:8080:10:160:0:243 - - [17/Sep/2019:16:08:33 +0200] "POST /api/v1/barrier/PACEMAKER_CTS_CHECKED_pacemaker?check_dead_job=0 HTTP/1.1" 409 4 "-" "Mojolicious (Perl)" 25025
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:00:52 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480104796
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:01:37 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480100422
2620:113:80c0:8080:10:160:1:20 - - [17/Sep/2019:16:10:12 +0200] "POST /api/v1/barrier/PACEMAKER_CTS_CHECKED_pacemaker?check_dead_job=0 HTTP/1.1" 409 4 "-" "Mojolicious (Perl)" 37502
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:05:52 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480098754
2620:113:80c0:8080:10:160:1:20 - - [17/Sep/2019:16:13:54 +0200] "POST /api/v1/barrier/PACEMAKER_CTS_CHECKED_pacemaker?check_dead_job=0 HTTP/1.1" 409 4 "-" "Mojolicious (Perl)" 50245
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:06:37 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480104153
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:10:52 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480101377
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:11:42 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480099105
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:21:44 +0200] "POST /api/v1/barrier/test_t04_sit_tunnel_xml_pre_run?check_dead_job=0 HTTP/1.1" 409 4 "-" "Mojolicious (Perl)" 25022
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:16:10 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480083671
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:16:42 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480102288
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:21:18 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480102722
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:21:42 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480099029
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:22:52 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480092022
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:26:26 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480063841
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:26:47 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480100969
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:27:19 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480100403
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:27:52 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480010150
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:31:40 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480101156
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:31:48 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480092209
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:32:19 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480102590
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:32:53 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480102287
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:33:45 +0200] "POST /api/v1/barrier/test_t04_sit_tunnel_xml_pre_run?check_dead_job=0 HTTP/1.1" 502 1202 "-" "Mojolicious (Perl)" 480102560
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:36:48 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480102510
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:36:48 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480032829
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:37:19 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480027596
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:37:58 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480011141
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:38:45 +0200] "POST /api/v1/barrier/test_t04_sit_tunnel_xml_pre_run?check_dead_job=0 HTTP/1.1" 502 1202 "-" "Mojolicious (Perl)" 480002487
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:41:53 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480102649
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:41:56 +0200] "POST /api/v1/barrier/test_t02_bonding_ab_pre_run?check_dead_job=0 HTTP/1.1" 502 1194 "-" "Mojolicious (Perl)" 480102665
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:41:56 +0200] "POST /api/v1/barrier/test_t04_sit_tunnel_xml_pre_run?check_dead_job=0 HTTP/1.1" 502 1202 "-" "Mojolicious (Perl)" 480025992
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:42:24 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480102684
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:42:58 +0200] "POST /api/v1/barrier/test_t03_bridge_ifup_ifreload_pre_run?check_dead_job=0 HTTP/1.1" 502 1214 "-" "Mojolicious (Perl)" 480102514
2620:113:80c0:8080:10:160:0:227 - - [17/Sep/2019:16:43:45 +0200] "POST /api/v1/barrier/test_t04_sit_tunnel_xml_pre_run?check_dead_job=0 HTTP/1.1" 502 1202 "-" "Mojolicious (Perl)" 480102540

In parallel to this we have this in our postgresql log (just an example, there are tons):

2019-09-17 16:53:00.498 CEST openqa geekotest [7863]LOG:  duration: 1154803.870 ms  execute <unnamed>: UPDATE job_locks SET locked_by = $1 WHERE ( ( locked_by = $2 AND name = $3 AND owner = $4 ) )
2019-09-17 16:53:00.498 CEST openqa geekotest [7863]DETAIL:  parameters: $1 = '3368401,3368405', $2 = '3368401', $3 = 'test_t04_sit_tunnel_xml_pre_run', $4 = '3368401'
2019-09-17 16:53:00.498 CEST openqa geekotest [8763]LOG:  duration: 1807604.689 ms  execute <unnamed>: UPDATE job_locks SET locked_by = $1 WHERE ( ( locked_by IS NULL AND name = $2 AND owner = $3 ) )
2019-09-17 16:53:00.498 CEST openqa geekotest [8763]DETAIL:  parameters: $1 = '3368563', $2 = 'test_t03_bridge_ifup_ifreload_pre_run', $3 = '3368436'
2019-09-17 16:53:00.498 CEST openqa geekotest [10596]LOG:  duration: 3572978.037 ms  execute dbdpg_p10595_23: UPDATE jobs SET passed_module_count = $1, t_

And as these job_locks update happen in the webui, it blocks the whole mojolicous worker, leading to a domino effect.

History

#2 Updated by kraih 5 months ago

If this is just the webui running out of database connections (and no connections are leaking), it might be fixable by installing a PostgreSQL connection manager. https://progress.opensuse.org/issues/55262

#3 Updated by coolo 5 months ago

It's not. The database connections ran out 15 minutes after the problem started. The problem is that UPDATE job_locks SET locked_by takes several minutes - hinting a livelock (as postgresql didn't report a deadlock)

#4 Updated by coolo 5 months ago

I enabled lock logging and there aren't deadlocks (that postgresql detects as some) but some processes keep locks

2019-09-20 04:05:35.461 CEST openqa geekotest [1164]LOG:  process 1164 still waiting for ShareLock on transaction 181474480 after 1000.231 ms
2019-09-20 04:05:35.461 CEST openqa geekotest [1164]DETAIL:  Process holding the lock: 22303. Wait queue: 22510, 19629, 1164.
2019-09-20 04:05:35.461 CEST openqa geekotest [1164]CONTEXT:  while updating tuple (9317,101) in relation "jobs"
2019-09-20 04:05:35.461 CEST openqa geekotest [1164]STATEMENT:  UPDATE jobs SET state = $1, t_updated = $2 WHERE ( id = $3 )
2019-09-20 04:05:45.468 CEST openqa geekotest [21717]LOG:  process 21717 still waiting for ExclusiveLock on tuple (567,4) of relation 16737 of database 16386 after 1000.163 ms
2019-09-20 04:05:45.468 CEST openqa geekotest [21717]DETAIL:  Process holding the lock: 22510. Wait queue: 21717.
2019-09-20 04:05:45.468 CEST openqa geekotest [21717]STATEMENT:  UPDATE workers SET t_updated = $1 WHERE ( id = $2 )
2019-09-20 04:06:53.998 CEST openqa geekotest [3469]LOG:  process 3469 still waiting for ExclusiveLock on tuple (9316,110) of relation 16534 of database 16386 after 1000.245 ms
2019-09-20 04:06:53.998 CEST openqa geekotest [3469]DETAIL:  Process holding the lock: 19629. Wait queue: 3469.
2019-09-20 04:06:53.998 CEST openqa geekotest [3469]STATEMENT:  UPDATE jobs SET skipped_module_count = $1, t_updated = $2 WHERE ( id = $3 )
2019-09-20 04:13:55.581 CEST openqa geekotest [29435]LOG:  process 29435 still waiting for ExclusiveLock on tuple (567,4) of relation 16737 of database 16386 after 1001.942 ms
2019-09-20 04:13:55.581 CEST openqa geekotest [29435]DETAIL:  Process holding the lock: 22510. Wait queue: 21717, 29435.
2019-09-20 04:13:55.581 CEST openqa geekotest [29435]STATEMENT:  UPDATE workers SET t_updated = $1 WHERE ( id = $2 )
2019-09-20 04:15:54.145 CEST openqa geekotest [17418]LOG:  process 17418 still waiting for ExclusiveLock on tuple (9316,110) of relation 16534 of database 16386 after 1000.197 ms
2019-09-20 04:15:54.145 CEST openqa geekotest [17418]DETAIL:  Process holding the lock: 19629. Wait queue: 3469, 17418.
2019-09-20 04:15:54.145 CEST openqa geekotest [17418]STATEMENT:  UPDATE jobs SET skipped_module_count = $1, t_updated = $2 WHERE ( id = $3 )
2019-09-20 04:17:24.952 CEST openqa geekotest [19629]LOG:  process 19629 acquired ShareLock on transaction 181474480 after 1172135.551 ms
2019-09-20 04:17:24.952 CEST openqa geekotest [19629]CONTEXT:  while updating tuple (9316,110) in relation "jobs"
2019-09-20 04:17:24.952 CEST openqa geekotest [19629]STATEMENT:  UPDATE jobs SET skipped_module_count = $1, t_updated = $2 WHERE ( id = $3 )
2019-09-20 04:17:24.952 CEST openqa geekotest [22510]LOG:  process 22510 acquired ShareLock on transaction 181474480 after 1190605.905 ms
2019-09-20 04:17:24.952 CEST openqa geekotest [22510]CONTEXT:  while updating tuple (567,4) in relation "workers"
2019-09-20 04:17:24.952 CEST openqa geekotest [22510]STATEMENT:  UPDATE workers SET t_updated = $1 WHERE ( id = $2 )
2019-09-20 04:17:24.952 CEST openqa geekotest [1164]LOG:  process 1164 acquired ShareLock on transaction 181474480 after 710491.349 ms
2019-09-20 04:17:24.952 CEST openqa geekotest [1164]CONTEXT:  while updating tuple (9317,101) in relation "jobs"
2019-09-20 04:17:24.952 CEST openqa geekotest [1164]STATEMENT:  UPDATE jobs SET state = $1, t_updated = $2 WHERE ( id = $3 )
2019-09-20 04:17:24.952 CEST openqa geekotest [21717]LOG:  process 21717 acquired ExclusiveLock on tuple (567,4) of relation 16737 of database 16386 after 700484.700 ms
2019-09-20 04:17:24.952 CEST openqa geekotest [21717]STATEMENT:  UPDATE workers SET t_updated = $1 WHERE ( id = $2 )
2019-09-20 04:17:24.952 CEST openqa geekotest [19629]LOG:  duration: 1172135.926 ms  execute dbdpg_p19626_23: UPDATE jobs SET skipped_module_count = $1, t_updated = $2 WHERE ( id = $3 )
2019-09-20 04:17:24.952 CEST openqa geekotest [19629]DETAIL:  parameters: $1 = '1', $2 = '2019-09-20 01:57:52', $3 = '3384018'
2019-09-20 04:17:24.952 CEST openqa geekotest [22510]LOG:  duration: 1190606.286 ms  execute dbdpg_p22509_5: UPDATE workers SET t_updated = $1 WHERE ( id = $2 )
2019-09-20 04:17:24.952 CEST openqa geekotest [22510]DETAIL:  parameters: $1 = '2019-09-20 01:57:34', $2 = '1251'
2019-09-20 04:17:24.952 CEST openqa geekotest [29435]LOG:  process 29435 acquired ExclusiveLock on tuple (567,4) of relation 16737 of database 16386 after 210373.646 ms
2019-09-20 04:17:24.952 CEST openqa geekotest [29435]STATEMENT:  UPDATE workers SET t_updated = $1 WHERE ( id = $2 )
2019-09-20 04:17:24.953 CEST openqa geekotest [1164]LOG:  duration: 710492.252 ms  execute dbdpg_p1159_1b: UPDATE jobs SET state = $1, t_updated = $2 WHERE ( id = $3 )
2019-09-20 04:17:24.953 CEST openqa geekotest [1164]DETAIL:  parameters: $1 = 'uploading', $2 = '2019-09-20 02:05:34', $3 = '3384056'
2019-09-20 04:17:24.953 CEST openqa geekotest [3469]LOG:  process 3469 acquired ExclusiveLock on tuple (9316,110) of relation 16534 of database 16386 after 631955.753 ms
2019-09-20 04:17:24.953 CEST openqa geekotest [3469]STATEMENT:  UPDATE jobs SET skipped_module_count = $1, t_updated = $2 WHERE ( id = $3 )
2019-09-20 04:17:24.953 CEST openqa geekotest [17418]LOG:  process 17418 acquired ExclusiveLock on tuple (9316,110) of relation 16534 of database 16386 after 91808.918 ms
2019-09-20 04:17:24.953 CEST openqa geekotest [17418]STATEMENT:  UPDATE jobs SET skipped_module_count = $1, t_updated = $2 WHERE ( id = $3 )
2019-09-20 04:17:24.954 CEST openqa geekotest [29435]LOG:  duration: 210374.886 ms  execute dbdpg_p29427_7: UPDATE workers SET t_updated = $1 WHERE ( id = $2 )
2019-09-20 04:17:24.954 CEST openqa geekotest [29435]DETAIL:  parameters: $1 = '2019-09-20 02:13:54', $2 = '1251'
2019-09-20 04:17:24.955 CEST openqa geekotest [21717]LOG:  duration: 700487.607 ms  execute dbdpg_p21714_5: UPDATE workers SET t_updated = $1 WHERE ( id = $2 )
2019-09-20 04:17:24.955 CEST openqa geekotest [21717]DETAIL:  parameters: $1 = '2019-09-20 02:05:44', $2 = '1251'
2019-09-20 04:17:24.957 CEST openqa geekotest [17418]LOG:  duration: 91812.500 ms  execute dbdpg_p17401_18: UPDATE jobs SET skipped_module_count = $1, t_updated = $2 WHERE ( id = $3 )
2019-09-20 04:17:24.957 CEST openqa geekotest [17418]DETAIL:  parameters: $1 = '1', $2 = '2019-09-20 02:15:53', $3 = '3384018'
2019-09-20 04:17:24.963 CEST openqa geekotest [3469]LOG:  duration: 631965.528 ms  execute dbdpg_p3468_1e: UPDATE jobs SET skipped_module_count = $1, t_updated = $2 WHERE ( id = $3 )
2019-09-20 04:17:24.963 CEST openqa geekotest [3469]DETAIL:  parameters: $1 = '1', $2 = '2019-09-20 02:06:52', $3 = '3384018'

#5 Updated by coolo 5 months ago

  • Subject changed from Barriers can lead to 502 (deadlock?) to websocket service stalls for 10 minutes when a worker misbehaves
  • Priority changed from High to Urgent
  • Target version changed from Ready to Current Sprint

ok, this is easier than expected once you got the debug logs. from the journal of websocket service (left out noise):

Sep 20 03:57:22 openqa openqa-websockets[22287]: BEGIN WORK
Sep 20 03:57:22 openqa openqa-websockets[22287]: COMMIT
Sep 20 04:07:24 openqa openqa-websockets[22287]: [error] [pid:22287] Failed dispatching message to websocket server over ipc for worker "openqaworker8:": Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Cl
Sep 20 04:07:24 openqa openqa-websockets[22287]: [debug] [pid:22287] enqueuing abort for 3384018 581
Sep 20 04:17:24 openqa openqa-websockets[22287]: [error] [pid:22287] Failed dispatching message to websocket server over ipc for worker "openqaworker8:": Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm >
Sep 20 04:17:24 openqa openqa-websockets[22287]: [debug] [pid:22287] new job 3384373
Sep 20 04:17:24 openqa openqa-websockets[22287]: [debug] [pid:22287] [DBIC] Took 0.00031100 seconds: SELECT me.id, me.result_dir, me.state, me.priority, me.result, me.clone_id, me.blocked_by_id, me.backend, me.backend_info, me.TEST, me.>
Sep 20 04:17:24 openqa openqa-websockets[22287]: [debug] [pid:22287] [Job#3384056] Duplicated as job '3384373'
Sep 20 04:17:24 openqa openqa-websockets[22287]: COMMIT

I can't find another BEGIN for the COMMIT - but there are 10 minute timeouts in the very crucial websocket service and it sounds like it holds all the important tables locked in there. So it's not actually job_locks, but jobs (where jobs_lock got a relation into)

#6 Updated by coolo 5 months ago

Parallel in the webui:

[2019-09-20T04:17:24.0953 CEST] [debug] [pid:1159] [DBIC] Took 710.49299600 seconds: UPDATE jobs SET state = ?, t_updated = ? WHERE ( id = ? ): 'uploading', '2019-09-20 02:05:34', '3384056'
[2019-09-20T04:17:24.0954 CEST] [debug] [pid:19626] [DBIC] Took 1172.13732700 seconds: UPDATE jobs SET skipped_module_count = ?, t_updated = ? WHERE ( id = ? ): '1', '2019-09-20 01:57:52', '3384018'
[2019-09-20T04:17:24.0954 CEST] [debug] [pid:22509] [DBIC] Took 1190.60780300 seconds: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2019-09-20 01:57:34', '1251'
[2019-09-20T04:17:24.0955 CEST] [debug] [pid:29427] [DBIC] Took 210.37648600 seconds: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2019-09-20 02:13:54', '1251'
[2019-09-20T04:17:24.0957 CEST] [debug] [pid:17401] [DBIC] Took 91.81311800 seconds: UPDATE jobs SET skipped_module_count = ?, t_updated = ? WHERE ( id = ? ): '1', '2019-09-20 02:15:53', '3384018'
[2019-09-20T04:17:24.0961 CEST] [debug] [pid:21714] [DBIC] Took 700.49337800 seconds: UPDATE workers SET t_updated = ? WHERE ( id = ? ): '2019-09-20 02:05:44', '1251'
[2019-09-20T04:17:24.0963 CEST] [debug] [pid:3468] [DBIC] Took 631.96631300 seconds: UPDATE jobs SET skipped_module_count = ?, t_updated = ? WHERE ( id = ? ): '1', '2019-09-20 02:06:52', '3384018'

So it blocked 7 mojo workers. the others only did SELECTs and that kept running.

#7 Updated by coolo 5 months ago

Of course this can be actually 2 bugs, but Marius and I tried for quite some time to make sense out of job_locks itself and it just doesn't make sense.

#8 Updated by coolo 5 months ago

  • Assignee set to mkittler

Assigning Marius, but we most likely need more brains to find a non-blocking solution there :(

#9 Updated by mkittler 5 months ago

You're saying the log is "from the journal of websocket service" and the openqa-websockets[22287] prefix says it, too. I'm confused. The messages look like they are from any other daemon which tries to speak with the websocket server. But that could explain why the timeout happens: The websockt server waits for an answer from itself. Seems like I have to look for any code paths reachable from the websocket server which possibly try to send a command to the websocket server.

#10 Updated by mkittler 5 months ago

Theory is that the following nested calls inside the web socket server cause it to block: OpenQA::WebSockets::Model::Status::workers_checker -> OpenQA::Schema::Result::Jobs::auto_duplicate -> OpenQA::Schema::Result::Workers::send_coomand

But this is too obvious. Why haven't we seen this earlier or more often?

#11 Updated by mkittler 5 months ago

It really seems to be that call sequence.

PR for immediate fix: https://github.com/os-autoinst/openQA/pull/2340 (if merged we can lower the priority of the issue)

Note that the problem only occurs when the following conditions are met at the same time:

  1. The web socket server incompletes a job because its worker hasn't seen for too long.
  2. That job is part of a cluster and other jobs in the cluster need to be aborted.

That's why it only happened rarely.


Marking jobs as incomplete is likely better done in the scheduler because it is possibly an extensive task and the ws server should stay reactive. So maybe create another API route on the scheduler-side to incomplete a job (and figure out which jobs in the cluster need to be rescheduled/aborted and so on). The ws server could call that route in an async manner and wouldn't care much about the response.

#12 Updated by mkittler 5 months ago

So I would create an API on the scheduler side which allows the web socket server to report jobs to be considered incomplete. Since the scheduler already has the "wakeup" API and therefore is already an HTTP server this should not be much to implement.

I'm wondering whether we should add a check in the IPC clients (e.g. on instantiation) to assert that it is never used from the daemon itself. This would allow us to identify problems caused by this type of problem more easily in the future.

#13 Updated by mkittler 5 months ago

  • Status changed from New to In Progress
  • Priority changed from Urgent to Normal

The immediate fix has been merged and hot-fixed on OSD. No further dead locks have happened since then so I'm lowering the prio.

#14 Updated by mkittler 5 months ago

I updated the approach: https://github.com/os-autoinst/openQA/pull/2348

So the idea is to rely only on t_updated for the incompleting stale jobs.


But I've discovered another code path in the ws code which incompletes and duplicates jobs. When a worker is online (again) the job the web UI think it is running needs to be incompleted unless the worker reports that it is still working on it. That will really require an IPC call from the ws server to the scheduler. Also the "re-scheduling" logic for assigned jobs should be moved to the scheduler (for consistency).

#15 Updated by mkittler 5 months ago

Draft/PR for the 2nd code path: https://github.com/os-autoinst/openQA/pull/2351

#16 Updated by mkittler 5 months ago

  • Status changed from In Progress to Feedback

The 2 PRs have been merged. I'd still like to follow the idea if adding a die to the client instantiation in case of a misuse. (This issue was harder to debug than it should have been.) But I'd wait for @kraih with that because it is not urgent and he might have a better idea how to implement it than me.

#17 Updated by mkittler 4 months ago

PR for the idea mentioned in previous comment: https://github.com/os-autoinst/openQA/pull/2398

#18 Updated by okurz 4 months ago

merged

#19 Updated by mkittler 4 months ago

  • Status changed from Feedback to Resolved

#20 Updated by mkittler 4 months ago

  • Target version changed from Current Sprint to Done

Also available in: Atom PDF