Project

General

Profile

action #25970

Updated by EDiGiacinto about 6 years ago

_workers_checker is triggered every 120 secs, and in SQLite DB this is what happens when it decides to take action: 

     [Wed Oct 11 15:04:35 2017] [scheduler:debug] +=----------------=+ 
     [Wed Oct 11 15:04:35 2017] [websockets:debug] job considered dead: 1241 worker 5 not seen. In state running 
     [Wed Oct 11 15:04:35 2017] [websockets:debug] dispatching IPC ws_send to websockets: [9, "cancel", 1242] 
     [Wed Oct 11 15:04:38 2017] [scheduler:debug] [Congestion control] Resetting no actions count and rescheduling if necessary 
     [Wed Oct 11 15:04:38 2017] [scheduler:debug] [rescheduling] Current tick is at 10000ms. New tick will be in: 5000ms 
     [Wed Oct 11 15:04:43 2017] [scheduler:debug] +=----------------=+ 
     [Wed Oct 11 15:04:43 2017] [scheduler:debug] -> Scheduling new jobs. 
     [Wed Oct 11 15:04:56 2017] [8623:debug] POST "/api/v1/jobs/1241/status" 
     [Wed Oct 11 15:04:56 2017] [8623:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1" and action "auth_operator" 
     [Wed Oct 11 15:04:56 2017] [8623:debug] API key from client: *FC2C0DAC88379885* 
     [Wed Oct 11 15:04:56 2017] [8623:debug] Key is for user 'Demo' 
     [Wed Oct 11 15:04:56 2017] [8623:debug] API auth by user: Demo, operator: 1 
     [Wed Oct 11 15:04:56 2017] [8623:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1::Job" and action "update_status" 
     [Wed Oct 11 15:05:01 2017] [8622:debug] POST "/api/v1/jobs/1242/status" 
     [Wed Oct 11 15:05:01 2017] [8622:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1" and action "auth_operator" 
     [Wed Oct 11 15:05:01 2017] [8622:debug] API key from client: *FC2C0DAC88379885* 
     [Wed Oct 11 15:05:01 2017] [8622:debug] Key is for user 'Demo' 
     [Wed Oct 11 15:05:01 2017] [8622:debug] API auth by user: Demo, operator: 1 
     [Wed Oct 11 15:05:01 2017] [8622:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1::Job" and action "update_status" 
     [Wed Oct 11 15:05:04 2017] [8606:debug] POST "/api/v1/jobs/1245/status" 
     [Wed Oct 11 15:05:04 2017] [8606:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1" and action "auth_operator" 
     [Wed Oct 11 15:05:04 2017] [8606:debug] API key from client: *FC2C0DAC88379885* 
     [Wed Oct 11 15:05:04 2017] [8606:debug] Key is for user 'Demo' 
     [Wed Oct 11 15:05:04 2017] [8606:debug] API auth by user: Demo, operator: 1 
     [Wed Oct 11 15:05:04 2017] [8606:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1::Job" and action "update_status" 
     [Wed Oct 11 15:05:06 2017] [8611:debug] POST "/api/v1/jobs/1244/status" 
     [Wed Oct 11 15:05:06 2017] [8611:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1" and action "auth_operator" 
     [Wed Oct 11 15:05:06 2017] [8611:debug] API key from client: *FC2C0DAC88379885* 
     [Wed Oct 11 15:05:06 2017] [8611:debug] Key is for user 'Demo' 
     [Wed Oct 11 15:05:06 2017] [8611:debug] API auth by user: Demo, operator: 1 
     [Wed Oct 11 15:05:06 2017] [8611:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1::Job" and action "update_status" 
     [Wed Oct 11 15:05:12 2017] [8614:debug] POST "/api/v1/jobs/1243/status" 
     [Wed Oct 11 15:05:12 2017] [8614:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1" and action "auth_operator" 
     [Wed Oct 11 15:05:12 2017] [8614:debug] API key from client: *FC2C0DAC88379885* 
     [Wed Oct 11 15:05:12 2017] [8614:debug] Key is for user 'Demo' 
     [Wed Oct 11 15:05:12 2017] [8614:debug] API auth by user: Demo, operator: 1 
     [Wed Oct 11 15:05:12 2017] [8614:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1::Job" and action "update_status" 
     [Wed Oct 11 15:05:13 2017] [scheduler:debug] Could not get new jobs to allocate: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::SQLite::st execute failed: database is locked [for Statement "SELECT COUNT( * ) FROM workers m 
     e"] at /usr/share/openqa/script/../lib/OpenQA/Scheduler/Scheduler.pm line 221 
     [Wed Oct 11 15:05:13 2017] [scheduler:debug] Scheduler took 30.05802s to perform operations and allocated 0 jobs 
     [Wed Oct 11 15:05:13 2017] [scheduler:debug] Scheduling took too much time. Increasing failure count. (3) 
     [Wed Oct 11 15:05:13 2017] [scheduler:debug] [Congestion control] Calculated backoff: 10000ms 

 It follows with all the subsequent queries from **all** the services to fail (not only the scheduler, that is not able to find free workers anymore due to the lock). 

 The same actually happened in production with postgres - with the result of completely killing the DBus services ( See https://github.com/os-autoinst/openQA/pull/1433/files#diff-4234193dc63be2e7aee5593a65382621R390 for reference ). 
 The statement was wrapped inside a transaction since it's doing actions that can potentially lead to race conditions, and it's fine with postgres. The problem now regards only SQLite, and i really think nested transactions are the culprit.  

 To note from the logs, one of the possible causes: logs: 

     [Wed Oct 11 15:04:35 2017] [websockets:debug] job considered dead: 1241 worker 5 not seen. In state running 
     [Wed Oct 11 15:04:35 2017] [websockets:debug] dispatching IPC ws_send to websockets: [9, "cancel", 1242] 

 Here the WS server is calling itself, most likely when cancelling the job the schema is dispatching even if it's not a message over dbus gain real problem - for now - this should can be avoided to not pollute even more DBus. 

 AC:  

   - profile _workers_checker(), re-elaborate over necessary optimizing it 
   - remove the dbus message loop generated by it 

    

Back