Project

General

Profile

Actions

action #25970

closed

Profile/Optimize _workers_checker in WebSockets server

Added by EDiGiacinto almost 7 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
-
Start date:
2017-10-11
Due date:
% Done:

0%

Estimated time:

Description

_workers_checker is triggered every 120 secs, and in SQLite DB this is what happens when it decides to take action:

[Wed Oct 11 15:04:35 2017] [scheduler:debug] +=----------------=+
[Wed Oct 11 15:04:35 2017] [websockets:debug] job considered dead: 1241 worker 5 not seen. In state running
[Wed Oct 11 15:04:35 2017] [websockets:debug] dispatching IPC ws_send to websockets: [9, "cancel", 1242]
[Wed Oct 11 15:04:38 2017] [scheduler:debug] [Congestion control] Resetting no actions count and rescheduling if necessary
[Wed Oct 11 15:04:38 2017] [scheduler:debug] [rescheduling] Current tick is at 10000ms. New tick will be in: 5000ms
[Wed Oct 11 15:04:43 2017] [scheduler:debug] +=----------------=+
[Wed Oct 11 15:04:43 2017] [scheduler:debug] -> Scheduling new jobs.
[Wed Oct 11 15:04:56 2017] [8623:debug] POST "/api/v1/jobs/1241/status"
[Wed Oct 11 15:04:56 2017] [8623:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1" and action "auth_operator"
[Wed Oct 11 15:04:56 2017] [8623:debug] API key from client: *FC2C0DAC88379885*
[Wed Oct 11 15:04:56 2017] [8623:debug] Key is for user 'Demo'
[Wed Oct 11 15:04:56 2017] [8623:debug] API auth by user: Demo, operator: 1
[Wed Oct 11 15:04:56 2017] [8623:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1::Job" and action "update_status"
[Wed Oct 11 15:05:01 2017] [8622:debug] POST "/api/v1/jobs/1242/status"
[Wed Oct 11 15:05:01 2017] [8622:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1" and action "auth_operator"
[Wed Oct 11 15:05:01 2017] [8622:debug] API key from client: *FC2C0DAC88379885*
[Wed Oct 11 15:05:01 2017] [8622:debug] Key is for user 'Demo'
[Wed Oct 11 15:05:01 2017] [8622:debug] API auth by user: Demo, operator: 1
[Wed Oct 11 15:05:01 2017] [8622:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1::Job" and action "update_status"
[Wed Oct 11 15:05:04 2017] [8606:debug] POST "/api/v1/jobs/1245/status"
[Wed Oct 11 15:05:04 2017] [8606:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1" and action "auth_operator"
[Wed Oct 11 15:05:04 2017] [8606:debug] API key from client: *FC2C0DAC88379885*
[Wed Oct 11 15:05:04 2017] [8606:debug] Key is for user 'Demo'
[Wed Oct 11 15:05:04 2017] [8606:debug] API auth by user: Demo, operator: 1
[Wed Oct 11 15:05:04 2017] [8606:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1::Job" and action "update_status"
[Wed Oct 11 15:05:06 2017] [8611:debug] POST "/api/v1/jobs/1244/status"
[Wed Oct 11 15:05:06 2017] [8611:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1" and action "auth_operator"
[Wed Oct 11 15:05:06 2017] [8611:debug] API key from client: *FC2C0DAC88379885*
[Wed Oct 11 15:05:06 2017] [8611:debug] Key is for user 'Demo'
[Wed Oct 11 15:05:06 2017] [8611:debug] API auth by user: Demo, operator: 1
[Wed Oct 11 15:05:06 2017] [8611:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1::Job" and action "update_status"
[Wed Oct 11 15:05:12 2017] [8614:debug] POST "/api/v1/jobs/1243/status"
[Wed Oct 11 15:05:12 2017] [8614:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1" and action "auth_operator"
[Wed Oct 11 15:05:12 2017] [8614:debug] API key from client: *FC2C0DAC88379885*
[Wed Oct 11 15:05:12 2017] [8614:debug] Key is for user 'Demo'
[Wed Oct 11 15:05:12 2017] [8614:debug] API auth by user: Demo, operator: 1
[Wed Oct 11 15:05:12 2017] [8614:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1::Job" and action "update_status"
[Wed Oct 11 15:05:13 2017] [scheduler:debug] Could not get new jobs to allocate: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::SQLite::st execute failed: database is locked [for Statement "SELECT COUNT( * ) FROM workers m
e"] at /usr/share/openqa/script/../lib/OpenQA/Scheduler/Scheduler.pm line 221
[Wed Oct 11 15:05:13 2017] [scheduler:debug] Scheduler took 30.05802s to perform operations and allocated 0 jobs
[Wed Oct 11 15:05:13 2017] [scheduler:debug] Scheduling took too much time. Increasing failure count. (3)
[Wed Oct 11 15:05:13 2017] [scheduler:debug] [Congestion control] Calculated backoff: 10000ms

It follows with all the subsequent queries from all the services to fail (not only the scheduler, that is not able to find free workers anymore due to the lock).

The same actually happened in production with postgres - with the result of completely killing the DBus services ( See https://github.com/os-autoinst/openQA/pull/1433/files#diff-4234193dc63be2e7aee5593a65382621R390 for reference ).
The statement was wrapped inside a transaction since it's doing actions that can potentially lead to race conditions, and it's fine with postgres.

To note from the logs, one of the possible causes:

[Wed Oct 11 15:04:35 2017] [websockets:debug] job considered dead: 1241 worker 5 not seen. In state running
[Wed Oct 11 15:04:35 2017] [websockets:debug] dispatching IPC ws_send to websockets: [9, "cancel", 1242]

Here the WS server is calling itself, most likely when cancelling the job the schema is dispatching a message over dbus gain - this should be avoided to not pollute even more DBus.

AC:

  • profile _workers_checker(), re-elaborate over necessary optimizing it
  • remove the dbus message loop generated by it

Related issues 3 (0 open3 closed)

Related to openQA Project - action #27454: [tools][scheduling] Worker's seen DB field is ignored by WebSocket server when checking for stale jobsResolvedmkittler2018-05-05

Actions
Related to openQA Project - action #28429: [tools][sprint 201711.2] Sqlite support bustedResolvedcoolo2017-11-27

Actions
Related to openQA Project - coordination #32851: [tools][EPIC] Scheduling redesignResolvedokurz2018-05-05

Actions
Actions #1

Updated by EDiGiacinto over 6 years ago

  • Related to action #27454: [tools][scheduling] Worker's seen DB field is ignored by WebSocket server when checking for stale jobs added
Actions #2

Updated by coolo over 6 years ago

  • Subject changed from [tools] Profile/Optimize _workers_checker in WebSockets server to Profile/Optimize _workers_checker in WebSockets server
  • Category set to 122
  • Target version set to Ready
Actions #3

Updated by EDiGiacinto over 6 years ago

  • Related to action #28429: [tools][sprint 201711.2] Sqlite support busted added
Actions #4

Updated by EDiGiacinto over 6 years ago

  • Description updated (diff)
Actions #5

Updated by EDiGiacinto over 6 years ago

Actions #6

Updated by mkittler about 5 years ago

@kraih Most likely no longer relevant after throwing out D-Bus in the websocket service, right?

Actions #7

Updated by okurz about 5 years ago

  • Status changed from New to Resolved

@mkittler I thought the same

Actions #8

Updated by coolo almost 5 years ago

  • Target version deleted (Ready)
Actions

Also available in: Atom PDF