Project

General

Profile

Actions

action #23536

closed

[tools] org.freedesktop.DBus.Error.NoReply: Did not receive a reply. appreas regularly in openQA logs

Added by nicksinger almost 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
2017-08-11
Due date:
% Done:

0%

Estimated time:

Description

Since (rough estimation) the heavy modification of the scheduler we can regularly observe the following error appear in the openQA log files:

[Wed Aug 23 09:56:24 2017] [11197:error] org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

Context from the log file:

[Wed Aug 23 09:56:21 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 23 09:56:22 2017] [3069:info] Stopping worker 16795 gracefully (800 seconds)
[Wed Aug 23 09:56:22 2017] [23576:info] Worker 23576 started
[Wed Aug 23 09:56:22 2017] [23576:info] Connecting to AMQP server
[Wed Aug 23 09:56:22 2017] [3069:info] Worker 16795 stopped
[Wed Aug 23 09:56:22 2017] [23576:info] AMQP connection established
[Wed Aug 23 09:56:24 2017] [3069:info] Stopping worker 16889 gracefully (800 seconds)
[Wed Aug 23 09:56:24 2017] [23578:info] Worker 23578 started
[Wed Aug 23 09:56:24 2017] [23578:info] Connecting to AMQP server
[Wed Aug 23 09:56:24 2017] [23578:info] AMQP connection established
[Wed Aug 23 09:56:24 2017] [11197:error] org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
[Wed Aug 23 09:56:24 2017] [18942:error] org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
[Wed Aug 23 09:56:24 2017] [13669:error] org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
[Wed Aug 23 09:56:24 2017] [22897:error] org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
[Wed Aug 23 09:56:25 2017] [7513:debug] removing screenshot 4fb/518/987743821945823012420a62bd.png
[Wed Aug 23 09:56:25 2017] [7513:debug] removing screenshot 83c/f60/e85ad7da4f25b1eb96f0680aa9.png
[Wed Aug 23 09:56:25 2017] [7513:debug] removing screenshot 581/fd4/bdf3965a15065f31523cef9463.png
[Wed Aug 23 09:56:25 2017] [7513:debug] removing screenshot 7ad/564/ae42ed15f1806c65f71cbfd4f9.png
[Wed Aug 23 09:56:25 2017] [7513:debug] removing screenshot 898/5a6/6807e3c39b97dd10458dc2b70d.png
[Wed Aug 23 09:56:25 2017] [3069:info] Stopping worker 5110 gracefully (800 seconds)
[Wed Aug 23 09:56:25 2017] [3069:info] Worker 5110 stopped
[Wed Aug 23 09:56:25 2017] [23579:info] Worker 23579 started
[Wed Aug 23 09:56:25 2017] [23579:info] Connecting to AMQP server

Everything related to one of the workers who raised this message:

[Wed Aug 23 08:57:57 2017] [11197:info] Worker 11197 started
[Wed Aug 23 08:57:57 2017] [11197:info] Connecting to AMQP server
[Wed Aug 23 08:57:57 2017] [11197:info] AMQP connection established
[Wed Aug 23 09:13:00 2017] [7513:debug] removing screenshot a4f/fc9/e72271244111977be85ad8dcc1.png
[Wed Aug 23 09:53:30 2017] [11197:info] Got status update for job 1125270 that does not belong to Worker 543
[Wed Aug 23 09:56:24 2017] [11197:error] org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
[Wed Aug 23 09:56:39 2017] [3069:info] Stopping worker 11197 gracefully (800 seconds)
[Wed Aug 23 09:56:39 2017] [3069:info] Worker 11197 stopped

Unfortunately I cannot see what exactly is causing the issue here.

Suggestions on how to improve this message:

  • If possible, include more specific reasons for this (what is the context of the message? What did the worker try to do?)

If this message is critical (not self recovering):

  • Add hints where an admin could look for more information
  • Expand message to explain the admin: "Hey, something just broke - you need to interact"

If this message should just inform the admin:

  • Decrease log level to at max "warn"
Actions #1

Updated by nicksinger almost 7 years ago

  • Copied from action #23320: [tools][sprint 201709.2][sprint 201710.1][sprint 201710.2] move locks/mutexes/barriers/job restarts out of scheduler added
Actions #2

Updated by nicksinger almost 7 years ago

  • Copied from deleted (action #23320: [tools][sprint 201709.2][sprint 201710.1][sprint 201710.2] move locks/mutexes/barriers/job restarts out of scheduler)
Actions #3

Updated by nicksinger almost 7 years ago

  • Description updated (diff)
Actions #4

Updated by EDiGiacinto almost 7 years ago

More dbus errors are caused because we rely on websockets even more, and now we dispatch jobs over dbus as well. But most probably the culprit is we have a huge load coming from the web socket server while receiving updates from worker, https://github.com/os-autoinst/openQA/pull/1433 for reference about the dbus load. https://github.com/os-autoinst/openQA/pull/1436 is a proposal to reduce messages from worker (that now can be handled from a unique message, that doesn't need to be sent in a fixed timing window).

Actions #5

Updated by coolo over 6 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF