action #90872
closedopenQA / os-autoinst 'either does not dequeue its messages, or exhibits some other buggy client-behavior'
Description
I recently updated Fedora's openQA deployments to very recent git snapshots of openQA and os-autoinst. Note also that on Fedora, we use dbus-broker by default. Since this update, I periodically see a test that fails with a dbus error, like https://openqa.fedoraproject.org/tests/847777 :
"backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
Open vSwitch command 'set_vlan' with arguments 'tap9 20' failed: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken."
Looking at the worker logs, this happens at a time those logs are flooded with messages like this:
Apr 08 03:21:01 openqa-x86-worker01.iad2.fedoraproject.org dbus-broker[76154]: Peer :1.10193 is being disconnected as it does not have the resources to receive a reply or unicast signal it expects.
Apr 08 03:21:01 openqa-x86-worker01.iad2.fedoraproject.org dbus-broker[76154]: Peer :1.10194 is being disconnected as it does not have the resources to receive a reply or unicast signal it expects.
Apr 08 03:21:01 openqa-x86-worker01.iad2.fedoraproject.org dbus-broker[76154]: Peer :1.10195 is being disconnected as it does not have the resources to receive a reply or unicast signal it expects.
Apr 08 03:21:01 openqa-x86-worker01.iad2.fedoraproject.org dbus-broker[76154]: Peer :1.10196 is being disconnected as it does not have the resources to receive a reply or unicast signal it expects.
Apr 08 03:21:01 openqa-x86-worker01.iad2.fedoraproject.org dbus-broker[76154]: Peer :1.10197 is being disconnected as it does not have the resources to receive a reply or unicast signal it expects.
Apr 08 03:21:02 openqa-x86-worker01.iad2.fedoraproject.org dbus-broker[76154]: Peer :1.10198 is being disconnected as it does not have the resources to receive a reply or unicast signal it expects.
Apr 08 03:21:02 openqa-x86-worker01.iad2.fedoraproject.org dbus-broker[76154]: Peer :1.10199 is being disconnected as it does not have the resources to receive a reply or unicast signal it expects.
Apr 08 03:21:02 openqa-x86-worker01.iad2.fedoraproject.org dbus-broker[76154]: Peer :1.10200 is being disconnected as it does not have the resources to receive a reply or unicast signal it expects.
Googling that error, I found https://github.com/bus1/dbus-broker/issues/201 . Looking through the discussion there, I looked for "exceeded" messages, and indeed found these:
[root@openqa-x86-worker01 adamwill][PROD]# journalctl --since 2021-04-08 | grep "exceeded"
Apr 08 03:18:40 openqa-x86-worker01.iad2.fedoraproject.org dbus-broker[76154]: UID 991 exceeded its 'bytes' quota on UID 991.
Apr 08 09:26:25 openqa-x86-worker01.iad2.fedoraproject.org dbus-broker[76154]: UID 991 exceeded its 'bytes' quota on UID 991.
Apr 08 11:59:19 openqa-x86-worker01.iad2.fedoraproject.org dbus-broker[76154]: UID 991 exceeded its 'bytes' quota on UID 991.
there are three floods of "is being disconnected" messages, each preceded by a "exceeded its 'bytes' quote" message. So we definitely seem to be in the pattern from that dbus-broker report. Note also that UID 991 is the _openqa_worker
user that openQA jobs run as.
So, it sounds like openQA / os-autoinst is actng as dvdhrm described in https://github.com/bus1/dbus-broker/issues/201#issuecomment-485715973 : it "does not dequeue its messages, or exhibits some other buggy client-behavior". I'm not sure exactly where the issue lies.
Files
Updated by AdamWill over 3 years ago
- File dbusdebug.txt dbusdebug.txt added
- File dbusps.txt dbusps.txt added
Here's the debug info and ps auxf
output that was suggested as a way to debug this in the upstream issue.
Updated by okurz over 3 years ago
- Target version set to Ready
Interesting case. I guess though that's an issue that others from the team can't reproduce and hence hardly help with. Let's see if anyone has an idea
Updated by AdamWill over 3 years ago
Well, I would think that the basic behaviour must be the same whatever the dbus daemon is, it's just a question of how the daemon handles it. dbus-broker apparently uses a hard limit; perhaps dbus-daemon is willing to just let a client keep piling up a backlog of messages (or whatever is going on) until system RAM is exhausted and it gets OOM killed, or something. But the underlying behaviour should be there and hence debuggable either way.
I will try and find someone who can help me debug this if no-one else gets to it first, but it's not #1 on my priority list as we're in final freeze for F34 right now.
Updated by dvdhrm over 3 years ago
The log shows 14 processes with significant OutgoingBytes buffers in dbus-broker. This means there is data leaving dbus-broker but not being dequeued by the client (of course, there is always the possibility of this being a bug in the broker).
The 14 processes have between 12 and 60 megabytes of data queued, so this is a sign of something being wrong. All 14 processes are instances of /usr/bin/isotovideo: backend
. No other process in this log shows this behavior. Everything else looks normal, on first glance. (Search for OutgoingBytes
in the logs, and look at the entries which have multiple megabytes of data listed)
The logs also show that all these clients have a single match rule installed. So my suspicion is that these processes do not dispatch the dbus file-descriptor, while someone keeps sending them messages.
Maybe someone with knowledge of "isotovideo: backend" can have a quick look? I might not be able to look further into this before next Monday.
Updated by AdamWill over 3 years ago
Thanks a lot for the info!
So in the openQA system, isotovideo is the process that ultimately does the heavy work of actually running tests. It launches an instance of qemu and interacts with it, recording a video and several screenshots along the way.
An openQA "worker host" will have several long-running "worker" services; each time a test job is scheduled, one of these "worker" instances launches isotovideo to run the actual test. The isotovideo process only lasts as long as it takes to run the test job. I believe the worker process and isotovideo signal to each other via DBus, and that's probably where the problem lies.
Updated by AdamWill over 3 years ago
Hmm, I might have butchered that (I was going from memory). Poking through the code, it looks more like the worker service and isotovideo have their own system for communicating. DBus is used by isotovideo to signal to os-autoinst-openvswitch, which is a helper process for interacting with openvswitch (we use openvswitch for some advanced network configurations in certain openQA tests). Broadly it looks like os-autoinst-openvswitch provides the org.opensuse.os_autoinst.switch
interface, and exports the methods "set_vlan", "unset_vlan", and "show"; isotovideo is only meant, AFAICT, to send those signals, it's not meant to receive anything.
Key files to look at are https://github.com/os-autoinst/os-autoinst/blob/master/backend/qemu.pm (which has the only dbus-related code in the isotovideo backend that I can find) and https://github.com/os-autoinst/os-autoinst/blob/master/os-autoinst-openvswitch (the os-autoinst-openvswitch process which does act as an exporter).
Updated by dvdhrm over 3 years ago
Check out this line:
This calls into Net::Dbus->system
and thus into Net::Dbus::_new()
. This function installs a D-Bus match for NameOwnerChanged
. Hence, this connection will continuously accumulate signals for every name-change on the system bus. If you do not dequeue those signals, they will slowly pile up. Note that every connect/disconnect to the bus will trigger such signals. Given that your previous logs showed connection-ids like :1.230716
, this means there have been at least about half a million NameOwnerChanged
signals. These signals will pile up inside of dbus-broker
, because D-Bus is not a lossy protocol.
This is just a guess. My perl knowledge is rather small. Looking at Net::Dbus
code, it seems to integrate with the Net/DBus/Reactor.pm
-mainloop, but only if you import the module beforehand. I would assume there is nothing dispatching your d-bus messages otherwise.
I would recommend passing "private"=true
to the constructor and then making sure to destroy the connection when you no longer need it. But this really needs input of the respective developers. This is beyond what I can safely judge.
Updated by mkittler over 3 years ago
But this really needs input of the respective developers.
The D-Bus service has been introduce by https://github.com/os-autoinst/os-autoinst/commit/315074ecb8b12817771fde11ad80291cdd4cf534. He's only contributing occasionally. I don't think anybody in the current team has much experience with Perl's Net::DBus::Object
module to know anything beyond what @dvdhrm has already found out.
I suppose the first step would be to come up with a test which reproduces the problem. The real fix might be rather small.
Alternatively we could also just ditch D-Bus and simply make it an HTTP server. That would be a bigger change but also an easy one which would definitely solve the issue.
Updated by mkittler over 3 years ago
- Assignee set to mkittler
Looks like our code is very much C&P from https://metacpan.org/pod/Net::DBus::Object except for Net::DBus->system
(the example uses Net::DBus->find
). I suppose this shouldn't make a difference here, though. Maybe one really needs to use Net::DBus::Reactor
explicitly.
Updated by AdamWill over 3 years ago
The perl-Net-Dbus author (dpberrange) is actually a Fedora/Red Hat person, I believe. I'll ask them if they can take a look and see what they think/suggest.
Updated by AdamWill over 3 years ago
Sent https://github.com/os-autoinst/os-autoinst/pull/1641 as a possible way to deal with this, based on discussion with danpb. It at least doesn't seem to break anything. Will have to run with it for a while to see if the bug is fixed.
Updated by openqa_review over 3 years ago
- Due date set to 2021-04-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 3 years ago
- Status changed from New to In Progress
- Assignee changed from mkittler to AdamWill
I'm assigning the ticket to you then. The PR has been merged. It would be interesting to know whether it actually fixed the issue in the end.
Updated by AdamWill over 3 years ago
Yeah, I have it running on both our deployments now. If we haven't had any more incompletions by next week I'll try and remember to close this with a note.
Updated by okurz over 3 years ago
- Status changed from In Progress to Resolved
Assuming fixed then :)
Updated by AdamWill over 3 years ago
I was gonna confirm tomorrow, but yeah, it looks like it's resolved indeed.