action #109878
closedbot-ng schedule/approve aborted
0%
Description
Observation¶
postgresql container on qam2.suse.de changed again IP and after this dashboard lost connection with database
after restart of dashboard service ( and reconnect to database using new ip) service for some time worked, but sometimes connection between container and host seems broken.
-- Logs begin at Tue 2022-04-12 18:07:21 CEST. --
dub 12 18:07:26 qam2 dnsmasq-dhcp[616]: DHCPREQUEST(br0) 192.168.0.48 aa:5d:de:99:79:13
dub 12 18:07:26 qam2 dnsmasq-dhcp[616]: DHCPACK(br0) 192.168.0.48 aa:5d:de:99:79:13 postgresql
dub 12 18:13:27 qam2 dnsmasq-dhcp[616]: DHCPDISCOVER(br0) aa:5d:de:99:79:13
dub 12 18:13:27 qam2 dnsmasq-dhcp[616]: DHCPOFFER(br0) 192.168.0.168 aa:5d:de:99:79:13
dub 12 18:13:27 qam2 dnsmasq-dhcp[616]: DHCPREQUEST(br0) 192.168.0.48 aa:5d:de:99:79:13
dub 12 18:13:27 qam2 dnsmasq-dhcp[616]: DHCPACK(br0) 192.168.0.48 aa:5d:de:99:79:13 postgresql
dub 12 18:13:37 qam2 dnsmasq-dhcp[616]: DHCPDISCOVER(br0) aa:5d:de:99:79:13
dub 12 18:13:37 qam2 dnsmasq-dhcp[616]: DHCPOFFER(br0) 192.168.0.168 aa:5d:de:99:79:13
dub 12 18:13:37 qam2 dnsmasq-dhcp[616]: DHCPREQUEST(br0) 192.168.0.48 aa:5d:de:99:79:13
dub 12 18:13:37 qam2 dnsmasq-dhcp[616]: DHCPACK(br0) 192.168.0.48 aa:5d:de:99:79:13 postgresql
Acceptance crtieria¶
- AC1: Pipeline doesn't fail due dashboard connection problems with db
Suggestions¶
Updated by osukup over 2 years ago
- Copied from action #107227: bot-ng schedule aborted with "ERROR: something wrong with /etc/openqabot/singlearch.yml" size:M added
Updated by jbaier_cz over 2 years ago
I will just add some more context from the logs, if someone would be interested in more debugging. The server was restarted recently (not a clean/planned restart though).
dub 12 12:47:56 qam2 kernel: Linux version 5.3.18-150300.59.60-default (geeko@buildhost) (gcc version 7.5.0 (SUSE Linux)) #1 SMP Fri Mar 18 18:37:08 UTC 2022 (79e1683)
dub 12 12:47:57 qam2 systemd-fsck[295]: ROOT: recovering journal
dub 12 12:48:05 qam2 dnsmasq-dhcp[621]: DHCPDISCOVER(br0) aa:5d:de:99:79:13
dub 12 12:48:05 qam2 dnsmasq-dhcp[621]: DHCPOFFER(br0) 192.168.0.168 aa:5d:de:99:79:13
dub 12 12:48:05 qam2 dnsmasq-dhcp[621]: DHCPREQUEST(br0) 192.168.0.48 aa:5d:de:99:79:13
dub 12 12:48:05 qam2 dnsmasq-dhcp[621]: DHCPACK(br0) 192.168.0.48 aa:5d:de:99:79:13 postgresql
dub 12 12:48:05 postgresql systemd-networkd[29]: host0: DHCPv4 address 192.168.0.48/24 via 192.168.0.1
dub 12 15:06:23 qam2 dashboard[9990]: [9990] [i] GET http://127.0.0.1:4000/app/api/incident/23085 -> 500 (0.003944s, 253.550/s)
dub 12 15:06:33 qam2 dashboard[9991]: [9991] [e] [_OFIWrQOXDVd] DBI connect('dbname=dashboard_db;host=postgresql;port=5432','dashboard_user',...) failed: could not translate host name "postgresql" to address: No address associated with hostname at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Pg.pm line 73.
dub 12 15:06:39 qam2 dnsmasq-dhcp[621]: DHCPDISCOVER(br0) aa:5d:de:99:79:13
dub 12 15:06:39 qam2 dnsmasq-dhcp[621]: DHCPOFFER(br0) 192.168.0.168 aa:5d:de:99:79:13
dub 12 15:06:39 qam2 dnsmasq-dhcp[621]: DHCPREQUEST(br0) 192.168.0.48 aa:5d:de:99:79:13
dub 12 15:06:39 qam2 dnsmasq-dhcp[621]: DHCPACK(br0) 192.168.0.48 aa:5d:de:99:79:13 postgresql
dub 12 15:06:39 postgresql systemd-networkd[504]: host0: DHCPv4 address 192.168.0.48/24 via 192.168.0.1
Updated by jbaier_cz over 2 years ago
- Status changed from New to Feedback
- Assignee set to jbaier_cz
And because I was extra curious I did investigate further and deeper. First, I found out that /var/lib/misc/dnsmasq.leases
was empty. That explained the inability for the host to resolve the name. It also gave me an idea and I really found the problem and solved it once for all (fingers crossed). We had yet another DHCP server running for that network (which very well explains the observed behavior). So I deleted DHCPServer=yes
from the systemd network interface configuration and I expect the IP to be stable from now on.
I would still recommend to implement #106547, though.
Updated by jbaier_cz over 2 years ago
- Status changed from Feedback to Resolved
I did not see any misbehaving since. The IP address seems to be stable and there are no unexpected DHCP messages in the journal.