action #136274
closedFailing DNS resolution on o3 for hosts like github.com
0%
Description
Observation¶
User report https://suse.slack.com/archives/C02CANHLANP/p1695315433186189
(Ana Guerrero Lopez) I can't update a needed in openqa.opensuse.org looks like some DNS problem?
Failed to save system_role-system-role-gnome-selected-20230921.
Unable to push Git commit: ssh: Could not resolve hostname github.com: Temporary failure in name resolution
fatal: Could not read from remote repository.
Steps to reproduce¶
- On o3
host github.com
- Observe error
;; connection timed out; no servers could be reached
Problem¶
- The DNS server from /etc/resolv.conf is
nameserver 127.0.0.1
which is served by a local dnsmasq instance.journalctl --since="1h ago" -u dnsmasq
states
Sep 21 16:51:32 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Sep 21 16:54:08 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Sep 21 16:54:14 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Sep 21 16:54:20 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Sep 21 16:54:26 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Sep 21 16:54:32 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 available DHCP subnet: 10.150.1.0/255.255.255.0
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 client provides name: openqaworker25
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 DHCPREQUEST(eth1) 10.150.1.28 7c:c2:55:24:de:98
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 tags: known, eth1
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 DHCPACK(eth1) 10.150.1.28 7c:c2:55:24:de:98 openqaworker25
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 requested options: 58:T1, 59:T2, 1:netmask, 28:broadcast, 26:mtu,
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 requested options: 121:classless-static-route, 3:router, 33:static-route,
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 requested options: 12:hostname, 119:domain-search, 15:domain-name,
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 requested options: 6:dns-server, 40:nis-domain, 41:nis-server,
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 requested options: 42:ntp-server, 17:root-path, 85, 86, 87,
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 requested options: 100:posix-timezone, 101:tzdb-timezone
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 server name: 10.150.1.11
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 next server: 10.150.1.11
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size: 1 option: 53 message-type 5
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size: 4 option: 54 server-identifier 10.150.1.11
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size: 4 option: 51 lease-time 1h
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size: 4 option: 58 T1 27m39s
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size: 4 option: 59 T2 50m9s
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size: 4 option: 1 netmask 255.255.255.0
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size: 4 option: 28 broadcast 10.150.1.255
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size: 4 option: 6 dns-server 10.150.1.11
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size: 22 option: 15 domain-name openqanet.opensuse.org
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size: 14 option: 12 hostname openqaworker25
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size: 4 option: 42 ntp-server 10.150.1.11
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size: 4 option: 3 router 10.150.1.254
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 available DHCP subnet: 10.150.1.0/255.255.255.0
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 client provides name: openqaworker28
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 DHCPREQUEST(eth1) 10.150.1.31 3c:ec:ef:fd:c5:cc
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 tags: known, eth1
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 DHCPACK(eth1) 10.150.1.31 3c:ec:ef:fd:c5:cc openqaworker28
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 requested options: 58:T1, 59:T2, 1:netmask, 28:broadcast, 26:mtu,
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 requested options: 121:classless-static-route, 3:router, 33:static-route,
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 requested options: 12:hostname, 119:domain-search, 15:domain-name,
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 requested options: 6:dns-server, 40:nis-domain, 41:nis-server,
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 requested options: 42:ntp-server, 17:root-path, 85, 86, 87,
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 requested options: 44:netbios-ns, 45:netbios-dd, 46:netbios-nodetype,
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 requested options: 47:netbios-scope, 100:posix-timezone, 101:tzdb-timezone
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 server name: 10.150.1.11
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 next server: 10.150.1.11
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size: 1 option: 53 message-type 5
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size: 4 option: 54 server-identifier 10.150.1.11
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size: 4 option: 51 lease-time 1h
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size: 4 option: 58 T1 26m16s
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size: 4 option: 59 T2 48m46s
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size: 4 option: 1 netmask 255.255.255.0
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size: 4 option: 28 broadcast 10.150.1.255
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size: 4 option: 6 dns-server 10.150.1.11
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size: 22 option: 15 domain-name openqanet.opensuse.org
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size: 14 option: 12 hostname openqaworker28
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size: 4 option: 42 ntp-server 10.150.1.11
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size: 4 option: 3 router 10.150.1.254
Sep 21 16:56:56 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Sep 21 16:57:33 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Sep 21 16:57:39 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Updated by okurz about 1 year ago
- Status changed from New to In Progress
- Assignee set to okurz
okurz@new-ariel:~> nslookup github.com 10.150.1.11
^C
okurz@new-ariel:~> nslookup github.com 1.1.1.1
Server: 1.1.1.1
Address: 1.1.1.1#53
Non-authoritative answer:
Name: github.com
Address: 140.82.121.4
okurz@new-ariel:~> for i in 10.150.1.11 1.1.1.1; do timeout 4 nslookup github.com $i; done
Server: 10.150.1.11
Address: 10.150.1.11#53
** server can't find github.com: REFUSED
Server: 1.1.1.1
Address: 1.1.1.1#53
Non-authoritative answer:
Name: github.com
Address: 140.82.121.3
okurz@new-ariel:~> for i in 10.150.1.11 1.1.1.1; do timeout 4 nslookup github.com $i; done
Server: 1.1.1.1
Address: 1.1.1.1#53
Non-authoritative answer:
Name: github.com
Address: 140.82.121.3
but why try 10.150.1.11 which is
Updated by okurz about 1 year ago
- Related to action #135740: [alert] Munin - minion hook failed - opensuse.org :: openqa.opensuse.org - only "label_known_issues" hook scriptssize:M added
Updated by nicksinger about 1 year ago
These are the upstream DNS servers defined in our dnsmasq config
new-ariel:/etc/dnsmasq.d # grep -ri "server="
openqa.conf:server=8.8.8.8
openqa.conf:server=/infra.opensuse.org/192.168.47.4
openqa.conf:server=/47.168.192.in-addr.arpa/192.168.47.4
only 8.8.8.8
is defined which is already discussed in other forums to cause problems similar to this: https://forum.openwrt.org/t/dnsmasq-maximum-concurrent-dns-queries-limit/164427
I've added 1.1.1.1 and 9.9.9.9 now and reloaded (trying to avoid having the issue with 8.8.8.8 or the queue size fixed by a restart) and queries seem to resolve after a short timeout:
new-ariel:/etc # dig heise.de
; <<>> DiG 9.16.43 <<>> heise.de
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1329
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;heise.de. IN A
;; ANSWER SECTION:
heise.de. 84524 IN A 193.99.144.80
;; Query time: 0 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Thu Sep 21 17:23:20 UTC 2023
;; MSG SIZE rcvd: 53
Updated by okurz about 1 year ago
@nicksinger thank you. Seems we stepped on each other's toes
… us ourselves on eth1 anyway? Maybe I am just getting confused with the entry that is passed on to DHCP clients.
/etc/dnsmasq.d/openqa.conf says server=10.151.53.53
but 10.151.53.53 does not respond to pings.
I tried to add another entry in /etc/dnsmasq.d/openqa.conf with server=1.1.1.1
but that does not seem to help.
As workaround I added nameserver 1.1.1.1
into /etc/resolv.conf. That seems to work
okurz@new-ariel:~> nslookup github.com
Server: 1.1.1.1
Address: 1.1.1.1#53
Non-authoritative answer:
Name: github.com
Address: 140.82.121.4
I reported https://sd.suse.com/servicedesk/customer/portal/1/SD-132971
Updated by okurz about 1 year ago
- Status changed from In Progress to Blocked
Updated by nicksinger about 1 year ago
Right, I was confused why server=10.151.53.53
was commented just right when I wanted to update the ticket :')
From the journal of dnsmasq I see that we hand out the IP of new-ariel as DNS server:
Aug 29 23:05:06 new-ariel dnsmasq-dhcp[24171]: 1804410612 sent size: 4 option: 6 dns-server 10.150.1.11
which is correct as new-ariel is a recursive server for all other hosts in the network. Adding anything else into /etc/resolv.conf
shouldn't be necessary as dnsmasq should provide DNS for ariel itself too and would hide away issues for the workers. Upstream resolvers are handled by the corresponding server=
entries in the config of dnsmasq. Checking dnsmasq right now with strace -p 16831
(the PID of the process listening on port 53) shows that it is stuck in trying to reach 10.151.53.53:
strace: Process 16831 attached
sendto(5, "\0-\0\1\1\0\0\1\0\0\0\0\0\1\3dns\10msftncsi\3com\0"..., 47, MSG_FASTOPEN, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.151.53.53")}, 16
ss -tupn | grep 10.151.53.53
confirms open connections:
tcp SYN-SENT 0 1 10.150.2.10:53254 10.151.53.53:53 users:(("dnsmasq",pid=18926,fd=5))
tcp SYN-SENT 0 1 10.150.2.10:55616 10.151.53.53:53 users:(("dnsmasq",pid=22076,fd=5))
tcp SYN-SENT 0 1 10.150.2.10:33776 10.151.53.53:53 users:(("dnsmasq",pid=19462,fd=5))
tcp SYN-SENT 0 1 10.150.2.10:56814 10.151.53.53:53 users:(("dnsmasq",pid=18922,fd=5))
Updated by nicksinger about 1 year ago
I just restarted dnsmasq.service
to kill these hanging connections to not cause further problems on our workers. Now 127.0.0.1
responds immediately:
/etc/dnsmasq.d # dig heise.de @127.0.0.1
; <<>> DiG 9.16.43 <<>> heise.de @127.0.0.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54140
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;heise.de. IN A
;; ANSWER SECTION:
heise.de. 549 IN A 193.99.144.80
;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Sep 21 17:37:24 UTC 2023
;; MSG SIZE rcvd: 53
So would you agree that we can remove 1.1.1.1 again from /etc/resolv.conf
?
Updated by nicksinger about 1 year ago
- Priority changed from Urgent to Normal
We reduced the priority by adjusting /etc/resolv.conf
to a public resolver, added more public resolvers to our dnsmasq.conf
(which mitigates the problem for workers in that network) and a infra ticket was created to fix the broken DNS server which started these problems.
Updated by okurz about 1 year ago
- Status changed from Blocked to Resolved
nicksinger wrote in #note-7:
So would you agree that we can remove 1.1.1.1 again from
/etc/resolv.conf
?
yes, done.
Responded in SD-ticket with:
As we are using dnsmasq we have now the following configuration:
server=10.151.53.53
server=10.151.53.54
# fallback servers
# nsinger: 2023-09-21: added additional servers as fallback if google misbehaves
server=192.168.47.4 #internal resolver which is also recursive
server=1.1.1.1
server=9.9.9.9
server=8.8.8.8
server=/infra.opensuse.org/192.168.47.4
server=/47.168.192.in-addr.arpa/192.168.47.4
no-resolv
All good now
Updated by jbaier_cz 8 months ago
- Related to action #156322: zabbix-proxy.dmz-prg2.suse.org not reachable from ariel.suse-dmz.opensuse.org added