Project

General

Profile

Actions

action #136274

closed

Failing DNS resolution on o3 for hosts like github.com

Added by okurz 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-09-21
Due date:
% Done:

0%

Estimated time:

Description

Observation

User report https://suse.slack.com/archives/C02CANHLANP/p1695315433186189

(Ana Guerrero Lopez) I can't update a needed in openqa.opensuse.org looks like some DNS problem?
Failed to save system_role-system-role-gnome-selected-20230921.
Unable to push Git commit: ssh: Could not resolve hostname github.com: Temporary failure in name resolution
fatal: Could not read from remote repository.

Steps to reproduce

  • On o3 host github.com
  • Observe error ;; connection timed out; no servers could be reached

Problem

  • The DNS server from /etc/resolv.conf is nameserver 127.0.0.1 which is served by a local dnsmasq instance. journalctl --since="1h ago" -u dnsmasq states
Sep 21 16:51:32 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Sep 21 16:54:08 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Sep 21 16:54:14 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Sep 21 16:54:20 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Sep 21 16:54:26 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Sep 21 16:54:32 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 available DHCP subnet: 10.150.1.0/255.255.255.0
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 client provides name: openqaworker25
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 DHCPREQUEST(eth1) 10.150.1.28 7c:c2:55:24:de:98
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 tags: known, eth1
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 DHCPACK(eth1) 10.150.1.28 7c:c2:55:24:de:98 openqaworker25
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 requested options: 58:T1, 59:T2, 1:netmask, 28:broadcast, 26:mtu,
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 requested options: 121:classless-static-route, 3:router, 33:static-route,
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 requested options: 12:hostname, 119:domain-search, 15:domain-name,
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 requested options: 6:dns-server, 40:nis-domain, 41:nis-server,
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 requested options: 42:ntp-server, 17:root-path, 85, 86, 87,
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 requested options: 100:posix-timezone, 101:tzdb-timezone
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 server name: 10.150.1.11
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 next server: 10.150.1.11
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size:  1 option: 53 message-type  5
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size:  4 option: 54 server-identifier  10.150.1.11
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size:  4 option: 51 lease-time  1h
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size:  4 option: 58 T1  27m39s
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size:  4 option: 59 T2  50m9s
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size:  4 option:  1 netmask  255.255.255.0
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size:  4 option: 28 broadcast  10.150.1.255
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size:  4 option:  6 dns-server  10.150.1.11
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size: 22 option: 15 domain-name  openqanet.opensuse.org
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size: 14 option: 12 hostname  openqaworker25
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size:  4 option: 42 ntp-server  10.150.1.11
Sep 21 16:55:33 new-ariel dnsmasq-dhcp[1739]: 282234984 sent size:  4 option:  3 router  10.150.1.254
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 available DHCP subnet: 10.150.1.0/255.255.255.0
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 client provides name: openqaworker28
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 DHCPREQUEST(eth1) 10.150.1.31 3c:ec:ef:fd:c5:cc
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 tags: known, eth1
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 DHCPACK(eth1) 10.150.1.31 3c:ec:ef:fd:c5:cc openqaworker28
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 requested options: 58:T1, 59:T2, 1:netmask, 28:broadcast, 26:mtu,
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 requested options: 121:classless-static-route, 3:router, 33:static-route,
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 requested options: 12:hostname, 119:domain-search, 15:domain-name,
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 requested options: 6:dns-server, 40:nis-domain, 41:nis-server,
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 requested options: 42:ntp-server, 17:root-path, 85, 86, 87,
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 requested options: 44:netbios-ns, 45:netbios-dd, 46:netbios-nodetype,
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 requested options: 47:netbios-scope, 100:posix-timezone, 101:tzdb-timezone
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 server name: 10.150.1.11
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 next server: 10.150.1.11
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size:  1 option: 53 message-type  5
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size:  4 option: 54 server-identifier  10.150.1.11
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size:  4 option: 51 lease-time  1h
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size:  4 option: 58 T1  26m16s
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size:  4 option: 59 T2  48m46s
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size:  4 option:  1 netmask  255.255.255.0
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size:  4 option: 28 broadcast  10.150.1.255
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size:  4 option:  6 dns-server  10.150.1.11
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size: 22 option: 15 domain-name  openqanet.opensuse.org
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size: 14 option: 12 hostname  openqaworker28
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size:  4 option: 42 ntp-server  10.150.1.11
Sep 21 16:55:50 new-ariel dnsmasq-dhcp[1739]: 376277500 sent size:  4 option:  3 router  10.150.1.254
Sep 21 16:56:56 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Sep 21 16:57:33 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)
Sep 21 16:57:39 new-ariel dnsmasq[1739]: Maximum number of concurrent DNS queries reached (max: 150)

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #135740: [alert] Munin - minion hook failed - opensuse.org :: openqa.opensuse.org - only "label_known_issues" hook scriptssize:MResolvedlivdywan2023-07-162023-10-05

Actions
Related to openQA Infrastructure - action #156322: zabbix-proxy.dmz-prg2.suse.org not reachable from ariel.suse-dmz.opensuse.orgResolvedjbaier_cz2024-02-29

Actions
Actions #1

Updated by okurz 7 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
okurz@new-ariel:~> nslookup github.com 10.150.1.11
^C
okurz@new-ariel:~> nslookup github.com 1.1.1.1
Server:     1.1.1.1
Address:    1.1.1.1#53

Non-authoritative answer:
Name:   github.com
Address: 140.82.121.4

okurz@new-ariel:~> for i in 10.150.1.11 1.1.1.1; do timeout 4 nslookup github.com $i; done
Server:     10.150.1.11
Address:    10.150.1.11#53

** server can't find github.com: REFUSED

Server:     1.1.1.1
Address:    1.1.1.1#53

Non-authoritative answer:
Name:   github.com
Address: 140.82.121.3

okurz@new-ariel:~> for i in 10.150.1.11 1.1.1.1; do timeout 4 nslookup github.com $i; done

Server:     1.1.1.1
Address:    1.1.1.1#53

Non-authoritative answer:
Name:   github.com
Address: 140.82.121.3

but why try 10.150.1.11 which is

Actions #2

Updated by okurz 7 months ago

  • Related to action #135740: [alert] Munin - minion hook failed - opensuse.org :: openqa.opensuse.org - only "label_known_issues" hook scriptssize:M added
Actions #3

Updated by nicksinger 7 months ago

These are the upstream DNS servers defined in our dnsmasq config

new-ariel:/etc/dnsmasq.d # grep -ri "server="
openqa.conf:server=8.8.8.8
openqa.conf:server=/infra.opensuse.org/192.168.47.4
openqa.conf:server=/47.168.192.in-addr.arpa/192.168.47.4

only 8.8.8.8 is defined which is already discussed in other forums to cause problems similar to this: https://forum.openwrt.org/t/dnsmasq-maximum-concurrent-dns-queries-limit/164427
I've added 1.1.1.1 and 9.9.9.9 now and reloaded (trying to avoid having the issue with 8.8.8.8 or the queue size fixed by a restart) and queries seem to resolve after a short timeout:

new-ariel:/etc # dig heise.de

; <<>> DiG 9.16.43 <<>> heise.de
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1329
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;heise.de.          IN  A

;; ANSWER SECTION:
heise.de.       84524   IN  A   193.99.144.80

;; Query time: 0 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Thu Sep 21 17:23:20 UTC 2023
;; MSG SIZE  rcvd: 53
Actions #4

Updated by okurz 7 months ago

@nicksinger thank you. Seems we stepped on each other's toes

… us ourselves on eth1 anyway? Maybe I am just getting confused with the entry that is passed on to DHCP clients.

/etc/dnsmasq.d/openqa.conf says server=10.151.53.53 but 10.151.53.53 does not respond to pings.

I tried to add another entry in /etc/dnsmasq.d/openqa.conf with server=1.1.1.1 but that does not seem to help.

As workaround I added nameserver 1.1.1.1 into /etc/resolv.conf. That seems to work

okurz@new-ariel:~> nslookup github.com
Server:     1.1.1.1
Address:    1.1.1.1#53

Non-authoritative answer:
Name:   github.com
Address: 140.82.121.4

I reported https://sd.suse.com/servicedesk/customer/portal/1/SD-132971

Actions #5

Updated by okurz 7 months ago

  • Status changed from In Progress to Blocked
Actions #6

Updated by nicksinger 7 months ago

Right, I was confused why server=10.151.53.53 was commented just right when I wanted to update the ticket :')
From the journal of dnsmasq I see that we hand out the IP of new-ariel as DNS server:

Aug 29 23:05:06 new-ariel dnsmasq-dhcp[24171]: 1804410612 sent size:  4 option:  6 dns-server  10.150.1.11

which is correct as new-ariel is a recursive server for all other hosts in the network. Adding anything else into /etc/resolv.conf shouldn't be necessary as dnsmasq should provide DNS for ariel itself too and would hide away issues for the workers. Upstream resolvers are handled by the corresponding server= entries in the config of dnsmasq. Checking dnsmasq right now with strace -p 16831 (the PID of the process listening on port 53) shows that it is stuck in trying to reach 10.151.53.53:

strace: Process 16831 attached
sendto(5, "\0-\0\1\1\0\0\1\0\0\0\0\0\1\3dns\10msftncsi\3com\0"..., 47, MSG_FASTOPEN, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.151.53.53")}, 16

ss -tupn | grep 10.151.53.53 confirms open connections:

tcp   SYN-SENT   0      1        10.150.2.10:53254    10.151.53.53:53    users:(("dnsmasq",pid=18926,fd=5))                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
tcp   SYN-SENT   0      1        10.150.2.10:55616    10.151.53.53:53    users:(("dnsmasq",pid=22076,fd=5))                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
tcp   SYN-SENT   0      1        10.150.2.10:33776    10.151.53.53:53    users:(("dnsmasq",pid=19462,fd=5))                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
tcp   SYN-SENT   0      1        10.150.2.10:56814    10.151.53.53:53    users:(("dnsmasq",pid=18922,fd=5))
Actions #7

Updated by nicksinger 7 months ago

I just restarted dnsmasq.service to kill these hanging connections to not cause further problems on our workers. Now 127.0.0.1 responds immediately:

/etc/dnsmasq.d # dig heise.de @127.0.0.1

; <<>> DiG 9.16.43 <<>> heise.de @127.0.0.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54140
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;heise.de.          IN  A

;; ANSWER SECTION:
heise.de.       549 IN  A   193.99.144.80

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Sep 21 17:37:24 UTC 2023
;; MSG SIZE  rcvd: 53

So would you agree that we can remove 1.1.1.1 again from /etc/resolv.conf?

Actions #8

Updated by nicksinger 7 months ago

  • Priority changed from Urgent to Normal

We reduced the priority by adjusting /etc/resolv.conf to a public resolver, added more public resolvers to our dnsmasq.conf (which mitigates the problem for workers in that network) and a infra ticket was created to fix the broken DNS server which started these problems.

Actions #9

Updated by okurz 7 months ago

  • Status changed from Blocked to Resolved

nicksinger wrote in #note-7:

So would you agree that we can remove 1.1.1.1 again from /etc/resolv.conf?

yes, done.

Responded in SD-ticket with:

As we are using dnsmasq we have now the following configuration:

server=10.151.53.53
server=10.151.53.54
# fallback servers
# nsinger: 2023-09-21: added additional servers as fallback if google misbehaves
server=192.168.47.4 #internal resolver which is also recursive
server=1.1.1.1
server=9.9.9.9
server=8.8.8.8
server=/infra.opensuse.org/192.168.47.4
server=/47.168.192.in-addr.arpa/192.168.47.4
no-resolv

All good now

Actions #10

Updated by jbaier_cz 25 days ago

  • Related to action #156322: zabbix-proxy.dmz-prg2.suse.org not reachable from ariel.suse-dmz.opensuse.org added
Actions

Also available in: Atom PDF