Project

General

Profile

Actions

tickets #90455

closed

random DNS problems causing various issues

Added by cboltz about 3 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Core services and virtual infrastructure
Target version:
-
Start date:
2021-03-27
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

It seems we have some random nameserver problems with *.infra.opensuse.org

There are various symptoms that are probably related:

mx1 and mx2 randomly can't resolve mailman3.i.o.o

On mx1 and mx2, some mails were rejected with 2021-03-28T17:08:50.273016+00:00 mx2 postfix/smtp[16122]: 9E0513301: to=<offtopic@lists.opensuse.org>, relay=none, delay=2.6, delays=2.6/0/0/0, dsn=5.3.0, status=bounced (unable to look up host mailman3.infra.opensuse.org: No address associated with hostname)

There are 850 successful deliveries to mailman3 vs. 28 failures in today's log. According to pjessen, the DNS issue only started on 26 March, at 1705 UTC.

mx* resolv.conf has anna/elsa, therefore I tried to remove FreeIPA from the dnsmasq config there (leaving only chip). That made things much worse, therefore I have to assume that chip is the one that causes the problems. (Needless to say that I reverted the dnsmasq config - better get results the possibly outdated FreeIPA than getting nothing.)

Note: the affected mails were bounced, which means the nameserver said something like "this domain doesn't exist" (not "temporary DNS error" which would have caused a 4xx code)

ssh login on chip

Several login attempts on chip.i.o.o (as cboltz) ended up with a Password: prompt instead of letting me in with my SSH key.

Using the salt "backdoor", I tracked the issue down to

fetch_freeipa_ldap_sshpubkey.sh cboltz
ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)

It sometimes works (running ldapsearch in debug mode seems to help...) - but even in debug mode, I got the above message once, without further details.

I guess it's also the same DNS issue.

reverse DNS lookups on anna

2021-03-28T17:55:32.502090+00:00 anna postfix/smtpd[25704]: warning: hostname mailman3.infra.opensuse.org does not resolve to address 192.168.47.80: No address associated with hostname

looks like reverse DNS fails sometimes - 15 times in today's log

reports about database connection errors

[20:09:33] <robin_listas> Yes, got hit now. "Welcome to Elgg. / Elgg couldn't connect to the database using the given credentials." (about 2 hours ago)

It magically fixed itsself, and without having looked into the details, it might also be a random DNS issue of not finding mysql.i.o.o.

Thinking about it, we had a similar report for survey.o.o in the last days, which also magically fixed itsself.

and more?

I'm quite sure the list above isn't complete - but it clearly shows that the recent DNS changes come with "some" side effects :-(

Please check what's wrong, and fix it ASAP.

(See also the #opensuse-admin IRC log from the last 3 hours for more details.)


Subtasks 1 (0 open1 closed)

tickets #90449: survey.o.o DB downResolved2021-03-27

Actions
Actions #1

Updated by cboltz about 3 years ago

  • Private changed from Yes to No
Actions #2

Updated by cboltz about 3 years ago

Turns out that chip no longer answers DNS queries for *.infra.o.o which explains the problem (FreeIPA is still master for this zone).

The obvious solution is to move infra.o.o to chip (so that it becomes master for it), but I'm afraid my pdns knownledge isn't good enough to do this myself.
(If there are other zones left with FreeIPA as master, please also move them to chip so that we have everything in one place again.)

When this is done, we'll have to remove references to FreeIPA from the dnsmasq config on anna/elsa (and any other host that runs dnsmasq).

Actions #3

Updated by cboltz about 3 years ago

As a workaround, I changed the dnsmasq config on anna/elsa to only query FreeIPA (not chip) for infra.o.o, but servers that don't have anna/elsa in resolv.conf might still suffer from the problem.

Actions #4

Updated by lrupp almost 3 years ago

  • Status changed from New to Closed

Old ticket. Setup has been changed. Closing...

Actions

Also available in: Atom PDF