random DNS problems causing various issues
It seems we have some random nameserver problems with *.infra.opensuse.org
There are various symptoms that are probably related:
mx1 and mx2 randomly can't resolve mailman3.i.o.o¶
On mx1 and mx2, some mails were rejected with
2021-03-28T17:08:50.273016+00:00 mx2 postfix/smtp: 9E0513301: to=<firstname.lastname@example.org>, relay=none, delay=2.6, delays=2.6/0/0/0, dsn=5.3.0, status=bounced (unable to look up host mailman3.infra.opensuse.org: No address associated with hostname)
There are 850 successful deliveries to mailman3 vs. 28 failures in today's log. According to pjessen, the DNS issue only started on 26 March, at 1705 UTC.
mx* resolv.conf has anna/elsa, therefore I tried to remove FreeIPA from the dnsmasq config there (leaving only chip). That made things much worse, therefore I have to assume that chip is the one that causes the problems. (Needless to say that I reverted the dnsmasq config - better get results the possibly outdated FreeIPA than getting nothing.)
Note: the affected mails were bounced, which means the nameserver said something like "this domain doesn't exist" (not "temporary DNS error" which would have caused a 4xx code)
ssh login on chip¶
Several login attempts on chip.i.o.o (as cboltz) ended up with a
Password: prompt instead of letting me in with my SSH key.
Using the salt "backdoor", I tracked the issue down to
fetch_freeipa_ldap_sshpubkey.sh cboltz ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
It sometimes works (running ldapsearch in debug mode seems to help...) - but even in debug mode, I got the above message once, without further details.
I guess it's also the same DNS issue.
reverse DNS lookups on anna¶
2021-03-28T17:55:32.502090+00:00 anna postfix/smtpd: warning: hostname mailman3.infra.opensuse.org does not resolve to address 192.168.47.80: No address associated with hostname
looks like reverse DNS fails sometimes - 15 times in today's log
reports about database connection errors¶
[20:09:33] <robin_listas> Yes, got hit now. "Welcome to Elgg. / Elgg couldn't connect to the database using the given credentials." (about 2 hours ago)
It magically fixed itsself, and without having looked into the details, it might also be a random DNS issue of not finding mysql.i.o.o.
Thinking about it, we had a similar report for survey.o.o in the last days, which also magically fixed itsself.
I'm quite sure the list above isn't complete - but it clearly shows that the recent DNS changes come with "some" side effects :-(
Please check what's wrong, and fix it ASAP.
(See also the #opensuse-admin IRC log from the last 3 hours for more details.)
Turns out that chip no longer answers DNS queries for *.infra.o.o which explains the problem (FreeIPA is still master for this zone).
The obvious solution is to move infra.o.o to chip (so that it becomes master for it), but I'm afraid my pdns knownledge isn't good enough to do this myself.
(If there are other zones left with FreeIPA as master, please also move them to chip so that we have everything in one place again.)
When this is done, we'll have to remove references to FreeIPA from the dnsmasq config on anna/elsa (and any other host that runs dnsmasq).