tickets #135779
closed
How's postfix mail queue doing?
Added by luc14n0 about 1 year ago.
Updated 3 months ago.
Category:
Core services and virtual infrastructure
Description
Yesterday - Wed Sep 13, 2023 -, or the day before, Icinga started showing an alarm (or maybe it was when I realized it, the alarm tells me it started 2d 4h ago) about Postfix mail queue been higher than its threshold for progress.i.o.o. Is this threshold there to avoid what exactly?
progress.i.o.o
CRITICAL: postfix mailq is 85 (threshold c = 50)
Other similar alarms that seem to be there for a while (or they come and go, I can't say right now):
mailman3.i.o.o
CRITICAL: postfix mailq is 413 (threshold c = 50)
openqa.i.o.o
CRITICAL: postfix mailq is 121 (threshold c = 50)
Files
- Private changed from Yes to No
luc14n0 wrote:
Yesterday - Wed Sep 13, 2023 -, or the day before, Icinga started showing an alarm (or maybe it was when I realized it, the alarm tells me it started 2d 4h ago) about Postfix mail queue been higher than its threshold for progress.i.o.o. Is this threshold there to avoid what exactly?
On any "normal" system, there really should never be much of a postfix queue. Having a queue means mails could not be delivered.
- mailman3 - 407 mails queued. All due to the receiving domain not being found. 205 for "spamergency.com" for instance. This is pretty typical for mailman3.
- progress - 86 mails queued. That is highly unusual. It seems to be a queue of inbound mails: "temporary failure. Command output: Failed to contact your Redmine server (502).)"
Destination is "redmine-opensuse-admin+admin@localhost.redmine". I'll open a separate ticket.
- Related to tickets #135809: progress.i.o.o - temporary failure. Command output: Failed to contact your Redmine server (502). added
- Status changed from New to Feedback
OK. I guess that sums it up. Thanks for the insight.
The openqa.i.o.o alarm is from Jul 26. Around the time the migration of openQA is ending and they started "warming the engines". So, I'd suppose that 120 mail queue is a remnant of hiccups from the migration.
luc14n0 wrote in #note-4:
OK. I guess that sums it up. Thanks for the insight.
The openqa.i.o.o alarm is from Jul 26. Around the time the migration of openQA is ending and they started "warming the engines".
So, I'd suppose that 120 mail queue is a remnant of hiccups from the migration.
I would log on to check it out, but I don't have access. 120 mails queued now is not normal - by default, undeliverables are discarded after 5 days.
The openQA infrastructure is tracked in a different project. But maybe one of @okurz @nicksinger could check (or alternatively let us know who's a better person to ping about openqa.i.o.o)?
crameleon wrote in #note-6:
The openQA infrastructure is tracked in a different project. But maybe one of @okurz @nicksinger could check (or alternatively let us know who's a better person to ping about openqa.i.o.o)?
Yes, it is. I'm going to open a ticket/action in their project just to make sure, as a quick search didn't return anything for me.
- Related to action #135848: Icinga alarm about Postfix mail queue since July 26 added
More fun - from mx1.o.o:
2023-10-19T05:11:04.549091+00:00 mx1 postfix/smtpd[23097]: NOQUEUE: reject: RCPT from static.opensuse.org[2001:67c:2178:8::18]: 550 5.1.1 <mirrorcache@opensuse.org>: Recipient address rejected: User unknown in virtual alias table; from=<> to=<mirrorcache@opensuse.org> proto=ESMTP helo=<anna.opensuse.org>
2023-10-19T05:26:04.753980+00:00 mx1 postfix/smtpd[24588]: NOQUEUE: reject: RCPT from static.opensuse.org[2001:67c:2178:8::18]: 550 5.1.1 <mirrorcache@opensuse.org>: Recipient address rejected: User unknown in virtual alias table; from=<> to=<mirrorcache@opensuse.org> proto=ESMTP helo=<anna.opensuse.org>
2023-10-19T05:31:04.493952+00:00 mx1 postfix/smtpd[24837]: NOQUEUE: reject: RCPT from static.opensuse.org[2001:67c:2178:8::18]: 550 5.1.1 <mirrorcache@opensuse.org>: Recipient address rejected: User unknown in virtual alias table; from=<> to=<mirrorcache@opensuse.org> proto=ESMTP helo=<anna.opensuse.org>
So, mirrorcache@opensuse.org on static.o.o sent an email to someone which bounced and thus produced an NDR. static.o.o is now trying to deliver this NDR to mirrorcache@o.o, which however is an unknown addresss, so mx1 of course rejects it.
Okay, part of the answer is that "2001:67c:2178:8::18" is assigned to anna - haproxy setup I presume. This is plainly wrong, I think anna is missing an "smtp_bind_address6" config.
Second, more pertinent to the topic of this ticket, anna has 1288 mails queued, of which 962 are from mirrorcache@o.o
to mirrorcache@mirrorcache.infra.opensuse.org
. The rest is a mixture -
- some are being refused by https://forwardemail.net I have written to the intended recipient and suggested he fix the problem.
- some are being refused by google, "To protect our users from spam, mail has been temporarily rate limited."
- other misc. errors.
- I have added a ratelimit for gmail.
- the 962 mails are reports of a failed cron-job on mirrorcache.i.o.o, see #138257 . The mail is being sent from
mirrorcache@mirrorcache.infra.opensuse.org
(envelope sender mirrorcache@opensuse.org) to mirrorcache@mirrorcache.infra.opensuse.org
relayed via anna. Of course they can't be delivered 😱
- Status changed from Feedback to Resolved
- Assignee changed from opensuse-admin to crameleon
The issues mentioned earlier in this ticket are no longer relevant. Recent mail queue events cleared up. New events in new monitoring are inspected and tracked separately as needed.
Also available in: Atom
PDF