Project

General

Profile

Actions

tickets #135779

open

How's postfix mail queue doing?

Added by luc14n0 8 months ago. Updated 6 months ago.

Status:
Feedback
Priority:
Normal
Category:
Core services and virtual infrastructure
Target version:
-
Start date:
2023-09-14
Due date:
% Done:

0%

Estimated time:

Description

Yesterday - Wed Sep 13, 2023 -, or the day before, Icinga started showing an alarm (or maybe it was when I realized it, the alarm tells me it started 2d 4h ago) about Postfix mail queue been higher than its threshold for progress.i.o.o. Is this threshold there to avoid what exactly?

progress.i.o.o
CRITICAL: postfix mailq is 85 (threshold c = 50)

Other similar alarms that seem to be there for a while (or they come and go, I can't say right now):

mailman3.i.o.o

CRITICAL: postfix mailq is 413 (threshold c = 50)

openqa.i.o.o

CRITICAL: postfix mailq is 121 (threshold c = 50)


Files

Screenshot from 2023-09-14 16-07-00.png (51.2 KB) Screenshot from 2023-09-14 16-07-00.png Icinga nrpe-mailq search luc14n0, 2023-09-14 19:08

Related issues 2 (0 open2 closed)

Related to openSUSE admin - tickets #135809: progress.i.o.o - temporary failure. Command output: Failed to contact your Redmine server (502).Resolvedcrameleon2023-09-15

Actions
Related to openQA Infrastructure - action #135848: Icinga alarm about Postfix mail queue since July 26Resolvedokurz2023-09-16

Actions
Actions #1

Updated by luc14n0 8 months ago

  • Private changed from Yes to No
Actions #2

Updated by pjessen 8 months ago

luc14n0 wrote:

Yesterday - Wed Sep 13, 2023 -, or the day before, Icinga started showing an alarm (or maybe it was when I realized it, the alarm tells me it started 2d 4h ago) about Postfix mail queue been higher than its threshold for progress.i.o.o. Is this threshold there to avoid what exactly?

On any "normal" system, there really should never be much of a postfix queue. Having a queue means mails could not be delivered.

  • mailman3 - 407 mails queued. All due to the receiving domain not being found. 205 for "spamergency.com" for instance. This is pretty typical for mailman3.
  • progress - 86 mails queued. That is highly unusual. It seems to be a queue of inbound mails: "temporary failure. Command output: Failed to contact your Redmine server (502).)" Destination is "redmine-opensuse-admin+admin@localhost.redmine". I'll open a separate ticket.
Actions #3

Updated by pjessen 8 months ago

  • Related to tickets #135809: progress.i.o.o - temporary failure. Command output: Failed to contact your Redmine server (502). added
Actions #4

Updated by luc14n0 7 months ago

  • Status changed from New to Feedback

OK. I guess that sums it up. Thanks for the insight.

The openqa.i.o.o alarm is from Jul 26. Around the time the migration of openQA is ending and they started "warming the engines". So, I'd suppose that 120 mail queue is a remnant of hiccups from the migration.

Actions #5

Updated by pjessen 7 months ago

luc14n0 wrote in #note-4:

OK. I guess that sums it up. Thanks for the insight.

The openqa.i.o.o alarm is from Jul 26. Around the time the migration of openQA is ending and they started "warming the engines".
So, I'd suppose that 120 mail queue is a remnant of hiccups from the migration.

I would log on to check it out, but I don't have access. 120 mails queued now is not normal - by default, undeliverables are discarded after 5 days.

Actions #6

Updated by crameleon 7 months ago

The openQA infrastructure is tracked in a different project. But maybe one of @okurz @nicksinger could check (or alternatively let us know who's a better person to ping about openqa.i.o.o)?

Actions #7

Updated by luc14n0 7 months ago

crameleon wrote in #note-6:

The openQA infrastructure is tracked in a different project. But maybe one of @okurz @nicksinger could check (or alternatively let us know who's a better person to ping about openqa.i.o.o)?

Yes, it is. I'm going to open a ticket/action in their project just to make sure, as a quick search didn't return anything for me.

Actions #8

Updated by okurz 7 months ago

  • Related to action #135848: Icinga alarm about Postfix mail queue since July 26 added
Actions #9

Updated by pjessen 6 months ago

More fun - from mx1.o.o:

2023-10-19T05:11:04.549091+00:00 mx1 postfix/smtpd[23097]: NOQUEUE: reject: RCPT from static.opensuse.org[2001:67c:2178:8::18]: 550 5.1.1 <mirrorcache@opensuse.org>: Recipient address rejected: User unknown in virtual alias table; from=<> to=<mirrorcache@opensuse.org> proto=ESMTP helo=<anna.opensuse.org>
2023-10-19T05:26:04.753980+00:00 mx1 postfix/smtpd[24588]: NOQUEUE: reject: RCPT from static.opensuse.org[2001:67c:2178:8::18]: 550 5.1.1 <mirrorcache@opensuse.org>: Recipient address rejected: User unknown in virtual alias table; from=<> to=<mirrorcache@opensuse.org> proto=ESMTP helo=<anna.opensuse.org>
2023-10-19T05:31:04.493952+00:00 mx1 postfix/smtpd[24837]: NOQUEUE: reject: RCPT from static.opensuse.org[2001:67c:2178:8::18]: 550 5.1.1 <mirrorcache@opensuse.org>: Recipient address rejected: User unknown in virtual alias table; from=<> to=<mirrorcache@opensuse.org> proto=ESMTP helo=<anna.opensuse.org>

So, mirrorcache@opensuse.org on static.o.o sent an email to someone which bounced and thus produced an NDR. static.o.o is now trying to deliver this NDR to mirrorcache@o.o, which however is an unknown addresss, so mx1 of course rejects it.

Actions #10

Updated by pjessen 6 months ago

Okay, part of the answer is that "2001:67c:2178:8::18" is assigned to anna - haproxy setup I presume. This is plainly wrong, I think anna is missing an "smtp_bind_address6" config.
Second, more pertinent to the topic of this ticket, anna has 1288 mails queued, of which 962 are from mirrorcache@o.o to mirrorcache@mirrorcache.infra.opensuse.org. The rest is a mixture -

  • some are being refused by https://forwardemail.net I have written to the intended recipient and suggested he fix the problem.
  • some are being refused by google, "To protect our users from spam, mail has been temporarily rate limited."
  • other misc. errors.
  1. I have added a ratelimit for gmail.
  2. the 962 mails are reports of a failed cron-job on mirrorcache.i.o.o, see #138257 . The mail is being sent from mirrorcache@mirrorcache.infra.opensuse.org (envelope sender mirrorcache@opensuse.org) to mirrorcache@mirrorcache.infra.opensuse.org relayed via anna. Of course they can't be delivered 😱
Actions

Also available in: Atom PDF