Project

General

Profile

Actions

tickets #135779

closed

How's postfix mail queue doing?

Added by luc14n0 about 1 year ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Core services and virtual infrastructure
Target version:
-
Start date:
2023-09-14
Due date:
% Done:

0%

Estimated time:

Description

Yesterday - Wed Sep 13, 2023 -, or the day before, Icinga started showing an alarm (or maybe it was when I realized it, the alarm tells me it started 2d 4h ago) about Postfix mail queue been higher than its threshold for progress.i.o.o. Is this threshold there to avoid what exactly?

progress.i.o.o
CRITICAL: postfix mailq is 85 (threshold c = 50)

Other similar alarms that seem to be there for a while (or they come and go, I can't say right now):

mailman3.i.o.o

CRITICAL: postfix mailq is 413 (threshold c = 50)

openqa.i.o.o

CRITICAL: postfix mailq is 121 (threshold c = 50)


Files

Screenshot from 2023-09-14 16-07-00.png (51.2 KB) Screenshot from 2023-09-14 16-07-00.png Icinga nrpe-mailq search luc14n0, 2023-09-14 19:08

Related issues 2 (0 open2 closed)

Related to openSUSE admin - tickets #135809: progress.i.o.o - temporary failure. Command output: Failed to contact your Redmine server (502).Resolvedcrameleon2023-09-15

Actions
Related to openQA Infrastructure - action #135848: Icinga alarm about Postfix mail queue since July 26Resolvedokurz2023-09-16

Actions
Actions #1

Updated by luc14n0 about 1 year ago

  • Private changed from Yes to No
Actions #2

Updated by pjessen about 1 year ago

luc14n0 wrote:

Yesterday - Wed Sep 13, 2023 -, or the day before, Icinga started showing an alarm (or maybe it was when I realized it, the alarm tells me it started 2d 4h ago) about Postfix mail queue been higher than its threshold for progress.i.o.o. Is this threshold there to avoid what exactly?

On any "normal" system, there really should never be much of a postfix queue. Having a queue means mails could not be delivered.

  • mailman3 - 407 mails queued. All due to the receiving domain not being found. 205 for "spamergency.com" for instance. This is pretty typical for mailman3.
  • progress - 86 mails queued. That is highly unusual. It seems to be a queue of inbound mails: "temporary failure. Command output: Failed to contact your Redmine server (502).)" Destination is "redmine-opensuse-admin+admin@localhost.redmine". I'll open a separate ticket.
Actions #3

Updated by pjessen about 1 year ago

  • Related to tickets #135809: progress.i.o.o - temporary failure. Command output: Failed to contact your Redmine server (502). added
Actions #4

Updated by luc14n0 about 1 year ago

  • Status changed from New to Feedback

OK. I guess that sums it up. Thanks for the insight.

The openqa.i.o.o alarm is from Jul 26. Around the time the migration of openQA is ending and they started "warming the engines". So, I'd suppose that 120 mail queue is a remnant of hiccups from the migration.

Actions #5

Updated by pjessen about 1 year ago

luc14n0 wrote in #note-4:

OK. I guess that sums it up. Thanks for the insight.

The openqa.i.o.o alarm is from Jul 26. Around the time the migration of openQA is ending and they started "warming the engines".
So, I'd suppose that 120 mail queue is a remnant of hiccups from the migration.

I would log on to check it out, but I don't have access. 120 mails queued now is not normal - by default, undeliverables are discarded after 5 days.

Actions #6

Updated by crameleon about 1 year ago

The openQA infrastructure is tracked in a different project. But maybe one of @okurz @nicksinger could check (or alternatively let us know who's a better person to ping about openqa.i.o.o)?

Actions #7

Updated by luc14n0 about 1 year ago

crameleon wrote in #note-6:

The openQA infrastructure is tracked in a different project. But maybe one of @okurz @nicksinger could check (or alternatively let us know who's a better person to ping about openqa.i.o.o)?

Yes, it is. I'm going to open a ticket/action in their project just to make sure, as a quick search didn't return anything for me.

Actions #8

Updated by okurz about 1 year ago

  • Related to action #135848: Icinga alarm about Postfix mail queue since July 26 added
Actions #9

Updated by pjessen about 1 year ago

More fun - from mx1.o.o:

2023-10-19T05:11:04.549091+00:00 mx1 postfix/smtpd[23097]: NOQUEUE: reject: RCPT from static.opensuse.org[2001:67c:2178:8::18]: 550 5.1.1 <mirrorcache@opensuse.org>: Recipient address rejected: User unknown in virtual alias table; from=<> to=<mirrorcache@opensuse.org> proto=ESMTP helo=<anna.opensuse.org>
2023-10-19T05:26:04.753980+00:00 mx1 postfix/smtpd[24588]: NOQUEUE: reject: RCPT from static.opensuse.org[2001:67c:2178:8::18]: 550 5.1.1 <mirrorcache@opensuse.org>: Recipient address rejected: User unknown in virtual alias table; from=<> to=<mirrorcache@opensuse.org> proto=ESMTP helo=<anna.opensuse.org>
2023-10-19T05:31:04.493952+00:00 mx1 postfix/smtpd[24837]: NOQUEUE: reject: RCPT from static.opensuse.org[2001:67c:2178:8::18]: 550 5.1.1 <mirrorcache@opensuse.org>: Recipient address rejected: User unknown in virtual alias table; from=<> to=<mirrorcache@opensuse.org> proto=ESMTP helo=<anna.opensuse.org>

So, mirrorcache@opensuse.org on static.o.o sent an email to someone which bounced and thus produced an NDR. static.o.o is now trying to deliver this NDR to mirrorcache@o.o, which however is an unknown addresss, so mx1 of course rejects it.

Actions #10

Updated by pjessen about 1 year ago

Okay, part of the answer is that "2001:67c:2178:8::18" is assigned to anna - haproxy setup I presume. This is plainly wrong, I think anna is missing an "smtp_bind_address6" config.
Second, more pertinent to the topic of this ticket, anna has 1288 mails queued, of which 962 are from mirrorcache@o.o to mirrorcache@mirrorcache.infra.opensuse.org. The rest is a mixture -

  • some are being refused by https://forwardemail.net I have written to the intended recipient and suggested he fix the problem.
  • some are being refused by google, "To protect our users from spam, mail has been temporarily rate limited."
  • other misc. errors.
  1. I have added a ratelimit for gmail.
  2. the 962 mails are reports of a failed cron-job on mirrorcache.i.o.o, see #138257 . The mail is being sent from mirrorcache@mirrorcache.infra.opensuse.org (envelope sender mirrorcache@opensuse.org) to mirrorcache@mirrorcache.infra.opensuse.org relayed via anna. Of course they can't be delivered 😱
Actions #11

Updated by crameleon 3 months ago

  • Status changed from Feedback to Resolved
  • Assignee changed from opensuse-admin to crameleon

The issues mentioned earlier in this ticket are no longer relevant. Recent mail queue events cleared up. New events in new monitoring are inspected and tracked separately as needed.

Actions

Also available in: Atom PDF