Project

General

Profile

Actions

tickets #121993

closed

unable to process held_messages (reject, discard) in mailman

Added by lkocman over 1 year ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Mailing lists
Target version:
-
Start date:
2022-12-14
Due date:
2022-12-22
% Done:

100%

Estimated time:

Description

Hello team

this has been situation for past two weeks or similar
I'm the person handling any potential source-dvd requests for openSUSE Lea
https://en.opensuse.org/Source_code

For past two or three weeks I was not able reject nor discard numerous spam messages that we're getting on daily basis.
That increases a chance that I could eventually miss a valid request.
https://lists.opensuse.org/manage/lists/sourcedvd.lists.opensuse.org/held_messages

I've tried both discard and reject, single or many, however, any action leads to ~30 sec waiting and ends with 502 bad gateway request.

Could you please look into this?


Files

502.png (17.8 KB) 502.png lkocman, 2022-12-14 12:08

Related issues 2 (1 open1 closed)

Related to openSUSE admin - tickets #129463: mailman3 - the admin-auto list has 10000 held messagesResolvedpjessen2023-05-172023-05-23

Actions
Is duplicate of openSUSE admin - tickets #116084: lists.opensuse.org / mailing list web archive - timeouts, sluggishness, nonresponsive, nginx timeout etcNew2022-08-31

Actions
Actions #1

Updated by pjessen over 1 year ago

  • Is duplicate of tickets #116084: lists.opensuse.org / mailing list web archive - timeouts, sluggishness, nonresponsive, nginx timeout etc added
Actions #2

Updated by pjessen over 1 year ago

  • Private changed from Yes to No

This issue is well known.

Actions #3

Updated by pjessen over 1 year ago

Looking at /var/log/nginx/error.logs, I see e.g.

/var/log/nginx/error.log-20221014.xz:2022/10/13 13:43:35 [error] 2062#2062: *624197 client intended to send too large body: 5071396 bytes, client: 127.0.0.1, server: lists.opensuse.org, request: "POST /archives/api/mailman/archive HTTP/1.1", host: "localhost"

This started on 14 October and has been going on ever since. Was some limit reset? As far as I can tell, we have "client_max_body_size 400M;", but that obviously does not work, somehow.

Actions #4

Updated by pjessen over 1 year ago

I also messages like this (when trying to discard a message):

2022/12/14 13:05:36 [error] 8935#8935: *5233 upstream prematurely closed connection while reading response header from upstream, client: 2a03:7520:4c68:1:ff99:ffff:0:98fc, server: lists.opensuse.org, request: "POST /manage/lists/sourcedvd.lists.opensuse.org/held_messages HTTP/1.1", upstream: "http://127.0.0.1:8000/manage/lists/sourcedvd.lists.opensuse.org/held_messages", host: "lists.opensuse.org", referrer: "https://lists.opensuse.org/manage/lists/sourcedvd.lists.opensuse.org/held_messages"

Upstream being http://127.0.0.1:8000 - that is gunicorn.

In /var/log/postorius/gunicorn.log, I see numerous "[CRITICAL] WORKER TIMEOUT".

Actions #5

Updated by pjessen over 1 year ago

pjessen wrote:

This started on 14 October and has been going on ever since. Was some limit reset? As far as I can tell, we have "client_max_body_size 400M;", but that obviously does not work, somehow.

I have reduced to "client_max_body_size 10M;" and this seems to work ??

Actions #6

Updated by pjessen over 1 year ago

Have changed the gunicorn timeout to 0, over the commandline, with mailman-web.service.d/timeout.conf. Instead of a 502, I'm now getting a 504, which suggests it it nginx timing out.

Actions #7

Updated by pjessen over 1 year ago

By default nginx has a 60 second timeout - I have added this to the http{} section in /etc/nginx/nginx.conf:

proxy_send_timeout          600;
proxy_read_timeout          600;
send_timeout                600;

This does the trick - why discarding a message takes about 90secs, I have no idea.

Actions #8

Updated by pjessen over 1 year ago

  • Due date set to 2022-12-22
  • Status changed from New to In Progress
  • Assignee set to pjessen
  • Priority changed from High to Normal
  • % Done changed from 0 to 30

Three changes:

  • nginx - /etc/nginx/vhosts.d/lists.opensuse.org.conf - client_max_body_size 10M; I don't see any reason why this should work any better than the previous client_max_body_size 400M;
  • gunicorn - mailman-web.service.d/timeout.conf - --timeout=0
  • nginx - /etc/nginx/nginx.conf - the three timeouts as above.

I'll leave this for now and review next week.

Actions #9

Updated by pjessen over 1 year ago

pjessen wrote:

I'll leave this for now and review next week.

So far it is looking good. The "worker timeout" messages have gone from the gunicorn log, but there are still some "upstream prematurely closed connection" messages in the nginx errorlog, interestingly all on exports of archives in mailbox format.

Actions #11

Updated by pjessen over 1 year ago

luc14n0 wrote:

Use MySQL for redirection based on table look-ups, instead of Nginx.

I don't see nginx and the huge redirection maps as being the culprit here, but it ought to be easy to prove/disprove.
Just a little earlier, I wanted to remove some non-members addresses from users.lists:

  • After hitting "delete", it took 2min16sec for the Confirm page to appear
  • after hitting "Confirm", it took 2m12sec to return to the list page.
Actions #12

Updated by pjessen over 1 year ago

pjessen wrote:

luc14n0 wrote:

Use MySQL for redirection based on table look-ups, instead of Nginx.

I don't see nginx and the huge redirection maps as being the culprit here, but it ought to be easy to prove/disprove.

I removed the include for thee mails redirects, almost 3million lines, did an "nginx reload" and tried to remove a non-member again:

  • After hitting "delete", it took 2min8sec for the Confirm page to appear.
  • after hitting "Confirm", it took 2min7sec to return to the list page.
Actions #13

Updated by pjessen 12 months ago

luc14n0 wrote:

Use MySQL for redirection based on table look-ups, instead of Nginx.

FWIW, I wrote a small proxy daemon for that, see #101842. It reduced the memory footprint, but did nothing for this issue. Discarding a held message currently takes 3 minutes and 15 seconds.

Actions #14

Updated by pjessen 11 months ago

  • Related to tickets #129463: mailman3 - the admin-auto list has 10000 held messages added
Actions #15

Updated by pjessen 11 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 30 to 100

It looks like all that was needed was a clean-up - with too many held messages, processing time increased, three minutes and more.
It is clearly a poor design when such a relatively small amount of data can affect the processing in this way. Even with 1000 held messages, it was very noticeable - once I had the total down to double-digits, the UI was responding in single-digit seconds.
I know, I know - I ought to open a bug with the mailman3 project .....

Actions

Also available in: Atom PDF