Project

General

Profile

tickets #121993

unable to process held_messages (reject, discard) in mailman

Added by lkocman 3 months ago. Updated 2 months ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Mailing lists
Target version:
-
Start date:
2022-12-14
Due date:
2022-12-22
% Done:

30%

Estimated time:

Description

Hello team

this has been situation for past two weeks or similar
I'm the person handling any potential source-dvd requests for openSUSE Lea
https://en.opensuse.org/Source_code

For past two or three weeks I was not able reject nor discard numerous spam messages that we're getting on daily basis.
That increases a chance that I could eventually miss a valid request.
https://lists.opensuse.org/manage/lists/sourcedvd.lists.opensuse.org/held_messages

I've tried both discard and reject, single or many, however, any action leads to ~30 sec waiting and ends with 502 bad gateway request.

Could you please look into this?

502.png (17.8 KB) 502.png lkocman, 2022-12-14 12:08
14308

Related issues

Is duplicate of openSUSE admin - tickets #116084: lists.opensuse.org / mailing list web archive - timeouts, sluggishness, nonresponsive, nginx timeout etcNew2022-08-31

History

#1 Updated by pjessen 3 months ago

  • Is duplicate of tickets #116084: lists.opensuse.org / mailing list web archive - timeouts, sluggishness, nonresponsive, nginx timeout etc added

#2 Updated by pjessen 3 months ago

  • Private changed from Yes to No

This issue is well known.

#3 Updated by pjessen 3 months ago

Looking at /var/log/nginx/error.logs, I see e.g.

/var/log/nginx/error.log-20221014.xz:2022/10/13 13:43:35 [error] 2062#2062: *624197 client intended to send too large body: 5071396 bytes, client: 127.0.0.1, server: lists.opensuse.org, request: "POST /archives/api/mailman/archive HTTP/1.1", host: "localhost"

This started on 14 October and has been going on ever since. Was some limit reset? As far as I can tell, we have "client_max_body_size 400M;", but that obviously does not work, somehow.

#4 Updated by pjessen 3 months ago

I also messages like this (when trying to discard a message):

2022/12/14 13:05:36 [error] 8935#8935: *5233 upstream prematurely closed connection while reading response header from upstream, client: 2a03:7520:4c68:1:ff99:ffff:0:98fc, server: lists.opensuse.org, request: "POST /manage/lists/sourcedvd.lists.opensuse.org/held_messages HTTP/1.1", upstream: "http://127.0.0.1:8000/manage/lists/sourcedvd.lists.opensuse.org/held_messages", host: "lists.opensuse.org", referrer: "https://lists.opensuse.org/manage/lists/sourcedvd.lists.opensuse.org/held_messages"

Upstream being http://127.0.0.1:8000 - that is gunicorn.

In /var/log/postorius/gunicorn.log, I see numerous "[CRITICAL] WORKER TIMEOUT".

#5 Updated by pjessen 3 months ago

pjessen wrote:

This started on 14 October and has been going on ever since. Was some limit reset? As far as I can tell, we have "client_max_body_size 400M;", but that obviously does not work, somehow.

I have reduced to "client_max_body_size 10M;" and this seems to work ??

#6 Updated by pjessen 3 months ago

Have changed the gunicorn timeout to 0, over the commandline, with mailman-web.service.d/timeout.conf. Instead of a 502, I'm now getting a 504, which suggests it it nginx timing out.

#7 Updated by pjessen 3 months ago

By default nginx has a 60 second timeout - I have added this to the http{} section in /etc/nginx/nginx.conf:

proxy_send_timeout          600;
proxy_read_timeout          600;
send_timeout                600;

This does the trick - why discarding a message takes about 90secs, I have no idea.

#8 Updated by pjessen 3 months ago

  • Due date set to 2022-12-22
  • Status changed from New to In Progress
  • Assignee set to pjessen
  • Priority changed from High to Normal
  • % Done changed from 0 to 30

Three changes:

  • nginx - /etc/nginx/vhosts.d/lists.opensuse.org.conf - client_max_body_size 10M; I don't see any reason why this should work any better than the previous client_max_body_size 400M;
  • gunicorn - mailman-web.service.d/timeout.conf - --timeout=0
  • nginx - /etc/nginx/nginx.conf - the three timeouts as above.

I'll leave this for now and review next week.

#9 Updated by pjessen 3 months ago

pjessen wrote:

I'll leave this for now and review next week.

So far it is looking good. The "worker timeout" messages have gone from the gunicorn log, but there are still some "upstream prematurely closed connection" messages in the nginx errorlog, interestingly all on exports of archives in mailbox format.

#11 Updated by pjessen 2 months ago

luc14n0 wrote:

Use MySQL for redirection based on table look-ups, instead of Nginx.

I don't see nginx and the huge redirection maps as being the culprit here, but it ought to be easy to prove/disprove.
Just a little earlier, I wanted to remove some non-members addresses from users.lists:

  • After hitting "delete", it took 2min16sec for the Confirm page to appear
  • after hitting "Confirm", it took 2m12sec to return to the list page.

#12 Updated by pjessen 2 months ago

pjessen wrote:

luc14n0 wrote:

Use MySQL for redirection based on table look-ups, instead of Nginx.

I don't see nginx and the huge redirection maps as being the culprit here, but it ought to be easy to prove/disprove.

I removed the include for thee mails redirects, almost 3million lines, did an "nginx reload" and tried to remove a non-member again:

  • After hitting "delete", it took 2min8sec for the Confirm page to appear.
  • after hitting "Confirm", it took 2min7sec to return to the list page.

Also available in: Atom PDF