tickets #121993
closed
unable to process held_messages (reject, discard) in mailman
Added by lkocman about 2 years ago.
Updated over 1 year ago.
Description
Hello team
this has been situation for past two weeks or similar
I'm the person handling any potential source-dvd requests for openSUSE Lea
https://en.opensuse.org/Source_code
For past two or three weeks I was not able reject nor discard numerous spam messages that we're getting on daily basis.
That increases a chance that I could eventually miss a valid request.
https://lists.opensuse.org/manage/lists/sourcedvd.lists.opensuse.org/held_messages
I've tried both discard and reject, single or many, however, any action leads to ~30 sec waiting and ends with 502 bad gateway request.
Could you please look into this?
Files
- Is duplicate of tickets #116084: lists.opensuse.org / mailing list web archive - timeouts, sluggishness, nonresponsive, nginx timeout etc added
- Private changed from Yes to No
This issue is well known.
Looking at /var/log/nginx/error.logs, I see e.g.
/var/log/nginx/error.log-20221014.xz:2022/10/13 13:43:35 [error] 2062#2062: *624197 client intended to send too large body: 5071396 bytes, client: 127.0.0.1, server: lists.opensuse.org, request: "POST /archives/api/mailman/archive HTTP/1.1", host: "localhost"
This started on 14 October and has been going on ever since. Was some limit reset? As far as I can tell, we have "client_max_body_size 400M;", but that obviously does not work, somehow.
I also messages like this (when trying to discard a message):
2022/12/14 13:05:36 [error] 8935#8935: *5233 upstream prematurely closed connection while reading response header from upstream, client: 2a03:7520:4c68:1:ff99:ffff:0:98fc, server: lists.opensuse.org, request: "POST /manage/lists/sourcedvd.lists.opensuse.org/held_messages HTTP/1.1", upstream: "http://127.0.0.1:8000/manage/lists/sourcedvd.lists.opensuse.org/held_messages", host: "lists.opensuse.org", referrer: "https://lists.opensuse.org/manage/lists/sourcedvd.lists.opensuse.org/held_messages"
Upstream being http://127.0.0.1:8000 - that is gunicorn.
In /var/log/postorius/gunicorn.log, I see numerous "[CRITICAL] WORKER TIMEOUT".
pjessen wrote:
This started on 14 October and has been going on ever since. Was some limit reset? As far as I can tell, we have "client_max_body_size 400M;", but that obviously does not work, somehow.
I have reduced to "client_max_body_size 10M;" and this seems to work ??
Have changed the gunicorn timeout to 0
, over the commandline, with mailman-web.service.d/timeout.conf. Instead of a 502, I'm now getting a 504, which suggests it it nginx timing out.
By default nginx has a 60 second timeout - I have added this to the http{} section in /etc/nginx/nginx.conf:
proxy_send_timeout 600;
proxy_read_timeout 600;
send_timeout 600;
This does the trick - why discarding a message takes about 90secs, I have no idea.
- Due date set to 2022-12-22
- Status changed from New to In Progress
- Assignee set to pjessen
- Priority changed from High to Normal
- % Done changed from 0 to 30
Three changes:
- nginx - /etc/nginx/vhosts.d/lists.opensuse.org.conf -
client_max_body_size 10M;
I don't see any reason why this should work any better than the previous client_max_body_size 400M;
- gunicorn - mailman-web.service.d/timeout.conf -
--timeout=0
- nginx - /etc/nginx/nginx.conf - the three timeouts as above.
I'll leave this for now and review next week.
pjessen wrote:
I'll leave this for now and review next week.
So far it is looking good. The "worker timeout" messages have gone from the gunicorn log, but there are still some "upstream prematurely closed connection" messages in the nginx errorlog, interestingly all on exports of archives in mailbox format.
luc14n0 wrote:
Use MySQL for redirection based on table look-ups, instead of Nginx.
I don't see nginx and the huge redirection maps as being the culprit here, but it ought to be easy to prove/disprove.
Just a little earlier, I wanted to remove some non-members addresses from users.lists:
- After hitting "delete", it took 2min16sec for the Confirm page to appear
- after hitting "Confirm", it took 2m12sec to return to the list page.
pjessen wrote:
luc14n0 wrote:
Use MySQL for redirection based on table look-ups, instead of Nginx.
I don't see nginx and the huge redirection maps as being the culprit here, but it ought to be easy to prove/disprove.
I removed the include for thee mails redirects, almost 3million lines, did an "nginx reload" and tried to remove a non-member again:
- After hitting "delete", it took 2min8sec for the Confirm page to appear.
- after hitting "Confirm", it took 2min7sec to return to the list page.
luc14n0 wrote:
Use MySQL for redirection based on table look-ups, instead of Nginx.
FWIW, I wrote a small proxy daemon for that, see #101842. It reduced the memory footprint, but did nothing for this issue. Discarding a held message currently takes 3 minutes and 15 seconds.
- Related to tickets #129463: mailman3 - the admin-auto list has 10000 held messages added
- Status changed from In Progress to Resolved
- % Done changed from 30 to 100
It looks like all that was needed was a clean-up - with too many held messages, processing time increased, three minutes and more.
It is clearly a poor design when such a relatively small amount of data can affect the processing in this way. Even with 1000 held messages, it was very noticeable - once I had the total down to double-digits, the UI was responding in single-digit seconds.
I know, I know - I ought to open a bug with the mailman3 project .....
Also available in: Atom
PDF