tickets #104899
closedmoderation of lists.opensuse.org very slow / timing out
100%
Description
hi,
when trying to moderating emails via the lists.opensuse.org interface
i only get timeouts, as if it runs into database locks / timeouts.
readonly is quick and fine, but any write operation is timing out.
Ciao, Marcus
Updated by pjessen about 3 years ago
- Category set to Mailing lists
- Private changed from Yes to No
Confirm, I saw this too in the last couple of days, when trying to discard messages from the moderation queue.
Updated by pjessen almost 3 years ago
From /var/log/mailman/uwsgi-error.log:
Mon Jan 17 15:51:10 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (100/100) ***
Mon Jan 17 15:51:12 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (100/100) ***
Mon Jan 17 15:51:14 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (100/100) ***
Mon Jan 17 15:51:15 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:16 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:17 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:18 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:19 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:20 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (100/100) ***
Mon Jan 17 15:51:21 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:22 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (100/100) ***
Mon Jan 17 15:51:23 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:24 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:25 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (100/100) ***
Mon Jan 17 15:51:26 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:27 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (100/100) ***
Updated by pjessen almost 3 years ago
FWIW, discarding 6 held messages from packaging.lists.o.o just took 1m35s.
Updated by pjessen almost 3 years ago
Tuesday 18 Jan I added a robots.txt file to restrict some robots from accessing the archives. I had a hunch that too frequent archive access might cause performance issue. Granted, it's a shot in the dark.
/var/lib/mailman_webui/static/robots.txt :
User-Agent: AhrefsBot
Disallow: /archives/
User-Agent: YisouSpider
Disallow: /archives/
User-Agent: Bytespider
Disallow: /archives/
User-Agent: SeznamBot
Disallow: /archives/
User-Agent: SemrushBot
Disallow: /archives/
User-Agent: magpie-crawler
Disallow: /archives/
User-Agent: *
Crawl-Delay: 10
Updated by msmeissn almost 3 years ago
it feels a bit faster. but it is still slow and i still run into timeouts
Updated by pjessen almost 3 years ago
msmeissn wrote:
it feels a bit faster. but it is still slow and i still run into timeouts
Thanks for the feedback - that is my experience too. I have a feeling it is not the discarding of messages that is slow, but the re-fetching of the list afterwards. I have tried to spot any long-running database trx, but no luck.
Updated by pjessen almost 3 years ago
pjessen wrote:
I have a feeling it is not the discarding of messages that is slow, but the re-fetching of the list afterwards.
Nah, ignore that. It simply takes 15-20seconds to discard one message. I have just timed it for 1-2-3-4.
Updated by msmeissn almost 3 years ago
i would expect more database slowness or application slowness than overload by bots ... are there perhaps a lot (thousands) of emails in other moderation queues?
Updated by pjessen almost 3 years ago
msmeissn wrote:
i would expect more database slowness or application slowness than overload by bots ...
I agree, but I can't find any evidence. I simply don't know what the time is spent on.
are there perhaps a lot (thousands) of emails in other moderation queues?
Many lists have a few hundred messages in queues, yes.
Updated by pjessen almost 3 years ago
On 'users-hu', we currently have 529 held messages and it takes 50-60seconds to discard one.
Updated by cboltz almost 3 years ago
~500 messages should be nothing for the database server ("let's talk again when we have 5000000" ;-)
OTOH, I wouldn't be surprised if the mailman developers only test with only 5 held messages because they expect everybody to moderate quickly, and never even thought about the performance in this area. (And, to be honest, with 500 mails in the queue for one list it might be a better idea to switch to reject-nonmembers or even reject-what-would-be-moderated. That's better than deleting these mails a month later.)
Since the overall mailman performance isn't that bad, my guess is that the terrible slowdown is caused by a query or maybe "just" a missing index around the held messages.
Is there a way to find out which queries run for a) deleting a held message and b) getting the list of held messages? And what does EXPLAIN $query
*) say for them?
*) Disclaimer: I only have some experience with debugging MySQL queries, and hope that similar methods work on Postgres. And while I can probably give some hints, I'll need someone who is familiar with Postgres to actually do whatever is needed.
BTW: Are you sure that it's really discarding a message that is so slow? Or is it (re)loading the list of held messages afterwards?
Updated by pjessen almost 3 years ago
cboltz wrote:
~500 messages should be nothing for the database server ("let's talk again when we have 5000000" ;-)
Agree.
OTOH, I wouldn't be surprised if the mailman developers only test with only 5 held messages because they expect everybody to
moderate quickly, and never even thought about the performance in this area. (And, to be honest, with 500 mails in the queue
for one list it might be a better idea to switch to reject-nonmembers or even reject-what-would-be-moderated.
That's better than deleting these mails a month later.
Generally I agree, although we would still end up with a couple of lists open to all (a board decision IIRC?)
Since the overall mailman performance isn't that bad, my guess is that the terrible slowdown is caused by a query or maybe
"just" a missing index around the held messages.
Been there, had that suspicion. There are some slow queries, but I cannot make them coincide with discarding any held messages.
With strace, all I see is lots of communication (sendto, recvfrom) with something.
Is there a way to find out which queries run for a) deleting a held message and b) getting the list of held messages? And what does
EXPLAIN $query
*) say for them?
It really does not appear to be a database issue. Besides, a 50sec delay for 500 rows?
BTW: Are you sure that it's really discarding a message that is so slow? Or is it (re)loading the list of held messages afterwards?
I was asking myself that, see my comment nr. 7 above. For each message, it took 15-20sec.
Updated by pjessen almost 3 years ago
- Related to tickets #101848: Poor performance of lists.opensuse.org - web ui, running very slowly. added
Updated by pjessen almost 3 years ago
Just one observation - yesterday I discarded a single message from a moderation queue of only one, it also took 20-30 seconds.
Updated by pjessen almost 3 years ago
Today, I am able to discard messages in blocks of 10, only takes 5sec apiece.
Updated by pjessen almost 3 years ago
- Status changed from New to Resolved
- Assignee set to pjessen
- % Done changed from 0 to 100
pjessen wrote:
Today, I am able to discard messages in blocks of 10, only takes 5sec apiece.
Same now, takes about 1 minute to discard 10 messages from the moderation queue. This is still very poor performance, but I doubt if we can do much about that. I am closing as resolved, but feel free to re-open.
Updated by pjessen almost 3 years ago
- Status changed from Resolved to New
Re-opening. Discarding a message now takes about 20sek.
Updated by pjessen almost 3 years ago
- % Done changed from 100 to 0
Currently, it takes 25-30 seconds to discard a single held message from the xfce list. Back in Feb, I did a count of held messages, and got to about 18'000, for all lists.
Updated by pjessen almost 3 years ago
- Status changed from New to In Progress
I have managed to figure out, more or less, how to use the mailman3 REST API via wget and I can now discard messages "in batch", ie. without human interaction. It still takes 20-25-30 seconds per message, but I can now leave it to run unattended overnight or just in the background while I do something else.
The basic idea is:
- on mailman3, get a list of held messages for a list, "wget -O list.json http://localhost:8001/3.0/lists/kde@lists.opensuse.org/held"
- the resulting json data can be easily processed, use whatever your favourite tool is. I used php.
- I printed out the request_ids and the email subjects, easy to grep.
- use a list of request_ids and produce urls: http://localhost:8001/3.0/lists/kde@lists.opensuse.org/held/REQID
- wget with --post-data="action=discard"
To anyone who is python-literate, I'm sure this is backwards and stupid, but for me, it does the job.
Updated by pjessen almost 3 years ago
There must be a leak or some such, somewhere in mailman. I think the mailman "runner" that services the REST API at port 8001 was restarted this morning and now discard requests are processed in 4-5 seconds instead of 25-30 seconds yesterday. I am continuing clearing out held messages, but the general problem will not fix itself.
Updated by pjessen almost 3 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
I have cleared out about 90% of the 20K held messages, there is now 2118 left. As I wrote earlier, I think this is a bug in postorious, so the slowness will likely reappear. Feel free to re-open.
Updated by pjessen over 2 years ago
- Related to tickets #113111: lists.opensuse.org - webserver backand timeouts - nginx 504 gateway timeout added