Project

General

Profile

Actions

tickets #104899

closed

moderation of lists.opensuse.org very slow / timing out

Added by meissner@suse.de over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Mailing lists
Target version:
-
Start date:
2022-01-14
Due date:
% Done:

100%

Estimated time:

Description

hi,

when trying to moderating emails via the lists.opensuse.org interface
i only get timeouts, as if it runs into database locks / timeouts.

readonly is quick and fine, but any write operation is timing out.

Ciao, Marcus


Related issues 2 (1 open1 closed)

Related to openSUSE admin - tickets #101848: Poor performance of lists.opensuse.org - web ui, running very slowly.Resolvedpjessen2021-11-02

Actions
Related to openSUSE admin - tickets #113111: lists.opensuse.org - webserver backand timeouts - nginx 504 gateway timeoutNew2022-06-28

Actions
Actions #1

Updated by pjessen over 2 years ago

  • Category set to Mailing lists
  • Private changed from Yes to No

Confirm, I saw this too in the last couple of days, when trying to discard messages from the moderation queue.

Actions #2

Updated by pjessen over 2 years ago

From /var/log/mailman/uwsgi-error.log:

Mon Jan 17 15:51:10 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (100/100) ***
Mon Jan 17 15:51:12 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (100/100) ***
Mon Jan 17 15:51:14 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (100/100) ***
Mon Jan 17 15:51:15 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:16 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:17 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:18 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:19 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:20 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (100/100) ***
Mon Jan 17 15:51:21 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:22 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (100/100) ***
Mon Jan 17 15:51:23 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:24 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:25 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (100/100) ***
Mon Jan 17 15:51:26 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (101/100) ***
Mon Jan 17 15:51:27 2022 - *** uWSGI listen queue of socket "0.0.0.0:8000" (fd: 8) full !!! (100/100) ***
Actions #3

Updated by pjessen over 2 years ago

FWIW, discarding 6 held messages from packaging.lists.o.o just took 1m35s.

Actions #4

Updated by pjessen over 2 years ago

Tuesday 18 Jan I added a robots.txt file to restrict some robots from accessing the archives. I had a hunch that too frequent archive access might cause performance issue. Granted, it's a shot in the dark.

/var/lib/mailman_webui/static/robots.txt :

User-Agent: AhrefsBot
Disallow: /archives/
User-Agent: YisouSpider
Disallow: /archives/
User-Agent: Bytespider
Disallow: /archives/
User-Agent: SeznamBot
Disallow: /archives/
User-Agent: SemrushBot
Disallow: /archives/
User-Agent: magpie-crawler
Disallow: /archives/
User-Agent: *
Crawl-Delay: 10
Actions #5

Updated by msmeissn over 2 years ago

it feels a bit faster. but it is still slow and i still run into timeouts

Actions #6

Updated by pjessen over 2 years ago

msmeissn wrote:

it feels a bit faster. but it is still slow and i still run into timeouts

Thanks for the feedback - that is my experience too. I have a feeling it is not the discarding of messages that is slow, but the re-fetching of the list afterwards. I have tried to spot any long-running database trx, but no luck.

Actions #7

Updated by pjessen over 2 years ago

pjessen wrote:

I have a feeling it is not the discarding of messages that is slow, but the re-fetching of the list afterwards.

Nah, ignore that. It simply takes 15-20seconds to discard one message. I have just timed it for 1-2-3-4.

Actions #8

Updated by msmeissn over 2 years ago

i would expect more database slowness or application slowness than overload by bots ... are there perhaps a lot (thousands) of emails in other moderation queues?

Actions #9

Updated by pjessen over 2 years ago

msmeissn wrote:

i would expect more database slowness or application slowness than overload by bots ...

I agree, but I can't find any evidence. I simply don't know what the time is spent on.

are there perhaps a lot (thousands) of emails in other moderation queues?

Many lists have a few hundred messages in queues, yes.

Actions #10

Updated by pjessen over 2 years ago

On 'users-hu', we currently have 529 held messages and it takes 50-60seconds to discard one.

Actions #11

Updated by cboltz over 2 years ago

~500 messages should be nothing for the database server ("let's talk again when we have 5000000" ;-)

OTOH, I wouldn't be surprised if the mailman developers only test with only 5 held messages because they expect everybody to moderate quickly, and never even thought about the performance in this area. (And, to be honest, with 500 mails in the queue for one list it might be a better idea to switch to reject-nonmembers or even reject-what-would-be-moderated. That's better than deleting these mails a month later.)

Since the overall mailman performance isn't that bad, my guess is that the terrible slowdown is caused by a query or maybe "just" a missing index around the held messages.

Is there a way to find out which queries run for a) deleting a held message and b) getting the list of held messages? And what does EXPLAIN $query *) say for them?

*) Disclaimer: I only have some experience with debugging MySQL queries, and hope that similar methods work on Postgres. And while I can probably give some hints, I'll need someone who is familiar with Postgres to actually do whatever is needed.

BTW: Are you sure that it's really discarding a message that is so slow? Or is it (re)loading the list of held messages afterwards?

Actions #12

Updated by pjessen over 2 years ago

cboltz wrote:

~500 messages should be nothing for the database server ("let's talk again when we have 5000000" ;-)

Agree.

OTOH, I wouldn't be surprised if the mailman developers only test with only 5 held messages because they expect everybody to
moderate quickly, and never even thought about the performance in this area. (And, to be honest, with 500 mails in the queue
for one list it might be a better idea to switch to reject-nonmembers or even reject-what-would-be-moderated.
That's better than deleting these mails a month later.

Generally I agree, although we would still end up with a couple of lists open to all (a board decision IIRC?)

Since the overall mailman performance isn't that bad, my guess is that the terrible slowdown is caused by a query or maybe
"just" a missing index around the held messages.

Been there, had that suspicion. There are some slow queries, but I cannot make them coincide with discarding any held messages.
With strace, all I see is lots of communication (sendto, recvfrom) with something.

Is there a way to find out which queries run for a) deleting a held message and b) getting the list of held messages? And what does EXPLAIN $query *) say for them?

It really does not appear to be a database issue. Besides, a 50sec delay for 500 rows?

BTW: Are you sure that it's really discarding a message that is so slow? Or is it (re)loading the list of held messages afterwards?

I was asking myself that, see my comment nr. 7 above. For each message, it took 15-20sec.

Actions #13

Updated by pjessen about 2 years ago

  • Related to tickets #101848: Poor performance of lists.opensuse.org - web ui, running very slowly. added
Actions #14

Updated by pjessen about 2 years ago

Just one observation - yesterday I discarded a single message from a moderation queue of only one, it also took 20-30 seconds.

Actions #15

Updated by pjessen about 2 years ago

Today, I am able to discard messages in blocks of 10, only takes 5sec apiece.

Actions #16

Updated by pjessen about 2 years ago

  • Status changed from New to Resolved
  • Assignee set to pjessen
  • % Done changed from 0 to 100

pjessen wrote:

Today, I am able to discard messages in blocks of 10, only takes 5sec apiece.

Same now, takes about 1 minute to discard 10 messages from the moderation queue. This is still very poor performance, but I doubt if we can do much about that. I am closing as resolved, but feel free to re-open.

Actions #17

Updated by pjessen about 2 years ago

  • Status changed from Resolved to New

Re-opening. Discarding a message now takes about 20sek.

Actions #18

Updated by pjessen about 2 years ago

  • % Done changed from 100 to 0

Currently, it takes 25-30 seconds to discard a single held message from the xfce list. Back in Feb, I did a count of held messages, and got to about 18'000, for all lists.

Actions #19

Updated by pjessen about 2 years ago

  • Status changed from New to In Progress

I have managed to figure out, more or less, how to use the mailman3 REST API via wget and I can now discard messages "in batch", ie. without human interaction. It still takes 20-25-30 seconds per message, but I can now leave it to run unattended overnight or just in the background while I do something else.
The basic idea is:

To anyone who is python-literate, I'm sure this is backwards and stupid, but for me, it does the job.

Actions #20

Updated by pjessen about 2 years ago

There must be a leak or some such, somewhere in mailman. I think the mailman "runner" that services the REST API at port 8001 was restarted this morning and now discard requests are processed in 4-5 seconds instead of 25-30 seconds yesterday. I am continuing clearing out held messages, but the general problem will not fix itself.

Actions #21

Updated by pjessen about 2 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

I have cleared out about 90% of the 20K held messages, there is now 2118 left. As I wrote earlier, I think this is a bug in postorious, so the slowness will likely reappear. Feel free to re-open.

Actions #22

Updated by pjessen almost 2 years ago

  • Related to tickets #113111: lists.opensuse.org - webserver backand timeouts - nginx 504 gateway timeout added
Actions

Also available in: Atom PDF