tickets #135785: Assess raising matrix.i.o.o resources - openSUSE admin - openSUSE Project Management Tool

Actions

Copy link

tickets #135785

closed

Assess raising matrix.i.o.o resources

Added by luc14n0 over 1 year ago. Updated over 1 year ago.

Status:

Closed

Priority:

Normal

Assignee:

opensuse-admin

Category:

Core services and virtual infrastructure

Target version:

Start date:

2023-09-14

Due date:

% Done:

Estimated time:

Description

I have been watching matrix.i.o.o resource consumption of the last few days, and we should assess raising its RAM, at least:

luc14n0@matrix:~> free -h
               total        used        free      shared  buff/cache   available
Mem:            14Gi        13Gi       525Mi       0.0Ki       784Mi       961Mi
Swap:          255Mi       255Mi       0.0Ki

It's been like that for almost as long as I started watching it -- around three days ago when I took a closer look at the Hookshot bridge failures.

The average load, from what I've seen, is reasonably OK from the times I was looking at it. Sometimes there are some crazy CPU spikes:

luc14n0@matrix:~> w
 21:25:38 up 10 days,  8:44,  3 users,  load average: 5.76, 11.07, 23.69

Like this one I saw today, but I can't say exactly what is causing them, and how frequent they happen.

Files

Download all files

Screenshot from 2023-09-15 22-16-52.png (38.1 KB) Screenshot from 2023-09-15 22-16-52.png	Htop - used resources after a while, post-reboot	luc14n0, 2023-09-16 01:20
clipboard-202310022239-hlbma.png (81.3 KB) clipboard-202310022239-hlbma.png	matrix.i.o.o Grafana	luc14n0, 2023-10-03 01:39

Actions

Copy link

Updated by luc14n0 over 1 year ago

Private changed from Yes to No

Actions

Copy link

Updated by pjessen over 1 year ago

256Mb swap is ridiculous, in my opinion, but no more ridiculous than 14Gb for exchanging a few messages 😱
Any chance this is yet-another-memory-leak? Something that needs tidying up every so often? I can't login to matrix.i.o.o to have a look, so just thinking out loud.

Actions

Copy link

Updated by luc14n0 over 1 year ago

pjessen wrote in #note-2:

256Mb swap is ridiculous, in my opinion, but no more ridiculous than 14Gb for exchanging a few messages 😱

(Un)Fortunately we're doing much more than exchanging a few messages nowadays. I can't say exact numbers, Jacob knows better, but we have dozens of rooms. I imagine at least some are (very) busy, at times. We do have thousands of users, most of which are bots. We have 17 workers, if I'm counting them correctly. We have three bridges (Discord, Telegram, and Hookshot).

All of that doesn't scale well with Synapse, which is written in Python.

Any chance this is yet-another-memory-leak? Something that needs tidying up every so often? I can't login to matrix.i.o.o to have a look, so just thinking out loud.

I don't think so. Looking at it right now, it still sitting around of 13.2G/14.4G -- in Htop:

matrix (matrix.o.o):~ # free --mega
               total        used        free      shared  buff/cache   available
Mem:           15506       14584         207           0        1079         921
Swap:            268         268           0

Just like yesterday. But I'm going to reboot the VM and watch it over time.

Actions

Copy link

Updated by luc14n0 over 1 year ago

Right now the load is below 100%, though:

21:06:29 up 11 days,  8:25,  4 users,  load average: 4.88, 4.68, 4.60

Actions

Copy link

Updated by hellcp over 1 year ago

Before you restart, you could run zypper dup on the machine, as there's pending synapse update

Actions

Copy link

Updated by luc14n0 over 1 year ago

OK. I did a zypper dup, then rebooted.

When the system got back it stayed sitting at around 3.8G. Now, after a while, it got higher. Let's see if it keeps getting higher tomorrow.

matrix (matrix.o.o):~ # free --mega
               total        used        free      shared  buff/cache   available
Mem:           15506        6334        6817           0        2719        9171
Swap:            268           0         268
matrix (matrix.o.o):~ # w
 00:55:57 up  2:27,  1 user,  load average: 1.19, 0.47, 0.42

Actions

Copy link

Updated by luc14n0 over 1 year ago

File Screenshot from 2023-09-15 22-16-52.png Screenshot from 2023-09-15 22-16-52.png added

BTW, I left htop open the whole time. And it shows that there were a RAM usage spike (around 8G, judging by the bar in the annexed picture). Just a FYI.

Actions

Copy link

Updated by pjessen over 1 year ago

luc14n0 wrote in #note-3:

pjessen wrote in #note-2:

256Mb swap is ridiculous, in my opinion, but no more ridiculous than 14Gb for exchanging a few messages 😱

(Un)Fortunately we're doing much more than exchanging a few messages nowadays. I can't say exact numbers, Jacob knows better, but we have dozens of rooms.
I imagine at least some are (very) busy, at times. We do have thousands of users, most of which are bots. We have 17 workers, if I'm counting them correctly.
We have three bridges (Discord, Telegram, and Hookshot).

I may be old fashioned, but that still sounds like "exchanging a few messages" 🙂

Any chance this is yet-another-memory-leak? Something that needs tidying up every so often? I can't login to matrix.i.o.o to have a look, so just thinking out loud.

I don't think so. Looking at it right now, it still sitting around of 13.2G/14.4G -- in Htop:

I only mentioned it because that was the case with mailman3. (also python ...)
Restarting the gunicorn workers after X requests produced a significant reduction in memory footprint. (--max-requests=500)

Actions

Copy link

Updated by luc14n0 over 1 year ago

Status changed from New to Feedback

pjessen wrote in #note-8:

luc14n0 wrote in #note-3:

pjessen wrote in #note-2:

256Mb swap is ridiculous, in my opinion, but no more ridiculous than 14Gb for exchanging a few messages 😱

(Un)Fortunately we're doing much more than exchanging a few messages nowadays. I can't say exact numbers, Jacob knows better, but we have dozens of rooms.
I imagine at least some are (very) busy, at times. We do have thousands of users, most of which are bots. We have 17 workers, if I'm counting them correctly.
We have three bridges (Discord, Telegram, and Hookshot).

I may be old fashioned, but that still sounds like "exchanging a few messages" 🙂

Well, in a way it might very well be. However, in general there more things Synapse is doing in the background than, let's say, an IRC server. There are some many cogwheels spinning day in day out.

Any chance this is yet-another-memory-leak? Something that needs tidying up every so often? I can't login to matrix.i.o.o to have a look, so just thinking out loud.

I don't think so. Looking at it right now, it still sitting around of 13.2G/14.4G -- in Htop:

I only mentioned it because that was the case with mailman3. (also python ...)
Restarting the gunicorn workers after X requests produced a significant reduction in memory footprint. (--max-requests=500)

Taking another look:

matrix (matrix.o.o):~ # free --mega
               total        used        free      shared  buff/cache   available
Mem:           15506        6651        6180           0        3039        8854
Swap:            268           0         268
matrix (matrix.o.o):~ # uptime
 23:08:48  up 1 day  0:40,  1 user,  load average: 0.15, 0.23, 0.43

And you know what? At this point I won't say that there isn't a memory leak, or something similar -- like a worker spawning too many children. I'm going to keep an eye on the monitors for a while.

Actions

Copy link

#10

Updated by luc14n0 over 1 year ago

I believe my initial proposition while opening this ticket proved to be hasty judgement, and I changed my mind about raising the resources allocated to the matrix.i.o.o VM. However, I'm going to keep this ticket in feedback status while I keep my watch.

Actions

Copy link

#11

Updated by luc14n0 over 1 year ago

Indeed Per, you do have a point.

I've kept matrix.i.o.o under my watch and day by day the RAM consumption has been raising, little by little. And I have a suspicion it has to do with our current broken federation with matrix.org.

luc14n0@matrix:~> free --mega
               total        used        free      shared  buff/cache   available
Mem:           15506        9053         552           0        6265        6452
Swap:            268           1         266

The system load, however, got tamed ever since the last system reboot:

luc14n0@matrix:~> uptime
 00:08:59  up 13 days  1:40,  1 user,  load average: 0.23, 0.33, 0.40

Since there are incoming updates for element-web, and soon there will be one for Synapse as well, I'm thinking of creating a script to try to find out more about possible culprits after updating the system -- there is a Kernel update in the stack too. I do see lots of processes regarding federation.

If anyone knows a handy script that would fit the job for monitoring RAM usage and process spawning, please speak up. In the meanwhile I'm going to take a look at the openSUSE's System Analysis and Tuning Guide, more specifically the System Monitoring part.

Actions

Copy link

#12

Updated by crameleon over 1 year ago

If anyone knows a handy script that would fit the job for monitoring RAM usage

Is it not monitored by nrpe, visible in Icinga?

Actions

Copy link

#13

Updated by luc14n0 over 1 year ago

crameleon wrote in #note-12:

If anyone knows a handy script that would fit the job for monitoring RAM usage

Is it not monitored by nrpe, visible in Icinga?

Yes, it is. And here it is the last 13 days graph.

I don't think we're going to need over-engineering here:

luc14n0@matrix:~> ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -4
  PID  PPID CMD                         %MEM %CPU
 1232     1 /usr/bin/python3 -m synapse 18.2 18.5
 1924     1 /usr/bin/python3 -m synapse  3.6  7.2
 1921     1 /usr/bin/python3 -m synapse  3.6  7.6

luc14n0@matrix:~> ps -e -o pid,user,%mem,cmd --sort=-%mem | head -4
  PID USER     %MEM CMD
 1232 synapse  18.2 /usr/bin/python3 -m synapse.app.homeserver --config-path=/etc/matrix-synapse/homeserver.yaml --config-path=/etc/matrix-synapse/conf.d/
 1924 synapse   3.6 /usr/bin/python3 -m synapse.app.generic_worker --config-path=/etc/matrix-synapse/homeserver.yaml --config-path=/etc/matrix-synapse/conf.d/ --config-path=/etc/matrix-synapse/workers/federation_sender2.yaml
 1921 synapse   3.6 /usr/bin/python3 -m synapse.app.generic_worker --config-path=/etc/matrix-synapse/homeserver.yaml --config-path=/etc/matrix-synapse/conf.d/ --config-path=/etc/matrix-synapse/workers/federation_sender1.yaml

luc14n0@matrix:~> python3 -c 'print(15506 * 0.182)'
2822.092

So the main Synapse process is using 2.822M of RAM, at this moment. I'm going to update the machine this Saturday as there's a Synapse update in the available (in the most recent upstream release should be available next week) and let's see how thing go from there.

Actions

Copy link

#14