Project

General

Profile

Actions

tickets #135785

closed

Assess raising matrix.i.o.o resources

Added by luc14n0 about 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Category:
Core services and virtual infrastructure
Target version:
-
Start date:
2023-09-14
Due date:
% Done:

0%

Estimated time:

Description

I have been watching matrix.i.o.o resource consumption of the last few days, and we should assess raising its RAM, at least:

luc14n0@matrix:~> free -h
               total        used        free      shared  buff/cache   available
Mem:            14Gi        13Gi       525Mi       0.0Ki       784Mi       961Mi
Swap:          255Mi       255Mi       0.0Ki

It's been like that for almost as long as I started watching it -- around three days ago when I took a closer look at the Hookshot bridge failures.

The average load, from what I've seen, is reasonably OK from the times I was looking at it. Sometimes there are some crazy CPU spikes:

luc14n0@matrix:~> w
 21:25:38 up 10 days,  8:44,  3 users,  load average: 5.76, 11.07, 23.69

Like this one I saw today, but I can't say exactly what is causing them, and how frequent they happen.


Files

Screenshot from 2023-09-15 22-16-52.png (38.1 KB) Screenshot from 2023-09-15 22-16-52.png Htop - used resources after a while, post-reboot luc14n0, 2023-09-16 01:20
clipboard-202310022239-hlbma.png (81.3 KB) clipboard-202310022239-hlbma.png matrix.i.o.o Grafana luc14n0, 2023-10-03 01:39
Actions #1

Updated by luc14n0 about 1 year ago

  • Private changed from Yes to No
Actions #2

Updated by pjessen about 1 year ago

256Mb swap is ridiculous, in my opinion, but no more ridiculous than 14Gb for exchanging a few messages 😱
Any chance this is yet-another-memory-leak? Something that needs tidying up every so often? I can't login to matrix.i.o.o to have a look, so just thinking out loud.

Actions #3

Updated by luc14n0 about 1 year ago

pjessen wrote in #note-2:

256Mb swap is ridiculous, in my opinion, but no more ridiculous than 14Gb for exchanging a few messages 😱

(Un)Fortunately we're doing much more than exchanging a few messages nowadays. I can't say exact numbers, Jacob knows better, but we have dozens of rooms. I imagine at least some are (very) busy, at times. We do have thousands of users, most of which are bots. We have 17 workers, if I'm counting them correctly. We have three bridges (Discord, Telegram, and Hookshot).

All of that doesn't scale well with Synapse, which is written in Python.

Any chance this is yet-another-memory-leak? Something that needs tidying up every so often? I can't login to matrix.i.o.o to have a look, so just thinking out loud.

I don't think so. Looking at it right now, it still sitting around of 13.2G/14.4G -- in Htop:

matrix (matrix.o.o):~ # free --mega
               total        used        free      shared  buff/cache   available
Mem:           15506       14584         207           0        1079         921
Swap:            268         268           0

Just like yesterday. But I'm going to reboot the VM and watch it over time.

Actions #4

Updated by luc14n0 about 1 year ago

Right now the load is below 100%, though:

21:06:29 up 11 days,  8:25,  4 users,  load average: 4.88, 4.68, 4.60
Actions #5

Updated by hellcp about 1 year ago

Before you restart, you could run zypper dup on the machine, as there's pending synapse update

Actions #6

Updated by luc14n0 about 1 year ago

OK. I did a zypper dup, then rebooted.

When the system got back it stayed sitting at around 3.8G. Now, after a while, it got higher. Let's see if it keeps getting higher tomorrow.

matrix (matrix.o.o):~ # free --mega
               total        used        free      shared  buff/cache   available
Mem:           15506        6334        6817           0        2719        9171
Swap:            268           0         268
matrix (matrix.o.o):~ # w
 00:55:57 up  2:27,  1 user,  load average: 1.19, 0.47, 0.42
Actions #7

Updated by luc14n0 about 1 year ago

BTW, I left htop open the whole time. And it shows that there were a RAM usage spike (around 8G, judging by the bar in the annexed picture). Just a FYI.

Actions #8

Updated by pjessen about 1 year ago

luc14n0 wrote in #note-3:

pjessen wrote in #note-2:

256Mb swap is ridiculous, in my opinion, but no more ridiculous than 14Gb for exchanging a few messages 😱

(Un)Fortunately we're doing much more than exchanging a few messages nowadays. I can't say exact numbers, Jacob knows better, but we have dozens of rooms.
I imagine at least some are (very) busy, at times. We do have thousands of users, most of which are bots. We have 17 workers, if I'm counting them correctly.
We have three bridges (Discord, Telegram, and Hookshot).

I may be old fashioned, but that still sounds like "exchanging a few messages" 🙂

Any chance this is yet-another-memory-leak? Something that needs tidying up every so often? I can't login to matrix.i.o.o to have a look, so just thinking out loud.

I don't think so. Looking at it right now, it still sitting around of 13.2G/14.4G -- in Htop:

I only mentioned it because that was the case with mailman3. (also python ...)
Restarting the gunicorn workers after X requests produced a significant reduction in memory footprint. (--max-requests=500)

Actions #9

Updated by luc14n0 about 1 year ago

  • Status changed from New to Feedback

pjessen wrote in #note-8:

luc14n0 wrote in #note-3:

pjessen wrote in #note-2:

256Mb swap is ridiculous, in my opinion, but no more ridiculous than 14Gb for exchanging a few messages 😱

(Un)Fortunately we're doing much more than exchanging a few messages nowadays. I can't say exact numbers, Jacob knows better, but we have dozens of rooms.
I imagine at least some are (very) busy, at times. We do have thousands of users, most of which are bots. We have 17 workers, if I'm counting them correctly.
We have three bridges (Discord, Telegram, and Hookshot).

I may be old fashioned, but that still sounds like "exchanging a few messages" 🙂

Well, in a way it might very well be. However, in general there more things Synapse is doing in the background than, let's say, an IRC server. There are some many cogwheels spinning day in day out.

Any chance this is yet-another-memory-leak? Something that needs tidying up every so often? I can't login to matrix.i.o.o to have a look, so just thinking out loud.

I don't think so. Looking at it right now, it still sitting around of 13.2G/14.4G -- in Htop:

I only mentioned it because that was the case with mailman3. (also python ...)
Restarting the gunicorn workers after X requests produced a significant reduction in memory footprint. (--max-requests=500)

Taking another look:

matrix (matrix.o.o):~ # free --mega
               total        used        free      shared  buff/cache   available
Mem:           15506        6651        6180           0        3039        8854
Swap:            268           0         268
matrix (matrix.o.o):~ # uptime
 23:08:48  up 1 day  0:40,  1 user,  load average: 0.15, 0.23, 0.43

And you know what? At this point I won't say that there isn't a memory leak, or something similar -- like a worker spawning too many children. I'm going to keep an eye on the monitors for a while.

Actions #10

Updated by luc14n0 about 1 year ago

I believe my initial proposition while opening this ticket proved to be hasty judgement, and I changed my mind about raising the resources allocated to the matrix.i.o.o VM. However, I'm going to keep this ticket in feedback status while I keep my watch.

Actions #11

Updated by luc14n0 about 1 year ago

Indeed Per, you do have a point.

I've kept matrix.i.o.o under my watch and day by day the RAM consumption has been raising, little by little. And I have a suspicion it has to do with our current broken federation with matrix.org.

luc14n0@matrix:~> free --mega
               total        used        free      shared  buff/cache   available
Mem:           15506        9053         552           0        6265        6452
Swap:            268           1         266

The system load, however, got tamed ever since the last system reboot:

luc14n0@matrix:~> uptime
 00:08:59  up 13 days  1:40,  1 user,  load average: 0.23, 0.33, 0.40

Since there are incoming updates for element-web, and soon there will be one for Synapse as well, I'm thinking of creating a script to try to find out more about possible culprits after updating the system -- there is a Kernel update in the stack too. I do see lots of processes regarding federation.

If anyone knows a handy script that would fit the job for monitoring RAM usage and process spawning, please speak up. In the meanwhile I'm going to take a look at the openSUSE's System Analysis and Tuning Guide, more specifically the System Monitoring part.

Actions #12

Updated by crameleon about 1 year ago

If anyone knows a handy script that would fit the job for monitoring RAM usage

Is it not monitored by nrpe, visible in Icinga?

Actions #13

Updated by luc14n0 about 1 year ago

crameleon wrote in #note-12:

If anyone knows a handy script that would fit the job for monitoring RAM usage

Is it not monitored by nrpe, visible in Icinga?

Yes, it is. And here it is the last 13 days graph.

I don't think we're going to need over-engineering here:

luc14n0@matrix:~> ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -4
  PID  PPID CMD                         %MEM %CPU
 1232     1 /usr/bin/python3 -m synapse 18.2 18.5
 1924     1 /usr/bin/python3 -m synapse  3.6  7.2
 1921     1 /usr/bin/python3 -m synapse  3.6  7.6

luc14n0@matrix:~> ps -e -o pid,user,%mem,cmd --sort=-%mem | head -4
  PID USER     %MEM CMD
 1232 synapse  18.2 /usr/bin/python3 -m synapse.app.homeserver --config-path=/etc/matrix-synapse/homeserver.yaml --config-path=/etc/matrix-synapse/conf.d/
 1924 synapse   3.6 /usr/bin/python3 -m synapse.app.generic_worker --config-path=/etc/matrix-synapse/homeserver.yaml --config-path=/etc/matrix-synapse/conf.d/ --config-path=/etc/matrix-synapse/workers/federation_sender2.yaml
 1921 synapse   3.6 /usr/bin/python3 -m synapse.app.generic_worker --config-path=/etc/matrix-synapse/homeserver.yaml --config-path=/etc/matrix-synapse/conf.d/ --config-path=/etc/matrix-synapse/workers/federation_sender1.yaml

luc14n0@matrix:~> python3 -c 'print(15506 * 0.182)'
2822.092

So the main Synapse process is using 2.822M of RAM, at this moment. I'm going to update the machine this Saturday as there's a Synapse update in the available (in the most recent upstream release should be available next week) and let's see how thing go from there.

Actions #14

Updated by luc14n0 about 1 year ago

For posterity's sake -- the graph will go away at some point:

matrix.i.o.o Grafana

Actions #15

Updated by luc14n0 about 1 year ago

  • Status changed from Feedback to Closed

Alright, I'm going to close this one now, since there's nothing actionable here in the context of what this ticket was opened for.

Actions

Also available in: Atom PDF