Project

General

Profile

Actions

tickets #135527

closed

matrix.i.o.o hookshot bridge failures and CPU spikes

Added by luc14n0 12 months ago. Updated 12 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
IRC and Matrix
Target version:
-
Start date:
2023-09-11
Due date:
% Done:

100%

Estimated time:

Description

Lately I've noticed that sometimes when I enter Element on my phone that animation of syncing never stops and rooms don't sync latest messages.

Today I logged in on matrix.i.o.o and checked the system logs, and this loop was all over the place:

Sep 11 14:04:13 matrix systemd[1]: hookshot.service: Main process exited, code=exited, status=1/FAILURE
Sep 11 14:04:13 matrix systemd[1]: hookshot.service: Failed with result 'exit-code'.
Sep 11 14:04:13 matrix systemd[1]: hookshot.service: Consumed 1.576s CPU time.
Sep 11 14:04:14 matrix systemd[1]: hookshot.service: Scheduled restart job, restart counter is at 260306.
Sep 11 14:04:14 matrix systemd[1]: Stopped Matrix Appservice hookshot.
Sep 11 14:04:14 matrix systemd[1]: hookshot.service: Consumed 1.576s CPU time.
Sep 11 14:04:14 matrix systemd[1]: Started Matrix Appservice hookshot.
Sep 11 14:04:15 matrix node[14486]: node:internal/modules/cjs/loader:1051
Sep 11 14:04:15 matrix node[14486]:   throw err;
Sep 11 14:04:15 matrix node[14486]:   ^
Sep 11 14:04:15 matrix node[14486]: Error: Cannot find module 'rss-parser'
Sep 11 14:04:15 matrix node[14486]: Require stack:
Sep 11 14:04:15 matrix node[14486]: - /data/var/lib/matrix-synapse/hookshot/lib/feeds/FeedReader.js
Sep 11 14:04:15 matrix node[14486]: - /data/var/lib/matrix-synapse/hookshot/lib/Bridge.js
Sep 11 14:04:15 matrix node[14486]: - /data/var/lib/matrix-synapse/hookshot/lib/App/BridgeApp.js
Sep 11 14:04:15 matrix node[14486]:     at Module._resolveFilename (node:internal/modules/cjs/loader:1048:15)
Sep 11 14:04:15 matrix node[14486]:     at Module._load (node:internal/modules/cjs/loader:901:27)
Sep 11 14:04:15 matrix node[14486]:     at Module.require (node:internal/modules/cjs/loader:1115:19)
Sep 11 14:04:15 matrix node[14486]:     at require (node:internal/modules/helpers:130:18)
Sep 11 14:04:15 matrix node[14486]:     at Object.<anonymous> (/data/var/lib/matrix-synapse/hookshot/lib/feeds/FeedReader.js:11:38)
Sep 11 14:04:15 matrix node[14486]:     at Module._compile (node:internal/modules/cjs/loader:1233:14)
Sep 11 14:04:15 matrix node[14486]:     at Module._extensions..js (node:internal/modules/cjs/loader:1287:10)
Sep 11 14:04:15 matrix node[14486]:     at Module.load (node:internal/modules/cjs/loader:1091:32)
Sep 11 14:04:15 matrix node[14486]:     at Module._load (node:internal/modules/cjs/loader:938:12)
Sep 11 14:04:15 matrix node[14486]:     at Module.require (node:internal/modules/cjs/loader:1115:19)
Sep 11 14:04:15 matrix node[14486]:     at require (node:internal/modules/helpers:130:18)
Sep 11 14:04:15 matrix node[14486]:     at Object.<anonymous> (/data/var/lib/matrix-synapse/hookshot/lib/Bridge.js:34:22)
Sep 11 14:04:15 matrix node[14486]:     at Module._compile (node:internal/modules/cjs/loader:1233:14)
Sep 11 14:04:15 matrix node[14486]:     at Module._extensions..js (node:internal/modules/cjs/loader:1287:10)
Sep 11 14:04:15 matrix node[14486]:     at Module.load (node:internal/modules/cjs/loader:1091:32)
Sep 11 14:04:15 matrix node[14486]:     at Module._load (node:internal/modules/cjs/loader:938:12) {
Sep 11 14:04:15 matrix node[14486]:   code: 'MODULE_NOT_FOUND',
Sep 11 14:04:15 matrix node[14486]:   requireStack: [
Sep 11 14:04:15 matrix node[14486]:     '/data/var/lib/matrix-synapse/hookshot/lib/feeds/FeedReader.js',
Sep 11 14:04:15 matrix node[14486]:     '/data/var/lib/matrix-synapse/hookshot/lib/Bridge.js',
Sep 11 14:04:15 matrix node[14486]:     '/data/var/lib/matrix-synapse/hookshot/lib/App/BridgeApp.js'
Sep 11 14:04:15 matrix node[14486]:   ]
Sep 11 14:04:15 matrix node[14486]: }
Sep 11 14:04:15 matrix node[14486]: Node.js v20.5.1
Sep 11 14:04:15 matrix systemd[1]: hookshot.service: Main process exited, code=exited, status=1/FAILURE
Sep 11 14:04:15 matrix systemd[1]: hookshot.service: Failed with result 'exit-code'.
Sep 11 14:04:15 matrix systemd[1]: hookshot.service: Consumed 1.515s CPU time.
Sep 11 14:04:16 matrix systemd[1]: Stopped Matrix Appservice hookshot.
Sep 11 14:04:16 matrix systemd[1]: hookshot.service: Consumed 1.515s CPU time.

The hookshot bridge was failing and retrying, failing and retrying, due to Error: Cannot find module 'rss-parser'. What are we using the hookshot bridge for?

The system load was ~3 ~3 ~3 (higher than seen in the attached image) before I stopped hookshot.service. And after observing Htop for a while the load average is still high -- more than 2, and less than 3.


Files

Screenshot from 2023-09-11 11-17-54.png (33 KB) Screenshot from 2023-09-11 11-17-54.png Htop - used resources luc14n0, 2023-09-11 14:50
Screenshot from 2023-09-11 11-18-53.png (76.1 KB) Screenshot from 2023-09-11 11-18-53.png Htop - top CPU eaters luc14n0, 2023-09-11 14:50
Actions #1

Updated by luc14n0 12 months ago

  • Private changed from Yes to No
Actions #2

Updated by crameleon 12 months ago

I remember chatting about this with @hellcp the other day but we did not reach a conclusion regarding hookshot. Something went awry with updates.

Actions #3

Updated by luc14n0 12 months ago

crameleon wrote in #note-2:

I remember chatting about this with @hellcp the other day but we did not reach a conclusion regarding hookshot. Something went awry with updates.

I imagine. Those NodeJS apps can be such a pain, at times.

Actions #4

Updated by luc14n0 12 months ago

Another thing I noticed, is this normal?

luc14n0@matrix:~> ping -c4 opensuse.org
PING opensuse.org (195.135.221.140) 56(84) bytes of data.
From scar.infra.opensuse.org (192.168.47.252) icmp_seq=1 Destination Port Unreachable
From scar.infra.opensuse.org (192.168.47.252) icmp_seq=2 Destination Port Unreachable
From scar.infra.opensuse.org (192.168.47.252) icmp_seq=3 Destination Port Unreachable
From scar.infra.opensuse.org (192.168.47.252) icmp_seq=4 Destination Port Unreachable

--- opensuse.org ping statistics ---
4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3003ms
Actions #5

Updated by hellcp 12 months ago

Hookshot is used for various channels with rss feeds like #mastodon:opensuse.org

Actions #6

Updated by luc14n0 12 months ago

hellcp wrote in #note-5:

Hookshot is used for various channels with rss feeds like #mastodon:opensuse.org

Great, that's good to know. Personally I never used Mastodon, nor stopped by such channels before.

Actions #7

Updated by luc14n0 12 months ago

BTW, how are we updating things over there, more specifically that particular bridge?

P.S.: I spoke too soon. From Bash history, I see sudo -u synapse yarn run build and sudo -u synapse npm run build, I suppose that's run under /var/lib/matrix-synapse/hookshot after doing sudo -u synapse git foo.

Actions #8

Updated by crameleon 12 months ago

Another thing I noticed, is this normal?

It's not normal, but I couldn't quite figure out what the issue is by inspecting /etc/nftables.conf on scar.i.o.o. Other hosts in the network are allowed ICMP/to ping just fine. But I couldn't notice it causing any issues with the applications on matrix.i.o.o either.

From Bash history, I see sudo -u synapse yarn run build and sudo -u synapse npm run build, I suppose that's run under /var/lib/matrix-synapse/hookshot after doing sudo -u synapse git foo.

That was just me debugging what is usually executed by Salt, sorry for the confusion.

Actions #9

Updated by luc14n0 12 months ago

crameleon wrote in #note-8:

Another thing I noticed, is this normal?

It's not normal, but I couldn't quite figure out what the issue is by inspecting /etc/nftables.conf on scar.i.o.o. Other hosts in the network are allowed ICMP/to ping just fine. But I couldn't notice it causing any issues with the applications on matrix.i.o.o either.

OK. I'm not sure it causes any issue with the app either.

From Bash history, I see sudo -u synapse yarn run build and sudo -u synapse npm run build, I suppose that's run under /var/lib/matrix-synapse/hookshot after doing sudo -u synapse git foo.

That was just me debugging what is usually executed by Salt, sorry for the confusion.

Thanks for the clarification.

Actions #10

Updated by luc14n0 12 months ago

I believe I was able to hopefully fix the Hookshot bridge by using (probably) unorthodox means, with my limited NPM/Yarn skills.

Since rss-parser wasn't installed, which by the way is odd because if I got this right:

matrix (matrix.o.o):/var/lib/matrix-synapse/hookshot # grep -n rss-parser node_modules/.yarn-integrity
./node_modules/.yarn-integrity:59:    "rss-parser@^3.12.0",                                                               
./node_modules/.yarn-integrity:921:    "rss-parser@^3.12.0": "https://registry.yarnpkg.com/rss-parser/-/rss-parser-3.12.0.
tgz#b8888699ea46304a74363fbd8144671b2997984c",

Does that means that at some point it was installed, somehow?! I can't say.

$ yarn help | grep integrity
    --skip-integrity-check              run install without checking if node_modules is installed

So, what did I do? First I added rss-parser with Yarn:

matrix (matrix.o.o):/var/lib/matrix-synapse/hookshot # sudo -u synapse yarn add rss-parser

This recompiled all the Node modules successfully, but failed with:

...
$ tsc --project tsconfig.json                                     
error TS5109: Option 'moduleResolution' must be set to 'node16' (or left unspecified) when option 'module' is set to 'node16'.

So I edited tsconfig.json:

diff --git a/tsconfig.json b/tsconfig.json
index 7d9a92a..839c386 100644
--- a/tsconfig.json
+++ b/tsconfig.json
@@ -1,6 +1,8 @@
 {
   "extends": "@tsconfig/node18/tsconfig.json",
   "compilerOptions": {
+    "module": "NodeNext",
+    "moduleResolution": "NodeNext",
     "incremental": true,
     "declaration": false,
     "outDir": "./lib",

And tried that failed step again with:

matrix (matrix.o.o):/var/lib/matrix-synapse/hookshot # sudo -u synapse yarn run tsc --project tsconfig.json               
yarn run v1.22.19                                                                                                         
$ /data/var/lib/matrix-synapse/hookshot/node_modules/.bin/tsc --project tsconfig.json                                     
Done in 10.42s.

And then I restarted the hookshot.service. It seems fine:

Sep 11 22:39:44 matrix systemd[1]: Started Matrix Appservice hookshot.
Sep 11 22:39:45 matrix node[6394]: INFO 22:39:45:720 [ListenerService] Listening on http://127.0.0.1:9005 for webhooks
Sep 11 22:39:45 matrix node[6394]: INFO 22:39:45:724 [Appservice] Initialising memory storage
Sep 11 22:39:45 matrix node[6394]: INFO 22:39:45:733 [Bridge] Starting up
Sep 11 22:39:45 matrix node[6394]: INFO 22:39:45:733 [UserTokenStore] Loading token key file passkey.pem
Sep 11 22:39:45 matrix node[6394]: INFO 22:39:45:737 [Bridge] Ensuring homeserver can be reached...
Sep 11 22:39:45 matrix node[6394]: INFO 22:39:45:795 [BotUsersManager] Ensuring bot users are set up...
Sep 11 22:39:45 matrix node[6394]: INFO 22:39:45:815 [BotUsersManager] Getting joined rooms...
Sep 11 22:39:45 matrix node[6394]: INFO 22:39:45:821 [Config] Prefilling room membership for permissions for 0 rooms
Sep 11 22:39:49 matrix node[6394]: ERROR 22:39:49:375 [ConnectionManager] User @_hookshot:opensuse.org is disallowed to manage state for generic in !NXyYLakNWAIiSBMuWD:opensuse.org
Sep 11 22:39:51 matrix node[6394]: INFO 22:39:51:918 [Bridge] All connections loaded
Sep 11 22:39:51 matrix node[6394]: INFO 22:39:51:921 [Bridge] Bridge is now ready. Found 4 connections
Sep 11 22:41:39 matrix node[6394]: INFO 22:41:39:380 [Appservice] Processing transaction 4645132
Sep 11 22:41:39 matrix node[6394]: INFO 22:41:39:382 [Appservice] Processing event of type m.room.member
Sep 11 22:41:39 matrix node[6394]: INFO 22:41:39:387 [Appservice] 127.0.0.1 - - [11/Sep/2023:22:41:39 +0000] "PUT /_matrix/app/v1/transactions/4645132?access_token=%3Credacted%3E HTTP/1.1" 200 2 "-" "Synapse/1.90.0"
Sep 11 22:41:39 matrix node[6394]: INFO 22:41:39:412 [Appservice] Processing transaction 4645138
Sep 11 22:41:39 matrix node[6394]: INFO 22:41:39:412 [Appservice] Processing event of type m.room.member
Sep 11 22:41:39 matrix node[6394]: INFO 22:41:39:413 [Appservice] 127.0.0.1 - - [11/Sep/2023:22:41:39 +0000] "PUT /_matrix/app/v1/transactions/4645138?access_token=%3Credacted%3E HTTP/1.1" 200 2 "-" "Synapse/1.90.0"
Sep 11 22:41:39 matrix node[6394]: INFO 22:41:39:441 [Appservice] Processing transaction 4645190
Sep 11 22:41:39 matrix node[6394]: INFO 22:41:39:441 [Appservice] Processing event of type m.room.member
Sep 11 22:41:39 matrix node[6394]: INFO 22:41:39:442 [Appservice] 127.0.0.1 - - [11/Sep/2023:22:41:39 +0000] "PUT /_matrix/app/v1/transactions/4645190?access_token=%3Credacted%3E HTTP/1.1" 200 2 "-" "Synapse/1.90.0"

About whether we need to report something to upstream. I'm just not sure.

Actions #11

Updated by luc14n0 12 months ago

  • % Done changed from 0 to 50
Actions #12

Updated by luc14n0 12 months ago

  • Status changed from New to Resolved
  • % Done changed from 50 to 100

The Hookshot bridge seems to be fixed. Of course at this moment other issues started to crop up and make the homeserver unstable.

And about the CPU spikes I mention in the title of the ticket, I realized that the spikes I saw at the time I was taking a look at the Hookshot bridge were normal after all, and within the range of what we could call "normal" average load (~50%) for a 6-core system.

So I'd say we can close this ticket.

Actions

Also available in: Atom PDF