Project

General

Profile

Actions

tickets #134786

closed

lnt.opensuse.org down

Added by mjambor@suse.cz 8 months ago. Updated 5 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Servers hosted in NBG
Target version:
-
Start date:
2023-08-29
Due date:
% Done:

0%

Estimated time:

Description

Hello admin@opensuse.org,

lnt.opensuse.org is not responding to http requests. From within the
opensuse network (specifically from gcc.infra.opensuse.org) the machine
lnt.infra.opensuse.org responds to ping but does not accept ssh
connections.

Can you please help us to restore the service?

Thanks,

Martin Jambor

Actions #1

Updated by cboltz 8 months ago

  • Assignee set to crameleon
  • Private changed from Yes to No

Looks like the load heavily increased today around 9:53 UTC (according to the monitoring, it jumped to a load of 23 - and since then, the monitoring couldn't get newer data, so the actual load is probably even higher).

Georg, IMHO the VM needs a forced reboot... (unless you manage to login somehow and can find out what happened)

Actions #2

Updated by crameleon 8 months ago

  • Category set to Servers hosted in NBG
  • Status changed from New to In Progress
Actions #3

Updated by crameleon 8 months ago

I can reach a serial console, but just entering root at the login prompt makes it take over a minute for the passphrase prompt to appear, at which point it cancels my request with "timed out after 60 seconds". Will reboot it.

Actions #4

Updated by crameleon 8 months ago

  • Status changed from In Progress to Resolved

Shutdown took a long time, many services needed to run into timeouts to stop.

The boot on the other hand went quickly, and the machine is online again:

$ ssh lnt.infra.opensuse.org
Last login: Wed Jun  7 20:33:20 2023 from 192.168.252.162
Have a lot of fun...
crameleon@lnt:/home/crameleon> systemctl --failed
  UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed.

I'll leave further investigation up to you.

Actions #5

Updated by jamborm 8 months ago

  • Status changed from Resolved to In Progress

Thank you very much, the machine was accessible for a while after you
rebooted it, but then before I had a chance to look at it, it stumbled
again, then probably recovered during the night and now is inaccessible
again.

These issues have started 2-3 days ago, I am not aware of any changes
which might have caused it.

From the lnt configuration file I can see that it uses an underlying
database server postgresql.infra.opensuse.org, I assume it also has
not undergone any important changes recently?

More generally, although the service is currently only maintained in a
limited way since Martin Liška left SUSE, it seems that Petr Hodač
agreed to step in and be a proper maintainer. He should be in the
process of requesting openSUSE infra VPN access (soon).

Meanwhile, can you perhaps reboot the machine again - and perhaps be
ready to do so again in a more coordinated fashion when me and
(especially) Petr is ready to jump in and see what is going on?
Thanks a lot and sorry for missing an earlier chance, the transition
unfortunately takes time.

Actions #6

Updated by crameleon 8 months ago

Hey @jamborm,

the last change I know of was me installing updates on the proxy and the Postgres server and upgrading the PostgreSQL version, but that already happened over a month ago, I don't think anything noteworthy happened in the last few days.

I'm glad to hear about a new maintainer, we're happy to support him with getting access when requested.

As for checking the machine together, I suggest you ping me in #opensuse-admin on ircs://libera.chat when you have time. My nickname there is acidsys. Then I can reboot it at a good time and you get quick feedback when it's ready for you to jump in. ;-)

Actions #7

Updated by jamborm 8 months ago

Thanks a lot. The issue has happened again but now I have a somewhat responsive established root ssh connection to the machine.

The problem is clearly that the machine is completely overwhelmed with processes running /usr/sbin/httpd-prefork.

I am totally inexperienced when it comes to debugging these kind of issues. Can this be some common problem? Is it possible to easily determine whether there are simply that many incoming requests nowadays or whether the machine somehow spins out of control ans spawns the processes itself?

Actions #8

Updated by crameleon 8 months ago

Running the Apache httpd in prefork mode is very prone to denial of service attacks targeting the spawning of many webserver processes. Whether this is a malicious attack or simply a higher legitimate demand in using this service, I highly recommend switching to the event MPM. This can easily be done by installing apache2-event and setting APACHE_MPM to event in /etc/sysconfig/apache2 (alternatively uninstalling apache2-prefork should make it detect the new MPM automatically as well). Afterwards restart apache2.
You can compare the systemctl status apache2 output before and after to notice the change in process name.

If you want to investigate whether it's a malicious attack or not, you can enable a status endpoint to check for abnormally high requests from individual sources:

https://httpd.apache.org/docs/2.4/mod/mod_status.html

It might be desirable to only expose the status endpoint on the local network.

Edit: I should mention, switching the MPM is only easy if you do not use any modules which are only compatible with the prefork mode. A common example is mod_php to execute PHP code. In case this is running a PHP application, it would also mean switching to PHP-FPM. If this is the case I can provide more instructions. :-)

Actions #9

Updated by pjessen_invalid 8 months ago

jamborm wrote in #note-7:

Thanks a lot. The issue has happened again but now I have a somewhat responsive established root ssh connection to the machine.

The problem is clearly that the machine is completely overwhelmed with processes running /usr/sbin/httpd-prefork.

I am totally inexperienced when it comes to debugging these kind of issues. Can this be some common problem? Is it possible to easily determine whether there are simply
that many incoming requests nowadays or whether the machine somehow spins out of control ans spawns the processes itself?

Too many processes could be an indication of requests taking too long or never finishing. Of course, it could be a DDoS attack or it could be that the machine has insufficient resources.

Actions #10

Updated by crameleon 8 months ago

  • Status changed from In Progress to Feedback
Actions #11

Updated by jamborm 8 months ago

We have turned down a few of the Apache2 knobs so that the machine would not get so overloaded.

Looking into /var/log/apache2/access_log, it seems that quite a few bots like to regularly download probably the entire site, which could effectively DoS it. We see many accesses like

192.168.47.102 - - [06/Sep/2023:12:46:39 +0000] "GET /db_default/v4/CPP/6268/graph?test.758=8 HTTP/1.1" 302 333 "-" "Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)"

or

192.168.47.102 - - [06/Sep/2023:12:46:32 +0000] "GET /db_default/v4/SPEC/15770?compare_to=15685 HTTP/1.1" 200 551910 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)"

or

192.168.47.102 - - [06/Sep/2023:12:46:34 +0000] "GET /db_default/v4/CPP/5691?compare_to=5067 HTTP/1.1" 200 262162 "-" "Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)"

Actions #12

Updated by cboltz 8 months ago

May I recommend to create a robots.txt? ;-)

You can block the most annoying bots, and/or deny paths that are not too useful for search engine users.

Actions #13

Updated by crameleon 5 months ago

  • Status changed from Feedback to Closed

No feedback, closing.

Actions

Also available in: Atom PDF