Project

General

Profile

tickets #123472

mailman3 - nginx oom killed ?

Added by fkrueger 2 months ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Mailing lists
Target version:
-
Start date:
2023-01-21
Due date:
% Done:

0%

Estimated time:

Description

The above-mentioned websites are not available for several hours with the error message "We are very sorry, but the requested service is currently not available." There is no hint at https://status.opensuse.org/.

Regards,
Frank


Related issues

Has duplicate openSUSE admin - tickets #123631: lists.o.o is downResolved2023-01-25

History

#1 Updated by pjessen 2 months ago

  • Tracker changed from communication to tickets
  • Category set to Mailing lists
  • Private changed from Yes to No

Indeed, it looks like nginx stopped yesterday at 1045UTC.

Jan 20 10:24:02 mailman3 systemd[1]: Reloading The nginx HTTP and reverse proxy server...
Jan 20 10:24:02 mailman3 systemd[1]: Reloaded The nginx HTTP and reverse proxy server.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Main process exited, code=killed, status=9/KILL
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 24437 (nginx) with signal SIGKILL.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 13820 (nginx) with signal SIGKILL.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 13873 (nginx) with signal SIGKILL.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 13897 (nginx) with signal SIGKILL.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 13919 (nginx) with signal SIGKILL.
Jan 21 10:45:14 mailman3 systemd[1]: nginx.service: Failed with result 'signal'.
Jan 21 10:45:14 mailman3 systemd[1]: nginx.service: Consumed 20h 13min 18.620s CPU time.

I have restarted nginx.

#2 Updated by fkrueger 2 months ago

pjessen wrote:

Indeed, it looks like nginx stopped yesterday at 1045UTC.

Jan 20 10:24:02 mailman3 systemd[1]: Reloading The nginx HTTP and reverse proxy server...
Jan 20 10:24:02 mailman3 systemd[1]: Reloaded The nginx HTTP and reverse proxy server.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Main process exited, code=killed, status=9/KILL
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 24437 (nginx) with signal SIGKILL.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 13820 (nginx) with signal SIGKILL.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 13873 (nginx) with signal SIGKILL.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 13897 (nginx) with signal SIGKILL.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 13919 (nginx) with signal SIGKILL.
Jan 21 10:45:14 mailman3 systemd[1]: nginx.service: Failed with result 'signal'.
Jan 21 10:45:14 mailman3 systemd[1]: nginx.service: Consumed 20h 13min 18.620s CPU time.

I have restarted nginx.

Seems to work again. Thx. Feel free to close it.

#3 Updated by pjessen 2 months ago

At first, dmesg did not show nginx being killed by the oom killer, but from /var/log/messages :

2023-01-21T10:45:13.488872+00:00 mailman3 kernel: [6224189.860720][ T9631] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global
_oom,task_memcg=/system.slice,task=nginx,pid=32317,uid=0
2023-01-21T10:45:13.488873+00:00 mailman3 kernel: [6224189.860757][ T9631] Out of memory: Killed process 32317 (nginx) total-vm:1384308kB, anon-rss:1346128kB
, file-rss:68kB, shmem-rss:4kB, UID:0 pgtables:2748kB oom_score_adj:0

#4 Updated by pjessen 2 months ago

  • Subject changed from Service down at https://lists.opensuse.org/archives/ to mailman3 - nginx oom killed ?

pjessen wrote:

2023-01-21T10:45:13.488873+00:00 mailman3 kernel: [6224189.860757][ T9631] Out of memory: Killed process 32317 (nginx) total-vm:1384308kB, anon-rss:1346128kB

I guess nginx was in fact gobbling up most of the memory on mailman3. That sounds very unusual. Even when reloading the config (with the big rewrite maps), it should never get that high.

#5 Updated by fkrueger about 2 months ago

FYI: https://lists.opensuse.org/ is down again.

#6 Updated by pjessen about 2 months ago

Yes, I've been trying to restart nginx all day. It seems to be running now.

#7 Updated by pjessen about 2 months ago

#8 Updated by pjessen about 2 months ago

Cop-out: Because nginx seems to have become the preferred victim, I have added automatic nginx restart.

# /etc/systemd/system/nginx.service.d/restart.conf
[Service]
RestartSec=600s
Restart=on-failure

#9 Updated by fkrueger about 2 months ago

pjessen wrote:

Cop-out: Because nginx seems to have become the preferred victim, I have added automatic nginx restart.

# /etc/systemd/system/nginx.service.d/restart.conf
[Service]
RestartSec=600s
Restart=on-failure

Unfortunately, https://lists.opensuse.org/ is down again for quite some time now. By the way, why doesn't this issue show up at https://status.opensuse.org/?

#10 Updated by pjessen about 2 months ago

fkrueger wrote:

Unfortunately, https://lists.opensuse.org/ is down again for quite some time now.

Yes, it looks like the automatic restart of nginx is working, but when the rest of the machine is misbehaving ...

By the way, why doesn't this issue show up at https://status.opensuse.org/?

Updating https://status.opensuse.org/ is not automatic, it is a manual operation.

#11 Updated by pjessen about 2 months ago

pjessen wrote:

fkrueger wrote:

Unfortunately, https://lists.opensuse.org/ is down again for quite some time now.

Yes, it looks like the automatic restart of nginx is working, but when the rest of the machine is misbehaving ...

postfix also got oom killed, last night around 2220.

Also available in: Atom PDF