tickets #123472: mailman3 - nginx oom killed ? - openSUSE admin - openSUSE Project Management Tool

Custom queries

Events of the openSUSE Heroes
my assigned stuff
obs-admin-tickets
openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

tickets #123472

closed

mailman3 - nginx oom killed ?

Added by fkrueger almost 2 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

pjessen

Category:

Mailing lists

Target version:

Start date:

2023-01-21

Due date:

% Done:

100%

Estimated time:

Description

The above-mentioned websites are not available for several hours with the error message "We are very sorry, but the requested service is currently not available." There is no hint at https://status.opensuse.org/.

Regards,
Frank

Related issues 1 (0 open — 1 closed)

Has duplicate openSUSE admin - tickets #123631: lists.o.o is down

Resolved

pjessen

2023-01-25

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by pjessen almost 2 years ago

Tracker changed from communication to tickets
Category set to Mailing lists
Private changed from Yes to No

Indeed, it looks like nginx stopped yesterday at 1045UTC.

Jan 20 10:24:02 mailman3 systemd[1]: Reloading The nginx HTTP and reverse proxy server...
Jan 20 10:24:02 mailman3 systemd[1]: Reloaded The nginx HTTP and reverse proxy server.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Main process exited, code=killed, status=9/KILL
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 24437 (nginx) with signal SIGKILL.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 13820 (nginx) with signal SIGKILL.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 13873 (nginx) with signal SIGKILL.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 13897 (nginx) with signal SIGKILL.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 13919 (nginx) with signal SIGKILL.
Jan 21 10:45:14 mailman3 systemd[1]: nginx.service: Failed with result 'signal'.
Jan 21 10:45:14 mailman3 systemd[1]: nginx.service: Consumed 20h 13min 18.620s CPU time.

I have restarted nginx.

Actions

Copy link

Updated by fkrueger almost 2 years ago

pjessen wrote:

Indeed, it looks like nginx stopped yesterday at 1045UTC.

Jan 20 10:24:02 mailman3 systemd[1]: Reloading The nginx HTTP and reverse proxy server...
Jan 20 10:24:02 mailman3 systemd[1]: Reloaded The nginx HTTP and reverse proxy server.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Main process exited, code=killed, status=9/KILL
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 24437 (nginx) with signal SIGKILL.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 13820 (nginx) with signal SIGKILL.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 13873 (nginx) with signal SIGKILL.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 13897 (nginx) with signal SIGKILL.
Jan 21 10:45:13 mailman3 systemd[1]: nginx.service: Killing process 13919 (nginx) with signal SIGKILL.
Jan 21 10:45:14 mailman3 systemd[1]: nginx.service: Failed with result 'signal'.
Jan 21 10:45:14 mailman3 systemd[1]: nginx.service: Consumed 20h 13min 18.620s CPU time.

I have restarted nginx.

Seems to work again. Thx. Feel free to close it.

Actions

Copy link

Updated by pjessen almost 2 years ago

At first, dmesg did not show nginx being killed by the oom killer, but from /var/log/messages :

2023-01-21T10:45:13.488872+00:00 mailman3 kernel: [6224189.860720][ T9631] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global
_oom,task_memcg=/system.slice,task=nginx,pid=32317,uid=0
2023-01-21T10:45:13.488873+00:00 mailman3 kernel: [6224189.860757][ T9631] Out of memory: Killed process 32317 (nginx) total-vm:1384308kB, anon-rss:1346128kB
, file-rss:68kB, shmem-rss:4kB, UID:0 pgtables:2748kB oom_score_adj:0

Actions

Copy link

Updated by pjessen almost 2 years ago

Subject changed from Service down at https://lists.opensuse.org/archives/ to mailman3 - nginx oom killed ?

pjessen wrote:

2023-01-21T10:45:13.488873+00:00 mailman3 kernel: [6224189.860757][ T9631] Out of memory: Killed process 32317 (nginx) total-vm:1384308kB, anon-rss:1346128kB

I guess nginx was in fact gobbling up most of the memory on mailman3. That sounds very unusual. Even when reloading the config (with the big rewrite maps), it should never get that high.

Actions

Copy link

Updated by fkrueger almost 2 years ago

FYI: https://lists.opensuse.org/ is down again.

Actions

Copy link

Updated by pjessen almost 2 years ago

Yes, I've been trying to restart nginx all day. It seems to be running now.

Actions

Copy link

Updated by pjessen almost 2 years ago

Has duplicate tickets #123631: lists.o.o is down added

Actions

Copy link

Updated by pjessen almost 2 years ago

Cop-out: Because nginx seems to have become the preferred victim, I have added automatic nginx restart.

# /etc/systemd/system/nginx.service.d/restart.conf
[Service]
RestartSec=600s
Restart=on-failure

Actions

Copy link

Updated by fkrueger almost 2 years ago

pjessen wrote:

Cop-out: Because nginx seems to have become the preferred victim, I have added automatic nginx restart.
# /etc/systemd/system/nginx.service.d/restart.conf
[Service]
RestartSec=600s
Restart=on-failure

Unfortunately, https://lists.opensuse.org/ is down again for quite some time now. By the way, why doesn't this issue show up at https://status.opensuse.org/?

Actions

Copy link

#10

Updated by pjessen almost 2 years ago

fkrueger wrote:

Unfortunately, https://lists.opensuse.org/ is down again for quite some time now.

Yes, it looks like the automatic restart of nginx is working, but when the rest of the machine is misbehaving ...

By the way, why doesn't this issue show up at https://status.opensuse.org/?

Updating https://status.opensuse.org/ is not automatic, it is a manual operation.

Actions

Copy link

#11

Updated by pjessen almost 2 years ago

pjessen wrote:

fkrueger wrote:

Unfortunately, https://lists.opensuse.org/ is down again for quite some time now.

Yes, it looks like the automatic restart of nginx is working, but when the rest of the machine is misbehaving ...

postfix also got oom killed, last night around 2220.

Actions

Copy link

#12

Updated by pjessen over 1 year ago

Status changed from New to Resolved
Assignee set to pjessen
% Done changed from 0 to 100

The memory issue on mailman3 has been resolved -

it was given a lot more memory and room for a swapfile (didn't help a lot)
gunicorn workers were told to restart regularly #102203 (helped a lot)
the nginx rewrite map was moved to a proxy daemon #101842 (helped, but not a lot)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

openSUSE admin

Tags

Custom queries

tickets #123472

mailman3 - nginx oom killed ?

Updated by pjessen almost 2 years ago

Updated by fkrueger almost 2 years ago

Updated by pjessen almost 2 years ago

Updated by pjessen almost 2 years ago

Updated by fkrueger almost 2 years ago

Updated by pjessen almost 2 years ago

Updated by pjessen almost 2 years ago

Updated by pjessen almost 2 years ago

Updated by fkrueger almost 2 years ago

Updated by pjessen almost 2 years ago

Updated by pjessen almost 2 years ago

Updated by pjessen over 1 year ago