Project

General

Profile

Actions

tickets #93686

closed

Postgres currently down

Added by hellcp almost 3 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Urgent
Category:
Core services and virtual infrastructure
Target version:
-
Start date:
2021-06-09
Due date:
% Done:

100%

Estimated time:

Description

Seems like mirrordb1 went down this morning, and hasn't come back up. Simple restart doesn't work apparently

Actions #1

Updated by hellcp almost 3 years ago

  • Private changed from Yes to No
Actions #2

Updated by pjessen almost 3 years ago

I tried restarting it twice, both resulted in a segfault.

The same thing repeatedly:

# grep segfault /var/log/messages
2021-06-05T10:56:53.856510+00:00 mirrordb1 kernel: [1344517.476156] postgres[21877]: segfault at 5626bf85dffa ip 00007ff966d76147 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-05T12:03:59.645338+00:00 mirrordb1 kernel: [1348543.361004] postgres[17228]: segfault at 5626bf85dffa ip 00007ff966d76133 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-05T12:07:44.854244+00:00 mirrordb1 kernel: [1348768.627274] postgres[18005]: segfault at 5626bf85dffa ip 00007ff966d76151 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-05T12:12:10.410150+00:00 mirrordb1 kernel: [1349034.190949] postgres[19194]: segfault at 5626bf85dffa ip 00007ff966d76133 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-05T12:18:45.116673+00:00 mirrordb1 kernel: [1349428.912847] postgres[20605]: segfault at 5626bf85dffa ip 00007ff966d76151 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-05T12:26:47.459836+00:00 mirrordb1 kernel: [1349911.271712] postgres[22555]: segfault at 5626bf85dffa ip 00007ff966d7612e sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-05T12:33:10.156571+00:00 mirrordb1 kernel: [1350293.982984] postgres[24549]: segfault at 5626bf85dffa ip 00007ff966d7614c sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-08T15:26:17.358152+00:00 mirrordb1 kernel: [1619890.699769] postgres[4776]: segfault at 5626bf85dffa ip 00007ff966d76147 sp 00007ffd90595438 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-08T18:28:07.089464+00:00 mirrordb1 kernel: [1630800.817691] postgres[28928]: segfault at 5626bf85dffa ip 00007ff966d76151 sp 00007ffd90595438 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-08T23:04:58.963109+00:00 mirrordb1 kernel: [1647413.271332] postgres[30652]: segfault at 5626bf85dffa ip 00007ff966d76138 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-09T03:25:10.534559+00:00 mirrordb1 kernel: [1663025.397677] postgres[28557]: segfault at 5626bf85dffa ip 00007ff966d76147 sp 00007ffd90595438 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-09T04:58:06.275839+00:00 mirrordb1 kernel: [1668601.338594] postgres[29635]: segfault at 5626bf85dffa ip 00007ff966d7612e sp 00007ffd90595438 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-09T04:59:07.199999+00:00 mirrordb1 kernel: [1668662.268971] postgres[999]: segfault at 5626bfbeabf2 ip 00007ff966d73700 sp 00007ffd90592918 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-09T07:51:34.897584+00:00 mirrordb1 kernel: [1679010.333734] postgres[24774]: segfault at 5618704bfc92 ip 00007ff512413700 sp 00007ffeae458f08 error 4 in libc-2.31.so[7ff5122d0000+1cb000]
2021-06-09T07:59:14.420472+00:00 mirrordb1 kernel: [1679469.872615] postgres[29702]: segfault at 5576df7d1c92 ip 00007f444ba53700 sp 00007fff8a7ecce8 error 4 in libc-2.31.so[7f444b910000+1cb000]
2021-06-09T08:05:50.521976+00:00 mirrordb1 kernel: [1679865.986628] postgres[1349]: segfault at 55b0a9cb1c82 ip 00007f2aacf9b700 sp 00007ffd2bed75f8 error 4 in libc-2.31.so[7f2aace58000+1cb000]
Actions #3

Updated by pjessen almost 3 years ago

I'm no wizard at debugging postgres, but at the first segfault this morning:

2021-06-09 03:25:10.552 UTC [1159]: [302-1] db=,user= LOG:  server process (PID 28557) was terminated by signal 11: Segmentation fault
2021-06-09 03:25:10.552 UTC [1159]: [303-1] db=,user= DETAIL:  Failed process was running: SELECT COUNT(mirr_del_byid(169, id) order by id) FROM temp1
2021-06-09 03:25:10.552 UTC [1159]: [304-1] db=,user= LOG:  terminating any other active server processes
Actions #4

Updated by mstrigl almost 3 years ago

We are currently working on it.

Actions #5

Updated by mstrigl almost 3 years ago

postgresql on mirrordb1 is up again.
We used a snapshot from yesterday.

I enabled the core dump writing on mirrordb1 to get an initial core dump the next time.

Actions #6

Updated by hellcp almost 3 years ago

Aaaaand it's down again

Actions #7

Updated by pjessen almost 3 years ago

Looks similar to last time:

2021-06-11 18:46:12.227 UTC [24774]: [27-1] db=,user= LOG:  server process (PID 10410) was terminated by signal 11: Segmentation fault
2021-06-11 18:46:12.227 UTC [24774]: [28-1] db=,user= DETAIL:  Failed process was running: SELECT COUNT(mirr_del_byid(583, id) order by id) FROM temp1
2021-06-11 18:46:12.227 UTC [24774]: [29-1] db=,user= LOG:  terminating any other active server processes
Actions #8

Updated by andriinikitin almost 3 years ago

Bernhard did make backup of datadir today and I decided to try resetting write ahead log and see if it helps (It looses some last transactions, but I don't think we have better option).
After postgres@mirrordb1:~> pg_resetwal -f /var/lib/pgsql/data (as postgres user) - the service was able to start

Actions #9

Updated by cboltz almost 3 years ago

For the records: andriinikitin migrated the databases to mirrordb2 and changed pgbouncer on anna/elsa so that it now uses mirrordb2. So for now, everything works again.

The reason for the segfaults and for allowing duplicate rows in unique indexes are still unclear.

Actions #10

Updated by KaratekHD almost 3 years ago

It seems to be down again, at least Matrix is which was affected by the last downtime too

Actions #11

Updated by pjessen almost 3 years ago

KaratekHD wrote:

It seems to be down again, at least Matrix is which was affected by the last downtime too

Confirmed.

Actions #12

Updated by bmwiedemann almost 3 years ago

Current path is downgrade to postgresql12 on mirrordb2, re-import SQL dumps, ensure unique indexes on mirrorbrain's filearr(path) entries exist and are working this time.

Meanwhile, I had edited download.o.o DNS to have 2 A and 2 AAAA records to shift 50% of load to mirrorcache to avoid overload of either of them.
Without DB, download.o.o would point every user to its downloadcontent.o.o alias and cause packet-loss and mirrorcache is not yet optimized enough to handle 300 requests per second.

Actions #13

Updated by andriinikitin almost 3 years ago

mirrorcache.o.o has run out of disk space, and has some other issues, so I reverted the DNS change because download.o.o redirects to mirrors properly at the moment

Actions #14

Updated by bmwiedemann almost 3 years ago

  • % Done changed from 0 to 10

Filed https://bugzilla.opensuse.org/show_bug.cgi?id=1187392 for our postgresql13 segfault

Actions #15

Updated by lrupp over 2 years ago

  • Status changed from New to Closed
  • % Done changed from 10 to 100

Issue is meanwhile solved by using the latest backup and a clean postgresql12 installation. Closing here.
For further reference, please check https://bugzilla.opensuse.org/show_bug.cgi?id=1187392

Actions

Also available in: Atom PDF