tickets #93686
Postgres currently down
100%
Description
Seems like mirrordb1 went down this morning, and hasn't come back up. Simple restart doesn't work apparently
History
#1
Updated by hellcp almost 2 years ago
- Private changed from Yes to No
#2
Updated by pjessen almost 2 years ago
I tried restarting it twice, both resulted in a segfault.
The same thing repeatedly:
# grep segfault /var/log/messages 2021-06-05T10:56:53.856510+00:00 mirrordb1 kernel: [1344517.476156] postgres[21877]: segfault at 5626bf85dffa ip 00007ff966d76147 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000] 2021-06-05T12:03:59.645338+00:00 mirrordb1 kernel: [1348543.361004] postgres[17228]: segfault at 5626bf85dffa ip 00007ff966d76133 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000] 2021-06-05T12:07:44.854244+00:00 mirrordb1 kernel: [1348768.627274] postgres[18005]: segfault at 5626bf85dffa ip 00007ff966d76151 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000] 2021-06-05T12:12:10.410150+00:00 mirrordb1 kernel: [1349034.190949] postgres[19194]: segfault at 5626bf85dffa ip 00007ff966d76133 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000] 2021-06-05T12:18:45.116673+00:00 mirrordb1 kernel: [1349428.912847] postgres[20605]: segfault at 5626bf85dffa ip 00007ff966d76151 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000] 2021-06-05T12:26:47.459836+00:00 mirrordb1 kernel: [1349911.271712] postgres[22555]: segfault at 5626bf85dffa ip 00007ff966d7612e sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000] 2021-06-05T12:33:10.156571+00:00 mirrordb1 kernel: [1350293.982984] postgres[24549]: segfault at 5626bf85dffa ip 00007ff966d7614c sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000] 2021-06-08T15:26:17.358152+00:00 mirrordb1 kernel: [1619890.699769] postgres[4776]: segfault at 5626bf85dffa ip 00007ff966d76147 sp 00007ffd90595438 error 4 in libc-2.31.so[7ff966c30000+1cb000] 2021-06-08T18:28:07.089464+00:00 mirrordb1 kernel: [1630800.817691] postgres[28928]: segfault at 5626bf85dffa ip 00007ff966d76151 sp 00007ffd90595438 error 4 in libc-2.31.so[7ff966c30000+1cb000] 2021-06-08T23:04:58.963109+00:00 mirrordb1 kernel: [1647413.271332] postgres[30652]: segfault at 5626bf85dffa ip 00007ff966d76138 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000] 2021-06-09T03:25:10.534559+00:00 mirrordb1 kernel: [1663025.397677] postgres[28557]: segfault at 5626bf85dffa ip 00007ff966d76147 sp 00007ffd90595438 error 4 in libc-2.31.so[7ff966c30000+1cb000] 2021-06-09T04:58:06.275839+00:00 mirrordb1 kernel: [1668601.338594] postgres[29635]: segfault at 5626bf85dffa ip 00007ff966d7612e sp 00007ffd90595438 error 4 in libc-2.31.so[7ff966c30000+1cb000] 2021-06-09T04:59:07.199999+00:00 mirrordb1 kernel: [1668662.268971] postgres[999]: segfault at 5626bfbeabf2 ip 00007ff966d73700 sp 00007ffd90592918 error 4 in libc-2.31.so[7ff966c30000+1cb000] 2021-06-09T07:51:34.897584+00:00 mirrordb1 kernel: [1679010.333734] postgres[24774]: segfault at 5618704bfc92 ip 00007ff512413700 sp 00007ffeae458f08 error 4 in libc-2.31.so[7ff5122d0000+1cb000] 2021-06-09T07:59:14.420472+00:00 mirrordb1 kernel: [1679469.872615] postgres[29702]: segfault at 5576df7d1c92 ip 00007f444ba53700 sp 00007fff8a7ecce8 error 4 in libc-2.31.so[7f444b910000+1cb000] 2021-06-09T08:05:50.521976+00:00 mirrordb1 kernel: [1679865.986628] postgres[1349]: segfault at 55b0a9cb1c82 ip 00007f2aacf9b700 sp 00007ffd2bed75f8 error 4 in libc-2.31.so[7f2aace58000+1cb000]
#3
Updated by pjessen almost 2 years ago
I'm no wizard at debugging postgres, but at the first segfault this morning:
2021-06-09 03:25:10.552 UTC [1159]: [302-1] db=,user= LOG: server process (PID 28557) was terminated by signal 11: Segmentation fault 2021-06-09 03:25:10.552 UTC [1159]: [303-1] db=,user= DETAIL: Failed process was running: SELECT COUNT(mirr_del_byid(169, id) order by id) FROM temp1 2021-06-09 03:25:10.552 UTC [1159]: [304-1] db=,user= LOG: terminating any other active server processes
#4
Updated by mstrigl almost 2 years ago
We are currently working on it.
#5
Updated by mstrigl almost 2 years ago
postgresql on mirrordb1 is up again.
We used a snapshot from yesterday.
I enabled the core dump writing on mirrordb1 to get an initial core dump the next time.
#6
Updated by hellcp almost 2 years ago
Aaaaand it's down again
#7
Updated by pjessen almost 2 years ago
Looks similar to last time:
2021-06-11 18:46:12.227 UTC [24774]: [27-1] db=,user= LOG: server process (PID 10410) was terminated by signal 11: Segmentation fault 2021-06-11 18:46:12.227 UTC [24774]: [28-1] db=,user= DETAIL: Failed process was running: SELECT COUNT(mirr_del_byid(583, id) order by id) FROM temp1 2021-06-11 18:46:12.227 UTC [24774]: [29-1] db=,user= LOG: terminating any other active server processes
#8
Updated by andriinikitin almost 2 years ago
Bernhard did make backup of datadir today and I decided to try resetting write ahead log and see if it helps (It looses some last transactions, but I don't think we have better option).
After postgres@mirrordb1:~> pg_resetwal -f /var/lib/pgsql/data
(as postgres user) - the service was able to start
#9
Updated by cboltz almost 2 years ago
For the records: andriinikitin migrated the databases to mirrordb2 and changed pgbouncer on anna/elsa so that it now uses mirrordb2. So for now, everything works again.
The reason for the segfaults and for allowing duplicate rows in unique indexes are still unclear.
#10
Updated by KaratekHD almost 2 years ago
It seems to be down again, at least Matrix is which was affected by the last downtime too
#11
Updated by pjessen almost 2 years ago
KaratekHD wrote:
It seems to be down again, at least Matrix is which was affected by the last downtime too
Confirmed.
#12
Updated by bmwiedemann almost 2 years ago
Current path is downgrade to postgresql12 on mirrordb2, re-import SQL dumps, ensure unique indexes on mirrorbrain's filearr(path) entries exist and are working this time.
Meanwhile, I had edited download.o.o DNS to have 2 A and 2 AAAA records to shift 50% of load to mirrorcache to avoid overload of either of them.
Without DB, download.o.o would point every user to its downloadcontent.o.o alias and cause packet-loss and mirrorcache is not yet optimized enough to handle 300 requests per second.
#13
Updated by andriinikitin almost 2 years ago
mirrorcache.o.o has run out of disk space, and has some other issues, so I reverted the DNS change because download.o.o redirects to mirrors properly at the moment
#14
Updated by bmwiedemann almost 2 years ago
- % Done changed from 0 to 10
Filed https://bugzilla.opensuse.org/show_bug.cgi?id=1187392 for our postgresql13 segfault
#15
Updated by lrupp almost 2 years ago
- Status changed from New to Closed
- % Done changed from 10 to 100
Issue is meanwhile solved by using the latest backup and a clean postgresql12 installation. Closing here.
For further reference, please check https://bugzilla.opensuse.org/show_bug.cgi?id=1187392