tickets #93686
closedPostgres currently down
100%
Description
Seems like mirrordb1 went down this morning, and hasn't come back up. Simple restart doesn't work apparently
Updated by pjessen over 3 years ago
I tried restarting it twice, both resulted in a segfault.
The same thing repeatedly:
# grep segfault /var/log/messages
2021-06-05T10:56:53.856510+00:00 mirrordb1 kernel: [1344517.476156] postgres[21877]: segfault at 5626bf85dffa ip 00007ff966d76147 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-05T12:03:59.645338+00:00 mirrordb1 kernel: [1348543.361004] postgres[17228]: segfault at 5626bf85dffa ip 00007ff966d76133 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-05T12:07:44.854244+00:00 mirrordb1 kernel: [1348768.627274] postgres[18005]: segfault at 5626bf85dffa ip 00007ff966d76151 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-05T12:12:10.410150+00:00 mirrordb1 kernel: [1349034.190949] postgres[19194]: segfault at 5626bf85dffa ip 00007ff966d76133 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-05T12:18:45.116673+00:00 mirrordb1 kernel: [1349428.912847] postgres[20605]: segfault at 5626bf85dffa ip 00007ff966d76151 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-05T12:26:47.459836+00:00 mirrordb1 kernel: [1349911.271712] postgres[22555]: segfault at 5626bf85dffa ip 00007ff966d7612e sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-05T12:33:10.156571+00:00 mirrordb1 kernel: [1350293.982984] postgres[24549]: segfault at 5626bf85dffa ip 00007ff966d7614c sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-08T15:26:17.358152+00:00 mirrordb1 kernel: [1619890.699769] postgres[4776]: segfault at 5626bf85dffa ip 00007ff966d76147 sp 00007ffd90595438 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-08T18:28:07.089464+00:00 mirrordb1 kernel: [1630800.817691] postgres[28928]: segfault at 5626bf85dffa ip 00007ff966d76151 sp 00007ffd90595438 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-08T23:04:58.963109+00:00 mirrordb1 kernel: [1647413.271332] postgres[30652]: segfault at 5626bf85dffa ip 00007ff966d76138 sp 00007ffd90595538 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-09T03:25:10.534559+00:00 mirrordb1 kernel: [1663025.397677] postgres[28557]: segfault at 5626bf85dffa ip 00007ff966d76147 sp 00007ffd90595438 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-09T04:58:06.275839+00:00 mirrordb1 kernel: [1668601.338594] postgres[29635]: segfault at 5626bf85dffa ip 00007ff966d7612e sp 00007ffd90595438 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-09T04:59:07.199999+00:00 mirrordb1 kernel: [1668662.268971] postgres[999]: segfault at 5626bfbeabf2 ip 00007ff966d73700 sp 00007ffd90592918 error 4 in libc-2.31.so[7ff966c30000+1cb000]
2021-06-09T07:51:34.897584+00:00 mirrordb1 kernel: [1679010.333734] postgres[24774]: segfault at 5618704bfc92 ip 00007ff512413700 sp 00007ffeae458f08 error 4 in libc-2.31.so[7ff5122d0000+1cb000]
2021-06-09T07:59:14.420472+00:00 mirrordb1 kernel: [1679469.872615] postgres[29702]: segfault at 5576df7d1c92 ip 00007f444ba53700 sp 00007fff8a7ecce8 error 4 in libc-2.31.so[7f444b910000+1cb000]
2021-06-09T08:05:50.521976+00:00 mirrordb1 kernel: [1679865.986628] postgres[1349]: segfault at 55b0a9cb1c82 ip 00007f2aacf9b700 sp 00007ffd2bed75f8 error 4 in libc-2.31.so[7f2aace58000+1cb000]
Updated by pjessen over 3 years ago
I'm no wizard at debugging postgres, but at the first segfault this morning:
2021-06-09 03:25:10.552 UTC [1159]: [302-1] db=,user= LOG: server process (PID 28557) was terminated by signal 11: Segmentation fault
2021-06-09 03:25:10.552 UTC [1159]: [303-1] db=,user= DETAIL: Failed process was running: SELECT COUNT(mirr_del_byid(169, id) order by id) FROM temp1
2021-06-09 03:25:10.552 UTC [1159]: [304-1] db=,user= LOG: terminating any other active server processes
Updated by mstrigl over 3 years ago
postgresql on mirrordb1 is up again.
We used a snapshot from yesterday.
I enabled the core dump writing on mirrordb1 to get an initial core dump the next time.
Updated by pjessen over 3 years ago
Looks similar to last time:
2021-06-11 18:46:12.227 UTC [24774]: [27-1] db=,user= LOG: server process (PID 10410) was terminated by signal 11: Segmentation fault
2021-06-11 18:46:12.227 UTC [24774]: [28-1] db=,user= DETAIL: Failed process was running: SELECT COUNT(mirr_del_byid(583, id) order by id) FROM temp1
2021-06-11 18:46:12.227 UTC [24774]: [29-1] db=,user= LOG: terminating any other active server processes
Updated by andriinikitin over 3 years ago
Bernhard did make backup of datadir today and I decided to try resetting write ahead log and see if it helps (It looses some last transactions, but I don't think we have better option).
After postgres@mirrordb1:~> pg_resetwal -f /var/lib/pgsql/data
(as postgres user) - the service was able to start
Updated by cboltz over 3 years ago
For the records: andriinikitin migrated the databases to mirrordb2 and changed pgbouncer on anna/elsa so that it now uses mirrordb2. So for now, everything works again.
The reason for the segfaults and for allowing duplicate rows in unique indexes are still unclear.
Updated by KaratekHD over 3 years ago
It seems to be down again, at least Matrix is which was affected by the last downtime too
Updated by pjessen over 3 years ago
KaratekHD wrote:
It seems to be down again, at least Matrix is which was affected by the last downtime too
Confirmed.
Updated by bmwiedemann over 3 years ago
Current path is downgrade to postgresql12 on mirrordb2, re-import SQL dumps, ensure unique indexes on mirrorbrain's filearr(path) entries exist and are working this time.
Meanwhile, I had edited download.o.o DNS to have 2 A and 2 AAAA records to shift 50% of load to mirrorcache to avoid overload of either of them.
Without DB, download.o.o would point every user to its downloadcontent.o.o alias and cause packet-loss and mirrorcache is not yet optimized enough to handle 300 requests per second.
Updated by andriinikitin over 3 years ago
mirrorcache.o.o has run out of disk space, and has some other issues, so I reverted the DNS change because download.o.o redirects to mirrors properly at the moment
Updated by bmwiedemann over 3 years ago
- % Done changed from 0 to 10
Filed https://bugzilla.opensuse.org/show_bug.cgi?id=1187392 for our postgresql13 segfault
Updated by lrupp over 3 years ago
- Status changed from New to Closed
- % Done changed from 10 to 100
Issue is meanwhile solved by using the latest backup and a clean postgresql12 installation. Closing here.
For further reference, please check https://bugzilla.opensuse.org/show_bug.cgi?id=1187392