Actions
action #125903
closedDatabase connection error occurs when database is restarted size:M
Start date:
2023-03-13
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
Getting such an error should be no surprise. I was asked to create a ticket nevertheless as this triggered logwarn to send a mail:
[2023-03-10T13:55:28.093124Z] [error] [yx50S4-ffPMk] DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: the database system is shutting down at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/Shared/Controller/Auth.pm line 145
[2023-03-10T13:55:28.436955Z] [error] [pid:18233] Failed updating seen and error status of worker 131: DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request. at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Controller/Worker.pm line 161
[2023-03-10T13:55:28.439034Z] [error] [2HvCcSs7Fqxy] DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request. at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/Shared/Controller/Auth.pm line 145
[2023-03-10T13:55:28.445117Z] [warn] [pid:18233] Unable to verify whether worker 131 runs its job(s) as expected: DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
Is the server running locally and accepting connections on that socket? at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Controller/Worker.pm line 181
According to systemd, the DB was in fact restarted at this time. Logs shortly before the termination:
…
2023-03-10 13:55:28.053 UTC openqa geekotest [22389]FATAL: terminating connection due to administrator command
2023-03-10 13:55:28.053 UTC openqa geekotest [19144]FATAL: terminating connection due to administrator command
2023-03-10 13:55:28.057 UTC openqa geekotest [22122]FATAL: terminating connection due to administrator command
2023-03-10 13:55:28.061 UTC [1744]LOG: background worker "logical replication launcher" (PID 1796) exited with exit code 1
2023-03-10 13:55:28.084 UTC [1787]LOG: shutting down
2023-03-10 13:55:28.088 UTC openqa geekotest [30880]FATAL: the database system is shutting down
2023-03-10 13:55:28.431 UTC [1744]LOG: database system is shut down
Logs from the subsequent startup:
2023-03-10 13:55:28.654 UTC [30891]LOG: starting PostgreSQL 14.7 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 7.5.0, 64-bit
2023-03-10 13:55:28.654 UTC [30891]LOG: listening on IPv6 address "::1", port 5432
2023-03-10 13:55:28.654 UTC [30891]LOG: listening on IPv4 address "127.0.0.1", port 5432
2023-03-10 13:55:28.657 UTC [30891]LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-03-10 13:55:28.659 UTC [30891]LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-03-10 13:55:28.664 UTC [30893]LOG: database system was shut down at 2023-03-10 13:55:28 UTC
2023-03-10 13:55:28.676 UTC [30891]LOG: database system is ready to accept connections
2023-03-10 14:00:16.599 UTC openqa geekotest [31070]LOG: could not receive data from client: Connection reset by peer
2023-03-10 14:14:26.827 UTC openqa geekotest [6903]LOG: could not receive data from client: Connection reset by peer
2023-03-10 14:18:55.219 UTC openqa geekotest [11631]LOG: could not receive data from client: Connection reset by peer
2023-03-10 14:19:44.924 UTC openqa geekotest [6851]LOG: could not receive data from client: Connection reset by peer
2023-03-10 14:28:23.899 UTC openqa geekotest [14926]LOG: could not receive data from client: Connection reset by peer
2023-03-10 14:32:57.743 UTC openqa geekotest [18054]LOG: could not receive data from client: Connection reset by peer
2023-03-10 14:35:08.184 UTC openqa geekotest [19151]LOG: could not receive data from client: Connection reset by peer
2023-03-10 14:35:09.278 UTC openqa geekotest [23335]LOG: could not receive data from client: Connection reset by peer
2023-03-10 14:39:29.405 UTC openqa geekotest [25991]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:00:11.623 UTC openqa geekotest [13127]LOG: duration: 5983.118 ms statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout;
2023-03-10 15:00:56.281 UTC openqa geekotest [13127]LOG: duration: 43972.851 ms statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout;
2023-03-10 15:01:14.303 UTC openqa geekotest [13127]LOG: duration: 18020.498 ms statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout;
2023-03-10 15:01:28.712 UTC openqa geekotest [13127]LOG: duration: 10114.855 ms statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout;
2023-03-10 15:03:07.298 UTC openqa geekotest [13127]LOG: duration: 98585.286 ms statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout;
2023-03-10 15:03:42.250 UTC openqa geekotest [14576]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:04:36.760 UTC openqa geekotest [13127]LOG: duration: 89447.230 ms statement: COPY public.screenshots (id, filename, t_created) TO stdout;
2023-03-10 15:12:56.244 UTC openqa geekotest [19729]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:12:58.562 UTC openqa geekotest [20122]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:14:27.531 UTC openqa geekotest [19720]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:15:52.839 UTC openqa geekotest [19722]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:16:09.346 UTC openqa geekotest [20157]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:18:20.241 UTC openqa geekotest [18185]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:19:35.025 UTC openqa geekotest [22763]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:24:51.309 UTC openqa geekotest [25027]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:42:32.267 UTC openqa geekotest [21997]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:46:34.498 UTC openqa geekotest [20391]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:49:42.852 UTC openqa postgres [23662]LOG: duration: 87067.164 ms statement: select id, clone_id from jobs where (id > 3150000) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
2023-03-10 16:06:01.035 UTC openqa postgres [23662]ERROR: canceling statement due to user request
2023-03-10 16:06:01.035 UTC openqa postgres [23662]STATEMENT: select id, clone_id from jobs where (id > 3150461) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
2023-03-10 16:07:23.615 UTC openqa postgres [23662]LOG: duration: 76436.426 ms statement: select id, clone_id, t_finished from jobs where (id > 3150461) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
2023-03-10 18:01:47.905 UTC openqa geekotest [18974]LOG: could not receive data from client: Connection reset by peer
2023-03-10 18:39:14.560 UTC openqa geekotest [27743]LOG: duration: 50166.827 ms statement: select distinct distri,version,flavor,test,arch,machine from jobs where id in (select job_id from job_modules where (name='None')and t_updated > now() - interval '90 days') limit 30;
2023-03-10 19:00:09.858 UTC openqa geekotest [7206]LOG: duration: 5945.785 ms statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout;
2023-03-10 19:00:50.728 UTC openqa geekotest [7206]LOG: duration: 40332.712 ms statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout;
2023-03-10 19:01:08.961 UTC openqa geekotest [7206]LOG: duration: 18231.136 ms statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout;
2023-03-10 19:01:23.372 UTC openqa geekotest [7206]LOG: duration: 10327.766 ms statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout;
2023-03-10 19:03:00.531 UTC openqa geekotest [7206]LOG: duration: 97157.132 ms statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout;
2023-03-10 19:04:26.322 UTC openqa geekotest [7206]LOG: duration: 85774.954 ms statement: COPY public.screenshots (id, filename, t_created) TO stdout;
2023-03-10 22:01:54.004 UTC openqa geekotest [26566]LOG: could not receive data from client: Connection reset by peer
2023-03-10 23:00:10.793 UTC openqa geekotest [10571]LOG: duration: 6084.983 ms statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout;
2023-03-10 23:00:53.068 UTC openqa geekotest [10571]LOG: duration: 41735.429 ms statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout;
2023-03-10 23:01:10.412 UTC openqa geekotest [10571]LOG: duration: 17342.030 ms statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout;
2023-03-10 23:01:24.367 UTC openqa geekotest [10571]LOG: duration: 9969.798 ms statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout;
2023-03-10 23:02:58.563 UTC openqa geekotest [10571]LOG: duration: 94195.186 ms statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout;
2023-03-10 23:04:24.386 UTC openqa geekotest [10571]LOG: duration: 85806.685 ms statement: COPY public.screenshots (id, filename, t_created) TO stdout;
Acceptance criteria¶
- AC1: It is known why the database was shut down before the openQA services using it
Suggestions¶
- Related(?): #122578
- Why is the dependency handling on the systemd level not enough? https://github.com/os-autoinst/openQA/pull/4976/files
- Check if the openQA services depending on PostgreSQL are shut down properly (correct signals, systemd settings?)
- Check if the wrong PostgreSQL shutdown mode is used (https://www.postgresql.org/docs/current/server-shutdown.html)
- Maybe we need to use
BindsTo=
?
Actions