Project

General

Profile

Actions

action #125903

closed

Database connection error occurs when database is restarted size:M

Added by mkittler about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-03-13
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Getting such an error should be no surprise. I was asked to create a ticket nevertheless as this triggered logwarn to send a mail:

[2023-03-10T13:55:28.093124Z] [error] [yx50S4-ffPMk] DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  the database system is shutting down at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/Shared/Controller/Auth.pm line 145

[2023-03-10T13:55:28.436955Z] [error] [pid:18233] Failed updating seen and error status of worker 131: DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request. at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Controller/Worker.pm line 161

[2023-03-10T13:55:28.439034Z] [error] [2HvCcSs7Fqxy] DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request. at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/Shared/Controller/Auth.pm line 145

[2023-03-10T13:55:28.445117Z] [warn] [pid:18233] Unable to verify whether worker 131 runs its job(s) as expected: DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
        Is the server running locally and accepting connections on that socket? at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Controller/Worker.pm line 181

According to systemd, the DB was in fact restarted at this time. Logs shortly before the termination:

…
2023-03-10 13:55:28.053 UTC openqa geekotest [22389]FATAL:  terminating connection due to administrator command
2023-03-10 13:55:28.053 UTC openqa geekotest [19144]FATAL:  terminating connection due to administrator command
2023-03-10 13:55:28.057 UTC openqa geekotest [22122]FATAL:  terminating connection due to administrator command
2023-03-10 13:55:28.061 UTC   [1744]LOG:  background worker "logical replication launcher" (PID 1796) exited with exit code 1
2023-03-10 13:55:28.084 UTC   [1787]LOG:  shutting down
2023-03-10 13:55:28.088 UTC openqa geekotest [30880]FATAL:  the database system is shutting down
2023-03-10 13:55:28.431 UTC   [1744]LOG:  database system is shut down

Logs from the subsequent startup:

2023-03-10 13:55:28.654 UTC   [30891]LOG:  starting PostgreSQL 14.7 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 7.5.0, 64-bit
2023-03-10 13:55:28.654 UTC   [30891]LOG:  listening on IPv6 address "::1", port 5432
2023-03-10 13:55:28.654 UTC   [30891]LOG:  listening on IPv4 address "127.0.0.1", port 5432
2023-03-10 13:55:28.657 UTC   [30891]LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-03-10 13:55:28.659 UTC   [30891]LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-03-10 13:55:28.664 UTC   [30893]LOG:  database system was shut down at 2023-03-10 13:55:28 UTC
2023-03-10 13:55:28.676 UTC   [30891]LOG:  database system is ready to accept connections
2023-03-10 14:00:16.599 UTC openqa geekotest [31070]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 14:14:26.827 UTC openqa geekotest [6903]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 14:18:55.219 UTC openqa geekotest [11631]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 14:19:44.924 UTC openqa geekotest [6851]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 14:28:23.899 UTC openqa geekotest [14926]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 14:32:57.743 UTC openqa geekotest [18054]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 14:35:08.184 UTC openqa geekotest [19151]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 14:35:09.278 UTC openqa geekotest [23335]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 14:39:29.405 UTC openqa geekotest [25991]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:00:11.623 UTC openqa geekotest [13127]LOG:  duration: 5983.118 ms  statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout;
2023-03-10 15:00:56.281 UTC openqa geekotest [13127]LOG:  duration: 43972.851 ms  statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout;
2023-03-10 15:01:14.303 UTC openqa geekotest [13127]LOG:  duration: 18020.498 ms  statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout;
2023-03-10 15:01:28.712 UTC openqa geekotest [13127]LOG:  duration: 10114.855 ms  statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout;
2023-03-10 15:03:07.298 UTC openqa geekotest [13127]LOG:  duration: 98585.286 ms  statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout;
2023-03-10 15:03:42.250 UTC openqa geekotest [14576]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:04:36.760 UTC openqa geekotest [13127]LOG:  duration: 89447.230 ms  statement: COPY public.screenshots (id, filename, t_created) TO stdout;
2023-03-10 15:12:56.244 UTC openqa geekotest [19729]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:12:58.562 UTC openqa geekotest [20122]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:14:27.531 UTC openqa geekotest [19720]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:15:52.839 UTC openqa geekotest [19722]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:16:09.346 UTC openqa geekotest [20157]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:18:20.241 UTC openqa geekotest [18185]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:19:35.025 UTC openqa geekotest [22763]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:24:51.309 UTC openqa geekotest [25027]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:42:32.267 UTC openqa geekotest [21997]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:46:34.498 UTC openqa geekotest [20391]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:49:42.852 UTC openqa postgres [23662]LOG:  duration: 87067.164 ms  statement: select id, clone_id from jobs where (id > 3150000) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
2023-03-10 16:06:01.035 UTC openqa postgres [23662]ERROR:  canceling statement due to user request
2023-03-10 16:06:01.035 UTC openqa postgres [23662]STATEMENT:  select id, clone_id from jobs where (id > 3150461) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
2023-03-10 16:07:23.615 UTC openqa postgres [23662]LOG:  duration: 76436.426 ms  statement: select id, clone_id, t_finished from jobs where (id > 3150461) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
2023-03-10 18:01:47.905 UTC openqa geekotest [18974]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 18:39:14.560 UTC openqa geekotest [27743]LOG:  duration: 50166.827 ms  statement: select distinct distri,version,flavor,test,arch,machine from jobs where id in (select job_id from job_modules where (name='None')and t_updated > now() - interval '90 days') limit 30;
2023-03-10 19:00:09.858 UTC openqa geekotest [7206]LOG:  duration: 5945.785 ms  statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout;
2023-03-10 19:00:50.728 UTC openqa geekotest [7206]LOG:  duration: 40332.712 ms  statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout;
2023-03-10 19:01:08.961 UTC openqa geekotest [7206]LOG:  duration: 18231.136 ms  statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout;
2023-03-10 19:01:23.372 UTC openqa geekotest [7206]LOG:  duration: 10327.766 ms  statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout;
2023-03-10 19:03:00.531 UTC openqa geekotest [7206]LOG:  duration: 97157.132 ms  statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout;
2023-03-10 19:04:26.322 UTC openqa geekotest [7206]LOG:  duration: 85774.954 ms  statement: COPY public.screenshots (id, filename, t_created) TO stdout;
2023-03-10 22:01:54.004 UTC openqa geekotest [26566]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 23:00:10.793 UTC openqa geekotest [10571]LOG:  duration: 6084.983 ms  statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout;
2023-03-10 23:00:53.068 UTC openqa geekotest [10571]LOG:  duration: 41735.429 ms  statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout;
2023-03-10 23:01:10.412 UTC openqa geekotest [10571]LOG:  duration: 17342.030 ms  statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout;
2023-03-10 23:01:24.367 UTC openqa geekotest [10571]LOG:  duration: 9969.798 ms  statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout;
2023-03-10 23:02:58.563 UTC openqa geekotest [10571]LOG:  duration: 94195.186 ms  statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout;
2023-03-10 23:04:24.386 UTC openqa geekotest [10571]LOG:  duration: 85806.685 ms  statement: COPY public.screenshots (id, filename, t_created) TO stdout;

Acceptance criteria

  • AC1: It is known why the database was shut down before the openQA services using it

Suggestions

Actions

Also available in: Atom PDF