action #125903
closedDatabase connection error occurs when database is restarted size:M
0%
Description
Observation¶
Getting such an error should be no surprise. I was asked to create a ticket nevertheless as this triggered logwarn to send a mail:
[2023-03-10T13:55:28.093124Z] [error] [yx50S4-ffPMk] DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: the database system is shutting down at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/Shared/Controller/Auth.pm line 145
[2023-03-10T13:55:28.436955Z] [error] [pid:18233] Failed updating seen and error status of worker 131: DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request. at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Controller/Worker.pm line 161
[2023-03-10T13:55:28.439034Z] [error] [2HvCcSs7Fqxy] DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request. at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/Shared/Controller/Auth.pm line 145
[2023-03-10T13:55:28.445117Z] [warn] [pid:18233] Unable to verify whether worker 131 runs its job(s) as expected: DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
Is the server running locally and accepting connections on that socket? at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Controller/Worker.pm line 181
According to systemd, the DB was in fact restarted at this time. Logs shortly before the termination:
…
2023-03-10 13:55:28.053 UTC openqa geekotest [22389]FATAL: terminating connection due to administrator command
2023-03-10 13:55:28.053 UTC openqa geekotest [19144]FATAL: terminating connection due to administrator command
2023-03-10 13:55:28.057 UTC openqa geekotest [22122]FATAL: terminating connection due to administrator command
2023-03-10 13:55:28.061 UTC [1744]LOG: background worker "logical replication launcher" (PID 1796) exited with exit code 1
2023-03-10 13:55:28.084 UTC [1787]LOG: shutting down
2023-03-10 13:55:28.088 UTC openqa geekotest [30880]FATAL: the database system is shutting down
2023-03-10 13:55:28.431 UTC [1744]LOG: database system is shut down
Logs from the subsequent startup:
2023-03-10 13:55:28.654 UTC [30891]LOG: starting PostgreSQL 14.7 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 7.5.0, 64-bit
2023-03-10 13:55:28.654 UTC [30891]LOG: listening on IPv6 address "::1", port 5432
2023-03-10 13:55:28.654 UTC [30891]LOG: listening on IPv4 address "127.0.0.1", port 5432
2023-03-10 13:55:28.657 UTC [30891]LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-03-10 13:55:28.659 UTC [30891]LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-03-10 13:55:28.664 UTC [30893]LOG: database system was shut down at 2023-03-10 13:55:28 UTC
2023-03-10 13:55:28.676 UTC [30891]LOG: database system is ready to accept connections
2023-03-10 14:00:16.599 UTC openqa geekotest [31070]LOG: could not receive data from client: Connection reset by peer
2023-03-10 14:14:26.827 UTC openqa geekotest [6903]LOG: could not receive data from client: Connection reset by peer
2023-03-10 14:18:55.219 UTC openqa geekotest [11631]LOG: could not receive data from client: Connection reset by peer
2023-03-10 14:19:44.924 UTC openqa geekotest [6851]LOG: could not receive data from client: Connection reset by peer
2023-03-10 14:28:23.899 UTC openqa geekotest [14926]LOG: could not receive data from client: Connection reset by peer
2023-03-10 14:32:57.743 UTC openqa geekotest [18054]LOG: could not receive data from client: Connection reset by peer
2023-03-10 14:35:08.184 UTC openqa geekotest [19151]LOG: could not receive data from client: Connection reset by peer
2023-03-10 14:35:09.278 UTC openqa geekotest [23335]LOG: could not receive data from client: Connection reset by peer
2023-03-10 14:39:29.405 UTC openqa geekotest [25991]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:00:11.623 UTC openqa geekotest [13127]LOG: duration: 5983.118 ms statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout;
2023-03-10 15:00:56.281 UTC openqa geekotest [13127]LOG: duration: 43972.851 ms statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout;
2023-03-10 15:01:14.303 UTC openqa geekotest [13127]LOG: duration: 18020.498 ms statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout;
2023-03-10 15:01:28.712 UTC openqa geekotest [13127]LOG: duration: 10114.855 ms statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout;
2023-03-10 15:03:07.298 UTC openqa geekotest [13127]LOG: duration: 98585.286 ms statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout;
2023-03-10 15:03:42.250 UTC openqa geekotest [14576]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:04:36.760 UTC openqa geekotest [13127]LOG: duration: 89447.230 ms statement: COPY public.screenshots (id, filename, t_created) TO stdout;
2023-03-10 15:12:56.244 UTC openqa geekotest [19729]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:12:58.562 UTC openqa geekotest [20122]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:14:27.531 UTC openqa geekotest [19720]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:15:52.839 UTC openqa geekotest [19722]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:16:09.346 UTC openqa geekotest [20157]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:18:20.241 UTC openqa geekotest [18185]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:19:35.025 UTC openqa geekotest [22763]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:24:51.309 UTC openqa geekotest [25027]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:42:32.267 UTC openqa geekotest [21997]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:46:34.498 UTC openqa geekotest [20391]LOG: could not receive data from client: Connection reset by peer
2023-03-10 15:49:42.852 UTC openqa postgres [23662]LOG: duration: 87067.164 ms statement: select id, clone_id from jobs where (id > 3150000) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
2023-03-10 16:06:01.035 UTC openqa postgres [23662]ERROR: canceling statement due to user request
2023-03-10 16:06:01.035 UTC openqa postgres [23662]STATEMENT: select id, clone_id from jobs where (id > 3150461) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
2023-03-10 16:07:23.615 UTC openqa postgres [23662]LOG: duration: 76436.426 ms statement: select id, clone_id, t_finished from jobs where (id > 3150461) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
2023-03-10 18:01:47.905 UTC openqa geekotest [18974]LOG: could not receive data from client: Connection reset by peer
2023-03-10 18:39:14.560 UTC openqa geekotest [27743]LOG: duration: 50166.827 ms statement: select distinct distri,version,flavor,test,arch,machine from jobs where id in (select job_id from job_modules where (name='None')and t_updated > now() - interval '90 days') limit 30;
2023-03-10 19:00:09.858 UTC openqa geekotest [7206]LOG: duration: 5945.785 ms statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout;
2023-03-10 19:00:50.728 UTC openqa geekotest [7206]LOG: duration: 40332.712 ms statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout;
2023-03-10 19:01:08.961 UTC openqa geekotest [7206]LOG: duration: 18231.136 ms statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout;
2023-03-10 19:01:23.372 UTC openqa geekotest [7206]LOG: duration: 10327.766 ms statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout;
2023-03-10 19:03:00.531 UTC openqa geekotest [7206]LOG: duration: 97157.132 ms statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout;
2023-03-10 19:04:26.322 UTC openqa geekotest [7206]LOG: duration: 85774.954 ms statement: COPY public.screenshots (id, filename, t_created) TO stdout;
2023-03-10 22:01:54.004 UTC openqa geekotest [26566]LOG: could not receive data from client: Connection reset by peer
2023-03-10 23:00:10.793 UTC openqa geekotest [10571]LOG: duration: 6084.983 ms statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout;
2023-03-10 23:00:53.068 UTC openqa geekotest [10571]LOG: duration: 41735.429 ms statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout;
2023-03-10 23:01:10.412 UTC openqa geekotest [10571]LOG: duration: 17342.030 ms statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout;
2023-03-10 23:01:24.367 UTC openqa geekotest [10571]LOG: duration: 9969.798 ms statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout;
2023-03-10 23:02:58.563 UTC openqa geekotest [10571]LOG: duration: 94195.186 ms statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout;
2023-03-10 23:04:24.386 UTC openqa geekotest [10571]LOG: duration: 85806.685 ms statement: COPY public.screenshots (id, filename, t_created) TO stdout;
Acceptance criteria¶
- AC1: It is known why the database was shut down before the openQA services using it
Suggestions¶
- Related(?): #122578
- Why is the dependency handling on the systemd level not enough? https://github.com/os-autoinst/openQA/pull/4976/files
- Check if the openQA services depending on PostgreSQL are shut down properly (correct signals, systemd settings?)
- Check if the wrong PostgreSQL shutdown mode is used (https://www.postgresql.org/docs/current/server-shutdown.html)
- Maybe we need to use
BindsTo=
?
Updated by okurz almost 2 years ago
- Tags changed from alert, infra to alert
- Category set to Regressions/Crashes
- Target version set to Ready
Updated by livdywan almost 2 years ago
- Subject changed from Database connection error occurs when database is restarted to Database connection error occurs when database is restarted size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by jbaier_cz almost 2 years ago
- Status changed from Workable to In Progress
#122578 should solve this issue as well, but the fix depends on openQA-local-db which is not installed on osd
Updated by jbaier_cz almost 2 years ago
I removed the zypper lock to allow the installation of openQA-local-db package (which includes the systemd-service overrides with the appropriate postgresql requires).
Updated by openqa_review almost 2 years ago
- Due date set to 2023-04-11
Setting due date based on mean cycle time of SUSE QE Tools
Updated by jbaier_cz almost 2 years ago
- Status changed from In Progress to Feedback
The require snippets are now present, so the restarting postgresql database should stop the affected services first and then start them again after the unit goes active.
Updated by okurz almost 2 years ago
I can confirm that openQA-local-db is installed on OSD but on o3 it's not installed and I think this is where the original problem happened.
Updated by jbaier_cz almost 2 years ago
I believed we solved o3 within #122578, I will check that and install the missing package manually if needed.
Updated by livdywan almost 2 years ago
jbaier_cz wrote:
I believed we solved o3 within #122578, I will check that and install the missing package manually if needed.
Do we still need a manual action here?
Updated by okurz almost 2 years ago
- Due date deleted (
2023-04-11) - Status changed from Feedback to Resolved
mkittler, tinita and me checked the current state on 2023-04-12. We ensured that on both o3+osd there are no package locks for openQA-local-db and the package is installed. We triggered a restart of "postgresql.service" on o3 and checked in the system journal that openQA services were stopped and started properly as dependency.