Project

General

Profile

Actions

action #125903

closed

Database connection error occurs when database is restarted size:M

Added by mkittler over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-03-13
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Getting such an error should be no surprise. I was asked to create a ticket nevertheless as this triggered logwarn to send a mail:

[2023-03-10T13:55:28.093124Z] [error] [yx50S4-ffPMk] DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  the database system is shutting down at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/Shared/Controller/Auth.pm line 145

[2023-03-10T13:55:28.436955Z] [error] [pid:18233] Failed updating seen and error status of worker 131: DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request. at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Controller/Worker.pm line 161

[2023-03-10T13:55:28.439034Z] [error] [2HvCcSs7Fqxy] DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request. at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/Shared/Controller/Auth.pm line 145

[2023-03-10T13:55:28.445117Z] [warn] [pid:18233] Unable to verify whether worker 131 runs its job(s) as expected: DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
        Is the server running locally and accepting connections on that socket? at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Controller/Worker.pm line 181

According to systemd, the DB was in fact restarted at this time. Logs shortly before the termination:

…
2023-03-10 13:55:28.053 UTC openqa geekotest [22389]FATAL:  terminating connection due to administrator command
2023-03-10 13:55:28.053 UTC openqa geekotest [19144]FATAL:  terminating connection due to administrator command
2023-03-10 13:55:28.057 UTC openqa geekotest [22122]FATAL:  terminating connection due to administrator command
2023-03-10 13:55:28.061 UTC   [1744]LOG:  background worker "logical replication launcher" (PID 1796) exited with exit code 1
2023-03-10 13:55:28.084 UTC   [1787]LOG:  shutting down
2023-03-10 13:55:28.088 UTC openqa geekotest [30880]FATAL:  the database system is shutting down
2023-03-10 13:55:28.431 UTC   [1744]LOG:  database system is shut down

Logs from the subsequent startup:

2023-03-10 13:55:28.654 UTC   [30891]LOG:  starting PostgreSQL 14.7 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 7.5.0, 64-bit
2023-03-10 13:55:28.654 UTC   [30891]LOG:  listening on IPv6 address "::1", port 5432
2023-03-10 13:55:28.654 UTC   [30891]LOG:  listening on IPv4 address "127.0.0.1", port 5432
2023-03-10 13:55:28.657 UTC   [30891]LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-03-10 13:55:28.659 UTC   [30891]LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-03-10 13:55:28.664 UTC   [30893]LOG:  database system was shut down at 2023-03-10 13:55:28 UTC
2023-03-10 13:55:28.676 UTC   [30891]LOG:  database system is ready to accept connections
2023-03-10 14:00:16.599 UTC openqa geekotest [31070]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 14:14:26.827 UTC openqa geekotest [6903]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 14:18:55.219 UTC openqa geekotest [11631]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 14:19:44.924 UTC openqa geekotest [6851]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 14:28:23.899 UTC openqa geekotest [14926]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 14:32:57.743 UTC openqa geekotest [18054]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 14:35:08.184 UTC openqa geekotest [19151]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 14:35:09.278 UTC openqa geekotest [23335]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 14:39:29.405 UTC openqa geekotest [25991]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:00:11.623 UTC openqa geekotest [13127]LOG:  duration: 5983.118 ms  statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout;
2023-03-10 15:00:56.281 UTC openqa geekotest [13127]LOG:  duration: 43972.851 ms  statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout;
2023-03-10 15:01:14.303 UTC openqa geekotest [13127]LOG:  duration: 18020.498 ms  statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout;
2023-03-10 15:01:28.712 UTC openqa geekotest [13127]LOG:  duration: 10114.855 ms  statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout;
2023-03-10 15:03:07.298 UTC openqa geekotest [13127]LOG:  duration: 98585.286 ms  statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout;
2023-03-10 15:03:42.250 UTC openqa geekotest [14576]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:04:36.760 UTC openqa geekotest [13127]LOG:  duration: 89447.230 ms  statement: COPY public.screenshots (id, filename, t_created) TO stdout;
2023-03-10 15:12:56.244 UTC openqa geekotest [19729]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:12:58.562 UTC openqa geekotest [20122]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:14:27.531 UTC openqa geekotest [19720]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:15:52.839 UTC openqa geekotest [19722]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:16:09.346 UTC openqa geekotest [20157]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:18:20.241 UTC openqa geekotest [18185]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:19:35.025 UTC openqa geekotest [22763]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:24:51.309 UTC openqa geekotest [25027]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:42:32.267 UTC openqa geekotest [21997]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:46:34.498 UTC openqa geekotest [20391]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 15:49:42.852 UTC openqa postgres [23662]LOG:  duration: 87067.164 ms  statement: select id, clone_id from jobs where (id > 3150000) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
2023-03-10 16:06:01.035 UTC openqa postgres [23662]ERROR:  canceling statement due to user request
2023-03-10 16:06:01.035 UTC openqa postgres [23662]STATEMENT:  select id, clone_id from jobs where (id > 3150461) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
2023-03-10 16:07:23.615 UTC openqa postgres [23662]LOG:  duration: 76436.426 ms  statement: select id, clone_id, t_finished from jobs where (id > 3150461) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
2023-03-10 18:01:47.905 UTC openqa geekotest [18974]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 18:39:14.560 UTC openqa geekotest [27743]LOG:  duration: 50166.827 ms  statement: select distinct distri,version,flavor,test,arch,machine from jobs where id in (select job_id from job_modules where (name='None')and t_updated > now() - interval '90 days') limit 30;
2023-03-10 19:00:09.858 UTC openqa geekotest [7206]LOG:  duration: 5945.785 ms  statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout;
2023-03-10 19:00:50.728 UTC openqa geekotest [7206]LOG:  duration: 40332.712 ms  statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout;
2023-03-10 19:01:08.961 UTC openqa geekotest [7206]LOG:  duration: 18231.136 ms  statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout;
2023-03-10 19:01:23.372 UTC openqa geekotest [7206]LOG:  duration: 10327.766 ms  statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout;
2023-03-10 19:03:00.531 UTC openqa geekotest [7206]LOG:  duration: 97157.132 ms  statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout;
2023-03-10 19:04:26.322 UTC openqa geekotest [7206]LOG:  duration: 85774.954 ms  statement: COPY public.screenshots (id, filename, t_created) TO stdout;
2023-03-10 22:01:54.004 UTC openqa geekotest [26566]LOG:  could not receive data from client: Connection reset by peer
2023-03-10 23:00:10.793 UTC openqa geekotest [10571]LOG:  duration: 6084.983 ms  statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout;
2023-03-10 23:00:53.068 UTC openqa geekotest [10571]LOG:  duration: 41735.429 ms  statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout;
2023-03-10 23:01:10.412 UTC openqa geekotest [10571]LOG:  duration: 17342.030 ms  statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout;
2023-03-10 23:01:24.367 UTC openqa geekotest [10571]LOG:  duration: 9969.798 ms  statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout;
2023-03-10 23:02:58.563 UTC openqa geekotest [10571]LOG:  duration: 94195.186 ms  statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout;
2023-03-10 23:04:24.386 UTC openqa geekotest [10571]LOG:  duration: 85806.685 ms  statement: COPY public.screenshots (id, filename, t_created) TO stdout;

Acceptance criteria

  • AC1: It is known why the database was shut down before the openQA services using it

Suggestions

Actions #1

Updated by okurz over 1 year ago

  • Tags changed from alert, infra to alert
  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #2

Updated by livdywan over 1 year ago

  • Subject changed from Database connection error occurs when database is restarted to Database connection error occurs when database is restarted size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by mkittler over 1 year ago

  • Description updated (diff)
Actions #4

Updated by jbaier_cz over 1 year ago

  • Assignee set to jbaier_cz
Actions #5

Updated by jbaier_cz over 1 year ago

  • Status changed from Workable to In Progress

#122578 should solve this issue as well, but the fix depends on openQA-local-db which is not installed on osd

Actions #6

Updated by jbaier_cz over 1 year ago

I removed the zypper lock to allow the installation of openQA-local-db package (which includes the systemd-service overrides with the appropriate postgresql requires).

Actions #7

Updated by openqa_review over 1 year ago

  • Due date set to 2023-04-11

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by jbaier_cz over 1 year ago

  • Status changed from In Progress to Feedback

The require snippets are now present, so the restarting postgresql database should stop the affected services first and then start them again after the unit goes active.

Actions #9

Updated by okurz over 1 year ago

I can confirm that openQA-local-db is installed on OSD but on o3 it's not installed and I think this is where the original problem happened.

Actions #10

Updated by jbaier_cz over 1 year ago

I believed we solved o3 within #122578, I will check that and install the missing package manually if needed.

Actions #11

Updated by livdywan over 1 year ago

jbaier_cz wrote:

I believed we solved o3 within #122578, I will check that and install the missing package manually if needed.

Do we still need a manual action here?

Actions #12

Updated by okurz over 1 year ago

  • Due date deleted (2023-04-11)
  • Status changed from Feedback to Resolved

mkittler, tinita and me checked the current state on 2023-04-12. We ensured that on both o3+osd there are no package locks for openQA-local-db and the package is installed. We triggered a restart of "postgresql.service" on o3 and checked in the system journal that openQA services were stopped and started properly as dependency.

Actions

Also available in: Atom PDF