Project

General

Profile

action #125903

Updated by mkittler over 1 year ago

### Observation 
 Getting such an error should be no surprise. I was asked to create a ticket nevertheless as this triggered logwarn to send a mail: 

 ``` 
 [2023-03-10T13:55:28.093124Z] [error] [yx50S4-ffPMk] DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:    the database system is shutting down at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/Shared/Controller/Auth.pm line 145 

 [2023-03-10T13:55:28.436955Z] [error] [pid:18233] Failed updating seen and error status of worker 131: DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: server closed the connection unexpectedly 
         This probably means the server terminated abnormally 
         before or while processing the request. at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Controller/Worker.pm line 161 

 [2023-03-10T13:55:28.439034Z] [error] [2HvCcSs7Fqxy] DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: server closed the connection unexpectedly 
         This probably means the server terminated abnormally 
         before or while processing the request. at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/Shared/Controller/Auth.pm line 145 

 [2023-03-10T13:55:28.445117Z] [warn] [pid:18233] Unable to verify whether worker 131 runs its job(s) as expected: DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory 
         Is the server running locally and accepting connections on that socket? at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Controller/Worker.pm line 181 
 ``` 

 According to systemd, the DB was in fact restarted at this time. Logs shortly before the termination: 

 ``` 
 … 
 2023-03-10 13:55:28.053 UTC openqa geekotest [22389]FATAL:    terminating connection due to administrator command 
 2023-03-10 13:55:28.053 UTC openqa geekotest [19144]FATAL:    terminating connection due to administrator command 
 2023-03-10 13:55:28.057 UTC openqa geekotest [22122]FATAL:    terminating connection due to administrator command 
 2023-03-10 13:55:28.061 UTC     [1744]LOG:    background worker "logical replication launcher" (PID 1796) exited with exit code 1 
 2023-03-10 13:55:28.084 UTC     [1787]LOG:    shutting down 
 2023-03-10 13:55:28.088 UTC openqa geekotest [30880]FATAL:    the database system is shutting down 
 2023-03-10 13:55:28.431 UTC     [1744]LOG:    database system is shut down 
 ``` 

 Logs from the subsequent startup: 
 ``` 
 2023-03-10 13:55:28.654 UTC     [30891]LOG:    starting PostgreSQL 14.7 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 7.5.0, 64-bit 
 2023-03-10 13:55:28.654 UTC     [30891]LOG:    listening on IPv6 address "::1", port 5432 
 2023-03-10 13:55:28.654 UTC     [30891]LOG:    listening on IPv4 address "127.0.0.1", port 5432 
 2023-03-10 13:55:28.657 UTC     [30891]LOG:    listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 
 2023-03-10 13:55:28.659 UTC     [30891]LOG:    listening on Unix socket "/tmp/.s.PGSQL.5432" 
 2023-03-10 13:55:28.664 UTC     [30893]LOG:    database system was shut down at 2023-03-10 13:55:28 UTC 
 2023-03-10 13:55:28.676 UTC     [30891]LOG:    database system is ready to accept connections 
 2023-03-10 14:00:16.599 UTC openqa geekotest [31070]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 14:14:26.827 UTC openqa geekotest [6903]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 14:18:55.219 UTC openqa geekotest [11631]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 14:19:44.924 UTC openqa geekotest [6851]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 14:28:23.899 UTC openqa geekotest [14926]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 14:32:57.743 UTC openqa geekotest [18054]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 14:35:08.184 UTC openqa geekotest [19151]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 14:35:09.278 UTC openqa geekotest [23335]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 14:39:29.405 UTC openqa geekotest [25991]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 15:00:11.623 UTC openqa geekotest [13127]LOG:    duration: 5983.118 ms    statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout; 
 2023-03-10 15:00:56.281 UTC openqa geekotest [13127]LOG:    duration: 43972.851 ms    statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout; 
 2023-03-10 15:01:14.303 UTC openqa geekotest [13127]LOG:    duration: 18020.498 ms    statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout; 
 2023-03-10 15:01:28.712 UTC openqa geekotest [13127]LOG:    duration: 10114.855 ms    statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout; 
 2023-03-10 15:03:07.298 UTC openqa geekotest [13127]LOG:    duration: 98585.286 ms    statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout; 
 2023-03-10 15:03:42.250 UTC openqa geekotest [14576]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 15:04:36.760 UTC openqa geekotest [13127]LOG:    duration: 89447.230 ms    statement: COPY public.screenshots (id, filename, t_created) TO stdout; 
 2023-03-10 15:12:56.244 UTC openqa geekotest [19729]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 15:12:58.562 UTC openqa geekotest [20122]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 15:14:27.531 UTC openqa geekotest [19720]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 15:15:52.839 UTC openqa geekotest [19722]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 15:16:09.346 UTC openqa geekotest [20157]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 15:18:20.241 UTC openqa geekotest [18185]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 15:19:35.025 UTC openqa geekotest [22763]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 15:24:51.309 UTC openqa geekotest [25027]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 15:42:32.267 UTC openqa geekotest [21997]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 15:46:34.498 UTC openqa geekotest [20391]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 15:49:42.852 UTC openqa postgres [23662]LOG:    duration: 87067.164 ms    statement: select id, clone_id from jobs where (id > 3150000) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc; 
 2023-03-10 16:06:01.035 UTC openqa postgres [23662]ERROR:    canceling statement due to user request 
 2023-03-10 16:06:01.035 UTC openqa postgres [23662]STATEMENT:    select id, clone_id from jobs where (id > 3150461) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc; 
 2023-03-10 16:07:23.615 UTC openqa postgres [23662]LOG:    duration: 76436.426 ms    statement: select id, clone_id, t_finished from jobs where (id > 3150461) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc; 
 2023-03-10 18:01:47.905 UTC openqa geekotest [18974]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 18:39:14.560 UTC openqa geekotest [27743]LOG:    duration: 50166.827 ms    statement: select distinct distri,version,flavor,test,arch,machine from jobs where id in (select job_id from job_modules where (name='None')and t_updated > now() - interval '90 days') limit 30; 
 2023-03-10 19:00:09.858 UTC openqa geekotest [7206]LOG:    duration: 5945.785 ms    statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout; 
 2023-03-10 19:00:50.728 UTC openqa geekotest [7206]LOG:    duration: 40332.712 ms    statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout; 
 2023-03-10 19:01:08.961 UTC openqa geekotest [7206]LOG:    duration: 18231.136 ms    statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout; 
 2023-03-10 19:01:23.372 UTC openqa geekotest [7206]LOG:    duration: 10327.766 ms    statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout; 
 2023-03-10 19:03:00.531 UTC openqa geekotest [7206]LOG:    duration: 97157.132 ms    statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout; 
 2023-03-10 19:04:26.322 UTC openqa geekotest [7206]LOG:    duration: 85774.954 ms    statement: COPY public.screenshots (id, filename, t_created) TO stdout; 
 2023-03-10 22:01:54.004 UTC openqa geekotest [26566]LOG:    could not receive data from client: Connection reset by peer 
 2023-03-10 23:00:10.793 UTC openqa geekotest [10571]LOG:    duration: 6084.983 ms    statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout; 
 2023-03-10 23:00:53.068 UTC openqa geekotest [10571]LOG:    duration: 41735.429 ms    statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout; 
 2023-03-10 23:01:10.412 UTC openqa geekotest [10571]LOG:    duration: 17342.030 ms    statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout; 
 2023-03-10 23:01:24.367 UTC openqa geekotest [10571]LOG:    duration: 9969.798 ms    statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout; 
 2023-03-10 23:02:58.563 UTC openqa geekotest [10571]LOG:    duration: 94195.186 ms    statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout; 
 2023-03-10 23:04:24.386 UTC openqa geekotest [10571]LOG:    duration: 85806.685 ms    statement: COPY public.screenshots (id, filename, t_created) TO stdout; 
 ``` 

 ## Acceptance criteria 
 * **AC1:** It is known why the database was shut down before the openQA services using it 

 ## Suggestions 
 - Related(?): #122578 
 - Why is the dependency handling on the systemd level not enough? https://github.com/os-autoinst/openQA/pull/4976/files 
 - Check if the openQA services depending on PostgreSQL are shut down properly (correct signals, systemd settings?) 
 - Check if the wrong PostgreSQL shutdown mode    is used (https://www.postgresql.org/docs/current/server-shutdown.html) 
 - Maybe we need to use [`BindsTo=`](https://www.freedesktop.org/software/systemd/man/systemd.unit.html#BindsTo=)?

Back