action #72139
openQA services on OSD failed to connect to database
0%
Description
All openQA services which use the database showed connection errors. That's the first error logged by PostgreSQL:
2020-09-30 12:30:53.437 CEST openqa geekotest [7311]FATAL: remaining connection slots are reserved for non-replication superuser connections
From the openQA-side the errors look like:
Sep 30 12:47:45 openqa openqa[32459]: [error] [vJyMDc-a] DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','geekotest',...) failed: FATAL: remaining connection slots are reserved for non-replication superuser connections at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at /usr/share/openqa/script/../lib/OpenQA/Schema.pm line 172
This lead to various alerts being triggered (Minion jobs alert, HTTP Response alert, Workers alert). A restart of the main openqa-webui
service and posgresql
service helped to fix the error. (Likely the restart of openqa-webui
was unnecessary considering the other services could restore themselves without a restart.)
I also retried the failed Minion jobs and all of them passed. So there shouldn't be any active warnings anymore.
The question is what caused the connection limit to be exceeded. Theoretically we have a fixed number of services using a fixed number of connections.
Related issues
History
#2
Updated by okurz 4 months ago
- Status changed from New to In Progress
- Assignee set to mkittler
- Priority changed from Normal to Urgent
- Target version set to Ready
this is urgent and certainly you are working on it. Everyone can help of course but we should just make it obvious that this is actively tracked by at least one person.
#3
Updated by mkittler 4 months ago
PRs by Nick to improve the monitoring: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/364, https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/365
In the meantime I've been restarting the database again to keep OSD operational. I also restarted the failed Minion jobs again. After the database restart I have been monitoring the number of database connections manually and saw a slow increase over time. At 62 connections I reverted the perl-Mojolicious
package to version 5.59. The diff doesn't look that suspicious but it was the only lead. After the downgrade the number of connections is around 49 and doesn't seem to increase further. So maybe we've found the culprit.
#7
Updated by kraih 4 months ago
I've started running some tests and so far i can only confirm that there's a risk of leaks. For example Mojo::UserAgent
does accumulate some garbage state information after a Mojo::IOLoop->reset
call. But i've not been able to find anything in Minion
or Mojo::Pg
yet. There's a few more things i can try though to get to the root of it.
#8
Updated by kraih 4 months ago
I'm getting closer, the Mojo::IOLoop->reset
leak was probably just a trigger for other bugs to become visible. Here's a first fix for an upstream leak in Mojo::Pg
. https://github.com/mojolicious/mojo-pg/compare/a546b0454d7a...29b22d36a953
#9
Updated by kraih 4 months ago
And a followup. With one commit also reverting the upstream change Marius suspected of being the cause. https://github.com/mojolicious/mojo/compare/b181827b1a75...0ca361b75e38
#10
Updated by kraih 4 months ago
I could not exactly replicate the problem from this ticket, but i think it wasn't actually the Minion jobs, but the subprocess from OpenQA::WebAPI::Controller::API::V1::Job
. Subprocesses behave very similar to Minion jobs and also call Mojo::IOLoop->reset
. And due to the calling context there is a lot more file descriptors to leak.
#11
Updated by okurz 4 months ago
- Related to action #72196: t/24-worker-jobs.t fails in OBS added
#12
Updated by okurz 4 months ago
- Status changed from In Progress to Blocked
- Assignee changed from kraih to okurz
- Priority changed from Urgent to High
It turns out #72196 is a consequence of perl-Mojolicious 8.59->8.60 as well. So we have a clear reproducible test with t/24-worker-jobs.t showing that 8.60 is introducing problems. Of course we did not see that in circleci tests as we only run the old version until the dependency PR is accepted which also showed the corresponding problems but was even blocked by many circleCI tests failing due to the container image unable to being retrieved from registry.opensuse.org . #69160 should help to prevent deployments going forward by running tests within Leap repos as well. As there had not been an automatic submit request for the new mojolicious version yet I have created https://build.opensuse.org/request/show/839303 now with a manual update. You can also use the repo from https://build.opensuse.org/package/show/home:okurz:branches:devel:languages:perl/perl-Mojolicious if you need the new fixed version.
kraih I guess with this you do not plan further immediate work. I will take over the ticket waiting for the SR to be accepted. Also reducing prio as a fixed version is available and osd was hotpatched already some days ago.
EDIT: SR accepted, now https://build.opensuse.org/request/show/839314 for openSUSE:Factory
#13
Updated by okurz 4 months ago
To prevent the broken version to be redeployed tomorrow on osd I will now try to link 8.61 into devel:openQA:Leap:15.1 overriding the current link from Factory:
for i in 1 2; do osc linkpac -f devel:languages:perl perl-Mojolicious devel:openQA:Leap:15.$i; done
after the SR to Factory https://build.opensuse.org/request/show/839314 has been accepted we can revert this back with
for i in 1 2; do osc linkpac -f openSUSE:Factory perl-Mojolicious devel:openQA:Leap:15.$i; done
#14
Updated by okurz 4 months ago
- Status changed from Blocked to In Progress
I will monitor https://build.opensuse.org/project/show/devel:openQA:Leap:15.1 and then see if I can retrigger the build dependency generation CI job that should create a new PR.
#15
Updated by okurz 4 months ago
- Due date set to 2020-10-13
- Status changed from In Progress to Feedback
https://app.circleci.com/pipelines/github/os-autoinst/openQA/4417/workflows/1cd6b181-9298-48b6-8cf9-db9b12781c3c/jobs/42437 created https://github.com/os-autoinst/openQA/pull/3446 which has now two approvals. Tests are currently running and unless there are tests failing we should merge this PR to unblock other PR tests.
Setting due date as reminder to check if perl-Mojolicious-8.61 reached Factory and we can remove the workaround and put the link to the Factory package again
#16
Updated by okurz 3 months ago
- Due date changed from 2020-10-13 to 2020-10-20
https://build.opensuse.org/request/show/839314 for the submission of perl-Mojolicious to openSUSE:Factory is still pending and I am getting annoyed by all these failed tests and failed package builds. I now linked the fixed version directly into devel:openQA
so that we also have that version available for any tests and builds based on Factory and/or Tumbleweed.
$ osc linkpac devel:languages:perl perl-Mojolicious devel:openQA Sending meta data... Done. Creating _link... Done.
I hope this works. This as well should be reverted but here with a deletion of the (linked) package after the submission to openSUSE:Factory is accepted.