action #87838
closed*alert* osd: Open database connections by user alert
0%
Description
Observation¶
Alert message received 20201-01-18 05:59, received automatic resolution again 06:04:
Alert message details:
90% of postgresql connections reached or connections are already not possible any more. Check on OSD what query is blocking with e.g.: ``` SELECT datid, datname, pid, usesysid, usename, query_start, state, query FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start LIMIT 10; ```
Metric name
Value
TOTAL
113.000
Updated by okurz over 3 years ago
- Status changed from New to In Progress
- Assignee set to okurz
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=2&orgId=1&from=1610908184397&to=1610953743720 is the interesting time period and there are some big spikes this morning but returning to normal quickly. The SQL query mentioned in the alert description does not list any relevant blocking queries. There are also not many users logged in:
$ last -a -n 30 --since 2021-01-18
leli pts/0 Mon Jan 18 06:35 still logged in 10.163.30.246
leli pts/0 Mon Jan 18 04:07 - 06:31 (02:24) 10.163.30.246
Updated by okurz over 3 years ago
- Status changed from In Progress to Feedback
Updated by livdywan over 3 years ago
I guess you were right, it wasn't just a one-off. I saw it spurring up a couple more times now.
okurz wrote:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/430
So changing the alert from max to avg might help if we're above 90% briefly - should we split off a hard fail, though? Will we miss alerts when we can't run any more queries? 🤔
Updated by okurz over 3 years ago
cdywan wrote:
I guess you were right, it wasn't just a one-off. I saw it spurring up a couple more times now.
okurz wrote:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/430
So changing the alert from max to avg might help if we're above 90% briefly - should we split off a hard fail, though? Will we miss alerts when we can't run any more queries? 🤔
I would assume if the system stays clogged for 5m we would still get a notification. As telegraf on osd runs in 10s intervals we could reduce this period but as we only introduced the new alert 2020-09 and have not done any tweaking yet and the alert values "each 1m for 5m" are default I would go with one change at a time and wait until the next time we either get an alert or we miss an alert in case of problems.
Updated by okurz over 3 years ago
I have paused the alert for now because there had been multiple more spikes recently. The CI jobs in gitlab fail, likely due to some recent python problems introduced in upstream Tumbleweed.
Updated by okurz over 3 years ago
- Due date set to 2021-01-26
merged and active. There was another change by mkittler to show all users to which the connections belong as recorded by the database. I have unpaused the alert in https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&editPanel=2&tab=alert again after the monitoring data looks better. However there are still spikes and I could not yet see which user these connections might belong to. Like if they are "internal" or something.
Updated by mkittler over 3 years ago
Really? It looks like the numbers add up after my SR. At least the spikes I've checked are caused by geekotest
. Its connections almost double when a peak occurs.
Updated by okurz over 3 years ago
yes, I can confirm. My mouse cursor slipped ;)
Updated by okurz over 3 years ago
- Status changed from Feedback to Resolved
no more spikes since some days and no alert anymore: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=2&orgId=1&from=now-7d&to=now