action #87838
closed
*alert* osd: Open database connections by user alert
Added by okurz almost 4 years ago.
Updated almost 4 years ago.
Description
Observation¶
Alert message received 20201-01-18 05:59, received automatic resolution again 06:04:
Alert message details:
90% of postgresql connections reached or connections are already not possible any more. Check on OSD what query is blocking with e.g.: ``` SELECT datid, datname, pid, usesysid, usename, query_start, state, query FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start LIMIT 10; ```
Metric name
Value
TOTAL
113.000
- Status changed from New to In Progress
- Assignee set to okurz
- Status changed from In Progress to Feedback
I guess you were right, it wasn't just a one-off. I saw it spurring up a couple more times now.
okurz wrote:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/430
So changing the alert from max to avg might help if we're above 90% briefly - should we split off a hard fail, though? Will we miss alerts when we can't run any more queries? 🤔
cdywan wrote:
I guess you were right, it wasn't just a one-off. I saw it spurring up a couple more times now.
okurz wrote:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/430
So changing the alert from max to avg might help if we're above 90% briefly - should we split off a hard fail, though? Will we miss alerts when we can't run any more queries? 🤔
I would assume if the system stays clogged for 5m we would still get a notification. As telegraf on osd runs in 10s intervals we could reduce this period but as we only introduced the new alert 2020-09 and have not done any tweaking yet and the alert values "each 1m for 5m" are default I would go with one change at a time and wait until the next time we either get an alert or we miss an alert in case of problems.
I have paused the alert for now because there had been multiple more spikes recently. The CI jobs in gitlab fail, likely due to some recent python problems introduced in upstream Tumbleweed.
- Due date set to 2021-01-26
Really? It looks like the numbers add up after my SR. At least the spikes I've checked are caused by geekotest
. Its connections almost double when a peak occurs.
yes, I can confirm. My mouse cursor slipped ;)
- Status changed from Feedback to Resolved
Also available in: Atom
PDF