Project

General

Profile

Actions

action #87838

closed

*alert* osd: Open database connections by user alert

Added by okurz almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2021-01-18
Due date:
2021-01-26
% Done:

0%

Estimated time:

Description

Observation

Alert message received 20201-01-18 05:59, received automatic resolution again 06:04:

Alert message details:

90% of postgresql connections reached or connections are already not possible any more. Check on OSD what query is blocking with e.g.: ``` SELECT datid, datname, pid, usesysid, usename, query_start, state, query FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start LIMIT 10; ```

Metric name
Value
TOTAL
113.000

Actions #1

Updated by okurz almost 4 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=2&orgId=1&from=1610908184397&to=1610953743720 is the interesting time period and there are some big spikes this morning but returning to normal quickly. The SQL query mentioned in the alert description does not list any relevant blocking queries. There are also not many users logged in:

$ last -a -n 30 --since 2021-01-18
leli     pts/0        Mon Jan 18 06:35   still logged in    10.163.30.246
leli     pts/0        Mon Jan 18 04:07 - 06:31  (02:24)     10.163.30.246
Actions #2

Updated by okurz almost 4 years ago

  • Status changed from In Progress to Feedback
Actions #3

Updated by livdywan almost 4 years ago

I guess you were right, it wasn't just a one-off. I saw it spurring up a couple more times now.

okurz wrote:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/430

So changing the alert from max to avg might help if we're above 90% briefly - should we split off a hard fail, though? Will we miss alerts when we can't run any more queries? 🤔

Actions #4

Updated by okurz almost 4 years ago

cdywan wrote:

I guess you were right, it wasn't just a one-off. I saw it spurring up a couple more times now.

okurz wrote:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/430

So changing the alert from max to avg might help if we're above 90% briefly - should we split off a hard fail, though? Will we miss alerts when we can't run any more queries? 🤔

I would assume if the system stays clogged for 5m we would still get a notification. As telegraf on osd runs in 10s intervals we could reduce this period but as we only introduced the new alert 2020-09 and have not done any tweaking yet and the alert values "each 1m for 5m" are default I would go with one change at a time and wait until the next time we either get an alert or we miss an alert in case of problems.

Actions #5

Updated by okurz almost 4 years ago

I have paused the alert for now because there had been multiple more spikes recently. The CI jobs in gitlab fail, likely due to some recent python problems introduced in upstream Tumbleweed.

Actions #6

Updated by okurz almost 4 years ago

  • Due date set to 2021-01-26

merged and active. There was another change by mkittler to show all users to which the connections belong as recorded by the database. I have unpaused the alert in https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&editPanel=2&tab=alert again after the monitoring data looks better. However there are still spikes and I could not yet see which user these connections might belong to. Like if they are "internal" or something.

Actions #7

Updated by mkittler almost 4 years ago

Really? It looks like the numbers add up after my SR. At least the spikes I've checked are caused by geekotest. Its connections almost double when a peak occurs.

Actions #8

Updated by okurz almost 4 years ago

yes, I can confirm. My mouse cursor slipped ;)

Actions #9

Updated by okurz almost 4 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF