action #87838: *alert* osd: Open database connections by user alert - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #87838

closed

alert osd: Open database connections by user alert

Added by okurz over 4 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-01-18

Due date:

2021-01-26

% Done:

Estimated time:

Description

Observation¶

Alert message received 20201-01-18 05:59, received automatic resolution again 06:04:

Alert message details:

90% of postgresql connections reached or connections are already not possible any more. Check on OSD what query is blocking with e.g.: ``` SELECT datid, datname, pid, usesysid, usename, query_start, state, query FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start LIMIT 10; ```

Metric name
Value
TOTAL
113.000

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from New to In Progress
Assignee set to okurz

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=2&orgId=1&from=1610908184397&to=1610953743720 is the interesting time period and there are some big spikes this morning but returning to normal quickly. The SQL query mentioned in the alert description does not list any relevant blocking queries. There are also not many users logged in:

$ last -a -n 30 --since 2021-01-18
leli     pts/0        Mon Jan 18 06:35   still logged in    10.163.30.246
leli     pts/0        Mon Jan 18 04:07 - 06:31  (02:24)     10.163.30.246

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/430

Actions

Copy link

Updated by livdywan over 4 years ago

I guess you were right, it wasn't just a one-off. I saw it spurring up a couple more times now.

okurz wrote:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/430

So changing the alert from max to avg might help if we're above 90% briefly - should we split off a hard fail, though? Will we miss alerts when we can't run any more queries? 🤔

Actions

Copy link

Updated by okurz over 4 years ago

cdywan wrote:

I guess you were right, it wasn't just a one-off. I saw it spurring up a couple more times now.

okurz wrote:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/430

So changing the alert from max to avg might help if we're above 90% briefly - should we split off a hard fail, though? Will we miss alerts when we can't run any more queries? 🤔

I would assume if the system stays clogged for 5m we would still get a notification. As telegraf on osd runs in 10s intervals we could reduce this period but as we only introduced the new alert 2020-09 and have not done any tweaking yet and the alert values "each 1m for 5m" are default I would go with one change at a time and wait until the next time we either get an alert or we miss an alert in case of problems.

Actions

Copy link

Updated by okurz over 4 years ago

I have paused the alert for now because there had been multiple more spikes recently. The CI jobs in gitlab fail, likely due to some recent python problems introduced in upstream Tumbleweed.

Actions

Copy link

Updated by okurz over 4 years ago

Due date set to 2021-01-26

merged and active. There was another change by mkittler to show all users to which the connections belong as recorded by the database. I have unpaused the alert in https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&editPanel=2&tab=alert again after the monitoring data looks better. However there are still spikes and I could not yet see which user these connections might belong to. Like if they are "internal" or something.

Actions

Copy link

Updated by mkittler over 4 years ago

Really? It looks like the numbers add up after my SR. At least the spikes I've checked are caused by geekotest. Its connections almost double when a peak occurs.

Actions

Copy link