Project

General

Profile

Actions

coordination #112718

closed

openQA Project - coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids

[alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend

Added by nicksinger almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2022-06-22
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

We received a lot of alerts over the weekend regarding failed minion jobs and others. Checking Grafana I can see that the problem started Saturday, 18th of June around 13:00 CET: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1655549105000&to=now
The amount of returned PostgreSQL rows looks very suspicious and is now five times as high as before: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1655475539000&to=now&viewPanel=89

Suggestions

  • Load OSD database dump from after the incident started and try to reproduce the problem
  • Research how to find out where heavy queries come from
  • Research what can cause rows returned to grow from <100k to 20-60M

Problem

  • H1: The migration to "bigint" has triggered a query planner update causing to end up with sub-optimal routing. As also auto-vacuum is eventually triggering "ANALYZE" we assume that eventually the system would recover automatically by using optimized queries. This is likely what happened on o3 after the period of 1-2 days. On OSD we do not have enough performance headroom (in particular CPU and potentially disk I/O) to cover for such periods.

Rollback and cleanup steps


Files

vaccuum-analyze-o3.txt (16.6 KB) vaccuum-analyze-o3.txt tinita, 2022-06-22 12:36
o3-munin-pg-rows.png (66.7 KB) o3-munin-pg-rows.png tinita, 2022-06-22 15:25
vaccuum-analyze-osd.txt (7.9 KB) vaccuum-analyze-osd.txt tinita, 2022-06-22 20:47
postgres_queries3_openqa-week.png (29.1 KB) postgres_queries3_openqa-week.png PostgreSQL database queries - week - 2022-06-15 to 2022-06-22 okurz, 2022-06-23 06:13
Screenshot_20220623_081721_osd_spotty_http_response.png (36.5 KB) Screenshot_20220623_081721_osd_spotty_http_response.png spotty OSD response starting 2022-06-18, some minutes after "bigint" deployment okurz, 2022-06-23 06:18
Screenshot_20220623_082117_increase_in_rows_returned_after_bigint_migration.png (71 KB) Screenshot_20220623_082117_increase_in_rows_returned_after_bigint_migration.png Increase in rows returned in PostgreSQL queries okurz, 2022-06-23 06:21
Screenshot_20220623_082633_different_mitigations_to_OSD_incident.png (49.4 KB) Screenshot_20220623_082633_different_mitigations_to_OSD_incident.png Different mitigations as visible in PostgreSQL queries okurz, 2022-06-23 06:27

Subtasks 1 (0 open1 closed)

openQA Project - action #112859: Conduct Five Whys for "[alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend"Resolvedokurz2022-06-22

Actions

Related issues 11 (4 open7 closed)

Related to openQA Project - action #112265: Just use bigint in all our database tables (for auto-incremented ID-columns) size:MResolvedmkittler2022-06-10

Actions
Related to openQA Project - action #112736: Better alert based on 2022-06-18 incident size:MResolvedmkittler2022-06-20

Actions
Related to openQA Infrastructure - action #112733: Webui Summary dashboard in Grafana is missing I/O panels size:MResolvednicksinger2022-06-20

Actions
Related to openQA Project - action #54902: openQA on osd fails at "incomplete" status when uploading, "502 response: Proxy Error"Resolvedokurz2019-07-31

Actions
Related to openQA Infrastructure - action #112769: errors from telegraf on osd webUI "[inputs.procstat] Error in plugin: open /sys/fs/cgroup/systemd/system.slice/salt-master.service/cgroup.procs: no such file or directory", same for openqa-scheduler.serviceRejectedokurz2022-06-21

Actions
Related to openQA Project - coordination #99549: [epic] Split production workload onto multiple hosts (focusing on OSD)New2021-09-30

Actions
Related to openQA Project - action #94705: Monitor number of SQL queries in grafanaNew2021-06-25

Actions
Related to openQA Infrastructure - action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:MResolvedmkittler2022-09-18

Actions
Copied to openQA Infrastructure - action #112787: Use different users for our database accesses so that we can debug where load comes fromNew

Actions
Copied to openQA Infrastructure - action #112850: Make pgstatstatements persistent on OSD (in salt)New

Actions
Copied to openQA Infrastructure - action #112916: postgresql.conf is invalid after recent salt changes size:MResolvedokurz2022-06-22

Actions
Actions

Also available in: Atom PDF