coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids
openQA Infrastructure - coordination #112718: [alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend
Conduct Five Whys for "[alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend"
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
- Bring up in retro
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
- Organize a call to conduct the 5 whys (not as part of the retro)
Updated by okurz over 1 year ago
Results from our lessons learned meeting
On 2022-04-12 we ran into an incident of exhausting available IDs for job module results, handled in #109836 with follow-up to migrate relevant tables to "bigint". We already foresaw a potentially significant performance impact during migration on rollout, especially on OSD, so we planned the migration to happen starting on a Friday afternoon or later – because every good team deploys on a Friday afternoon :) – The migration was conducted on Saturday morning first on O3 (openqa.opensuse.org) and after no significant impact was observed the migration on OSD (openqa.suse.de) was started some hours later 2022-06-18 1000Z. The services on OSD were generally reachable 18 again minutes afterwards but with a performance impact causing high response times or timeouts.
why didn't we see any bad impact directly after migration?
- the migration itself finished but the query planner could not have been finished at this time. The webUI itself was up and reachable when okurz tested it but likely we would have seen problems looking at the http responses when trying multiple times. Also there were minion job alerts
=> think about improving our alerts #112736
- it was the weekend and the load was low because nobody was scheduling or restarting jobs => It was a good idea to deploy explicitly during a time of low load
- some performance regressions were expected directly after the migration (first autovacuum after migration) => plan a manual vacuum analyze for each table affected by the migration -> https://github.com/os-autoinst/openQA/pull/4716 => conduct an automatic vacuum analyze after migration -> #112718
Why did OSD have performance issues a few hours after the big migration?
- Initially after start of services everything was available but the query planner could not optimize the plans so significant problems can only show up over time => run pg_stat_statements_reset() and keep an eye on slowest queries -> https://github.com/os-autoinst/openQA/pull/4716/files#r904941555
- We already have post-deployment checks in osd-deployment in various time periods after migration, e.g. 1h and 8h after migration. They already failed and signaled problems so all good
Why did we not fix it sooner?
- We observed some problems over the weekend and would have expected that they might even resolve themselves (first autovacuum should have triggered analyze), including the query planner. This is backed by our observation on o3 which was never showing significant problems and improved itself after two days
- Also we did not want to make it worse because the services were generally available
- It is a good idea to look at the monitoring data, not only alerts, after such exceptional deployments or changes
- We needed the time to check our database changes for consistency and learn about "pg_stat" and explain analyze and vacuum analyze. Understanding our "rows" dashboard was really helpful. => It is a good idea to learn more about PostgreSQL, consider doing courses => #112946 Extend openQA documentation with best practices what to do after migration, e.g. look at pg_stats
Why was OSD looking completely fine after Monday morning and only became bad again in the late afternoon?
- maybe people were using OSD less and less and with our mitigations, like stopping openqa-scheduler, was helping to lessen the load
OSD was working on jobs, including the openqa-scheduler, but around 1k new jobs (nothing special) triggered a cascading effect with sequential scans from bad query plans
=> It is a good idea to switch off non-critical features and conduct emergency drills -> #112949
Why was OSD so severly affected and could not recover itself but other instances not?
- Less data in the relevant tables (explain analyze showed the same bad query plans)
- Less bot requests -> Prepare blocklist/passlist for requests by user or user-agent -> "Quality of Service"/Rate Limiting for requests to deliberately slow down requests when the system is too loaded. Like rate limits for certain slow endpoints. There are likely reverse-proxy modules, e.g. for apache, nginx, haproxy -> #65271#note-101 -> label/comments for SQL queries to be able to identify them -> #65271#note-101
- More performance headroom => get more hardware for OSD => move Postgres to a separate VM -> already existing ticket #99549