action #112859
closed
coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids
openQA Infrastructure (public) - coordination #112718: [alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend
Conduct Five Whys for "[alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend"
Added by kraih over 2 years ago.
Updated over 2 years ago.
Description
Observation¶
See #112718
Acceptance criteria¶
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
Suggestions¶
- Bring up in retro
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
- Organize a call to conduct the 5 whys (not as part of the retro)
- Status changed from Workable to In Progress
- Assignee set to okurz
Results from our lessons learned meeting
What happened¶
On 2022-04-12 we ran into an incident of exhausting available IDs for job module results, handled in #109836 with follow-up to migrate relevant tables to "bigint". We already foresaw a potentially significant performance impact during migration on rollout, especially on OSD, so we planned the migration to happen starting on a Friday afternoon or later – because every good team deploys on a Friday afternoon :) – The migration was conducted on Saturday morning first on O3 (openqa.opensuse.org) and after no significant impact was observed the migration on OSD (openqa.suse.de) was started some hours later 2022-06-18 1000Z. The services on OSD were generally reachable 18 again minutes afterwards but with a performance impact causing high response times or timeouts.
Five Whys¶
why didn't we see any bad impact directly after migration?
- the migration itself finished but the query planner could not have been finished at this time. The webUI itself was up and reachable when okurz tested it but likely we would have seen problems looking at the http responses when trying multiple times. Also there were minion job alerts
=> think about improving our alerts #112736
Why did OSD have performance issues a few hours after the big migration?
- Initially after start of services everything was available but the query planner could not optimize the plans so significant problems can only show up over time
=> run pg_stat_statements_reset() and keep an eye on slowest queries -> https://github.com/os-autoinst/openQA/pull/4716/files#r904941555
- We already have post-deployment checks in osd-deployment in various time periods after migration, e.g. 1h and 8h after migration. They already failed and signaled problems so all good
Why did we not fix it sooner?
- We observed some problems over the weekend and would have expected that they might even resolve themselves (first autovacuum should have triggered analyze), including the query planner. This is backed by our observation on o3 which was never showing significant problems and improved itself after two days
- Also we did not want to make it worse because the services were generally available
- It is a good idea to look at the monitoring data, not only alerts, after such exceptional deployments or changes
- We needed the time to check our database changes for consistency and learn about "pg_stat" and explain analyze and vacuum analyze. Understanding our "rows" dashboard was really helpful.
=> It is a good idea to learn more about PostgreSQL, consider doing courses
=> #112946 Extend openQA documentation with best practices what to do after migration, e.g. look at pg_stats
Why was OSD looking completely fine after Monday morning and only became bad again in the late afternoon?
- maybe people were using OSD less and less and with our mitigations, like stopping openqa-scheduler, was helping to lessen the load
OSD was working on jobs, including the openqa-scheduler, but around 1k new jobs (nothing special) triggered a cascading effect with sequential scans from bad query plans
=> It is a good idea to switch off non-critical features and conduct emergency drills -> #112949
Why was OSD so severly affected and could not recover itself but other instances not?
- Less data in the relevant tables (explain analyze showed the same bad query plans)
- Less bot requests
-> Prepare blocklist/passlist for requests by user or user-agent
-> "Quality of Service"/Rate Limiting for requests to deliberately slow down requests when the system is too loaded. Like rate limits for certain slow endpoints. There are likely reverse-proxy modules, e.g. for apache, nginx, haproxy -> #65271#note-101
-> label/comments for SQL queries to be able to identify them -> #65271#note-101
- More performance headroom
=> get more hardware for OSD
=> move Postgres to a separate VM -> already existing ticket #99549
- Status changed from In Progress to Resolved
All minutes are covered. Follow-up tasks are defined in tickets. Additionally a SUSE QE Tools workshop session is planned for 2022-07-08.
- Related to action #126212: openqa.suse.de response times very slow. No alert fired size:M added
Also available in: Atom
PDF