Project

General

Profile

action #112859

coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids

openQA Infrastructure - coordination #112718: [alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend

Conduct Five Whys for "[alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend"

Added by kraih about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Organisational
Target version:
Start date:
2022-06-22
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

See #112718

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys (not as part of the retro)

History

#2 Updated by okurz about 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

#3 Updated by okurz about 2 months ago

Results from our lessons learned meeting

What happened

On 2022-04-12 we ran into an incident of exhausting available IDs for job module results, handled in #109836 with follow-up to migrate relevant tables to "bigint". We already foresaw a potentially significant performance impact during migration on rollout, especially on OSD, so we planned the migration to happen starting on a Friday afternoon or later – because every good team deploys on a Friday afternoon :) – The migration was conducted on Saturday morning first on O3 (openqa.opensuse.org) and after no significant impact was observed the migration on OSD (openqa.suse.de) was started some hours later 2022-06-18 1000Z. The services on OSD were generally reachable 18 again minutes afterwards but with a performance impact causing high response times or timeouts.

Five Whys

  1. why didn't we see any bad impact directly after migration?

    • the migration itself finished but the query planner could not have been finished at this time. The webUI itself was up and reachable when okurz tested it but likely we would have seen problems looking at the http responses when trying multiple times. Also there were minion job alerts

    => think about improving our alerts #112736

  2. Why did OSD have performance issues a few hours after the big migration?

    • Initially after start of services everything was available but the query planner could not optimize the plans so significant problems can only show up over time => run pg_stat_statements_reset() and keep an eye on slowest queries -> https://github.com/os-autoinst/openQA/pull/4716/files#r904941555
    • We already have post-deployment checks in osd-deployment in various time periods after migration, e.g. 1h and 8h after migration. They already failed and signaled problems so all good
  3. Why did we not fix it sooner?

    • We observed some problems over the weekend and would have expected that they might even resolve themselves (first autovacuum should have triggered analyze), including the query planner. This is backed by our observation on o3 which was never showing significant problems and improved itself after two days
    • Also we did not want to make it worse because the services were generally available
    • It is a good idea to look at the monitoring data, not only alerts, after such exceptional deployments or changes
    • We needed the time to check our database changes for consistency and learn about "pg_stat" and explain analyze and vacuum analyze. Understanding our "rows" dashboard was really helpful. => It is a good idea to learn more about PostgreSQL, consider doing courses => #112946 Extend openQA documentation with best practices what to do after migration, e.g. look at pg_stats
  4. Why was OSD looking completely fine after Monday morning and only became bad again in the late afternoon?

    • maybe people were using OSD less and less and with our mitigations, like stopping openqa-scheduler, was helping to lessen the load
    • OSD was working on jobs, including the openqa-scheduler, but around 1k new jobs (nothing special) triggered a cascading effect with sequential scans from bad query plans

      => It is a good idea to switch off non-critical features and conduct emergency drills -> #112949

  5. Why was OSD so severly affected and could not recover itself but other instances not?

    • Less data in the relevant tables (explain analyze showed the same bad query plans)
    • Less bot requests -> Prepare blocklist/passlist for requests by user or user-agent -> "Quality of Service"/Rate Limiting for requests to deliberately slow down requests when the system is too loaded. Like rate limits for certain slow endpoints. There are likely reverse-proxy modules, e.g. for apache, nginx, haproxy -> #65271#note-101 -> label/comments for SQL queries to be able to identify them -> #65271#note-101
    • More performance headroom => get more hardware for OSD => move Postgres to a separate VM -> already existing ticket #99549

#4 Updated by okurz about 2 months ago

  • Status changed from In Progress to Resolved

All minutes are covered. Follow-up tasks are defined in tickets. Additionally a SUSE QE Tools workshop session is planned for 2022-07-08.

Also available in: Atom PDF