Project

General

Profile

coordination #112718

Updated by okurz almost 2 years ago

## Observation 
 We received a lot of alerts over the weekend regarding failed minion jobs and others. Checking Grafana I can see that the problem started Saturday, 18th of June around 13:00 CET: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1655549105000&to=now 
 The amount of returned PostgreSQL rows looks very suspicious and is now five times as high as before: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1655475539000&to=now&viewPanel=89 

 ## Suggestions 
 * Load [OSD database dump](https://progress.opensuse.org/projects/openqav3/wiki/#Backup) from after the incident started and try to reproduce the problem 
 * Research how to find out where heavy queries come from 
 * Research what can cause rows returned to grow from <100k to 20-60M 

 ## Problem 
 * **H1:** The migration to "bigint" has triggered a query planner update causing to end up with sub-optimal routing. As also auto-vacuum is eventually triggering "ANALYZE" we assume that eventually the system would recover automatically by using optimized queries. This is likely what happened on o3 after the period of 1-2 days. On OSD we do not have enough performance headroom (in particular CPU and potentially disk I/O) to cover for such periods. 

 ## Rollback and cleanup steps 
 * *DONE:* on osd `systemctl enable --now telegraf` 
 * *DONE:* on osd `systemctl unmask --now salt-master` and ensure that /etc/telegraf/telegraf.d/telegraf-webui.conf is reverted 
 * on osd `systemctl unmask --now openqa-scheduler` -> #112718#note-57 
 * *DONE:* Retrigger all incomplete jobs since 2022-06-18 with https://github.com/os-autoinst/scripts/blob/master/openqa-advanced-retrigger-jobs -> #112718#note-57 
 * *DONE:* Retrigger failed obs-sync trigger events: https://openqa.suse.de/admin/obs_rsync/ https://openqa.suse.de/admin/obs_rsync/-/pipelines 
 * *DONE:* Retrigger failed qem-bot trigger events: https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipelines 
 * *DONE:* Retrigger failed openQA bot trigger events: https://gitlab.suse.de/qa-maintenance/openQABot/-/pipelines 
 * *DONE:* Unmask and start on openqaworker10 and 13: `sudo systemctl unmask --now openqa-worker-cacheservice openqa-worker@{1..20}` 
 * *DONE:* Remove /etc/openqa/templates/main/index.html.ep 
 * Apply salt high state and check that files are back to maintained format, e.g. 
  * *DONE:* /usr/share/openqa/script/openqa-gru 
 * on osd `systemctl unmask --now salt-master` and ensure that /etc/telegraf/telegraf.d/telegraf-webui.conf is reverted 
 * Unpause alerts: 
  * Broken workers 
  * Failed systemd services (except openqa.suse.de) 
  * Open database connections by user 
  * openqa-scheduler.service 
  * salt-master.service 
  * web UI: Too many minion job failures

Back