coordination #112718: [alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend - openQA Infrastructure (public) - openSUSE Project Management Tool

coordination #112718

# Observation 
 We received a lot of alerts over the weekend regarding failed minion jobs and others. Checking Grafana I can see that the problem started Saturday, 18th of June around 13:00 CET: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1655549105000&to=now 
 The amount of returned PostgreSQL rows looks very suspicious and is now five times as high as before: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1655475539000&to=now&viewPanel=89 


 ## Rollback and cleanup steps 
 * *DONE:* on osd `systemctl enable --now telegraf` 
 * on osd `systemctl unmask --now salt-master` and ensure that /etc/telegraf/telegraf.d/telegraf-webui.conf is reverted 
 * on osd `systemctl unmask --now openqa-scheduler` 
 * *DONE:* Retrigger all incomplete jobs since 2022-06-18 with https://github.com/os-autoinst/scripts/blob/master/openqa-advanced-retrigger-jobs 
 * Retrigger failed obs-sync trigger events: https://openqa.suse.de/admin/obs_rsync/-/pipelines 
 * Retrigger failed qem-bot trigger events: https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipelines 
 * *DONE:* Retrigger failed openQA bot trigger events: https://gitlab.suse.de/qa-maintenance/openQABot/-/pipelines 
 * Unmask and start on openqaworker10 and 13: `sudo systemctl unmask --now openqa-worker-cacheservice openqa-worker@{1..20}`

Back

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

coordination #112718