Project

General

Profile

action #163592

Updated by livdywan 5 months ago

## Observation 

 We saw several alerts this morning: 
 http://stats.openqa-monitor.qa.suse.de/alerting/grafana/tm0h5mf4k/view?orgId=1 
 ``` 
 Date: Wed, 10 Jul 2024 04:45:31 +0200                                                                                                                                                                          
 From: Grafana <osd-admins@suse.de>                                                                                                                                                                             
 To: osd-admins@suse.de                                                                                                                                                                                         
 Subject: [FIRING:1] (HTTP Response alert Salt tm0h5mf4k)                                                                                                                                                       

 1 firing alert instance 
 [IMAGE] 

   🔥 1 firing instances 

 Firing [stats.openqa-monitor.qa.suse.de] 
 HTTP Response alert 
 View alert [stats.openqa-monitor.qa.suse.de] 
 Values 
 B0=6.424936551249999  
 Labels 
 alertname 
 HTTP Response alert 
 grafana_folder 
 Salt 
 ``` 

 ## Suggestions 
 * *DONE* Reduce global job limit 
 * Consider mitigations for route causing the overload 
 * Actively monitor while mitigations are applied 
 * Retrigger affected jobs 
 * Handle communication with affected users and downstream services 

 ## Mitigations 
 * Manual changes to `/usr/share/openqa/lib/OpenQA/WebAPI.pm` on osd 
 * ~~https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules switched off~~ 
 * `sudo systemctl stop auto-update.timer` on osd 
 * in OSD /etc/openqa/openqa.ini reduced job limit from 420 to 300 
 * okurz still assumes that the actions done in #159396 might have an effect. okurz suggests to temporarily disable taking the database backup to observe if that helps. 

 ## Rollback actions 
 * See mitigations: @livdywan TODO to define rollback actions

Back