Project

General

Profile

action #134927

Updated by okurz 8 months ago

## Observation 
 user report we heard in https://suse.slack.com/archives/C02CANHLANP/p1693474449529259 
 > openqa.suse.de throws 503 and sometimes doesn't respond (timeout on http requests) - anyone else or is it just me? 

 and also spotty http response: https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1693471438216&to=1693475746164 

 ![Screenshot_20230831_131009_grafana_spotty_http_response](Screenshot_20230831_131009_grafana_spotty_http_response) 

 ## Acceptance criteria 
 * **AC1:** Measures have been applied to make unresponsiveness of OSD during "many jobs upload" events unlikely 

 ## Suggestions 
 * Based on monitoring over multiple days tweak a jobs limit value and apply that on OSD 
 * Think about relevant alerts -> *done* we found that OSD does not respond to pings, e.g. from worker during an outage period, e.g. https://monitor.qa.suse.de/explore?panes=%7B%22edM%22:%7B%22datasource%22:%22000000001%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22influxdb%22,%22uid%22:%22000000001%22%7D,%22resultFormat%22:%22time_series%22,%22orderByTime%22:%22ASC%22,%22tags%22:%5B%7B%22key%22:%22url::tag%22,%22value%22:%22openqa.suse.de%22,%22operator%22:%22%3D%22%7D%5D,%22groupBy%22:%5B%7B%22type%22:%22time%22,%22params%22:%5B%22$__interval%22%5D%7D,%7B%22type%22:%22tag%22,%22params%22:%5B%22host::tag%22%5D%7D,%7B%22type%22:%22fill%22,%22params%22:%5B%22null%22%5D%7D%5D,%22select%22:%5B%5B%7B%22type%22:%22field%22,%22params%22:%5B%22result_code%22%5D%7D,%7B%22type%22:%22mean%22,%22params%22:%5B%5D%7D%5D%5D,%22policy%22:%22autogen%22,%22measurement%22:%22ping%22%7D%5D,%22range%22:%7B%22from%22:%221694681402853%22,%22to%22:%221694686985098%22%7D%7D%7D&schemaVersion=1&orgId=1 so without access to the hypervisor we do not know if the system is just rebooting or will recover from being unresponsive so we decided we can not come up with a better alert for now

Back