Actions
action #134927
closedOSD throws 503, unresponsive for some minutes size:M
Start date:
2023-08-31
Due date:
% Done:
0%
Estimated time:
Description
Observation¶
user report we heard in https://suse.slack.com/archives/C02CANHLANP/p1693474449529259
openqa.suse.de throws 503 and sometimes doesn't respond (timeout on http requests) - anyone else or is it just me?
and also spotty http response: https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1693471438216&to=1693475746164
Acceptance criteria¶
- AC1: Measures have been applied to make unresponsiveness of OSD during "many jobs upload" events unlikely
Suggestions¶
- Based on monitoring over multiple days tweak a jobs limit value and apply that on OSD
- Think about relevant alerts -> done we found that OSD does not respond to pings, e.g. from worker during an outage period, e.g. https://monitor.qa.suse.de/explore?panes=%7B%22edM%22:%7B%22datasource%22:%22000000001%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22influxdb%22,%22uid%22:%22000000001%22%7D,%22resultFormat%22:%22time_series%22,%22orderByTime%22:%22ASC%22,%22tags%22:%5B%7B%22key%22:%22url::tag%22,%22value%22:%22openqa.suse.de%22,%22operator%22:%22%3D%22%7D%5D,%22groupBy%22:%5B%7B%22type%22:%22time%22,%22params%22:%5B%22$__interval%22%5D%7D,%7B%22type%22:%22tag%22,%22params%22:%5B%22host::tag%22%5D%7D,%7B%22type%22:%22fill%22,%22params%22:%5B%22null%22%5D%7D%5D,%22select%22:%5B%5B%7B%22type%22:%22field%22,%22params%22:%5B%22result_code%22%5D%7D,%7B%22type%22:%22mean%22,%22params%22:%5B%5D%7D%5D%5D,%22policy%22:%22autogen%22,%22measurement%22:%22ping%22%7D%5D,%22range%22:%7B%22from%22:%221694681402853%22,%22to%22:%221694686985098%22%7D%7D%7D&schemaVersion=1&orgId=1 so without access to the hypervisor we do not know if the system is just rebooting or will recover from being unresponsive so we decided we can not come up with a better alert for now
Files
Actions