action #134927
closedOSD throws 503, unresponsive for some minutes size:M
0%
Description
Observation¶
user report we heard in https://suse.slack.com/archives/C02CANHLANP/p1693474449529259
openqa.suse.de throws 503 and sometimes doesn't respond (timeout on http requests) - anyone else or is it just me?
and also spotty http response: https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1693471438216&to=1693475746164
Acceptance criteria¶
- AC1: Measures have been applied to make unresponsiveness of OSD during "many jobs upload" events unlikely
Suggestions¶
- Based on monitoring over multiple days tweak a jobs limit value and apply that on OSD
- Think about relevant alerts -> done we found that OSD does not respond to pings, e.g. from worker during an outage period, e.g. https://monitor.qa.suse.de/explore?panes=%7B%22edM%22:%7B%22datasource%22:%22000000001%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22influxdb%22,%22uid%22:%22000000001%22%7D,%22resultFormat%22:%22time_series%22,%22orderByTime%22:%22ASC%22,%22tags%22:%5B%7B%22key%22:%22url::tag%22,%22value%22:%22openqa.suse.de%22,%22operator%22:%22%3D%22%7D%5D,%22groupBy%22:%5B%7B%22type%22:%22time%22,%22params%22:%5B%22$__interval%22%5D%7D,%7B%22type%22:%22tag%22,%22params%22:%5B%22host::tag%22%5D%7D,%7B%22type%22:%22fill%22,%22params%22:%5B%22null%22%5D%7D%5D,%22select%22:%5B%5B%7B%22type%22:%22field%22,%22params%22:%5B%22result_code%22%5D%7D,%7B%22type%22:%22mean%22,%22params%22:%5B%5D%7D%5D%5D,%22policy%22:%22autogen%22,%22measurement%22:%22ping%22%7D%5D,%22range%22:%7B%22from%22:%221694681402853%22,%22to%22:%221694686985098%22%7D%7D%7D&schemaVersion=1&orgId=1 so without access to the hypervisor we do not know if the system is just rebooting or will recover from being unresponsive so we decided we can not come up with a better alert for now
Files
Updated by okurz about 1 year ago
- File Screenshot_20230831_131009_grafana_spotty_http_response.png Screenshot_20230831_131009_grafana_spotty_http_response.png added
- Description updated (diff)
Updated by okurz about 1 year ago
- Due date set to 2023-09-14
- Status changed from In Progress to Feedback
In OSD /etc/openqa/openqa.ini I reduced max_running_jobs = 300
to 260
Updated by tinita about 1 year ago
We were never over 260 in the last few hours. I reduced it to 220 now.
Updated by okurz about 1 year ago
- Related to action #129619: high response times on osd - simple limit of jobs running concurrently in openQA size:M added
Updated by okurz about 1 year ago
- Due date deleted (
2023-09-14) - Status changed from Feedback to Blocked
- Priority changed from High to Low
No problems with responsiveness recorded since some days. To be looked at again after #129619
Updated by okurz about 1 year ago
With recent changes to workers and such I now changed the limit 220->300 and systemctl restart openqa-scheduler openqa-webui
Updated by okurz about 1 year ago
- Status changed from Blocked to Feedback
Given that our job schedule is still very long and after I monitored https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=now-6h&to=now over the past hours I now disabled the limit again completely on OSD and will take a look if the webUI becomes unresponsive again.
EDIT: https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1694529064786&to=1694537789909 shows small outages of data and ~1s http responses after that but according to https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-12h&to=now there were only around 180 jobs running at that moment so I doubt that the job limit would have helped here.
Updated by okurz about 1 year ago
- Related to action #135578: Long job age and jobs not executed for long size:M added
Updated by okurz about 1 year ago
- Status changed from Feedback to Resolved
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1694631788720&to=1694650046022&viewPanel=9 shows that for a period of about 1h OSD is happily executing up to 600 jobs. In https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1694631865287&to=1694653322713 we can see that there is a corresponding surge in the HTTP response and high CPU usage and load however nothing alarming. The system is handling that quite well, possibly also due to the recent increase in the VM CPU core and RAM values. I now set a sensible value of 600 in osd:/etc/openqa/openqa.ini
No further mentions of unresponsive webUI have been brought up in the past days so I consider this task here resolved.
Updated by okurz about 1 year ago
- Status changed from Resolved to Feedback
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1694682883898&to=1694684794243 showed a 20 min outage. There were problems in the network in the past minutes so maybe not really related to openQA itself. Will monitor more over the next days.
Updated by okurz about 1 year ago
User report in https://suse.slack.com/archives/C02CANHLANP/p1694691995523719 about OSD being "down".
The "outage" of data is also visible e.g. in
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=9&orgId=1&from=1694691676260&to=1694692264799
I lowered the limit to 400 and did systemctl restart openqa-{webui,scheduler}
Updated by okurz about 1 year ago
With the very long job schedule queue for now even with the downside of sometimes unresponsive systems I am setting the limit 400->600 so that openQA has a chance to finish more jobs.
Updated by okurz about 1 year ago
- Subject changed from OSD throws 503, unresponsive for some minutes to OSD throws 503, unresponsive for some minutes size:M
- Description updated (diff)
Updated by okurz about 1 year ago
As expected with a job limit of 600 sometimes we have short outages. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1694986530963&to=1694991782981&viewPanel=9 shows a some-minute outage just as many of the 600 jobs finish so meeting our hypothesis that many jobs uploading bog down the system but that is what we accept as compromise for now
Updated by okurz about 1 year ago
Received an alert "FIRING:1" so good to see that working :) Overnight the job queue has went down to 500 in total running+scheduled+blocked https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1695093189335&to=1695109010794&viewPanel=9 . Reducing limit again to 480, monitoring.
Updated by okurz about 1 year ago
Based on user feedback in https://suse.slack.com/archives/C02CANHLANP/p1695125910277669 I reduced the job limit further 480->420
Updated by okurz about 1 year ago
Based on user feedback in https://suse.slack.com/archives/C02CANHLANP/p1695292686462249?thread_ts=1695292454.700889&cid=C02CANHLANP I reduced the job limit further 420->380
Updated by okurz about 1 year ago
okurz wrote in #note-17:
Based on user feedback in https://suse.slack.com/archives/C02CANHLANP/p1695650480344229 I reduced the job limit further 380->340
In https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1695649754191&to=1695651189280&viewPanel=15 one can see quite clearly that all apache workers become busy, possibly handling uploading jobs with lots and lots of data up to the point where the webUI is not responsive as there are simply no free apache workers to handle the requests
Updated by tinita about 1 year ago
- Copied to action #136967: Monitor number of uploading jobs in grafana added
Updated by okurz about 1 year ago
- Status changed from Feedback to Resolved
No more user reports, no more related alerts. We can keep the current job limit in place and resolve here.
Updated by okurz 6 months ago
- Copied to action #160478: Try out higher global openQA job limit on OSD again after switch to nginx size:S added