Project

General

Profile

Actions

action #134927

closed

OSD throws 503, unresponsive for some minutes size:M

Added by okurz over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Start date:
2023-08-31
Due date:
% Done:

0%

Estimated time:

Description

Observation

user report we heard in https://suse.slack.com/archives/C02CANHLANP/p1693474449529259

openqa.suse.de throws 503 and sometimes doesn't respond (timeout on http requests) - anyone else or is it just me?

and also spotty http response: https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1693471438216&to=1693475746164

Screenshot_20230831_131009_grafana_spotty_http_response

Acceptance criteria

  • AC1: Measures have been applied to make unresponsiveness of OSD during "many jobs upload" events unlikely

Suggestions


Files


Related issues 4 (1 open3 closed)

Related to openQA Project (public) - action #129619: high response times on osd - simple limit of jobs running concurrently in openQA size:MResolvedtinita2023-05-20

Actions
Related to openQA Infrastructure (public) - action #135578: Long job age and jobs not executed for long size:MResolvednicksinger

Actions
Copied to openQA Infrastructure (public) - action #136967: Monitor number of uploading jobs in grafanaNew2023-08-31

Actions
Copied to openQA Infrastructure (public) - action #160478: Try out higher global openQA job limit on OSD again after switch to nginx size:SResolvedokurz2023-08-31

Actions
Actions #2

Updated by okurz over 1 year ago

  • Due date set to 2023-09-14
  • Status changed from In Progress to Feedback

In OSD /etc/openqa/openqa.ini I reduced max_running_jobs = 300 to 260

Actions #3

Updated by tinita over 1 year ago

We were never over 260 in the last few hours. I reduced it to 220 now.

Actions #4

Updated by okurz over 1 year ago

  • Related to action #129619: high response times on osd - simple limit of jobs running concurrently in openQA size:M added
Actions #5

Updated by okurz over 1 year ago

  • Due date deleted (2023-09-14)
  • Status changed from Feedback to Blocked
  • Priority changed from High to Low

No problems with responsiveness recorded since some days. To be looked at again after #129619

Actions #6

Updated by okurz over 1 year ago

With recent changes to workers and such I now changed the limit 220->300 and systemctl restart openqa-scheduler openqa-webui

Actions #7

Updated by okurz over 1 year ago

  • Status changed from Blocked to Feedback

Given that our job schedule is still very long and after I monitored https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=now-6h&to=now over the past hours I now disabled the limit again completely on OSD and will take a look if the webUI becomes unresponsive again.

EDIT: https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1694529064786&to=1694537789909 shows small outages of data and ~1s http responses after that but according to https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-12h&to=now there were only around 180 jobs running at that moment so I doubt that the job limit would have helped here.

Actions #8

Updated by okurz over 1 year ago

  • Related to action #135578: Long job age and jobs not executed for long size:M added
Actions #9

Updated by okurz over 1 year ago

  • Status changed from Feedback to Resolved

https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1694631788720&to=1694650046022&viewPanel=9 shows that for a period of about 1h OSD is happily executing up to 600 jobs. In https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1694631865287&to=1694653322713 we can see that there is a corresponding surge in the HTTP response and high CPU usage and load however nothing alarming. The system is handling that quite well, possibly also due to the recent increase in the VM CPU core and RAM values. I now set a sensible value of 600 in osd:/etc/openqa/openqa.ini

No further mentions of unresponsive webUI have been brought up in the past days so I consider this task here resolved.

Actions #10

Updated by okurz over 1 year ago

  • Status changed from Resolved to Feedback

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1694682883898&to=1694684794243 showed a 20 min outage. There were problems in the network in the past minutes so maybe not really related to openQA itself. Will monitor more over the next days.

Actions #11

Updated by okurz over 1 year ago

User report in https://suse.slack.com/archives/C02CANHLANP/p1694691995523719 about OSD being "down".
The "outage" of data is also visible e.g. in
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=9&orgId=1&from=1694691676260&to=1694692264799
I lowered the limit to 400 and did systemctl restart openqa-{webui,scheduler}

Actions #12

Updated by okurz over 1 year ago

With the very long job schedule queue for now even with the downside of sometimes unresponsive systems I am setting the limit 400->600 so that openQA has a chance to finish more jobs.

Actions #13

Updated by okurz over 1 year ago

  • Subject changed from OSD throws 503, unresponsive for some minutes to OSD throws 503, unresponsive for some minutes size:M
  • Description updated (diff)
Actions #14

Updated by okurz over 1 year ago

As expected with a job limit of 600 sometimes we have short outages. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1694986530963&to=1694991782981&viewPanel=9 shows a some-minute outage just as many of the 600 jobs finish so meeting our hypothesis that many jobs uploading bog down the system but that is what we accept as compromise for now

Actions #15

Updated by okurz about 1 year ago

Received an alert "FIRING:1" so good to see that working :) Overnight the job queue has went down to 500 in total running+scheduled+blocked https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1695093189335&to=1695109010794&viewPanel=9 . Reducing limit again to 480, monitoring.

Actions #16

Updated by okurz about 1 year ago

Based on user feedback in https://suse.slack.com/archives/C02CANHLANP/p1695125910277669 I reduced the job limit further 480->420

Actions #17

Updated by okurz about 1 year ago

Actions #18

Updated by okurz about 1 year ago

okurz wrote in #note-17:

Based on user feedback in https://suse.slack.com/archives/C02CANHLANP/p1695650480344229 I reduced the job limit further 380->340

In https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1695649754191&to=1695651189280&viewPanel=15 one can see quite clearly that all apache workers become busy, possibly handling uploading jobs with lots and lots of data up to the point where the webUI is not responsive as there are simply no free apache workers to handle the requests

Actions #19

Updated by tinita about 1 year ago

  • Copied to action #136967: Monitor number of uploading jobs in grafana added
Actions #20

Updated by okurz about 1 year ago

  • Status changed from Feedback to Resolved

No more user reports, no more related alerts. We can keep the current job limit in place and resolve here.

Actions #21

Updated by okurz 7 months ago

  • Copied to action #160478: Try out higher global openQA job limit on OSD again after switch to nginx size:S added
Actions

Also available in: Atom PDF