action #130636
open
coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
Added by livdywan 11 months ago.
Updated 4 days ago.
Category:
Feature requests
Description
Motivation¶
Apache in prefork mode uses a lot of resources to provide mediocre performance.
Acceptance criteria¶
- AC1: Nginx has been deployed successfully on OSD
- AC2: No alerts regarding "oh no, apache is down" ;)
Suggestions¶
- Make sure there is an easy way to switch back to Apache in case something goes wrong
- See #129490 for results from O3
- Adapt OSD nginx config for HTTP + HTTPS (O3 only requires HTTP)
- We can prepare the deployment of nginx in parallel to apache, have it deployed and at any time decide when to switch by just disabling/enabling services accordingly. The deployment needs to consider dehydrated+nginx as well. We can switch OSD to nginx to gather realtime data before we suggest to use nginx as default in our openQA documentation and CI infrastructure.
- Add changes to salt-states-openqa excluding monitoring
- Ensure that we have no alerts regarding "oh no, apache is down" ;)
- If there are any bigger issues observed then just revert and note down in follow-up tickets what needs to be solved first (to limit the ticket to size:S)
Out of scope¶
- It is known if Nginx rate limiting features work for our use cases
- Full monitoring integration
- Copied from action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing features added
- Description updated (diff)
- Description updated (diff)
During the openQA weekly we've talked about this ticket and consider it a good candidate for a mob session. Main problems to solve are Salt deployment and SSL configuration. As well as a simple way to rollback the deployment and use Apache again in case something goes wrong.
- Description updated (diff)
We can prepare the deployment of nginx in parallel to apache, have it deployed and at any time decide when to switch by just disabling/enabling services accordingly. The deployment needs to consider dehydrated+nginx as well. We can switch OSD to nginx to gather realtime data before we suggest to use nginx as default in our openQA documentation and CI infrastructure.
- Related to action #157081: OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30Z added
- Related to action #158059: OSD unresponsive or significantly slow for some minutes 2024-03-26 13:34Z added
- Tags set to infra
- Target version changed from future to Ready
due to repeated issues with unresponsiveness we should give this more focus and bring it onto the backlog now.
- Copied to action #159651: high response times on osd - nginx with enabled rate limiting features size:S added
- Subject changed from high response times on osd - Try nginx on osd with enabled load limiting or load balancing features to high response times on osd - Try nginx on OSD size:S
- Status changed from New to Workable
- Description updated (diff)
Also available in: Atom
PDF