Project

General

Profile

Actions

action #162533

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #108209: [epic] Reduce load on OSD

[alert] OSD nginx yields 502 responses rather than being more resilient of e.g. openqa-webui restarts size:S

Added by okurz 6 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-05-24
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Similar as in #160877 but concerning the server rather than tooling.

Did we managed do DoS the server? Do we need to tweak the nginx even more?

Acceptance criteria

Suggestions

Rollback steps


Related issues 3 (0 open3 closed)

Copied from openQA Project (public) - action #160877: [alert] Scripts CI pipeline failing due to osd yielding 502 size:MResolvedmkittler2024-05-24

Actions
Copied to openQA Project (public) - action #162611: Easy local development setup for comparing apache2+nginx as openQA web proxy size:SResolvedlivdywan2024-05-24

Actions
Copied to openQA Project (public) - action #162614: Consider other options how to restart openqa-webui to prevent 502's responses by nginxRejectedokurz2024-05-24

Actions
Actions #1

Updated by okurz 6 months ago

  • Copied from action #160877: [alert] Scripts CI pipeline failing due to osd yielding 502 size:M added
Actions #2

Updated by okurz 6 months ago

  • Copied to action #162611: Easy local development setup for comparing apache2+nginx as openQA web proxy size:S added
Actions #3

Updated by okurz 6 months ago

  • Subject changed from [alert] OSD nginx yields 502 responses rather than being more resilient of e.g. openqa-webui restarts to [alert] OSD nginx yields 502 responses rather than being more resilient of e.g. openqa-webui restarts size:S
  • Description updated (diff)
  • Status changed from New to Blocked
  • Assignee set to okurz
Actions #4

Updated by okurz 6 months ago

  • Copied to action #162614: Consider other options how to restart openqa-webui to prevent 502's responses by nginx added
Actions #5

Updated by dheidler 6 months ago

  • Description updated (diff)
Actions #6

Updated by okurz 6 months ago

  • Description updated (diff)

disabled the alert for now. Added the according rollback step.

Actions #7

Updated by okurz 6 months ago

  • Parent task set to #108209
Actions #8

Updated by livdywan 5 months ago

Pending on #162611 which I intend to resolve once the second PR is merged

Actions #9

Updated by livdywan 5 months ago

  • Status changed from Blocked to Workable

livdywan wrote in #note-8:

Pending on #162611 which I intend to resolve once the second PR is merged

Unlocked 🎉

Actions #10

Updated by okurz 5 months ago

  • Assignee deleted (okurz)
Actions #11

Updated by mkittler 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #12

Updated by mkittler 5 months ago · Edited

Locally, my apache is also not very resilient. I get a 503 error right away. My apache has the same settings as we used in production (as far as I can tell) and I'm also using systemd like in production.

I also restored my systemd units to be before https://github.com/os-autoinst/openQA/commit/78365ca96a3f3f429be7b2c47db07fef626afa80 and invoked systemctl daemon-reload. Now systemctl start openqa-webui also started apache2. However, restarting openqa-webui behaved exactly the same. So I still got a 503 error immediately and apache2 was neither restarted or reloaded as well.

Making sure that the Retry-After header is set also doesn't change the behavior. So I still have no idea how this was ever better with apache (although I'm pretty sure it was better).

Maybe we can rewrite 503 responses (or actually 502 in case of nginx) to some custom page with 200 return code that will automatically reload after a certain time? At least with Apache it worked to customize the error page (that's also how I added the Retry-After header) so I would only need to figure out how to do it with nginx.

Actions #13

Updated by openqa_review 5 months ago

  • Due date set to 2024-08-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by mkittler 4 months ago

  • Status changed from In Progress to Feedback

The PR has been merged and is deployed on o3. It looks good so far.

Actions #16

Updated by mkittler 4 months ago

The package is also reloading (and not restarting) as expected on o3:

martchus@ariel:~> sudo journalctl -fu openqa-webui.service
Aug 06 10:49:33 ariel openqa-webui-daemon[23345]: Web application available at http://127.0.0.1:9526
Aug 06 10:49:33 ariel openqa-webui-daemon[23345]: Web application available at http://[::1]:9526
Aug 06 11:10:58 ariel systemd[1]: Reloading The openQA web UI...
Aug 06 11:10:58 ariel systemd[1]: Reloaded The openQA web UI.
Aug 06 11:10:59 ariel openqa-webui-daemon[2865]: Web application available at http://127.0.0.1:9526
Aug 06 11:10:59 ariel openqa-webui-daemon[2865]: Web application available at http://[::1]:9526
Aug 06 14:19:47 ariel systemd[1]: Reloading The openQA web UI...
Aug 06 14:19:47 ariel systemd[1]: Reloaded The openQA web UI.
Aug 06 14:19:50 ariel openqa-webui-daemon[6604]: Web application available at http://127.0.0.1:9526
Aug 06 14:19:50 ariel openqa-webui-daemon[6604]: Web application available at http://[::1]:9526

(The first reload was me testing a manual reload when I wrote #162533#note-15. The last second reload happened because 4.6.1722948816.653e0d8a-lp155.7014.1 was installed via our normal updating services.)

So we'll only have to see how well it works on OSD.

Actions #17

Updated by mkittler 4 months ago

It has now been deployed on OSD as well and works. I reloaded the service two times refreshing the page at the same time and didn't get any 5xx responses anymore.

Actions #18

Updated by mkittler 4 months ago

While I still haven't seen any problems I cannot resolve the ticket just now because the alert is still firing. It has been firing since the deployment this morning (see the alert query's graph) and still needs to cool down.

Actions #19

Updated by livdywan 4 months ago · Edited

There was 952 502 responses at 7.16 this morning and sudo journalctl -S '2024-08-08 07:00:00' -U '2024-08-08 07:30:00' confirms openQA services were being restarted at that time (Note the time range of the link in the AC is only 1 hour).

Actions #20

Updated by okurz 4 months ago

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=80&from=1723610837160&to=1723614966741 shows the 502 responses same as every day when an OSD deployment is conducted.

Actions #21

Updated by mkittler 4 months ago · Edited

On OSD I only see

Aug 08 07:16:30 openqa systemd[1]: Reloading The openQA web UI...
Aug 08 07:16:30 openqa systemd[1]: Reloaded The openQA web UI.

and not Stopping …/Starting The openQA web UI. It also looks very much like things worked as expected:

Aug 08 07:16:33 openqa openqa-webui-daemon[30637]: [info] Listening at "http://127.0.0.1:9526?reuse=1"
Aug 08 07:16:33 openqa openqa-webui-daemon[30637]: Web application available at http://127.0.0.1:9526
Aug 08 07:16:33 openqa openqa-webui-daemon[30637]: [info] Listening at "http://[::1]:9526?reuse=1"
Aug 08 07:16:33 openqa openqa-webui-daemon[30637]: Web application available at http://[::1]:9526
Aug 08 07:16:33 openqa openqa-webui-daemon[30637]: [info] Manager 30637 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30665]: [info] Worker 30665 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30668]: [info] Worker 30668 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30669]: [info] Worker 30669 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30666]: [info] Worker 30666 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30670]: [info] Worker 30670 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30667]: [info] Worker 30667 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30672]: [info] Worker 30672 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30674]: [info] Worker 30674 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30673]: [info] Worker 30673 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30676]: [info] Worker 30676 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30671]: [info] Worker 30671 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30681]: [info] Worker 30681 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30684]: [info] Worker 30684 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30686]: [info] Worker 30686 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30688]: [info] Worker 30688 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30690]: [info] Worker 30690 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30694]: [info] Worker 30694 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30696]: [info] Worker 30696 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30693]: [info] Worker 30693 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30695]: [info] Worker 30695 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30701]: [info] Worker 30701 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30702]: [info] Worker 30702 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30704]: [info] Worker 30704 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30710]: [info] Worker 30710 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30714]: [info] Worker 30714 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30709]: [info] Worker 30709 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30712]: [info] Worker 30712 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30637]: [info] Creating process id file "/var/lib/openqa/webui/prefork-1.pid"
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 21043 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1537 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19434 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 21138 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 15801 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16603 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 23987 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 29727 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 3123 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 26139 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 20874 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19596 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 25394 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 25514 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 20081 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 22958 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 24531 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16242 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16927 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 3610 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 4879 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1100 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19428 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 28670 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 8487 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 12256 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 11923 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19734 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 17014 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 11800 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[30708]: [info] Worker 30708 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30717]: [info] Worker 30717 started
Aug 08 07:16:33 openqa openqa-webui-daemon[30716]: [info] Worker 30716 started
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 25394 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 29727 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 21043 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1537 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19434 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 21138 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 15801 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16603 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 23987 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 3123 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 26139 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 20874 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19596 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 25514 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 20081 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 22958 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 24531 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16242 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16927 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 3610 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 4879 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1100 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19428 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 28670 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 8487 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 12256 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 11923 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19734 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 17014 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 11800 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 16927 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 21043 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1537 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19434 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 21138 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 15801 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16603 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 23987 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 3123 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 26139 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 20874 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19596 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 25514 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 20081 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 22958 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 24531 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16242 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 3610 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 4879 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1100 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19428 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 28670 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 8487 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 12256 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 11923 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19734 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 17014 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 11800 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 16927 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 21043 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1537 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19434 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 21138 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 15801 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16603 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 23987 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 3123 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 26139 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 20874 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19596 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 25514 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 20081 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 22958 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 24531 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16242 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 3610 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 4879 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1100 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19428 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 28670 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 8487 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 12256 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 11923 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19734 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 17014 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 11800 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 20081 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 21043 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1537 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19434 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 21138 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 15801 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16603 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 23987 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 3123 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 26139 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 20874 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19596 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 25514 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 22958 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 24531 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16242 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 3610 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 4879 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1100 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19428 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 28670 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 8487 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 12256 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 11923 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19734 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 17014 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 11800 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 22958 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 21043 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1537 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19434 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 21138 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 15801 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16603 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 23987 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 3123 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 3123 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 28670 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 26139 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 20874 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19596 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 25514 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 24531 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16242 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 3610 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 4879 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1100 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19428 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 8487 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 12256 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 11923 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19734 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 17014 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 11800 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 23987 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 21043 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1537 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19434 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 21138 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 15801 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16603 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 26139 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 20874 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19596 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 25514 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 24531 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16242 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 3610 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 4879 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1100 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 17014 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19428 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 8487 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 12256 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 11923 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19734 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 11800 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 4879 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 21043 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1537 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 20874 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 19434 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 21138 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 15801 immediately
…
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 24531 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16242 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1100 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 8487 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 11800 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 11800 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1537 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 15801 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16603 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 25514 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 24531 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16242 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1100 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 8487 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 1537 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 15801 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16603 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 25514 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 24531 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16242 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1100 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 8487 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 24531 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 15801 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16603 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 25514 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16242 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1100 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 8487 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 16242 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 15801 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16603 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 25514 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1100 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 8487 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 15801 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16603 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 25514 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1100 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 8487 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 8487 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 25514 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 16603 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1100 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 16603 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [warn] Stopping worker 1100 immediately
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Worker 1100 stopped
Aug 08 07:16:33 openqa openqa-webui-daemon[13722]: [info] Manager 13722 stopped

On o3 I don't see any relevant log lines in that time frame.

Unfortunately it still seems to produce 5xx responses according to the monitoring. So there were 899 "5xx" responses in the time frame @livdywan mentioned. (Not sure where the number 952 comes from.)

I think those 5xx responses are due to the additional openQA services still being restarted normally. I excluded those for the sake of simplicity because as mentioned in my commit message on https://github.com/os-autoinst/openQA/pull/5820 this is actually not a problem as e.g. workers will simply retry the connection and so does the JavaScript code relying on the live handler.

So should I take care of the other services as well? It would just be for the sake of the alerting. Not sure whether we can exclude those responses on the alerting level.

Actions #22

Updated by mkittler 4 months ago · Edited

I guess we could add the condition request::field !~ /\/(ws|liveviewhandler)\// in Grafana but this would leave a few "5xx" responses.

I'm still not sure why we get those responses. They'd still be way below our alert threshold (only 50 and the threshold is 2000) but it still shouldn't happen. I queries some of the routes manually and they are definitely working under normal circumstances (and not produce a 500 response anyway).

Actions #23

Updated by mkittler 4 months ago · Edited

As a first step I'd exclude the websocket server and liveviewhandler from our 5xx alerting: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1251/diffs (I tested that change in the query explorer.)

If we want to do the zero-downtime approach for all services later we can still revert that commit.

About the remaining 50 problematic responses: Maybe those are caused by log messages like [warn] Stopping worker 24531 immediately? It is also strange that those messages show up multiple times within the same second. Maybe the shutdown of the previous web UI is not done cleanly?

Actions #24

Updated by mkittler 4 months ago

This is an excerpt of the relevant source code (/usr/lib/perl5/vendor_perl/5.*/Mojo/Server/Prefork.pm):

    # No heartbeat (graceful stop)
    $log->error("Worker $pid has no heartbeat ($ht seconds), restarting (see FAQ for more)") and $w->{graceful} = $time
      if !$w->{graceful} && ($w->{time} + $interval + $ht <= $time);

    # Graceful stop with timeout
    my $graceful = $w->{graceful} ||= $self->{graceful} ? $time : undef;
    $log->info("Stopping worker $pid gracefully ($gt seconds)") and (kill 'QUIT', $pid or $self->_stopped($pid))
      if $graceful && !$w->{quit}++;
    $w->{force} = 1 if $graceful && $graceful + $gt <= $time;

    # Normal stop
    $log->warn("Stopping worker $pid immediately") and (kill 'KILL', $pid or $self->_stopped($pid))
      if $w->{force} || ($self->{finished} && !$graceful);

So even though considered "normal", the message "Stopping … immediately" means sending SIGKILL. Considering not enough time passed I don't think this is because the workers were busy for too long. Maybe the convoluted logic in the prefork shutdown code is a bit too convoluted and might kill workers needlessly.

Actions #25

Updated by okurz 4 months ago

  • Status changed from Feedback to In Progress
Actions #26

Updated by mkittler 4 months ago

This should help with the remaining 5xx responses of the main openQA service: https://github.com/os-autoinst/openQA/pull/5859

Actions #27

Updated by mkittler 4 months ago

  • Status changed from In Progress to Feedback

The PR has been merged, let's see whether it works in production.

Actions #28

Updated by mkittler 4 months ago

The change hasn't been deployed yet. If it is deployed I will manually restart the main web UI service to apply the change to the launcher script. (This will unfortunately not be a zero-downtime restart.) Then I can wait for the next deployment or just reload the service manually and then revisit the query mentioned on #162533#note-22 with the time frame shortly after the restart.

Note that we might still get 5xx responses when rebooting the machine as this is of course also not zero-downtime. Not sure whether it is important enough to care about.

Actions #29

Updated by okurz 4 months ago

mkittler wrote in #note-28:

Note that we might still get 5xx responses when rebooting the machine as this is of course also not zero-downtime. Not sure whether it is important enough to care about.

good point. How would it that situation look like to users? When nginx itself is also not responsive during the reboot then the browser will not receive 5xx responses. IIRC browsers will show busy loading a page for a bit longer. During that time the openQA instance might boot up and I guess as soon as nginx is up but openQA not yet the browser might receive a 5xx, right?

Actions #30

Updated by livdywan 4 months ago

  • Due date changed from 2024-08-17 to 2024-08-23

Let's try to answer the open questions concerning reboot behavior this week.

I guess as soon as nginx is up but openQA not yet the browser might receive a 5xx, right?

Should we enforce openQA starting before nginx?

Actions #31

Updated by mkittler 4 months ago · Edited

I was also thinking about that. I suppose it would make sense. Of course just adding a dependency on systemd level won't cut it because at this point none of our services can signal systemd when they are actually ready. We needed to implement that at least for the main service. I suppose we can use some Mojolicious hook to know when the service is ready and then we'll probably have to see how call/implement sd_notify() from/in Perl.

EDIT: Or does it actually make sense to use a Mojolicious hook here? We only want to notify systemd once. So maybe this needed to be handled by the prefork backend or our launcher script.

Actions #32

Updated by mkittler 4 months ago · Edited

  • Description updated (diff)

I removed the silence because now there are not many 5xx responses left. I suppose the remaining 5xx responses are really just due to the reboot of the machine and there we needed to synchronize the startup as mentioned in previous comments.

The full list of 5xx responses over the weekend¹ (when the VM was rebooted) only contained 16 entries. So I'm not sure whether it is worth investing more time into this. For the sake of avoiding false alerts (which is the main scope of this ticket) it is not necessary (as our threshold is 2000).


¹ Without livehandler and websocket responses as we also filter them out in our monitoring now.

Actions #33

Updated by livdywan 4 months ago

mkittler wrote in #note-32:

I removed the silence because now there are not many 5xx responses left. I suppose the remaining 5xx responses are really just due to the reboot of the machine and there we needed to synchronize the startup as mentioned in previous comments.

The full list of 5xx responses over the weekend¹ (when the VM was rebooted) only contained 16 entries. So I'm not sure whether it is worth investing more time into this. For the sake of avoiding false alerts (which is the main scope of this ticket) it is not necessary (as our threshold is 2000).

I'm inclined to say file a ticket based on #162533#note-31 and we are probably good here

Actions #34

Updated by mkittler 4 months ago

  • Status changed from Feedback to Resolved

I filed #165515.

Actions #35

Updated by okurz 4 months ago

  • Due date deleted (2024-08-23)
Actions

Also available in: Atom PDF