Project

General

Profile

Actions

action #13876

closed

Error 502: Bad gateway on uploading

Added by okurz over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
-
Start date:
2016-09-23
Due date:
% Done:

0%

Estimated time:

Description

observation

https://openqa.suse.de/tests/588618/file/autoinst-log.txt shows

7:20:42.5600 17609 preparing hdd 1 for upload as sle-12-SP2-aarch64-gnome-encrypted.qcow2 in qcow2
17609: EXIT 0
+++ worker notes +++
end time: 2016-09-23 07:20:43
result: done
uploading consoletest_setup-loadavg_consoletest_setup.txt
uploading consoletest_setup-psaxf.log
uploading install_and_reboot-y2logs.tar.bz2
uploading textinfo-info.txt
uploading textinfo-logs.tar.bz2
uploading sle-12-SP2-aarch64-gnome-encrypted.qcow2
ERROR sle-12-SP2-aarch64-gnome-encrypted.qcow2: 502 response: Bad Gateway

Related issues 1 (0 open1 closed)

Related to openQA Project - action #12246: [tools]upload of log files can fail sometimes (was: https://openqa.suse.de/tests/412464 has no X-related log files)Resolvedszarate2016-06-07

Actions
Actions #1

Updated by okurz over 7 years ago

  • Category set to Regressions/Crashes
  • Priority changed from Normal to High
Actions #2

Updated by okurz over 7 years ago

  • Related to action #12246: [tools]upload of log files can fail sometimes (was: https://openqa.suse.de/tests/412464 has no X-related log files) added
Actions #3

Updated by oholecek over 7 years ago

Can't this be race during openqa update? 502 error is returned by Apache proxy when destination is not available. IIRC previously the error during upload were about operation timeout or forbidden by hmac validation.

Actions #4

Updated by coolo over 7 years ago

What do you consider 'openqa update'? See test osd's 645084

end time: 2016-11-27 23:30:03
result: done
uploading install_and_reboot-y2logs.tar.bz2
uploading SLES-12-SP2-x86_64-mru-install-minimal-with-addons-Build20161128-1.qcow2
ERROR SLES-12-SP2-x86_64-mru-install-minimal-with-addons-Build20161128-1.qcow2: 502 response: Bad Gateway

Let me guarantee you that noone touched the production instance on sunday night 23:30

Actions #5

Updated by oholecek over 7 years ago

Right, right.
As far as my greping got, we don't issue 502 result code so another idea is that there are too many simultaneous long lasting upload and none of our preforked mojo workers is available. I guess number of apache workers does not match openqa mojo workers.
As a quick scaling we can try to increase the number of preforked mojo workers. Or lower apache workers, but than we'll probably get connection timeouts.
As a longterm solution, moving to uploading by parts as suggested previously, I think, would help here too.

Actions #6

Updated by coolo over 7 years ago

According to kraih, a mojo "worker" (have to be careful with the term, it has a different meaning in openqa than in mojo :) can handle multiple uploads at the same time and this will happen unblocked. So what else can block an openqa webui process so heavily that 10 of them are unable to accept connections at 23:30 on a saturday night?

Actions #7

Updated by coolo over 7 years ago

looking at the error_log scares me:

[Mon Nov 28 00:29:46.268479 2016] [ssl:info] [pid 4915] [client 149.44.176.14:37760] AH01964: Connection to child 104 established (server openqa.suse.de:443)
[Mon Nov 28 00:30:03.602870 2016] [reqtimeout:info] [pid 4915] [client 149.44.176.14:37760] AH01382: Request header read timeout
[Mon Nov 28 00:30:03.602973 2016] [ssl:info] [pid 4915] (70007)The timeout specified has expired: [client 149.44.176.14:37760] AH01991: SSL input filter read failed.
[Mon Nov 28 00:30:22.531272 2016] [proxy_http:error] [pid 4915] (104)Connection reset by peer: [client 2620:113:80c0:8080:100:d99d:cf0c:2b3:48326] AH01102: error reading status line from remote server localhost:9526
[Mon Nov 28 00:30:22.531373 2016] [proxy:error] [pid 4915] [client 2620:113:80c0:8080:100:d99d:cf0c:2b3:48326] AH00898: Error reading from remote server returned by /api/v1/jobs/645041/status
[Mon Nov 28 00:30:22.554725 2016] [negotiation:error] [pid 4915] [client 2620:113:80c0:8080:100:d99d:cf0c:2b3:48326] AH00690: no acceptable variant: /usr/share/apache2/error/HTTP_BAD_GATEWAY.html.var
[Mon Nov 28 00:30:25.291854 2016] [proxy:error] [pid 4914] (104)Connection reset by peer: [client 2620:113:80c0:8080:100:d99d:cf0c:2b3:48200] AH01084: pass request body failed to [::1]:9526 (localhost)
[Mon Nov 28 00:30:25.296938 2016] [proxy_http:error] [pid 4914] [client 2620:113:80c0:8080:100:d99d:cf0c:2b3:48200] AH01097: pass request body failed to [::1]:9526 (localhost) from 2620:113:80c0:8080:100:d99d:cf0c:2b3 ()
[Mon Nov 28 00:30:25.297774 2016] [negotiation:error] [pid 4914] [client 2620:113:80c0:8080:100:d99d:cf0c:2b3:48200] AH00690: no acceptable variant: /usr/share/apache2/error/HTTP_BAD_GATEWAY.html.var
[Mon Nov 28 00:32:45.074391 2016] [mpm_prefork:info] [pid 2023] AH00162: server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 8 children, there are 0 idle, and 100 total children
[Mon Nov 28 00:32:46.083552 2016] [mpm_prefork:info] [pid 2023] AH00162: server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 16 children, there are 0 idle, and 108 total children
[Mon Nov 28 00:32:47.099927 2016] [mpm_prefork:info] [pid 2023] AH00162: server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 32 children, there are 9 idle, and 124 total children
[Mon Nov 28 00:34:10.246295 2016] [mpm_prefork:info] [pid 2023] AH00162: server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 8 children, there are 3 idle, and 100 total children
[Mon Nov 28 00:39:53.468174 2016] [ssl:info] [pid 3176] [client 149.44.176.14:37937] AH01964: Connection to child 71 established (server openqa.suse.de:443)
Actions #8

Updated by coolo over 7 years ago

these are
Mon Nov 28 00:30:03 2016] [28554:debug] uploaded install_and_reboot-y2logs.tar.bz2
00:30:33 video.ogv

2620:113:80c0:8080:100:d99d:cf0c:2b3 - - [28/Nov/2016:00:30:03 +0100] "POST /api/v1/jobs/645084/artefact HTTP/1.1" 406 - "-" "Mojolicious (Perl)"

So we have 30 seconds in between the successfull operations and the gateway timeout is happening in between. We actually still have the tmp file
of the asset:

-rw-r--r-- 1 geekotest nogroup 2302615174 Nov 28 00:30 mojo.tmp.6698afc40db845d3883ed0d16221d1c3

Actions #9

Updated by coolo over 7 years ago

  • Assignee set to kraih
Actions #10

Updated by coolo over 7 years ago

Actions #11

Updated by oholecek over 7 years ago

  • Assignee deleted (kraih)

Right at that time there are two (at least) mojo workers stopped. Why is that?

[Mon Nov 28 00:30:03 2016] [2021:debug] Stopping worker 28567 gracefully

Actions #12

Updated by oholecek over 7 years ago

  • Assignee set to kraih

eeh.. sorry, didn't mean to do that.

Actions #13

Updated by oholecek over 7 years ago

Greped the management process:

[Mon Nov 28 00:30:01 2016] [2021:debug] Worker 28568 stopped
[Mon Nov 28 00:30:02 2016] [2021:debug] Stopping worker 28573 gracefully
[Mon Nov 28 00:30:02 2016] [2021:debug] Stopping worker 28550 gracefully
[Mon Nov 28 00:30:02 2016] [2021:debug] Stopping worker 28575 gracefully
[Mon Nov 28 00:30:03 2016] [2021:debug] Stopping worker 28556 gracefully
[Mon Nov 28 00:30:03 2016] [2021:debug] Stopping worker 28567 gracefully
[Mon Nov 28 00:30:04 2016] [2021:debug] Stopping worker 28578 gracefully
[Mon Nov 28 00:30:04 2016] [2021:debug] Stopping worker 29105 gracefully
[Mon Nov 28 00:30:05 2016] [2021:debug] Stopping worker 28554 gracefully
[Mon Nov 28 00:30:05 2016] [2021:debug] Stopping worker 28564 gracefully
[Mon Nov 28 00:30:22 2016] [2021:debug] Stopping worker 28573
[Mon Nov 28 00:30:22 2016] [2021:debug] Worker 28573 stopped
[Mon Nov 28 00:30:22 2016] [2021:debug] Stopping worker 28550
[Mon Nov 28 00:30:22 2016] [2021:debug] Worker 28550 stopped
[Mon Nov 28 00:30:22 2016] [2021:debug] Stopping worker 28575
[Mon Nov 28 00:30:22 2016] [2021:debug] Stopping worker 28575
[Mon Nov 28 00:30:22 2016] [2021:debug] Worker 28575 stopped
[Mon Nov 28 00:30:23 2016] [2021:debug] Stopping worker 28556
[Mon Nov 28 00:30:23 2016] [2021:debug] Worker 28556 stopped
[Mon Nov 28 00:30:23 2016] [2021:debug] Stopping worker 28567
[Mon Nov 28 00:30:23 2016] [2021:debug] Worker 28567 stopped
[Mon Nov 28 00:30:24 2016] [2021:debug] Stopping worker 28578
[Mon Nov 28 00:30:24 2016] [2021:debug] Worker 28578 stopped
[Mon Nov 28 00:30:24 2016] [2021:debug] Stopping worker 29105
[Mon Nov 28 00:30:24 2016] [2021:debug] Stopping worker 29105
[Mon Nov 28 00:30:24 2016] [2021:debug] Worker 29105 stopped
[Mon Nov 28 00:30:25 2016] [2021:debug] Stopping worker 28554
[Mon Nov 28 00:30:25 2016] [2021:debug] Worker 28554 stopped
[Mon Nov 28 00:30:26 2016] [2021:debug] Stopping worker 28564
[Mon Nov 28 00:30:26 2016] [2021:debug] Worker 28564 stopped

So, for some reason, mojo seems to be stopping workers and it takes them ~20s to stop and during this time, I guess, they don't accept any new connections.

Actions #14

Updated by coolo over 7 years ago

so the kill -QUIT is not so graceful as we hope :)

cat /etc/cron.hourly/openqa-restart-workers

#! /bin/sh

pkill -QUIT -f 'openqa prefork' -P $(pgrep -f 'openqa prefork' -P 1)

Actions #15

Updated by coolo over 7 years ago

in this context it's important to know that the apache default timeout for proxy is 60 seconds.

Actions #16

Updated by coolo over 7 years ago

but the graceful restart doesn't appear to work at all. E.g.

openqa:~ # egrep "[Ww]orker 544" /var/log/openqa
[Mon Nov 28 14:35:30 2016] [544:debug] Worker 544 started
[Mon Nov 28 15:30:06 2016] [2021:debug] Stopping worker 544 gracefully
[Mon Nov 28 15:30:16 2016] [2021:debug] Worker 544 stopped

So it took 10 seconds for it to stop. But if you look, it still got new requests in these 10 seconds:

[Mon Nov 28 15:30:05 2016] [544:debug] POST "/api/v1/jobs/645676/artefact"
[Mon Nov 28 15:30:05 2016] [544:debug] 200 OK (0.037611s, 26.588/s)
[Mon Nov 28 15:30:06 2016] [544:debug] POST "/api/v1/jobs/646177/artefact"
[Mon Nov 28 15:30:06 2016] [544:debug] 200 OK (0.010903s, 91.718/s)
[Mon Nov 28 15:30:06 2016] [544:debug] POST "/api/v1/jobs/645768/status"
[Mon Nov 28 15:30:06 2016] [544:debug] 200 OK (0.037931s, 26.364/s)
[Mon Nov 28 15:30:07 2016] [544:debug] POST "/api/v1/jobs/645923/artefact"
[Mon Nov 28 15:30:07 2016] [544:debug] 200 OK (0.044114s, 22.669/s)
[Mon Nov 28 15:30:09 2016] [544:debug] POST "/api/v1/jobs/645767/status"
[Mon Nov 28 15:30:09 2016] [544:debug] 200 OK (0.040649s, 24.601/s)
[Mon Nov 28 15:30:11 2016] [544:debug] POST "/api/v1/jobs/645923/artefact"
[Mon Nov 28 15:30:11 2016] [544:debug] 200 OK (0.011489s, 87.040/s)
[Mon Nov 28 15:30:13 2016] [544:debug] POST "/api/v1/jobs/646182/status"
[Mon Nov 28 15:30:13 2016] [544:debug] 200 OK (0.035543s, 28.135/s)
[Mon Nov 28 15:30:14 2016] [544:debug] POST "/api/v1/jobs/645769/status"
[Mon Nov 28 15:30:14 2016] [544:debug] 200 OK (0.054106s, 18.482/s)
[Mon Nov 28 15:30:15 2016] [544:debug] POST "/api/v1/jobs/645891/status"
[Mon Nov 28 15:30:16 2016] [544:debug] 200 OK (0.415147s, 2.409/s)
Actions #17

Updated by oholecek over 7 years ago

Can be. But also, given that I don't know how many connections can one worker service, these connection can be remaining keep-alive connections worker was servicing before? So they may not really be new connections?

If they are indeed new connections, I'm out of ideas. Still, I would like to know why mojo is shutting them down. Is that some maintenance thing? Edit: Didn't notice your previous answer.

Actions #18

Updated by coolo over 7 years ago

So we want to replace this crontab with http://mojolicious.org/perldoc/Mojo/Server/Hypnotoad#accepts - i.e. 1000 or so. But we also need to make sure this restart is fully graceful (in doubt by means of setting graceful_timeout to MAXINT).

Actions #19

Updated by coolo over 7 years ago

I disabled the crontab now on both instances

Actions #20

Updated by coolo over 7 years ago

looking at the source SIGQUIT and max_accepts both call the same function. So if the 20 seconds aren't enough to stop, we need to do something about it ;(

Actions #21

Updated by oholecek over 7 years ago

coolo wrote:

So we want to replace this crontab with http://mojolicious.org/perldoc/Mojo/Server/Hypnotoad#accepts - i.e. 1000 or so. But we also need to make sure this restart is fully graceful (in doubt by means of setting graceful_timeout to MAXINT).

According to http://mojolicious.org/perldoc/Mojo/Server/Prefork#accepts the prefork accepts defaults to 10000. Why are we doing this? Just in case? Like healthy restart for mem leaks and so?

Actions #22

Updated by oholecek over 7 years ago

coolo wrote:

looking at the source SIGQUIT and max_accepts both call the same function. So if the 20 seconds aren't enough to stop, we need to do something about it ;(

I'm not sure how this works, but from the logs we see that individual queries are completed quickly. That means mojo worker is not closing connections with keep-alive header and waiting till remote side closes it. But that will happen only when job is finished, nowhere near 20s time limit. Should mojo worker just close connections after finishing the query, of course only in shutdown mode? Mojo::UA should be able to open new connection when current one is closed.

Actions #23

Updated by kraih over 7 years ago

Ok, i believe the problem could be keep-alive connections. When a Mojolicious worker receives SIGQUIT, it immediately stops accepting new incoming TCP connections, but it still does accept new incoming keep-alive HTTP requests on already accepted TCP connections. So what happens here is that a graceful shutdown got initiated, then an HTTP request got accepted on the keep-alive connection, the graceful shutdown timeout ran out, the manager kills the worker immediately, closing the TCP connection, and Apache sends a 502 Bad Gateway to the client.

There are multiple solutions, but one thing that definitely needs to be done for all of them is to increate the graceful-timeout to the maximum amount of time any request will take.

Next, a quick and inefficient (performance wise) fix would be to reduce the number of keep-alive connections to 1, essentially disabling keep-alive. Alternatively, it might also be possible to disable keep-alive in the mod_proxy configuration. But i would strongly recommend fixing this upstream in Mojolicious, workers should just not be accepting new keep-alive requests once graceful shutdown has been initiated.

Actions #24

Updated by coolo over 7 years ago

https://github.com/os-autoinst/openQA/pull/1046 for temporary workaround. 10000 connections are just "forever" if at a same time prefork handles 100 requests per connection. And these 100 requests have to finish in 20 seconds.

Actions #25

Updated by szarate over 7 years ago

I just sent https://github.com/os-autoinst/openQA/pull/1053, which should address this one in particular from the worker perspective.

Actions #26

Updated by kraih over 7 years ago

I've committed a fix upstream in Mojolicious (released with 7.11) that should make bad gateway errors much less likely. https://github.com/kraih/mojo/commit/02049ddabb93e07699edab6ad28f208d4064daf4

Actions #27

Updated by szarate over 7 years ago

Notice that for retries, we should think about something also regarding the file size... i can think of a scenario where a 10GB disk is being uploaded... and let's say... 2-4 jobs failing at the same time for the same reason... no fun my friend, but can we really do something about it?.

Also i noticed that on the code, if an upload group fails (assets, ulogs or other files) then those files will be lost... meaning if #10 of 100 logs fail to upload... the rest will not be uploaded...

eg:

            unless (
                upload(
                    $job_id,
                    {
                        file => {file => "$pooldir/$file", filename => $ofile}}))
            {
                $aborted = "failed to upload $file";
                last;
            }
Actions #28

Updated by coolo over 7 years ago

  • Status changed from New to Resolved

resolved in mojo - and we changed our prefork options to restart more often with bigger timeouts. This helped

Actions

Also available in: Atom PDF