Project

General

Profile

Actions

action #106762

open

Prevent proxy timeout errors on `isos post` requests that take too long

Added by okurz about 2 years ago. Updated about 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2022-02-14
Due date:
% Done:

0%

Estimated time:

Description

Motivation

isos post by default is running synchronously and depending on the amount of to be scheduled tests that can take very long so that we run into ugly timeouts. There is already a custom apache config in https://github.com/os-autoinst/openQA/blob/master/etc/apache2/vhosts.d/openqa-common.inc#L89 which as of today failed on openqa.suse.de due to requests taking even longer than 5 minutes. The commands fail with a proxy error but at least parts of the tests are still scheduled. This behaviour should be changed to ensure that we don't run into timeouts without good error feedback. I am not sure if it's a good idea to further bump up the timeout. Likely the request should be explicitly handled in a better way before we hit the timeout, regardless if it's 60s, 5m or 20m. So maybe the server should respond with an explicit error pointing to the async route to use instead.

Also see https://suse.slack.com/archives/C02CANHLANP/p1644819600199619 and https://suse.slack.com/archives/C02CANHLANP/p1644830282823459 and https://suse.slack.com/archives/C02CANHLANP/p1644830317616679 and https://suse.slack.com/archives/C02CANHLANP/p1644831776045659

Acceptance criteria

  • AC1: No jobs are scheduled when requests time out
  • AC2: No custom proxy timeout settings are required
  • AC3: openQA explicitly handles timeouts on requests before the proxy intervenes, e.g. clear error feedback pointing to alternatives

Suggestions

How about aborting on an openQA internal timeout before any proxy timeouts, e.g. maybe 30s when apache timeout by default is 60s, and suggest to use the async parameter.

Actions #1

Updated by mkittler about 2 years ago

AC1: That's very hard to implement. The Mojolicious web app needed to react to the connection being closed by removing all pending events/handlers related to that connection. All ongoing handlers needed to be cancelled gracefully which means having well-defined cancellation points within our code. Possibly some cleanup needs to happen as well. I suppose already handling that the connection is being closed is a problem because the code to handle it would be blocked by the code we want to abort (as we only have one thread in Perl).

AC2: So the timeout in our default Apache2 config should be increased?

AC3: Simply using a timer for that won't cut it because it wouldn't be handled if the event loop is blocked (by the code we want to abort in the first place). We could use SIGALRM. I suppose the Perl handler for that signal would only be invoked after any blocking C code is done (and the next line of Perl code is executed). In the handler we could set some variable to indicate the request should be aborted and check for it in certain places during the isos post code. However, that seems rather ugly - e.g. we needed to keep track which requests should be affected when SIGALRM is fired (as multiple requests can be handled/queued at the same time).


Note that the async=1 parameter was actually supposed to be generally used (while async=0 was supposed to be no longer used due to the timeout problem). So I'd simply recommend to use openqa-cli api -X POST isos async=1 and if needed poll via openqa-cli api isos/$scheduled_product_id for the result.

Actions #2

Updated by kraih about 2 years ago

I'd like to see all ACs replaced with AC1: Deprecate current async=0 default and make async=1 the default in the future. Trying to gracefully deal with connection timeouts on the server side is too complex to be feasible. And it's not just Perl code, socket close is often not propagated across hops, or propagated only with a large delay. For example a reverse proxy may reuse backend connections even if clients disconnect. And of course you can't assume that there are not more hops in between the client and our reverse proxy... you get the idea.

Actions

Also available in: Atom PDF