action #106759
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #92854: [epic] limit overload of openQA webUI by heavy requests
Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M
0%
Description
Observation¶
# /var/log/openqa
[2022-02-12T04:12:47.195902Z] [error] Worker 22011 has no heartbeat (400 seconds), restarting
[2022-02-12T04:12:56.228360Z] [error] Worker 28596 has no heartbeat (400 seconds), restarting
Acceptance criteria¶
AC1: The cause of the heartbeat message is known
Suggestions¶
Add the message to the blocklistLook at Mojolicious APIs related to preforking of Mojo workers(problem is a blocked worker process, Mojolicious API can't help with that)Extend the configured timeout from 400sThe timeout's already pretty highConfirm where the errors are logged, and add context(Mojolicious logs the error, and there is no context information to add, different process)500 error in access_log from live_view_handler(unrelated, time doesn't match and the live view handler doesn't use preforking)[15/Feb/2022:07:16:02 +0000] "GET /liveviewhandler/tests/2189494/developer/ws-proxy HTTP/1.1" 500
Additional info¶
from /usr/lib/perl5/vendor_perl/5.34.0/Mojolicious/Guides/FAQ.pod
:
=head2 What does "Worker 31842 has no heartbeat (50 seconds), restarting" mean?
As long as they are accepting new connections, worker processes of all built-in pre-forking web servers send heartbeat
messages to the manager process at regular intervals, to signal that they are still responsive. A blocking operation
such as an infinite loop in your application can prevent this, and will force the affected worker to be restarted after
a timeout. This timeout defaults to C<50> seconds and can be extended with the attribute
L<Mojo::Server::Prefork/"heartbeat_timeout"> if your application requires it.
lib/Mojo/Server/Prefork.pm
10 # No heartbeat (graceful stop)
9 $log->error("Worker $pid has no heartbeat ($ht seconds), restarting") and $w->{graceful} = $time
8 if !$w->{graceful} && ($w->{time} + $interval + $ht <= $time);
7
6 # Graceful stop with timeout
5 my $graceful = $w->{graceful} ||= $self->{graceful} ? $time : undef;
4 $log->info("Stopping worker $pid gracefully ($gt seconds)") and (kill 'QUIT', $pid or $self->_stopped($pid))
3 if $graceful && !$w->{quit}++;
Updated by livdywan over 2 years ago
- Copied from action #105828: 4-7 logreport emails a day cause alert fatigue size:M added
Updated by okurz over 2 years ago
- Due date deleted (
2022-02-17) - Priority changed from Normal to High
I think also here you did by mistake leave in target version and due date, didn't you? But it certainly needs to be handled with priority to be covered in openqa-logwarn although I think such task should be trivial to do for you on alert handling as well.
Updated by livdywan over 2 years ago
- Subject changed from Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 to Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from High to Low
But it certainly needs to be handled with priority to be covered in openqa-logwarn
During the estimation we agreed this shouldn't be simply ignored. Although we don't consider it urgent and should monitor this further since it's not clear what effect it has in production.
Updated by kraih over 2 years ago
Seen the same in the OSD journal:
Feb 08 14:49:52 openqa openqa[13856]: [error] Worker 19842 has no heartbeat (400 seconds), restarting
There is only really one cause for this, and that's a blocking operation (like a database query) running for more than 400 seconds. Which would prevent the worker process from sending a heartbeat to the manager process. The fix would be a) to make sure nothing blocks the process for more than 400 seconds, or b) to increase the heartbeat timeout.
Updated by kraih over 2 years ago
Looking through the OSD logs a bit more, i also found a few seemingly unrelated liveviewhandler 504 errors in the access_log around the same time (guess those might just be that common). The 502 seems suspiciously close to when the 400 second timeout must have started.
Feb 08 14:49:52 openqa openqa[13856]: [error] Worker 19842 has no heartbeat (400 seconds), restarting
[08/Feb/2022:14:43:10 +0100] "GET /tests/overview?build=&distri=sle&version=15-SP1 HTTP/1.1" 502 1071 "https://openqa.suse.de/tests"
[08/Feb/2022:14:58:50 +0100] "GET /liveviewhandler/tests/8119824/developer/ws-proxy/status HTTP/1.1" 504 - "-"
[08/Feb/2022:14:58:37 +0100] "GET /liveviewhandler/tests/8119823/developer/ws-proxy/status HTTP/1.1" 504 - "-"
The last thing the killed worker process logged was:
Feb 08 14:43:10 openqa openqa[19842]: [debug] No build and no group specified, will lookup build based on the other parameters
It does not appear to be a general problem with the request though, when i manually open the URL it finishes loading completely in 4.92s.
Updated by livdywan over 2 years ago
From openQA logreport on o3:
[2022-02-17T03:51:53.414618Z] [error] Worker 13242 has no heartbeat (400 seconds), restarting
[2022-02-17T03:52:08.416699Z] [error] Worker 7572 has no heartbeat (400 seconds), restarting
[2022-02-17T03:52:26.022451Z] [error] Worker 7891 has no heartbeat (400 seconds), restarting
[2022-02-17T03:54:41.762629Z] [error] Worker 14719 has no heartbeat (400 seconds), restarting
[2022-02-17T03:54:55.849639Z] [error] Worker 15029 has no heartbeat (400 seconds), restarting
[2022-02-17T03:55:15.120880Z] [error] Worker 13288 has no heartbeat (400 seconds), restarting
[2022-02-17T03:57:08.128696Z] [error] Worker 16322 has no heartbeat (400 seconds), restarting
[2022-02-17T03:59:52.744987Z] [error] Worker 17689 has no heartbeat (400 seconds), restarting
Updated by okurz over 2 years ago
- Priority changed from Low to Urgent
cdywan wrote:
But it certainly needs to be handled with priority to be covered in openqa-logwarn
During the estimation we agreed this shouldn't be simply ignored. Although we don't consider it urgent and should monitor this further since it's not clear what effect it has in production.
You reduced prio to Low but nobody added something to logwarn. This means that every day we receive monitoring alerts about this causing the opposite of "this shouldn't be simply ignored"
Updated by livdywan over 2 years ago
okurz wrote:
You reduced prio to Low but nobody added something to logwarn. This means that every day we receive monitoring alerts about this causing the opposite of "this shouldn't be simply ignored"
That didn't look to be the case when I was on alert duty ;-) (Not disagreeing, answering your question)
Updated by tinita over 2 years ago
Re https://progress.opensuse.org/issues/106759#note-9
I looked for long running HTTP requests in the log from 17/Feb/2022:
% cat long-requests | grep -v "GET /api/v1/ws/" | grep -v "GET /assets" | grep -v "GET /liveviewhandler/tests" | grep -v "GET .*/streaming " | grep '[0-9][0-9][0-9][0-9][0-9][0-9]$'
192.168.112.100 - - [17/Feb/2022:00:47:29 +0000] "GET /tests/2192620/investigation_ajax HTTP/1.1" 200 1929881 "-" "openqa-investigate" 101193
192.168.47.46 - - [17/Feb/2022:03:32:41 +0000] "GET /tests/overview HTTP/1.1" 502 1082 "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300102
192.168.47.46 - - [17/Feb/2022:03:32:51 +0000] "GET /tests/overview HTTP/1.1" 502 1082 "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300135
192.168.47.46 - - [17/Feb/2022:03:45:13 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300101
192.168.47.46 - - [17/Feb/2022:03:45:43 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 304052
192.168.47.46 - - [17/Feb/2022:03:48:01 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300097
192.168.47.46 - - [17/Feb/2022:03:48:31 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300498
192.168.47.46 - - [17/Feb/2022:03:45:23 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 600197
192.168.47.46 - - [17/Feb/2022:03:50:23 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300098
192.168.47.46 - - [17/Feb/2022:03:48:11 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 600199
192.168.47.46 - - [17/Feb/2022:03:53:11 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300100
As you can see, most of them are between 3:32 and 3:53, and two of them took over 600 seconds.
So it doesn't exactly match the times from https://progress.opensuse.org/issues/106759#note-9 but still maybe related?
Updated by kraih over 2 years ago
We could try activating slow query logging in Postgres (300 seconds or more) and then match slow query timestamps with the worker has no heartbeat messages
.
Updated by tinita over 2 years ago
- Status changed from Workable to In Progress
- Assignee set to tinita
Updated by tinita over 2 years ago
I enabled slow query logging (>= 300s) on o3.
Updated by tinita over 2 years ago
PR for ignoring the messages for now: https://github.com/os-autoinst/openqa-logwarn/pull/27
Not sure if this is wanted because Oli's statement in https://progress.opensuse.org/issues/106759#note-10 is hard to parse
Updated by okurz over 2 years ago
- Priority changed from Urgent to Normal
tinita wrote:
Not sure if this is wanted because Oli's statement in https://progress.opensuse.org/issues/106759#note-10 is hard to parse
don't worry about it. The ticket is now set to "Urgent" and you are working on it. You prepared https://github.com/os-autoinst/openqa-logwarn/pull/27 and we merged that which is perfect. So we can reduce prio now and you can of course unassign again just as well as continue about the real problem which you already looked into. So all good from my side :)
Updated by tinita over 2 years ago
I changed the slow query threshold to >= 200s
yesterday.
So far I can't see a slow query log.
Updated by tinita over 2 years ago
I set the slow query duration to 5s now.
I was able to trigger one with using the search box in the webui (duration: 5119.870 ms
).
The slow queries don't land in their own log like in mysql, but in the general log under /var/lib/pgsql/data/log
.
I will keep it at 5s and monitor how many we will get.
I did not find any slow queries over 200s for the last days.
Some long running queries seem to be aborted though, e.g. https://openqa.opensuse.org/tests/overview
will return with a 502 after 5 minutes.
Updated by tinita over 2 years ago
Had a look at the o3 logs today. No "heartbeat" messages (logs go back to March 9). Very few slow (>=5s) queries.
Updated by tinita over 2 years ago
- Status changed from Feedback to Workable
- Assignee deleted (
tinita)
I had a look at the logs.
We had two occurrences
openqa.19.xz:[2022-03-23T17:52:18.080336Z] [error] Worker 18831 has no heartbeat (400 seconds), restarting
openqa.7.xz:[2022-03-26T13:15:33.754772Z] [error] Worker 1803 has no heartbeat (400 seconds), restarting
I can see these postgres logs around those times:
... (several same log messages)
2022-03-23 17:43:24.908 UTC openqa geekotest [14594]LOG: could not receive data from client: Connection reset by peer
2022-03-23 17:51:21.737 UTC openqa geekotest [19501]LOG: could not receive data from client: Connection reset by peer
2022-03-23 17:56:04.548 UTC openqa geekotest [23583]LOG: could not receive data from client: Connection reset by peer
2022-03-23 18:01:58.291 UTC openqa geekotest [23899]LOG: could not receive data from client: Connection reset by peer
... (several same log messages)
... (several same log messages)
2022-03-26 13:11:32.097 UTC openqa geekotest [3381]LOG: could not receive data from client: Connection reset by peer
2022-03-26 13:12:43.550 UTC openqa geekotest [6950]LOG: could not receive data from client: Connection reset by peer
2022-03-26 13:16:58.260 UTC openqa geekotest [11630]LOG: could not receive data from client: Connection reset by peer
2022-03-26 13:17:04.136 UTC openqa geekotest [8792]LOG: could not receive data from client: Connection reset by peer
... (several same log messages)
No slow queries at that time.
The apache log has one suspicious 502 a couple of minutes before:
[23/Mar/2022:17:45:36 +0000] "GET /tests/overview?result=skipped&result=obsoleted&result=parallel_failed&result=parallel_restarted&result=user_cancelled&result=user_restarted&distri=microos&distri=opensuse&version=Tumbleweed&build=20220318&groupid=38 HTTP/1.1" 502 1076 "-" "Mozilla/5.0 ..." 300108
# and
[26/Mar/2022:13:08:52 +0000] "GET /tests/overview?distri=opensuse&version=Tumbleweed HTTP/1.1" 502 1076 "-" "Mozilla/5.0 ..." 300102
But maybe @kraih or someone else knows better what to do with that information. I have no idea right now.
Updated by kraih over 2 years ago
- Assignee set to kraih
Will take a look at the collected information.
Updated by kraih over 2 years ago
- Assignee deleted (
kraih)
I think the most common cause for could not receive data from client: Connection reset by peer
is simply a process getting killed before it could properly shut down the socket to the PostgreSQL server. Unfortunately that doesn't tell us much that we didn't know before. Currently there are no Worker *** has no heartbeat (400 seconds), restarting
messages in the O3 logs, but once i'm back from vacation i'll have another look at the log messages surrounding that one, maybe there's something to be found. One thing we do know now though, is that it's not just a slow query. If it is postgres related, it has to be something so severe that it kills the socket (maybe excessively large placeholder values breaking the wire protocol?).
Updated by kraih over 2 years ago
- Status changed from Workable to In Progress
- Assignee set to kraih
Updated by openqa_review over 2 years ago
- Due date set to 2022-05-04
Setting due date based on mean cycle time of SUSE QE Tools
Updated by kraih over 2 years ago
I'm not actually seeing the error in the logs anymore, perhaps it has been resolved unintentionally by another change.
Updated by kraih over 2 years ago
If this issue pops up again we should maybe try logging all database errors with one of the DBI hook APIs. (Not doing that now since in some cases we intentionally create exceptions and handle them with try/catch, so it could lead to a lot of log output...)
Updated by kraih over 2 years ago
- Status changed from Feedback to Resolved
Currently there are still no cases, so i'll consider this resolved for now. In case someone stumbles over this in the future again: if you see a single case it may very well be harmless (database vacuum could block an already slow query long enough to trigger it for example), but if there's multiple cases over a short time period it's definitely worth investigating.
Updated by tinita over 2 years ago
Just noting that we won't really notice it if the message appears because we are ignoring it.
Updated by okurz over 2 years ago
- Due date changed from 2022-05-04 to 2022-05-13
- Status changed from Resolved to Feedback
@kraih I removed the ignoring from openqa-logwarn with https://github.com/os-autoinst/openqa-logwarn/pull/34 and immediately on the next day I now saw the message by email again so reverting the removal of ignoring again with https://github.com/os-autoinst/openqa-logwarn/pull/35 . It looks like our logs on o3 do not go back that long but xzgrep 'heartbeat' /var/log/openqa*
shows the error at least two times from the past two days:
/var/log/openqa.1.xz:[2022-05-03T01:48:15.126683Z] [error] Worker 324 has no heartbeat (400 seconds), restarting
/var/log/openqa.10.xz:[2022-04-30T10:12:07.993477Z] [error] Worker 16174 has no heartbeat (400 seconds), restarting
Updated by kraih over 2 years ago
- Status changed from Feedback to Workable
- Assignee deleted (
kraih)
Bummer, i'll put the ticket back into the queue then, since all ideas i have to track it down further would be worse than the symptom.
Updated by livdywan over 2 years ago
Let's talk about it in tomorrow's mob session
Updated by okurz over 2 years ago
We looked into this together. On o3 with xzgrep heartbeat /var/log/openqa*
we found
/var/log/openqa.10.xz:[2022-05-03T01:48:15.126683Z] [error] Worker 324 has no heartbeat (400 seconds), restarting
/var/log/openqa.19.xz:[2022-04-30T10:12:07.993477Z] [error] Worker 16174 has no heartbeat (400 seconds), restarting
Comparing with apache access logs excluding some services that don't use the main openQA webUI server we found:
xzgrep '[0-9]\{6,\}$' /space/logs/apache2/openqa.access_log-20220504.xz | grep -v '\/v1\/ws' | grep -v '\/assets\/' | grep -v '\/liveviewhandler' 192.168.47.102 - - [03/May/2022:01:41:34 +0000] "GET /tests/overview?version=Tumbleweed&groupid=24 HTTP/1.1" 502 1076 "-" "Mozilla/5.0 (iPad; CPU OS 9_3_4 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13G35 Safari/601.1" 300104
the last number is the time the request took in ms, configured by tinita in apache config.
The funny part is that "groupid=24" is openQA-in-openQA tests but likely the quoting with &
is done wrongly here. So likely the real URL that is called is /tests/overview?version=Tumbleweed
. If you try to kill the server then one could just call /tests/overview
without any filter parameters.
Updated by mkittler over 2 years ago
It is also the investigation route:
192.168.112.100 - - [03/May/2022:22:23:54 +0000] "GET /tests/2326528/investigation_ajax HTTP/1.1" 502 1076 "-" "python-requests/2.24.0" 300108
So the investigation should likely be a Minion job and the web UI would poll for the results. As a further optimization we could also cache the results of the investigation (e.g. store it as hidden asset so we can make use of the asset cleanup).
Updated by kraih over 2 years ago
We have replicated the issue on O3 with the path /tests/overview?version=Tumbleweed&groupid=24
, which caused a new log entry:
[2022-05-05T12:29:55.485177Z] [error] Worker 22758 has no heartbeat (400 seconds), restarting
[2022-05-05T12:29:55.485366Z] [info] Stopping worker 22758 gracefully (800 seconds)
[2022-05-05T12:29:55.541685Z] [info] Worker 32438 started
Updated by mkittler over 2 years ago
For the test results overview page: I'd just put a limit on the database query (configurable). Paging is not useful and would also be very hard to implement. We can just add a hint "Limited to X results. Please use more specific parameters.".
Updated by livdywan over 2 years ago
- Copied to action #110677: Investigation page shouldn't involve blocking long-running API routes size:M added
Updated by livdywan over 2 years ago
- Copied to action #110680: Overview page shouldn't allow long-running requests without limits size:M added
Updated by okurz over 2 years ago
- Project changed from openQA Infrastructure to openQA Project
- Category set to Regressions/Crashes
Updated by okurz over 2 years ago
- Tags deleted (
reactive work) - Project changed from openQA Project to openQA Infrastructure
- Category deleted (
Regressions/Crashes) - Parent task set to #92854
Updated by okurz over 2 years ago
- Project changed from openQA Infrastructure to openQA Project
- Category set to Regressions/Crashes
Updated by livdywan over 2 years ago
- Tags set to reactive work
- Project changed from openQA Project to openQA Infrastructure
- Description updated (diff)
- Category deleted (
Regressions/Crashes) - Status changed from Workable to Resolved
- Parent task deleted (
#92854)
Since we tracked down the cause together I'm resolving this ticket, see the related tickets for follow-ups (note the updated AC to reflect that this ticket isn't the solution).
Updated by okurz over 2 years ago
- Project changed from openQA Infrastructure to openQA Project
- Due date deleted (
2022-05-13) - Category set to Regressions/Crashes
- Status changed from Resolved to Feedback
- Assignee set to livdywan
- Parent task set to #92854
not resolved as long as we ignore the error message in openqa-logwarn.
Updated by livdywan over 2 years ago
okurz wrote:
not resolved as long as we ignore the error message in openqa-logwarn.
Updated by livdywan over 2 years ago
- Status changed from Feedback to Blocked
Updated by livdywan over 2 years ago
- Copied to deleted (action #110677: Investigation page shouldn't involve blocking long-running API routes size:M)
Updated by livdywan over 2 years ago
- Copied to deleted (action #110680: Overview page shouldn't allow long-running requests without limits size:M)
Updated by livdywan over 2 years ago
- Blocked by action #110677: Investigation page shouldn't involve blocking long-running API routes size:M added
Updated by livdywan over 2 years ago
- Blocks action #110680: Overview page shouldn't allow long-running requests without limits size:M added
Updated by livdywan over 2 years ago
- Blocks deleted (action #110680: Overview page shouldn't allow long-running requests without limits size:M)
Updated by livdywan over 2 years ago
- Blocked by action #110680: Overview page shouldn't allow long-running requests without limits size:M added
Updated by livdywan over 2 years ago
- Status changed from Blocked to Feedback
kraih wrote:
The blocker has been resolved.
The underlying causes have been addressed, hence we shouldn't ignore this anymore: https://github.com/os-autoinst/openqa-logwarn/pull/40
Updated by okurz over 2 years ago
- Status changed from Feedback to Resolved
Updated by okurz over 1 year ago
- Related to action #128345: [logwarn] Worker 30538 has no heartbeat (400 seconds), restarting size:M added
Updated by okurz 5 months ago
- Related to action #138536: Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:S added