action #106759
coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #92854: [epic] limit overload of openQA webUI by heavy requests
Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M
0%
Description
Observation¶
# /var/log/openqa [2022-02-12T04:12:47.195902Z] [error] Worker 22011 has no heartbeat (400 seconds), restarting [2022-02-12T04:12:56.228360Z] [error] Worker 28596 has no heartbeat (400 seconds), restarting
Acceptance criteria¶
AC1: The cause of the heartbeat message is known
Suggestions¶
Add the message to the blocklistLook at Mojolicious APIs related to preforking of Mojo workers(problem is a blocked worker process, Mojolicious API can't help with that)Extend the configured timeout from 400sThe timeout's already pretty highConfirm where the errors are logged, and add context(Mojolicious logs the error, and there is no context information to add, different process)500 error in access_log from live_view_handler(unrelated, time doesn't match and the live view handler doesn't use preforking)[15/Feb/2022:07:16:02 +0000] "GET /liveviewhandler/tests/2189494/developer/ws-proxy HTTP/1.1" 500
Additional info¶
from /usr/lib/perl5/vendor_perl/5.34.0/Mojolicious/Guides/FAQ.pod
:
=head2 What does "Worker 31842 has no heartbeat (50 seconds), restarting" mean? As long as they are accepting new connections, worker processes of all built-in pre-forking web servers send heartbeat messages to the manager process at regular intervals, to signal that they are still responsive. A blocking operation such as an infinite loop in your application can prevent this, and will force the affected worker to be restarted after a timeout. This timeout defaults to C<50> seconds and can be extended with the attribute L<Mojo::Server::Prefork/"heartbeat_timeout"> if your application requires it.
lib/Mojo/Server/Prefork.pm 10 # No heartbeat (graceful stop) 9 $log->error("Worker $pid has no heartbeat ($ht seconds), restarting") and $w->{graceful} = $time 8 if !$w->{graceful} && ($w->{time} + $interval + $ht <= $time); 7 6 # Graceful stop with timeout 5 my $graceful = $w->{graceful} ||= $self->{graceful} ? $time : undef; 4 $log->info("Stopping worker $pid gracefully ($gt seconds)") and (kill 'QUIT', $pid or $self->_stopped($pid)) 3 if $graceful && !$w->{quit}++;
Related issues
History
#1
Updated by cdywan 4 months ago
- Copied from action #105828: 4-7 logreport emails a day cause alert fatigue size:M added
#3
Updated by okurz 4 months ago
- Due date deleted (
2022-02-17) - Priority changed from Normal to High
I think also here you did by mistake leave in target version and due date, didn't you? But it certainly needs to be handled with priority to be covered in openqa-logwarn although I think such task should be trivial to do for you on alert handling as well.
#4
Updated by cdywan 4 months ago
- Subject changed from Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 to Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from High to Low
But it certainly needs to be handled with priority to be covered in openqa-logwarn
During the estimation we agreed this shouldn't be simply ignored. Although we don't consider it urgent and should monitor this further since it's not clear what effect it has in production.
#6
Updated by kraih 4 months ago
Seen the same in the OSD journal:
Feb 08 14:49:52 openqa openqa[13856]: [error] Worker 19842 has no heartbeat (400 seconds), restarting
There is only really one cause for this, and that's a blocking operation (like a database query) running for more than 400 seconds. Which would prevent the worker process from sending a heartbeat to the manager process. The fix would be a) to make sure nothing blocks the process for more than 400 seconds, or b) to increase the heartbeat timeout.
#7
Updated by kraih 4 months ago
Looking through the OSD logs a bit more, i also found a few seemingly unrelated liveviewhandler 504 errors in the access_log around the same time (guess those might just be that common). The 502 seems suspiciously close to when the 400 second timeout must have started.
Feb 08 14:49:52 openqa openqa[13856]: [error] Worker 19842 has no heartbeat (400 seconds), restarting
[08/Feb/2022:14:43:10 +0100] "GET /tests/overview?build=&distri=sle&version=15-SP1 HTTP/1.1" 502 1071 "https://openqa.suse.de/tests" [08/Feb/2022:14:58:50 +0100] "GET /liveviewhandler/tests/8119824/developer/ws-proxy/status HTTP/1.1" 504 - "-" [08/Feb/2022:14:58:37 +0100] "GET /liveviewhandler/tests/8119823/developer/ws-proxy/status HTTP/1.1" 504 - "-"
The last thing the killed worker process logged was:
Feb 08 14:43:10 openqa openqa[19842]: [debug] No build and no group specified, will lookup build based on the other parameters
It does not appear to be a general problem with the request though, when i manually open the URL it finishes loading completely in 4.92s.
#9
Updated by cdywan 4 months ago
From openQA logreport on o3:
[2022-02-17T03:51:53.414618Z] [error] Worker 13242 has no heartbeat (400 seconds), restarting [2022-02-17T03:52:08.416699Z] [error] Worker 7572 has no heartbeat (400 seconds), restarting [2022-02-17T03:52:26.022451Z] [error] Worker 7891 has no heartbeat (400 seconds), restarting [2022-02-17T03:54:41.762629Z] [error] Worker 14719 has no heartbeat (400 seconds), restarting [2022-02-17T03:54:55.849639Z] [error] Worker 15029 has no heartbeat (400 seconds), restarting [2022-02-17T03:55:15.120880Z] [error] Worker 13288 has no heartbeat (400 seconds), restarting [2022-02-17T03:57:08.128696Z] [error] Worker 16322 has no heartbeat (400 seconds), restarting [2022-02-17T03:59:52.744987Z] [error] Worker 17689 has no heartbeat (400 seconds), restarting
#10
Updated by okurz 4 months ago
- Priority changed from Low to Urgent
cdywan wrote:
But it certainly needs to be handled with priority to be covered in openqa-logwarn
During the estimation we agreed this shouldn't be simply ignored. Although we don't consider it urgent and should monitor this further since it's not clear what effect it has in production.
You reduced prio to Low but nobody added something to logwarn. This means that every day we receive monitoring alerts about this causing the opposite of "this shouldn't be simply ignored"
#11
Updated by cdywan 4 months ago
okurz wrote:
You reduced prio to Low but nobody added something to logwarn. This means that every day we receive monitoring alerts about this causing the opposite of "this shouldn't be simply ignored"
That didn't look to be the case when I was on alert duty ;-) (Not disagreeing, answering your question)
#12
Updated by tinita 4 months ago
Re https://progress.opensuse.org/issues/106759#note-9
I looked for long running HTTP requests in the log from 17/Feb/2022:
% cat long-requests | grep -v "GET /api/v1/ws/" | grep -v "GET /assets" | grep -v "GET /liveviewhandler/tests" | grep -v "GET .*/streaming " | grep '[0-9][0-9][0-9][0-9][0-9][0-9]$' 192.168.112.100 - - [17/Feb/2022:00:47:29 +0000] "GET /tests/2192620/investigation_ajax HTTP/1.1" 200 1929881 "-" "openqa-investigate" 101193 192.168.47.46 - - [17/Feb/2022:03:32:41 +0000] "GET /tests/overview HTTP/1.1" 502 1082 "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300102 192.168.47.46 - - [17/Feb/2022:03:32:51 +0000] "GET /tests/overview HTTP/1.1" 502 1082 "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300135 192.168.47.46 - - [17/Feb/2022:03:45:13 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300101 192.168.47.46 - - [17/Feb/2022:03:45:43 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 304052 192.168.47.46 - - [17/Feb/2022:03:48:01 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300097 192.168.47.46 - - [17/Feb/2022:03:48:31 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300498 192.168.47.46 - - [17/Feb/2022:03:45:23 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 600197 192.168.47.46 - - [17/Feb/2022:03:50:23 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300098 192.168.47.46 - - [17/Feb/2022:03:48:11 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 600199 192.168.47.46 - - [17/Feb/2022:03:53:11 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300100
As you can see, most of them are between 3:32 and 3:53, and two of them took over 600 seconds.
So it doesn't exactly match the times from https://progress.opensuse.org/issues/106759#note-9 but still maybe related?
#16
Updated by tinita 4 months ago
PR for ignoring the messages for now: https://github.com/os-autoinst/openqa-logwarn/pull/27
Not sure if this is wanted because Oli's statement in https://progress.opensuse.org/issues/106759#note-10 is hard to parse
#18
Updated by okurz 4 months ago
- Priority changed from Urgent to Normal
tinita wrote:
Not sure if this is wanted because Oli's statement in https://progress.opensuse.org/issues/106759#note-10 is hard to parse
don't worry about it. The ticket is now set to "Urgent" and you are working on it. You prepared https://github.com/os-autoinst/openqa-logwarn/pull/27 and we merged that which is perfect. So we can reduce prio now and you can of course unassign again just as well as continue about the real problem which you already looked into. So all good from my side :)
#20
Updated by tinita 4 months ago
I set the slow query duration to 5s now.
I was able to trigger one with using the search box in the webui (duration: 5119.870 ms
).
The slow queries don't land in their own log like in mysql, but in the general log under /var/lib/pgsql/data/log
.
I will keep it at 5s and monitor how many we will get.
I did not find any slow queries over 200s for the last days.
Some long running queries seem to be aborted though, e.g. https://openqa.opensuse.org/tests/overview
will return with a 502 after 5 minutes.
#22
Updated by tinita 3 months ago
- Status changed from Feedback to Workable
- Assignee deleted (
tinita)
I had a look at the logs.
We had two occurrences
openqa.19.xz:[2022-03-23T17:52:18.080336Z] [error] Worker 18831 has no heartbeat (400 seconds), restarting openqa.7.xz:[2022-03-26T13:15:33.754772Z] [error] Worker 1803 has no heartbeat (400 seconds), restarting
I can see these postgres logs around those times:
... (several same log messages) 2022-03-23 17:43:24.908 UTC openqa geekotest [14594]LOG: could not receive data from client: Connection reset by peer 2022-03-23 17:51:21.737 UTC openqa geekotest [19501]LOG: could not receive data from client: Connection reset by peer 2022-03-23 17:56:04.548 UTC openqa geekotest [23583]LOG: could not receive data from client: Connection reset by peer 2022-03-23 18:01:58.291 UTC openqa geekotest [23899]LOG: could not receive data from client: Connection reset by peer ... (several same log messages) ... (several same log messages) 2022-03-26 13:11:32.097 UTC openqa geekotest [3381]LOG: could not receive data from client: Connection reset by peer 2022-03-26 13:12:43.550 UTC openqa geekotest [6950]LOG: could not receive data from client: Connection reset by peer 2022-03-26 13:16:58.260 UTC openqa geekotest [11630]LOG: could not receive data from client: Connection reset by peer 2022-03-26 13:17:04.136 UTC openqa geekotest [8792]LOG: could not receive data from client: Connection reset by peer ... (several same log messages)
No slow queries at that time.
The apache log has one suspicious 502 a couple of minutes before:
[23/Mar/2022:17:45:36 +0000] "GET /tests/overview?result=skipped&result=obsoleted&result=parallel_failed&result=parallel_restarted&result=user_cancelled&result=user_restarted&distri=microos&distri=opensuse&version=Tumbleweed&build=20220318&groupid=38 HTTP/1.1" 502 1076 "-" "Mozilla/5.0 ..." 300108 # and [26/Mar/2022:13:08:52 +0000] "GET /tests/overview?distri=opensuse&version=Tumbleweed HTTP/1.1" 502 1076 "-" "Mozilla/5.0 ..." 300102
But maybe kraih or someone else knows better what to do with that information. I have no idea right now.
#24
Updated by kraih 3 months ago
- Assignee deleted (
kraih)
I think the most common cause for could not receive data from client: Connection reset by peer
is simply a process getting killed before it could properly shut down the socket to the PostgreSQL server. Unfortunately that doesn't tell us much that we didn't know before. Currently there are no Worker *** has no heartbeat (400 seconds), restarting
messages in the O3 logs, but once i'm back from vacation i'll have another look at the log messages surrounding that one, maybe there's something to be found. One thing we do know now though, is that it's not just a slow query. If it is postgres related, it has to be something so severe that it kills the socket (maybe excessively large placeholder values breaking the wire protocol?).
#26
Updated by openqa_review 2 months ago
- Due date set to 2022-05-04
Setting due date based on mean cycle time of SUSE QE Tools
#30
Updated by kraih about 2 months ago
- Status changed from Feedback to Resolved
Currently there are still no cases, so i'll consider this resolved for now. In case someone stumbles over this in the future again: if you see a single case it may very well be harmless (database vacuum could block an already slow query long enough to trigger it for example), but if there's multiple cases over a short time period it's definitely worth investigating.
#31
Updated by tinita about 2 months ago
Just noting that we won't really notice it if the message appears because we are ignoring it.
#32
Updated by okurz about 2 months ago
- Due date changed from 2022-05-04 to 2022-05-13
- Status changed from Resolved to Feedback
kraih I removed the ignoring from openqa-logwarn with https://github.com/os-autoinst/openqa-logwarn/pull/34 and immediately on the next day I now saw the message by email again so reverting the removal of ignoring again with https://github.com/os-autoinst/openqa-logwarn/pull/35 . It looks like our logs on o3 do not go back that long but xzgrep 'heartbeat' /var/log/openqa*
shows the error at least two times from the past two days:
/var/log/openqa.1.xz:[2022-05-03T01:48:15.126683Z] [error] Worker 324 has no heartbeat (400 seconds), restarting /var/log/openqa.10.xz:[2022-04-30T10:12:07.993477Z] [error] Worker 16174 has no heartbeat (400 seconds), restarting
#33
Updated by kraih about 2 months ago
- Status changed from Feedback to Workable
- Assignee deleted (
kraih)
Bummer, i'll put the ticket back into the queue then, since all ideas i have to track it down further would be worse than the symptom.
#34
Updated by cdywan about 2 months ago
Let's talk about it in tomorrow's mob session
#35
Updated by okurz about 2 months ago
We looked into this together. On o3 with xzgrep heartbeat /var/log/openqa*
we found
/var/log/openqa.10.xz:[2022-05-03T01:48:15.126683Z] [error] Worker 324 has no heartbeat (400 seconds), restarting /var/log/openqa.19.xz:[2022-04-30T10:12:07.993477Z] [error] Worker 16174 has no heartbeat (400 seconds), restarting
Comparing with apache access logs excluding some services that don't use the main openQA webUI server we found:
xzgrep '[0-9]\{6,\}$' /space/logs/apache2/openqa.access_log-20220504.xz | grep -v '\/v1\/ws' | grep -v '\/assets\/' | grep -v '\/liveviewhandler' 192.168.47.102 - - [03/May/2022:01:41:34 +0000] "GET /tests/overview?version=Tumbleweed&groupid=24 HTTP/1.1" 502 1076 "-" "Mozilla/5.0 (iPad; CPU OS 9_3_4 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13G35 Safari/601.1" 300104
the last number is the time the request took in ms, configured by tinita in apache config.
The funny part is that "groupid=24" is openQA-in-openQA tests but likely the quoting with &
is done wrongly here. So likely the real URL that is called is /tests/overview?version=Tumbleweed
. If you try to kill the server then one could just call /tests/overview
without any filter parameters.
#36
Updated by mkittler about 2 months ago
It is also the investigation route:
192.168.112.100 - - [03/May/2022:22:23:54 +0000] "GET /tests/2326528/investigation_ajax HTTP/1.1" 502 1076 "-" "python-requests/2.24.0" 300108
So the investigation should likely be a Minion job and the web UI would poll for the results. As a further optimization we could also cache the results of the investigation (e.g. store it as hidden asset so we can make use of the asset cleanup).
#37
Updated by kraih about 2 months ago
We have replicated the issue on O3 with the path /tests/overview?version=Tumbleweed&groupid=24
, which caused a new log entry:
[2022-05-05T12:29:55.485177Z] [error] Worker 22758 has no heartbeat (400 seconds), restarting [2022-05-05T12:29:55.485366Z] [info] Stopping worker 22758 gracefully (800 seconds) [2022-05-05T12:29:55.541685Z] [info] Worker 32438 started
#38
Updated by mkittler about 2 months ago
For the test results overview page: I'd just put a limit on the database query (configurable). Paging is not useful and would also be very hard to implement. We can just add a hint "Limited to X results. Please use more specific parameters.".
#39
Updated by cdywan about 2 months ago
- Copied to action #110677: Investigation page shouldn't involve blocking long-running API routes size:M added
#40
Updated by cdywan about 2 months ago
- Copied to action #110680: Overview page shouldn't allow long-running requests without limits size:M added
#41
Updated by okurz about 2 months ago
- Project changed from openQA Infrastructure to openQA Project
- Category set to Concrete Bugs
#42
Updated by okurz about 2 months ago
- Tags deleted (
reactive work) - Project changed from openQA Project to openQA Infrastructure
- Category deleted (
Concrete Bugs) - Parent task set to #92854
#43
Updated by okurz about 2 months ago
- Project changed from openQA Infrastructure to openQA Project
- Category set to Concrete Bugs
#44
Updated by cdywan about 2 months ago
- Tags set to reactive work
- Project changed from openQA Project to openQA Infrastructure
- Description updated (diff)
- Category deleted (
Concrete Bugs) - Status changed from Workable to Resolved
- Parent task deleted (
#92854)
Since we tracked down the cause together I'm resolving this ticket, see the related tickets for follow-ups (note the updated AC to reflect that this ticket isn't the solution).
#45
Updated by okurz about 2 months ago
- Project changed from openQA Infrastructure to openQA Project
- Due date deleted (
2022-05-13) - Category set to Concrete Bugs
- Status changed from Resolved to Feedback
- Assignee set to cdywan
- Parent task set to #92854
not resolved as long as we ignore the error message in openqa-logwarn.
#46
Updated by cdywan about 2 months ago
okurz wrote:
not resolved as long as we ignore the error message in openqa-logwarn.
#47
Updated by cdywan about 2 months ago
- Status changed from Feedback to Blocked
#48
Updated by cdywan about 2 months ago
- Copied to deleted (action #110677: Investigation page shouldn't involve blocking long-running API routes size:M)
#49
Updated by cdywan about 2 months ago
- Copied to deleted (action #110680: Overview page shouldn't allow long-running requests without limits size:M)
#50
Updated by cdywan about 2 months ago
- Blocked by action #110677: Investigation page shouldn't involve blocking long-running API routes size:M added
#51
Updated by cdywan about 2 months ago
- Blocks action #110680: Overview page shouldn't allow long-running requests without limits size:M added
#52
Updated by cdywan about 2 months ago
- Blocks deleted (action #110680: Overview page shouldn't allow long-running requests without limits size:M)
#53
Updated by cdywan about 2 months ago
- Blocked by action #110680: Overview page shouldn't allow long-running requests without limits size:M added
#55
Updated by cdywan 3 days ago
- Status changed from Blocked to Feedback
kraih wrote:
The blocker has been resolved.
The underlying causes have been addressed, hence we shouldn't ignore this anymore: https://github.com/os-autoinst/openqa-logwarn/pull/40