action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

action #106759

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #92854: [epic] limit overload of openQA webUI by heavy requests

Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M

Added by livdywan about 3 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

livdywan

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2022-02-03

Due date:

% Done:

Estimated time:

Tags:

reactive work

Description

Observation¶

# /var/log/openqa
[2022-02-12T04:12:47.195902Z] [error] Worker 22011 has no heartbeat (400 seconds), restarting
[2022-02-12T04:12:56.228360Z] [error] Worker 28596 has no heartbeat (400 seconds), restarting

Acceptance criteria¶

AC1: The cause of the heartbeat message is known

Suggestions¶

~~Add the message to the blocklist~~
~~Look at Mojolicious APIs related to preforking of Mojo workers~~ (problem is a blocked worker process, Mojolicious API can't help with that)
~~Extend the configured timeout from 400s~~ The timeout's already pretty high
~~Confirm where the errors are logged, and add context~~ (Mojolicious logs the error, and there is no context information to add, different process)
~~500 error in access_log from live_view_handler~~ (unrelated, time doesn't match and the live view handler doesn't use preforking) [15/Feb/2022:07:16:02 +0000] "GET /liveviewhandler/tests/2189494/developer/ws-proxy HTTP/1.1" 500

Additional info¶

from /usr/lib/perl5/vendor_perl/5.34.0/Mojolicious/Guides/FAQ.pod:

=head2 What does "Worker 31842 has no heartbeat (50 seconds), restarting" mean?

As long as they are accepting new connections, worker processes of all built-in pre-forking web servers send heartbeat
messages to the manager process at regular intervals, to signal that they are still responsive. A blocking operation
such as an infinite loop in your application can prevent this, and will force the affected worker to be restarted after
a timeout. This timeout defaults to C<50> seconds and can be extended with the attribute
L<Mojo::Server::Prefork/"heartbeat_timeout"> if your application requires it.

lib/Mojo/Server/Prefork.pm
 10     # No heartbeat (graceful stop)                                              
  9     $log->error("Worker $pid has no heartbeat ($ht seconds), restarting") and $w->{graceful} = $time
  8       if !$w->{graceful} && ($w->{time} + $interval + $ht <= $time);            
  7                                                                                 
  6     # Graceful stop with timeout                                                
  5     my $graceful = $w->{graceful} ||= $self->{graceful} ? $time : undef;        
  4     $log->info("Stopping worker $pid gracefully ($gt seconds)") and (kill 'QUIT', $pid or $self->_stopped($pid))
  3       if $graceful && !$w->{quit}++;

Related issues 5 (0 open — 5 closed)

Related to openQA Project (public) - action #128345: [logwarn] Worker 30538 has no heartbeat (400 seconds), restarting size:M

Resolved

kraih

2023-04-27

2023-05-20

Actions

Related to openQA Infrastructure (public) - action #138536: Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:S

Resolved

mkittler

2023-10-25

Actions

Blocked by openQA Project (public) - action #110677: Investigation page shouldn't involve blocking long-running API routes size:M

Resolved

tinita

2022-02-03

Actions

Blocked by openQA Project (public) - action #110680: Overview page shouldn't allow long-running requests without limits size:M

Resolved

kraih

2022-02-03

Actions

Copied from openQA Infrastructure (public) - action #105828: 4-7 logreport emails a day cause alert fatigue size:M

Resolved

tinita

2022-02-03

2022-02-17

Actions

Copy link

Updated by livdywan about 3 years ago

Copied from action #105828: 4-7 logreport emails a day cause alert fatigue size:M added

Actions

Copy link

Updated by tinita about 3 years ago

Assignee deleted (~~tinita~~)

Actions

Copy link

Updated by okurz about 3 years ago

Due date deleted (~~2022-02-17~~)
Priority changed from Normal to High

I think also here you did by mistake leave in target version and due date, didn't you? But it certainly needs to be handled with priority to be covered in openqa-logwarn although I think such task should be trivial to do for you on alert handling as well.

Actions

Copy link

Updated by livdywan about 3 years ago

Subject changed from Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 to Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M
Description updated (diff)
Status changed from New to Workable
Priority changed from High to Low

But it certainly needs to be handled with priority to be covered in openqa-logwarn

During the estimation we agreed this shouldn't be simply ignored. Although we don't consider it urgent and should monitor this further since it's not clear what effect it has in production.

Actions

Copy link

Updated by tinita about 3 years ago

Description updated (diff)

Actions

Copy link

Updated by kraih about 3 years ago

Seen the same in the OSD journal:

Feb 08 14:49:52 openqa openqa[13856]: [error] Worker 19842 has no heartbeat (400 seconds), restarting

There is only really one cause for this, and that's a blocking operation (like a database query) running for more than 400 seconds. Which would prevent the worker process from sending a heartbeat to the manager process. The fix would be a) to make sure nothing blocks the process for more than 400 seconds, or b) to increase the heartbeat timeout.

Actions

Copy link

Updated by kraih about 3 years ago

Looking through the OSD logs a bit more, i also found a few seemingly unrelated liveviewhandler 504 errors in the access_log around the same time (guess those might just be that common). The 502 seems suspiciously close to when the 400 second timeout must have started.

Feb 08 14:49:52 openqa openqa[13856]: [error] Worker 19842 has no heartbeat (400 seconds), restarting

[08/Feb/2022:14:43:10 +0100] "GET /tests/overview?build=&distri=sle&version=15-SP1 HTTP/1.1" 502 1071 "https://openqa.suse.de/tests"
[08/Feb/2022:14:58:50 +0100] "GET /liveviewhandler/tests/8119824/developer/ws-proxy/status HTTP/1.1" 504 - "-"
[08/Feb/2022:14:58:37 +0100] "GET /liveviewhandler/tests/8119823/developer/ws-proxy/status HTTP/1.1" 504 - "-"

The last thing the killed worker process logged was:

Feb 08 14:43:10 openqa openqa[19842]: [debug] No build and no group specified, will lookup build based on the other parameters

It does not appear to be a general problem with the request though, when i manually open the URL it finishes loading completely in 4.92s.

Actions

Copy link

Updated by kraih about 3 years ago

Description updated (diff)

Actions

Copy link

Updated by livdywan about 3 years ago

From openQA logreport on o3:

[2022-02-17T03:51:53.414618Z] [error] Worker 13242 has no heartbeat (400 seconds), restarting
[2022-02-17T03:52:08.416699Z] [error] Worker 7572 has no heartbeat (400 seconds), restarting
[2022-02-17T03:52:26.022451Z] [error] Worker 7891 has no heartbeat (400 seconds), restarting
[2022-02-17T03:54:41.762629Z] [error] Worker 14719 has no heartbeat (400 seconds), restarting
[2022-02-17T03:54:55.849639Z] [error] Worker 15029 has no heartbeat (400 seconds), restarting
[2022-02-17T03:55:15.120880Z] [error] Worker 13288 has no heartbeat (400 seconds), restarting
[2022-02-17T03:57:08.128696Z] [error] Worker 16322 has no heartbeat (400 seconds), restarting
[2022-02-17T03:59:52.744987Z] [error] Worker 17689 has no heartbeat (400 seconds), restarting

Actions

Copy link

#10

Updated by okurz about 3 years ago

Priority changed from Low to Urgent

cdywan wrote:

But it certainly needs to be handled with priority to be covered in openqa-logwarn

During the estimation we agreed this shouldn't be simply ignored. Although we don't consider it urgent and should monitor this further since it's not clear what effect it has in production.

You reduced prio to Low but nobody added something to logwarn. This means that every day we receive monitoring alerts about this causing the opposite of "this shouldn't be simply ignored"

Actions

Copy link

#11

Updated by livdywan about 3 years ago

okurz wrote:

You reduced prio to Low but nobody added something to logwarn. This means that every day we receive monitoring alerts about this causing the opposite of "this shouldn't be simply ignored"

That didn't look to be the case when I was on alert duty ;-) (Not disagreeing, answering your question)

Actions

Copy link

#12

Updated by tinita about 3 years ago

Re https://progress.opensuse.org/issues/106759#note-9

I looked for long running HTTP requests in the log from 17/Feb/2022:

% cat long-requests  | grep -v "GET /api/v1/ws/" | grep -v "GET /assets" | grep -v "GET /liveviewhandler/tests" | grep -v "GET .*/streaming " | grep '[0-9][0-9][0-9][0-9][0-9][0-9]$' 
192.168.112.100 - - [17/Feb/2022:00:47:29 +0000] "GET /tests/2192620/investigation_ajax HTTP/1.1" 200 1929881 "-" "openqa-investigate" 101193
192.168.47.46 - - [17/Feb/2022:03:32:41 +0000] "GET /tests/overview HTTP/1.1" 502 1082 "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300102
192.168.47.46 - - [17/Feb/2022:03:32:51 +0000] "GET /tests/overview HTTP/1.1" 502 1082 "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300135
192.168.47.46 - - [17/Feb/2022:03:45:13 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300101
192.168.47.46 - - [17/Feb/2022:03:45:43 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 304052
192.168.47.46 - - [17/Feb/2022:03:48:01 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300097
192.168.47.46 - - [17/Feb/2022:03:48:31 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300498
192.168.47.46 - - [17/Feb/2022:03:45:23 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 600197
192.168.47.46 - - [17/Feb/2022:03:50:23 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300098
192.168.47.46 - - [17/Feb/2022:03:48:11 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 600199
192.168.47.46 - - [17/Feb/2022:03:53:11 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300100

As you can see, most of them are between 3:32 and 3:53, and two of them took over 600 seconds.

So it doesn't exactly match the times from https://progress.opensuse.org/issues/106759#note-9 but still maybe related?

Actions

Copy link

#13

Updated by kraih about 3 years ago

We could try activating slow query logging in Postgres (300 seconds or more) and then match slow query timestamps with the worker has no heartbeat messages.

Actions

Copy link

#14

Updated by tinita about 3 years ago

Status changed from Workable to In Progress
Assignee set to tinita

Actions

Copy link

#15

Updated by tinita about 3 years ago

I enabled slow query logging (>= 300s) on o3.

Actions

Copy link

#16

Updated by tinita about 3 years ago

PR for ignoring the messages for now: https://github.com/os-autoinst/openqa-logwarn/pull/27

Not sure if this is wanted because Oli's statement in https://progress.opensuse.org/issues/106759#note-10 is hard to parse

Actions

Copy link

#17

Updated by tinita about 3 years ago

Status changed from In Progress to Feedback

Actions

Copy link

#18

Updated by okurz about 3 years ago

Priority changed from Urgent to Normal

tinita wrote:

Not sure if this is wanted because Oli's statement in https://progress.opensuse.org/issues/106759#note-10 is hard to parse

don't worry about it. The ticket is now set to "Urgent" and you are working on it. You prepared https://github.com/os-autoinst/openqa-logwarn/pull/27 and we merged that which is perfect. So we can reduce prio now and you can of course unassign again just as well as continue about the real problem which you already looked into. So all good from my side :)

Actions

Copy link

#19

Updated by tinita about 3 years ago

I changed the slow query threshold to >= 200s yesterday.
So far I can't see a slow query log.

Actions

Copy link

#20

Updated by tinita about 3 years ago

I set the slow query duration to 5s now.
I was able to trigger one with using the search box in the webui (duration: 5119.870 ms).
The slow queries don't land in their own log like in mysql, but in the general log under /var/lib/pgsql/data/log.
I will keep it at 5s and monitor how many we will get.
I did not find any slow queries over 200s for the last days.
Some long running queries seem to be aborted though, e.g. https://openqa.opensuse.org/tests/overview will return with a 502 after 5 minutes.

Actions

Copy link

#21

Updated by tinita about 3 years ago

Had a look at the o3 logs today. No "heartbeat" messages (logs go back to March 9). Very few slow (>=5s) queries.

Actions

Copy link

#22

Updated by tinita about 3 years ago

Status changed from Feedback to Workable
Assignee deleted (~~tinita~~)

I had a look at the logs.
We had two occurrences

openqa.19.xz:[2022-03-23T17:52:18.080336Z] [error] Worker 18831 has no heartbeat (400 seconds), restarting
openqa.7.xz:[2022-03-26T13:15:33.754772Z] [error] Worker 1803 has no heartbeat (400 seconds), restarting

I can see these postgres logs around those times:

... (several same log messages)
2022-03-23 17:43:24.908 UTC openqa geekotest [14594]LOG:  could not receive data from client: Connection reset by peer
2022-03-23 17:51:21.737 UTC openqa geekotest [19501]LOG:  could not receive data from client: Connection reset by peer
2022-03-23 17:56:04.548 UTC openqa geekotest [23583]LOG:  could not receive data from client: Connection reset by peer
2022-03-23 18:01:58.291 UTC openqa geekotest [23899]LOG:  could not receive data from client: Connection reset by peer
... (several same log messages)

... (several same log messages)
2022-03-26 13:11:32.097 UTC openqa geekotest [3381]LOG:  could not receive data from client: Connection reset by peer
2022-03-26 13:12:43.550 UTC openqa geekotest [6950]LOG:  could not receive data from client: Connection reset by peer
2022-03-26 13:16:58.260 UTC openqa geekotest [11630]LOG:  could not receive data from client: Connection reset by peer
2022-03-26 13:17:04.136 UTC openqa geekotest [8792]LOG:  could not receive data from client: Connection reset by peer
... (several same log messages)

No slow queries at that time.

The apache log has one suspicious 502 a couple of minutes before:

[23/Mar/2022:17:45:36 +0000] "GET /tests/overview?result=skipped&amp;result=obsoleted&amp;result=parallel_failed&amp;result=parallel_restarted&amp;result=user_cancelled&amp;result=user_restarted&amp;distri=microos&amp;distri=opensuse&amp;version=Tumbleweed&amp;build=20220318&amp;groupid=38 HTTP/1.1" 502 1076 "-" "Mozilla/5.0 ..." 300108

# and
[26/Mar/2022:13:08:52 +0000] "GET /tests/overview?distri=opensuse&version=Tumbleweed HTTP/1.1" 502 1076 "-" "Mozilla/5.0 ..." 300102

But maybe @kraih or someone else knows better what to do with that information. I have no idea right now.

Actions

Copy link

#23

Updated by kraih about 3 years ago

Assignee set to kraih

Will take a look at the collected information.

Actions

Copy link

#24

Updated by kraih about 3 years ago

Assignee deleted (~~kraih~~)

I think the most common cause for could not receive data from client: Connection reset by peer is simply a process getting killed before it could properly shut down the socket to the PostgreSQL server. Unfortunately that doesn't tell us much that we didn't know before. Currently there are no Worker *** has no heartbeat (400 seconds), restarting messages in the O3 logs, but once i'm back from vacation i'll have another look at the log messages surrounding that one, maybe there's something to be found. One thing we do know now though, is that it's not just a slow query. If it is postgres related, it has to be something so severe that it kills the socket (maybe excessively large placeholder values breaking the wire protocol?).

Actions

Copy link

#25

Updated by kraih almost 3 years ago

Status changed from Workable to In Progress
Assignee set to kraih

Actions

Copy link

#26

Updated by openqa_review almost 3 years ago

Due date set to 2022-05-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#27

Updated by kraih almost 3 years ago

I'm not actually seeing the error in the logs anymore, perhaps it has been resolved unintentionally by another change.

Actions

Copy link

#28

Updated by kraih almost 3 years ago

Status changed from In Progress to Feedback

Actions

Copy link

#29

Updated by kraih almost 3 years ago

If this issue pops up again we should maybe try logging all database errors with one of the DBI hook APIs. (Not doing that now since in some cases we intentionally create exceptions and handle them with try/catch, so it could lead to a lot of log output...)

Actions

Copy link

#30

Updated by kraih almost 3 years ago

Status changed from Feedback to Resolved

Currently there are still no cases, so i'll consider this resolved for now. In case someone stumbles over this in the future again: if you see a single case it may very well be harmless (database vacuum could block an already slow query long enough to trigger it for example), but if there's multiple cases over a short time period it's definitely worth investigating.

Actions

Copy link

#31

Updated by tinita almost 3 years ago

Just noting that we won't really notice it if the message appears because we are ignoring it.

Actions

Copy link

#32

Updated by okurz almost 3 years ago

Due date changed from 2022-05-04 to 2022-05-13
Status changed from Resolved to Feedback

@kraih I removed the ignoring from openqa-logwarn with https://github.com/os-autoinst/openqa-logwarn/pull/34 and immediately on the next day I now saw the message by email again so reverting the removal of ignoring again with https://github.com/os-autoinst/openqa-logwarn/pull/35 . It looks like our logs on o3 do not go back that long but xzgrep 'heartbeat' /var/log/openqa* shows the error at least two times from the past two days:

/var/log/openqa.1.xz:[2022-05-03T01:48:15.126683Z] [error] Worker 324 has no heartbeat (400 seconds), restarting
/var/log/openqa.10.xz:[2022-04-30T10:12:07.993477Z] [error] Worker 16174 has no heartbeat (400 seconds), restarting

Actions

Copy link

#33

Updated by kraih almost 3 years ago

Status changed from Feedback to Workable
Assignee deleted (~~kraih~~)

Bummer, i'll put the ticket back into the queue then, since all ideas i have to track it down further would be worse than the symptom.

Actions

Copy link

#34

Updated by livdywan almost 3 years ago

Let's talk about it in tomorrow's mob session

Actions

Copy link

#35

Updated by okurz almost 3 years ago

We looked into this together. On o3 with xzgrep heartbeat /var/log/openqa* we found

/var/log/openqa.10.xz:[2022-05-03T01:48:15.126683Z] [error] Worker 324 has no heartbeat (400 seconds), restarting 
/var/log/openqa.19.xz:[2022-04-30T10:12:07.993477Z] [error] Worker 16174 has no heartbeat (400 seconds), restarting

Comparing with apache access logs excluding some services that don't use the main openQA webUI server we found:

xzgrep '[0-9]\{6,\}$' /space/logs/apache2/openqa.access_log-20220504.xz | grep -v '\/v1\/ws' | grep -v '\/assets\/' | grep -v '\/liveviewhandler'                                                                                                                192.168.47.102 - - [03/May/2022:01:41:34 +0000] "GET /tests/overview?version=Tumbleweed&amp;groupid=24 HTTP/1.1" 502 1076 "-" "Mozilla/5.0 (iPad; CPU OS 9_3_4 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13G35 Safari/601.1" 300104

the last number is the time the request took in ms, configured by tinita in apache config.

The funny part is that "groupid=24" is openQA-in-openQA tests but likely the quoting with & is done wrongly here. So likely the real URL that is called is /tests/overview?version=Tumbleweed. If you try to kill the server then one could just call /tests/overview without any filter parameters.

Actions

Copy link

#36

Updated by mkittler almost 3 years ago

It is also the investigation route:

192.168.112.100 - - [03/May/2022:22:23:54 +0000] "GET /tests/2326528/investigation_ajax HTTP/1.1" 502 1076 "-" "python-requests/2.24.0" 300108

So the investigation should likely be a Minion job and the web UI would poll for the results. As a further optimization we could also cache the results of the investigation (e.g. store it as hidden asset so we can make use of the asset cleanup).

Actions

Copy link

#37

Updated by kraih almost 3 years ago

We have replicated the issue on O3 with the path /tests/overview?version=Tumbleweed&groupid=24, which caused a new log entry:

[2022-05-05T12:29:55.485177Z] [error] Worker 22758 has no heartbeat (400 seconds), restarting
[2022-05-05T12:29:55.485366Z] [info] Stopping worker 22758 gracefully (800 seconds)
[2022-05-05T12:29:55.541685Z] [info] Worker 32438 started

Actions

Copy link

#38

Updated by mkittler almost 3 years ago

For the test results overview page: I'd just put a limit on the database query (configurable). Paging is not useful and would also be very hard to implement. We can just add a hint "Limited to X results. Please use more specific parameters.".

Actions

Copy link

#39

Updated by livdywan almost 3 years ago

Copied to action #110677: Investigation page shouldn't involve blocking long-running API routes size:M added

Actions

Copy link

#40

Updated by livdywan almost 3 years ago

Copied to action #110680: Overview page shouldn't allow long-running requests without limits size:M added

Actions

Copy link

#41

Updated by okurz almost 3 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Category set to Regressions/Crashes

Actions

Copy link

#42

Updated by okurz almost 3 years ago

Tags deleted (~~reactive work~~)
Project changed from openQA Project (public) to openQA Infrastructure (public)
Category deleted (~~Regressions/Crashes~~)
Parent task set to #92854

Actions

Copy link

#43

Updated by okurz almost 3 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Category set to Regressions/Crashes

Actions

Copy link

#44

Updated by livdywan almost 3 years ago

Tags set to reactive work
Project changed from openQA Project (public) to openQA Infrastructure (public)
Description updated (diff)
Category deleted (~~Regressions/Crashes~~)
Status changed from Workable to Resolved
Parent task deleted (~~#92854~~)

Since we tracked down the cause together I'm resolving this ticket, see the related tickets for follow-ups (note the updated AC to reflect that this ticket isn't the solution).

Actions

Copy link

#45

Updated by okurz almost 3 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Due date deleted (~~2022-05-13~~)
Category set to Regressions/Crashes
Status changed from Resolved to Feedback
Assignee set to livdywan
Parent task set to #92854