Project

General

Profile

action #106759

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #92854: [epic] limit overload of openQA webUI by heavy requests

Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M

Added by cdywan 4 months ago. Updated 3 days ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2022-02-03
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

# /var/log/openqa
[2022-02-12T04:12:47.195902Z] [error] Worker 22011 has no heartbeat (400 seconds), restarting
[2022-02-12T04:12:56.228360Z] [error] Worker 28596 has no heartbeat (400 seconds), restarting

Acceptance criteria

AC1: The cause of the heartbeat message is known

Suggestions

  • Add the message to the blocklist
  • Look at Mojolicious APIs related to preforking of Mojo workers (problem is a blocked worker process, Mojolicious API can't help with that)
  • Extend the configured timeout from 400s The timeout's already pretty high
  • Confirm where the errors are logged, and add context (Mojolicious logs the error, and there is no context information to add, different process)
  • 500 error in access_log from live_view_handler (unrelated, time doesn't match and the live view handler doesn't use preforking) [15/Feb/2022:07:16:02 +0000] "GET /liveviewhandler/tests/2189494/developer/ws-proxy HTTP/1.1" 500

Additional info

from /usr/lib/perl5/vendor_perl/5.34.0/Mojolicious/Guides/FAQ.pod:

=head2 What does "Worker 31842 has no heartbeat (50 seconds), restarting" mean?

As long as they are accepting new connections, worker processes of all built-in pre-forking web servers send heartbeat
messages to the manager process at regular intervals, to signal that they are still responsive. A blocking operation
such as an infinite loop in your application can prevent this, and will force the affected worker to be restarted after
a timeout. This timeout defaults to C<50> seconds and can be extended with the attribute
L<Mojo::Server::Prefork/"heartbeat_timeout"> if your application requires it.
lib/Mojo/Server/Prefork.pm
 10     # No heartbeat (graceful stop)                                              
  9     $log->error("Worker $pid has no heartbeat ($ht seconds), restarting") and $w->{graceful} = $time
  8       if !$w->{graceful} && ($w->{time} + $interval + $ht <= $time);            
  7                                                                                 
  6     # Graceful stop with timeout                                                
  5     my $graceful = $w->{graceful} ||= $self->{graceful} ? $time : undef;        
  4     $log->info("Stopping worker $pid gracefully ($gt seconds)") and (kill 'QUIT', $pid or $self->_stopped($pid))
  3       if $graceful && !$w->{quit}++;                                            

Related issues

Blocked by openQA Project - action #110677: Investigation page shouldn't involve blocking long-running API routes size:MResolved2022-02-03

Blocked by openQA Project - action #110680: Overview page shouldn't allow long-running requests without limits size:MResolved2022-02-03

Copied from openQA Infrastructure - action #105828: 4-7 logreport emails a day cause alert fatigue size:MResolved2022-02-032022-02-17

History

#1 Updated by cdywan 4 months ago

  • Copied from action #105828: 4-7 logreport emails a day cause alert fatigue size:M added

#2 Updated by tinita 4 months ago

  • Assignee deleted (tinita)

#3 Updated by okurz 4 months ago

  • Due date deleted (2022-02-17)
  • Priority changed from Normal to High

I think also here you did by mistake leave in target version and due date, didn't you? But it certainly needs to be handled with priority to be covered in openqa-logwarn although I think such task should be trivial to do for you on alert handling as well.

#4 Updated by cdywan 4 months ago

  • Subject changed from Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 to Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M
  • Description updated (diff)
  • Status changed from New to Workable
  • Priority changed from High to Low

But it certainly needs to be handled with priority to be covered in openqa-logwarn

During the estimation we agreed this shouldn't be simply ignored. Although we don't consider it urgent and should monitor this further since it's not clear what effect it has in production.

#5 Updated by tinita 4 months ago

  • Description updated (diff)

#6 Updated by kraih 4 months ago

Seen the same in the OSD journal:

Feb 08 14:49:52 openqa openqa[13856]: [error] Worker 19842 has no heartbeat (400 seconds), restarting

There is only really one cause for this, and that's a blocking operation (like a database query) running for more than 400 seconds. Which would prevent the worker process from sending a heartbeat to the manager process. The fix would be a) to make sure nothing blocks the process for more than 400 seconds, or b) to increase the heartbeat timeout.

#7 Updated by kraih 4 months ago

Looking through the OSD logs a bit more, i also found a few seemingly unrelated liveviewhandler 504 errors in the access_log around the same time (guess those might just be that common). The 502 seems suspiciously close to when the 400 second timeout must have started.

Feb 08 14:49:52 openqa openqa[13856]: [error] Worker 19842 has no heartbeat (400 seconds), restarting
[08/Feb/2022:14:43:10 +0100] "GET /tests/overview?build=&distri=sle&version=15-SP1 HTTP/1.1" 502 1071 "https://openqa.suse.de/tests"
[08/Feb/2022:14:58:50 +0100] "GET /liveviewhandler/tests/8119824/developer/ws-proxy/status HTTP/1.1" 504 - "-"
[08/Feb/2022:14:58:37 +0100] "GET /liveviewhandler/tests/8119823/developer/ws-proxy/status HTTP/1.1" 504 - "-"

The last thing the killed worker process logged was:

Feb 08 14:43:10 openqa openqa[19842]: [debug] No build and no group specified, will lookup build based on the other parameters

It does not appear to be a general problem with the request though, when i manually open the URL it finishes loading completely in 4.92s.

#8 Updated by kraih 4 months ago

  • Description updated (diff)

#9 Updated by cdywan 4 months ago

From openQA logreport on o3:

[2022-02-17T03:51:53.414618Z] [error] Worker 13242 has no heartbeat (400 seconds), restarting
[2022-02-17T03:52:08.416699Z] [error] Worker 7572 has no heartbeat (400 seconds), restarting
[2022-02-17T03:52:26.022451Z] [error] Worker 7891 has no heartbeat (400 seconds), restarting
[2022-02-17T03:54:41.762629Z] [error] Worker 14719 has no heartbeat (400 seconds), restarting
[2022-02-17T03:54:55.849639Z] [error] Worker 15029 has no heartbeat (400 seconds), restarting
[2022-02-17T03:55:15.120880Z] [error] Worker 13288 has no heartbeat (400 seconds), restarting
[2022-02-17T03:57:08.128696Z] [error] Worker 16322 has no heartbeat (400 seconds), restarting
[2022-02-17T03:59:52.744987Z] [error] Worker 17689 has no heartbeat (400 seconds), restarting

#10 Updated by okurz 4 months ago

  • Priority changed from Low to Urgent

cdywan wrote:

But it certainly needs to be handled with priority to be covered in openqa-logwarn

During the estimation we agreed this shouldn't be simply ignored. Although we don't consider it urgent and should monitor this further since it's not clear what effect it has in production.

You reduced prio to Low but nobody added something to logwarn. This means that every day we receive monitoring alerts about this causing the opposite of "this shouldn't be simply ignored"

#11 Updated by cdywan 4 months ago

okurz wrote:

You reduced prio to Low but nobody added something to logwarn. This means that every day we receive monitoring alerts about this causing the opposite of "this shouldn't be simply ignored"

That didn't look to be the case when I was on alert duty ;-) (Not disagreeing, answering your question)

#12 Updated by tinita 4 months ago

Re https://progress.opensuse.org/issues/106759#note-9

I looked for long running HTTP requests in the log from 17/Feb/2022:

% cat long-requests  | grep -v "GET /api/v1/ws/" | grep -v "GET /assets" | grep -v "GET /liveviewhandler/tests" | grep -v "GET .*/streaming " | grep '[0-9][0-9][0-9][0-9][0-9][0-9]$' 
192.168.112.100 - - [17/Feb/2022:00:47:29 +0000] "GET /tests/2192620/investigation_ajax HTTP/1.1" 200 1929881 "-" "openqa-investigate" 101193
192.168.47.46 - - [17/Feb/2022:03:32:41 +0000] "GET /tests/overview HTTP/1.1" 502 1082 "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300102
192.168.47.46 - - [17/Feb/2022:03:32:51 +0000] "GET /tests/overview HTTP/1.1" 502 1082 "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300135
192.168.47.46 - - [17/Feb/2022:03:45:13 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300101
192.168.47.46 - - [17/Feb/2022:03:45:43 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 304052
192.168.47.46 - - [17/Feb/2022:03:48:01 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300097
192.168.47.46 - - [17/Feb/2022:03:48:31 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300498
192.168.47.46 - - [17/Feb/2022:03:45:23 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 600197
192.168.47.46 - - [17/Feb/2022:03:50:23 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300098
192.168.47.46 - - [17/Feb/2022:03:48:11 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 600199
192.168.47.46 - - [17/Feb/2022:03:53:11 +0000] "GET /tests/overview HTTP/1.1" 406 - "-" "Mozilla/5.0 [en] (X11, U; OpenVAS-VT 21.4.3)" 300100

As you can see, most of them are between 3:32 and 3:53, and two of them took over 600 seconds.

So it doesn't exactly match the times from https://progress.opensuse.org/issues/106759#note-9 but still maybe related?

#13 Updated by kraih 4 months ago

We could try activating slow query logging in Postgres (300 seconds or more) and then match slow query timestamps with the worker has no heartbeat messages.

#14 Updated by tinita 4 months ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita

#15 Updated by tinita 4 months ago

I enabled slow query logging (>= 300s) on o3.

#16 Updated by tinita 4 months ago

PR for ignoring the messages for now: https://github.com/os-autoinst/openqa-logwarn/pull/27

Not sure if this is wanted because Oli's statement in https://progress.opensuse.org/issues/106759#note-10 is hard to parse

#17 Updated by tinita 4 months ago

  • Status changed from In Progress to Feedback

#18 Updated by okurz 4 months ago

  • Priority changed from Urgent to Normal

tinita wrote:

Not sure if this is wanted because Oli's statement in https://progress.opensuse.org/issues/106759#note-10 is hard to parse

don't worry about it. The ticket is now set to "Urgent" and you are working on it. You prepared https://github.com/os-autoinst/openqa-logwarn/pull/27 and we merged that which is perfect. So we can reduce prio now and you can of course unassign again just as well as continue about the real problem which you already looked into. So all good from my side :)

#19 Updated by tinita 4 months ago

I changed the slow query threshold to >= 200s yesterday.
So far I can't see a slow query log.

#20 Updated by tinita 4 months ago

I set the slow query duration to 5s now.
I was able to trigger one with using the search box in the webui (duration: 5119.870 ms).
The slow queries don't land in their own log like in mysql, but in the general log under /var/lib/pgsql/data/log.
I will keep it at 5s and monitor how many we will get.
I did not find any slow queries over 200s for the last days.
Some long running queries seem to be aborted though, e.g. https://openqa.opensuse.org/tests/overview will return with a 502 after 5 minutes.

#21 Updated by tinita 3 months ago

Had a look at the o3 logs today. No "heartbeat" messages (logs go back to March 9). Very few slow (>=5s) queries.

#22 Updated by tinita 3 months ago

  • Status changed from Feedback to Workable
  • Assignee deleted (tinita)

I had a look at the logs.
We had two occurrences

openqa.19.xz:[2022-03-23T17:52:18.080336Z] [error] Worker 18831 has no heartbeat (400 seconds), restarting
openqa.7.xz:[2022-03-26T13:15:33.754772Z] [error] Worker 1803 has no heartbeat (400 seconds), restarting

I can see these postgres logs around those times:

... (several same log messages)
2022-03-23 17:43:24.908 UTC openqa geekotest [14594]LOG:  could not receive data from client: Connection reset by peer
2022-03-23 17:51:21.737 UTC openqa geekotest [19501]LOG:  could not receive data from client: Connection reset by peer
2022-03-23 17:56:04.548 UTC openqa geekotest [23583]LOG:  could not receive data from client: Connection reset by peer
2022-03-23 18:01:58.291 UTC openqa geekotest [23899]LOG:  could not receive data from client: Connection reset by peer
... (several same log messages)

... (several same log messages)
2022-03-26 13:11:32.097 UTC openqa geekotest [3381]LOG:  could not receive data from client: Connection reset by peer
2022-03-26 13:12:43.550 UTC openqa geekotest [6950]LOG:  could not receive data from client: Connection reset by peer
2022-03-26 13:16:58.260 UTC openqa geekotest [11630]LOG:  could not receive data from client: Connection reset by peer
2022-03-26 13:17:04.136 UTC openqa geekotest [8792]LOG:  could not receive data from client: Connection reset by peer
... (several same log messages)

No slow queries at that time.

The apache log has one suspicious 502 a couple of minutes before:

[23/Mar/2022:17:45:36 +0000] "GET /tests/overview?result=skipped&amp;result=obsoleted&amp;result=parallel_failed&amp;result=parallel_restarted&amp;result=user_cancelled&amp;result=user_restarted&amp;distri=microos&amp;distri=opensuse&amp;version=Tumbleweed&amp;build=20220318&amp;groupid=38 HTTP/1.1" 502 1076 "-" "Mozilla/5.0 ..." 300108

# and
[26/Mar/2022:13:08:52 +0000] "GET /tests/overview?distri=opensuse&version=Tumbleweed HTTP/1.1" 502 1076 "-" "Mozilla/5.0 ..." 300102

But maybe kraih or someone else knows better what to do with that information. I have no idea right now.

#23 Updated by kraih 3 months ago

  • Assignee set to kraih

Will take a look at the collected information.

#24 Updated by kraih 3 months ago

  • Assignee deleted (kraih)

I think the most common cause for could not receive data from client: Connection reset by peer is simply a process getting killed before it could properly shut down the socket to the PostgreSQL server. Unfortunately that doesn't tell us much that we didn't know before. Currently there are no Worker *** has no heartbeat (400 seconds), restarting messages in the O3 logs, but once i'm back from vacation i'll have another look at the log messages surrounding that one, maybe there's something to be found. One thing we do know now though, is that it's not just a slow query. If it is postgres related, it has to be something so severe that it kills the socket (maybe excessively large placeholder values breaking the wire protocol?).

#25 Updated by kraih 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to kraih

#26 Updated by openqa_review 2 months ago

  • Due date set to 2022-05-04

Setting due date based on mean cycle time of SUSE QE Tools

#27 Updated by kraih 2 months ago

I'm not actually seeing the error in the logs anymore, perhaps it has been resolved unintentionally by another change.

#28 Updated by kraih 2 months ago

  • Status changed from In Progress to Feedback

#29 Updated by kraih 2 months ago

If this issue pops up again we should maybe try logging all database errors with one of the DBI hook APIs. (Not doing that now since in some cases we intentionally create exceptions and handle them with try/catch, so it could lead to a lot of log output...)

#30 Updated by kraih about 2 months ago

  • Status changed from Feedback to Resolved

Currently there are still no cases, so i'll consider this resolved for now. In case someone stumbles over this in the future again: if you see a single case it may very well be harmless (database vacuum could block an already slow query long enough to trigger it for example), but if there's multiple cases over a short time period it's definitely worth investigating.

#31 Updated by tinita about 2 months ago

Just noting that we won't really notice it if the message appears because we are ignoring it.

#32 Updated by okurz about 2 months ago

  • Due date changed from 2022-05-04 to 2022-05-13
  • Status changed from Resolved to Feedback

kraih I removed the ignoring from openqa-logwarn with https://github.com/os-autoinst/openqa-logwarn/pull/34 and immediately on the next day I now saw the message by email again so reverting the removal of ignoring again with https://github.com/os-autoinst/openqa-logwarn/pull/35 . It looks like our logs on o3 do not go back that long but xzgrep 'heartbeat' /var/log/openqa* shows the error at least two times from the past two days:

/var/log/openqa.1.xz:[2022-05-03T01:48:15.126683Z] [error] Worker 324 has no heartbeat (400 seconds), restarting
/var/log/openqa.10.xz:[2022-04-30T10:12:07.993477Z] [error] Worker 16174 has no heartbeat (400 seconds), restarting

#33 Updated by kraih about 2 months ago

  • Status changed from Feedback to Workable
  • Assignee deleted (kraih)

Bummer, i'll put the ticket back into the queue then, since all ideas i have to track it down further would be worse than the symptom.

#34 Updated by cdywan about 2 months ago

Let's talk about it in tomorrow's mob session

#35 Updated by okurz about 2 months ago

We looked into this together. On o3 with xzgrep heartbeat /var/log/openqa* we found

/var/log/openqa.10.xz:[2022-05-03T01:48:15.126683Z] [error] Worker 324 has no heartbeat (400 seconds), restarting 
/var/log/openqa.19.xz:[2022-04-30T10:12:07.993477Z] [error] Worker 16174 has no heartbeat (400 seconds), restarting

Comparing with apache access logs excluding some services that don't use the main openQA webUI server we found:

xzgrep '[0-9]\{6,\}$' /space/logs/apache2/openqa.access_log-20220504.xz | grep -v '\/v1\/ws' | grep -v '\/assets\/' | grep -v '\/liveviewhandler'                                                                                                                192.168.47.102 - - [03/May/2022:01:41:34 +0000] "GET /tests/overview?version=Tumbleweed&amp;groupid=24 HTTP/1.1" 502 1076 "-" "Mozilla/5.0 (iPad; CPU OS 9_3_4 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13G35 Safari/601.1" 300104

the last number is the time the request took in ms, configured by tinita in apache config.

The funny part is that "groupid=24" is openQA-in-openQA tests but likely the quoting with &amp; is done wrongly here. So likely the real URL that is called is /tests/overview?version=Tumbleweed. If you try to kill the server then one could just call /tests/overview without any filter parameters.

#36 Updated by mkittler about 2 months ago

It is also the investigation route:

192.168.112.100 - - [03/May/2022:22:23:54 +0000] "GET /tests/2326528/investigation_ajax HTTP/1.1" 502 1076 "-" "python-requests/2.24.0" 300108

So the investigation should likely be a Minion job and the web UI would poll for the results. As a further optimization we could also cache the results of the investigation (e.g. store it as hidden asset so we can make use of the asset cleanup).

#37 Updated by kraih about 2 months ago

We have replicated the issue on O3 with the path /tests/overview?version=Tumbleweed&amp;groupid=24, which caused a new log entry:

[2022-05-05T12:29:55.485177Z] [error] Worker 22758 has no heartbeat (400 seconds), restarting
[2022-05-05T12:29:55.485366Z] [info] Stopping worker 22758 gracefully (800 seconds)
[2022-05-05T12:29:55.541685Z] [info] Worker 32438 started

#38 Updated by mkittler about 2 months ago

For the test results overview page: I'd just put a limit on the database query (configurable). Paging is not useful and would also be very hard to implement. We can just add a hint "Limited to X results. Please use more specific parameters.".

#39 Updated by cdywan about 2 months ago

  • Copied to action #110677: Investigation page shouldn't involve blocking long-running API routes size:M added

#40 Updated by cdywan about 2 months ago

  • Copied to action #110680: Overview page shouldn't allow long-running requests without limits size:M added

#41 Updated by okurz about 2 months ago

  • Project changed from openQA Infrastructure to openQA Project
  • Category set to Concrete Bugs

#42 Updated by okurz about 2 months ago

  • Tags deleted (reactive work)
  • Project changed from openQA Project to openQA Infrastructure
  • Category deleted (Concrete Bugs)
  • Parent task set to #92854

#43 Updated by okurz about 2 months ago

  • Project changed from openQA Infrastructure to openQA Project
  • Category set to Concrete Bugs

#44 Updated by cdywan about 2 months ago

  • Tags set to reactive work
  • Project changed from openQA Project to openQA Infrastructure
  • Description updated (diff)
  • Category deleted (Concrete Bugs)
  • Status changed from Workable to Resolved
  • Parent task deleted (#92854)

Since we tracked down the cause together I'm resolving this ticket, see the related tickets for follow-ups (note the updated AC to reflect that this ticket isn't the solution).

#45 Updated by okurz about 2 months ago

  • Project changed from openQA Infrastructure to openQA Project
  • Due date deleted (2022-05-13)
  • Category set to Concrete Bugs
  • Status changed from Resolved to Feedback
  • Assignee set to cdywan
  • Parent task set to #92854

not resolved as long as we ignore the error message in openqa-logwarn.

#46 Updated by cdywan about 2 months ago

okurz wrote:

not resolved as long as we ignore the error message in openqa-logwarn.

https://github.com/os-autoinst/openqa-logwarn/pull/36

#47 Updated by cdywan about 2 months ago

  • Status changed from Feedback to Blocked

cdywan wrote:

okurz wrote:

not resolved as long as we ignore the error message in openqa-logwarn.

https://github.com/os-autoinst/openqa-logwarn/pull/36

As suggested I'm blocking this ticket on #110680 and #110677.

#48 Updated by cdywan about 2 months ago

  • Copied to deleted (action #110677: Investigation page shouldn't involve blocking long-running API routes size:M)

#49 Updated by cdywan about 2 months ago

  • Copied to deleted (action #110680: Overview page shouldn't allow long-running requests without limits size:M)

#50 Updated by cdywan about 2 months ago

  • Blocked by action #110677: Investigation page shouldn't involve blocking long-running API routes size:M added

#51 Updated by cdywan about 2 months ago

  • Blocks action #110680: Overview page shouldn't allow long-running requests without limits size:M added

#52 Updated by cdywan about 2 months ago

  • Blocks deleted (action #110680: Overview page shouldn't allow long-running requests without limits size:M)

#53 Updated by cdywan about 2 months ago

  • Blocked by action #110680: Overview page shouldn't allow long-running requests without limits size:M added

#54 Updated by kraih 27 days ago

The blocker has been resolved.

#55 Updated by cdywan 3 days ago

  • Status changed from Blocked to Feedback

kraih wrote:

The blocker has been resolved.

The underlying causes have been addressed, hence we shouldn't ignore this anymore: https://github.com/os-autoinst/openqa-logwarn/pull/40

Also available in: Atom PDF