Project

General

Profile

Actions

action #106759

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #92854: [epic] limit overload of openQA webUI by heavy requests

Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M

Added by livdywan almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-02-03
Due date:
% Done:

0%

Estimated time:

Description

Observation

# /var/log/openqa
[2022-02-12T04:12:47.195902Z] [error] Worker 22011 has no heartbeat (400 seconds), restarting
[2022-02-12T04:12:56.228360Z] [error] Worker 28596 has no heartbeat (400 seconds), restarting

Acceptance criteria

AC1: The cause of the heartbeat message is known

Suggestions

  • Add the message to the blocklist
  • Look at Mojolicious APIs related to preforking of Mojo workers (problem is a blocked worker process, Mojolicious API can't help with that)
  • Extend the configured timeout from 400s The timeout's already pretty high
  • Confirm where the errors are logged, and add context (Mojolicious logs the error, and there is no context information to add, different process)
  • 500 error in access_log from live_view_handler (unrelated, time doesn't match and the live view handler doesn't use preforking) [15/Feb/2022:07:16:02 +0000] "GET /liveviewhandler/tests/2189494/developer/ws-proxy HTTP/1.1" 500

Additional info

from /usr/lib/perl5/vendor_perl/5.34.0/Mojolicious/Guides/FAQ.pod:

=head2 What does "Worker 31842 has no heartbeat (50 seconds), restarting" mean?

As long as they are accepting new connections, worker processes of all built-in pre-forking web servers send heartbeat
messages to the manager process at regular intervals, to signal that they are still responsive. A blocking operation
such as an infinite loop in your application can prevent this, and will force the affected worker to be restarted after
a timeout. This timeout defaults to C<50> seconds and can be extended with the attribute
L<Mojo::Server::Prefork/"heartbeat_timeout"> if your application requires it.
lib/Mojo/Server/Prefork.pm
 10     # No heartbeat (graceful stop)                                              
  9     $log->error("Worker $pid has no heartbeat ($ht seconds), restarting") and $w->{graceful} = $time
  8       if !$w->{graceful} && ($w->{time} + $interval + $ht <= $time);            
  7                                                                                 
  6     # Graceful stop with timeout                                                
  5     my $graceful = $w->{graceful} ||= $self->{graceful} ? $time : undef;        
  4     $log->info("Stopping worker $pid gracefully ($gt seconds)") and (kill 'QUIT', $pid or $self->_stopped($pid))
  3       if $graceful && !$w->{quit}++;                                            

Related issues 5 (0 open5 closed)

Related to openQA Project (public) - action #128345: [logwarn] Worker 30538 has no heartbeat (400 seconds), restarting size:MResolvedkraih2023-04-272023-05-20

Actions
Related to openQA Infrastructure (public) - action #138536: Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:SResolvedmkittler2023-10-25

Actions
Blocked by openQA Project (public) - action #110677: Investigation page shouldn't involve blocking long-running API routes size:MResolvedtinita2022-02-03

Actions
Blocked by openQA Project (public) - action #110680: Overview page shouldn't allow long-running requests without limits size:MResolvedkraih2022-02-03

Actions
Copied from openQA Infrastructure (public) - action #105828: 4-7 logreport emails a day cause alert fatigue size:MResolvedtinita2022-02-032022-02-17

Actions
Actions

Also available in: Atom PDF