action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

action #106759

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #92854: [epic] limit overload of openQA webUI by heavy requests

Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M

Added by livdywan almost 3 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

livdywan

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2022-02-03

Due date:

% Done:

Estimated time:

Tags:

reactive work

Description

Observation¶

# /var/log/openqa
[2022-02-12T04:12:47.195902Z] [error] Worker 22011 has no heartbeat (400 seconds), restarting
[2022-02-12T04:12:56.228360Z] [error] Worker 28596 has no heartbeat (400 seconds), restarting

Acceptance criteria¶

AC1: The cause of the heartbeat message is known

Suggestions¶

~~Add the message to the blocklist~~
~~Look at Mojolicious APIs related to preforking of Mojo workers~~ (problem is a blocked worker process, Mojolicious API can't help with that)
~~Extend the configured timeout from 400s~~ The timeout's already pretty high
~~Confirm where the errors are logged, and add context~~ (Mojolicious logs the error, and there is no context information to add, different process)
~~500 error in access_log from live_view_handler~~ (unrelated, time doesn't match and the live view handler doesn't use preforking) [15/Feb/2022:07:16:02 +0000] "GET /liveviewhandler/tests/2189494/developer/ws-proxy HTTP/1.1" 500

Additional info¶

from /usr/lib/perl5/vendor_perl/5.34.0/Mojolicious/Guides/FAQ.pod:

=head2 What does "Worker 31842 has no heartbeat (50 seconds), restarting" mean?

As long as they are accepting new connections, worker processes of all built-in pre-forking web servers send heartbeat
messages to the manager process at regular intervals, to signal that they are still responsive. A blocking operation
such as an infinite loop in your application can prevent this, and will force the affected worker to be restarted after
a timeout. This timeout defaults to C<50> seconds and can be extended with the attribute
L<Mojo::Server::Prefork/"heartbeat_timeout"> if your application requires it.

lib/Mojo/Server/Prefork.pm
 10     # No heartbeat (graceful stop)                                              
  9     $log->error("Worker $pid has no heartbeat ($ht seconds), restarting") and $w->{graceful} = $time
  8       if !$w->{graceful} && ($w->{time} + $interval + $ht <= $time);            
  7                                                                                 
  6     # Graceful stop with timeout                                                
  5     my $graceful = $w->{graceful} ||= $self->{graceful} ? $time : undef;        
  4     $log->info("Stopping worker $pid gracefully ($gt seconds)") and (kill 'QUIT', $pid or $self->_stopped($pid))
  3       if $graceful && !$w->{quit}++;

Related issues 5 (0 open — 5 closed)

Related to openQA Project (public) - action #128345: [logwarn] Worker 30538 has no heartbeat (400 seconds), restarting size:M

Resolved

kraih

2023-04-27

2023-05-20

Actions

Related to openQA Infrastructure (public) - action #138536: Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:S

Resolved

mkittler

2023-10-25

Actions

Blocked by openQA Project (public) - action #110677: Investigation page shouldn't involve blocking long-running API routes size:M

Resolved

tinita

2022-02-03

Actions

Blocked by openQA Project (public) - action #110680: Overview page shouldn't allow long-running requests without limits size:M

Resolved

kraih

2022-02-03

Actions

Copied from openQA Infrastructure (public) - action #105828: 4-7 logreport emails a day cause alert fatigue size:M

Resolved

tinita

2022-02-03

2022-02-17

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #106759

Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Additional info¶

Updated by livdywan almost 3 years ago

Updated by tinita almost 3 years ago

Updated by okurz almost 3 years ago

Updated by livdywan almost 3 years ago

Updated by tinita almost 3 years ago

Updated by kraih almost 3 years ago

Updated by kraih almost 3 years ago

Updated by kraih almost 3 years ago

Updated by livdywan almost 3 years ago

Updated by okurz almost 3 years ago

Updated by livdywan almost 3 years ago

Updated by tinita almost 3 years ago

Updated by kraih almost 3 years ago

Updated by tinita almost 3 years ago

Updated by tinita almost 3 years ago

Updated by tinita almost 3 years ago

Updated by tinita almost 3 years ago

Updated by okurz almost 3 years ago

Updated by tinita almost 3 years ago

Updated by tinita almost 3 years ago

Updated by tinita almost 3 years ago

Updated by tinita over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by openqa_review over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by tinita over 2 years ago

Updated by okurz over 2 years ago

Updated by kraih over 2 years ago

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by mkittler over 2 years ago

Updated by kraih over 2 years ago

Updated by mkittler over 2 years ago

Updated by livdywan over 2 years ago

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by livdywan over 2 years ago

Updated by livdywan over 2 years ago

Updated by livdywan over 2 years ago

Updated by livdywan over 2 years ago

Updated by livdywan over 2 years ago

Updated by livdywan over 2 years ago

Updated by livdywan over 2 years ago

Updated by livdywan over 2 years ago

Updated by kraih over 2 years ago

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 1 year ago

Updated by okurz 7 months ago