Project

General

Profile

Actions

action #138536

closed

Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:S

Added by livdywan about 1 year ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2023-10-25
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

OpenQA logreport says

[2023-10-25T12:14:44.745934Z] [error] Worker 17519 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-25T12:15:26.832072Z] [error] Worker 19825 has no heartbeat (900 seconds), restarting (see FAQ for more)

From /var/log/openqa

# grep 17519 /var/log/openqa
[2023-10-25T02:16:31.175195Z] [debug] [pid:2595] Updating seen of worker 1060 from worker_status (free)
[2023-10-25T10:19:00.517519Z] [debug] [pid:31333] Sending AMQP event: opensuse.openqa.job.done
[2023-10-25T11:59:29.405375Z] [info] Worker 17519 started
[2023-10-25T12:14:44.745934Z] [error] Worker 17519 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-25T12:14:44.746126Z] [info] Stopping worker 17519 gracefully (800 seconds)
[2023-10-25T12:16:27.274372Z] [info] Worker 17519 stopped
# grep 19825 /var/log/openqa
[2023-10-25T09:04:16.519825Z] [debug] [a7YZwJtcRdbL] looking for "autoinst-log.txt" in [
[2023-10-25T11:15:18.028018Z] [info] Worker 19825 started
[2023-10-25T11:17:54.719018Z] [debug] [pid:19825] _carry_over_candidate(3674609): _failure_reason=firefox_audio:softfailed
[2023-10-25T11:17:54.750455Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.job.done
[2023-10-25T11:17:54.750867Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-10-25T11:17:54.801831Z] [debug] [pid:19825] opensuse.openqa.job.done published
[2023-10-25T11:21:33.090407Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.comment.create
[2023-10-25T11:21:33.090755Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-10-25T11:21:33.127004Z] [debug] [pid:19825] opensuse.openqa.comment.create published
  pid  => 19825,
  pid  => 19825,
  pid  => 19825,
[2023-10-25T12:00:06.480220Z] [debug] [pid:19825] _carry_over_candidate(3674853): _failure_reason=GOOD
[2023-10-25T12:00:06.631183Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.job.done
[2023-10-25T12:00:06.631658Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-10-25T12:00:07.055590Z] [debug] [pid:19825] opensuse.openqa.job.done published
[2023-10-25T12:15:26.832072Z] [error] Worker 19825 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-25T12:15:26.832215Z] [info] Stopping worker 19825 gracefully (800 seconds)
[2023-10-25T12:21:03.316812Z] [info] Worker 19825 stopped
[2023-10-25T12:41:34.619825Z] [debug] [pid:3076] _carry_over_candidate(3674823): _failure_reason=GOOD

Acceptance criteria

  • AC1: DONE It is understood why we saw the no heartbeat message here
    • The message was logged because the route to set a job to "done" was slow (see the log messages).
  • AC2: The heartbeat message is no longer treated as an error (as it is usually not that severe).

Suggestions

Out of scope

  • The _carry_over_candidate messages are unrelated - grep just matches the timestamp here

Related issues 2 (0 open2 closed)

Related to openQA Project (public) - action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:MResolvedlivdywan2022-02-03

Actions
Copied to openQA Infrastructure (public) - action #165033: Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 - againResolvedtinita2023-10-25

Actions
Actions #1

Updated by livdywan about 1 year ago

  • Description updated (diff)
Actions #2

Updated by livdywan about 1 year ago

  • Description updated (diff)
Actions #3

Updated by livdywan about 1 year ago

  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by livdywan about 1 year ago

  • Subject changed from Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 to Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:M
Actions #5

Updated by livdywan about 1 year ago

Alerting at 12.10+1

[2023-10-30T10:19:51.055850Z] [error] Worker 29863 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T10:30:47.204800Z] [error] Worker 1975 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T10:41:48.420849Z] [error] Worker 16109 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T10:49:43.452901Z] [error] Worker 13125 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T10:52:49.800412Z] [error] Worker 13008 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T11:00:40.792969Z] [error] Worker 17409 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T11:03:51.947210Z] [error] Worker 29747 has no heartbeat (900 seconds), restarting (see FAQ for more)
Actions #6

Updated by livdywan about 1 year ago

Haven't seen it for a while so I assume this can wait til after hackweek.

Actions #7

Updated by livdywan about 1 year ago

  • Status changed from Workable to Rejected

I'm going to assume this is not happening anymore, hence resolving.

Actions #8

Updated by livdywan 8 months ago

  • Subject changed from Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:M to Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3
  • Status changed from Rejected to New
[2024-05-08T09:46:20.374639Z] [error] Worker 1513 has no heartbeat (900 seconds), restarting (see FAQ for more)

I'm re-opening this since we didn't fulfill the AC, and maybe we should re-estimate it and ponder how to make this more useful.

Actions #9

Updated by okurz 7 months ago

The message comes from https://github.com/mojolicious/mojo/blob/431cc0a9d7c55b04d7283340cc22a36a09d803e6/lib/Mojo/Server/Prefork.pm#L113
Probably we should just submit a patch to the upstream project to reduce from error to warning or info

Actions #10

Updated by livdywan 7 months ago

  • Subject changed from Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 to Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #11

Updated by mkittler 7 months ago

  • Assignee set to mkittler
Actions #12

Updated by mkittler 7 months ago ยท Edited

  • Status changed from Workable to Feedback

I asked @kraih whether https://github.com/mojolicious/mojo/compare/main...Martchus:mojo:no-heartbeat-warning?expand=1 has a chance of being merged. (If not we can just ignore this in logwarn.)


I'm re-opening this since we didn't fulfill the AC, and maybe we should re-estimate it and ponder how to make this more useful.

Note that there was maybe some context in the logs around that time which would have contained useful information. (Just like it is the case in the initial ticket description.)

Actions #13

Updated by mkittler 7 months ago

The change in Mojolicious would require a community vote so let's just ignore the log message from our side: https://github.com/os-autoinst/openqa-logwarn/pull/49

Actions #14

Updated by okurz 7 months ago

I will try a submission to mojo then

Actions #16

Updated by okurz 7 months ago

  • Related to action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added
Actions #17

Updated by okurz 7 months ago

  • Category set to Regressions/Crashes
  • Status changed from Feedback to Resolved

https://github.com/os-autoinst/openqa-logwarn/pull/49 is merged and should be effective already. I will follow up with my PR but it's not strictly necessary for this ticket so we can resolve

Actions #18

Updated by livdywan 4 months ago

  • Copied to action #165033: Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 - again added
Actions

Also available in: Atom PDF