Project

General

Profile

Actions

action #138536

closed

Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:S

Added by livdywan 7 months ago. Updated 3 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-10-25
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

OpenQA logreport says

[2023-10-25T12:14:44.745934Z] [error] Worker 17519 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-25T12:15:26.832072Z] [error] Worker 19825 has no heartbeat (900 seconds), restarting (see FAQ for more)

From /var/log/openqa

# grep 17519 /var/log/openqa
[2023-10-25T02:16:31.175195Z] [debug] [pid:2595] Updating seen of worker 1060 from worker_status (free)
[2023-10-25T10:19:00.517519Z] [debug] [pid:31333] Sending AMQP event: opensuse.openqa.job.done
[2023-10-25T11:59:29.405375Z] [info] Worker 17519 started
[2023-10-25T12:14:44.745934Z] [error] Worker 17519 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-25T12:14:44.746126Z] [info] Stopping worker 17519 gracefully (800 seconds)
[2023-10-25T12:16:27.274372Z] [info] Worker 17519 stopped
# grep 19825 /var/log/openqa
[2023-10-25T09:04:16.519825Z] [debug] [a7YZwJtcRdbL] looking for "autoinst-log.txt" in [
[2023-10-25T11:15:18.028018Z] [info] Worker 19825 started
[2023-10-25T11:17:54.719018Z] [debug] [pid:19825] _carry_over_candidate(3674609): _failure_reason=firefox_audio:softfailed
[2023-10-25T11:17:54.750455Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.job.done
[2023-10-25T11:17:54.750867Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-10-25T11:17:54.801831Z] [debug] [pid:19825] opensuse.openqa.job.done published
[2023-10-25T11:21:33.090407Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.comment.create
[2023-10-25T11:21:33.090755Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-10-25T11:21:33.127004Z] [debug] [pid:19825] opensuse.openqa.comment.create published
  pid  => 19825,
  pid  => 19825,
  pid  => 19825,
[2023-10-25T12:00:06.480220Z] [debug] [pid:19825] _carry_over_candidate(3674853): _failure_reason=GOOD
[2023-10-25T12:00:06.631183Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.job.done
[2023-10-25T12:00:06.631658Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-10-25T12:00:07.055590Z] [debug] [pid:19825] opensuse.openqa.job.done published
[2023-10-25T12:15:26.832072Z] [error] Worker 19825 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-25T12:15:26.832215Z] [info] Stopping worker 19825 gracefully (800 seconds)
[2023-10-25T12:21:03.316812Z] [info] Worker 19825 stopped
[2023-10-25T12:41:34.619825Z] [debug] [pid:3076] _carry_over_candidate(3674823): _failure_reason=GOOD

Acceptance criteria

  • AC1: DONE It is understood why we saw the no heartbeat message here
    • The message was logged because the route to set a job to "done" was slow (see the log messages).
  • AC2: The heartbeat message is no longer treated as an error (as it is usually not that severe).

Suggestions

Out of scope

  • The _carry_over_candidate messages are unrelated - grep just matches the timestamp here

Related issues 1 (0 open1 closed)

Related to openQA Project - action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:MResolvedlivdywan2022-02-03

Actions
Actions #1

Updated by livdywan 7 months ago

  • Description updated (diff)
Actions #2

Updated by livdywan 7 months ago

  • Description updated (diff)
Actions #3

Updated by livdywan 7 months ago

  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by livdywan 7 months ago

  • Subject changed from Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 to Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:M
Actions #5

Updated by livdywan 7 months ago

Alerting at 12.10+1

[2023-10-30T10:19:51.055850Z] [error] Worker 29863 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T10:30:47.204800Z] [error] Worker 1975 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T10:41:48.420849Z] [error] Worker 16109 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T10:49:43.452901Z] [error] Worker 13125 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T10:52:49.800412Z] [error] Worker 13008 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T11:00:40.792969Z] [error] Worker 17409 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T11:03:51.947210Z] [error] Worker 29747 has no heartbeat (900 seconds), restarting (see FAQ for more)
Actions #6

Updated by livdywan 7 months ago

Haven't seen it for a while so I assume this can wait til after hackweek.

Actions #7

Updated by livdywan 6 months ago

  • Status changed from Workable to Rejected

I'm going to assume this is not happening anymore, hence resolving.

Actions #8

Updated by livdywan 11 days ago

  • Subject changed from Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:M to Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3
  • Status changed from Rejected to New
[2024-05-08T09:46:20.374639Z] [error] Worker 1513 has no heartbeat (900 seconds), restarting (see FAQ for more)

I'm re-opening this since we didn't fulfill the AC, and maybe we should re-estimate it and ponder how to make this more useful.

Actions #9

Updated by okurz 5 days ago

The message comes from https://github.com/mojolicious/mojo/blob/431cc0a9d7c55b04d7283340cc22a36a09d803e6/lib/Mojo/Server/Prefork.pm#L113
Probably we should just submit a patch to the upstream project to reduce from error to warning or info

Actions #10

Updated by livdywan 5 days ago

  • Subject changed from Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 to Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #11

Updated by mkittler 4 days ago

  • Assignee set to mkittler
Actions #12

Updated by mkittler 4 days ago ยท Edited

  • Status changed from Workable to Feedback

I asked @kraih whether https://github.com/mojolicious/mojo/compare/main...Martchus:mojo:no-heartbeat-warning?expand=1 has a chance of being merged. (If not we can just ignore this in logwarn.)


I'm re-opening this since we didn't fulfill the AC, and maybe we should re-estimate it and ponder how to make this more useful.

Note that there was maybe some context in the logs around that time which would have contained useful information. (Just like it is the case in the initial ticket description.)

Actions #13

Updated by mkittler 4 days ago

The change in Mojolicious would require a community vote so let's just ignore the log message from our side: https://github.com/os-autoinst/openqa-logwarn/pull/49

Actions #14

Updated by okurz 3 days ago

I will try a submission to mojo then

Actions #16

Updated by okurz 3 days ago

  • Related to action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added
Actions #17

Updated by okurz 3 days ago

  • Category set to Regressions/Crashes
  • Status changed from Feedback to Resolved

https://github.com/os-autoinst/openqa-logwarn/pull/49 is merged and should be effective already. I will follow up with my PR but it's not strictly necessary for this ticket so we can resolve

Actions

Also available in: Atom PDF