action #138536
closedAlert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:S
0%
Description
Observation¶
OpenQA logreport says
[2023-10-25T12:14:44.745934Z] [error] Worker 17519 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-25T12:15:26.832072Z] [error] Worker 19825 has no heartbeat (900 seconds), restarting (see FAQ for more)
From /var/log/openqa
# grep 17519 /var/log/openqa
[2023-10-25T02:16:31.175195Z] [debug] [pid:2595] Updating seen of worker 1060 from worker_status (free)
[2023-10-25T10:19:00.517519Z] [debug] [pid:31333] Sending AMQP event: opensuse.openqa.job.done
[2023-10-25T11:59:29.405375Z] [info] Worker 17519 started
[2023-10-25T12:14:44.745934Z] [error] Worker 17519 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-25T12:14:44.746126Z] [info] Stopping worker 17519 gracefully (800 seconds)
[2023-10-25T12:16:27.274372Z] [info] Worker 17519 stopped
# grep 19825 /var/log/openqa
[2023-10-25T09:04:16.519825Z] [debug] [a7YZwJtcRdbL] looking for "autoinst-log.txt" in [
[2023-10-25T11:15:18.028018Z] [info] Worker 19825 started
[2023-10-25T11:17:54.719018Z] [debug] [pid:19825] _carry_over_candidate(3674609): _failure_reason=firefox_audio:softfailed
[2023-10-25T11:17:54.750455Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.job.done
[2023-10-25T11:17:54.750867Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-10-25T11:17:54.801831Z] [debug] [pid:19825] opensuse.openqa.job.done published
[2023-10-25T11:21:33.090407Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.comment.create
[2023-10-25T11:21:33.090755Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-10-25T11:21:33.127004Z] [debug] [pid:19825] opensuse.openqa.comment.create published
pid => 19825,
pid => 19825,
pid => 19825,
[2023-10-25T12:00:06.480220Z] [debug] [pid:19825] _carry_over_candidate(3674853): _failure_reason=GOOD
[2023-10-25T12:00:06.631183Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.job.done
[2023-10-25T12:00:06.631658Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-10-25T12:00:07.055590Z] [debug] [pid:19825] opensuse.openqa.job.done published
[2023-10-25T12:15:26.832072Z] [error] Worker 19825 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-25T12:15:26.832215Z] [info] Stopping worker 19825 gracefully (800 seconds)
[2023-10-25T12:21:03.316812Z] [info] Worker 19825 stopped
[2023-10-25T12:41:34.619825Z] [debug] [pid:3076] _carry_over_candidate(3674823): _failure_reason=GOOD
Acceptance criteria¶
- AC1: DONE
It is understood why we saw the no heartbeat message here- The message was logged because the route to set a job to "done" was slow (see the log messages).
- AC2: The heartbeat message is no longer treated as an error (as it is usually not that severe).
Suggestions¶
- https://docs.mojolicious.org/Mojolicious/Guides/FAQ#What-does-Worker-31842-has-no-heartbeat-50-seconds-restarting-mean
- Maybe also related to #138535
- Should this be an error message vs debug in the first place? Maybe it is too low level to make it useful?
- Find out if we can replace the message upstream, or introduce our own handling with more specific errors
- Figure out what worker/pid corresponds to
- The message comes from https://github.com/mojolicious/mojo/blob/431cc0a9d7c55b04d7283340cc22a36a09d803e6/lib/Mojo/Server/Prefork.pm#L113 Probably we should just submit a patch to the upstream project to reduce from error to warning or info
- Could we log performance of requests? Assuming the message could mean "req took too long"
Out of scope¶
- The _carry_over_candidate messages are unrelated - grep just matches the timestamp here
Updated by livdywan about 1 year ago
- Description updated (diff)
- Status changed from New to Workable
Updated by livdywan about 1 year ago
- Subject changed from Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 to Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:M
Updated by livdywan about 1 year ago
Alerting at 12.10+1
[2023-10-30T10:19:51.055850Z] [error] Worker 29863 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T10:30:47.204800Z] [error] Worker 1975 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T10:41:48.420849Z] [error] Worker 16109 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T10:49:43.452901Z] [error] Worker 13125 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T10:52:49.800412Z] [error] Worker 13008 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T11:00:40.792969Z] [error] Worker 17409 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T11:03:51.947210Z] [error] Worker 29747 has no heartbeat (900 seconds), restarting (see FAQ for more)
Updated by livdywan about 1 year ago
Haven't seen it for a while so I assume this can wait til after hackweek.
Updated by livdywan about 1 year ago
- Status changed from Workable to Rejected
I'm going to assume this is not happening anymore, hence resolving.
Updated by livdywan 8 months ago
- Subject changed from Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:M to Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3
- Status changed from Rejected to New
[2024-05-08T09:46:20.374639Z] [error] Worker 1513 has no heartbeat (900 seconds), restarting (see FAQ for more)
I'm re-opening this since we didn't fulfill the AC, and maybe we should re-estimate it and ponder how to make this more useful.
Updated by okurz 7 months ago
The message comes from https://github.com/mojolicious/mojo/blob/431cc0a9d7c55b04d7283340cc22a36a09d803e6/lib/Mojo/Server/Prefork.pm#L113
Probably we should just submit a patch to the upstream project to reduce from error to warning or info
Updated by mkittler 7 months ago ยท Edited
- Status changed from Workable to Feedback
I asked @kraih whether https://github.com/mojolicious/mojo/compare/main...Martchus:mojo:no-heartbeat-warning?expand=1 has a chance of being merged. (If not we can just ignore this in logwarn.)
I'm re-opening this since we didn't fulfill the AC, and maybe we should re-estimate it and ponder how to make this more useful.
Note that there was maybe some context in the logs around that time which would have contained useful information. (Just like it is the case in the initial ticket description.)
Updated by mkittler 7 months ago
The change in Mojolicious would require a community vote so let's just ignore the log message from our side: https://github.com/os-autoinst/openqa-logwarn/pull/49
Updated by okurz 7 months ago
- Related to action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added
Updated by okurz 7 months ago
- Category set to Regressions/Crashes
- Status changed from Feedback to Resolved
https://github.com/os-autoinst/openqa-logwarn/pull/49 is merged and should be effective already. I will follow up with my PR but it's not strictly necessary for this ticket so we can resolve
Updated by livdywan 4 months ago
- Copied to action #165033: Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 - again added