Project

General

Profile

Actions

action #138536

closed

Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:S

Added by livdywan about 1 year ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2023-10-25
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

OpenQA logreport says

[2023-10-25T12:14:44.745934Z] [error] Worker 17519 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-25T12:15:26.832072Z] [error] Worker 19825 has no heartbeat (900 seconds), restarting (see FAQ for more)

From /var/log/openqa

# grep 17519 /var/log/openqa
[2023-10-25T02:16:31.175195Z] [debug] [pid:2595] Updating seen of worker 1060 from worker_status (free)
[2023-10-25T10:19:00.517519Z] [debug] [pid:31333] Sending AMQP event: opensuse.openqa.job.done
[2023-10-25T11:59:29.405375Z] [info] Worker 17519 started
[2023-10-25T12:14:44.745934Z] [error] Worker 17519 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-25T12:14:44.746126Z] [info] Stopping worker 17519 gracefully (800 seconds)
[2023-10-25T12:16:27.274372Z] [info] Worker 17519 stopped
# grep 19825 /var/log/openqa
[2023-10-25T09:04:16.519825Z] [debug] [a7YZwJtcRdbL] looking for "autoinst-log.txt" in [
[2023-10-25T11:15:18.028018Z] [info] Worker 19825 started
[2023-10-25T11:17:54.719018Z] [debug] [pid:19825] _carry_over_candidate(3674609): _failure_reason=firefox_audio:softfailed
[2023-10-25T11:17:54.750455Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.job.done
[2023-10-25T11:17:54.750867Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-10-25T11:17:54.801831Z] [debug] [pid:19825] opensuse.openqa.job.done published
[2023-10-25T11:21:33.090407Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.comment.create
[2023-10-25T11:21:33.090755Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-10-25T11:21:33.127004Z] [debug] [pid:19825] opensuse.openqa.comment.create published
  pid  => 19825,
  pid  => 19825,
  pid  => 19825,
[2023-10-25T12:00:06.480220Z] [debug] [pid:19825] _carry_over_candidate(3674853): _failure_reason=GOOD
[2023-10-25T12:00:06.631183Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.job.done
[2023-10-25T12:00:06.631658Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-10-25T12:00:07.055590Z] [debug] [pid:19825] opensuse.openqa.job.done published
[2023-10-25T12:15:26.832072Z] [error] Worker 19825 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-25T12:15:26.832215Z] [info] Stopping worker 19825 gracefully (800 seconds)
[2023-10-25T12:21:03.316812Z] [info] Worker 19825 stopped
[2023-10-25T12:41:34.619825Z] [debug] [pid:3076] _carry_over_candidate(3674823): _failure_reason=GOOD

Acceptance criteria

  • AC1: DONE It is understood why we saw the no heartbeat message here
    • The message was logged because the route to set a job to "done" was slow (see the log messages).
  • AC2: The heartbeat message is no longer treated as an error (as it is usually not that severe).

Suggestions

Out of scope

  • The _carry_over_candidate messages are unrelated - grep just matches the timestamp here

Related issues 2 (0 open2 closed)

Related to openQA Project (public) - action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:MResolvedlivdywan2022-02-03

Actions
Copied to openQA Infrastructure (public) - action #165033: Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 - againResolvedtinita2023-10-25

Actions
Actions

Also available in: Atom PDF