Project

General

Profile

Actions

action #138536

open

Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3

Added by livdywan 7 months ago. Updated 27 minutes ago.

Status:
New
Priority:
High
Assignee:
-
Category:
-
Target version:
Start date:
2023-10-25
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

OpenQA logreport says

[2023-10-25T12:14:44.745934Z] [error] Worker 17519 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-25T12:15:26.832072Z] [error] Worker 19825 has no heartbeat (900 seconds), restarting (see FAQ for more)

From /var/log/openqa

# grep 17519 /var/log/openqa
[2023-10-25T02:16:31.175195Z] [debug] [pid:2595] Updating seen of worker 1060 from worker_status (free)
[2023-10-25T10:19:00.517519Z] [debug] [pid:31333] Sending AMQP event: opensuse.openqa.job.done
[2023-10-25T11:59:29.405375Z] [info] Worker 17519 started
[2023-10-25T12:14:44.745934Z] [error] Worker 17519 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-25T12:14:44.746126Z] [info] Stopping worker 17519 gracefully (800 seconds)
[2023-10-25T12:16:27.274372Z] [info] Worker 17519 stopped
# grep 19825 /var/log/openqa
[2023-10-25T09:04:16.519825Z] [debug] [a7YZwJtcRdbL] looking for "autoinst-log.txt" in [
[2023-10-25T11:15:18.028018Z] [info] Worker 19825 started
[2023-10-25T11:17:54.719018Z] [debug] [pid:19825] _carry_over_candidate(3674609): _failure_reason=firefox_audio:softfailed
[2023-10-25T11:17:54.750455Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.job.done
[2023-10-25T11:17:54.750867Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-10-25T11:17:54.801831Z] [debug] [pid:19825] opensuse.openqa.job.done published
[2023-10-25T11:21:33.090407Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.comment.create
[2023-10-25T11:21:33.090755Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-10-25T11:21:33.127004Z] [debug] [pid:19825] opensuse.openqa.comment.create published
  pid  => 19825,
  pid  => 19825,
  pid  => 19825,
[2023-10-25T12:00:06.480220Z] [debug] [pid:19825] _carry_over_candidate(3674853): _failure_reason=GOOD
[2023-10-25T12:00:06.631183Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.job.done
[2023-10-25T12:00:06.631658Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-10-25T12:00:07.055590Z] [debug] [pid:19825] opensuse.openqa.job.done published
[2023-10-25T12:15:26.832072Z] [error] Worker 19825 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-25T12:15:26.832215Z] [info] Stopping worker 19825 gracefully (800 seconds)
[2023-10-25T12:21:03.316812Z] [info] Worker 19825 stopped
[2023-10-25T12:41:34.619825Z] [debug] [pid:3076] _carry_over_candidate(3674823): _failure_reason=GOOD

Acceptance criteria

  • AC1: It is understood why we saw the no heartbeat message here

Suggestions

Out of scope

  • The _carry_over_candidate messages are unrelated - grep just matches the timestamp here
Actions #1

Updated by livdywan 7 months ago

  • Description updated (diff)
Actions #2

Updated by livdywan 7 months ago

  • Description updated (diff)
Actions #3

Updated by livdywan 7 months ago

  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by livdywan 7 months ago

  • Subject changed from Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 to Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:M
Actions #5

Updated by livdywan 6 months ago

Alerting at 12.10+1

[2023-10-30T10:19:51.055850Z] [error] Worker 29863 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T10:30:47.204800Z] [error] Worker 1975 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T10:41:48.420849Z] [error] Worker 16109 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T10:49:43.452901Z] [error] Worker 13125 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T10:52:49.800412Z] [error] Worker 13008 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T11:00:40.792969Z] [error] Worker 17409 has no heartbeat (900 seconds), restarting (see FAQ for more)
[2023-10-30T11:03:51.947210Z] [error] Worker 29747 has no heartbeat (900 seconds), restarting (see FAQ for more)
Actions #6

Updated by livdywan 6 months ago

Haven't seen it for a while so I assume this can wait til after hackweek.

Actions #7

Updated by livdywan 6 months ago

  • Status changed from Workable to Rejected

I'm going to assume this is not happening anymore, hence resolving.

Actions #8

Updated by livdywan 27 minutes ago

  • Subject changed from Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3 size:M to Alert Worker .* has no heartbeat (900 seconds), restarting (see FAQ for more) on o3
  • Status changed from Rejected to New
[2024-05-08T09:46:20.374639Z] [error] Worker 1513 has no heartbeat (900 seconds), restarting (see FAQ for more)

I'm re-opening this since we didn't fulfill the AC, and maybe we should re-estimate it and ponder how to make this more useful.

Actions

Also available in: Atom PDF