Project

General

Profile

action #138536

Updated by livdywan about 1 year ago

## Observation 

 OpenQA logreport says 

 ``` 
 [2023-10-25T12:14:44.745934Z] [error] Worker 17519 has no heartbeat (900 seconds), restarting (see FAQ for more) 
 [2023-10-25T12:15:26.832072Z] [error] Worker 19825 has no heartbeat (900 seconds), restarting (see FAQ for more) 
 ``` 

 From /var/log/openqa 

 ``` 
 # grep 17519 /var/log/openqa 
 [2023-10-25T02:16:31.175195Z] [debug] [pid:2595] Updating seen of worker 1060 from worker_status (free) 
 [2023-10-25T10:19:00.517519Z] [debug] [pid:31333] Sending AMQP event: opensuse.openqa.job.done 
 [2023-10-25T11:59:29.405375Z] [info] Worker 17519 started 
 [2023-10-25T12:14:44.745934Z] [error] Worker 17519 has no heartbeat (900 seconds), restarting (see FAQ for more) 
 [2023-10-25T12:14:44.746126Z] [info] Stopping worker 17519 gracefully (800 seconds) 
 [2023-10-25T12:16:27.274372Z] [info] Worker 17519 stopped 
 # grep 19825 /var/log/openqa 
 [2023-10-25T09:04:16.519825Z] [debug] [a7YZwJtcRdbL] looking for "autoinst-log.txt" in [ 
 [2023-10-25T11:15:18.028018Z] [info] Worker 19825 started 
 [2023-10-25T11:17:54.719018Z] [debug] [pid:19825] _carry_over_candidate(3674609): _failure_reason=firefox_audio:softfailed 
 [2023-10-25T11:17:54.750455Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.job.done 
 [2023-10-25T11:17:54.750867Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub 
 [2023-10-25T11:17:54.801831Z] [debug] [pid:19825] opensuse.openqa.job.done published 
 [2023-10-25T11:21:33.090407Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.comment.create 
 [2023-10-25T11:21:33.090755Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub 
 [2023-10-25T11:21:33.127004Z] [debug] [pid:19825] opensuse.openqa.comment.create published 
   pid    => 19825, 
   pid    => 19825, 
   pid    => 19825, 
 [2023-10-25T12:00:06.480220Z] [debug] [pid:19825] _carry_over_candidate(3674853): _failure_reason=GOOD 
 [2023-10-25T12:00:06.631183Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.job.done 
 [2023-10-25T12:00:06.631658Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub 
 [2023-10-25T12:00:07.055590Z] [debug] [pid:19825] opensuse.openqa.job.done published 
 [2023-10-25T12:15:26.832072Z] [error] Worker 19825 has no heartbeat (900 seconds), restarting (see FAQ for more) 
 [2023-10-25T12:15:26.832215Z] [info] Stopping worker 19825 gracefully (800 seconds) 
 [2023-10-25T12:21:03.316812Z] [info] Worker 19825 stopped 
 [2023-10-25T12:41:34.619825Z] [debug] [pid:3076] _carry_over_candidate(3674823): _failure_reason=GOOD 
 ``` 

 ## Acceptance criteria 
 * **AC1:** It is understood why we saw the no heartbeat message here 

  

 ## Suggestions 
 * https://docs.mojolicious.org/Mojolicious/Guides/FAQ#What-does-Worker-31842-has-no-heartbeat-50-seconds-restarting-mean 
 * Maybe also related to #138535 

 ## Out of scope 
 * The _carry_over_candidate messages are unrelated - grep just matches the timestamp here

Back