Project

General

Profile

action #138536

Updated by livdywan 7 months ago

## Observation 

 OpenQA logreport says 

 ``` 
 [2023-10-25T12:14:44.745934Z] [error] Worker 17519 has no heartbeat (900 seconds), restarting (see FAQ for more) 
 [2023-10-25T12:15:26.832072Z] [error] Worker 19825 has no heartbeat (900 seconds), restarting (see FAQ for more) 
 ``` 

 From /var/log/openqa 

 ``` 
 # grep 17519 /var/log/openqa 
 [2023-10-25T02:16:31.175195Z] [debug] [pid:2595] Updating seen of worker 1060 from worker_status (free) 
 [2023-10-25T10:19:00.517519Z] [debug] [pid:31333] Sending AMQP event: opensuse.openqa.job.done 
 [2023-10-25T11:59:29.405375Z] [info] Worker 17519 started 
 [2023-10-25T12:14:44.745934Z] [error] Worker 17519 has no heartbeat (900 seconds), restarting (see FAQ for more) 
 [2023-10-25T12:14:44.746126Z] [info] Stopping worker 17519 gracefully (800 seconds) 
 [2023-10-25T12:16:27.274372Z] [info] Worker 17519 stopped 
 # grep 19825 /var/log/openqa 
 [2023-10-25T09:04:16.519825Z] [debug] [a7YZwJtcRdbL] looking for "autoinst-log.txt" in [ 
 [2023-10-25T11:15:18.028018Z] [info] Worker 19825 started 
 [2023-10-25T11:17:54.719018Z] [debug] [pid:19825] _carry_over_candidate(3674609): _failure_reason=firefox_audio:softfailed 
 [2023-10-25T11:17:54.750455Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.job.done 
 [2023-10-25T11:17:54.750867Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub 
 [2023-10-25T11:17:54.801831Z] [debug] [pid:19825] opensuse.openqa.job.done published 
 [2023-10-25T11:21:33.090407Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.comment.create 
 [2023-10-25T11:21:33.090755Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub 
 [2023-10-25T11:21:33.127004Z] [debug] [pid:19825] opensuse.openqa.comment.create published 
   pid    => 19825, 
   pid    => 19825, 
   pid    => 19825, 
 [2023-10-25T12:00:06.480220Z] [debug] [pid:19825] _carry_over_candidate(3674853): _failure_reason=GOOD 
 [2023-10-25T12:00:06.631183Z] [debug] [pid:19825] Sending AMQP event: opensuse.openqa.job.done 
 [2023-10-25T12:00:06.631658Z] [debug] [pid:19825] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub 
 [2023-10-25T12:00:07.055590Z] [debug] [pid:19825] opensuse.openqa.job.done published 
 [2023-10-25T12:15:26.832072Z] [error] Worker 19825 has no heartbeat (900 seconds), restarting (see FAQ for more) 
 [2023-10-25T12:15:26.832215Z] [info] Stopping worker 19825 gracefully (800 seconds) 
 [2023-10-25T12:21:03.316812Z] [info] Worker 19825 stopped 
 [2023-10-25T12:41:34.619825Z] [debug] [pid:3076] _carry_over_candidate(3674823): _failure_reason=GOOD 
 ``` 

 ## Acceptance criteria 
 * **AC1:** DONE ~~It It is understood why we saw the no heartbeat message here~~ 
     * The message was logged because the route to set a job to "done" was slow (see the log messages). 
 * **AC2:** The heartbeat message is no longer treated as an error (as it is usually not that severe). here 

 ## Suggestions 
 * https://docs.mojolicious.org/Mojolicious/Guides/FAQ#What-does-Worker-31842-has-no-heartbeat-50-seconds-restarting-mean 
 * Maybe also related to #138535 
 * Should this be an error message vs debug in the first place? Maybe it is too low level to make it useful? 
 * Find out if we can replace the message upstream, or introduce our own handling with more specific errors 
 * Figure out what worker/pid corresponds to 
 * The message comes from https://github.com/mojolicious/mojo/blob/431cc0a9d7c55b04d7283340cc22a36a09d803e6/lib/Mojo/Server/Prefork.pm#L113 
 Probably we should just submit a patch to the upstream project to reduce from error to warning or info 
 * Could we log performance of requests? Assuming the message could mean "req took too long" 

 ## Out of scope 
 * The _carry_over_candidate messages are unrelated - grep just matches the timestamp here 

Back