action #128345
closed[logwarn] Worker 30538 has no heartbeat (400 seconds), restarting size:M
Description
Observation¶
from OpenQA logreport for ariel.suse-dmz.opensuse.org:
[2023-04-27T07:29:38.886066Z] [error] Worker 30538 has no heartbeat (400 seconds), restarting
Acceptance criteria¶
- AC1: Heart beat messages are not observed in log reports
Suggestions¶
- "Worker $PID" refers to the process on the web UI since this is where we use Mojo workers (prefork HTTP server) and nowhere else
- Heartbeats are a recurring write of a small message from the prefork workers to the manager every few seconds, this must have been blocked by a blocking operation (like a syscall or database query)
- Check the WebUI logs for relevant messages
Updated by livdywan over 1 year ago
- Subject changed from [logwarn] Worker 30538 has no heartbeat (400 seconds), restarting to [logwarn] Worker 30538 has no heartbeat (400 seconds), restarting size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by kraih over 1 year ago
- Project changed from openQA Infrastructure to openQA Project
- Assignee set to kraih
This is not openQA Infrastructure, it should be a bug somewhere in the code.
Updated by kraih over 1 year ago
There has been a second case since:
[2023-05-04T10:12:42.760284Z] [error] Worker 6195 has no heartbeat (400 seconds), restarting
Updated by kraih over 1 year ago
Here we have the collected log messages from 10 minutes before the event, until the worker has been stopped:
[2023-04-27T07:19:06.097799Z] [debug] [pid:30538] Sending AMQP event: opensuse.openqa.job.done
[2023-04-27T07:19:06.098571Z] [debug] [pid:30538] AMQP URL: amqps://USER:PASS@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-04-27T07:19:06.151559Z] [debug] [pid:30538] opensuse.openqa.job.done published
[2023-04-27T07:22:54.723983Z] [debug] [pid:30538] _carry_over_candidate(3247307): _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,generic-676:failed,run:softfailed
[2023-04-27T07:24:12.667803Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3245855: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:25:27.850158Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3243714: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:25:27.850408Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3243714 with repeated problem
[2023-04-27T07:26:32.150038Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3241095: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:26:32.150357Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3241095 with repeated problem
[2023-04-27T07:27:37.565926Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3239952: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:27:37.566128Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3239952 with repeated problem
[2023-04-27T07:28:37.334215Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3236443: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:28:37.334405Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3236443 with repeated problem
[2023-04-27T07:29:32.966574Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3234707: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:29:32.966728Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3234707 with repeated problem
[2023-04-27T07:29:38.886066Z] [error] Worker 30538 has no heartbeat (400 seconds), restarting
[2023-04-27T07:29:38.886376Z] [info] Stopping worker 30538 gracefully (800 seconds)
[2023-04-27T07:30:28.142825Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3233435: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:30:28.143061Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3233435 with repeated problem
[2023-04-27T07:31:24.087082Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3231074: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:31:24.087241Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3231074 with repeated problem
[2023-04-27T07:32:18.253362Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3229569: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:32:18.253559Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3229569 with repeated problem
[2023-04-27T07:33:11.214452Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3228736: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:33:11.214596Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3228736 with repeated problem
[2023-04-27T07:33:11.264477Z] [debug] [pid:30538] Sending AMQP event: opensuse.openqa.job.done
[2023-04-27T07:33:11.265532Z] [debug] [pid:30538] AMQP URL: amqps://USER:PASS@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-04-27T07:33:11.495977Z] [info] Worker 30538 stopped
This should be the request from the Apache logs:
192.168.112.17 - - [27/Apr/2023:07:22:54 +0000] "POST /api/v1/jobs/3247307/set_done?worker_id=520 HTTP/1.1" 406 - "-" "Mojolicious (Perl)" 300020
The picture seems pretty clear, around 07:22:54
there was an OpenQA::Schema::Result::Jobs::done()
call that ran SQL queries for job 3247307
until 07:33:11
. Not letting the event loop send a heartbeat via recurring timer for 11
minutes, going way above our heartbeat timeout of 400
seconds.
Updated by kraih over 1 year ago
And for the new case it looks similar:
[2023-05-04T10:06:03.008008Z] [debug] [pid:6195] _carry_over_candidate(3262555): _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,generic-465:failed,generic-528:failed,run:softfailed
[2023-05-04T10:09:09.560880Z] [debug] [pid:6195] _carry_over_candidate(3262555): checking take over from 3258496: _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,run:softfailed
[2023-05-04T10:11:28.641581Z] [debug] [pid:6195] _carry_over_candidate(3262555): checking take over from 3256753: _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,generic-465:failed,run:softfailed
[2023-05-04T10:12:42.760284Z] [error] Worker 6195 has no heartbeat (400 seconds), restarting
[2023-05-04T10:12:42.760511Z] [info] Stopping worker 6195 gracefully (800 seconds)
[2023-05-04T10:13:09.021717Z] [debug] [pid:6195] _carry_over_candidate(3262555): checking take over from 3254957: _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,generic-465:failed,run:softfailed
[2023-05-04T10:13:09.021931Z] [debug] [pid:6195] _carry_over_candidate(3262555): ignoring job 3254957 with repeated problem
[2023-05-04T10:14:21.028710Z] [debug] [pid:6195] _carry_over_candidate(3262555): checking take over from 3254138: _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,run:softfailed
[2023-05-04T10:15:22.415947Z] [debug] [pid:6195] _carry_over_candidate(3262555): checking take over from 3252131: _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,generic-465:failed,run:softfailed
[2023-05-04T10:15:22.416135Z] [debug] [pid:6195] _carry_over_candidate(3262555): changed state more than 3 (4), aborting search
[2023-05-04T10:15:22.474994Z] [debug] [pid:6195] Sending AMQP event: opensuse.openqa.job.done
[2023-05-04T10:15:22.478586Z] [debug] [pid:6195] AMQP URL: amqps://USER:PASS@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-05-04T10:15:22.690578Z] [info] Worker 6195 stopped
Updated by openqa_review over 1 year ago
- Due date set to 2023-05-20
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 1 year ago
- Related to action #121771: openqaworker20 has no heartbeat added
Updated by okurz over 1 year ago
- Related to action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added
Updated by okurz over 1 year ago
- Priority changed from Normal to Urgent
repeated logreport messages, e.g. from 2023-05-06. I suggest we handle the urgency of repeated alert messages as priority.
Updated by kraih over 1 year ago
okurz wrote:
repeated logreport messages, e.g. from 2023-05-06. I suggest we handle the urgency of repeated alert messages as priority.
Ok, quick fix should be easy enough by increasing the timeout. I'll prepare a patch.
Updated by kraih over 1 year ago
Updated by okurz over 1 year ago
as you wrote in https://github.com/os-autoinst/openQA/pull/5124 "Long term we should consider moving such slow code into Minion background jobs though.": How about improving the error message to give that hint directly? And is moving to minion background jobs what you plan to do in particular for this ticket?
Updated by kraih over 1 year ago
as you wrote in https://github.com/os-autoinst/openQA/pull/5124 "Long term we should consider moving such slow code into Minion background jobs though.": How about improving the error message to give that hint directly?
Yes, that might be worth changing upstream. Not the first time this particular log message has caused some confusion.
And is moving to minion background jobs what you plan to do in particular for this ticket?
I was pondering how to do it before the priority was raised to urgent. :) But it's not as easy at it seems at first, because $job->done
is called from 79 places in different contexts, and has a significant return value a bunch of them depend on. For this ticket i think we are good with the timeout increase. But in general, we shouldn't have any API endpoints that take multiple minutes. So a followup ticket might be in order.
Updated by kraih over 1 year ago
The vast majority of /api/v1/jobs/:id/set_done
requests are pretty fast. But there always seems to be a bunch of outliers that run really slow (here a few from yesterday):
192.168.112.18 - - [06/May/2023:01:39:00 +0000] "POST /api/v1/jobs/3267772/set_done?worker_id=519 HTTP/1.1" 200 37 "-" "Mojolicious (Perl)" 101578
192.168.112.7 - - [06/May/2023:01:58:27 +0000] "POST /api/v1/jobs/3267447/set_done?worker_id=93 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 47713
192.168.112.18 - - [06/May/2023:02:22:47 +0000] "POST /api/v1/jobs/3267445/set_done?worker_id=505 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 46972
192.168.112.19 - - [06/May/2023:10:37:47 +0000] "POST /api/v1/jobs/3267905/set_done?worker_id=575 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 215399
192.168.112.19 - - [06/May/2023:11:01:37 +0000] "POST /api/v1/jobs/3267907/set_done?worker_id=571 HTTP/1.1" 406 - "-" "Mojolicious (Perl)" 300104
192.168.112.19 - - [06/May/2023:11:14:54 +0000] "POST /api/v1/jobs/3267955/set_done?worker_id=574 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 110528
192.168.112.17 - - [06/May/2023:14:29:27 +0000] "POST /api/v1/jobs/3268327/set_done?worker_id=540 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 84127
192.168.112.19 - - [06/May/2023:18:24:27 +0000] "POST /api/v1/jobs/3268700/set_done?worker_id=573 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 59262
192.168.112.12 - - [06/May/2023:22:02:48 +0000] "POST /api/v1/jobs/3269647/set_done?worker_id=186 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 50023
192.168.112.17 - - [06/May/2023:22:17:35 +0000] "POST /api/v1/jobs/3269577/set_done?worker_id=538 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 119143
192.168.112.17 - - [06/May/2023:22:41:08 +0000] "POST /api/v1/jobs/3269668/set_done?worker_id=537 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 62961
192.168.112.17 - - [06/May/2023:22:41:41 +0000] "POST /api/v1/jobs/3269675/set_done?worker_id=522 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 61466
Updated by kraih over 1 year ago
Created a followup ticket to make the API endpoint faster, in whatever way possible: #128936
Updated by kraih over 1 year ago
- Related to action #128936: API endpoint /api/v1/jobs/:jobid/set_done can be very slow added
Updated by okurz over 1 year ago
- Priority changed from Urgent to Normal
Haven't seen more logreport messages recently so the urgency was addressed.
Updated by okurz over 1 year ago
The message could include a reference to the according faq section like a shortened URL: https://docs.mojolicious.org/faq/42