Project

General

Profile

Actions

action #128345

closed

[logwarn] Worker 30538 has no heartbeat (400 seconds), restarting size:M

Added by livdywan over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-04-27
Due date:
2023-05-20
% Done:

0%

Estimated time:

Description

Observation

from OpenQA logreport for ariel.suse-dmz.opensuse.org:

[2023-04-27T07:29:38.886066Z] [error] Worker 30538 has no heartbeat (400 seconds), restarting

Acceptance criteria

  • AC1: Heart beat messages are not observed in log reports

Suggestions

  • "Worker $PID" refers to the process on the web UI since this is where we use Mojo workers (prefork HTTP server) and nowhere else
  • Heartbeats are a recurring write of a small message from the prefork workers to the manager every few seconds, this must have been blocked by a blocking operation (like a syscall or database query)
  • Check the WebUI logs for relevant messages

Related issues 3 (1 open2 closed)

Related to openQA Infrastructure (public) - action #121771: openqaworker20 has no heartbeatResolvedokurz

Actions
Related to openQA Project (public) - action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:MResolvedlivdywan2022-02-03

Actions
Related to openQA Project (public) - action #128936: API endpoint /api/v1/jobs/:jobid/set_done can be very slowNew2023-05-08

Actions
Actions #1

Updated by livdywan over 1 year ago

  • Subject changed from [logwarn] Worker 30538 has no heartbeat (400 seconds), restarting to [logwarn] Worker 30538 has no heartbeat (400 seconds), restarting size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #2

Updated by kraih over 1 year ago

  • Project changed from openQA Infrastructure (public) to openQA Project (public)
  • Assignee set to kraih

This is not openQA Infrastructure, it should be a bug somewhere in the code.

Actions #3

Updated by kraih over 1 year ago

  • Category set to Regressions/Crashes
Actions #4

Updated by kraih over 1 year ago

  • Status changed from Workable to In Progress
Actions #5

Updated by kraih over 1 year ago

There has been a second case since:

[2023-05-04T10:12:42.760284Z] [error] Worker 6195 has no heartbeat (400 seconds), restarting
Actions #6

Updated by kraih over 1 year ago

Here we have the collected log messages from 10 minutes before the event, until the worker has been stopped:

[2023-04-27T07:19:06.097799Z] [debug] [pid:30538] Sending AMQP event: opensuse.openqa.job.done
[2023-04-27T07:19:06.098571Z] [debug] [pid:30538] AMQP URL: amqps://USER:PASS@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-04-27T07:19:06.151559Z] [debug] [pid:30538] opensuse.openqa.job.done published
[2023-04-27T07:22:54.723983Z] [debug] [pid:30538] _carry_over_candidate(3247307): _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,generic-676:failed,run:softfailed
[2023-04-27T07:24:12.667803Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3245855: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:25:27.850158Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3243714: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:25:27.850408Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3243714 with repeated problem
[2023-04-27T07:26:32.150038Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3241095: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:26:32.150357Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3241095 with repeated problem
[2023-04-27T07:27:37.565926Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3239952: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:27:37.566128Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3239952 with repeated problem
[2023-04-27T07:28:37.334215Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3236443: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:28:37.334405Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3236443 with repeated problem
[2023-04-27T07:29:32.966574Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3234707: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:29:32.966728Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3234707 with repeated problem
[2023-04-27T07:29:38.886066Z] [error] Worker 30538 has no heartbeat (400 seconds), restarting
[2023-04-27T07:29:38.886376Z] [info] Stopping worker 30538 gracefully (800 seconds)
[2023-04-27T07:30:28.142825Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3233435: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:30:28.143061Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3233435 with repeated problem
[2023-04-27T07:31:24.087082Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3231074: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:31:24.087241Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3231074 with repeated problem
[2023-04-27T07:32:18.253362Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3229569: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:32:18.253559Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3229569 with repeated problem
[2023-04-27T07:33:11.214452Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3228736: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:33:11.214596Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3228736 with repeated problem
[2023-04-27T07:33:11.264477Z] [debug] [pid:30538] Sending AMQP event: opensuse.openqa.job.done
[2023-04-27T07:33:11.265532Z] [debug] [pid:30538] AMQP URL: amqps://USER:PASS@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-04-27T07:33:11.495977Z] [info] Worker 30538 stopped

This should be the request from the Apache logs:

192.168.112.17 - - [27/Apr/2023:07:22:54 +0000] "POST /api/v1/jobs/3247307/set_done?worker_id=520 HTTP/1.1" 406 - "-" "Mojolicious (Perl)" 300020

The picture seems pretty clear, around 07:22:54 there was an OpenQA::Schema::Result::Jobs::done() call that ran SQL queries for job 3247307 until 07:33:11. Not letting the event loop send a heartbeat via recurring timer for 11 minutes, going way above our heartbeat timeout of 400 seconds.

Actions #7

Updated by kraih over 1 year ago

And for the new case it looks similar:

[2023-05-04T10:06:03.008008Z] [debug] [pid:6195] _carry_over_candidate(3262555): _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,generic-465:failed,generic-528:failed,run:softfailed
[2023-05-04T10:09:09.560880Z] [debug] [pid:6195] _carry_over_candidate(3262555): checking take over from 3258496: _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,run:softfailed
[2023-05-04T10:11:28.641581Z] [debug] [pid:6195] _carry_over_candidate(3262555): checking take over from 3256753: _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,generic-465:failed,run:softfailed
[2023-05-04T10:12:42.760284Z] [error] Worker 6195 has no heartbeat (400 seconds), restarting
[2023-05-04T10:12:42.760511Z] [info] Stopping worker 6195 gracefully (800 seconds)
[2023-05-04T10:13:09.021717Z] [debug] [pid:6195] _carry_over_candidate(3262555): checking take over from 3254957: _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,generic-465:failed,run:softfailed
[2023-05-04T10:13:09.021931Z] [debug] [pid:6195] _carry_over_candidate(3262555): ignoring job 3254957 with repeated problem
[2023-05-04T10:14:21.028710Z] [debug] [pid:6195] _carry_over_candidate(3262555): checking take over from 3254138: _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,run:softfailed
[2023-05-04T10:15:22.415947Z] [debug] [pid:6195] _carry_over_candidate(3262555): checking take over from 3252131: _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,generic-465:failed,run:softfailed
[2023-05-04T10:15:22.416135Z] [debug] [pid:6195] _carry_over_candidate(3262555): changed state more than 3 (4), aborting search
[2023-05-04T10:15:22.474994Z] [debug] [pid:6195] Sending AMQP event: opensuse.openqa.job.done
[2023-05-04T10:15:22.478586Z] [debug] [pid:6195] AMQP URL: amqps://USER:PASS@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-05-04T10:15:22.690578Z] [info] Worker 6195 stopped
Actions #8

Updated by openqa_review over 1 year ago

  • Due date set to 2023-05-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by okurz over 1 year ago

Actions #10

Updated by okurz over 1 year ago

  • Related to action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added
Actions #11

Updated by okurz over 1 year ago

  • Priority changed from Normal to Urgent

repeated logreport messages, e.g. from 2023-05-06. I suggest we handle the urgency of repeated alert messages as priority.

Actions #12

Updated by kraih over 1 year ago

okurz wrote:

repeated logreport messages, e.g. from 2023-05-06. I suggest we handle the urgency of repeated alert messages as priority.

Ok, quick fix should be easy enough by increasing the timeout. I'll prepare a patch.

Actions #14

Updated by okurz over 1 year ago

as you wrote in https://github.com/os-autoinst/openQA/pull/5124 "Long term we should consider moving such slow code into Minion background jobs though.": How about improving the error message to give that hint directly? And is moving to minion background jobs what you plan to do in particular for this ticket?

Actions #15

Updated by kraih over 1 year ago

as you wrote in https://github.com/os-autoinst/openQA/pull/5124 "Long term we should consider moving such slow code into Minion background jobs though.": How about improving the error message to give that hint directly?

Yes, that might be worth changing upstream. Not the first time this particular log message has caused some confusion.

And is moving to minion background jobs what you plan to do in particular for this ticket?

I was pondering how to do it before the priority was raised to urgent. :) But it's not as easy at it seems at first, because $job->done is called from 79 places in different contexts, and has a significant return value a bunch of them depend on. For this ticket i think we are good with the timeout increase. But in general, we shouldn't have any API endpoints that take multiple minutes. So a followup ticket might be in order.

Actions #16

Updated by kraih over 1 year ago

The vast majority of /api/v1/jobs/:id/set_done requests are pretty fast. But there always seems to be a bunch of outliers that run really slow (here a few from yesterday):

192.168.112.18 - - [06/May/2023:01:39:00 +0000] "POST /api/v1/jobs/3267772/set_done?worker_id=519 HTTP/1.1" 200 37 "-" "Mojolicious (Perl)" 101578
192.168.112.7 - - [06/May/2023:01:58:27 +0000] "POST /api/v1/jobs/3267447/set_done?worker_id=93 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 47713
192.168.112.18 - - [06/May/2023:02:22:47 +0000] "POST /api/v1/jobs/3267445/set_done?worker_id=505 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 46972
192.168.112.19 - - [06/May/2023:10:37:47 +0000] "POST /api/v1/jobs/3267905/set_done?worker_id=575 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 215399
192.168.112.19 - - [06/May/2023:11:01:37 +0000] "POST /api/v1/jobs/3267907/set_done?worker_id=571 HTTP/1.1" 406 - "-" "Mojolicious (Perl)" 300104
192.168.112.19 - - [06/May/2023:11:14:54 +0000] "POST /api/v1/jobs/3267955/set_done?worker_id=574 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 110528
192.168.112.17 - - [06/May/2023:14:29:27 +0000] "POST /api/v1/jobs/3268327/set_done?worker_id=540 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 84127
192.168.112.19 - - [06/May/2023:18:24:27 +0000] "POST /api/v1/jobs/3268700/set_done?worker_id=573 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 59262
192.168.112.12 - - [06/May/2023:22:02:48 +0000] "POST /api/v1/jobs/3269647/set_done?worker_id=186 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 50023
192.168.112.17 - - [06/May/2023:22:17:35 +0000] "POST /api/v1/jobs/3269577/set_done?worker_id=538 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 119143
192.168.112.17 - - [06/May/2023:22:41:08 +0000] "POST /api/v1/jobs/3269668/set_done?worker_id=537 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 62961
192.168.112.17 - - [06/May/2023:22:41:41 +0000] "POST /api/v1/jobs/3269675/set_done?worker_id=522 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 61466
Actions #17

Updated by kraih over 1 year ago

Created a followup ticket to make the API endpoint faster, in whatever way possible: #128936

Actions #18

Updated by kraih over 1 year ago

  • Related to action #128936: API endpoint /api/v1/jobs/:jobid/set_done can be very slow added
Actions #19

Updated by kraih over 1 year ago

  • Status changed from In Progress to Feedback
Actions #20

Updated by okurz over 1 year ago

  • Priority changed from Urgent to Normal

Haven't seen more logreport messages recently so the urgency was addressed.

Actions #21

Updated by kraih over 1 year ago

  • Status changed from Feedback to Resolved
Actions #22

Updated by okurz over 1 year ago

The message could include a reference to the according faq section like a shortened URL: https://docs.mojolicious.org/faq/42

Actions

Also available in: Atom PDF