action #128345: [logwarn] Worker 30538 has no heartbeat (400 seconds), restarting size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #128345

closed

[logwarn] Worker 30538 has no heartbeat (400 seconds), restarting size:M

Added by livdywan almost 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

kraih

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2023-04-27

Due date:

2023-05-20

% Done:

Estimated time:

Description

Observation¶

from OpenQA logreport for ariel.suse-dmz.opensuse.org:

[2023-04-27T07:29:38.886066Z] [error] Worker 30538 has no heartbeat (400 seconds), restarting

Acceptance criteria¶

AC1: Heart beat messages are not observed in log reports

Suggestions¶

"Worker $PID" refers to the process on the web UI since this is where we use Mojo workers (prefork HTTP server) and nowhere else
Heartbeats are a recurring write of a small message from the prefork workers to the manager every few seconds, this must have been blocked by a blocking operation (like a syscall or database query)
Check the WebUI logs for relevant messages

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by livdywan almost 2 years ago

Subject changed from [logwarn] Worker 30538 has no heartbeat (400 seconds), restarting to [logwarn] Worker 30538 has no heartbeat (400 seconds), restarting size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by kraih almost 2 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Assignee set to kraih

This is not openQA Infrastructure, it should be a bug somewhere in the code.

Actions

Copy link

Updated by kraih almost 2 years ago

Category set to Regressions/Crashes

Actions

Copy link

Updated by kraih almost 2 years ago

Status changed from Workable to In Progress

Actions

Copy link

Updated by kraih almost 2 years ago

There has been a second case since:

[2023-05-04T10:12:42.760284Z] [error] Worker 6195 has no heartbeat (400 seconds), restarting

Actions

Copy link

Updated by kraih almost 2 years ago

Here we have the collected log messages from 10 minutes before the event, until the worker has been stopped:

[2023-04-27T07:19:06.097799Z] [debug] [pid:30538] Sending AMQP event: opensuse.openqa.job.done
[2023-04-27T07:19:06.098571Z] [debug] [pid:30538] AMQP URL: amqps://USER:PASS@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-04-27T07:19:06.151559Z] [debug] [pid:30538] opensuse.openqa.job.done published
[2023-04-27T07:22:54.723983Z] [debug] [pid:30538] _carry_over_candidate(3247307): _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,generic-676:failed,run:softfailed
[2023-04-27T07:24:12.667803Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3245855: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:25:27.850158Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3243714: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:25:27.850408Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3243714 with repeated problem
[2023-04-27T07:26:32.150038Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3241095: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:26:32.150357Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3241095 with repeated problem
[2023-04-27T07:27:37.565926Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3239952: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:27:37.566128Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3239952 with repeated problem
[2023-04-27T07:28:37.334215Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3236443: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:28:37.334405Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3236443 with repeated problem
[2023-04-27T07:29:32.966574Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3234707: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:29:32.966728Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3234707 with repeated problem
[2023-04-27T07:29:38.886066Z] [error] Worker 30538 has no heartbeat (400 seconds), restarting
[2023-04-27T07:29:38.886376Z] [info] Stopping worker 30538 gracefully (800 seconds)
[2023-04-27T07:30:28.142825Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3233435: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:30:28.143061Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3233435 with repeated problem
[2023-04-27T07:31:24.087082Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3231074: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:31:24.087241Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3231074 with repeated problem
[2023-04-27T07:32:18.253362Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3229569: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:32:18.253559Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3229569 with repeated problem
[2023-04-27T07:33:11.214452Z] [debug] [pid:30538] _carry_over_candidate(3247307): checking take over from 3228736: _failure_reason=generic-053:failed,generic-099:failed,generic-105:failed,generic-294:failed,generic-318:failed,generic-319:failed,generic-444:failed,generic-529:failed,run:softfailed
[2023-04-27T07:33:11.214596Z] [debug] [pid:30538] _carry_over_candidate(3247307): ignoring job 3228736 with repeated problem
[2023-04-27T07:33:11.264477Z] [debug] [pid:30538] Sending AMQP event: opensuse.openqa.job.done
[2023-04-27T07:33:11.265532Z] [debug] [pid:30538] AMQP URL: amqps://USER:PASS@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-04-27T07:33:11.495977Z] [info] Worker 30538 stopped

This should be the request from the Apache logs:

192.168.112.17 - - [27/Apr/2023:07:22:54 +0000] "POST /api/v1/jobs/3247307/set_done?worker_id=520 HTTP/1.1" 406 - "-" "Mojolicious (Perl)" 300020

The picture seems pretty clear, around 07:22:54 there was an OpenQA::Schema::Result::Jobs::done() call that ran SQL queries for job 3247307 until 07:33:11. Not letting the event loop send a heartbeat via recurring timer for 11 minutes, going way above our heartbeat timeout of 400 seconds.

Actions

Copy link

Updated by kraih almost 2 years ago

And for the new case it looks similar:

[2023-05-04T10:06:03.008008Z] [debug] [pid:6195] _carry_over_candidate(3262555): _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,generic-465:failed,generic-528:failed,run:softfailed
[2023-05-04T10:09:09.560880Z] [debug] [pid:6195] _carry_over_candidate(3262555): checking take over from 3258496: _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,run:softfailed
[2023-05-04T10:11:28.641581Z] [debug] [pid:6195] _carry_over_candidate(3262555): checking take over from 3256753: _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,generic-465:failed,run:softfailed
[2023-05-04T10:12:42.760284Z] [error] Worker 6195 has no heartbeat (400 seconds), restarting
[2023-05-04T10:12:42.760511Z] [info] Stopping worker 6195 gracefully (800 seconds)
[2023-05-04T10:13:09.021717Z] [debug] [pid:6195] _carry_over_candidate(3262555): checking take over from 3254957: _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,generic-465:failed,run:softfailed
[2023-05-04T10:13:09.021931Z] [debug] [pid:6195] _carry_over_candidate(3262555): ignoring job 3254957 with repeated problem
[2023-05-04T10:14:21.028710Z] [debug] [pid:6195] _carry_over_candidate(3262555): checking take over from 3254138: _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,run:softfailed
[2023-05-04T10:15:22.415947Z] [debug] [pid:6195] _carry_over_candidate(3262555): checking take over from 3252131: _failure_reason=generic-099:failed,generic-237:failed,generic-258:failed,generic-294:failed,generic-318:failed,generic-444:failed,generic-465:failed,run:softfailed
[2023-05-04T10:15:22.416135Z] [debug] [pid:6195] _carry_over_candidate(3262555): changed state more than 3 (4), aborting search
[2023-05-04T10:15:22.474994Z] [debug] [pid:6195] Sending AMQP event: opensuse.openqa.job.done
[2023-05-04T10:15:22.478586Z] [debug] [pid:6195] AMQP URL: amqps://USER:PASS@rabbit.opensuse.org:5671/?exchange=pubsub
[2023-05-04T10:15:22.690578Z] [info] Worker 6195 stopped

Actions

Copy link

Updated by openqa_review almost 2 years ago

Due date set to 2023-05-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz almost 2 years ago

Related to action #121771: openqaworker20 has no heartbeat added

Actions

Copy link

#10

Updated by okurz almost 2 years ago

Related to action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added

Actions

Copy link

#11

Updated by okurz almost 2 years ago

Priority changed from Normal to Urgent

repeated logreport messages, e.g. from 2023-05-06. I suggest we handle the urgency of repeated alert messages as priority.

Actions

Copy link

#12

Updated by kraih almost 2 years ago

okurz wrote:

repeated logreport messages, e.g. from 2023-05-06. I suggest we handle the urgency of repeated alert messages as priority.

Ok, quick fix should be easy enough by increasing the timeout. I'll prepare a patch.

Actions

Copy link

#13

Updated by kraih almost 2 years ago

PR: https://github.com/os-autoinst/openQA/pull/5124

Actions

Copy link

#14

Updated by okurz almost 2 years ago

as you wrote in https://github.com/os-autoinst/openQA/pull/5124 "Long term we should consider moving such slow code into Minion background jobs though.": How about improving the error message to give that hint directly? And is moving to minion background jobs what you plan to do in particular for this ticket?

Actions

Copy link

#15

Updated by kraih almost 2 years ago

as you wrote in https://github.com/os-autoinst/openQA/pull/5124 "Long term we should consider moving such slow code into Minion background jobs though.": How about improving the error message to give that hint directly?

Yes, that might be worth changing upstream. Not the first time this particular log message has caused some confusion.

And is moving to minion background jobs what you plan to do in particular for this ticket?

I was pondering how to do it before the priority was raised to urgent. :) But it's not as easy at it seems at first, because $job->done is called from 79 places in different contexts, and has a significant return value a bunch of them depend on. For this ticket i think we are good with the timeout increase. But in general, we shouldn't have any API endpoints that take multiple minutes. So a followup ticket might be in order.

Actions

Copy link

#16

Updated by kraih almost 2 years ago

The vast majority of /api/v1/jobs/:id/set_done requests are pretty fast. But there always seems to be a bunch of outliers that run really slow (here a few from yesterday):

192.168.112.18 - - [06/May/2023:01:39:00 +0000] "POST /api/v1/jobs/3267772/set_done?worker_id=519 HTTP/1.1" 200 37 "-" "Mojolicious (Perl)" 101578
192.168.112.7 - - [06/May/2023:01:58:27 +0000] "POST /api/v1/jobs/3267447/set_done?worker_id=93 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 47713
192.168.112.18 - - [06/May/2023:02:22:47 +0000] "POST /api/v1/jobs/3267445/set_done?worker_id=505 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 46972
192.168.112.19 - - [06/May/2023:10:37:47 +0000] "POST /api/v1/jobs/3267905/set_done?worker_id=575 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 215399
192.168.112.19 - - [06/May/2023:11:01:37 +0000] "POST /api/v1/jobs/3267907/set_done?worker_id=571 HTTP/1.1" 406 - "-" "Mojolicious (Perl)" 300104
192.168.112.19 - - [06/May/2023:11:14:54 +0000] "POST /api/v1/jobs/3267955/set_done?worker_id=574 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 110528
192.168.112.17 - - [06/May/2023:14:29:27 +0000] "POST /api/v1/jobs/3268327/set_done?worker_id=540 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 84127
192.168.112.19 - - [06/May/2023:18:24:27 +0000] "POST /api/v1/jobs/3268700/set_done?worker_id=573 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 59262
192.168.112.12 - - [06/May/2023:22:02:48 +0000] "POST /api/v1/jobs/3269647/set_done?worker_id=186 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 50023
192.168.112.17 - - [06/May/2023:22:17:35 +0000] "POST /api/v1/jobs/3269577/set_done?worker_id=538 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 119143
192.168.112.17 - - [06/May/2023:22:41:08 +0000] "POST /api/v1/jobs/3269668/set_done?worker_id=537 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 62961
192.168.112.17 - - [06/May/2023:22:41:41 +0000] "POST /api/v1/jobs/3269675/set_done?worker_id=522 HTTP/1.1" 200 33 "-" "Mojolicious (Perl)" 61466

Actions

Copy link

#17