Project

General

Profile

Actions

action #110032

closed

Migration group can not trigger, missing minion jobs? size:M

Added by coolgw about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-05-03
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Motivation

This issue already happen twice, after check the log of openqa(please check attachment), it seems log show 4305089 triggered, but in in minion list not exist.

Acceptance criteria

  • AC1: There are no new leftover entries with status added in the scheduled_products table.

Suggestions

  • Find a Minion job that's not older than 2 days with the query SELECT id, minion_job_id FROM scheduled_products WHERE status = 'added' AND t_created > NOW() - INTERVAL '2days'; and look at the metadata.
  • Identify a regression around March 18.

Files

trigger2.png (182 KB) trigger2.png coolgw, 2022-04-18 01:56
trigger1.png (216 KB) trigger1.png coolgw, 2022-04-18 01:56

Subtasks 1 (0 open1 closed)

action #110566: [tools] Problem with ISOS Post command from OBS syncResolvedkraih2022-05-03

Actions

Related issues 1 (1 open0 closed)

Related to openQA Project - action #108983: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::ScheduleNew2022-03-25

Actions
Actions #1

Updated by okurz almost 2 years ago

  • Subject changed from Migration group can not trigger to Migration group can not trigger, missing minion jobs?
  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #2

Updated by kraih almost 2 years ago

  • Assignee set to kraih

For context, the async iso post would result in a schedule_iso job, executing this code. It does not look like the Minion job would delete itself, but the job id is stored in the database, so maybe it gets removed as a side effect of something else.

Actions #3

Updated by kraih almost 2 years ago

The entry from the scheduled_products table is interesting, it's still in the added state.

   id   | distri | build | status | results | gru_task_id | minion_job_id |      t_created      |      t_updated
--------+--------+-------+--------+---------+-------------+---------------+---------------------+---------------------
 898954 | SLE    | 129.1 | added  |         |    31418883 |       4305089 | 2022-04-15 04:17:10 | 2022-04-15 04:17:10
Actions #4

Updated by kraih almost 2 years ago

  • Status changed from New to In Progress
Actions #5

Updated by kraih almost 2 years ago

On OSD there's actually 62 of these cases, some pretty old, but it seems to have become more common only recently:

openqa=# SELECT id, distri, arch, build, status, results, gru_task_id, minion_job_id, t_created, t_updated FROM scheduled_products WHERE status = 'added' AND t_updated < NOW() - INTERVAL '7 days' ORDER BY t_created ASC;
   id   | distri |  arch   | build | status | results | gru_task_id | minion_job_id |      t_created      |      t_updated
--------+--------+---------+-------+--------+---------+-------------+---------------+---------------------+---------------------
 459262 | sle    | x86_64  | 4.29  | added  |         |    27353855 |        239470 | 2019-08-06 17:21:59 | 2019-08-06 17:21:59
 459263 | sle    | x86_64  | 4.29  | added  |         |    27353856 |        239471 | 2019-08-06 17:22:00 | 2019-08-06 17:22:00
 459264 | sle    | x86_64  | 4.29  | added  |         |    27353857 |        239472 | 2019-08-06 17:22:02 | 2019-08-06 17:22:02
 884945 | SLE    | aarch64 | 113.1 | added  |         |    31215424 |       4101598 | 2022-03-18 00:45:10 | 2022-03-18 00:45:10
 884946 | SLE    | ppc64le | 113.1 | added  |         |    31215425 |       4101599 | 2022-03-18 00:45:12 | 2022-03-18 00:45:12
 884947 | SLE    | s390x   | 113.1 | added  |         |    31215426 |       4101600 | 2022-03-18 00:45:13 | 2022-03-18 00:45:13
 884948 | SLE    | x86_64  | 113.1 | added  |         |    31215427 |       4101601 | 2022-03-18 00:45:14 | 2022-03-18 00:45:14
 893953 | SLE    | aarch64 | 124.1 | added  |         |    31351834 |       4238032 | 2022-04-07 22:25:06 | 2022-04-07 22:25:06
 893954 | SLE    | ppc64le | 124.1 | added  |         |    31351835 |       4238033 | 2022-04-07 22:25:06 | 2022-04-07 22:25:06
 893955 | SLE    | s390x   | 124.1 | added  |         |    31351836 |       4238034 | 2022-04-07 22:25:07 | 2022-04-07 22:25:07
 893956 | SLE    | x86_64  | 124.1 | added  |         |    31351837 |       4238035 | 2022-04-07 22:25:08 | 2022-04-07 22:25:08
 894522 | sle    | x86_64  | 1.95  | added  |         |    31361288 |       4247487 | 2022-04-08 23:49:34 | 2022-04-08 23:49:34
 894523 | sle    | aarch64 | 1.95  | added  |         |    31361290 |       4247489 | 2022-04-08 23:49:38 | 2022-04-08 23:49:38
 894524 | sle    | x86_64  | 1.95  | added  |         |    31361291 |       4247490 | 2022-04-08 23:49:42 | 2022-04-08 23:49:42
 894525 | sle    | x86_64  | 1.95  | added  |         |    31361292 |       4247491 | 2022-04-08 23:49:46 | 2022-04-08 23:49:46
 894526 | sle    | x86_64  | 1.95  | added  |         |    31361294 |       4247493 | 2022-04-08 23:49:52 | 2022-04-08 23:49:52
 894528 | SLE    | ppc64le | 125.1 | added  |         |    31361366 |       4247565 | 2022-04-08 23:54:45 | 2022-04-08 23:54:45
 894529 | SLE    | s390x   | 125.1 | added  |         |    31361368 |       4247567 | 2022-04-08 23:54:59 | 2022-04-08 23:54:59
 894530 | SLE    | x86_64  | 125.1 | added  |         |    31361372 |       4247571 | 2022-04-08 23:55:12 | 2022-04-08 23:55:13
 894531 | SLE    | aarch64 | 125.1 | added  |         |    31361373 |       4247572 | 2022-04-08 23:55:26 | 2022-04-08 23:55:26
 894532 | SLE    | ppc64le | 125.1 | added  |         |    31361374 |       4247573 | 2022-04-08 23:55:36 | 2022-04-08 23:55:36
 894533 | SLE    | s390x   | 125.1 | added  |         |    31361377 |       4247576 | 2022-04-08 23:55:45 | 2022-04-08 23:55:45
 894534 | SLE    | x86_64  | 125.1 | added  |         |    31361378 |       4247577 | 2022-04-08 23:55:55 | 2022-04-08 23:55:56
 894535 | SLE    | aarch64 | 125.1 | added  |         |    31361382 |       4247581 | 2022-04-08 23:56:07 | 2022-04-08 23:56:07
 894536 | SLE    | ppc64le | 125.1 | added  |         |    31361383 |       4247582 | 2022-04-08 23:56:18 | 2022-04-08 23:56:19
 894537 | SLE    | s390x   | 125.1 | added  |         |    31361384 |       4247583 | 2022-04-08 23:56:30 | 2022-04-08 23:56:30
 894538 | SLE    | x86_64  | 125.1 | added  |         |    31361385 |       4247584 | 2022-04-08 23:56:41 | 2022-04-08 23:56:42
 894539 | SLE    | aarch64 | 125.1 | added  |         |    31361388 |       4247587 | 2022-04-08 23:56:55 | 2022-04-08 23:56:55
 894540 | SLE    | ppc64le | 125.1 | added  |         |    31361390 |       4247589 | 2022-04-08 23:57:08 | 2022-04-08 23:57:08
 894541 | SLE    | s390x   | 125.1 | added  |         |    31361392 |       4247591 | 2022-04-08 23:57:20 | 2022-04-08 23:57:21
 894542 | SLE    | x86_64  | 125.1 | added  |         |    31361393 |       4247592 | 2022-04-08 23:57:32 | 2022-04-08 23:57:32
 894543 | SLE    | aarch64 | 125.1 | added  |         |    31361395 |       4247594 | 2022-04-08 23:57:41 | 2022-04-08 23:57:41
 894545 | SLE    | s390x   | 125.1 | added  |         |    31361399 |       4247598 | 2022-04-08 23:57:57 | 2022-04-08 23:57:57
...
(62 rows)

And all cases are clustered with similar ones in the timespan of a few seconds.

Actions #6

Updated by kraih almost 2 years ago

To get further i'll need to catch a case that's no older than 2 days, because that's the timespan we keep Minion job data in storage. Unfortunately there is no such case at the moment.

openqa=# SELECT id, minion_job_id FROM scheduled_products WHERE status = 'added' AND t_created > NOW() - INTERVAL '2days';
 id | minion_job_id
----+---------------
(0 rows)

Until then i can only look for regressions in commits around March 18, or possibly a logic error somewhere in the code that results in the deletion of Minion jobs.

Actions #7

Updated by kraih almost 2 years ago

  • Description updated (diff)
Actions #8

Updated by kraih almost 2 years ago

  • Description updated (diff)
Actions #9

Updated by openqa_review almost 2 years ago

  • Due date set to 2022-05-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by kraih almost 2 years ago

Today i found a case that was newer than 2 days, and the Minion jobs wer definitely missing before the automatic cleanup could trigger:

openqa=# SELECT id, distri, arch, build, status, results, gru_task_id, minion_job_id, t_created, t_updated FROM scheduled_products WHERE status = 'added' AND t_updated > NOW() - INTERVAL '2 days' ORDER BY t_created ASC;
   id   | distri |  arch   | build | status | results | gru_task_id | minion_job_id |      t_created      |      t_updated
--------+--------+---------+-------+--------+---------+-------------+---------------+---------------------+---------------------
 907971 | SLE    | aarch64 | 137.1 | added  |         |    31490948 |       4377167 | 2022-04-26 07:07:08 | 2022-04-26 07:07:08
 907972 | SLE    | ppc64le | 137.1 | added  |         |    31490949 |       4377168 | 2022-04-26 07:07:10 | 2022-04-26 07:07:11
 907973 | SLE    | s390x   | 137.1 | added  |         |    31490950 |       4377169 | 2022-04-26 07:07:13 | 2022-04-26 07:07:13
 907974 | SLE    | x86_64  | 137.1 | added  |         |    31490951 |       4377170 | 2022-04-26 07:07:14 | 2022-04-26 07:07:14
(4 rows)

openqa=# SELECT * FROM minion_jobs WHERE id = 4377167;
 id | args | created | delayed | finished | priority | result | retried | retries | started | state | task | worker | queue | attempts | parents | notes | expires | lax
----+------+---------+---------+----------+----------+--------+---------+---------+---------+-------+------+--------+-------+----------+---------+-------+---------+-----
(0 rows)

openqa=# SELECT * FROM minion_jobs WHERE id = 4377168;
 id | args | created | delayed | finished | priority | result | retried | retries | started | state | task | worker | queue | attempts | parents | notes | expires | lax
----+------+---------+---------+----------+----------+--------+---------+---------+---------+-------+------+--------+-------+----------+---------+-------+---------+-----
(0 rows)

openqa=# SELECT * FROM minion_jobs WHERE id = 4377169;
 id | args | created | delayed | finished | priority | result | retried | retries | started | state | task | worker | queue | attempts | parents | notes | expires | lax
----+------+---------+---------+----------+----------+--------+---------+---------+---------+-------+------+--------+-------+----------+---------+-------+---------+-----
(0 rows)

openqa=# SELECT * FROM minion_jobs WHERE id = 4377170;
 id | args | created | delayed | finished | priority | result | retried | retries | started | state | task | worker | queue | attempts | parents | notes | expires | lax
----+------+---------+---------+----------+----------+--------+---------+---------+---------+-------+------+--------+-------+----------+---------+-------+---------+-----
(0 rows)

In the gru logs i could only find the jobs around these four (not executed in the listed order, see timestamps):

[2022-04-26T09:15:38.563826+02:00] [debug] Process 2071 is performing job "4377161" with task "schedule_iso"
[2022-04-26T09:15:38.669395+02:00] [debug] Process 2074 is performing job "4377162" with task "schedule_iso"
[2022-04-26T09:15:58.444404+02:00] [debug] Process 2273 is performing job "4377163" with task "schedule_iso"
[2022-04-26T09:31:01.834314+02:00] [debug] Process 19127 is performing job "4377164" with task "finalize_job_results"
[2022-04-26T09:16:08.571776+02:00] [debug] Process 2515 is performing job "4377165" with task "schedule_iso"
[2022-04-26T09:16:31.049261+02:00] [debug] Process 2827 is performing job "4377166" with task "schedule_iso"
...
[2022-04-26T09:31:01.912941+02:00] [debug] Process 19130 is performing job "4377171" with task "finalize_job_results"
[2022-04-26T09:31:01.950323+02:00] [debug] Process 19133 is performing job "4377172" with task "finalize_job_results"
Actions #11

Updated by mkittler almost 2 years ago

  • Related to action #108983: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule added
Actions #12

Updated by mkittler almost 2 years ago

I've been adding #108983 as related ticket because in theory such jobs might have been cleaned up by an admin if they ended up "Job terminated unexpectedly".

Actions #13

Updated by kraih almost 2 years ago

mkittler wrote:

I've been adding #108983 as related ticket because in theory such jobs might have been cleaned up by an admin if they ended up "Job terminated unexpectedly".

Manual deletion by a user is my top suspicion right now too. As a "solution" i'll probably add logging of job deletions, so there's at least a trace for auditing. The jobs could not have failed though, or there would be a log message Process 1234 is performing job "4377167" with task "schedule_iso".

Actions #14

Updated by kraih almost 2 years ago

Asked around if someone is running an automated script to delete jobs that might be misbehaving, but found nobody so far.

Actions #16

Updated by okurz almost 2 years ago

  • Subject changed from Migration group can not trigger, missing minion jobs? to Migration group can not trigger, missing minion jobs? size:M
Actions #17

Updated by kraih almost 2 years ago

Today we've had a lot of cases:

openqa=# SELECT id, distri, arch, build, status, results, gru_task_id, minion_job_id, t_created, t_updated FROM scheduled_products WHERE status = 'added' AND t_updated > NOW() - INTERVAL '2 days' ORDER BY t_created ASC;
   id   | distri |  arch   | build | status | results | gru_task_id | minion_job_id |      t_created      |      t_updated
--------+--------+---------+-------+--------+---------+-------------+---------------+---------------------+---------------------
 910510 | sle    | ppc64le | 24.52 | added  |         |    31531556 |       4417783 | 2022-05-03 00:30:31 | 2022-05-03 00:30:31
 910511 | sle    | s390x   | 24.52 | added  |         |    31531557 |       4417784 | 2022-05-03 00:30:33 | 2022-05-03 00:30:33
 910512 | sle    | x86_64  | 24.52 | added  |         |    31531558 |       4417785 | 2022-05-03 00:30:35 | 2022-05-03 00:30:35
 910514 | sle    | x86_64  | 1.134 | added  |         |    31531563 |       4417790 | 2022-05-03 00:31:47 | 2022-05-03 00:31:47
 910515 | sle    | aarch64 | 1.134 | added  |         |    31531564 |       4417791 | 2022-05-03 00:31:51 | 2022-05-03 00:31:51
 910516 | sle    | x86_64  | 1.134 | added  |         |    31531565 |       4417792 | 2022-05-03 00:31:54 | 2022-05-03 00:31:54
 910517 | sle    | x86_64  | 1.134 | added  |         |    31531566 |       4417793 | 2022-05-03 00:31:56 | 2022-05-03 00:31:56
 910518 | sle    | x86_64  | 1.134 | added  |         |    31531567 |       4417794 | 2022-05-03 00:31:59 | 2022-05-03 00:31:59
 910520 | SLE    | ppc64le | 139.4 | added  |         |    31531603 |       4417830 | 2022-05-03 00:36:21 | 2022-05-03 00:36:21
 910521 | SLE    | s390x   | 139.4 | added  |         |    31531605 |       4417832 | 2022-05-03 00:36:27 | 2022-05-03 00:36:27
 910522 | SLE    | x86_64  | 139.4 | added  |         |    31531607 |       4417834 | 2022-05-03 00:36:40 | 2022-05-03 00:36:41
 910523 | SLE    | aarch64 | 139.4 | added  |         |    31531608 |       4417835 | 2022-05-03 00:36:50 | 2022-05-03 00:36:50
 910524 | SLE    | ppc64le | 139.4 | added  |         |    31531611 |       4417838 | 2022-05-03 00:37:00 | 2022-05-03 00:37:00
 910525 | SLE    | s390x   | 139.4 | added  |         |    31531615 |       4417842 | 2022-05-03 00:37:13 | 2022-05-03 00:37:13
 910526 | SLE    | x86_64  | 139.4 | added  |         |    31531618 |       4417845 | 2022-05-03 00:37:21 | 2022-05-03 00:37:21
 910527 | SLE    | aarch64 | 139.4 | added  |         |    31531621 |       4417848 | 2022-05-03 00:37:36 | 2022-05-03 00:37:36
 910528 | SLE    | ppc64le | 139.4 | added  |         |    31531623 |       4417850 | 2022-05-03 00:37:44 | 2022-05-03 00:37:44
 910529 | SLE    | s390x   | 139.4 | added  |         |    31531624 |       4417851 | 2022-05-03 00:37:56 | 2022-05-03 00:37:56
 910530 | SLE    | x86_64  | 139.4 | added  |         |    31531629 |       4417856 | 2022-05-03 00:38:09 | 2022-05-03 00:38:09
 910531 | SLE    | aarch64 | 139.4 | added  |         |    31531632 |       4417859 | 2022-05-03 00:38:20 | 2022-05-03 00:38:20
 910532 | SLE    | ppc64le | 139.4 | added  |         |    31531633 |       4417860 | 2022-05-03 00:38:29 | 2022-05-03 00:38:29
 910533 | SLE    | s390x   | 139.4 | added  |         |    31531634 |       4417861 | 2022-05-03 00:38:37 | 2022-05-03 00:38:37
 910534 | SLE    | x86_64  | 139.4 | added  |         |    31531636 |       4417863 | 2022-05-03 00:38:48 | 2022-05-03 00:38:48
 910535 | SLE    | aarch64 | 139.4 | added  |         |    31531637 |       4417864 | 2022-05-03 00:39:08 | 2022-05-03 00:39:08
 910536 | SLE    | ppc64le | 139.4 | added  |         |    31531639 |       4417866 | 2022-05-03 00:39:16 | 2022-05-03 00:39:17
 910537 | SLE    | s390x   | 139.4 | added  |         |    31531641 |       4417868 | 2022-05-03 00:39:25 | 2022-05-03 00:39:25
 910538 | SLE    | x86_64  | 139.4 | added  |         |    31531642 |       4417869 | 2022-05-03 00:39:31 | 2022-05-03 00:39:31
 910539 | SLE    | aarch64 | 139.4 | added  |         |    31531644 |       4417871 | 2022-05-03 00:39:39 | 2022-05-03 00:39:39
 910540 | SLE    | ppc64le | 139.4 | added  |         |    31531646 |       4417873 | 2022-05-03 00:39:46 | 2022-05-03 00:39:47
 910541 | SLE    | s390x   | 139.4 | added  |         |    31531647 |       4417874 | 2022-05-03 00:39:55 | 2022-05-03 00:39:55
 910542 | SLE    | x86_64  | 139.4 | added  |         |    31531649 |       4417876 | 2022-05-03 00:40:16 | 2022-05-03 00:40:16
(31 rows)

The new Minion version is deployed, but so far no manual removal messages in the logs.

Actions #18

Updated by kraih almost 2 years ago

I think i found the cause, it's actually the gru ttl feature, which now maps to the expires Minion feature. Theoretically that can result in schedule_iso jobs being deleted if they have not been processed within 10 minutes.

Actions #20

Updated by kraih almost 2 years ago

Waiting for the change to be deployed, then will check for new cases.

Actions #21

Updated by kraih almost 2 years ago

  • Status changed from In Progress to Feedback
Actions #22

Updated by kraih almost 2 years ago

No new cases since deployment.

Actions #23

Updated by kraih almost 2 years ago

And no new cases again today. I'll take another look next week, but i'm cautiously optimistic that the problem is resolved now.

Actions #24

Updated by okurz almost 2 years ago

sounds great! Keep in mind that you added #110566 as a subtask of this one here so you would need to resolve #110566 first before resolving this ticket here.

Actions #25

Updated by kraih almost 2 years ago

  • Status changed from Feedback to Resolved

No new cases over the weekend, i think we can consider this resolved.

Actions #26

Updated by leli almost 2 years ago

kraih wrote:

No new cases over the weekend, i think we can consider this resolved.

Thanks, we will check it on next build, seems the GMC candidate. :)

Actions

Also available in: Atom PDF