Project

General

Profile

action #110032

Migration group can not trigger, missing minion jobs? size:M

Added by coolgw 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2022-05-03
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Difficulty:

Description

Motivation

This issue already happen twice, after check the log of openqa(please check attachment), it seems log show 4305089 triggered, but in in minion list not exist.

Acceptance criteria

  • AC1: There are no new leftover entries with status added in the scheduled_products table.

Suggestions

  • Find a Minion job that's not older than 2 days with the query SELECT id, minion_job_id FROM scheduled_products WHERE status = 'added' AND t_created > NOW() - INTERVAL '2days'; and look at the metadata.
  • Identify a regression around March 18.
trigger2.png (182 KB) trigger2.png coolgw, 2022-04-18 01:56
trigger1.png (216 KB) trigger1.png coolgw, 2022-04-18 01:56
13132
13135

Subtasks

action #110566: [tools] Problem with ISOS Post command from OBS syncResolvedkraih


Related issues

Related to openQA Project - action #108983: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::ScheduleNew2022-03-25

History

#1 Updated by okurz 2 months ago

  • Subject changed from Migration group can not trigger to Migration group can not trigger, missing minion jobs?
  • Category set to Concrete Bugs
  • Target version set to Ready

#2 Updated by kraih 2 months ago

  • Assignee set to kraih

For context, the async iso post would result in a schedule_iso job, executing this code. It does not look like the Minion job would delete itself, but the job id is stored in the database, so maybe it gets removed as a side effect of something else.

#3 Updated by kraih 2 months ago

The entry from the scheduled_products table is interesting, it's still in the added state.

   id   | distri | build | status | results | gru_task_id | minion_job_id |      t_created      |      t_updated
--------+--------+-------+--------+---------+-------------+---------------+---------------------+---------------------
 898954 | SLE    | 129.1 | added  |         |    31418883 |       4305089 | 2022-04-15 04:17:10 | 2022-04-15 04:17:10

#4 Updated by kraih 2 months ago

  • Status changed from New to In Progress

#5 Updated by kraih 2 months ago

On OSD there's actually 62 of these cases, some pretty old, but it seems to have become more common only recently:

openqa=# SELECT id, distri, arch, build, status, results, gru_task_id, minion_job_id, t_created, t_updated FROM scheduled_products WHERE status = 'added' AND t_updated < NOW() - INTERVAL '7 days' ORDER BY t_created ASC;
   id   | distri |  arch   | build | status | results | gru_task_id | minion_job_id |      t_created      |      t_updated
--------+--------+---------+-------+--------+---------+-------------+---------------+---------------------+---------------------
 459262 | sle    | x86_64  | 4.29  | added  |         |    27353855 |        239470 | 2019-08-06 17:21:59 | 2019-08-06 17:21:59
 459263 | sle    | x86_64  | 4.29  | added  |         |    27353856 |        239471 | 2019-08-06 17:22:00 | 2019-08-06 17:22:00
 459264 | sle    | x86_64  | 4.29  | added  |         |    27353857 |        239472 | 2019-08-06 17:22:02 | 2019-08-06 17:22:02
 884945 | SLE    | aarch64 | 113.1 | added  |         |    31215424 |       4101598 | 2022-03-18 00:45:10 | 2022-03-18 00:45:10
 884946 | SLE    | ppc64le | 113.1 | added  |         |    31215425 |       4101599 | 2022-03-18 00:45:12 | 2022-03-18 00:45:12
 884947 | SLE    | s390x   | 113.1 | added  |         |    31215426 |       4101600 | 2022-03-18 00:45:13 | 2022-03-18 00:45:13
 884948 | SLE    | x86_64  | 113.1 | added  |         |    31215427 |       4101601 | 2022-03-18 00:45:14 | 2022-03-18 00:45:14
 893953 | SLE    | aarch64 | 124.1 | added  |         |    31351834 |       4238032 | 2022-04-07 22:25:06 | 2022-04-07 22:25:06
 893954 | SLE    | ppc64le | 124.1 | added  |         |    31351835 |       4238033 | 2022-04-07 22:25:06 | 2022-04-07 22:25:06
 893955 | SLE    | s390x   | 124.1 | added  |         |    31351836 |       4238034 | 2022-04-07 22:25:07 | 2022-04-07 22:25:07
 893956 | SLE    | x86_64  | 124.1 | added  |         |    31351837 |       4238035 | 2022-04-07 22:25:08 | 2022-04-07 22:25:08
 894522 | sle    | x86_64  | 1.95  | added  |         |    31361288 |       4247487 | 2022-04-08 23:49:34 | 2022-04-08 23:49:34
 894523 | sle    | aarch64 | 1.95  | added  |         |    31361290 |       4247489 | 2022-04-08 23:49:38 | 2022-04-08 23:49:38
 894524 | sle    | x86_64  | 1.95  | added  |         |    31361291 |       4247490 | 2022-04-08 23:49:42 | 2022-04-08 23:49:42
 894525 | sle    | x86_64  | 1.95  | added  |         |    31361292 |       4247491 | 2022-04-08 23:49:46 | 2022-04-08 23:49:46
 894526 | sle    | x86_64  | 1.95  | added  |         |    31361294 |       4247493 | 2022-04-08 23:49:52 | 2022-04-08 23:49:52
 894528 | SLE    | ppc64le | 125.1 | added  |         |    31361366 |       4247565 | 2022-04-08 23:54:45 | 2022-04-08 23:54:45
 894529 | SLE    | s390x   | 125.1 | added  |         |    31361368 |       4247567 | 2022-04-08 23:54:59 | 2022-04-08 23:54:59
 894530 | SLE    | x86_64  | 125.1 | added  |         |    31361372 |       4247571 | 2022-04-08 23:55:12 | 2022-04-08 23:55:13
 894531 | SLE    | aarch64 | 125.1 | added  |         |    31361373 |       4247572 | 2022-04-08 23:55:26 | 2022-04-08 23:55:26
 894532 | SLE    | ppc64le | 125.1 | added  |         |    31361374 |       4247573 | 2022-04-08 23:55:36 | 2022-04-08 23:55:36
 894533 | SLE    | s390x   | 125.1 | added  |         |    31361377 |       4247576 | 2022-04-08 23:55:45 | 2022-04-08 23:55:45
 894534 | SLE    | x86_64  | 125.1 | added  |         |    31361378 |       4247577 | 2022-04-08 23:55:55 | 2022-04-08 23:55:56
 894535 | SLE    | aarch64 | 125.1 | added  |         |    31361382 |       4247581 | 2022-04-08 23:56:07 | 2022-04-08 23:56:07
 894536 | SLE    | ppc64le | 125.1 | added  |         |    31361383 |       4247582 | 2022-04-08 23:56:18 | 2022-04-08 23:56:19
 894537 | SLE    | s390x   | 125.1 | added  |         |    31361384 |       4247583 | 2022-04-08 23:56:30 | 2022-04-08 23:56:30
 894538 | SLE    | x86_64  | 125.1 | added  |         |    31361385 |       4247584 | 2022-04-08 23:56:41 | 2022-04-08 23:56:42
 894539 | SLE    | aarch64 | 125.1 | added  |         |    31361388 |       4247587 | 2022-04-08 23:56:55 | 2022-04-08 23:56:55
 894540 | SLE    | ppc64le | 125.1 | added  |         |    31361390 |       4247589 | 2022-04-08 23:57:08 | 2022-04-08 23:57:08
 894541 | SLE    | s390x   | 125.1 | added  |         |    31361392 |       4247591 | 2022-04-08 23:57:20 | 2022-04-08 23:57:21
 894542 | SLE    | x86_64  | 125.1 | added  |         |    31361393 |       4247592 | 2022-04-08 23:57:32 | 2022-04-08 23:57:32
 894543 | SLE    | aarch64 | 125.1 | added  |         |    31361395 |       4247594 | 2022-04-08 23:57:41 | 2022-04-08 23:57:41
 894545 | SLE    | s390x   | 125.1 | added  |         |    31361399 |       4247598 | 2022-04-08 23:57:57 | 2022-04-08 23:57:57
...
(62 rows)

And all cases are clustered with similar ones in the timespan of a few seconds.

#6 Updated by kraih 2 months ago

To get further i'll need to catch a case that's no older than 2 days, because that's the timespan we keep Minion job data in storage. Unfortunately there is no such case at the moment.

openqa=# SELECT id, minion_job_id FROM scheduled_products WHERE status = 'added' AND t_created > NOW() - INTERVAL '2days';
 id | minion_job_id
----+---------------
(0 rows)

Until then i can only look for regressions in commits around March 18, or possibly a logic error somewhere in the code that results in the deletion of Minion jobs.

#7 Updated by kraih 2 months ago

  • Description updated (diff)

#8 Updated by kraih 2 months ago

  • Description updated (diff)

#9 Updated by openqa_review 2 months ago

  • Due date set to 2022-05-10

Setting due date based on mean cycle time of SUSE QE Tools

#10 Updated by kraih 2 months ago

Today i found a case that was newer than 2 days, and the Minion jobs wer definitely missing before the automatic cleanup could trigger:

openqa=# SELECT id, distri, arch, build, status, results, gru_task_id, minion_job_id, t_created, t_updated FROM scheduled_products WHERE status = 'added' AND t_updated > NOW() - INTERVAL '2 days' ORDER BY t_created ASC;
   id   | distri |  arch   | build | status | results | gru_task_id | minion_job_id |      t_created      |      t_updated
--------+--------+---------+-------+--------+---------+-------------+---------------+---------------------+---------------------
 907971 | SLE    | aarch64 | 137.1 | added  |         |    31490948 |       4377167 | 2022-04-26 07:07:08 | 2022-04-26 07:07:08
 907972 | SLE    | ppc64le | 137.1 | added  |         |    31490949 |       4377168 | 2022-04-26 07:07:10 | 2022-04-26 07:07:11
 907973 | SLE    | s390x   | 137.1 | added  |         |    31490950 |       4377169 | 2022-04-26 07:07:13 | 2022-04-26 07:07:13
 907974 | SLE    | x86_64  | 137.1 | added  |         |    31490951 |       4377170 | 2022-04-26 07:07:14 | 2022-04-26 07:07:14
(4 rows)

openqa=# SELECT * FROM minion_jobs WHERE id = 4377167;
 id | args | created | delayed | finished | priority | result | retried | retries | started | state | task | worker | queue | attempts | parents | notes | expires | lax
----+------+---------+---------+----------+----------+--------+---------+---------+---------+-------+------+--------+-------+----------+---------+-------+---------+-----
(0 rows)

openqa=# SELECT * FROM minion_jobs WHERE id = 4377168;
 id | args | created | delayed | finished | priority | result | retried | retries | started | state | task | worker | queue | attempts | parents | notes | expires | lax
----+------+---------+---------+----------+----------+--------+---------+---------+---------+-------+------+--------+-------+----------+---------+-------+---------+-----
(0 rows)

openqa=# SELECT * FROM minion_jobs WHERE id = 4377169;
 id | args | created | delayed | finished | priority | result | retried | retries | started | state | task | worker | queue | attempts | parents | notes | expires | lax
----+------+---------+---------+----------+----------+--------+---------+---------+---------+-------+------+--------+-------+----------+---------+-------+---------+-----
(0 rows)

openqa=# SELECT * FROM minion_jobs WHERE id = 4377170;
 id | args | created | delayed | finished | priority | result | retried | retries | started | state | task | worker | queue | attempts | parents | notes | expires | lax
----+------+---------+---------+----------+----------+--------+---------+---------+---------+-------+------+--------+-------+----------+---------+-------+---------+-----
(0 rows)

In the gru logs i could only find the jobs around these four (not executed in the listed order, see timestamps):

[2022-04-26T09:15:38.563826+02:00] [debug] Process 2071 is performing job "4377161" with task "schedule_iso"
[2022-04-26T09:15:38.669395+02:00] [debug] Process 2074 is performing job "4377162" with task "schedule_iso"
[2022-04-26T09:15:58.444404+02:00] [debug] Process 2273 is performing job "4377163" with task "schedule_iso"
[2022-04-26T09:31:01.834314+02:00] [debug] Process 19127 is performing job "4377164" with task "finalize_job_results"
[2022-04-26T09:16:08.571776+02:00] [debug] Process 2515 is performing job "4377165" with task "schedule_iso"
[2022-04-26T09:16:31.049261+02:00] [debug] Process 2827 is performing job "4377166" with task "schedule_iso"
...
[2022-04-26T09:31:01.912941+02:00] [debug] Process 19130 is performing job "4377171" with task "finalize_job_results"
[2022-04-26T09:31:01.950323+02:00] [debug] Process 19133 is performing job "4377172" with task "finalize_job_results"

#11 Updated by mkittler 2 months ago

  • Related to action #108983: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule added

#12 Updated by mkittler 2 months ago

I've been adding #108983 as related ticket because in theory such jobs might have been cleaned up by an admin if they ended up "Job terminated unexpectedly".

#13 Updated by kraih 2 months ago

mkittler wrote:

I've been adding #108983 as related ticket because in theory such jobs might have been cleaned up by an admin if they ended up "Job terminated unexpectedly".

Manual deletion by a user is my top suspicion right now too. As a "solution" i'll probably add logging of job deletions, so there's at least a trace for auditing. The jobs could not have failed though, or there would be a log message Process 1234 is performing job "4377167" with task "schedule_iso".

#14 Updated by kraih 2 months ago

Asked around if someone is running an automated script to delete jobs that might be misbehaving, but found nobody so far.

#16 Updated by okurz about 2 months ago

  • Subject changed from Migration group can not trigger, missing minion jobs? to Migration group can not trigger, missing minion jobs? size:M

#17 Updated by kraih about 2 months ago

Today we've had a lot of cases:

openqa=# SELECT id, distri, arch, build, status, results, gru_task_id, minion_job_id, t_created, t_updated FROM scheduled_products WHERE status = 'added' AND t_updated > NOW() - INTERVAL '2 days' ORDER BY t_created ASC;
   id   | distri |  arch   | build | status | results | gru_task_id | minion_job_id |      t_created      |      t_updated
--------+--------+---------+-------+--------+---------+-------------+---------------+---------------------+---------------------
 910510 | sle    | ppc64le | 24.52 | added  |         |    31531556 |       4417783 | 2022-05-03 00:30:31 | 2022-05-03 00:30:31
 910511 | sle    | s390x   | 24.52 | added  |         |    31531557 |       4417784 | 2022-05-03 00:30:33 | 2022-05-03 00:30:33
 910512 | sle    | x86_64  | 24.52 | added  |         |    31531558 |       4417785 | 2022-05-03 00:30:35 | 2022-05-03 00:30:35
 910514 | sle    | x86_64  | 1.134 | added  |         |    31531563 |       4417790 | 2022-05-03 00:31:47 | 2022-05-03 00:31:47
 910515 | sle    | aarch64 | 1.134 | added  |         |    31531564 |       4417791 | 2022-05-03 00:31:51 | 2022-05-03 00:31:51
 910516 | sle    | x86_64  | 1.134 | added  |         |    31531565 |       4417792 | 2022-05-03 00:31:54 | 2022-05-03 00:31:54
 910517 | sle    | x86_64  | 1.134 | added  |         |    31531566 |       4417793 | 2022-05-03 00:31:56 | 2022-05-03 00:31:56
 910518 | sle    | x86_64  | 1.134 | added  |         |    31531567 |       4417794 | 2022-05-03 00:31:59 | 2022-05-03 00:31:59
 910520 | SLE    | ppc64le | 139.4 | added  |         |    31531603 |       4417830 | 2022-05-03 00:36:21 | 2022-05-03 00:36:21
 910521 | SLE    | s390x   | 139.4 | added  |         |    31531605 |       4417832 | 2022-05-03 00:36:27 | 2022-05-03 00:36:27
 910522 | SLE    | x86_64  | 139.4 | added  |         |    31531607 |       4417834 | 2022-05-03 00:36:40 | 2022-05-03 00:36:41
 910523 | SLE    | aarch64 | 139.4 | added  |         |    31531608 |       4417835 | 2022-05-03 00:36:50 | 2022-05-03 00:36:50
 910524 | SLE    | ppc64le | 139.4 | added  |         |    31531611 |       4417838 | 2022-05-03 00:37:00 | 2022-05-03 00:37:00
 910525 | SLE    | s390x   | 139.4 | added  |         |    31531615 |       4417842 | 2022-05-03 00:37:13 | 2022-05-03 00:37:13
 910526 | SLE    | x86_64  | 139.4 | added  |         |    31531618 |       4417845 | 2022-05-03 00:37:21 | 2022-05-03 00:37:21
 910527 | SLE    | aarch64 | 139.4 | added  |         |    31531621 |       4417848 | 2022-05-03 00:37:36 | 2022-05-03 00:37:36
 910528 | SLE    | ppc64le | 139.4 | added  |         |    31531623 |       4417850 | 2022-05-03 00:37:44 | 2022-05-03 00:37:44
 910529 | SLE    | s390x   | 139.4 | added  |         |    31531624 |       4417851 | 2022-05-03 00:37:56 | 2022-05-03 00:37:56
 910530 | SLE    | x86_64  | 139.4 | added  |         |    31531629 |       4417856 | 2022-05-03 00:38:09 | 2022-05-03 00:38:09
 910531 | SLE    | aarch64 | 139.4 | added  |         |    31531632 |       4417859 | 2022-05-03 00:38:20 | 2022-05-03 00:38:20
 910532 | SLE    | ppc64le | 139.4 | added  |         |    31531633 |       4417860 | 2022-05-03 00:38:29 | 2022-05-03 00:38:29
 910533 | SLE    | s390x   | 139.4 | added  |         |    31531634 |       4417861 | 2022-05-03 00:38:37 | 2022-05-03 00:38:37
 910534 | SLE    | x86_64  | 139.4 | added  |         |    31531636 |       4417863 | 2022-05-03 00:38:48 | 2022-05-03 00:38:48
 910535 | SLE    | aarch64 | 139.4 | added  |         |    31531637 |       4417864 | 2022-05-03 00:39:08 | 2022-05-03 00:39:08
 910536 | SLE    | ppc64le | 139.4 | added  |         |    31531639 |       4417866 | 2022-05-03 00:39:16 | 2022-05-03 00:39:17
 910537 | SLE    | s390x   | 139.4 | added  |         |    31531641 |       4417868 | 2022-05-03 00:39:25 | 2022-05-03 00:39:25
 910538 | SLE    | x86_64  | 139.4 | added  |         |    31531642 |       4417869 | 2022-05-03 00:39:31 | 2022-05-03 00:39:31
 910539 | SLE    | aarch64 | 139.4 | added  |         |    31531644 |       4417871 | 2022-05-03 00:39:39 | 2022-05-03 00:39:39
 910540 | SLE    | ppc64le | 139.4 | added  |         |    31531646 |       4417873 | 2022-05-03 00:39:46 | 2022-05-03 00:39:47
 910541 | SLE    | s390x   | 139.4 | added  |         |    31531647 |       4417874 | 2022-05-03 00:39:55 | 2022-05-03 00:39:55
 910542 | SLE    | x86_64  | 139.4 | added  |         |    31531649 |       4417876 | 2022-05-03 00:40:16 | 2022-05-03 00:40:16
(31 rows)

The new Minion version is deployed, but so far no manual removal messages in the logs.

#18 Updated by kraih about 2 months ago

I think i found the cause, it's actually the gru ttl feature, which now maps to the expires Minion feature. Theoretically that can result in schedule_iso jobs being deleted if they have not been processed within 10 minutes.

#20 Updated by kraih about 2 months ago

Waiting for the change to be deployed, then will check for new cases.

#21 Updated by kraih about 2 months ago

  • Status changed from In Progress to Feedback

#22 Updated by kraih about 2 months ago

No new cases since deployment.

#23 Updated by kraih about 2 months ago

And no new cases again today. I'll take another look next week, but i'm cautiously optimistic that the problem is resolved now.

#24 Updated by okurz about 2 months ago

sounds great! Keep in mind that you added #110566 as a subtask of this one here so you would need to resolve #110566 first before resolving this ticket here.

#25 Updated by kraih about 2 months ago

  • Status changed from Feedback to Resolved

No new cases over the weekend, i think we can consider this resolved.

#26 Updated by leli about 2 months ago

kraih wrote:

No new cases over the weekend, i think we can consider this resolved.

Thanks, we will check it on next build, seems the GMC candidate. :)

Also available in: Atom PDF