action #110032
closedMigration group can not trigger, missing minion jobs? size:M
Added by coolgw over 2 years ago. Updated over 2 years ago.
Description
Motivation¶
This issue already happen twice, after check the log of openqa(please check attachment), it seems log show 4305089 triggered, but in in minion list not exist.
Acceptance criteria¶
- AC1: There are no new leftover entries with status
added
in thescheduled_products
table.
Suggestions¶
- Find a Minion job that's not older than 2 days with the query
SELECT id, minion_job_id FROM scheduled_products WHERE status = 'added' AND t_created > NOW() - INTERVAL '2days';
and look at the metadata. - Identify a regression around March 18.
Files
trigger2.png (182 KB) trigger2.png | coolgw, 2022-04-18 01:56 | ||
trigger1.png (216 KB) trigger1.png | coolgw, 2022-04-18 01:56 |
Updated by okurz over 2 years ago
- Subject changed from Migration group can not trigger to Migration group can not trigger, missing minion jobs?
- Category set to Regressions/Crashes
- Target version set to Ready
Updated by kraih over 2 years ago
- Assignee set to kraih
For context, the async iso post would result in a schedule_iso
job, executing this code. It does not look like the Minion job would delete itself, but the job id is stored in the database, so maybe it gets removed as a side effect of something else.
Updated by kraih over 2 years ago
The entry from the scheduled_products
table is interesting, it's still in the added
state.
id | distri | build | status | results | gru_task_id | minion_job_id | t_created | t_updated
--------+--------+-------+--------+---------+-------------+---------------+---------------------+---------------------
898954 | SLE | 129.1 | added | | 31418883 | 4305089 | 2022-04-15 04:17:10 | 2022-04-15 04:17:10
Updated by kraih over 2 years ago
On OSD there's actually 62 of these cases, some pretty old, but it seems to have become more common only recently:
openqa=# SELECT id, distri, arch, build, status, results, gru_task_id, minion_job_id, t_created, t_updated FROM scheduled_products WHERE status = 'added' AND t_updated < NOW() - INTERVAL '7 days' ORDER BY t_created ASC;
id | distri | arch | build | status | results | gru_task_id | minion_job_id | t_created | t_updated
--------+--------+---------+-------+--------+---------+-------------+---------------+---------------------+---------------------
459262 | sle | x86_64 | 4.29 | added | | 27353855 | 239470 | 2019-08-06 17:21:59 | 2019-08-06 17:21:59
459263 | sle | x86_64 | 4.29 | added | | 27353856 | 239471 | 2019-08-06 17:22:00 | 2019-08-06 17:22:00
459264 | sle | x86_64 | 4.29 | added | | 27353857 | 239472 | 2019-08-06 17:22:02 | 2019-08-06 17:22:02
884945 | SLE | aarch64 | 113.1 | added | | 31215424 | 4101598 | 2022-03-18 00:45:10 | 2022-03-18 00:45:10
884946 | SLE | ppc64le | 113.1 | added | | 31215425 | 4101599 | 2022-03-18 00:45:12 | 2022-03-18 00:45:12
884947 | SLE | s390x | 113.1 | added | | 31215426 | 4101600 | 2022-03-18 00:45:13 | 2022-03-18 00:45:13
884948 | SLE | x86_64 | 113.1 | added | | 31215427 | 4101601 | 2022-03-18 00:45:14 | 2022-03-18 00:45:14
893953 | SLE | aarch64 | 124.1 | added | | 31351834 | 4238032 | 2022-04-07 22:25:06 | 2022-04-07 22:25:06
893954 | SLE | ppc64le | 124.1 | added | | 31351835 | 4238033 | 2022-04-07 22:25:06 | 2022-04-07 22:25:06
893955 | SLE | s390x | 124.1 | added | | 31351836 | 4238034 | 2022-04-07 22:25:07 | 2022-04-07 22:25:07
893956 | SLE | x86_64 | 124.1 | added | | 31351837 | 4238035 | 2022-04-07 22:25:08 | 2022-04-07 22:25:08
894522 | sle | x86_64 | 1.95 | added | | 31361288 | 4247487 | 2022-04-08 23:49:34 | 2022-04-08 23:49:34
894523 | sle | aarch64 | 1.95 | added | | 31361290 | 4247489 | 2022-04-08 23:49:38 | 2022-04-08 23:49:38
894524 | sle | x86_64 | 1.95 | added | | 31361291 | 4247490 | 2022-04-08 23:49:42 | 2022-04-08 23:49:42
894525 | sle | x86_64 | 1.95 | added | | 31361292 | 4247491 | 2022-04-08 23:49:46 | 2022-04-08 23:49:46
894526 | sle | x86_64 | 1.95 | added | | 31361294 | 4247493 | 2022-04-08 23:49:52 | 2022-04-08 23:49:52
894528 | SLE | ppc64le | 125.1 | added | | 31361366 | 4247565 | 2022-04-08 23:54:45 | 2022-04-08 23:54:45
894529 | SLE | s390x | 125.1 | added | | 31361368 | 4247567 | 2022-04-08 23:54:59 | 2022-04-08 23:54:59
894530 | SLE | x86_64 | 125.1 | added | | 31361372 | 4247571 | 2022-04-08 23:55:12 | 2022-04-08 23:55:13
894531 | SLE | aarch64 | 125.1 | added | | 31361373 | 4247572 | 2022-04-08 23:55:26 | 2022-04-08 23:55:26
894532 | SLE | ppc64le | 125.1 | added | | 31361374 | 4247573 | 2022-04-08 23:55:36 | 2022-04-08 23:55:36
894533 | SLE | s390x | 125.1 | added | | 31361377 | 4247576 | 2022-04-08 23:55:45 | 2022-04-08 23:55:45
894534 | SLE | x86_64 | 125.1 | added | | 31361378 | 4247577 | 2022-04-08 23:55:55 | 2022-04-08 23:55:56
894535 | SLE | aarch64 | 125.1 | added | | 31361382 | 4247581 | 2022-04-08 23:56:07 | 2022-04-08 23:56:07
894536 | SLE | ppc64le | 125.1 | added | | 31361383 | 4247582 | 2022-04-08 23:56:18 | 2022-04-08 23:56:19
894537 | SLE | s390x | 125.1 | added | | 31361384 | 4247583 | 2022-04-08 23:56:30 | 2022-04-08 23:56:30
894538 | SLE | x86_64 | 125.1 | added | | 31361385 | 4247584 | 2022-04-08 23:56:41 | 2022-04-08 23:56:42
894539 | SLE | aarch64 | 125.1 | added | | 31361388 | 4247587 | 2022-04-08 23:56:55 | 2022-04-08 23:56:55
894540 | SLE | ppc64le | 125.1 | added | | 31361390 | 4247589 | 2022-04-08 23:57:08 | 2022-04-08 23:57:08
894541 | SLE | s390x | 125.1 | added | | 31361392 | 4247591 | 2022-04-08 23:57:20 | 2022-04-08 23:57:21
894542 | SLE | x86_64 | 125.1 | added | | 31361393 | 4247592 | 2022-04-08 23:57:32 | 2022-04-08 23:57:32
894543 | SLE | aarch64 | 125.1 | added | | 31361395 | 4247594 | 2022-04-08 23:57:41 | 2022-04-08 23:57:41
894545 | SLE | s390x | 125.1 | added | | 31361399 | 4247598 | 2022-04-08 23:57:57 | 2022-04-08 23:57:57
...
(62 rows)
And all cases are clustered with similar ones in the timespan of a few seconds.
Updated by kraih over 2 years ago
To get further i'll need to catch a case that's no older than 2 days, because that's the timespan we keep Minion job data in storage. Unfortunately there is no such case at the moment.
openqa=# SELECT id, minion_job_id FROM scheduled_products WHERE status = 'added' AND t_created > NOW() - INTERVAL '2days';
id | minion_job_id
----+---------------
(0 rows)
Until then i can only look for regressions in commits around March 18, or possibly a logic error somewhere in the code that results in the deletion of Minion jobs.
Updated by openqa_review over 2 years ago
- Due date set to 2022-05-10
Setting due date based on mean cycle time of SUSE QE Tools
Updated by kraih over 2 years ago
Today i found a case that was newer than 2 days, and the Minion jobs wer definitely missing before the automatic cleanup could trigger:
openqa=# SELECT id, distri, arch, build, status, results, gru_task_id, minion_job_id, t_created, t_updated FROM scheduled_products WHERE status = 'added' AND t_updated > NOW() - INTERVAL '2 days' ORDER BY t_created ASC;
id | distri | arch | build | status | results | gru_task_id | minion_job_id | t_created | t_updated
--------+--------+---------+-------+--------+---------+-------------+---------------+---------------------+---------------------
907971 | SLE | aarch64 | 137.1 | added | | 31490948 | 4377167 | 2022-04-26 07:07:08 | 2022-04-26 07:07:08
907972 | SLE | ppc64le | 137.1 | added | | 31490949 | 4377168 | 2022-04-26 07:07:10 | 2022-04-26 07:07:11
907973 | SLE | s390x | 137.1 | added | | 31490950 | 4377169 | 2022-04-26 07:07:13 | 2022-04-26 07:07:13
907974 | SLE | x86_64 | 137.1 | added | | 31490951 | 4377170 | 2022-04-26 07:07:14 | 2022-04-26 07:07:14
(4 rows)
openqa=# SELECT * FROM minion_jobs WHERE id = 4377167;
id | args | created | delayed | finished | priority | result | retried | retries | started | state | task | worker | queue | attempts | parents | notes | expires | lax
----+------+---------+---------+----------+----------+--------+---------+---------+---------+-------+------+--------+-------+----------+---------+-------+---------+-----
(0 rows)
openqa=# SELECT * FROM minion_jobs WHERE id = 4377168;
id | args | created | delayed | finished | priority | result | retried | retries | started | state | task | worker | queue | attempts | parents | notes | expires | lax
----+------+---------+---------+----------+----------+--------+---------+---------+---------+-------+------+--------+-------+----------+---------+-------+---------+-----
(0 rows)
openqa=# SELECT * FROM minion_jobs WHERE id = 4377169;
id | args | created | delayed | finished | priority | result | retried | retries | started | state | task | worker | queue | attempts | parents | notes | expires | lax
----+------+---------+---------+----------+----------+--------+---------+---------+---------+-------+------+--------+-------+----------+---------+-------+---------+-----
(0 rows)
openqa=# SELECT * FROM minion_jobs WHERE id = 4377170;
id | args | created | delayed | finished | priority | result | retried | retries | started | state | task | worker | queue | attempts | parents | notes | expires | lax
----+------+---------+---------+----------+----------+--------+---------+---------+---------+-------+------+--------+-------+----------+---------+-------+---------+-----
(0 rows)
In the gru logs i could only find the jobs around these four (not executed in the listed order, see timestamps):
[2022-04-26T09:15:38.563826+02:00] [debug] Process 2071 is performing job "4377161" with task "schedule_iso"
[2022-04-26T09:15:38.669395+02:00] [debug] Process 2074 is performing job "4377162" with task "schedule_iso"
[2022-04-26T09:15:58.444404+02:00] [debug] Process 2273 is performing job "4377163" with task "schedule_iso"
[2022-04-26T09:31:01.834314+02:00] [debug] Process 19127 is performing job "4377164" with task "finalize_job_results"
[2022-04-26T09:16:08.571776+02:00] [debug] Process 2515 is performing job "4377165" with task "schedule_iso"
[2022-04-26T09:16:31.049261+02:00] [debug] Process 2827 is performing job "4377166" with task "schedule_iso"
...
[2022-04-26T09:31:01.912941+02:00] [debug] Process 19130 is performing job "4377171" with task "finalize_job_results"
[2022-04-26T09:31:01.950323+02:00] [debug] Process 19133 is performing job "4377172" with task "finalize_job_results"
Updated by mkittler over 2 years ago
- Related to action #108983: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule added
Updated by mkittler over 2 years ago
I've been adding #108983 as related ticket because in theory such jobs might have been cleaned up by an admin if they ended up "Job terminated unexpectedly".
Updated by kraih over 2 years ago
mkittler wrote:
I've been adding #108983 as related ticket because in theory such jobs might have been cleaned up by an admin if they ended up "Job terminated unexpectedly".
Manual deletion by a user is my top suspicion right now too. As a "solution" i'll probably add logging of job deletions, so there's at least a trace for auditing. The jobs could not have failed though, or there would be a log message Process 1234 is performing job "4377167" with task "schedule_iso"
.
Updated by kraih over 2 years ago
Asked around if someone is running an automated script to delete jobs that might be misbehaving, but found nobody so far.
Updated by kraih over 2 years ago
Added the log message upstream in Minion. https://github.com/mojolicious/minion/commit/69c38bcf1e87d8ff604197a03195827ca67f8179
Updated by okurz over 2 years ago
- Subject changed from Migration group can not trigger, missing minion jobs? to Migration group can not trigger, missing minion jobs? size:M
Updated by kraih over 2 years ago
Today we've had a lot of cases:
openqa=# SELECT id, distri, arch, build, status, results, gru_task_id, minion_job_id, t_created, t_updated FROM scheduled_products WHERE status = 'added' AND t_updated > NOW() - INTERVAL '2 days' ORDER BY t_created ASC;
id | distri | arch | build | status | results | gru_task_id | minion_job_id | t_created | t_updated
--------+--------+---------+-------+--------+---------+-------------+---------------+---------------------+---------------------
910510 | sle | ppc64le | 24.52 | added | | 31531556 | 4417783 | 2022-05-03 00:30:31 | 2022-05-03 00:30:31
910511 | sle | s390x | 24.52 | added | | 31531557 | 4417784 | 2022-05-03 00:30:33 | 2022-05-03 00:30:33
910512 | sle | x86_64 | 24.52 | added | | 31531558 | 4417785 | 2022-05-03 00:30:35 | 2022-05-03 00:30:35
910514 | sle | x86_64 | 1.134 | added | | 31531563 | 4417790 | 2022-05-03 00:31:47 | 2022-05-03 00:31:47
910515 | sle | aarch64 | 1.134 | added | | 31531564 | 4417791 | 2022-05-03 00:31:51 | 2022-05-03 00:31:51
910516 | sle | x86_64 | 1.134 | added | | 31531565 | 4417792 | 2022-05-03 00:31:54 | 2022-05-03 00:31:54
910517 | sle | x86_64 | 1.134 | added | | 31531566 | 4417793 | 2022-05-03 00:31:56 | 2022-05-03 00:31:56
910518 | sle | x86_64 | 1.134 | added | | 31531567 | 4417794 | 2022-05-03 00:31:59 | 2022-05-03 00:31:59
910520 | SLE | ppc64le | 139.4 | added | | 31531603 | 4417830 | 2022-05-03 00:36:21 | 2022-05-03 00:36:21
910521 | SLE | s390x | 139.4 | added | | 31531605 | 4417832 | 2022-05-03 00:36:27 | 2022-05-03 00:36:27
910522 | SLE | x86_64 | 139.4 | added | | 31531607 | 4417834 | 2022-05-03 00:36:40 | 2022-05-03 00:36:41
910523 | SLE | aarch64 | 139.4 | added | | 31531608 | 4417835 | 2022-05-03 00:36:50 | 2022-05-03 00:36:50
910524 | SLE | ppc64le | 139.4 | added | | 31531611 | 4417838 | 2022-05-03 00:37:00 | 2022-05-03 00:37:00
910525 | SLE | s390x | 139.4 | added | | 31531615 | 4417842 | 2022-05-03 00:37:13 | 2022-05-03 00:37:13
910526 | SLE | x86_64 | 139.4 | added | | 31531618 | 4417845 | 2022-05-03 00:37:21 | 2022-05-03 00:37:21
910527 | SLE | aarch64 | 139.4 | added | | 31531621 | 4417848 | 2022-05-03 00:37:36 | 2022-05-03 00:37:36
910528 | SLE | ppc64le | 139.4 | added | | 31531623 | 4417850 | 2022-05-03 00:37:44 | 2022-05-03 00:37:44
910529 | SLE | s390x | 139.4 | added | | 31531624 | 4417851 | 2022-05-03 00:37:56 | 2022-05-03 00:37:56
910530 | SLE | x86_64 | 139.4 | added | | 31531629 | 4417856 | 2022-05-03 00:38:09 | 2022-05-03 00:38:09
910531 | SLE | aarch64 | 139.4 | added | | 31531632 | 4417859 | 2022-05-03 00:38:20 | 2022-05-03 00:38:20
910532 | SLE | ppc64le | 139.4 | added | | 31531633 | 4417860 | 2022-05-03 00:38:29 | 2022-05-03 00:38:29
910533 | SLE | s390x | 139.4 | added | | 31531634 | 4417861 | 2022-05-03 00:38:37 | 2022-05-03 00:38:37
910534 | SLE | x86_64 | 139.4 | added | | 31531636 | 4417863 | 2022-05-03 00:38:48 | 2022-05-03 00:38:48
910535 | SLE | aarch64 | 139.4 | added | | 31531637 | 4417864 | 2022-05-03 00:39:08 | 2022-05-03 00:39:08
910536 | SLE | ppc64le | 139.4 | added | | 31531639 | 4417866 | 2022-05-03 00:39:16 | 2022-05-03 00:39:17
910537 | SLE | s390x | 139.4 | added | | 31531641 | 4417868 | 2022-05-03 00:39:25 | 2022-05-03 00:39:25
910538 | SLE | x86_64 | 139.4 | added | | 31531642 | 4417869 | 2022-05-03 00:39:31 | 2022-05-03 00:39:31
910539 | SLE | aarch64 | 139.4 | added | | 31531644 | 4417871 | 2022-05-03 00:39:39 | 2022-05-03 00:39:39
910540 | SLE | ppc64le | 139.4 | added | | 31531646 | 4417873 | 2022-05-03 00:39:46 | 2022-05-03 00:39:47
910541 | SLE | s390x | 139.4 | added | | 31531647 | 4417874 | 2022-05-03 00:39:55 | 2022-05-03 00:39:55
910542 | SLE | x86_64 | 139.4 | added | | 31531649 | 4417876 | 2022-05-03 00:40:16 | 2022-05-03 00:40:16
(31 rows)
The new Minion version is deployed, but so far no manual removal messages in the logs.
Updated by kraih over 2 years ago
I think i found the cause, it's actually the gru ttl feature, which now maps to the expires Minion feature. Theoretically that can result in schedule_iso jobs being deleted if they have not been processed within 10 minutes.
Updated by kraih over 2 years ago
Possible fix: https://github.com/os-autoinst/openQA/pull/4637
Updated by kraih over 2 years ago
Waiting for the change to be deployed, then will check for new cases.
Updated by kraih over 2 years ago
And no new cases again today. I'll take another look next week, but i'm cautiously optimistic that the problem is resolved now.
Updated by kraih over 2 years ago
- Status changed from Feedback to Resolved
No new cases over the weekend, i think we can consider this resolved.
Updated by leli over 2 years ago
kraih wrote:
No new cases over the weekend, i think we can consider this resolved.
Thanks, we will check it on next build, seems the GMC candidate. :)