Project

General

Profile

Actions

action #111152

closed

Investigation jobs run because of the lack of automatic takeover size:M

Added by tinita over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
2022-06-03
% Done:

0%

Estimated time:

Description

Observation

Several investigations jobs were scheduled for a test which was known to fail because of a particular issue:

https://openqa.suse.de/tests/8739190#next_previous

The "Next & Previous" jobs show that an issue was linked before.

Acceptance criteria

  • AC1: Minion jobs are kept long enough to be able to investigate this should it happen again
  • AC2: Logs for carryover are verbose enough to be able to investigate this should it happen again

Suggestions

  • Investigate this quickly while the logs are fresh and hot
  • The bug reference from the existing job wasn't taken over
  • Look into the audit table to find out if a takeover comment was deleted
  • Increase the time to keep minion jobs in the database. Until then, if we have such a case again, look for the minion entry as fast as possible and copy the minion entry into the ticket

Related issues 1 (0 open1 closed)

Copied from openQA Project (public) - action #110881: Investigation jobs run because of the lack of automatic takeover size:SResolvedokurz2022-05-11

Actions
Actions #1

Updated by tinita over 2 years ago

  • Copied from action #110881: Investigation jobs run because of the lack of automatic takeover size:S added
Actions #2

Updated by tinita over 2 years ago

  • Subject changed from Investigation jobs run because of the lack of automatic takeover to Investigation jobs run because of the lack of automatic carryover
  • Description updated (diff)
Actions #3

Updated by tinita over 2 years ago

  • Description updated (diff)
Actions #4

Updated by tinita over 2 years ago

  • Subject changed from Investigation jobs run because of the lack of automatic carryover to Investigation jobs run because of the lack of automatic takeover
  • Description updated (diff)

I looked into the minion_jobs table but the job was already too old and the entry deleted.
I was wondering why the investigation comment was made 75 minutes after the job was finished:
https://openqa.suse.de/tests/8739190#comments

@kraih suggested to increase the time to delete minion jobs. The default is 2 days. We should add a setting for it:

probably one line in WebAPI.pm to assign the setting to $self->minion->remove_after , and the setting itself to the settings module

SQL for searching for a certain job in the minion table:

select id, args, notes->'hook_rc' as hook_rc, notes->'hook_result' as hook_result, created, finished from minion_jobs where jsonb_typeof(args->0) = 'number' and  cast(args->0 as int) = 8739190
Actions #5

Updated by tinita over 2 years ago

I looked into the audit log if the takeover comment was even created (and then possibly deleted).
It's a bit hard to tell because the comment events get logged since a few days only, and the oldest comment event in the audit log is from "2022-05-12 05:47:14" CEST I believe.
The job in question finished at 05:25 UTC -> 07:25 CEST, that means the comment event should have been logged.

Now one problem is, comment events are logged with their id, but without the job(group) id.
So if a comment is deleted, we can never connect the comment audit entry to a job.
That will be fixed by https://github.com/os-autoinst/openQA/pull/4655 once it is merged.

But for takeover comments, they are actually logged in the audit table with an additional entry taken_over_from_job_id.
For the job in question that should have been 8292431:
https://openqa.suse.de/tests/8686481#comments
So I looked for that id:

select * from audit_events where event = 'comment_create' and event_data like '%8292431%';

but got nothing.

So my conclusion is the takeover comment was never created.

But it should have been, and I confirmed that by trying it out locally and just putting the carry_over_bugrefs call into a normal job view for the job in question, and it called the code which would have created the comment, so all conditions for a carry over candidate were fulfilled.

Actions #6

Updated by tinita over 2 years ago

  • Description updated (diff)
Actions #7

Updated by mkittler over 2 years ago

  • Subject changed from Investigation jobs run because of the lack of automatic takeover to Investigation jobs run because of the lack of automatic takeover size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #8

Updated by tinita over 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita
Actions #9

Updated by tinita over 2 years ago

https://github.com/os-autoinst/openQA/pull/4662 Add configuration for expiring minion jobs

Actions #10

Updated by openqa_review over 2 years ago

  • Due date set to 2022-06-03

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by tinita over 2 years ago

  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/4664 Improve debugging of _carry_over_candidate

Actions #14

Updated by okurz over 2 years ago

ok. That should suffice to cover the ACs. So I think we can resolve this ticket. We shouldn't necessarily wait for any other problem observed. Whenever that happens we can look at the logs. WDYT?

Actions #15

Updated by tinita over 2 years ago

Well, I was waiting until it is deployed. That's necessary for resolving.
I checked that it is deployed on o3, but not osd yet.

Actions #16

Updated by tinita over 2 years ago

  • Status changed from Feedback to Resolved

Deployed on osd as well

Actions

Also available in: Atom PDF