Project

General

Profile

action #111152

Investigation jobs run because of the lack of automatic takeover size:M

Added by tinita about 1 month ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
Due date:
2022-06-03
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

Several investigations jobs were scheduled for a test which was known to fail because of a particular issue:

https://openqa.suse.de/tests/8739190#next_previous

The "Next & Previous" jobs show that an issue was linked before.

Acceptance criteria

  • AC1: Minion jobs are kept long enough to be able to investigate this should it happen again
  • AC2: Logs for carryover are verbose enough to be able to investigate this should it happen again

Suggestions

  • Investigate this quickly while the logs are fresh and hot
  • The bug reference from the existing job wasn't taken over
  • Look into the audit table to find out if a takeover comment was deleted
  • Increase the time to keep minion jobs in the database. Until then, if we have such a case again, look for the minion entry as fast as possible and copy the minion entry into the ticket

Related issues

Copied from openQA Project - action #110881: Investigation jobs run because of the lack of automatic takeover size:SResolved2022-05-11

History

#1 Updated by tinita about 1 month ago

  • Copied from action #110881: Investigation jobs run because of the lack of automatic takeover size:S added

#2 Updated by tinita about 1 month ago

  • Subject changed from Investigation jobs run because of the lack of automatic takeover to Investigation jobs run because of the lack of automatic carryover
  • Description updated (diff)

#3 Updated by tinita about 1 month ago

  • Description updated (diff)

#4 Updated by tinita about 1 month ago

  • Subject changed from Investigation jobs run because of the lack of automatic carryover to Investigation jobs run because of the lack of automatic takeover
  • Description updated (diff)

I looked into the minion_jobs table but the job was already too old and the entry deleted.
I was wondering why the investigation comment was made 75 minutes after the job was finished:
https://openqa.suse.de/tests/8739190#comments

kraih suggested to increase the time to delete minion jobs. The default is 2 days. We should add a setting for it:

probably one line in WebAPI.pm to assign the setting to $self->minion->remove_after , and the setting itself to the settings module

SQL for searching for a certain job in the minion table:

select id, args, notes->'hook_rc' as hook_rc, notes->'hook_result' as hook_result, created, finished from minion_jobs where jsonb_typeof(args->0) = 'number' and  cast(args->0 as int) = 8739190

#5 Updated by tinita about 1 month ago

I looked into the audit log if the takeover comment was even created (and then possibly deleted).
It's a bit hard to tell because the comment events get logged since a few days only, and the oldest comment event in the audit log is from "2022-05-12 05:47:14" CEST I believe.
The job in question finished at 05:25 UTC -> 07:25 CEST, that means the comment event should have been logged.

Now one problem is, comment events are logged with their id, but without the job(group) id.
So if a comment is deleted, we can never connect the comment audit entry to a job.
That will be fixed by https://github.com/os-autoinst/openQA/pull/4655 once it is merged.

But for takeover comments, they are actually logged in the audit table with an additional entry taken_over_from_job_id.
For the job in question that should have been 8292431:
https://openqa.suse.de/tests/8686481#comments
So I looked for that id:

select * from audit_events where event = 'comment_create' and event_data like '%8292431%';

but got nothing.

So my conclusion is the takeover comment was never created.

But it should have been, and I confirmed that by trying it out locally and just putting the carry_over_bugrefs call into a normal job view for the job in question, and it called the code which would have created the comment, so all conditions for a carry over candidate were fulfilled.

#6 Updated by tinita about 1 month ago

  • Description updated (diff)

#7 Updated by mkittler about 1 month ago

  • Subject changed from Investigation jobs run because of the lack of automatic takeover to Investigation jobs run because of the lack of automatic takeover size:M
  • Description updated (diff)
  • Status changed from New to Workable

#8 Updated by tinita about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita

#9 Updated by tinita about 1 month ago

https://github.com/os-autoinst/openQA/pull/4662 Add configuration for expiring minion jobs

#10 Updated by openqa_review about 1 month ago

  • Due date set to 2022-06-03

Setting due date based on mean cycle time of SUSE QE Tools

#12 Updated by tinita about 1 month ago

  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/4664 Improve debugging of _carry_over_candidate

#14 Updated by okurz about 1 month ago

ok. That should suffice to cover the ACs. So I think we can resolve this ticket. We shouldn't necessarily wait for any other problem observed. Whenever that happens we can look at the logs. WDYT?

#15 Updated by tinita about 1 month ago

Well, I was waiting until it is deployed. That's necessary for resolving.
I checked that it is deployed on o3, but not osd yet.

#16 Updated by tinita about 1 month ago

  • Status changed from Feedback to Resolved

Deployed on osd as well

Also available in: Atom PDF