action #111152
closedInvestigation jobs run because of the lack of automatic takeover size:M
Description
Observation¶
Several investigations jobs were scheduled for a test which was known to fail because of a particular issue:
https://openqa.suse.de/tests/8739190#next_previous
The "Next & Previous" jobs show that an issue was linked before.
Acceptance criteria¶
- AC1: Minion jobs are kept long enough to be able to investigate this should it happen again
- AC2: Logs for carryover are verbose enough to be able to investigate this should it happen again
Suggestions¶
- Investigate this quickly while the logs are fresh and hot
- The bug reference from the existing job wasn't taken over
- Look into the audit table to find out if a takeover comment was deleted
- Increase the time to keep minion jobs in the database. Until then, if we have such a case again, look for the minion entry as fast as possible and copy the minion entry into the ticket
Updated by tinita over 2 years ago
- Copied from action #110881: Investigation jobs run because of the lack of automatic takeover size:S added
Updated by tinita over 2 years ago
- Subject changed from Investigation jobs run because of the lack of automatic takeover to Investigation jobs run because of the lack of automatic carryover
- Description updated (diff)
Updated by tinita over 2 years ago
- Subject changed from Investigation jobs run because of the lack of automatic carryover to Investigation jobs run because of the lack of automatic takeover
- Description updated (diff)
I looked into the minion_jobs table but the job was already too old and the entry deleted.
I was wondering why the investigation comment was made 75 minutes after the job was finished:
https://openqa.suse.de/tests/8739190#comments
@kraih suggested to increase the time to delete minion jobs. The default is 2 days. We should add a setting for it:
probably one line in WebAPI.pm to assign the setting to $self->minion->remove_after , and the setting itself to the settings module
SQL for searching for a certain job in the minion table:
select id, args, notes->'hook_rc' as hook_rc, notes->'hook_result' as hook_result, created, finished from minion_jobs where jsonb_typeof(args->0) = 'number' and cast(args->0 as int) = 8739190
Updated by tinita over 2 years ago
I looked into the audit log if the takeover comment was even created (and then possibly deleted).
It's a bit hard to tell because the comment events get logged since a few days only, and the oldest comment event in the audit log is from "2022-05-12 05:47:14" CEST I believe.
The job in question finished at 05:25 UTC -> 07:25 CEST, that means the comment event should have been logged.
Now one problem is, comment events are logged with their id, but without the job(group) id.
So if a comment is deleted, we can never connect the comment audit entry to a job.
That will be fixed by https://github.com/os-autoinst/openQA/pull/4655 once it is merged.
But for takeover comments, they are actually logged in the audit table with an additional entry taken_over_from_job_id
.
For the job in question that should have been 8292431
:
https://openqa.suse.de/tests/8686481#comments
So I looked for that id:
select * from audit_events where event = 'comment_create' and event_data like '%8292431%';
but got nothing.
So my conclusion is the takeover comment was never created.
But it should have been, and I confirmed that by trying it out locally and just putting the carry_over_bugrefs
call into a normal job view for the job in question, and it called the code which would have created the comment, so all conditions for a carry over candidate were fulfilled.
Updated by mkittler over 2 years ago
- Subject changed from Investigation jobs run because of the lack of automatic takeover to Investigation jobs run because of the lack of automatic takeover size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by tinita over 2 years ago
- Status changed from Workable to In Progress
- Assignee set to tinita
Updated by tinita over 2 years ago
https://github.com/os-autoinst/openQA/pull/4662 Add configuration for expiring minion jobs
Updated by openqa_review over 2 years ago
- Due date set to 2022-06-03
Setting due date based on mean cycle time of SUSE QE Tools
Updated by tinita over 2 years ago
Updated by tinita over 2 years ago
- Status changed from In Progress to Feedback
https://github.com/os-autoinst/openQA/pull/4664 Improve debugging of _carry_over_candidate
Updated by tinita over 2 years ago
Updated by okurz over 2 years ago
ok. That should suffice to cover the ACs. So I think we can resolve this ticket. We shouldn't necessarily wait for any other problem observed. Whenever that happens we can look at the logs. WDYT?
Updated by tinita over 2 years ago
Well, I was waiting until it is deployed. That's necessary for resolving.
I checked that it is deployed on o3, but not osd yet.