action #111152
Investigation jobs run because of the lack of automatic takeover size:M
0%
Description
Observation¶
Several investigations jobs were scheduled for a test which was known to fail because of a particular issue:
https://openqa.suse.de/tests/8739190#next_previous
The "Next & Previous" jobs show that an issue was linked before.
Acceptance criteria¶
- AC1: Minion jobs are kept long enough to be able to investigate this should it happen again
- AC2: Logs for carryover are verbose enough to be able to investigate this should it happen again
Suggestions¶
- Investigate this quickly while the logs are fresh and hot
- The bug reference from the existing job wasn't taken over
- Look into the audit table to find out if a takeover comment was deleted
- Increase the time to keep minion jobs in the database. Until then, if we have such a case again, look for the minion entry as fast as possible and copy the minion entry into the ticket
Related issues
History
#1
Updated by tinita about 1 month ago
- Copied from action #110881: Investigation jobs run because of the lack of automatic takeover size:S added
#2
Updated by tinita about 1 month ago
- Subject changed from Investigation jobs run because of the lack of automatic takeover to Investigation jobs run because of the lack of automatic carryover
- Description updated (diff)
#3
Updated by tinita about 1 month ago
- Description updated (diff)
#4
Updated by tinita about 1 month ago
- Subject changed from Investigation jobs run because of the lack of automatic carryover to Investigation jobs run because of the lack of automatic takeover
- Description updated (diff)
I looked into the minion_jobs table but the job was already too old and the entry deleted.
I was wondering why the investigation comment was made 75 minutes after the job was finished:
https://openqa.suse.de/tests/8739190#comments
kraih suggested to increase the time to delete minion jobs. The default is 2 days. We should add a setting for it:
probably one line in WebAPI.pm to assign the setting to $self->minion->remove_after , and the setting itself to the settings module
SQL for searching for a certain job in the minion table:
select id, args, notes->'hook_rc' as hook_rc, notes->'hook_result' as hook_result, created, finished from minion_jobs where jsonb_typeof(args->0) = 'number' and cast(args->0 as int) = 8739190
#5
Updated by tinita about 1 month ago
I looked into the audit log if the takeover comment was even created (and then possibly deleted).
It's a bit hard to tell because the comment events get logged since a few days only, and the oldest comment event in the audit log is from "2022-05-12 05:47:14" CEST I believe.
The job in question finished at 05:25 UTC -> 07:25 CEST, that means the comment event should have been logged.
Now one problem is, comment events are logged with their id, but without the job(group) id.
So if a comment is deleted, we can never connect the comment audit entry to a job.
That will be fixed by https://github.com/os-autoinst/openQA/pull/4655 once it is merged.
But for takeover comments, they are actually logged in the audit table with an additional entry taken_over_from_job_id
.
For the job in question that should have been 8292431
:
https://openqa.suse.de/tests/8686481#comments
So I looked for that id:
select * from audit_events where event = 'comment_create' and event_data like '%8292431%';
but got nothing.
So my conclusion is the takeover comment was never created.
But it should have been, and I confirmed that by trying it out locally and just putting the carry_over_bugrefs
call into a normal job view for the job in question, and it called the code which would have created the comment, so all conditions for a carry over candidate were fulfilled.
#6
Updated by tinita about 1 month ago
- Description updated (diff)
#7
Updated by mkittler about 1 month ago
- Subject changed from Investigation jobs run because of the lack of automatic takeover to Investigation jobs run because of the lack of automatic takeover size:M
- Description updated (diff)
- Status changed from New to Workable
#8
Updated by tinita about 1 month ago
- Status changed from Workable to In Progress
- Assignee set to tinita
#9
Updated by tinita about 1 month ago
https://github.com/os-autoinst/openQA/pull/4662 Add configuration for expiring minion jobs
#10
Updated by openqa_review about 1 month ago
- Due date set to 2022-06-03
Setting due date based on mean cycle time of SUSE QE Tools
#11
Updated by tinita about 1 month ago
#12
Updated by tinita about 1 month ago
- Status changed from In Progress to Feedback
https://github.com/os-autoinst/openQA/pull/4664 Improve debugging of _carry_over_candidate
#13
Updated by tinita about 1 month ago
#14
Updated by okurz about 1 month ago
ok. That should suffice to cover the ACs. So I think we can resolve this ticket. We shouldn't necessarily wait for any other problem observed. Whenever that happens we can look at the logs. WDYT?
#15
Updated by tinita about 1 month ago
Well, I was waiting until it is deployed. That's necessary for resolving.
I checked that it is deployed on o3, but not osd yet.
#16
Updated by tinita about 1 month ago
- Status changed from Feedback to Resolved
Deployed on osd as well