action #96959
closedJob hooks trigger investigate jobs for passed/soft-failed size:M
Description
Observation¶
Investigate jobs are being scheduled for soft-failed and passed jobs. There's no comment there but OPENQA_INVESTIGATE_ORIGIN
can be seen from the job settings when looking at scheduled jobs to trace it to the affected job.
passed
- https://openqa.suse.de/tests/6859615 for https://openqa.suse.de/tests/6812082
- https://openqa.suse.de/tests/6859614 for https://openqa.suse.de/tests/6812082
soft-failed
- https://openqa.suse.de/tests/6859608 for https://openqa.suse.de/tests/6851676
- https://openqa.suse.de/tests/6859607 for https://openqa.suse.de/tests/6851676
Suggestion¶
- Verify if openQA correctly reads config for hooks (config in production looks fine)
- Check if OPENQA_INVESTIGATE_ORIGIN was set correctly, considering those jobs have no comment
- (Why) are comments not being posted by https://github.com/os-autoinst/scripts/blob/master/openqa-investigate ?
- Check gru logs
Updated by livdywan over 3 years ago
- Subject changed from Job hooks trigger investigate jobs for passed/soft-failed to Job hooks trigger investigate jobs for passed/soft-failed size:M
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from High to Urgent
Updated by livdywan over 3 years ago
- Assignee set to livdywan
I'll take a look what's going on there.
Updated by livdywan over 3 years ago
Another instance I found:
passed
- Verify if openQA correctly reads config for hooks (config in production looks fine)
my $key = 'job_done_hook_' . $openqa_job->result
seems solid. We're not somehow setting the env variable either (grep -R OPENQA_ /etc/systemd
).
- Check if OPENQA_INVESTIGATE_ORIGIN was set correctly, considering those jobs have no comment
This is set in one place, clone_settings+=("OPENQA_INVESTIGATE_ORIGIN=$host_url/t$id")
, where $id
is the job to clone. I can't see how that could be the wrong job.
- (Why) are comments not being posted by https://github.com/os-autoinst/scripts/blob/master/openqa-investigate ?
I manually called investigate on those jobs and it got to the comment code path just fine.
- Check gru logs
- Nothing in /var/log/openqa_gru
- Going by minion jobs I couldn't easily identify any cases. There doesn't seem to be a way to search by job state or settings.
So I'm thinking we should record the hook to be able to verify the minion job: https://github.com/os-autoinst/openQA/pull/4126
Updated by livdywan over 3 years ago
- Status changed from Workable to In Progress
cdywan wrote:
- Check if OPENQA_INVESTIGATE_ORIGIN was set correctly, considering those jobs have no comment This is set in one place,
clone_settings+=("OPENQA_INVESTIGATE_ORIGIN=$host_url/t$id")
, where$id
is the job to clone. I can't see how that could be the wrong job.
We need Ensure OPENQA_INVESTIGATE_ORIGIN
gets the original ID, not the id to be cloned - as @tinita found out we're actually seeing the last good here, which is expected, and the script is passing that instead. I'm going to propose a PR for that.
Updated by tinita over 3 years ago
- Copied to action #97028: openqa-clone-job: Pass id of cloned job as setting added
Updated by livdywan over 3 years ago
- Status changed from In Progress to Feedback
cdywan wrote:
We need Ensure
OPENQA_INVESTIGATE_ORIGIN
gets the original ID, not the id to be cloned - as @tinita found out we're actually seeing the last good here, which is expected, and the script is passing that instead. I'm going to propose a PR for that.
Updated by livdywan over 3 years ago
- Priority changed from Urgent to Normal
Lowering the priority since it was found there were no extra jobs, just jobs with faulty settings.
cdywan wrote:
The PR was merged.
Updated by livdywan over 3 years ago
cdywan wrote:
Lowering the priority since it was found there were no extra jobs, just jobs with faulty settings.
cdywan wrote:
The PR was merged.
So I verified 20 investigate jobs for their OPENQA_INVESTIGATE_ORIGIN
and found this: https://openqa.suse.de/tests/6947401#settings
It's incomplete, and the origin has no comment 🤔
Updated by livdywan over 3 years ago
cdywan wrote:
So I verified 20 investigate jobs for their
OPENQA_INVESTIGATE_ORIGIN
and found this: https://openqa.suse.de/tests/6947401#settings
It's incomplete, and the origin has no comment 🤔
$ sudo grep -E 'invest|hook|6947401|6946641' /var/log/apache2/error_log
$ sudo grep -E '6947401|6946641' /var/log/openqa{,_gru}
Nothing relevant in the logs...
Updated by livdywan over 3 years ago
cdywan wrote:
Nothing relevant in the logs...
Checking some more recent occurences:
- https://openqa.suse.de/tests/6996738#settings w/o comment on https://openqa.suse.de/tests/6994712
- https://openqa.suse.de/tests/7003101#settings w/o comment on https://openqa.suse.de/tests/7002447
- https://openqa.suse.de/tests/7003107#settings w/o comment on https://openqa.suse.de/tests/7002448
$ sudo grep --color=always -E '[^0-9](7003107|7002448)' /var/log/apache2/error_log /var/log/openqa{,_gru}
We seem to have an increasing number of investigate jobs w/o a comment on the original job, but I still can't find anything in the logs... 🤔️
Updated by okurz over 3 years ago
Keep in mind that /var/log/openqa_gru does not go that far. I checked the first entry and it's from 2021-09-02 2300 but https://openqa.suse.de/tests/6994712 is from 2021-09-01 so will likely not show up there. You should try xzgrep --color=always -E '[^0-9](7003107|7002448)' /var/log/openqa_gru.*.xz
which at least goes back to 2021-08-30. Also not that long but enough in this case. Still, the search term does not show anything.
Updated by livdywan over 3 years ago
@tinita discovered some POST /api/v1/jobs/[0-9]*/comments
requests with status 404 in Apache logs. I'm wondering why we 're not seeing errors elsewhere if the post coming from investigate hooks is failing i.e.
$ openqa-cli api --osd jobs/5990783/comments
404 Not Found
{"error":"Job 5990783 does not exist","error_status":404}`
Normally this gives you output on both stderr and stdout yet we seem to get nothing?
Updated by tinita over 3 years ago
PR for adding User-Agent to openqa-cli calls: https://github.com/os-autoinst/scripts/pull/105
Updated by tinita over 3 years ago
Bugfix for 105: https://github.com/os-autoinst/scripts/pull/106
Updated by tinita over 3 years ago
- Status changed from Feedback to In Progress
- Assignee changed from livdywan to tinita
Since I'm actually working on this I assign it to me.
I'm working on another bugfix for my previous PR
Updated by VANASTASIADIS over 3 years ago
- Due date set to 2021-09-23
Setting due date based on mean cycle time of SUSE QE Tools
Updated by tinita over 3 years ago
Updated by tinita over 3 years ago
- Status changed from In Progress to Feedback
Ok, everything seems to be working again.
I see both openqa-label-known-issues and openqa-investigate useragents in both o3 and osd apache logs and none of the previous error messages in the gru logs anymore.
@cdywan, if you find more jobs that don't have comments, we can at least check the apache logs.
Updated by tinita about 3 years ago
- Assignee changed from tinita to livdywan
Assigning to @cdywan because of my vacation.
Updated by livdywan about 3 years ago
I tried to spot more instances by looking at Last 500 finished jobs with investigate
typed in the filter, and could only find correct jobs with comments in there. Pondering if I should bump the due date or consider this resolved 🤔️
Updated by livdywan about 3 years ago
- Status changed from Feedback to Resolved
I'm thinking a follow-up could be an alert for buggy investigate jobs. For now I'll consider this resolved.