Project

General

Profile

action #96959

Job hooks trigger investigate jobs for passed/soft-failed size:M

Added by cdywan 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2021-08-16
Due date:
2021-09-23
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

Investigate jobs are being scheduled for soft-failed and passed jobs. There's no comment there but OPENQA_INVESTIGATE_ORIGIN can be seen from the job settings when looking at scheduled jobs to trace it to the affected job.

passed

soft-failed

Suggestion


Related issues

Copied to openQA Project - action #97028: openqa-clone-job: Pass id of cloned job as settingResolved2021-08-16

History

#1 Updated by cdywan 5 months ago

  • Subject changed from Job hooks trigger investigate jobs for passed/soft-failed to Job hooks trigger investigate jobs for passed/soft-failed size:M
  • Description updated (diff)
  • Status changed from New to Workable
  • Priority changed from High to Urgent

#2 Updated by cdywan 5 months ago

  • Assignee set to cdywan

I'll take a look what's going on there.

#3 Updated by cdywan 5 months ago

Another instance I found:

passed

  • Verify if openQA correctly reads config for hooks (config in production looks fine)

my $key = 'job_done_hook_' . $openqa_job->result seems solid. We're not somehow setting the env variable either (grep -R OPENQA_ /etc/systemd).

  • Check if OPENQA_INVESTIGATE_ORIGIN was set correctly, considering those jobs have no comment

This is set in one place, clone_settings+=("OPENQA_INVESTIGATE_ORIGIN=$host_url/t$id"), where $id is the job to clone. I can't see how that could be the wrong job.

I manually called investigate on those jobs and it got to the comment code path just fine.

  • Check gru logs
  • Nothing in /var/log/openqa_gru
  • Going by minion jobs I couldn't easily identify any cases. There doesn't seem to be a way to search by job state or settings.

So I'm thinking we should record the hook to be able to verify the minion job: https://github.com/os-autoinst/openQA/pull/4126

#4 Updated by cdywan 5 months ago

  • Status changed from Workable to In Progress

cdywan wrote:

  • Check if OPENQA_INVESTIGATE_ORIGIN was set correctly, considering those jobs have no comment This is set in one place, clone_settings+=("OPENQA_INVESTIGATE_ORIGIN=$host_url/t$id"), where $id is the job to clone. I can't see how that could be the wrong job.

We need Ensure OPENQA_INVESTIGATE_ORIGIN gets the original ID, not the id to be cloned - as tinita found out we're actually seeing the last good here, which is expected, and the script is passing that instead. I'm going to propose a PR for that.

#5 Updated by tinita 5 months ago

  • Copied to action #97028: openqa-clone-job: Pass id of cloned job as setting added

#6 Updated by cdywan 5 months ago

  • Status changed from In Progress to Feedback

cdywan wrote:

We need Ensure OPENQA_INVESTIGATE_ORIGIN gets the original ID, not the id to be cloned - as tinita found out we're actually seeing the last good here, which is expected, and the script is passing that instead. I'm going to propose a PR for that.

https://github.com/os-autoinst/scripts/pull/102

#7 Updated by cdywan 5 months ago

  • Priority changed from Urgent to Normal

Lowering the priority since it was found there were no extra jobs, just jobs with faulty settings.

cdywan wrote:

https://github.com/os-autoinst/scripts/pull/102

The PR was merged.

#8 Updated by cdywan 5 months ago

cdywan wrote:

Lowering the priority since it was found there were no extra jobs, just jobs with faulty settings.

cdywan wrote:

https://github.com/os-autoinst/scripts/pull/102

The PR was merged.

So I verified 20 investigate jobs for their OPENQA_INVESTIGATE_ORIGIN and found this: https://openqa.suse.de/tests/6947401#settings
It's incomplete, and the origin has no comment 🤔

#9 Updated by cdywan 5 months ago

cdywan wrote:

So I verified 20 investigate jobs for their OPENQA_INVESTIGATE_ORIGIN and found this: https://openqa.suse.de/tests/6947401#settings
It's incomplete, and the origin has no comment 🤔

$ sudo grep -E 'invest|hook|6947401|6946641' /var/log/apache2/error_log
$ sudo grep -E '6947401|6946641' /var/log/openqa{,_gru}

Nothing relevant in the logs...

#10 Updated by cdywan 5 months ago

cdywan wrote:

Nothing relevant in the logs...

Checking some more recent occurences:

$ sudo grep --color=always -E '[^0-9](7003107|7002448)' /var/log/apache2/error_log /var/log/openqa{,_gru}

We seem to have an increasing number of investigate jobs w/o a comment on the original job, but I still can't find anything in the logs... 🤔️

#11 Updated by okurz 5 months ago

Keep in mind that /var/log/openqa_gru does not go that far. I checked the first entry and it's from 2021-09-02 2300 but https://openqa.suse.de/tests/6994712 is from 2021-09-01 so will likely not show up there. You should try xzgrep --color=always -E '[^0-9](7003107|7002448)' /var/log/openqa_gru.*.xz which at least goes back to 2021-08-30. Also not that long but enough in this case. Still, the search term does not show anything.

#12 Updated by cdywan 4 months ago

tinita discovered some POST /api/v1/jobs/[0-9]*/comments requests with status 404 in Apache logs. I'm wondering why we 're not seeing errors elsewhere if the post coming from investigate hooks is failing i.e.

$ openqa-cli api --osd jobs/5990783/comments
404 Not Found
{"error":"Job 5990783 does not exist","error_status":404}`

Normally this gives you output on both stderr and stdout yet we seem to get nothing?

#13 Updated by tinita 4 months ago

PR for adding User-Agent to openqa-cli calls: https://github.com/os-autoinst/scripts/pull/105

#15 Updated by tinita 4 months ago

  • Status changed from Feedback to In Progress
  • Assignee changed from cdywan to tinita

Since I'm actually working on this I assign it to me.

I'm working on another bugfix for my previous PR

#16 Updated by VANASTASIADIS 4 months ago

  • Due date set to 2021-09-23

Setting due date based on mean cycle time of SUSE QE Tools

#18 Updated by tinita 4 months ago

  • Status changed from In Progress to Feedback

Ok, everything seems to be working again.

I see both openqa-label-known-issues and openqa-investigate useragents in both o3 and osd apache logs and none of the previous error messages in the gru logs anymore.

cdywan, if you find more jobs that don't have comments, we can at least check the apache logs.

#19 Updated by tinita 4 months ago

  • Assignee changed from tinita to cdywan

Assigning to cdywan because of my vacation.

#20 Updated by cdywan 4 months ago

I tried to spot more instances by looking at Last 500 finished jobs with investigate typed in the filter, and could only find correct jobs with comments in there. Pondering if I should bump the due date or consider this resolved 🤔️

#21 Updated by cdywan 4 months ago

  • Status changed from Feedback to Resolved

I'm thinking a follow-up could be an alert for buggy investigate jobs. For now I'll consider this resolved.

Also available in: Atom PDF