Project

General

Profile

Actions

action #96959

closed

Job hooks trigger investigate jobs for passed/soft-failed size:M

Added by livdywan almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-08-16
Due date:
2021-09-23
% Done:

0%

Estimated time:

Description

Observation

Investigate jobs are being scheduled for soft-failed and passed jobs. There's no comment there but OPENQA_INVESTIGATE_ORIGIN can be seen from the job settings when looking at scheduled jobs to trace it to the affected job.

passed

soft-failed

Suggestion


Related issues 1 (0 open1 closed)

Copied to openQA Project - action #97028: openqa-clone-job: Pass id of cloned job as settingResolvedosukup2021-08-16

Actions
Actions #1

Updated by livdywan almost 3 years ago

  • Subject changed from Job hooks trigger investigate jobs for passed/soft-failed to Job hooks trigger investigate jobs for passed/soft-failed size:M
  • Description updated (diff)
  • Status changed from New to Workable
  • Priority changed from High to Urgent
Actions #2

Updated by livdywan almost 3 years ago

  • Assignee set to livdywan

I'll take a look what's going on there.

Actions #3

Updated by livdywan almost 3 years ago

Another instance I found:

passed

  • Verify if openQA correctly reads config for hooks (config in production looks fine)

my $key = 'job_done_hook_' . $openqa_job->result seems solid. We're not somehow setting the env variable either (grep -R OPENQA_ /etc/systemd).

  • Check if OPENQA_INVESTIGATE_ORIGIN was set correctly, considering those jobs have no comment

This is set in one place, clone_settings+=("OPENQA_INVESTIGATE_ORIGIN=$host_url/t$id"), where $id is the job to clone. I can't see how that could be the wrong job.

I manually called investigate on those jobs and it got to the comment code path just fine.

  • Check gru logs
  • Nothing in /var/log/openqa_gru
  • Going by minion jobs I couldn't easily identify any cases. There doesn't seem to be a way to search by job state or settings.

So I'm thinking we should record the hook to be able to verify the minion job: https://github.com/os-autoinst/openQA/pull/4126

Actions #4

Updated by livdywan almost 3 years ago

  • Status changed from Workable to In Progress

cdywan wrote:

  • Check if OPENQA_INVESTIGATE_ORIGIN was set correctly, considering those jobs have no comment This is set in one place, clone_settings+=("OPENQA_INVESTIGATE_ORIGIN=$host_url/t$id"), where $id is the job to clone. I can't see how that could be the wrong job.

We need Ensure OPENQA_INVESTIGATE_ORIGIN gets the original ID, not the id to be cloned - as @tinita found out we're actually seeing the last good here, which is expected, and the script is passing that instead. I'm going to propose a PR for that.

Actions #5

Updated by tinita almost 3 years ago

  • Copied to action #97028: openqa-clone-job: Pass id of cloned job as setting added
Actions #6

Updated by livdywan almost 3 years ago

  • Status changed from In Progress to Feedback

cdywan wrote:

We need Ensure OPENQA_INVESTIGATE_ORIGIN gets the original ID, not the id to be cloned - as @tinita found out we're actually seeing the last good here, which is expected, and the script is passing that instead. I'm going to propose a PR for that.

https://github.com/os-autoinst/scripts/pull/102

Actions #7

Updated by livdywan almost 3 years ago

  • Priority changed from Urgent to Normal

Lowering the priority since it was found there were no extra jobs, just jobs with faulty settings.

cdywan wrote:

https://github.com/os-autoinst/scripts/pull/102

The PR was merged.

Actions #8

Updated by livdywan almost 3 years ago

cdywan wrote:

Lowering the priority since it was found there were no extra jobs, just jobs with faulty settings.

cdywan wrote:

https://github.com/os-autoinst/scripts/pull/102

The PR was merged.

So I verified 20 investigate jobs for their OPENQA_INVESTIGATE_ORIGIN and found this: https://openqa.suse.de/tests/6947401#settings
It's incomplete, and the origin has no comment 🤔

Actions #9

Updated by livdywan over 2 years ago

cdywan wrote:

So I verified 20 investigate jobs for their OPENQA_INVESTIGATE_ORIGIN and found this: https://openqa.suse.de/tests/6947401#settings
It's incomplete, and the origin has no comment 🤔

$ sudo grep -E 'invest|hook|6947401|6946641' /var/log/apache2/error_log
$ sudo grep -E '6947401|6946641' /var/log/openqa{,_gru}

Nothing relevant in the logs...

Actions #10

Updated by livdywan over 2 years ago

cdywan wrote:

Nothing relevant in the logs...

Checking some more recent occurences:

$ sudo grep --color=always -E '[^0-9](7003107|7002448)' /var/log/apache2/error_log /var/log/openqa{,_gru}

We seem to have an increasing number of investigate jobs w/o a comment on the original job, but I still can't find anything in the logs... 🤔️

Actions #11

Updated by okurz over 2 years ago

Keep in mind that /var/log/openqa_gru does not go that far. I checked the first entry and it's from 2021-09-02 2300 but https://openqa.suse.de/tests/6994712 is from 2021-09-01 so will likely not show up there. You should try xzgrep --color=always -E '[^0-9](7003107|7002448)' /var/log/openqa_gru.*.xz which at least goes back to 2021-08-30. Also not that long but enough in this case. Still, the search term does not show anything.

Actions #12

Updated by livdywan over 2 years ago

@tinita discovered some POST /api/v1/jobs/[0-9]*/comments requests with status 404 in Apache logs. I'm wondering why we 're not seeing errors elsewhere if the post coming from investigate hooks is failing i.e.

$ openqa-cli api --osd jobs/5990783/comments
404 Not Found
{"error":"Job 5990783 does not exist","error_status":404}`

Normally this gives you output on both stderr and stdout yet we seem to get nothing?

Actions #13

Updated by tinita over 2 years ago

PR for adding User-Agent to openqa-cli calls: https://github.com/os-autoinst/scripts/pull/105

Actions #15

Updated by tinita over 2 years ago

  • Status changed from Feedback to In Progress
  • Assignee changed from livdywan to tinita

Since I'm actually working on this I assign it to me.

I'm working on another bugfix for my previous PR

Actions #16

Updated by VANASTASIADIS over 2 years ago

  • Due date set to 2021-09-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions #18

Updated by tinita over 2 years ago

  • Status changed from In Progress to Feedback

Ok, everything seems to be working again.

I see both openqa-label-known-issues and openqa-investigate useragents in both o3 and osd apache logs and none of the previous error messages in the gru logs anymore.

@cdywan, if you find more jobs that don't have comments, we can at least check the apache logs.

Actions #19

Updated by tinita over 2 years ago

  • Assignee changed from tinita to livdywan

Assigning to @cdywan because of my vacation.

Actions #20

Updated by livdywan over 2 years ago

I tried to spot more instances by looking at Last 500 finished jobs with investigate typed in the filter, and could only find correct jobs with comments in there. Pondering if I should bump the due date or consider this resolved 🤔️

Actions #21

Updated by livdywan over 2 years ago

  • Status changed from Feedback to Resolved

I'm thinking a follow-up could be an alert for buggy investigate jobs. For now I'll consider this resolved.

Actions

Also available in: Atom PDF