Project

General

Profile

action #80736

coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

coordination #80828: [epic] Trigger 'auto-review' and 'openqa-investigate' from within openQA when jobs incomplete or fail on o3+osd

Trigger 'auto-review' from within openQA when jobs incomplete (or fail) , for testing: auto_review:"tests died: unable to load main.pm, check the log for the cause"

Added by okurz 6 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

auto-review does a good job but we could benefit from running auto-review more often. We should trigger it directly when openQA jobs incomplete or fail

Acceptance criteria

  • AC1: auto-review is triggered on o3 when jobs incomplete

Suggestions


Related issues

Copied from QA - action #77944: Run "auto-review" more often but alarm lessResolved2020-11-14

Copied to openQA Project - action #80826: Trigger 'auto-review' from within openQA when jobs incomplete on osd as wellResolved2020-12-08

History

#1 Updated by okurz 6 months ago

  • Copied from action #77944: Run "auto-review" more often but alarm less added

#2 Updated by okurz 6 months ago

  • Start date deleted (2020-11-14)

#3 Updated by okurz 6 months ago

  • Subject changed from Trigger "auto-review" from within openQA when jobs incomplete (or fail) to Trigger "auto-review" from within openQA when jobs incomplete (or fail) , for testing: auto_review:"tests died: unable to load main.pm, check the log for the cause"

On o3 as root:

cd /opt/
mkdir os-autoinst-scripts
chown geekotest os-autoinst-scripts
sudo -u geekotest git clone git@github.com:os-autoinst/scripts.git os-autoinst-scripts

In /etc/cron.d/os-autoinst-scripts-update-git:

-*/3    * * * *  geekotest     git -C /opt/os-autoinst-scripts pull --quiet --rebase origin master

In /etc/openqa/openqa.ini:

[hooks]
job_done_hook_incomplete = /opt/os-autoinst-scripts/openqa-label-known-issues-hook                   

and

systemctl restart openqa-webui openqa-gru

Triggered a job that would incomplete with openqa-cli api --o3 -X post jobs test=okurz_poo80736 and found minion job result https://openqa.opensuse.org/minion/jobs?id=335327 that shows

{
  "args" => [
    1495066,
    undef
  ],
  "attempts" => 1,
  "children" => [],
  "created" => "2020-12-04T19:33:19.6905Z",
  "delayed" => "2020-12-04T19:33:19.6905Z",
  "expires" => undef,
  "finished" => "2020-12-04T19:33:19.77125Z",
  "id" => 335327,
  "lax" => 0,
  "notes" => {
    "gru_id" => 17288276
  },
  "parents" => [],
  "priority" => 0,
  "queue" => "default",
  "result" => "Job successfully executed",
  "retried" => undef,
  "retries" => 0,
  "started" => "2020-12-04T19:33:19.6963Z",
  "state" => "finished",
  "task" => "finalize_job_results",
  "time" => "2020-12-04T19:35:14.04025Z",
  "worker" => 471
}

so nothing seen from the hook. Calling the script with a job as argument manually reveals the problem:

/opt/os-autoinst-scripts/openqa-label-known-issues-hook 1495074

this shows "Connection refused" because https://openqa.opensuse.org is tried which does not work when called within the o3 network, http has to be used. env scheme=http /opt/os-autoinst-scripts/openqa-label-known-issues-hook 1495074 works so we can set that in the config as well.

Setting a custom auto_review regex in the subject line of this ticket which should label the job accordingly.

#4 Updated by okurz 6 months ago

  • Subject changed from Trigger "auto-review" from within openQA when jobs incomplete (or fail) , for testing: auto_review:"tests died: unable to load main.pm, check the log for the cause" to Trigger 'auto-review' from within openQA when jobs incomplete (or fail) , for testing: auto_review:"tests died: unable to load main.pm, check the log for the cause"

#5 Updated by okurz 6 months ago

  • Description updated (diff)
  • Status changed from In Progress to Feedback

First nothing happened. Then on o3 I added some live debug statements to /usr/share/openqa/lib/OpenQA/Task/Job/FinalizeResults.pm . I have the hypothesis that the config variables are not read by the gru service, so trying with env variables. I put into /etc/systemd/system/openqa-gru.service.d/override.conf

[Service]
Environment="OPENQA_JOB_DONE_HOOK_INCOMPLETE=env scheme=http /opt/os-autoinst-scripts/openqa-label-known-issues-hook"

which yields "Permission denied" in journalctl -u openqa-gru, due to apparmor. For testing the following steps I used the minion dashboard to retrigger the same minion job by selecting it and then clicking "Retry", instead of posting a new openQA job every time. So did aa-complain /etc/apparmor.d/usr.share.openqa.script.openqa and then aa-logprof /var/log/audit/audit.log which suggested me the following additions for /etc/apparmor.d/usr.share.openqa.script.openqa :

  /opt/os-autoinst-scripts/** rix,
  /usr/bin/cat rix,
  /usr/bin/curl rix,
  /usr/bin/jq rix,
  /usr/bin/mktemp rix,
  /usr/share/openqa/script/client rix,

and then again aa-enforce /etc/apparmor.d/usr.share.openqa.script.openqa.

https://openqa.opensuse.org/tests/1495076#comments is the job showing the results of my experiments in job labels.

The strange thing is that when using just a config variable I do not see anything from my temporary debugging statements nor any perl warning or error or anything but with an environment variable in place all the output is there. However, using that I could find out that the complete config is not read at all. Likely we just never read the config from the gru service process.

I also checked on progress.i.o.o with htop and in /opt/redmine/log/production.log that the repeated requests for tickets over the redmine API do not harm.

I have created a hook wrapper script for "label-known+investigate-unknown":
https://github.com/os-autoinst/scripts/pull/53

After merge this will be automatically deployed on o3 and we can then enable it as well with an env variable for the GRU service for failed (or even incomplete jobs).

Updated description with suggestions what to do next.

#6 Updated by okurz 6 months ago

  • Description updated (diff)

#7 Updated by okurz 6 months ago

  • Description updated (diff)

#8 Updated by okurz 6 months ago

  • Description updated (diff)

#9 Updated by okurz 6 months ago

  • Copied to action #80826: Trigger 'auto-review' from within openQA when jobs incomplete on osd as well added

#10 Updated by okurz 6 months ago

  • Copied to coordination #80828: [epic] Trigger 'auto-review' and 'openqa-investigate' from within openQA when jobs incomplete or fail on o3+osd added

#11 Updated by okurz 6 months ago

  • Description updated (diff)
  • Status changed from Feedback to Resolved
  • Parent task changed from #39719 to #80828

Split out other specific tickets and create parent epic #80828

auto-review is now triggered on o3 directly from hook scripts for incomplete jobs calling "openqa-label-known-issues" near-immediate when a job incompletes.

Also available in: Atom PDF