action #80736
closedcoordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues
coordination #80828: [epic] Trigger 'auto-review' and 'openqa-investigate' from within openQA when jobs incomplete or fail on o3+osd
Trigger 'auto-review' from within openQA when jobs incomplete (or fail) , for testing: auto_review:"tests died: unable to load main.pm, check the log for the cause"
Description
Motivation¶
auto-review does a good job but we could benefit from running auto-review more often. We should trigger it directly when openQA jobs incomplete or fail
Acceptance criteria¶
- AC1: auto-review is triggered on o3 when jobs incomplete
Suggestions¶
- DONE: After some days check if this still works fine on o3
- DONE: Fix reading from config for gru service -> https://github.com/os-autoinst/openQA/pull/3622
- DONE: Enable for failed as well (with label-known+investigate)
- DONE: Review openQA documentation for the current support, e.g. to consider apparmor, config vs. env, etc. -> http://open.qa/docs/#_enable_custom_hook_scripts_on_job_done_based_on_result
- monitor impact on o3
- Think about a better approach for apparmor
Updated by okurz almost 4 years ago
- Copied from action #77944: Run "auto-review" more often but alarm less added
Updated by okurz almost 4 years ago
- Subject changed from Trigger "auto-review" from within openQA when jobs incomplete (or fail) to Trigger "auto-review" from within openQA when jobs incomplete (or fail) , for testing: auto_review:"tests died: unable to load main.pm, check the log for the cause"
On o3 as root:
cd /opt/
mkdir os-autoinst-scripts
chown geekotest os-autoinst-scripts
sudo -u geekotest git clone git@github.com:os-autoinst/scripts.git os-autoinst-scripts
In /etc/cron.d/os-autoinst-scripts-update-git
:
-*/3 * * * * geekotest git -C /opt/os-autoinst-scripts pull --quiet --rebase origin master
In /etc/openqa/openqa.ini
:
[hooks]
job_done_hook_incomplete = /opt/os-autoinst-scripts/openqa-label-known-issues-hook
and
systemctl restart openqa-webui openqa-gru
Triggered a job that would incomplete with openqa-cli api --o3 -X post jobs test=okurz_poo80736
and found minion job result https://openqa.opensuse.org/minion/jobs?id=335327 that shows
{
"args" => [
1495066,
undef
],
"attempts" => 1,
"children" => [],
"created" => "2020-12-04T19:33:19.6905Z",
"delayed" => "2020-12-04T19:33:19.6905Z",
"expires" => undef,
"finished" => "2020-12-04T19:33:19.77125Z",
"id" => 335327,
"lax" => 0,
"notes" => {
"gru_id" => 17288276
},
"parents" => [],
"priority" => 0,
"queue" => "default",
"result" => "Job successfully executed",
"retried" => undef,
"retries" => 0,
"started" => "2020-12-04T19:33:19.6963Z",
"state" => "finished",
"task" => "finalize_job_results",
"time" => "2020-12-04T19:35:14.04025Z",
"worker" => 471
}
so nothing seen from the hook. Calling the script with a job as argument manually reveals the problem:
/opt/os-autoinst-scripts/openqa-label-known-issues-hook 1495074
this shows "Connection refused" because https://openqa.opensuse.org is tried which does not work when called within the o3 network, http has to be used. env scheme=http /opt/os-autoinst-scripts/openqa-label-known-issues-hook 1495074
works so we can set that in the config as well.
Setting a custom auto_review regex in the subject line of this ticket which should label the job accordingly.
Updated by okurz almost 4 years ago
- Subject changed from Trigger "auto-review" from within openQA when jobs incomplete (or fail) , for testing: auto_review:"tests died: unable to load main.pm, check the log for the cause" to Trigger 'auto-review' from within openQA when jobs incomplete (or fail) , for testing: auto_review:"tests died: unable to load main.pm, check the log for the cause"
Updated by okurz almost 4 years ago
- Description updated (diff)
- Status changed from In Progress to Feedback
First nothing happened. Then on o3 I added some live debug statements to /usr/share/openqa/lib/OpenQA/Task/Job/FinalizeResults.pm . I have the hypothesis that the config variables are not read by the gru service, so trying with env variables. I put into /etc/systemd/system/openqa-gru.service.d/override.conf
[Service]
Environment="OPENQA_JOB_DONE_HOOK_INCOMPLETE=env scheme=http /opt/os-autoinst-scripts/openqa-label-known-issues-hook"
which yields "Permission denied" in journalctl -u openqa-gru
, due to apparmor. For testing the following steps I used the minion dashboard to retrigger the same minion job by selecting it and then clicking "Retry", instead of posting a new openQA job every time. So did aa-complain /etc/apparmor.d/usr.share.openqa.script.openqa
and then aa-logprof /var/log/audit/audit.log
which suggested me the following additions for /etc/apparmor.d/usr.share.openqa.script.openqa :
/opt/os-autoinst-scripts/** rix,
/usr/bin/cat rix,
/usr/bin/curl rix,
/usr/bin/jq rix,
/usr/bin/mktemp rix,
/usr/share/openqa/script/client rix,
and then again aa-enforce /etc/apparmor.d/usr.share.openqa.script.openqa
.
https://openqa.opensuse.org/tests/1495076#comments is the job showing the results of my experiments in job labels.
The strange thing is that when using just a config variable I do not see anything from my temporary debugging statements nor any perl warning or error or anything but with an environment variable in place all the output is there. However, using that I could find out that the complete config is not read at all. Likely we just never read the config from the gru service process.
I also checked on progress.i.o.o with htop and in /opt/redmine/log/production.log that the repeated requests for tickets over the redmine API do not harm.
I have created a hook wrapper script for "label-known+investigate-unknown":
https://github.com/os-autoinst/scripts/pull/53
After merge this will be automatically deployed on o3 and we can then enable it as well with an env variable for the GRU service for failed (or even incomplete jobs).
Updated description with suggestions what to do next.
Updated by okurz almost 4 years ago
- Description updated (diff)
- https://github.com/os-autoinst/openQA/pull/3622 to remove the faulty config reading.
- Checked on o3 and everything still seems to work fine, e.g. https://openqa.opensuse.org/tests/1497351#comments although I wonder if I can make the comments show up from "auto-review" instead of "geekotest". Would need to use another user for the openQA client likely.
Updated by okurz almost 4 years ago
- Copied to action #80826: Trigger 'auto-review' from within openQA when jobs incomplete on osd as well added
Updated by okurz almost 4 years ago
- Copied to coordination #80828: [epic] Trigger 'auto-review' and 'openqa-investigate' from within openQA when jobs incomplete or fail on o3+osd added
Updated by okurz almost 4 years ago
- Description updated (diff)
- Status changed from Feedback to Resolved
- Parent task changed from #39719 to #80828
Split out other specific tickets and create parent epic #80828
auto-review is now triggered on o3 directly from hook scripts for incomplete jobs calling "openqa-label-known-issues" near-immediate when a job incompletes.