action #80830
closedcoordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues
coordination #80828: [epic] Trigger 'auto-review' and 'openqa-investigate' from within openQA when jobs incomplete or fail on o3+osd
Trigger 'openqa-investigate' from within openQA when jobs fail on o3
Added by okurz about 4 years ago. Updated about 4 years ago.
Description
Updated by okurz about 4 years ago
- Copied from coordination #80828: [epic] Trigger 'auto-review' and 'openqa-investigate' from within openQA when jobs incomplete or fail on o3+osd added
Updated by okurz about 4 years ago
Added the hook for failed on o3 as well now, in /etc/openqa/openqa.ini:
job_done_hook_failed = env scheme=http /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
and the systemd service override:
Environment="OPENQA_JOB_DONE_HOOK_FAILED=env scheme=http /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook"
and apparmor addition:
/usr/bin/openqa-cli rix,
monitoring journalctl -f
.
EDIT: Seems that apparmor still complains about openqa-cli and also I found that investigation jobs are called even for investigation jobs themselves which we should prevent. Maybe filter out jobs with ":investigate:" in the test name.
Updated by okurz about 4 years ago
Updated by okurz about 4 years ago
I merged it myself without review now.
On o3 I have enabled the config in /etc/openqa/openqa.ini again and removed from /etc/systemd/system/openqa-gru.service.d/override.conf after reading from config was fixed:
job_done_hook_incomplete = env scheme=http /opt/os-autoinst-scripts/openqa-label-known-issues-hook job_done_hook_failed = env scheme=http /opt/os-autoinst-scripts/openqa-label-known-issues-hook
then did
systemctl daemon-reload && systemctl restart openqa-webui openqa-gru
I guess we could add more filter options, e.g. a regex blocklist for group/parent_group or a check if a job belongs to any group. For common candidates to ignore also see https://github.com/os-autoinst/scripts/pull/55/files , e.g. "%Kernel%|%Development%|%Staging%|%MicroOS%" and additionally we could exclude jobs outside any group which would implicitly also exclude the ":investigate:" jobs which we already exclude but also exclude something like "openqa-clone-custom-git-refspec" jobs
Updated by okurz about 4 years ago
https://github.com/os-autoinst/scripts/pull/60 to allow filtering job group, parent job group, no group in openqa-investigate
Updated by okurz about 4 years ago
- Due date set to 2020-12-15
- Status changed from In Progress to Feedback
merged, was automatically deployed to o3, now enabled again in /etc/openqa/openqa.ini with:
job_done_hook_failed = env scheme=http exclude_group_regex='(Development|Open Build Service|Others).*/.*(Kernel|Development|Staging|MicroOS)' /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
Updated apparmor with aa-logprof
. Found the problem that openqa-cli is by default called with https causing "Connection refused" messages. I fixed that with https://github.com/os-autoinst/scripts/pull/61 and locally on o3 already. TODO: After merge of PR check that clean state is checked out in /opt/os-autoinst-scripts. In journal I still see Dec 12 11:04:39 ariel openqa-gru[1606]: /opt/os-autoinst-scripts/openqa-investigate: line 100: /usr/bin/openqa-cli: Permission denied
. Apparently I missed something in apparmor. The latest change in aa-logprof suggested to include #include <abstractions/openssl>
in the apparmor profile as well. But still I see:
Dec 12 15:04:44 ariel openqa-gru[1606]: /opt/os-autoinst-scripts/openqa-investigate: line 100: /usr/bin/openqa-cli: Permission denied
Dec 12 15:04:44 ariel openqa-gru[1606]: unable to query job data for 1504727:
With this no jobs are really triggered as investigation jobs. So trigger that intentionally now. Based on opensuse-Tumbleweed-NET-aarch64-textmode@aarch64 :
openqa_clone_job_o3 --skip-chained-deps 1503691 INCLUDE_MODULES=isosize ISO_MAXSIZE=42 TEST=okurz_poo80830 _GROUP=0 BUILD= _SKIP_POST_FAIL_HOOKS=1
Created job #1504805: opensuse-Tumbleweed-NET-aarch64-Build20201211-textmode@aarch64 -> https://openqa.opensuse.org/t1504805
This should trigger an early failure in "isosize" as only module.
This triggered again the error after about 30s of test execution. Trying with aa-disable /etc/apparmor.d/usr.share.openqa.script.openqa
This worked now: https://openqa.opensuse.org/tests/1504807#comment-100753 but there had been error messages:
Dec 12 22:10:35 ariel openqa-gru[1606]: jq: error (at <stdin>:1): Cannot index string with string "text"
Dec 12 22:10:35 ariel openqa-gru[1606]: parse error: Invalid numeric literal at line 1, column 10
updated https://github.com/os-autoinst/scripts/pull/61 accordingly and applied the patch on o3 temporarily.
Enforced apparmor again. Triggering again a failed job triggered an investigation job just fine, no entry in /var/log/audit/audit.log, no error entry in journal.
Time to wait for review of my PR and see how it behaves over time.
Updated by okurz about 4 years ago
- Due date changed from 2020-12-15 to 2020-12-22
no review of https://github.com/os-autoinst/scripts/pull/61 for 3 days, self-merged now.
Did on o3 sudo -u geekotest git -C /opt/os-autoinst-scripts reset --hard origin/master
to get rid of the local patch.
With the SQL query on o3 select jobs.id, result_dir from comments,jobs,users where comments.user_id = users.id and comments.job_id = jobs.id and nickname ~ 'geekotest' and text ~ 'Automatic investigation jobs';
I could find that the latest job where automatic investigation jobs were triggered was 2020-12-1 01:13:49 +0000, but "auto-review" from the gitlab CI pipeline did that as recently as 2020-12-15 so something broke maybe.
Did openqa_clone_job_o3 --skip-chained-deps 1503691 INCLUDE_MODULES=isosize ISO_MAXSIZE=42 TEST=okurz_poo80830 _GROUP=0 BUILD= _SKIP_POST_FAIL_HOOKS=1
which triggered https://openqa.opensuse.org/tests/1509703# and the minion job https://openqa.opensuse.org/minion/jobs?id=350616 which shows in "hook_result" of the minion job output details: Job w/o job group, \$exclude_no_group is set, skipping investigation\n
. Not beautiful but working :)
Next test case with "development job group" which should be excluded as well:
openqa_clone_job_o3 --skip-chained-deps 1503691 INCLUDE_MODULES=isosize ISO_MAXSIZE=42 TEST=okurz_poo80830 _GROUP_ID=38 BUILD= _SKIP_POST_FAIL_HOOKS=1
Created job #1509704: opensuse-Tumbleweed-NET-aarch64-Build20201211-textmode@aarch64 -> https://openqa.opensuse.org/t1509704
where group id "38" is Development / Development Tumbleweed.
This yields https://openqa.opensuse.org/minion/jobs?id=350617 with "Job group 'Development Tumbleweed / Development' matches \$exclude_group_regex '(Development|Open Build Service|Others)./.(Kernel|Development|Staging|MicroOS)', skipping investigation\n". Also nice, as expected.
Next test in a valid job group that is not ignored:
openqa_clone_job_o3 --skip-chained-deps 1503691 INCLUDE_MODULES=isosize ISO_MAXSIZE=42 TEST=okurz_poo80830 _GROUP_ID=24 BUILD= _SKIP_POST_FAIL_HOOKS=1
Created job #1509705: opensuse-Tumbleweed-NET-aarch64-Build20201211-textmode@aarch64 -> https://openqa.opensuse.org/t1509705
where group id "24" is openQA with the minion details https://openqa.opensuse.org/minion/jobs?id=350618 "hook_result" => "{\n \"id\" : 101174\n}\n"
which is the comment id within https://openqa.opensuse.org/tests/1509705#comment-101174 with
Automatic investigation jobs:
okurz_poo80830:investigate:retry: No last good recorded, skipping regression investigation jobs
also nice. I have deleted the latest job again for cleanup so that it does not pollute the job group.
I am not yet fully convinced this works all stable so I want to monitor the situation further.
TODO:
- Monitor after some days if jobs are also triggered for example in Tumbleweed aarch64, e.g. https://openqa.opensuse.org/tests/latest?arch=aarch64&distri=opensuse&flavor=DVD&machine=aarch64&test=desktopapps-documentation-x11&version=Tumbleweed
- With the SQL query on o3
select jobs.id, result_dir,t_finished from comments,jobs,users where comments.user_id = users.id and comments.job_id = jobs.id and nickname ~ 'geekotest' and text ~ 'Automatic investigation jobs' order by id DESC limit 10;
check if new comments are posted
Updated by okurz about 4 years ago
And now I see in o3 system journal:
Dec 16 12:42:05 ariel openqa-gru[1784]: /opt/os-autoinst-scripts/openqa-investigate: line 27: /usr/bin/openqa-clone-job: Permission denied
Dec 16 12:42:06 ariel openqa-gru[1784]: ERROR: 404 - Not Found
I thought I had apparmor covered :D
logprof suggested the following changes:
--- /etc/apparmor.d/usr.share.openqa.script.openqa 2020-12-15 19:50:56.466691761 +0000
+++ /tmp/tmpxwssfplu 2020-12-16 12:45:02.175185619 +0000
@@ -1,4 +1,4 @@
-# Last Modified: Tue Dec 15 19:50:56 2020
+# Last Modified: Wed Dec 16 12:45:02 2020
#include <tunables/global>
# ------------------------------------------------------------------
@@ -42,6 +42,7 @@
/usr/bin/mktemp rix,
/usr/bin/openqa-cli rix,
/usr/bin/perl ix,
+ /usr/bin/perl ix,
/usr/bin/rm rix,
/usr/bin/rsync mrix,
/usr/bin/ssh rcx,
@@ -61,6 +62,8 @@
/usr/share/openqa/script/client rix,
/usr/share/openqa/script/openqa r,
/usr/share/openqa/script/openqa-cli px,
+ /usr/share/openqa/script/openqa-clone-job mrix,
+ /usr/share/openqa/script/openqa-clone-job r,
/usr/share/openqa/templates/ r,
/usr/share/openqa/templates/** r,
/var/lib/openqa/.config/openqa/client.conf r,
= Changed Local Profiles =
The following local profiles were changed. Would you like to save them?
[1 - /usr/share/openqa/script/openqa]
2 - /usr/share/openqa/script/openqa-cli
which look a bit weird but ok. Saved and will check further.
Updated by okurz about 4 years ago
- Copied to action #81206: Trigger 'openqa-investigate' from within openQA when jobs fail on osd added
Updated by okurz about 4 years ago
/etc/apparmor.d/local/usr.share.openqa.script.openqa has more changes that are specific to the instance, e.g. including changes for openqa-trigger-from-obs. Not sure if we want to include them all in our basic profile but maybe in the "site-specific" extension profile:
Updated by livdywan about 4 years ago
- Due date changed from 2020-12-22 to 2021-01-08
Bumping due date to account for holidays
Updated by okurz about 4 years ago
- Status changed from Feedback to Resolved
Actually with the PR merged everything is done :)