Project

General

Profile

Actions

action #80830

closed

coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

coordination #80828: [epic] Trigger 'auto-review' and 'openqa-investigate' from within openQA when jobs incomplete or fail on o3+osd

Trigger 'openqa-investigate' from within openQA when jobs fail on o3

Added by okurz about 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-12-08
Due date:
% Done:

0%

Estimated time:

Description

Acceptance criteria

  • AC1: openqa-investigate is triggered on o3 when jobs fails

Suggestions

  • Same as in #80828 but not only "label-known" but also "openqa-investigate"
  • monitor impact on o3
  • Think about a better approach for apparmor

Related issues 1 (0 open1 closed)

Copied to openQA Project (public) - action #81206: Trigger 'openqa-investigate' from within openQA when jobs fail on osdResolvedokurz

Actions
Actions #1

Updated by okurz about 4 years ago

  • Copied from coordination #80828: [epic] Trigger 'auto-review' and 'openqa-investigate' from within openQA when jobs incomplete or fail on o3+osd added
Actions #2

Updated by okurz about 4 years ago

  • Parent task changed from #39719 to #80828
Actions #3

Updated by okurz about 4 years ago

Added the hook for failed on o3 as well now, in /etc/openqa/openqa.ini:

job_done_hook_failed = env scheme=http /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook

and the systemd service override:

Environment="OPENQA_JOB_DONE_HOOK_FAILED=env scheme=http /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook"

and apparmor addition:

  /usr/bin/openqa-cli rix,

monitoring journalctl -f.

EDIT: Seems that apparmor still complains about openqa-cli and also I found that investigation jobs are called even for investigation jobs themselves which we should prevent. Maybe filter out jobs with ":investigate:" in the test name.

Actions #4

Updated by okurz about 4 years ago

  • Target version changed from future to Ready
Actions #6

Updated by okurz about 4 years ago

I merged it myself without review now.

On o3 I have enabled the config in /etc/openqa/openqa.ini again and removed from /etc/systemd/system/openqa-gru.service.d/override.conf after reading from config was fixed:

job_done_hook_incomplete = env scheme=http /opt/os-autoinst-scripts/openqa-label-known-issues-hook                                                                                                                                   job_done_hook_failed = env scheme=http /opt/os-autoinst-scripts/openqa-label-known-issues-hook

then did

systemctl daemon-reload && systemctl restart openqa-webui openqa-gru

I guess we could add more filter options, e.g. a regex blocklist for group/parent_group or a check if a job belongs to any group. For common candidates to ignore also see https://github.com/os-autoinst/scripts/pull/55/files , e.g. "%Kernel%|%Development%|%Staging%|%MicroOS%" and additionally we could exclude jobs outside any group which would implicitly also exclude the ":investigate:" jobs which we already exclude but also exclude something like "openqa-clone-custom-git-refspec" jobs

Actions #7

Updated by okurz about 4 years ago

https://github.com/os-autoinst/scripts/pull/60 to allow filtering job group, parent job group, no group in openqa-investigate

Actions #8

Updated by okurz almost 4 years ago

  • Due date set to 2020-12-15
  • Status changed from In Progress to Feedback

merged, was automatically deployed to o3, now enabled again in /etc/openqa/openqa.ini with:

job_done_hook_failed = env scheme=http exclude_group_regex='(Development|Open Build Service|Others).*/.*(Kernel|Development|Staging|MicroOS)' /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook

Updated apparmor with aa-logprof. Found the problem that openqa-cli is by default called with https causing "Connection refused" messages. I fixed that with https://github.com/os-autoinst/scripts/pull/61 and locally on o3 already. TODO: After merge of PR check that clean state is checked out in /opt/os-autoinst-scripts. In journal I still see Dec 12 11:04:39 ariel openqa-gru[1606]: /opt/os-autoinst-scripts/openqa-investigate: line 100: /usr/bin/openqa-cli: Permission denied. Apparently I missed something in apparmor. The latest change in aa-logprof suggested to include #include <abstractions/openssl> in the apparmor profile as well. But still I see:

Dec 12 15:04:44 ariel openqa-gru[1606]: /opt/os-autoinst-scripts/openqa-investigate: line 100: /usr/bin/openqa-cli: Permission denied
Dec 12 15:04:44 ariel openqa-gru[1606]: unable to query job data for 1504727:

With this no jobs are really triggered as investigation jobs. So trigger that intentionally now. Based on opensuse-Tumbleweed-NET-aarch64-textmode@aarch64 :

openqa_clone_job_o3 --skip-chained-deps 1503691 INCLUDE_MODULES=isosize ISO_MAXSIZE=42 TEST=okurz_poo80830 _GROUP=0 BUILD= _SKIP_POST_FAIL_HOOKS=1

Created job #1504805: opensuse-Tumbleweed-NET-aarch64-Build20201211-textmode@aarch64 -> https://openqa.opensuse.org/t1504805

This should trigger an early failure in "isosize" as only module.

This triggered again the error after about 30s of test execution. Trying with aa-disable /etc/apparmor.d/usr.share.openqa.script.openqa

This worked now: https://openqa.opensuse.org/tests/1504807#comment-100753 but there had been error messages:

Dec 12 22:10:35 ariel openqa-gru[1606]: jq: error (at <stdin>:1): Cannot index string with string "text"
Dec 12 22:10:35 ariel openqa-gru[1606]: parse error: Invalid numeric literal at line 1, column 10

updated https://github.com/os-autoinst/scripts/pull/61 accordingly and applied the patch on o3 temporarily.

Enforced apparmor again. Triggering again a failed job triggered an investigation job just fine, no entry in /var/log/audit/audit.log, no error entry in journal.

Time to wait for review of my PR and see how it behaves over time.

Actions #9

Updated by okurz almost 4 years ago

  • Due date changed from 2020-12-15 to 2020-12-22

no review of https://github.com/os-autoinst/scripts/pull/61 for 3 days, self-merged now.

Did on o3 sudo -u geekotest git -C /opt/os-autoinst-scripts reset --hard origin/master to get rid of the local patch.

With the SQL query on o3 select jobs.id, result_dir from comments,jobs,users where comments.user_id = users.id and comments.job_id = jobs.id and nickname ~ 'geekotest' and text ~ 'Automatic investigation jobs'; I could find that the latest job where automatic investigation jobs were triggered was 2020-12-1 01:13:49 +0000, but "auto-review" from the gitlab CI pipeline did that as recently as 2020-12-15 so something broke maybe.

Did openqa_clone_job_o3 --skip-chained-deps 1503691 INCLUDE_MODULES=isosize ISO_MAXSIZE=42 TEST=okurz_poo80830 _GROUP=0 BUILD= _SKIP_POST_FAIL_HOOKS=1 which triggered https://openqa.opensuse.org/tests/1509703# and the minion job https://openqa.opensuse.org/minion/jobs?id=350616 which shows in "hook_result" of the minion job output details: Job w/o job group, \$exclude_no_group is set, skipping investigation\n. Not beautiful but working :)

Next test case with "development job group" which should be excluded as well:

openqa_clone_job_o3 --skip-chained-deps 1503691 INCLUDE_MODULES=isosize ISO_MAXSIZE=42 TEST=okurz_poo80830 _GROUP_ID=38 BUILD= _SKIP_POST_FAIL_HOOKS=1

Created job #1509704: opensuse-Tumbleweed-NET-aarch64-Build20201211-textmode@aarch64 -> https://openqa.opensuse.org/t1509704
where group id "38" is Development / Development Tumbleweed.

This yields https://openqa.opensuse.org/minion/jobs?id=350617 with "Job group 'Development Tumbleweed / Development' matches \$exclude_group_regex '(Development|Open Build Service|Others)./.(Kernel|Development|Staging|MicroOS)', skipping investigation\n". Also nice, as expected.

Next test in a valid job group that is not ignored:

openqa_clone_job_o3 --skip-chained-deps 1503691 INCLUDE_MODULES=isosize ISO_MAXSIZE=42 TEST=okurz_poo80830 _GROUP_ID=24 BUILD= _SKIP_POST_FAIL_HOOKS=1

Created job #1509705: opensuse-Tumbleweed-NET-aarch64-Build20201211-textmode@aarch64 -> https://openqa.opensuse.org/t1509705
where group id "24" is openQA with the minion details https://openqa.opensuse.org/minion/jobs?id=350618 "hook_result" => "{\n \"id\" : 101174\n}\n" which is the comment id within https://openqa.opensuse.org/tests/1509705#comment-101174 with

Automatic investigation jobs:
    okurz_poo80830:investigate:retry: No last good recorded, skipping regression investigation jobs

also nice. I have deleted the latest job again for cleanup so that it does not pollute the job group.

I am not yet fully convinced this works all stable so I want to monitor the situation further.

TODO:

Actions #10

Updated by okurz almost 4 years ago

And now I see in o3 system journal:

Dec 16 12:42:05 ariel openqa-gru[1784]: /opt/os-autoinst-scripts/openqa-investigate: line 27: /usr/bin/openqa-clone-job: Permission denied
Dec 16 12:42:06 ariel openqa-gru[1784]: ERROR: 404 - Not Found

I thought I had apparmor covered :D

logprof suggested the following changes:

--- /etc/apparmor.d/usr.share.openqa.script.openqa      2020-12-15 19:50:56.466691761 +0000
+++ /tmp/tmpxwssfplu    2020-12-16 12:45:02.175185619 +0000
@@ -1,4 +1,4 @@
-# Last Modified: Tue Dec 15 19:50:56 2020
+# Last Modified: Wed Dec 16 12:45:02 2020
 #include <tunables/global>

 # ------------------------------------------------------------------
@@ -42,6 +42,7 @@
   /usr/bin/mktemp rix,
   /usr/bin/openqa-cli rix,
   /usr/bin/perl ix,
+  /usr/bin/perl ix,
   /usr/bin/rm rix,
   /usr/bin/rsync mrix,
   /usr/bin/ssh rcx,
@@ -61,6 +62,8 @@
   /usr/share/openqa/script/client rix,
   /usr/share/openqa/script/openqa r,
   /usr/share/openqa/script/openqa-cli px,
+  /usr/share/openqa/script/openqa-clone-job mrix,
+  /usr/share/openqa/script/openqa-clone-job r,
   /usr/share/openqa/templates/ r,
   /usr/share/openqa/templates/** r,
   /var/lib/openqa/.config/openqa/client.conf r,

= Changed Local Profiles =

The following local profiles were changed. Would you like to save them?

 [1 - /usr/share/openqa/script/openqa]
  2 - /usr/share/openqa/script/openqa-cli

which look a bit weird but ok. Saved and will check further.

Actions #11

Updated by okurz almost 4 years ago

  • Copied to action #81206: Trigger 'openqa-investigate' from within openQA when jobs fail on osd added
Actions #12

Updated by okurz almost 4 years ago

/etc/apparmor.d/local/usr.share.openqa.script.openqa has more changes that are specific to the instance, e.g. including changes for openqa-trigger-from-obs. Not sure if we want to include them all in our basic profile but maybe in the "site-specific" extension profile:

https://github.com/os-autoinst/openQA/pull/3651

Actions #13

Updated by livdywan almost 4 years ago

  • Due date changed from 2020-12-22 to 2021-01-08

Bumping due date to account for holidays

Actions #14

Updated by okurz almost 4 years ago

  • Status changed from Feedback to Resolved

Actually with the PR merged everything is done :)

Actions #15

Updated by okurz almost 4 years ago

  • Due date deleted (2021-01-08)
Actions

Also available in: Atom PDF