https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842020-12-08T08:46:31ZopenSUSE Project Management ToolopenQA Project - action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3https://progress.opensuse.org/issues/80830?journal_id=3568322020-12-08T08:46:31Zokurzokurz@suse.com
<ul><li><strong>Copied from</strong> <i><a class="issue tracker-6 status-3 priority-4 priority-default closed child parent" href="/issues/80828">coordination #80828</a>: [epic] Trigger 'auto-review' and 'openqa-investigate' from within openQA when jobs incomplete or fail on o3+osd</i> added</li></ul> openQA Project - action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3https://progress.opensuse.org/issues/80830?journal_id=3568362020-12-08T08:48:42Zokurzokurz@suse.com
<ul><li><strong>Parent task</strong> changed from <i>#39719</i> to <i>#80828</i></li></ul> openQA Project - action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3https://progress.opensuse.org/issues/80830?journal_id=3568422020-12-08T08:52:43Zokurzokurz@suse.com
<ul></ul><p>Added the hook for failed on o3 as well now, in /etc/openqa/openqa.ini:</p>
<pre><code>job_done_hook_failed = env scheme=http /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
</code></pre>
<p>and the systemd service override:</p>
<pre><code>Environment="OPENQA_JOB_DONE_HOOK_FAILED=env scheme=http /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook"
</code></pre>
<p>and apparmor addition:</p>
<pre><code> /usr/bin/openqa-cli rix,
</code></pre>
<p>monitoring <code>journalctl -f</code>.</p>
<p>EDIT: Seems that apparmor still complains about openqa-cli and also I found that investigation jobs are called even for <em>investigation</em> jobs themselves which we should prevent. Maybe filter out jobs with ":investigate:" in the test name.</p>
openQA Project - action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3https://progress.opensuse.org/issues/80830?journal_id=3568442020-12-08T08:53:33Zokurzokurz@suse.com
<ul><li><strong>Target version</strong> changed from <i>future</i> to <i>Ready</i></li></ul> openQA Project - action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3https://progress.opensuse.org/issues/80830?journal_id=3569402020-12-08T14:49:11Zokurzokurz@suse.com
<ul></ul><p><a href="https://github.com/os-autoinst/scripts/pull/58" class="external">https://github.com/os-autoinst/scripts/pull/58</a></p>
openQA Project - action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3https://progress.opensuse.org/issues/80830?journal_id=3576562020-12-10T15:36:06Zokurzokurz@suse.com
<ul></ul><p>I merged it myself without review now.</p>
<p>On o3 I have enabled the config in /etc/openqa/openqa.ini again and removed from /etc/systemd/system/openqa-gru.service.d/override.conf after reading from config was fixed:</p>
<pre><code>job_done_hook_incomplete = env scheme=http /opt/os-autoinst-scripts/openqa-label-known-issues-hook job_done_hook_failed = env scheme=http /opt/os-autoinst-scripts/openqa-label-known-issues-hook
</code></pre>
<p>then did</p>
<pre><code>systemctl daemon-reload && systemctl restart openqa-webui openqa-gru
</code></pre>
<p>I guess we could add more filter options, e.g. a regex blocklist for group/parent_group or a check if a job belongs to any group. For common candidates to ignore also see <a href="https://github.com/os-autoinst/scripts/pull/55/files" class="external">https://github.com/os-autoinst/scripts/pull/55/files</a> , e.g. "%Kernel%|%Development%|%Staging%|%MicroOS%" and additionally we could exclude jobs outside any group which would implicitly also exclude the ":investigate:" jobs which we already exclude but also exclude something like "openqa-clone-custom-git-refspec" jobs</p>
openQA Project - action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3https://progress.opensuse.org/issues/80830?journal_id=3577362020-12-11T08:32:57Zokurzokurz@suse.com
<ul></ul><p><a href="https://github.com/os-autoinst/scripts/pull/60" class="external">https://github.com/os-autoinst/scripts/pull/60</a> to allow filtering job group, parent job group, no group in openqa-investigate</p>
openQA Project - action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3https://progress.opensuse.org/issues/80830?journal_id=3578462020-12-12T22:42:40Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> set to <i>2020-12-15</i></li><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul><p>merged, was automatically deployed to o3, now enabled again in /etc/openqa/openqa.ini with:</p>
<pre><code>job_done_hook_failed = env scheme=http exclude_group_regex='(Development|Open Build Service|Others).*/.*(Kernel|Development|Staging|MicroOS)' /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
</code></pre>
<p>Updated apparmor with <code>aa-logprof</code>. Found the problem that openqa-cli is by default called with https causing "Connection refused" messages. I fixed that with <a href="https://github.com/os-autoinst/scripts/pull/61">https://github.com/os-autoinst/scripts/pull/61</a> and locally on o3 already. TODO: After merge of PR check that clean state is checked out in /opt/os-autoinst-scripts. In journal I still see <code>Dec 12 11:04:39 ariel openqa-gru[1606]: /opt/os-autoinst-scripts/openqa-investigate: line 100: /usr/bin/openqa-cli: Permission denied</code>. Apparently I missed something in apparmor. The latest change in aa-logprof suggested to include <code>#include <abstractions/openssl></code> in the apparmor profile as well. But still I see:</p>
<pre><code>Dec 12 15:04:44 ariel openqa-gru[1606]: /opt/os-autoinst-scripts/openqa-investigate: line 100: /usr/bin/openqa-cli: Permission denied
Dec 12 15:04:44 ariel openqa-gru[1606]: unable to query job data for 1504727:
</code></pre>
<p>With this no jobs are really triggered as investigation jobs. So trigger that intentionally now. Based on <a href="https://openqa.opensuse.org/tests/latest?arch=aarch64&distri=opensuse&flavor=NET&machine=aarch64&test=textmode&version=Tumbleweed" class="external">opensuse-Tumbleweed-NET-aarch64-textmode@aarch64</a> :</p>
<pre><code>openqa_clone_job_o3 --skip-chained-deps 1503691 INCLUDE_MODULES=isosize ISO_MAXSIZE=42 TEST=okurz_poo80830 _GROUP=0 BUILD= _SKIP_POST_FAIL_HOOKS=1
</code></pre>
<p>Created job #1504805: opensuse-Tumbleweed-NET-aarch64-Build20201211-textmode@aarch64 -> <a href="https://openqa.opensuse.org/t1504805">https://openqa.opensuse.org/t1504805</a></p>
<p>This should trigger an early failure in "isosize" as only module.</p>
<p>This triggered again the error after about 30s of test execution. Trying with <code>aa-disable /etc/apparmor.d/usr.share.openqa.script.openqa</code></p>
<p>This worked now: <a href="https://openqa.opensuse.org/tests/1504807#comment-100753">https://openqa.opensuse.org/tests/1504807#comment-100753</a> but there had been error messages:</p>
<pre><code>Dec 12 22:10:35 ariel openqa-gru[1606]: jq: error (at <stdin>:1): Cannot index string with string "text"
Dec 12 22:10:35 ariel openqa-gru[1606]: parse error: Invalid numeric literal at line 1, column 10
</code></pre>
<p>updated <a href="https://github.com/os-autoinst/scripts/pull/61">https://github.com/os-autoinst/scripts/pull/61</a> accordingly and applied the patch on o3 temporarily.</p>
<p>Enforced apparmor again. Triggering again a failed job triggered an investigation job just fine, no entry in /var/log/audit/audit.log, no error entry in journal.</p>
<p>Time to wait for review of my PR and see how it behaves over time.</p>
openQA Project - action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3https://progress.opensuse.org/issues/80830?journal_id=3583982020-12-15T20:17:03Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> changed from <i>2020-12-15</i> to <i>2020-12-22</i></li></ul><p>no review of <a href="https://github.com/os-autoinst/scripts/pull/61">https://github.com/os-autoinst/scripts/pull/61</a> for 3 days, self-merged now.</p>
<p>Did on o3 <code>sudo -u geekotest git -C /opt/os-autoinst-scripts reset --hard origin/master</code> to get rid of the local patch.</p>
<p>With the SQL query on o3 <code>select jobs.id, result_dir from comments,jobs,users where comments.user_id = users.id and comments.job_id = jobs.id and nickname ~ 'geekotest' and text ~ 'Automatic investigation jobs';</code> I could find that the latest job where automatic investigation jobs were triggered was 2020-12-1 01:13:49 +0000, but "auto-review" from the gitlab CI pipeline did that as recently as 2020-12-15 so something broke maybe.</p>
<p>Did <code>openqa_clone_job_o3 --skip-chained-deps 1503691 INCLUDE_MODULES=isosize ISO_MAXSIZE=42 TEST=okurz_poo80830 _GROUP=0 BUILD= _SKIP_POST_FAIL_HOOKS=1</code> which triggered <a href="https://openqa.opensuse.org/tests/1509703#">https://openqa.opensuse.org/tests/1509703#</a> and the minion job <a href="https://openqa.opensuse.org/minion/jobs?id=350616">https://openqa.opensuse.org/minion/jobs?id=350616</a> which shows in "hook_result" of the minion job output details: <code>Job w/o job group, \$exclude_no_group is set, skipping investigation\n</code>. Not beautiful but working :)</p>
<p>Next test case with "development job group" which should be excluded as well:</p>
<pre><code>openqa_clone_job_o3 --skip-chained-deps 1503691 INCLUDE_MODULES=isosize ISO_MAXSIZE=42 TEST=okurz_poo80830 _GROUP_ID=38 BUILD= _SKIP_POST_FAIL_HOOKS=1
</code></pre>
<p>Created job #1509704: opensuse-Tumbleweed-NET-aarch64-Build20201211-textmode@aarch64 -> <a href="https://openqa.opensuse.org/t1509704">https://openqa.opensuse.org/t1509704</a><br>
where group id "38" is <a href="https://openqa.opensuse.org/group_overview/38" class="external">Development / Development Tumbleweed</a>.</p>
<p>This yields <a href="https://openqa.opensuse.org/minion/jobs?id=350617">https://openqa.opensuse.org/minion/jobs?id=350617</a> with "Job group 'Development Tumbleweed / Development' matches \$exclude_group_regex '(Development|Open Build Service|Others).<em>/.</em>(Kernel|Development|Staging|MicroOS)', skipping investigation\n". Also nice, as expected.</p>
<p>Next test in a valid job group that is not ignored:</p>
<pre><code>openqa_clone_job_o3 --skip-chained-deps 1503691 INCLUDE_MODULES=isosize ISO_MAXSIZE=42 TEST=okurz_poo80830 _GROUP_ID=24 BUILD= _SKIP_POST_FAIL_HOOKS=1
</code></pre>
<p>Created job #1509705: opensuse-Tumbleweed-NET-aarch64-Build20201211-textmode@aarch64 -> <a href="https://openqa.opensuse.org/t1509705">https://openqa.opensuse.org/t1509705</a><br>
where group id "24" is <a href="https://openqa.opensuse.org/group_overview/24" class="external">openQA</a> with the minion details <a href="https://openqa.opensuse.org/minion/jobs?id=350618">https://openqa.opensuse.org/minion/jobs?id=350618</a> <code>"hook_result" => "{\n \"id\" : 101174\n}\n"</code> which is the comment id within <a href="https://openqa.opensuse.org/tests/1509705#comment-101174">https://openqa.opensuse.org/tests/1509705#comment-101174</a> with</p>
<pre><code>Automatic investigation jobs:
okurz_poo80830:investigate:retry: No last good recorded, skipping regression investigation jobs
</code></pre>
<p>also nice. I have deleted the latest job again for cleanup so that it does not pollute the job group.</p>
<p>I am not yet fully convinced this works all stable so I want to monitor the situation further.</p>
<p>TODO:</p>
<ul>
<li>Monitor after some days if jobs are also triggered for example in Tumbleweed aarch64, e.g. <a href="https://openqa.opensuse.org/tests/latest?arch=aarch64&distri=opensuse&flavor=DVD&machine=aarch64&test=desktopapps-documentation-x11&version=Tumbleweed">https://openqa.opensuse.org/tests/latest?arch=aarch64&distri=opensuse&flavor=DVD&machine=aarch64&test=desktopapps-documentation-x11&version=Tumbleweed</a></li>
<li>With the SQL query on o3 <code>select jobs.id, result_dir,t_finished from comments,jobs,users where comments.user_id = users.id and comments.job_id = jobs.id and nickname ~ 'geekotest' and text ~ 'Automatic investigation jobs' order by id DESC limit 10;</code> check if new comments are posted</li>
</ul>
openQA Project - action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3https://progress.opensuse.org/issues/80830?journal_id=3585762020-12-16T12:46:07Zokurzokurz@suse.com
<ul></ul><p>And now I see in o3 system journal:</p>
<pre><code>Dec 16 12:42:05 ariel openqa-gru[1784]: /opt/os-autoinst-scripts/openqa-investigate: line 27: /usr/bin/openqa-clone-job: Permission denied
Dec 16 12:42:06 ariel openqa-gru[1784]: ERROR: 404 - Not Found
</code></pre>
<p>I thought I had apparmor covered :D</p>
<p>logprof suggested the following changes:</p>
<pre><code>--- /etc/apparmor.d/usr.share.openqa.script.openqa 2020-12-15 19:50:56.466691761 +0000
+++ /tmp/tmpxwssfplu 2020-12-16 12:45:02.175185619 +0000
@@ -1,4 +1,4 @@
-# Last Modified: Tue Dec 15 19:50:56 2020
+# Last Modified: Wed Dec 16 12:45:02 2020
#include <tunables/global>
# ------------------------------------------------------------------
@@ -42,6 +42,7 @@
/usr/bin/mktemp rix,
/usr/bin/openqa-cli rix,
/usr/bin/perl ix,
+ /usr/bin/perl ix,
/usr/bin/rm rix,
/usr/bin/rsync mrix,
/usr/bin/ssh rcx,
@@ -61,6 +62,8 @@
/usr/share/openqa/script/client rix,
/usr/share/openqa/script/openqa r,
/usr/share/openqa/script/openqa-cli px,
+ /usr/share/openqa/script/openqa-clone-job mrix,
+ /usr/share/openqa/script/openqa-clone-job r,
/usr/share/openqa/templates/ r,
/usr/share/openqa/templates/** r,
/var/lib/openqa/.config/openqa/client.conf r,
= Changed Local Profiles =
The following local profiles were changed. Would you like to save them?
[1 - /usr/share/openqa/script/openqa]
2 - /usr/share/openqa/script/openqa-cli
</code></pre>
<p>which look a bit weird but ok. Saved and will check further.</p>
openQA Project - action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3https://progress.opensuse.org/issues/80830?journal_id=3589802020-12-20T10:21:32Zokurzokurz@suse.com
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed child" href="/issues/81206">action #81206</a>: Trigger 'openqa-investigate' from within openQA when jobs fail on osd</i> added</li></ul> openQA Project - action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3https://progress.opensuse.org/issues/80830?journal_id=3591342020-12-21T07:23:32Zokurzokurz@suse.com
<ul></ul><p>/etc/apparmor.d/local/usr.share.openqa.script.openqa has more changes that are specific to the instance, e.g. including changes for openqa-trigger-from-obs. Not sure if we want to include them all in our basic profile but maybe in the "site-specific" extension profile:</p>
<p><a href="https://github.com/os-autoinst/openQA/pull/3651" class="external">https://github.com/os-autoinst/openQA/pull/3651</a></p>
openQA Project - action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3https://progress.opensuse.org/issues/80830?journal_id=3599602020-12-28T15:15:38Zlivdywanliv.dywan@suse.com
<ul><li><strong>Due date</strong> changed from <i>2020-12-22</i> to <i>2021-01-08</i></li></ul><p>Bumping <em>due date</em> to account for holidays</p>
openQA Project - action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3https://progress.opensuse.org/issues/80830?journal_id=3599842020-12-28T17:57:36Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul><p>Actually with the PR merged everything is done :)</p>
openQA Project - action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3https://progress.opensuse.org/issues/80830?journal_id=3613582021-01-08T10:48:47Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> deleted (<del><i>2021-01-08</i></del>)</li></ul>