Project

General

Profile

action #19720

Simplify investigation of job failures

Added by okurz almost 3 years ago. Updated 18 days ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
-
Start date:
2019-12-17
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Difficulty:
Duration:

Description

motivation

Make job failure investigation easier to save time and ensure we do not miss failures

ideas


Subtasks

action #61103: Use CodeMirror to render diffs in the Investigation tabRejectedokurz


Related issues

Related to openQA Project - action #41057: [epic] Make reviewing results easierNew2018-09-14

Related to openQA Project - action #60560: Self-investigate potential reasons for failures in openQAResolved2019-12-03

Related to openQA Project - action #39719: [saga][epic] Detect "known failures" and mark jobs as suchBlocked2018-04-162020-06-09

History

#1 Updated by okurz almost 3 years ago

  • Priority changed from Normal to Low

needle git hash available in vars.json as well as in output of autoinst-log.txt. Also https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3141 can serve as a nice example how more information can be extracted from within the SUT by looking into details of y2log and popping up with a record_info and more details so that one does not need to transcribe screenshots or download the logfiles to look into details.

The following can help further to get details then:

import requests
test_url = 'https://openqa.suse.de/tests/1033327'
j = requests.get(test_url + '/file/details-yast2_lan_restart.json').json()
soft_failed_name = [i['text'] for i in j if 'title' in i.keys() and 'Soft Fail' in i['title']][0]
out = requests.get(test_url + '/file/' + soft_failed_name).content
print(out.decode('utf-8'))

-> # Soft Failure:
bsc#992113

#2 Updated by okurz almost 3 years ago

asmorodskyi and me used that for getting parsing of soft-failure details into openqa-review, see https://github.com/okurz/openqa_review/commit/44119b13454e5bdb8609f99a21426434134035d0 and https://github.com/okurz/openqa_review/commit/4c4bc937fbf6e85e31eb1f2ca42a34e9603e26d3 for details

#3 Updated by okurz over 2 years ago

  • Assignee deleted (okurz)

Since the last update openqa-review improved by parsing soft-fail info boxes as well as soft-fail needles including reminder comments on tickets for these cases as well. Some further ideas are mentioned in the description for anyone to pick up.

#4 Updated by okurz 10 months ago

  • Status changed from In Progress to Workable

back to workable as it's not really "In Progress" now.

#5 Updated by okurz 7 months ago

  • Related to action #41057: [epic] Make reviewing results easier added

#6 Updated by okurz 6 months ago

  • Description updated (diff)

#7 Updated by okurz 6 months ago

  • Related to action #60560: Self-investigate potential reasons for failures in openQA added

#8 Updated by okurz 6 months ago

  • Status changed from Workable to Feedback
  • Assignee set to okurz
  • Target version set to Current Sprint

#9 Updated by okurz 6 months ago

  • Related to action #39719: [saga][epic] Detect "known failures" and mark jobs as such added

#10 Updated by okurz 6 months ago

  • Description updated (diff)

#11 Updated by okurz 4 months ago

  • Description updated (diff)
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)
  • Target version deleted (Current Sprint)

idea from #61103 included. Last change was the HTML table. Some things crossed of the list. With this I unassign again.

#12 Updated by okurz 4 months ago

  • Description updated (diff)

#13 Updated by okurz 4 months ago

  • Description updated (diff)

#14 Updated by okurz 3 months ago

From an investigation session today about a test failure in https://openqa.suse.de/tests/3950573 :

https://openqa.suse.de/tests/3940013#investigation [the previous job that already failed in the same module with same symptoms] shows me https://gitlab.suse.de/snippets/312 from where maybe 9d9f0b2fa or ca801c541 or 716d23041 look suspicious. But keep in mind that build 148 already failed the same, so not a regression in 150. You could crosscheck last good build 146 with last good test code, last good build with current test and vice versa, i.e. last build with last good test code. if all fail then blame a qemu update on arm workers if there was any. […] btw, no qemu update on worker: https://gitlab.suse.de/snippets/313

steps conducted:

  • followed "next & previous" results to find "first failed", found previous job on previous build
  • clicked "investigation" tab on failed job, identify that settings diff and needle changes look unrelated, identified some potentially related test changes -> https://gitlab.suse.de/snippets/312
  • suggestion in general to bisect: "crosscheck last good build […] with last good test code, last good build with current test and vice versa, i.e. last build with last good test code"
  • identify date of "last good" from https://openqa.suse.de/tests/3940013#next_previous -> "title="2020-02-26T16:29:53Z"
  • logged in into the assigned worker machine, openqaworker-arm-2, called sudo grep -A 100000 '2020-02-26' /var/log/zypp/history | sed -n 's@^.*install|\([^|]*\)|.*$@\1@p' | sort to identify changed packages.

Idea for the "packages on worker": Have a hook that can be defined by instance admins in worker conf and be filled with custom commands that are collected in log files and uploaded to jobs, e.g. rpm -qa. Then investigate can do diff between logs.

#15 Updated by okurz 3 months ago

  • Description updated (diff)
  • Status changed from Workable to Feedback
  • Assignee set to okurz

https://github.com/os-autoinst/scripts/pull/23 to cover "See https://github.com/okurz/openqa_review/tree/feature/investigate especially https://github.com/okurz/openqa_review/blob/feature/investigate/openqa_review/investigate.py"

crossed off two other points and deleted "* os-autoinst version in vars.json" as this should either not be necessary considering that the version is in the log file or implicitly covered by "all package changes, e.g. save rpm -qa in file and provide diff and/or changelog" which one can do on the worker just as well as within the SUT.

I have tested the call to rpm -qa on all osd workers:

# salt -l error --no-color -C  'G@roles:worker' --state-output=changes cmd.run "time rpm -qa > /tmp/rpm_qa; ls -lh /tmp/rpm_qa; rm /tmp/rpm_qa"
openqaworker3.suse.de: 0m1.187s, 68K
openqaworker2.suse.de: 0m1.191s, 64K
openqaworker6.suse.de: 0m1.210s, 67K
openqaworker8.suse.de: 0m1.210s, 63K
openqaworker5.suse.de: 0m1.226s, 67K
openqaworker9.suse.de: 0m1.510s, 63K
malbec.arch.suse.de: 0m1.730s, 53K
openqaworker10.suse.de: 0m1.715s, 60K
QA-Power8-5-kvm.qa.suse.de: 0m1.826s, 65K
openqaworker13.suse.de: 0m1.765s, 64K
QA-Power8-4-kvm.qa.suse.de: 0m1.884s, 67K
powerqaworker-qam-1: 0m2.244s, 93K
openqaworker-arm-1.suse.de: 0m4.072s, 63K
openqaworker-arm-2.suse.de: 0m4.076s, 62K
openqaworker-arm-3.suse.de: 0m4.639s, 72K
grenache-1.qa.suse.de: 0m4.859s, 92K

So 1-5s, recording 60-100kB. We could record this in every job but feels not efficient. Parsing zypper log only in case of an error with the diff computed between the date of "last good" and "first bad" feels smarter albeit of course a little bit more complicated :) The philosophy to follow would be to make the happy path (test passes) as efficient as possible and treat failures as exceptions, hence I consider recording rpm -qa in every job regardless of the status as not the right option.

#16 Updated by okurz 3 months ago

  • Description updated (diff)

#17 Updated by okurz 2 months ago

  • Status changed from Feedback to Workable

os-autoinst/os-autoinst#1367 merged, can now continue with https://github.com/os-autoinst/scripts/pull/23

#18 Updated by okurz 2 months ago

merged. Next step: Run it somewhere, e.g.

  • Setup github actions CI job
  • query for failed jobs
    • without any comment on openQA -> openqa-client --json-output --host https://openqa.opensuse.org jobs/$id/comment == []
    • e.g. start with job group 1 (Tumbleweed)
    • Or based on AMQP event if "bugref" is "null"
    • and – over openQA API – "origin_id" is "null" to not retrigger jobs that are already retriggered jobs
  • for each identified job, run investigate and write output, i.e. link to triggered jobs to comment on job
  • sleep in loop with amqp rx client, wait for new jobs and repeat

The proof-of-concept script seems still like a good idea but more and more it feels like trying to read all these details from the outside, finding a place to run and also trigger accordingly is more effort than retriggering directly from within openQA. E.g. based on my branch "feature/known_issue_detection" I could have "feature/investigate" with a method "trigger_investigation_jobs"

#19 Updated by okurz 2 months ago

  • Status changed from Workable to Feedback

With https://github.com/os-autoinst/scripts/pull/24 I added a script to look for "investigation candidates" and added a feature to "openqa-investigate" to post automatic comments to openQA with links to triggered investigation jobs. With https://gitlab.suse.de/openqa/auto-review/-/merge_requests/3 I run this daily (for now) so automatic investigation jobs get triggered now for all failed openQA jobs that do not have a label, e.g. new failures: https://openqa.opensuse.org/tests/1218545#comments . Based on the results one should be able to more easily determine if issues are sporadic, test regressions, product regressions. Based on that I await any feedback. We can also continue the work in openQA itself.

#20 Updated by okurz 2 months ago

  • Description updated (diff)

#21 Updated by cdywan 21 days ago

  • all package changes, e.g. save rpm -qa in file
  • diff of test schedule
  • if best needle candidate matches 0% it is most likely not a trivial needle issue
  • Make "last good" a link to a job instead of plain job ID
  • collapse content of initial rows in investigation tab when content becomes too big, e.g. more than 10 lines
  • In settings table mark origin of settings and changed settings, e.g. for setting "foo" instead of the table row "foo | 1" one could have [...]

These seem to be the outstanding items from the ideas. Are you still planning to look into these? Do you want to discuss these more (since you set it to Feedback)? I'd volunteer to help out.

#22 Updated by okurz 18 days ago

  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

I guess by now I can update the bug as I have not received further feedback lately. Results and observations so far:

  • Some people like the "Investigation" tab, some have not even found it and got lost in their own investigation :(
  • The script solution in #19720#note-19 kinda works and can be replaced by a solution within openQA which I have prepared some steps for but did not manage to continue further so far
  • The other mentioned ideas are also still valid and can be done anytime

Also available in: Atom PDF