Simplify investigation of job failures
Make job failure investigation easier to save time and ensure we do not miss failures
DONE: See https://github.com/okurz/openqa_review/tree/feature/investigate especially https://github.com/okurz/openqa_review/blob/feature/investigate/openqa_review/investigate.py -> https://github.com/os-autoinst/scripts/pull/23 -> gh#os-autoinst/scripts#24
all package changes, e.g. save
rpm -qain file and provide diff and/or changelog for the worker and the SUT. For openSUSE e.g. read out changelog diff from https://openqa.opensuse.org/snapshot-changes/opensuse/Tumbleweed/ , for SLE from http://xcdchk.suse.de/
diff of test schedule
if best needle candidate matches 0% it is most likely not a trivial needle issue
Make "last good" a link to a job instead of plain job ID
collapse content of initial rows in investigation tab when content becomes too big, e.g. more than 10 lines
In settings table mark origin of settings and changed settings, e.g. for setting "foo" instead of the table row "foo | 1" one could have
- "foo | 1 (testsuites table)" when the settings comes from the test suites database table, e.g. compared to job templates, machines, etc. . This would also help when we allow even more sources for settings, e.g. load job templates from test distributions in parallel to database tables
- update the settings table from vars.json after job run to included changes but then show which settings changed since the job was initially created
- "foo | 1 (+)" when the setting is new in the scenario, with the table row and/or "(+)" in green (as in common colored diffs) and on hover it shows the explanation that this was added, linked to the commit, showing which job it compares against
- "foo | 1 (<->)" or similar when the setting changed against "last good" where it was e.g. 0, with "(<->)" being a link to the "last good" job, with the table row in different color
DONE: provide more data in the job logs itself for https://progress.opensuse.org/projects/openqav3/wiki#Further-decision-steps-working-on-test-issues
DONE: provide diff of failed job vs. "last good"
- list of changed files
DONE: exclude context in vars.json diff, distinguish change and add/remove -> gh#os-autoinst/openQA#2625
DONE: exclude merges from test git log -> gh#os-autoinst/openQA#2625
DONE: The Investigation tab should use CodeMirror to render diffs like we do for test sources or the YAML editor (from #61103) -> Using nicer table instead should be good enough gh#os-autoinst/openQA#2644
#1 Updated by okurz almost 3 years ago
- Priority changed from Normal to Low
needle git hash available in vars.json as well as in output of autoinst-log.txt. Also https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3141 can serve as a nice example how more information can be extracted from within the SUT by looking into details of y2log and popping up with a
record_info and more details so that one does not need to transcribe screenshots or download the logfiles to look into details.
The following can help further to get details then:
import requests test_url = 'https://openqa.suse.de/tests/1033327' j = requests.get(test_url + '/file/details-yast2_lan_restart.json').json() soft_failed_name = [i['text'] for i in j if 'title' in i.keys() and 'Soft Fail' in i['title']] out = requests.get(test_url + '/file/' + soft_failed_name).content print(out.decode('utf-8'))
-> # Soft Failure:
#2 Updated by okurz almost 3 years ago
asmorodskyi and me used that for getting parsing of soft-failure details into openqa-review, see https://github.com/okurz/openqa_review/commit/44119b13454e5bdb8609f99a21426434134035d0 and https://github.com/okurz/openqa_review/commit/4c4bc937fbf6e85e31eb1f2ca42a34e9603e26d3 for details
#3 Updated by okurz over 2 years ago
- Assignee deleted (
Since the last update openqa-review improved by parsing soft-fail info boxes as well as soft-fail needles including reminder comments on tickets for these cases as well. Some further ideas are mentioned in the description for anyone to pick up.
From an investigation session today about a test failure in https://openqa.suse.de/tests/3950573 :
https://openqa.suse.de/tests/3940013#investigation [the previous job that already failed in the same module with same symptoms] shows me https://gitlab.suse.de/snippets/312 from where maybe 9d9f0b2fa or ca801c541 or 716d23041 look suspicious. But keep in mind that build 148 already failed the same, so not a regression in 150. You could crosscheck last good build 146 with last good test code, last good build with current test and vice versa, i.e. last build with last good test code. if all fail then blame a qemu update on arm workers if there was any. […] btw, no qemu update on worker: https://gitlab.suse.de/snippets/313
- followed "next & previous" results to find "first failed", found previous job on previous build
- clicked "investigation" tab on failed job, identify that settings diff and needle changes look unrelated, identified some potentially related test changes -> https://gitlab.suse.de/snippets/312
- suggestion in general to bisect: "crosscheck last good build […] with last good test code, last good build with current test and vice versa, i.e. last build with last good test code"
- identify date of "last good" from https://openqa.suse.de/tests/3940013#next_previous -> "title="2020-02-26T16:29:53Z"
- logged in into the assigned worker machine, openqaworker-arm-2, called
sudo grep -A 100000 '2020-02-26' /var/log/zypp/history | sed -n 's@^.*install|\([^|]*\)|.*$@\1@p' | sortto identify changed packages.
Idea for the "packages on worker": Have a hook that can be defined by instance admins in worker conf and be filled with custom commands that are collected in log files and uploaded to jobs, e.g.
rpm -qa. Then investigate can do diff between logs.
- Description updated (diff)
- Status changed from Workable to Feedback
- Assignee set to okurz
https://github.com/os-autoinst/scripts/pull/23 to cover "See https://github.com/okurz/openqa_review/tree/feature/investigate especially https://github.com/okurz/openqa_review/blob/feature/investigate/openqa_review/investigate.py"
crossed off two other points and deleted "* os-autoinst version in vars.json" as this should either not be necessary considering that the version is in the log file or implicitly covered by "all package changes, e.g. save
rpm -qa in file and provide diff and/or changelog" which one can do on the worker just as well as within the SUT.
I have tested the call to
rpm -qa on all osd workers:
# salt -l error --no-color -C 'G@roles:worker' --state-output=changes cmd.run "time rpm -qa > /tmp/rpm_qa; ls -lh /tmp/rpm_qa; rm /tmp/rpm_qa" openqaworker3.suse.de: 0m1.187s, 68K openqaworker2.suse.de: 0m1.191s, 64K openqaworker6.suse.de: 0m1.210s, 67K openqaworker8.suse.de: 0m1.210s, 63K openqaworker5.suse.de: 0m1.226s, 67K openqaworker9.suse.de: 0m1.510s, 63K malbec.arch.suse.de: 0m1.730s, 53K openqaworker10.suse.de: 0m1.715s, 60K QA-Power8-5-kvm.qa.suse.de: 0m1.826s, 65K openqaworker13.suse.de: 0m1.765s, 64K QA-Power8-4-kvm.qa.suse.de: 0m1.884s, 67K powerqaworker-qam-1: 0m2.244s, 93K openqaworker-arm-1.suse.de: 0m4.072s, 63K openqaworker-arm-2.suse.de: 0m4.076s, 62K openqaworker-arm-3.suse.de: 0m4.639s, 72K grenache-1.qa.suse.de: 0m4.859s, 92K
So 1-5s, recording 60-100kB. We could record this in every job but feels not efficient. Parsing zypper log only in case of an error with the diff computed between the date of "last good" and "first bad" feels smarter albeit of course a little bit more complicated :) The philosophy to follow would be to make the happy path (test passes) as efficient as possible and treat failures as exceptions, hence I consider recording
rpm -qa in every job regardless of the status as not the right option.
merged. Next step: Run it somewhere, e.g.
- Setup github actions CI job
- query for failed jobs
- without any comment on openQA ->
openqa-client --json-output --host https://openqa.opensuse.org jobs/$id/comment == 
- e.g. start with job group 1 (Tumbleweed)
- Or based on AMQP event if "bugref" is "null"
- and – over openQA API – "origin_id" is "null" to not retrigger jobs that are already retriggered jobs
- without any comment on openQA ->
- for each identified job, run investigate and write output, i.e. link to triggered jobs to comment on job
- sleep in loop with amqp rx client, wait for new jobs and repeat
The proof-of-concept script seems still like a good idea but more and more it feels like trying to read all these details from the outside, finding a place to run and also trigger accordingly is more effort than retriggering directly from within openQA. E.g. based on my branch "feature/known_issue_detection" I could have "feature/investigate" with a method "trigger_investigation_jobs"
- Status changed from Workable to Feedback
With https://github.com/os-autoinst/scripts/pull/24 I added a script to look for "investigation candidates" and added a feature to "openqa-investigate" to post automatic comments to openQA with links to triggered investigation jobs. With https://gitlab.suse.de/openqa/auto-review/-/merge_requests/3 I run this daily (for now) so automatic investigation jobs get triggered now for all failed openQA jobs that do not have a label, e.g. new failures: https://openqa.opensuse.org/tests/1218545#comments . Based on the results one should be able to more easily determine if issues are sporadic, test regressions, product regressions. Based on that I await any feedback. We can also continue the work in openQA itself.
- all package changes, e.g. save
rpm -qain file
- diff of test schedule
- if best needle candidate matches 0% it is most likely not a trivial needle issue
- Make "last good" a link to a job instead of plain job ID
- collapse content of initial rows in investigation tab when content becomes too big, e.g. more than 10 lines
- In settings table mark origin of settings and changed settings, e.g. for setting "foo" instead of the table row "foo | 1" one could have [...]
These seem to be the outstanding items from the ideas. Are you still planning to look into these? Do you want to discuss these more (since you set it to Feedback)? I'd volunteer to help out.
- Status changed from Feedback to Workable
- Assignee deleted (
I guess by now I can update the bug as I have not received further feedback lately. Results and observations so far:
- Some people like the "Investigation" tab, some have not even found it and got lost in their own investigation :(
- The script solution in #19720#note-19 kinda works and can be replaced by a solution within openQA which I have prepared some steps for but did not manage to continue further so far
- The other mentioned ideas are also still valid and can be done anytime