coordination #19720
closedcoordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues
[epic] Simplify investigation of job failures
Description
motivation¶
Make job failure investigation easier to save time and ensure we do not miss failures
ideas¶
DONE:
See https://github.com/okurz/openqa_review/tree/feature/investigate especially https://github.com/okurz/openqa_review/blob/feature/investigate/openqa_review/investigate.py -> https://github.com/os-autoinst/scripts/pull/23-> gh#os-autoinst/scripts#24all package changes, e.g. save
rpm -qa
in file and provide diff and/or changelog for the worker and the SUT. For openSUSE e.g. read out changelog diff from https://openqa.opensuse.org/snapshot-changes/opensuse/Tumbleweed/ , for SLE from http://xcdchk.suse.de/- changes on worker #69088
diff of test schedule
if best needle candidate matches 0% it is most likely not a trivial needle issue
DONE:
Make "last good" a link to a job instead of plain job ID#69085DONE:
collapse content of initial rows in investigation tab when content becomes too big, e.g. more than 10 lines-> gh#os-autoinst/openQA#3257In settings table mark origin of settings and changed settings, e.g. for setting "foo" instead of the table row "foo | 1" one could have
- "foo | 1 (testsuites table)" when the settings comes from the test suites database table, e.g. compared to job templates, machines, etc. . This would also help when we allow even more sources for settings, e.g. load job templates from test distributions in parallel to database tables
- update the settings table from vars.json after job run to included changes but then show which settings changed since the job was initially created
- "foo | 1 (+)" when the setting is new in the scenario, with the table row and/or "(+)" in green (as in common colored diffs) and on hover it shows the explanation that this was added, linked to the commit, showing which job it compares against
- "foo | 1 (<->)" or similar when the setting changed against "last good" where it was e.g. 0, with "(<->)" being a link to the "last good" job, with the table row in different color
DONE:
provide more data in the job logs itself for https://progress.opensuse.org/projects/openqav3/wiki#Further-decision-steps-working-on-test-issuese.g. gh#os-autoinst/os-autoinst#805 -> providing also git hash for needles repo so we could also compare the differences in needles-> gh#os-autoinst/openQA#2625
DONE:
provide diff of failed job vs. "last good"DONE:
git log or diff for test+needle changes-> gh#os-autoinst/openQA/2566 and gh#os-autoinst/openQA/2609 for test diff, gh#os-autoinst/openQA/2625 for needleslist of changed files
DONE:
exclude context in vars.json diff, distinguish change and add/remove-> gh#os-autoinst/openQA#2625DONE:
exclude merges from test git log-> gh#os-autoinst/openQA#2625DONE:
The Investigation tab should use CodeMirror to render diffs like we do for test sources or the YAML editor (from #61103)-> Using nicer table instead should be good enough gh#os-autoinst/openQA#2644
Updated by okurz over 7 years ago
- Priority changed from Normal to Low
needle git hash available in vars.json as well as in output of autoinst-log.txt. Also https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3141 can serve as a nice example how more information can be extracted from within the SUT by looking into details of y2log and popping up with a record_info
and more details so that one does not need to transcribe screenshots or download the logfiles to look into details.
The following can help further to get details then:
import requests
test_url = 'https://openqa.suse.de/tests/1033327'
j = requests.get(test_url + '/file/details-yast2_lan_restart.json').json()
soft_failed_name = [i['text'] for i in j if 'title' in i.keys() and 'Soft Fail' in i['title']][0]
out = requests.get(test_url + '/file/' + soft_failed_name).content
print(out.decode('utf-8'))
-> # Soft Failure:
bsc#992113
Updated by okurz over 7 years ago
asmorodskyi and me used that for getting parsing of soft-failure details into openqa-review, see https://github.com/okurz/openqa_review/commit/44119b13454e5bdb8609f99a21426434134035d0 and https://github.com/okurz/openqa_review/commit/4c4bc937fbf6e85e31eb1f2ca42a34e9603e26d3 for details
Updated by okurz about 7 years ago
- Assignee deleted (
okurz)
Since the last update openqa-review improved by parsing soft-fail info boxes as well as soft-fail needles including reminder comments on tickets for these cases as well. Some further ideas are mentioned in the description for anyone to pick up.
Updated by okurz over 5 years ago
- Status changed from In Progress to Workable
back to workable as it's not really "In Progress" now.
Updated by okurz about 5 years ago
- Related to coordination #41057: [epic] Make reviewing results easier added
Updated by okurz about 5 years ago
- Related to action #60560: Self-investigate potential reasons for failures in openQA added
Updated by okurz about 5 years ago
- Status changed from Workable to Feedback
- Assignee set to okurz
- Target version set to Current Sprint
Updated by okurz about 5 years ago
- Related to coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues added
Updated by okurz almost 5 years ago
- Description updated (diff)
- Status changed from Feedback to Workable
- Assignee deleted (
okurz) - Target version deleted (
Current Sprint)
idea from #61103 included. Last change was the HTML table. Some things crossed of the list. With this I unassign again.
Updated by okurz almost 5 years ago
From an investigation session today about a test failure in https://openqa.suse.de/tests/3950573 :
https://openqa.suse.de/tests/3940013#investigation [the previous job that already failed in the same module with same symptoms] shows me https://gitlab.suse.de/snippets/312 from where maybe 9d9f0b2fa or ca801c541 or 716d23041 look suspicious. But keep in mind that build 148 already failed the same, so not a regression in 150. You could crosscheck last good build 146 with last good test code, last good build with current test and vice versa, i.e. last build with last good test code. if all fail then blame a qemu update on arm workers if there was any. […] btw, no qemu update on worker: https://gitlab.suse.de/snippets/313
steps conducted:
- followed "next & previous" results to find "first failed", found previous job on previous build
- clicked "investigation" tab on failed job, identify that settings diff and needle changes look unrelated, identified some potentially related test changes -> https://gitlab.suse.de/snippets/312
- suggestion in general to bisect: "crosscheck last good build […] with last good test code, last good build with current test and vice versa, i.e. last build with last good test code"
- identify date of "last good" from https://openqa.suse.de/tests/3940013#next_previous -> "title="2020-02-26T16:29:53Z"
- logged in into the assigned worker machine, openqaworker-arm-2, called
sudo grep -A 100000 '2020-02-26' /var/log/zypp/history | sed -n 's@^.*install|\([^|]*\)|.*$@\1@p' | sort
to identify changed packages.
Idea for the "packages on worker": Have a hook that can be defined by instance admins in worker conf and be filled with custom commands that are collected in log files and uploaded to jobs, e.g. rpm -qa
. Then investigate can do diff between logs.
Updated by okurz almost 5 years ago
- Description updated (diff)
- Status changed from Workable to Feedback
- Assignee set to okurz
https://github.com/os-autoinst/scripts/pull/23 to cover "See https://github.com/okurz/openqa_review/tree/feature/investigate especially https://github.com/okurz/openqa_review/blob/feature/investigate/openqa_review/investigate.py"
crossed off two other points and deleted "* os-autoinst version in vars.json" as this should either not be necessary considering that the version is in the log file or implicitly covered by "all package changes, e.g. save rpm -qa
in file and provide diff and/or changelog" which one can do on the worker just as well as within the SUT.
I have tested the call to rpm -qa
on all osd workers:
# salt -l error --no-color -C 'G@roles:worker' --state-output=changes cmd.run "time rpm -qa > /tmp/rpm_qa; ls -lh /tmp/rpm_qa; rm /tmp/rpm_qa"
openqaworker3.suse.de: 0m1.187s, 68K
openqaworker2.suse.de: 0m1.191s, 64K
openqaworker6.suse.de: 0m1.210s, 67K
openqaworker8.suse.de: 0m1.210s, 63K
openqaworker5.suse.de: 0m1.226s, 67K
openqaworker9.suse.de: 0m1.510s, 63K
malbec.arch.suse.de: 0m1.730s, 53K
openqaworker10.suse.de: 0m1.715s, 60K
QA-Power8-5-kvm.qa.suse.de: 0m1.826s, 65K
openqaworker13.suse.de: 0m1.765s, 64K
QA-Power8-4-kvm.qa.suse.de: 0m1.884s, 67K
powerqaworker-qam-1: 0m2.244s, 93K
openqaworker-arm-1.suse.de: 0m4.072s, 63K
openqaworker-arm-2.suse.de: 0m4.076s, 62K
openqaworker-arm-3.suse.de: 0m4.639s, 72K
grenache-1.qa.suse.de: 0m4.859s, 92K
So 1-5s, recording 60-100kB. We could record this in every job but feels not efficient. Parsing zypper log only in case of an error with the diff computed between the date of "last good" and "first bad" feels smarter albeit of course a little bit more complicated :) The philosophy to follow would be to make the happy path (test passes) as efficient as possible and treat failures as exceptions, hence I consider recording rpm -qa
in every job regardless of the status as not the right option.
Updated by okurz almost 5 years ago
- Status changed from Feedback to Workable
os-autoinst/os-autoinst#1367 merged, can now continue with https://github.com/os-autoinst/scripts/pull/23
Updated by okurz almost 5 years ago
merged. Next step: Run it somewhere, e.g.
- Setup github actions CI job
- query for failed jobs
- without any comment on openQA ->
openqa-client --json-output --host https://openqa.opensuse.org jobs/$id/comment == []
- e.g. start with job group 1 (Tumbleweed)
- Or based on AMQP event if "bugref" is "null"
- and – over openQA API – "origin_id" is "null" to not retrigger jobs that are already retriggered jobs
- without any comment on openQA ->
- for each identified job, run investigate and write output, i.e. link to triggered jobs to comment on job
- sleep in loop with amqp rx client, wait for new jobs and repeat
The proof-of-concept script seems still like a good idea but more and more it feels like trying to read all these details from the outside, finding a place to run and also trigger accordingly is more effort than retriggering directly from within openQA. E.g. based on my branch "feature/known_issue_detection" I could have "feature/investigate" with a method "trigger_investigation_jobs"
Updated by okurz over 4 years ago
- Status changed from Workable to Feedback
With https://github.com/os-autoinst/scripts/pull/24 I added a script to look for "investigation candidates" and added a feature to "openqa-investigate" to post automatic comments to openQA with links to triggered investigation jobs. With https://gitlab.suse.de/openqa/auto-review/-/merge_requests/3 I run this daily (for now) so automatic investigation jobs get triggered now for all failed openQA jobs that do not have a label, e.g. new failures: https://openqa.opensuse.org/tests/1218545#comments . Based on the results one should be able to more easily determine if issues are sporadic, test regressions, product regressions. Based on that I await any feedback. We can also continue the work in openQA itself.
Updated by livdywan over 4 years ago
- all package changes, e.g. save
rpm -qa
in file- diff of test schedule
- if best needle candidate matches 0% it is most likely not a trivial needle issue
- Make "last good" a link to a job instead of plain job ID
- collapse content of initial rows in investigation tab when content becomes too big, e.g. more than 10 lines
- In settings table mark origin of settings and changed settings, e.g. for setting "foo" instead of the table row "foo | 1" one could have [...]
These seem to be the outstanding items from the ideas. Are you still planning to look into these? Do you want to discuss these more (since you set it to Feedback)? I'd volunteer to help out.
Updated by okurz over 4 years ago
- Status changed from Feedback to Workable
- Assignee deleted (
okurz)
I guess by now I can update the bug as I have not received further feedback lately. Results and observations so far:
- Some people like the "Investigation" tab, some have not even found it and got lost in their own investigation :(
- The script solution in #19720#note-19 kinda works and can be replaced by a solution within openQA which I have prepared some steps for but did not manage to continue further so far
- The other mentioned ideas are also still valid and can be done anytime
Updated by ilausuch over 4 years ago
For collapse content of initial rows in investigation tab when content becomes too big, e.g. more than 10 lines
https://github.com/os-autoinst/openQA/pull/3257
Updated by okurz over 4 years ago
- Status changed from Workable to Feedback
- Assignee set to ilausuch
@ilausuch assigning to you. Even if you don't plan to cover all ideas you can track this ticket and after being done with the parts that you planned you can set back to "Workable" and unassign again
Updated by ilausuch over 4 years ago
Make "last good" a link to a job instead of plain job ID
https://github.com/os-autoinst/openQA/pull/3261
Updated by okurz over 4 years ago
- Description updated (diff)
discussed with ilausuch, clarified the approach regarding saving changes from worker/SUT, updated description.
Updated by okurz over 4 years ago
- Due date set to 2020-07-17
due to changes in a related task: #69085
Updated by okurz over 4 years ago
- Due date set to 2020-07-17
due to changes in a related task: #69088
Updated by okurz over 4 years ago
- Subject changed from Simplify investigation of job failures to [epic] Simplify investigation of job failures
- Description updated (diff)
- Status changed from Feedback to Blocked
- Target version set to Ready
Updated description with links to individual PRs for already covered items, reworked to be "epic" with subtasks for current WIP items
Updated by okurz about 4 years ago
- Assignee set to okurz
fine. For blocked tickets however we should have an assignee. In this case the ticket was assigned to you as you are assigned to the subtask #69088 . For now I can do the same myself of course.
Updated by szarate about 4 years ago
- Tracker changed from action to coordination
- Status changed from Blocked to New
Updated by szarate about 4 years ago
See for the reason of tracker change: http://mailman.suse.de/mailman/private/qa-sle/2020-October/002722.html
Updated by okurz almost 4 years ago
- Target version changed from Ready to future
the current only left subtask is for future so moving this epic there as well
Updated by livdywan over 3 years ago
- Related to deleted (coordination #41057: [epic] Make reviewing results easier)
Updated by livdywan over 3 years ago
- Blocks coordination #41057: [epic] Make reviewing results easier added
Updated by okurz over 3 years ago
Please don't overuse "blocking" relations. Mostly we go for a "soft-block", i.e. just a decision by us what we want to do before looking into other issues, in contrast to "hard-blocks" where due to technical reasons one can not really work on the blocked ticket.
Updated by okurz about 3 years ago
- Copied to coordination #102912: [epic] Simplify investigation of job failures - 2nd added
Updated by okurz about 3 years ago
- Status changed from Blocked to Resolved
All subtasks done with two remotely related ones moved to #102912 which covers the rest. I check the description and we have covered really the most of it can call this epic done. Yay! Thanks to all contributors!