Project

General

Profile

Actions

coordination #19720

closed

coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

[epic] Simplify investigation of job failures

Added by okurz over 7 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2019-12-17
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

motivation

Make job failure investigation easier to save time and ensure we do not miss failures

ideas


Subtasks 14 (0 open14 closed)

action #61103: Use CodeMirror to render diffs in the Investigation tabRejectedokurz2019-12-17

Actions
action #69085: Make "last good" a link to a job instead of plain job IDResolvedokurz2020-07-17

Actions
action #69088: Present changes between packages on openQA worker machines in "investigation"Resolvedilausuch2020-07-17

Actions
coordination #91518: [epic] Provide 'first bad' vs. 'last good' difference in investigation infoResolvedokurz2021-04-21

Actions
action #91521: link to "first bad" in investigation tabResolvedosukup2021-04-21

Actions
action #92188: test reviewers are pointed to the "first bad vs. last good" comparison if current job is not already the first badResolvedtinita

Actions
action #91527: Cleanup logging in autoinst-log.txtResolvedilausuch2021-04-21

Actions
action #91878: Improve git log entries in failed test investigationResolvedybonatakis2021-04-27

Actions
action #92731: clickable git log entries in investigation tabResolvedybonatakis

Actions
action #92746: Log viewer in openQA webUI with color parsingResolvedmkittler2021-05-17

Actions
action #93940: text thumbnail preview feels inconsistent to other screenshots size:MResolvedosukup2021-06-14

Actions
action #95581: ci: Use a git commit message style checker size:SResolvedVANASTASIADIS2021-07-16

Actions
action #101533: Make text thumbnails easily distinguishable from info thumbnailsResolvedmkittler2021-10-27

Actions
action #101725: Improve text result preview font size in chromium based browsersResolveddheidler2021-10-29

Actions

Related issues 3 (2 open1 closed)

Related to openQA Project (public) - action #60560: Self-investigate potential reasons for failures in openQAResolvedokurz2019-12-03

Actions
Blocks openQA Project (public) - coordination #41057: [epic] Make reviewing results easierNew2018-04-16

Actions
Copied to openQA Project (public) - coordination #102912: [epic] Simplify investigation of job failures - 2ndNew2018-04-16

Actions
Actions #1

Updated by okurz over 7 years ago

  • Priority changed from Normal to Low

needle git hash available in vars.json as well as in output of autoinst-log.txt. Also https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3141 can serve as a nice example how more information can be extracted from within the SUT by looking into details of y2log and popping up with a record_info and more details so that one does not need to transcribe screenshots or download the logfiles to look into details.

The following can help further to get details then:

import requests
test_url = 'https://openqa.suse.de/tests/1033327'
j = requests.get(test_url + '/file/details-yast2_lan_restart.json').json()
soft_failed_name = [i['text'] for i in j if 'title' in i.keys() and 'Soft Fail' in i['title']][0]
out = requests.get(test_url + '/file/' + soft_failed_name).content
print(out.decode('utf-8'))

-> # Soft Failure:
bsc#992113

Actions #2

Updated by okurz over 7 years ago

asmorodskyi and me used that for getting parsing of soft-failure details into openqa-review, see https://github.com/okurz/openqa_review/commit/44119b13454e5bdb8609f99a21426434134035d0 and https://github.com/okurz/openqa_review/commit/4c4bc937fbf6e85e31eb1f2ca42a34e9603e26d3 for details

Actions #3

Updated by okurz almost 7 years ago

  • Assignee deleted (okurz)

Since the last update openqa-review improved by parsing soft-fail info boxes as well as soft-fail needles including reminder comments on tickets for these cases as well. Some further ideas are mentioned in the description for anyone to pick up.

Actions #4

Updated by okurz over 5 years ago

  • Status changed from In Progress to Workable

back to workable as it's not really "In Progress" now.

Actions #5

Updated by okurz about 5 years ago

Actions #6

Updated by okurz almost 5 years ago

  • Description updated (diff)
Actions #7

Updated by okurz almost 5 years ago

  • Related to action #60560: Self-investigate potential reasons for failures in openQA added
Actions #8

Updated by okurz almost 5 years ago

  • Status changed from Workable to Feedback
  • Assignee set to okurz
  • Target version set to Current Sprint
Actions #9

Updated by okurz almost 5 years ago

  • Related to coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues added
Actions #10

Updated by okurz almost 5 years ago

  • Description updated (diff)
Actions #11

Updated by okurz almost 5 years ago

  • Description updated (diff)
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)
  • Target version deleted (Current Sprint)

idea from #61103 included. Last change was the HTML table. Some things crossed of the list. With this I unassign again.

Actions #12

Updated by okurz almost 5 years ago

  • Description updated (diff)
Actions #13

Updated by okurz almost 5 years ago

  • Description updated (diff)
Actions #14

Updated by okurz almost 5 years ago

From an investigation session today about a test failure in https://openqa.suse.de/tests/3950573 :

https://openqa.suse.de/tests/3940013#investigation [the previous job that already failed in the same module with same symptoms] shows me https://gitlab.suse.de/snippets/312 from where maybe 9d9f0b2fa or ca801c541 or 716d23041 look suspicious. But keep in mind that build 148 already failed the same, so not a regression in 150. You could crosscheck last good build 146 with last good test code, last good build with current test and vice versa, i.e. last build with last good test code. if all fail then blame a qemu update on arm workers if there was any. […] btw, no qemu update on worker: https://gitlab.suse.de/snippets/313

steps conducted:

  • followed "next & previous" results to find "first failed", found previous job on previous build
  • clicked "investigation" tab on failed job, identify that settings diff and needle changes look unrelated, identified some potentially related test changes -> https://gitlab.suse.de/snippets/312
  • suggestion in general to bisect: "crosscheck last good build […] with last good test code, last good build with current test and vice versa, i.e. last build with last good test code"
  • identify date of "last good" from https://openqa.suse.de/tests/3940013#next_previous -> "title="2020-02-26T16:29:53Z"
  • logged in into the assigned worker machine, openqaworker-arm-2, called sudo grep -A 100000 '2020-02-26' /var/log/zypp/history | sed -n 's@^.*install|\([^|]*\)|.*$@\1@p' | sort to identify changed packages.

Idea for the "packages on worker": Have a hook that can be defined by instance admins in worker conf and be filled with custom commands that are collected in log files and uploaded to jobs, e.g. rpm -qa. Then investigate can do diff between logs.

Actions #15

Updated by okurz almost 5 years ago

  • Description updated (diff)
  • Status changed from Workable to Feedback
  • Assignee set to okurz

https://github.com/os-autoinst/scripts/pull/23 to cover "See https://github.com/okurz/openqa_review/tree/feature/investigate especially https://github.com/okurz/openqa_review/blob/feature/investigate/openqa_review/investigate.py"

crossed off two other points and deleted "* os-autoinst version in vars.json" as this should either not be necessary considering that the version is in the log file or implicitly covered by "all package changes, e.g. save rpm -qa in file and provide diff and/or changelog" which one can do on the worker just as well as within the SUT.

I have tested the call to rpm -qa on all osd workers:

# salt -l error --no-color -C  'G@roles:worker' --state-output=changes cmd.run "time rpm -qa > /tmp/rpm_qa; ls -lh /tmp/rpm_qa; rm /tmp/rpm_qa"
openqaworker3.suse.de: 0m1.187s, 68K
openqaworker2.suse.de: 0m1.191s, 64K
openqaworker6.suse.de: 0m1.210s, 67K
openqaworker8.suse.de: 0m1.210s, 63K
openqaworker5.suse.de: 0m1.226s, 67K
openqaworker9.suse.de: 0m1.510s, 63K
malbec.arch.suse.de: 0m1.730s, 53K
openqaworker10.suse.de: 0m1.715s, 60K
QA-Power8-5-kvm.qa.suse.de: 0m1.826s, 65K
openqaworker13.suse.de: 0m1.765s, 64K
QA-Power8-4-kvm.qa.suse.de: 0m1.884s, 67K
powerqaworker-qam-1: 0m2.244s, 93K
openqaworker-arm-1.suse.de: 0m4.072s, 63K
openqaworker-arm-2.suse.de: 0m4.076s, 62K
openqaworker-arm-3.suse.de: 0m4.639s, 72K
grenache-1.qa.suse.de: 0m4.859s, 92K

So 1-5s, recording 60-100kB. We could record this in every job but feels not efficient. Parsing zypper log only in case of an error with the diff computed between the date of "last good" and "first bad" feels smarter albeit of course a little bit more complicated :) The philosophy to follow would be to make the happy path (test passes) as efficient as possible and treat failures as exceptions, hence I consider recording rpm -qa in every job regardless of the status as not the right option.

Actions #16

Updated by okurz over 4 years ago

  • Description updated (diff)
Actions #17

Updated by okurz over 4 years ago

  • Status changed from Feedback to Workable

os-autoinst/os-autoinst#1367 merged, can now continue with https://github.com/os-autoinst/scripts/pull/23

Actions #18

Updated by okurz over 4 years ago

merged. Next step: Run it somewhere, e.g.

  • Setup github actions CI job
  • query for failed jobs
    • without any comment on openQA -> openqa-client --json-output --host https://openqa.opensuse.org jobs/$id/comment == []
    • e.g. start with job group 1 (Tumbleweed)
    • Or based on AMQP event if "bugref" is "null"
    • and – over openQA API – "origin_id" is "null" to not retrigger jobs that are already retriggered jobs
  • for each identified job, run investigate and write output, i.e. link to triggered jobs to comment on job
  • sleep in loop with amqp rx client, wait for new jobs and repeat

The proof-of-concept script seems still like a good idea but more and more it feels like trying to read all these details from the outside, finding a place to run and also trigger accordingly is more effort than retriggering directly from within openQA. E.g. based on my branch "feature/known_issue_detection" I could have "feature/investigate" with a method "trigger_investigation_jobs"

Actions #19

Updated by okurz over 4 years ago

  • Status changed from Workable to Feedback

With https://github.com/os-autoinst/scripts/pull/24 I added a script to look for "investigation candidates" and added a feature to "openqa-investigate" to post automatic comments to openQA with links to triggered investigation jobs. With https://gitlab.suse.de/openqa/auto-review/-/merge_requests/3 I run this daily (for now) so automatic investigation jobs get triggered now for all failed openQA jobs that do not have a label, e.g. new failures: https://openqa.opensuse.org/tests/1218545#comments . Based on the results one should be able to more easily determine if issues are sporadic, test regressions, product regressions. Based on that I await any feedback. We can also continue the work in openQA itself.

Actions #20

Updated by okurz over 4 years ago

  • Description updated (diff)
Actions #21

Updated by livdywan over 4 years ago

  • all package changes, e.g. save rpm -qa in file
  • diff of test schedule
  • if best needle candidate matches 0% it is most likely not a trivial needle issue
  • Make "last good" a link to a job instead of plain job ID
  • collapse content of initial rows in investigation tab when content becomes too big, e.g. more than 10 lines
  • In settings table mark origin of settings and changed settings, e.g. for setting "foo" instead of the table row "foo | 1" one could have [...]

These seem to be the outstanding items from the ideas. Are you still planning to look into these? Do you want to discuss these more (since you set it to Feedback)? I'd volunteer to help out.

Actions #22

Updated by okurz over 4 years ago

  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

I guess by now I can update the bug as I have not received further feedback lately. Results and observations so far:

  • Some people like the "Investigation" tab, some have not even found it and got lost in their own investigation :(
  • The script solution in #19720#note-19 kinda works and can be replaced by a solution within openQA which I have prepared some steps for but did not manage to continue further so far
  • The other mentioned ideas are also still valid and can be done anytime
Actions #23

Updated by ilausuch over 4 years ago

For collapse content of initial rows in investigation tab when content becomes too big, e.g. more than 10 lines
https://github.com/os-autoinst/openQA/pull/3257

Actions #24

Updated by okurz over 4 years ago

  • Status changed from Workable to Feedback
  • Assignee set to ilausuch

@ilausuch assigning to you. Even if you don't plan to cover all ideas you can track this ticket and after being done with the parts that you planned you can set back to "Workable" and unassign again

Actions #25

Updated by ilausuch over 4 years ago

Make "last good" a link to a job instead of plain job ID
https://github.com/os-autoinst/openQA/pull/3261

Actions #26

Updated by okurz over 4 years ago

  • Description updated (diff)

discussed with ilausuch, clarified the approach regarding saving changes from worker/SUT, updated description.

Actions #27

Updated by okurz over 4 years ago

  • Due date set to 2020-07-17

due to changes in a related task: #69085

Actions #28

Updated by okurz over 4 years ago

  • Due date set to 2020-07-17

due to changes in a related task: #69088

Actions #29

Updated by okurz over 4 years ago

  • Subject changed from Simplify investigation of job failures to [epic] Simplify investigation of job failures
  • Description updated (diff)
  • Status changed from Feedback to Blocked
  • Target version set to Ready

Updated description with links to individual PRs for already covered items, reworked to be "epic" with subtasks for current WIP items

Actions #30

Updated by ilausuch over 4 years ago

  • Description updated (diff)
Actions #31

Updated by okurz over 4 years ago

  • Description updated (diff)
Actions #32

Updated by ilausuch about 4 years ago

  • Assignee deleted (ilausuch)

I release this ticket for now

Actions #33

Updated by okurz about 4 years ago

  • Assignee set to okurz

fine. For blocked tickets however we should have an assignee. In this case the ticket was assigned to you as you are assigned to the subtask #69088 . For now I can do the same myself of course.

Actions #34

Updated by szarate about 4 years ago

  • Tracker changed from action to coordination
  • Status changed from Blocked to New
Actions #36

Updated by okurz about 4 years ago

  • Status changed from New to Blocked
Actions #37

Updated by okurz about 4 years ago

  • Parent task set to #39719
Actions #38

Updated by okurz almost 4 years ago

  • Target version changed from Ready to future

the current only left subtask is for future so moving this epic there as well

Actions #39

Updated by okurz over 3 years ago

  • Target version changed from future to Ready
Actions #40

Updated by livdywan over 3 years ago

Actions #41

Updated by livdywan over 3 years ago

Actions #42

Updated by okurz over 3 years ago

Please don't overuse "blocking" relations. Mostly we go for a "soft-block", i.e. just a decision by us what we want to do before looking into other issues, in contrast to "hard-blocks" where due to technical reasons one can not really work on the blocked ticket.

Actions #43

Updated by okurz about 3 years ago

Actions #44

Updated by okurz about 3 years ago

  • Status changed from Blocked to Resolved

All subtasks done with two remotely related ones moved to #102912 which covers the rest. I check the description and we have covered really the most of it can call this epic done. Yay! Thanks to all contributors!

Actions

Also available in: Atom PDF