coordination #19720: [epic] Simplify investigation of job failures - openQA Project (public) - openSUSE Project Management Tool

Actions

coordination #19720

closed

Actions

#1

Updated by okurz almost 8 years ago

Priority changed from Normal to Low

needle git hash available in vars.json as well as in output of autoinst-log.txt. Also https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3141 can serve as a nice example how more information can be extracted from within the SUT by looking into details of y2log and popping up with a record_info and more details so that one does not need to transcribe screenshots or download the logfiles to look into details.

The following can help further to get details then:

import requests
test_url = 'https://openqa.suse.de/tests/1033327'
j = requests.get(test_url + '/file/details-yast2_lan_restart.json').json()
soft_failed_name = [i['text'] for i in j if 'title' in i.keys() and 'Soft Fail' in i['title']][0]
out = requests.get(test_url + '/file/' + soft_failed_name).content
print(out.decode('utf-8'))

-> # Soft Failure:
bsc#992113

Actions

#2

Updated by okurz over 7 years ago

asmorodskyi and me used that for getting parsing of soft-failure details into openqa-review, see https://github.com/okurz/openqa_review/commit/44119b13454e5bdb8609f99a21426434134035d0 and https://github.com/okurz/openqa_review/commit/4c4bc937fbf6e85e31eb1f2ca42a34e9603e26d3 for details

Actions

#3

Updated by okurz over 7 years ago

Assignee deleted (~~okurz~~)

Since the last update openqa-review improved by parsing soft-fail info boxes as well as soft-fail needles including reminder comments on tickets for these cases as well. Some further ideas are mentioned in the description for anyone to pick up.

Actions

#4

Updated by okurz over 5 years ago

Status changed from In Progress to Workable

back to workable as it's not really "In Progress" now.

Actions

#5

Updated by okurz over 5 years ago

Related to coordination #41057: [epic] Make reviewing results easier added

Actions

#6

Updated by okurz over 5 years ago

Description updated (diff)

Actions

#7

Updated by okurz over 5 years ago

Related to action #60560: Self-investigate potential reasons for failures in openQA added

Actions

#8

Updated by okurz over 5 years ago

Status changed from Workable to Feedback
Assignee set to okurz
Target version set to Current Sprint

https://github.com/os-autoinst/openQA/pull/2612

Actions

#9

Updated by okurz over 5 years ago

Related to coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues added

Actions

#10

Updated by okurz over 5 years ago

Description updated (diff)

Actions

#11

Updated by okurz about 5 years ago

Description updated (diff)
Status changed from Feedback to Workable
Assignee deleted (~~okurz~~)
Target version deleted (~~Current Sprint~~)

idea from #61103 included. Last change was the HTML table. Some things crossed of the list. With this I unassign again.

Actions

#12

Updated by okurz about 5 years ago

Description updated (diff)

Actions

#13

Updated by okurz about 5 years ago

Description updated (diff)

Actions

#14

Updated by okurz about 5 years ago

From an investigation session today about a test failure in https://openqa.suse.de/tests/3950573 :

https://openqa.suse.de/tests/3940013#investigation [the previous job that already failed in the same module with same symptoms] shows me https://gitlab.suse.de/snippets/312 from where maybe 9d9f0b2fa or ca801c541 or 716d23041 look suspicious. But keep in mind that build 148 already failed the same, so not a regression in 150. You could crosscheck last good build 146 with last good test code, last good build with current test and vice versa, i.e. last build with last good test code. if all fail then blame a qemu update on arm workers if there was any. […] btw, no qemu update on worker: https://gitlab.suse.de/snippets/313

steps conducted:

followed "next & previous" results to find "first failed", found previous job on previous build
clicked "investigation" tab on failed job, identify that settings diff and needle changes look unrelated, identified some potentially related test changes -> https://gitlab.suse.de/snippets/312
suggestion in general to bisect: "crosscheck last good build […] with last good test code, last good build with current test and vice versa, i.e. last build with last good test code"
identify date of "last good" from https://openqa.suse.de/tests/3940013#next_previous -> "title="2020-02-26T16:29:53Z"
logged in into the assigned worker machine, openqaworker-arm-2, called sudo grep -A 100000 '2020-02-26' /var/log/zypp/history | sed -n 's@^.*install|\([^|]*\)|.*$@\1@p' | sort to identify changed packages.

Idea for the "packages on worker": Have a hook that can be defined by instance admins in worker conf and be filled with custom commands that are collected in log files and uploaded to jobs, e.g. rpm -qa. Then investigate can do diff between logs.

Actions

#15

Updated by okurz about 5 years ago

Description updated (diff)
Status changed from Workable to Feedback
Assignee set to okurz

https://github.com/os-autoinst/scripts/pull/23 to cover "See https://github.com/okurz/openqa_review/tree/feature/investigate especially https://github.com/okurz/openqa_review/blob/feature/investigate/openqa_review/investigate.py"

crossed off two other points and deleted "* os-autoinst version in vars.json" as this should either not be necessary considering that the version is in the log file or implicitly covered by "all package changes, e.g. save rpm -qa in file and provide diff and/or changelog" which one can do on the worker just as well as within the SUT.

I have tested the call to rpm -qa on all osd workers:

# salt -l error --no-color -C  'G@roles:worker' --state-output=changes cmd.run "time rpm -qa > /tmp/rpm_qa; ls -lh /tmp/rpm_qa; rm /tmp/rpm_qa"
openqaworker3.suse.de: 0m1.187s, 68K
openqaworker2.suse.de: 0m1.191s, 64K
openqaworker6.suse.de: 0m1.210s, 67K
openqaworker8.suse.de: 0m1.210s, 63K
openqaworker5.suse.de: 0m1.226s, 67K
openqaworker9.suse.de: 0m1.510s, 63K
malbec.arch.suse.de: 0m1.730s, 53K
openqaworker10.suse.de: 0m1.715s, 60K
QA-Power8-5-kvm.qa.suse.de: 0m1.826s, 65K
openqaworker13.suse.de: 0m1.765s, 64K
QA-Power8-4-kvm.qa.suse.de: 0m1.884s, 67K
powerqaworker-qam-1: 0m2.244s, 93K
openqaworker-arm-1.suse.de: 0m4.072s, 63K
openqaworker-arm-2.suse.de: 0m4.076s, 62K
openqaworker-arm-3.suse.de: 0m4.639s, 72K
grenache-1.qa.suse.de: 0m4.859s, 92K

So 1-5s, recording 60-100kB. We could record this in every job but feels not efficient. Parsing zypper log only in case of an error with the diff computed between the date of "last good" and "first bad" feels smarter albeit of course a little bit more complicated :) The philosophy to follow would be to make the happy path (test passes) as efficient as possible and treat failures as exceptions, hence I consider recording rpm -qa in every job regardless of the status as not the right option.

Actions

#16

Updated by okurz about 5 years ago

Description updated (diff)

Actions

#17

Updated by okurz about 5 years ago

Status changed from Feedback to Workable

os-autoinst/os-autoinst#1367 merged, can now continue with https://github.com/os-autoinst/scripts/pull/23

Actions

#18

Updated by okurz about 5 years ago

merged. Next step: Run it somewhere, e.g.

Setup github actions CI job
query for failed jobs
- without any comment on openQA -> openqa-client --json-output --host https://openqa.opensuse.org jobs/$id/comment == []
- e.g. start with job group 1 (Tumbleweed)
- Or based on AMQP event if "bugref" is "null"
- and – over openQA API – "origin_id" is "null" to not retrigger jobs that are already retriggered jobs
for each identified job, run investigate and write output, i.e. link to triggered jobs to comment on job
sleep in loop with amqp rx client, wait for new jobs and repeat

The proof-of-concept script seems still like a good idea but more and more it feels like trying to read all these details from the outside, finding a place to run and also trigger accordingly is more effort than retriggering directly from within openQA. E.g. based on my branch "feature/known_issue_detection" I could have "feature/investigate" with a method "trigger_investigation_jobs"

Actions

#19

Updated by okurz about 5 years ago

Status changed from Workable to Feedback

With https://github.com/os-autoinst/scripts/pull/24 I added a script to look for "investigation candidates" and added a feature to "openqa-investigate" to post automatic comments to openQA with links to triggered investigation jobs. With https://gitlab.suse.de/openqa/auto-review/-/merge_requests/3 I run this daily (for now) so automatic investigation jobs get triggered now for all failed openQA jobs that do not have a label, e.g. new failures: https://openqa.opensuse.org/tests/1218545#comments . Based on the results one should be able to more easily determine if issues are sporadic, test regressions, product regressions. Based on that I await any feedback. We can also continue the work in openQA itself.

Actions

#20

Updated by okurz about 5 years ago

Description updated (diff)

Actions

#21

Updated by livdywan almost 5 years ago

all package changes, e.g. save rpm -qa in file

diff of test schedule

if best needle candidate matches 0% it is most likely not a trivial needle issue

Make "last good" a link to a job instead of plain job ID

collapse content of initial rows in investigation tab when content becomes too big, e.g. more than 10 lines

In settings table mark origin of settings and changed settings, e.g. for setting "foo" instead of the table row "foo | 1" one could have [...]

These seem to be the outstanding items from the ideas. Are you still planning to look into these? Do you want to discuss these more (since you set it to Feedback)? I'd volunteer to help out.

Actions

#22

Updated by okurz almost 5 years ago

Status changed from Feedback to Workable
Assignee deleted (~~okurz~~)

I guess by now I can update the bug as I have not received further feedback lately. Results and observations so far:

Some people like the "Investigation" tab, some have not even found it and got lost in their own investigation :(
The script solution in #19720#note-19 kinda works and can be replaced by a solution within openQA which I have prepared some steps for but did not manage to continue further so far
The other mentioned ideas are also still valid and can be done anytime

Actions

#23

Updated by ilausuch almost 5 years ago

For collapse content of initial rows in investigation tab when content becomes too big, e.g. more than 10 lines
https://github.com/os-autoinst/openQA/pull/3257

Actions

#24

Updated by okurz almost 5 years ago

Status changed from Workable to Feedback
Assignee set to ilausuch

@ilausuch assigning to you. Even if you don't plan to cover all ideas you can track this ticket and after being done with the parts that you planned you can set back to "Workable" and unassign again

Actions

#25

Updated by ilausuch almost 5 years ago

Make "last good" a link to a job instead of plain job ID
https://github.com/os-autoinst/openQA/pull/3261

Actions

#26

Updated by okurz almost 5 years ago

Description updated (diff)

discussed with ilausuch, clarified the approach regarding saving changes from worker/SUT, updated description.

Actions

#27

Updated by okurz almost 5 years ago

Due date set to 2020-07-17

due to changes in a related task: #69085

Actions

#28

Updated by okurz almost 5 years ago

Due date set to 2020-07-17

due to changes in a related task: #69088

Actions

#29

Updated by okurz almost 5 years ago

Subject changed from Simplify investigation of job failures to [epic] Simplify investigation of job failures
Description updated (diff)
Status changed from Feedback to Blocked
Target version set to Ready

Updated description with links to individual PRs for already covered items, reworked to be "epic" with subtasks for current WIP items

Actions

#30

Updated by ilausuch almost 5 years ago

Description updated (diff)

Actions

#31

Updated by okurz almost 5 years ago

Description updated (diff)

Actions

#32

Updated by ilausuch over 4 years ago

Assignee deleted (~~ilausuch~~)

I release this ticket for now

Actions

#33

Updated by okurz over 4 years ago

Assignee set to okurz

fine. For blocked tickets however we should have an assignee. In this case the ticket was assigned to you as you are assigned to the subtask #69088 . For now I can do the same myself of course.

Actions

#34

Updated by szarate over 4 years ago

Tracker changed from action to coordination
Status changed from Blocked to New

Actions

#35

Updated by szarate over 4 years ago

See for the reason of tracker change: http://mailman.suse.de/mailman/private/qa-sle/2020-October/002722.html

Actions

#36

Updated by okurz over 4 years ago

Status changed from New to Blocked

Actions

#37

Updated by okurz over 4 years ago

Parent task set to #39719

Actions

#38

Updated by okurz about 4 years ago

Target version changed from Ready to future

the current only left subtask is for future so moving this epic there as well

Actions

#39

Updated by okurz almost 4 years ago

Target version changed from future to Ready

Actions

#40

Updated by livdywan almost 4 years ago

Related to deleted (coordination #41057: [epic] Make reviewing results easier)

Actions

#41

Updated by livdywan almost 4 years ago

Blocks coordination #41057: [epic] Make reviewing results easier added

Actions

#42

Updated by okurz almost 4 years ago

Please don't overuse "blocking" relations. Mostly we go for a "soft-block", i.e. just a decision by us what we want to do before looking into other issues, in contrast to "hard-blocks" where due to technical reasons one can not really work on the blocked ticket.

Actions

#43

Updated by okurz over 3 years ago

Copied to coordination #102912: [epic] Simplify investigation of job failures - 2nd added

Actions

#44

Updated by okurz over 3 years ago

Status changed from Blocked to Resolved

All subtasks done with two remotely related ones moved to #102912 which covers the rest. I check the description and we have covered really the most of it can call this epic done. Yay! Thanks to all contributors!

Actions

Also available in: Atom PDF