action #108944: 5 whys follow-up to Missing (re-)schedules of SLE maintenance tests size:M - QA (public) - openSUSE Project Management Tool

Actions

Copy link

action #108944

closed

coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release

5 whys follow-up to Missing (re-)schedules of SLE maintenance tests size:M

Added by livdywan about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

osukup

Target version:

openQA Project (public) - Ready

Start date:

2022-03-24

Due date:

% Done:

Estimated time:

Description

Motivation¶

See #108869#note-6

Acceptance criteria¶

AC1: A Five-Whys analysis has been conducted and results documented
AC2: Improvements are planned

Suggestions¶

Bring up in retro
Conduct "Five-Whys" analysis for the topic
Identify follow-up tasks in tickets
Organize a call to conduct the 5 whys (not as part of the retro)

Five Whys¶

Why...?

Why...?

Why...?

Why...?

Why...?

Related issues 5 (1 open — 4 closed)

Actions

Copy link

Updated by livdywan about 3 years ago

Copied from action #108869: Missing (re-)schedules of SLE maintenance tests size:M added

Actions

Copy link

Updated by livdywan about 3 years ago

Priority changed from Urgent to High

Setting this to High (not Urgent) since it should be conducted soon while memory is fresh, and I went ahead and made it workable based on how we've conducted previous ones

Actions

Copy link

Updated by osukup about 3 years ago

Assignee set to osukup

Actions

Copy link

Updated by okurz about 3 years ago

Why did it take 3 days for someone to notice?
- Because we have not been alerted automatically about an error. The gitlab CI pipeline was actually running and "successful" but it was running as "--dry-run" for testing purposes -> We could monitor the "incidents updated from smelt"-freshness, e.g. show in the dashboard UI. We already have a "last updated" in the dashboard which can be misleading.
- Because the reviewers of SLE incident tests are not notified automatically about such problems and they only review every couple of days
Why did we think this might even be a feature request?
- Expected behavior (vs. actual behavior) was very unclear.
Why could we not easily pinpoint what the source of the problem is, smelt, syncing, openQA scheduling?
- Because the logging of qem-bot is not good enough
  -Why don't we have an architecture diagram or architecture description of the involved components?
- No one did that. Likely not even QAM architect was involved. So it was coolo that created a proof of concept and then the current project was generated and everything turned to a mess.
Why did we just not read the logs which said "dry-run" in the first line(s)?
- At first we started to suspect one of the "scheduling" jobs but we needed to look at 5 different gitlab CI jobs to find the "sync-smelt" job
  
  -> link to the gitlab CI pipeline logs from the dashboard (https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules)
  
  -> Now we just know better and feel more secure about looking into the logs :) Can we do a walkthrough to understand the current logs and what is going on
- Because we were not proficient with the log files and meanings of messages
  -> conducting a log file walkthrough together with the team
```
    -> https://build.suse.de/project/show/SUSE:Maintenance:12265 is likely obsolete and should be removed, we could point that out to maintenance coordination engineers
```
  -> many log lines with "ERROR" about missing repomd.xml -> turn to INFO
  
  -> log message "WARNING: Missing product in /etc/openqabot" -> "DEBUG: Skipping obsolete openQABot config /etc/openqabot/bot.yml"
  
  -> log message "DEBUG: Incident … does not have x86_64 arch in 12-SP3" -> so what? -> maybe we can simply remove that message and ignore that, or move to TRACE
  
  -> log message "DEBUG: No channels in … for …" -> Can we put some hints to the readers there what it means or what they could check, e.g. is there no valid smelt_channel to openQA product mapping in metadata? Maybe incident is obsolete and should be closed, removed, etc.?
```
-> log message "NOT SCHEDULE:" -> lowercase and use "not scheduling"

-> log message "Project ... can't calculate repohash" -> would be useful to have a timestamp of last update from OBS

-> log message for aggregates "Posting ... jobs" is ambiguous or wrong, should be more like "Triggering ... openQA products" or similar, or "openqa isos post calls"

-> we found a problem with an exception as openQA API returns with 404 on post isos as a product is missing in openQA. This error is ignored and we continue the job. We should handle that better.

-> add a concluding log message after triggering tests, like "Triggering done ... jobs" or so.

-> in "inc-approve" there is "ERROR: Job ... not found", how can that happen and what does that mean?

-> "inc-approve" ends with Exception on Forbidden 403 and then the job succeeds -> could be regression from the retrying approach -> DONE: https://github.com/openSUSE/qem-bot/pull/10
```
Why is it so hard to find out starting from an openQA job details page why that job was created?
- The user used for scheduling is always "qa-maintenance-automation".
- Scheduling settings don't contain an URL to e.g. some GitLab pipeline that did the scheduling. -> helpful in general but in the specific case we were encountering "missing jobs" rather than "wrongly scheduled jobs"
  
  -> When scheduling openQA tests from bot-ng add two more settings, the URL pointing to a certain incident as shown on the dashboard, e.g. https://dashboard.qam.suse.de/incident/23309 . And as second setting the gitlab CI job URL that triggered
-> make URL clickable on https://openqa.suse.de/admin/productlog?id=887145 same as for the job settings page

-> to not overcrowd job details setting pages we should invent a special prefix which only applies to schedule variables, like we already use _OBSOLETE and _ONLY_OBSOLETE_SAME_BUILD and do not forward such variables to the job. This could even help us to streamline the above two mentioned variables. Suggestions:
- __...
- OPENQA_SCHEDULE...
- OPENQA_COMMENT...
- ¯_(ツ)_/¯
-> Reading from the logs we found out that the openQA schedule settings are also provided to the dashboard as well as openQA so we can just put the useful URLs, e.g. gitlab job, to both and display in both dashboard and openQA

Next steps: Turn into actionable individual tickets, link here and then resolve.

Actions

Copy link

Updated by jbaier_cz about 3 years ago

okurz wrote:

-> "inc-approve" ends with Exception on Forbidden 403 and then the job succeeds -> could be regression from the retrying approach

It is not a regression, it turned out we do not set the exit status correctly (it is always 0). I proposed a quick fix for this in https://github.com/openSUSE/qem-bot/pull/10

Actions

Copy link