action #108944
closedcoordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release
5 whys follow-up to Missing (re-)schedules of SLE maintenance tests size:M
Added by livdywan over 2 years ago. Updated over 2 years ago.
0%
Description
Motivation¶
See #108869#note-6
Acceptance criteria¶
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
Suggestions¶
- Bring up in retro
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
- Organize a call to conduct the 5 whys (not as part of the retro)
Five Whys¶
- Why...?
- ...
- Why...?
- ...
- Why...?
- ...
- Why...?
- ...
- Why...?
- ...
Updated by livdywan over 2 years ago
- Copied from action #108869: Missing (re-)schedules of SLE maintenance tests size:M added
Updated by livdywan over 2 years ago
- Priority changed from Urgent to High
Setting this to High (not Urgent) since it should be conducted soon while memory is fresh, and I went ahead and made it workable based on how we've conducted previous ones
Updated by okurz over 2 years ago
- Why did it take 3 days for someone to notice?
- Because we have not been alerted automatically about an error. The gitlab CI pipeline was actually running and "successful" but it was running as "--dry-run" for testing purposes -> We could monitor the "incidents updated from smelt"-freshness, e.g. show in the dashboard UI. We already have a "last updated" in the dashboard which can be misleading.
- Because the reviewers of SLE incident tests are not notified automatically about such problems and they only review every couple of days
- Why did we think this might even be a feature request?
- Expected behavior (vs. actual behavior) was very unclear.
- Why could we not easily pinpoint what the source of the problem is, smelt, syncing, openQA scheduling?
- Because the logging of qem-bot is not good enough -Why don't we have an architecture diagram or architecture description of the involved components?
- No one did that. Likely not even QAM architect was involved. So it was coolo that created a proof of concept and then the current project was generated and everything turned to a mess.
Why did we just not read the logs which said "dry-run" in the first line(s)?
- At first we started to suspect one of the "scheduling" jobs but we needed to look at 5 different gitlab CI jobs to find the "sync-smelt" job
-> link to the gitlab CI pipeline logs from the dashboard (https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules)
-> Now we just know better and feel more secure about looking into the logs :) Can we do a walkthrough to understand the current logs and what is going on
Because we were not proficient with the log files and meanings of messages
-> conducting a log file walkthrough together with the team-> https://build.suse.de/project/show/SUSE:Maintenance:12265 is likely obsolete and should be removed, we could point that out to maintenance coordination engineers
-> many log lines with "ERROR" about missing repomd.xml -> turn to INFO
-> log message "WARNING: Missing product in /etc/openqabot" -> "DEBUG: Skipping obsolete openQABot config /etc/openqabot/bot.yml"
-> log message "DEBUG: Incident … does not have x86_64 arch in 12-SP3" -> so what? -> maybe we can simply remove that message and ignore that, or move to TRACE
-> log message "DEBUG: No channels in … for …" -> Can we put some hints to the readers there what it means or what they could check, e.g. is there no valid smelt_channel to openQA product mapping in metadata? Maybe incident is obsolete and should be closed, removed, etc.?
-> log message "NOT SCHEDULE:" -> lowercase and use "not scheduling" -> log message "Project ... can't calculate repohash" -> would be useful to have a timestamp of last update from OBS -> log message for aggregates "Posting ... jobs" is ambiguous or wrong, should be more like "Triggering ... openQA products" or similar, or "openqa isos post calls" -> we found a problem with an exception as openQA API returns with 404 on post isos as a product is missing in openQA. This error is ignored and we continue the job. We should handle that better. -> add a concluding log message after triggering tests, like "Triggering done ... jobs" or so. -> in "inc-approve" there is "ERROR: Job ... not found", how can that happen and what does that mean? -> "inc-approve" ends with Exception on Forbidden 403 and then the job succeeds -> could be regression from the retrying approach -> DONE: https://github.com/openSUSE/qem-bot/pull/10
Why is it so hard to find out starting from an openQA job details page why that job was created?
- The user used for scheduling is always "qa-maintenance-automation".
Scheduling settings don't contain an URL to e.g. some GitLab pipeline that did the scheduling. -> helpful in general but in the specific case we were encountering "missing jobs" rather than "wrongly scheduled jobs"
-> When scheduling openQA tests from bot-ng add two more settings, the URL pointing to a certain incident as shown on the dashboard, e.g. https://dashboard.qam.suse.de/incident/23309 . And as second setting the gitlab CI job URL that triggered
-> make URL clickable on https://openqa.suse.de/admin/productlog?id=887145 same as for the job settings page
-> to not overcrowd job details setting pages we should invent a special prefix which only applies to schedule variables, like we already use _OBSOLETE and _ONLY_OBSOLETE_SAME_BUILD and do not forward such variables to the job. This could even help us to streamline the above two mentioned variables. Suggestions:
- __...
- OPENQA_SCHEDULE...
- OPENQA_COMMENT...
- ¯_(ツ)_/¯
-> Reading from the logs we found out that the openQA schedule settings are also provided to the dashboard as well as openQA so we can just put the useful URLs, e.g. gitlab job, to both and display in both dashboard and openQA
Next steps: Turn into actionable individual tickets, link here and then resolve.
Updated by jbaier_cz over 2 years ago
okurz wrote:
-> "inc-approve" ends with Exception on Forbidden 403 and then the job succeeds -> could be regression from the retrying approach
It is not a regression, it turned out we do not set the exit status correctly (it is always 0). I proposed a quick fix for this in https://github.com/openSUSE/qem-bot/pull/10
Updated by openqa_review over 2 years ago
- Due date set to 2022-04-16
Setting due date based on mean cycle time of SUSE QE Tools
Updated by osukup over 2 years ago
-> add a concluding log message after triggering tests, like "Triggering done ... jobs" or so.
https://github.com/openSUSE/qem-bot/pull/11
Updated by osukup over 2 years ago
- Related to action #109488: qem-bot - better logging added
Updated by osukup over 2 years ago
- Copied to action #109491: Flow diagram for Maintenance jobs scheduling added
Updated by osukup over 2 years ago
- Copied to action #109512: qem-bot - add vars with GitlabCI job link and qem-dashboard link added
Updated by osukup over 2 years ago
- Status changed from In Progress to Resolved
Followup action items created.
Updated by okurz over 2 years ago
- Related to action #109623: Allow adding scheduling settings for informal purposes that are not added to openQA jobs added