coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release
5 whys follow-up to Missing (re-)schedules of SLE maintenance tests size:M
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
- Bring up in retro
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
- Organize a call to conduct the 5 whys (not as part of the retro)
- Why did it take 3 days for someone to notice?
- Because we have not been alerted automatically about an error. The gitlab CI pipeline was actually running and "successful" but it was running as "--dry-run" for testing purposes -> We could monitor the "incidents updated from smelt"-freshness, e.g. show in the dashboard UI. We already have a "last updated" in the dashboard which can be misleading.
- Because the reviewers of SLE incident tests are not notified automatically about such problems and they only review every couple of days
- Why did we think this might even be a feature request?
- Expected behavior (vs. actual behavior) was very unclear.
- Why could we not easily pinpoint what the source of the problem is, smelt, syncing, openQA scheduling?
- Because the logging of qem-bot is not good enough -Why don't we have an architecture diagram or architecture description of the involved components?
- No one did that. Likely not even QAM architect was involved. So it was coolo that created a proof of concept and then the current project was generated and everything turned to a mess.
Why did we just not read the logs which said "dry-run" in the first line(s)?
- At first we started to suspect one of the "scheduling" jobs but we needed to look at 5 different gitlab CI jobs to find the "sync-smelt" job
-> link to the gitlab CI pipeline logs from the dashboard (https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules)
-> Now we just know better and feel more secure about looking into the logs :) Can we do a walkthrough to understand the current logs and what is going on
Because we were not proficient with the log files and meanings of messages
-> conducting a log file walkthrough together with the team
-> https://build.suse.de/project/show/SUSE:Maintenance:12265 is likely obsolete and should be removed, we could point that out to maintenance coordination engineers
-> many log lines with "ERROR" about missing repomd.xml -> turn to INFO
-> log message "WARNING: Missing product in /etc/openqabot" -> "DEBUG: Skipping obsolete openQABot config /etc/openqabot/bot.yml"
-> log message "DEBUG: Incident … does not have x86_64 arch in 12-SP3" -> so what? -> maybe we can simply remove that message and ignore that, or move to TRACE
-> log message "DEBUG: No channels in … for …" -> Can we put some hints to the readers there what it means or what they could check, e.g. is there no valid smelt_channel to openQA product mapping in metadata? Maybe incident is obsolete and should be closed, removed, etc.?
-> log message "NOT SCHEDULE:" -> lowercase and use "not scheduling" -> log message "Project ... can't calculate repohash" -> would be useful to have a timestamp of last update from OBS -> log message for aggregates "Posting ... jobs" is ambiguous or wrong, should be more like "Triggering ... openQA products" or similar, or "openqa isos post calls" -> we found a problem with an exception as openQA API returns with 404 on post isos as a product is missing in openQA. This error is ignored and we continue the job. We should handle that better. -> add a concluding log message after triggering tests, like "Triggering done ... jobs" or so. -> in "inc-approve" there is "ERROR: Job ... not found", how can that happen and what does that mean? -> "inc-approve" ends with Exception on Forbidden 403 and then the job succeeds -> could be regression from the retrying approach -> DONE: https://github.com/openSUSE/qem-bot/pull/10
Why is it so hard to find out starting from an openQA job details page why that job was created?
- The user used for scheduling is always "qa-maintenance-automation".
Scheduling settings don't contain an URL to e.g. some GitLab pipeline that did the scheduling. -> helpful in general but in the specific case we were encountering "missing jobs" rather than "wrongly scheduled jobs"
-> When scheduling openQA tests from bot-ng add two more settings, the URL pointing to a certain incident as shown on the dashboard, e.g. https://dashboard.qam.suse.de/incident/23309 . And as second setting the gitlab CI job URL that triggered
-> make URL clickable on https://openqa.suse.de/admin/productlog?id=887145 same as for the job settings page
-> to not overcrowd job details setting pages we should invent a special prefix which only applies to schedule variables, like we already use _OBSOLETE and _ONLY_OBSOLETE_SAME_BUILD and do not forward such variables to the job. This could even help us to streamline the above two mentioned variables. Suggestions:
-> Reading from the logs we found out that the openQA schedule settings are also provided to the dashboard as well as openQA so we can just put the useful URLs, e.g. gitlab job, to both and display in both dashboard and openQA
Next steps: Turn into actionable individual tickets, link here and then resolve.
-> "inc-approve" ends with Exception on Forbidden 403 and then the job succeeds -> could be regression from the retrying approach
It is not a regression, it turned out we do not set the exit status correctly (it is always 0). I proposed a quick fix for this in https://github.com/openSUSE/qem-bot/pull/10