Project

General

Profile

action #108944

coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release

5 whys follow-up to Missing (re-)schedules of SLE maintenance tests size:M

Added by cdywan 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2022-03-24
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See #108869#note-6

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys (not as part of the retro)

Five Whys

  1. Why...?
    • ...
  2. Why...?
    • ...
  3. Why...?
    • ...
  4. Why...?
    • ...
  5. Why...?
    • ...

Related issues

Related to QA - action #109488: qem-bot - better loggingResolved2022-04-05

Related to QA - action #109623: Allow adding scheduling settings for informal purposes that are not added to openQA jobsResolved2022-03-24

Copied from QA - action #108869: Missing (re-)schedules of SLE maintenance tests size:MResolved2022-03-24

Copied to QA - action #109491: Flow diagram for Maintenance jobs scheduling New2022-03-24

Copied to QA - action #109512: qem-bot - add vars with GitlabCI job link and qem-dashboard linkResolved

History

#1 Updated by cdywan 3 months ago

  • Copied from action #108869: Missing (re-)schedules of SLE maintenance tests size:M added

#2 Updated by cdywan 3 months ago

  • Priority changed from Urgent to High

Setting this to High (not Urgent) since it should be conducted soon while memory is fresh, and I went ahead and made it workable based on how we've conducted previous ones

#3 Updated by osukup 3 months ago

  • Assignee set to osukup

#4 Updated by okurz 3 months ago

  • Why did it take 3 days for someone to notice?
    • Because we have not been alerted automatically about an error. The gitlab CI pipeline was actually running and "successful" but it was running as "--dry-run" for testing purposes -> We could monitor the "incidents updated from smelt"-freshness, e.g. show in the dashboard UI. We already have a "last updated" in the dashboard which can be misleading.
    • Because the reviewers of SLE incident tests are not notified automatically about such problems and they only review every couple of days
  • Why did we think this might even be a feature request?
    • Expected behavior (vs. actual behavior) was very unclear.
  • Why could we not easily pinpoint what the source of the problem is, smelt, syncing, openQA scheduling?
    • Because the logging of qem-bot is not good enough -Why don't we have an architecture diagram or architecture description of the involved components?
    • No one did that. Likely not even QAM architect was involved. So it was coolo that created a proof of concept and then the current project was generated and everything turned to a mess.
  • Why did we just not read the logs which said "dry-run" in the first line(s)?

    • At first we started to suspect one of the "scheduling" jobs but we needed to look at 5 different gitlab CI jobs to find the "sync-smelt" job

    -> link to the gitlab CI pipeline logs from the dashboard (https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules)

    -> Now we just know better and feel more secure about looking into the logs :) Can we do a walkthrough to understand the current logs and what is going on

    • Because we were not proficient with the log files and meanings of messages
      -> conducting a log file walkthrough together with the team

      -> https://build.suse.de/project/show/SUSE:Maintenance:12265 is likely obsolete and should be removed, we could point that out to maintenance coordination engineers
      

    -> many log lines with "ERROR" about missing repomd.xml -> turn to INFO

    -> log message "WARNING: Missing product in /etc/openqabot" -> "DEBUG: Skipping obsolete openQABot config /etc/openqabot/bot.yml"

    -> log message "DEBUG: Incident … does not have x86_64 arch in 12-SP3" -> so what? -> maybe we can simply remove that message and ignore that, or move to TRACE

    -> log message "DEBUG: No channels in … for …" -> Can we put some hints to the readers there what it means or what they could check, e.g. is there no valid smelt_channel to openQA product mapping in metadata? Maybe incident is obsolete and should be closed, removed, etc.?

    -> log message "NOT SCHEDULE:" -> lowercase and use "not scheduling"
    
    -> log message "Project ... can't calculate repohash" -> would be useful to have a timestamp of last update from OBS
    
    -> log message for aggregates "Posting ... jobs" is ambiguous or wrong, should be more like "Triggering ... openQA products" or similar, or "openqa isos post calls"
    
    -> we found a problem with an exception as openQA API returns with 404 on post isos as a product is missing in openQA. This error is ignored and we continue the job. We should handle that better.
    
    -> add a concluding log message after triggering tests, like "Triggering done ... jobs" or so.
    
    -> in "inc-approve" there is "ERROR: Job ... not found", how can that happen and what does that mean?
    
    -> "inc-approve" ends with Exception on Forbidden 403 and then the job succeeds -> could be regression from the retrying approach -> DONE: https://github.com/openSUSE/qem-bot/pull/10
    
  • Why is it so hard to find out starting from an openQA job details page why that job was created?

    • The user used for scheduling is always "qa-maintenance-automation".
    • Scheduling settings don't contain an URL to e.g. some GitLab pipeline that did the scheduling. -> helpful in general but in the specific case we were encountering "missing jobs" rather than "wrongly scheduled jobs"

      -> When scheduling openQA tests from bot-ng add two more settings, the URL pointing to a certain incident as shown on the dashboard, e.g. https://dashboard.qam.suse.de/incident/23309 . And as second setting the gitlab CI job URL that triggered

    -> make URL clickable on https://openqa.suse.de/admin/productlog?id=887145 same as for the job settings page

    -> to not overcrowd job details setting pages we should invent a special prefix which only applies to schedule variables, like we already use _OBSOLETE and _ONLY_OBSOLETE_SAME_BUILD and do not forward such variables to the job. This could even help us to streamline the above two mentioned variables. Suggestions:

    • __...
    • OPENQA_SCHEDULE...
    • OPENQA_COMMENT...
    • ¯_(ツ)_/¯

    -> Reading from the logs we found out that the openQA schedule settings are also provided to the dashboard as well as openQA so we can just put the useful URLs, e.g. gitlab job, to both and display in both dashboard and openQA

Next steps: Turn into actionable individual tickets, link here and then resolve.

#5 Updated by jbaier_cz 3 months ago

okurz wrote:

-> "inc-approve" ends with Exception on Forbidden 403 and then the job succeeds -> could be regression from the retrying approach

It is not a regression, it turned out we do not set the exit status correctly (it is always 0). I proposed a quick fix for this in https://github.com/openSUSE/qem-bot/pull/10

#6 Updated by okurz 3 months ago

  • Status changed from Workable to In Progress

#7 Updated by openqa_review 3 months ago

  • Due date set to 2022-04-16

Setting due date based on mean cycle time of SUSE QE Tools

#8 Updated by osukup 3 months ago

-> add a concluding log message after triggering tests, like "Triggering done ... jobs" or so.
https://github.com/openSUSE/qem-bot/pull/11

#9 Updated by osukup 3 months ago

#10 Updated by osukup 3 months ago

  • Copied to action #109491: Flow diagram for Maintenance jobs scheduling added

#11 Updated by osukup 3 months ago

  • Copied to action #109512: qem-bot - add vars with GitlabCI job link and qem-dashboard link added

#13 Updated by osukup 3 months ago

  • Status changed from In Progress to Resolved

Followup action items created.

#14 Updated by mkittler 3 months ago

Yes, looks like all ideas were covered.

#15 Updated by okurz 3 months ago

  • Parent task set to #91646

#16 Updated by okurz 3 months ago

  • Related to action #109623: Allow adding scheduling settings for informal purposes that are not added to openQA jobs added

#17 Updated by okurz 2 months ago

  • Due date deleted (2022-04-16)

Also available in: Atom PDF