Actions
action #136007
closedConduct "lessons learned" with Five Why analysis for network protocols failures on multimachine tests on HA/SAP size:S
Added by okurz 11 months ago. Updated 10 months ago.
Start date:
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
See #134282
Acceptance criteria¶
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
Suggestions¶
- Bring up in retro
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
- Organize a call to conduct the 5 whys (not as part of the retro)
Updated by okurz 11 months ago
- Copied from action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added
Updated by livdywan 10 months ago
- Subject changed from Conduct "lessons learned" with Five Why analysis for network protocols failures on multimachine tests on HA/SAP to Conduct "lessons learned" with Five Why analysis for network protocols failures on multimachine tests on HA/SAP size:S
- Status changed from New to Feedback
I'll grab it and fill in the template as well as check that we have it in the wiki
Updated by tinita 10 months ago
- Status changed from Feedback to In Progress
- Assignee set to tinita
- Why was it reported?
- 2023-08-15 "There is something wrong with multimachine network when tests are run across different workers. If is multimachine job forced to run on same worker, it is fine."
- Because emiura was doing SLE maintenance test review so we should have known that SLE maintenance updates are blocked until mitigations are applied
- ACTION Add "Impact" to the ticket template, maybe with an example i.e. SLE Maintenance updates are blocked -> #137837
2.Why was the ticket neglected for so long?
- Ticket became Urgent on 2023-08-16
- Was picked up by Liv on 2023-08-17
- Blocked on SD ticket 2023-08-17
- Back and forth in SD ticket about which team is responsible until 2023-08-22
- Ticket went from Blocked to Feedback on 2023-08-22 and was investigated
- We should avoid relying on SD tickets for urgency mitigation.
- Out of band conversations e.g. in Slack or Jitsi not reflected in comments.
- We don't always discuss blocked tickets.
- ACTION Urgent/Immediate tickets can only be in new/workable/progress/resolved - this needs to be mentioned in the wiki and also reflected in the backlog status -> #137825
- #135647 would have helped with that. The ticket would have been flagged in the backlog status.
- ACTION Write email to our Slack or o3-admins from backlogger if one of the queries is red -> #137828
Why did nobody work with the assignee to ensure we meet the SLO?
- Did test developers look into the issues?
- Was it really important? Feedback on open questions was spotty.
- ACTION Reduce limit on feedback tickets to 10 -> #137831
- Do we know about mitigations having been applied? Comments are not stating that clearly.
- There was disagreement on approaches to mitigation.
After the SD ticket discussion, why still so long?
- Investigation was done on the day the SD ticket was closed
- ACTION Introduce a rule or guideline how to communicate clearly who is on the next steps -> #137834
Why did noone try to create a reproducer early?
- Nobody asked the testers for a reproducer
- Most people just assumed "somebody else" would do that
- We did not ensure there was a reproducer retroactively
- ACTION Come up with a ticket template extension that strongly encourages reproducers and Impact be included -> #137837
- ACTION Implement automation to ensure templates are used -> #137840
- extend our due-date automation or another script to comment on tickets not following the ticket template, suggest to use the "report issue" button from openQA
- Research if we can provide templates via a RedMine plugin https://www.redmine.org/plugins/redmine_issue_templates
Updated by tinita 10 months ago
- Copied to action #137825: Urgent/Immediate tickets can only be in new/workable/progress/resolved - this needs to be mentioned in the wiki and also reflected in the backlog status added
Updated by tinita 10 months ago
- Copied to action #137828: [spike solution][timeboxed:10h] Notification if one of the queries on https://os-autoinst.github.io/qa-tools-backlog-assistant/ is red, e.g. write email to our Slack or o3-admins from backlogger size:S added
Updated by tinita 10 months ago
- Copied to action #137831: Reduce limit on feedback tickets to 10 added
Updated by tinita 10 months ago
- Copied to action #137834: Introduce a rule or guideline how to communicate clearly who is on the next steps size:S added
Updated by tinita 10 months ago
- Copied to action #137837: [spike solution][timeboxed:10h] Come up with a ticket template extension that strongly encourages reproducers and impact be included size:S added
Updated by tinita 10 months ago
- Copied to action #137840: Implement automation to ensure templates are used added
Actions