Actions
action #136007
closedConduct "lessons learned" with Five Why analysis for network protocols failures on multimachine tests on HA/SAP size:S
Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
See #134282
Acceptance criteria¶
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
Suggestions¶
- Bring up in retro
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
- Organize a call to conduct the 5 whys (not as part of the retro)
Updated by okurz about 1 year ago
- Copied from action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added
Updated by okurz about 1 year ago
- Target version changed from Tools - Next to Ready
Updated by livdywan about 1 year ago
- Subject changed from Conduct "lessons learned" with Five Why analysis for network protocols failures on multimachine tests on HA/SAP to Conduct "lessons learned" with Five Why analysis for network protocols failures on multimachine tests on HA/SAP size:S
- Status changed from New to Feedback
I'll grab it and fill in the template as well as check that we have it in the wiki
Updated by tinita about 1 year ago
- Status changed from Feedback to In Progress
- Assignee set to tinita
- Why was it reported?
- 2023-08-15 "There is something wrong with multimachine network when tests are run across different workers. If is multimachine job forced to run on same worker, it is fine."
- Because emiura was doing SLE maintenance test review so we should have known that SLE maintenance updates are blocked until mitigations are applied
- ACTION Add "Impact" to the ticket template, maybe with an example i.e. SLE Maintenance updates are blocked -> #137837
2.Why was the ticket neglected for so long?
- Ticket became Urgent on 2023-08-16
- Was picked up by Liv on 2023-08-17
- Blocked on SD ticket 2023-08-17
- Back and forth in SD ticket about which team is responsible until 2023-08-22
- Ticket went from Blocked to Feedback on 2023-08-22 and was investigated
- We should avoid relying on SD tickets for urgency mitigation.
- Out of band conversations e.g. in Slack or Jitsi not reflected in comments.
- We don't always discuss blocked tickets.
- ACTION Urgent/Immediate tickets can only be in new/workable/progress/resolved - this needs to be mentioned in the wiki and also reflected in the backlog status -> #137825
- #135647 would have helped with that. The ticket would have been flagged in the backlog status.
- ACTION Write email to our Slack or o3-admins from backlogger if one of the queries is red -> #137828
Why did nobody work with the assignee to ensure we meet the SLO?
- Did test developers look into the issues?
- Was it really important? Feedback on open questions was spotty.
- ACTION Reduce limit on feedback tickets to 10 -> #137831
- Do we know about mitigations having been applied? Comments are not stating that clearly.
- There was disagreement on approaches to mitigation.
After the SD ticket discussion, why still so long?
- Investigation was done on the day the SD ticket was closed
- ACTION Introduce a rule or guideline how to communicate clearly who is on the next steps -> #137834
Why did noone try to create a reproducer early?
- Nobody asked the testers for a reproducer
- Most people just assumed "somebody else" would do that
- We did not ensure there was a reproducer retroactively
- ACTION Come up with a ticket template extension that strongly encourages reproducers and Impact be included -> #137837
- ACTION Implement automation to ensure templates are used -> #137840
- extend our due-date automation or another script to comment on tickets not following the ticket template, suggest to use the "report issue" button from openQA
- Research if we can provide templates via a RedMine plugin https://www.redmine.org/plugins/redmine_issue_templates
Updated by tinita about 1 year ago
- Copied to action #137825: Urgent/Immediate tickets can only be in new/workable/progress/resolved - this needs to be mentioned in the wiki and also reflected in the backlog status added
Updated by tinita about 1 year ago
- Copied to action #137828: [spike solution][timeboxed:10h] Notification if one of the queries on https://os-autoinst.github.io/qa-tools-backlog-assistant/ is red, e.g. write email to our Slack or o3-admins from backlogger size:S added
Updated by tinita about 1 year ago
- Copied to action #137831: Reduce limit on feedback tickets to 10 added
Updated by tinita about 1 year ago
- Copied to action #137834: Introduce a rule or guideline how to communicate clearly who is on the next steps size:S added
Updated by tinita about 1 year ago
- Copied to action #137837: [spike solution][timeboxed:10h] Come up with a ticket template extension that strongly encourages reproducers and impact be included size:S added
Updated by tinita about 1 year ago
- Copied to action #137840: Implement automation to ensure templates are used added
Updated by tinita about 1 year ago
- Status changed from In Progress to Resolved
Followup tickets created
Actions