Project

General

Profile

Actions

action #136007

closed

Conduct "lessons learned" with Five Why analysis for network protocols failures on multimachine tests on HA/SAP size:S

Added by okurz 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

See #134282

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys (not as part of the retro)

Related issues 7 (1 open6 closed)

Copied from openQA Infrastructure - action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retryResolvednicksinger2023-08-15

Actions
Copied to openQA Project - action #137825: Urgent/Immediate tickets can only be in new/workable/progress/resolved - this needs to be mentioned in the wiki and also reflected in the backlog statusResolvedokurz2023-10-12

Actions
Copied to openQA Project - action #137828: [spike solution][timeboxed:10h] Notification if one of the queries on https://os-autoinst.github.io/qa-tools-backlog-assistant/ is red, e.g. write email to our Slack or o3-admins from backlogger size:SResolvedjbaier_cz2023-10-122024-01-20

Actions
Copied to openQA Project - action #137831: Reduce limit on feedback tickets to 10Resolvedokurz2023-10-12

Actions
Copied to openQA Project - action #137834: Introduce a rule or guideline how to communicate clearly who is on the next steps size:SResolvedlivdywan2023-10-12

Actions
Copied to openQA Project - action #137837: [spike solution][timeboxed:10h] Come up with a ticket template extension that strongly encourages reproducers and impact be included size:SResolvedlivdywan2023-10-122024-02-21

Actions
Copied to openQA Project - action #137840: Implement automation to ensure templates are usedNew2023-10-12

Actions
Actions #1

Updated by okurz 8 months ago

  • Copied from action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added
Actions #2

Updated by okurz 7 months ago

  • Target version changed from Tools - Next to Ready
Actions #3

Updated by livdywan 7 months ago

  • Subject changed from Conduct "lessons learned" with Five Why analysis for network protocols failures on multimachine tests on HA/SAP to Conduct "lessons learned" with Five Why analysis for network protocols failures on multimachine tests on HA/SAP size:S
  • Status changed from New to Feedback

I'll grab it and fill in the template as well as check that we have it in the wiki

Actions #4

Updated by okurz 7 months ago

  • Description updated (diff)
Actions #5

Updated by okurz 7 months ago

  • Tags changed from infra to infra, mob
Actions #6

Updated by tinita 7 months ago

  • Status changed from Feedback to In Progress
  • Assignee set to tinita
  1. Why was it reported?
    • 2023-08-15 "There is something wrong with multimachine network when tests are run across different workers. If is multimachine job forced to run on same worker, it is fine."
    • Because emiura was doing SLE maintenance test review so we should have known that SLE maintenance updates are blocked until mitigations are applied
    • ACTION Add "Impact" to the ticket template, maybe with an example i.e. SLE Maintenance updates are blocked -> #137837

2.Why was the ticket neglected for so long?

  • Ticket became Urgent on 2023-08-16
  • Was picked up by Liv on 2023-08-17
  • Blocked on SD ticket 2023-08-17
  • Back and forth in SD ticket about which team is responsible until 2023-08-22
  • Ticket went from Blocked to Feedback on 2023-08-22 and was investigated
  • We should avoid relying on SD tickets for urgency mitigation.
  • Out of band conversations e.g. in Slack or Jitsi not reflected in comments.
  • We don't always discuss blocked tickets.
    • ACTION Urgent/Immediate tickets can only be in new/workable/progress/resolved - this needs to be mentioned in the wiki and also reflected in the backlog status -> #137825
    • #135647 would have helped with that. The ticket would have been flagged in the backlog status.
    • ACTION Write email to our Slack or o3-admins from backlogger if one of the queries is red -> #137828
  1. Why did nobody work with the assignee to ensure we meet the SLO?

    • Did test developers look into the issues?
    • Was it really important? Feedback on open questions was spotty.
    • ACTION Reduce limit on feedback tickets to 10 -> #137831
    • Do we know about mitigations having been applied? Comments are not stating that clearly.
    • There was disagreement on approaches to mitigation.
  2. After the SD ticket discussion, why still so long?

    • Investigation was done on the day the SD ticket was closed
    • ACTION Introduce a rule or guideline how to communicate clearly who is on the next steps -> #137834
  3. Why did noone try to create a reproducer early?

    • Nobody asked the testers for a reproducer
    • Most people just assumed "somebody else" would do that
    • We did not ensure there was a reproducer retroactively
    • ACTION Come up with a ticket template extension that strongly encourages reproducers and Impact be included -> #137837
    • ACTION Implement automation to ensure templates are used -> #137840
Actions #7

Updated by tinita 7 months ago

  • Copied to action #137825: Urgent/Immediate tickets can only be in new/workable/progress/resolved - this needs to be mentioned in the wiki and also reflected in the backlog status added
Actions #8

Updated by tinita 7 months ago

  • Copied to action #137828: [spike solution][timeboxed:10h] Notification if one of the queries on https://os-autoinst.github.io/qa-tools-backlog-assistant/ is red, e.g. write email to our Slack or o3-admins from backlogger size:S added
Actions #9

Updated by tinita 7 months ago

  • Copied to action #137831: Reduce limit on feedback tickets to 10 added
Actions #10

Updated by tinita 7 months ago

  • Copied to action #137834: Introduce a rule or guideline how to communicate clearly who is on the next steps size:S added
Actions #11

Updated by tinita 7 months ago

  • Copied to action #137837: [spike solution][timeboxed:10h] Come up with a ticket template extension that strongly encourages reproducers and impact be included size:S added
Actions #12

Updated by tinita 7 months ago

  • Copied to action #137840: Implement automation to ensure templates are used added
Actions #13

Updated by tinita 7 months ago

  • Status changed from In Progress to Resolved

Followup tickets created

Actions

Also available in: Atom PDF