Project

General

Profile

Actions

action #106538

closed

coordination #102882: [epic] All OSD PPC64LE workers except malbec appear to have horribly broken cache service

lessons learned "five whys" for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" size:S

Added by okurz about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-02-10
Due date:
% Done:

0%

Estimated time:

Description

Acceptance criteria

  • AC1: A "five why" session has been conducted and results recorded in this ticket
  • AC2: Actionable followup tasks have been defined in new tickets

Suggestions

  • Ask the team to join a five why session, e.g. in the "extension on-demand" meeting thursday morning
  • Record results in the ticket
  • Define followup tasks in new tickets
Actions #1

Updated by livdywan about 2 years ago

  • Subject changed from lessons learned "five whys" for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" to lessons learned "five whys" for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #2

Updated by okurz about 2 years ago

  • Assignee set to livdywan
  • Priority changed from Normal to High

If we don't do this soon everybody will loose memory what happened :) I suggest to use tomorrow's extensions meeting slot 0900Z to cover this. Assigning to @cdywan as discussed in daily.

Actions #3

Updated by livdywan about 2 years ago

  • Status changed from Workable to In Progress

I sent out invitation emails and a message in the team Slack for tomorrow 10 CET

Actions #4

Updated by openqa_review about 2 years ago

  • Due date set to 2022-03-03

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by okurz about 2 years ago

  • Assignee changed from livdywan to okurz

We had the meeting 2022-02-17 and conducted a "Five Whys" analysis. We collected results on https://etherpad.opensuse.org/p/suse_qe_tools and they are as follows:

  1. Why did it take so long for infra to perform a firmware upgrade?

  2. Why can EngInfra state that they can not help with the issue when it's about a rack in SRV2 where members of SUSE QE Tools don't have access to without EngInfra?

  3. Why was there no follow-up in https://sd.suse.com/servicedesk/customer/portal/1/SD-67703 between 2021-11-24 11:18 and 2021-12-01 17:13 (that's 7 days from Wednesday to next Wednesday)

    • Because we tried to work on this on our own. nsinger gained access but of course due to Covid-Pandemic and remote work physical access was necessary to be planned and did not happen immediately
  4. Why did nothing happen in the progress ticket for 2 weeks after 2021-12-01?

    • There was mostly information flowing in the EngInfra ticket. But for a "High" ticket in our backlog being assigned to someone and actively being worked on there should be more frequent updates
    • -> Ensure to have more frequent updates in tickets that are being worked on in our backlog. We need to have a daily sync, we are trying meanwhile with the "moderation duty". The only alternative is to have a mandatory daily meeting
  5. Why did people from EngInfra repeatedly only "plug cables" instead of looking into the clear problem hypothesis we made repeatedly about "unintended traffic" and "broadcast traffic"?

    • Maybe there is the according network related technical competence missing within EngInfra. It seems like we made the hypothesis pretty clear
    • -> Raise that to department directors. Do we need to learn about maintaining Cisco switches?
    • -> Move more to other datacenters or the cloud? => We have according tickets already on our backlog. This just gives us additional motivation to work on those

I will followup with the remaining tasks, e.g. coordinating with managers regarding responsibility

EDIT: Asked mgriessmeier about the open points

Actions #6

Updated by okurz about 2 years ago

  • Status changed from In Progress to Feedback

Meeting scheduled with mgriessmeier tomorrow

Actions #7

Updated by okurz about 2 years ago

Why did EngInfra not have credentials for the switches?

Seems this was missed when handover should have happened when the current SLA was going into action.

https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=992 states that "qa-infra@suse.de" is the contact person. We can't find that mailing list if it is one? Who is that? Or is there anyone on that mailing list?

The mailing list is obsolete and does not exist anymore. It was formerly admins within "QA SLE". Nowadays the official responsibility is with qa-team@suse.de . De facto the current SLA is unspecified in the point regarding network responsibility because the SLA states clearly that EngInfra is responsible but in fact they can not currently fulfill that.

SRV1 is considered production infrastructure and clearly maintained by SUSE IT EngInfra. Only limited people have access, less than 10. SRV2 is considered by some "testing lab", by some also production grade. https://racktables.nue.suse.com/index.php?page=location&tab=edit&location_id=5 does not even list a contact person (unlike SRV1). Many people (maybe around 60-80 according to mgriessmeier) have access so it is certainly less structured. We can still consider SUSE IT EngInfra the main contact person but it depends, mostly by rack.

The switches within the racks are "QA" switches so we should have access. SLA says EngInfra has responsibility.
-> 1. TODO: Make sure that we know all those switches and have them marked properly => DONE: okurz updated all entries of https://racktables.nue.suse.com/index.php?page=search&last_page=object&last_tab=default&q=qa-infra%40suse.de (12 as of 2022-02-18) to point to "qa-team@suse.de" instead of formerly "qa-infra@suse.de" which is obsolete as decided with mgriessmeier
-> 2. TODO: SUSE QE Tools team must learn about switch administration and get access => #107083
-> 3. TODO: Ask for volunteers in SUSE QE Tools that would be able to visit the Nbg server rooms, e.g. as second person accompanying nsinger or any potential new admin => #107086
-> 4. TODO: Make SUSE QE Tools team aware that we need to support EngInfra due to limited capacity => #107089

EDIT: I pinged @channel in https://suse.slack.com/archives/C02AJ1E568M/p1645197538775129 with a replication of this message

Actions #8

Updated by okurz about 2 years ago

  • Due date deleted (2022-03-03)

I pinged @channel in https://suse.slack.com/archives/C02AJ1E568M/p1645197538775129 with a replication of the last message. We have all necessary points covered and either already resolved or covered in specific tickets.

Actions #9

Updated by okurz about 2 years ago

  • Status changed from Feedback to Resolved

I added the remaining information regarding server rooms and network infrastructure on https://confluence.suse.com/pages/diffpagesbyversion.action?pageId=192544831&selectedPageVersions=82&selectedPageVersions=81 . Now we are done here.

Actions

Also available in: Atom PDF