action #106538: lessons learned "five whys" for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #106538

closed

coordination #102882: [epic] All OSD PPC64LE workers except malbec appear to have horribly broken cache service

lessons learned "five whys" for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" size:S

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-02-10

Due date:

% Done:

Estimated time:

Description

Acceptance criteria¶

AC1: A "five why" session has been conducted and results recorded in this ticket
AC2: Actionable followup tasks have been defined in new tickets

Suggestions¶

Ask the team to join a five why session, e.g. in the "extension on-demand" meeting thursday morning
Record results in the ticket
Define followup tasks in new tickets

Actions

Copy link

Updated by livdywan over 3 years ago

Subject changed from lessons learned "five whys" for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" to lessons learned "five whys" for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz over 3 years ago

Assignee set to livdywan
Priority changed from Normal to High

If we don't do this soon everybody will loose memory what happened :) I suggest to use tomorrow's extensions meeting slot 0900Z to cover this. Assigning to @cdywan as discussed in daily.

Actions

Copy link

Updated by livdywan over 3 years ago

Status changed from Workable to In Progress

I sent out invitation emails and a message in the team Slack for tomorrow 10 CET

Actions

Copy link

Updated by openqa_review over 3 years ago

Due date set to 2022-03-03

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz over 3 years ago

Assignee changed from livdywan to okurz

We had the meeting 2022-02-17 and conducted a "Five Whys" analysis. We collected results on https://etherpad.opensuse.org/p/suse_qe_tools and they are as follows:

Why did it take so long for infra to perform a firmware upgrade?

Because they did not take the issue serious enough
mdoucha reported ridiculous data loss
marius ran iperf and reported high bandwidth
nick pointed to diff between up and download
we discussed performance data, broken mac address table
-> Make sure when we create EngInfra tickets that the severity and impact is obvious => DONE: https://progress.opensuse.org/projects/qa/wiki/Tools/diff?utf8=%E2%9C%93&version=26&version_from=25&commit=View+differences

Why can EngInfra state that they can not help with the issue when it's about a rack in SRV2 where members of SUSE QE Tools don't have access to without EngInfra?

Who else could have access? Obviously we assumed we don't
-> Crosscheck with relevant managers, e.g. department lead, who can drive such issues and who has access
-> How should we be able to work on such issues if we don't have access?
-> Do we need more people with access? nsinger had access (https://progress.opensuse.org/issues/102882#note-19)
-> Why did EngInfra not have credentials for the switches? https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=992 states that "qa-infra@suse.de" is the contact person. We can't find that mailing list if it is one? Who is that? Or is there anyone on that mailing list?

Why was there no follow-up in https://sd.suse.com/servicedesk/customer/portal/1/SD-67703 between 2021-11-24 11:18 and 2021-12-01 17:13 (that's 7 days from Wednesday to next Wednesday)

Because we tried to work on this on our own. nsinger gained access but of course due to Covid-Pandemic and remote work physical access was necessary to be planned and did not happen immediately

Why did nothing happen in the progress ticket for 2 weeks after 2021-12-01?

There was mostly information flowing in the EngInfra ticket. But for a "High" ticket in our backlog being assigned to someone and actively being worked on there should be more frequent updates
-> Ensure to have more frequent updates in tickets that are being worked on in our backlog. We need to have a daily sync, we are trying meanwhile with the "moderation duty". The only alternative is to have a mandatory daily meeting

Why did people from EngInfra repeatedly only "plug cables" instead of looking into the clear problem hypothesis we made repeatedly about "unintended traffic" and "broadcast traffic"?

Maybe there is the according network related technical competence missing within EngInfra. It seems like we made the hypothesis pretty clear
-> Raise that to department directors. Do we need to learn about maintaining Cisco switches?
-> Move more to other datacenters or the cloud? => We have according tickets already on our backlog. This just gives us additional motivation to work on those

I will followup with the remaining tasks, e.g. coordinating with managers regarding responsibility

EDIT: Asked mgriessmeier about the open points

Actions

Copy link

Updated by okurz over 3 years ago

Status changed from In Progress to Feedback

Meeting scheduled with mgriessmeier tomorrow

Actions

Copy link

Updated by okurz over 3 years ago

Why did EngInfra not have credentials for the switches?

Seems this was missed when handover should have happened when the current SLA was going into action.

https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=992 states that "qa-infra@suse.de" is the contact person. We can't find that mailing list if it is one? Who is that? Or is there anyone on that mailing list?

The mailing list is obsolete and does not exist anymore. It was formerly admins within "QA SLE". Nowadays the official responsibility is with qa-team@suse.de . De facto the current SLA is unspecified in the point regarding network responsibility because the SLA states clearly that EngInfra is responsible but in fact they can not currently fulfill that.

SRV1 is considered production infrastructure and clearly maintained by SUSE IT EngInfra. Only limited people have access, less than 10. SRV2 is considered by some "testing lab", by some also production grade. https://racktables.nue.suse.com/index.php?page=location&tab=edit&location_id=5 does not even list a contact person (unlike SRV1). Many people (maybe around 60-80 according to mgriessmeier) have access so it is certainly less structured. We can still consider SUSE IT EngInfra the main contact person but it depends, mostly by rack.

The switches within the racks are "QA" switches so we should have access. SLA says EngInfra has responsibility.
-> 1. TODO: Make sure that we know all those switches and have them marked properly => DONE: okurz updated all entries of https://racktables.nue.suse.com/index.php?page=search&last_page=object&last_tab=default&q=qa-infra%40suse.de (12 as of 2022-02-18) to point to "qa-team@suse.de" instead of formerly "qa-infra@suse.de" which is obsolete as decided with mgriessmeier
-> 2. TODO: SUSE QE Tools team must learn about switch administration and get access => #107083
-> 3. TODO: Ask for volunteers in SUSE QE Tools that would be able to visit the Nbg server rooms, e.g. as second person accompanying nsinger or any potential new admin => #107086
-> 4. TODO: Make SUSE QE Tools team aware that we need to support EngInfra due to limited capacity => #107089

EDIT: I pinged @channel in https://suse.slack.com/archives/C02AJ1E568M/p1645197538775129 with a replication of this message

Actions

Copy link

Updated by okurz over 3 years ago

Due date deleted (~~2022-03-03~~)

I pinged @channel in https://suse.slack.com/archives/C02AJ1E568M/p1645197538775129 with a replication of the last message. We have all necessary points covered and either already resolved or covered in specific tickets.

Actions

Copy link

Updated by okurz over 3 years ago

Status changed from Feedback to Resolved

I added the remaining information regarding server rooms and network infrastructure on https://confluence.suse.com/pages/diffpagesbyversion.action?pageId=192544831&selectedPageVersions=82&selectedPageVersions=81 . Now we are done here.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #106538

lessons learned "five whys" for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" size:S

Acceptance criteria¶

Suggestions¶

Updated by livdywan over 3 years ago

Updated by okurz over 3 years ago

Updated by livdywan over 3 years ago

Updated by openqa_review over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago