action #106538
closedcoordination #102882: [epic] All OSD PPC64LE workers except malbec appear to have horribly broken cache service
lessons learned "five whys" for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" size:S
0%
Description
Acceptance criteria¶
- AC1: A "five why" session has been conducted and results recorded in this ticket
- AC2: Actionable followup tasks have been defined in new tickets
Suggestions¶
- Ask the team to join a five why session, e.g. in the "extension on-demand" meeting thursday morning
- Record results in the ticket
- Define followup tasks in new tickets
Updated by livdywan over 2 years ago
- Subject changed from lessons learned "five whys" for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" to lessons learned "five whys" for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 2 years ago
- Assignee set to livdywan
- Priority changed from Normal to High
If we don't do this soon everybody will loose memory what happened :) I suggest to use tomorrow's extensions meeting slot 0900Z to cover this. Assigning to @cdywan as discussed in daily.
Updated by livdywan over 2 years ago
- Status changed from Workable to In Progress
I sent out invitation emails and a message in the team Slack for tomorrow 10 CET
Updated by openqa_review over 2 years ago
- Due date set to 2022-03-03
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 2 years ago
- Assignee changed from livdywan to okurz
We had the meeting 2022-02-17 and conducted a "Five Whys" analysis. We collected results on https://etherpad.opensuse.org/p/suse_qe_tools and they are as follows:
Why did it take so long for infra to perform a firmware upgrade?
- Because they did not take the issue serious enough mdoucha reported ridiculous data loss marius ran iperf and reported high bandwidth nick pointed to diff between up and download we discussed performance data, broken mac address table
- -> Make sure when we create EngInfra tickets that the severity and impact is obvious => DONE: https://progress.opensuse.org/projects/qa/wiki/Tools/diff?utf8=%E2%9C%93&version=26&version_from=25&commit=View+differences
Why can EngInfra state that they can not help with the issue when it's about a rack in SRV2 where members of SUSE QE Tools don't have access to without EngInfra?
- Who else could have access? Obviously we assumed we don't
- -> Crosscheck with relevant managers, e.g. department lead, who can drive such issues and who has access
- -> How should we be able to work on such issues if we don't have access?
- -> Do we need more people with access? nsinger had access (https://progress.opensuse.org/issues/102882#note-19)
- -> Why did EngInfra not have credentials for the switches? https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=992 states that "qa-infra@suse.de" is the contact person. We can't find that mailing list if it is one? Who is that? Or is there anyone on that mailing list?
Why was there no follow-up in https://sd.suse.com/servicedesk/customer/portal/1/SD-67703 between 2021-11-24 11:18 and 2021-12-01 17:13 (that's 7 days from Wednesday to next Wednesday)
- Because we tried to work on this on our own. nsinger gained access but of course due to Covid-Pandemic and remote work physical access was necessary to be planned and did not happen immediately
Why did nothing happen in the progress ticket for 2 weeks after 2021-12-01?
- There was mostly information flowing in the EngInfra ticket. But for a "High" ticket in our backlog being assigned to someone and actively being worked on there should be more frequent updates
- -> Ensure to have more frequent updates in tickets that are being worked on in our backlog. We need to have a daily sync, we are trying meanwhile with the "moderation duty". The only alternative is to have a mandatory daily meeting
Why did people from EngInfra repeatedly only "plug cables" instead of looking into the clear problem hypothesis we made repeatedly about "unintended traffic" and "broadcast traffic"?
- Maybe there is the according network related technical competence missing within EngInfra. It seems like we made the hypothesis pretty clear
- -> Raise that to department directors. Do we need to learn about maintaining Cisco switches?
- -> Move more to other datacenters or the cloud? => We have according tickets already on our backlog. This just gives us additional motivation to work on those
I will followup with the remaining tasks, e.g. coordinating with managers regarding responsibility
EDIT: Asked mgriessmeier about the open points
Updated by okurz over 2 years ago
- Status changed from In Progress to Feedback
Meeting scheduled with mgriessmeier tomorrow
Updated by okurz over 2 years ago
Why did EngInfra not have credentials for the switches?
Seems this was missed when handover should have happened when the current SLA was going into action.
https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=992 states that "qa-infra@suse.de" is the contact person. We can't find that mailing list if it is one? Who is that? Or is there anyone on that mailing list?
The mailing list is obsolete and does not exist anymore. It was formerly admins within "QA SLE". Nowadays the official responsibility is with qa-team@suse.de . De facto the current SLA is unspecified in the point regarding network responsibility because the SLA states clearly that EngInfra is responsible but in fact they can not currently fulfill that.
SRV1 is considered production infrastructure and clearly maintained by SUSE IT EngInfra. Only limited people have access, less than 10. SRV2 is considered by some "testing lab", by some also production grade. https://racktables.nue.suse.com/index.php?page=location&tab=edit&location_id=5 does not even list a contact person (unlike SRV1). Many people (maybe around 60-80 according to mgriessmeier) have access so it is certainly less structured. We can still consider SUSE IT EngInfra the main contact person but it depends, mostly by rack.
The switches within the racks are "QA" switches so we should have access. SLA says EngInfra has responsibility.
-> 1. TODO: Make sure that we know all those switches and have them marked properly => DONE: okurz updated all entries of https://racktables.nue.suse.com/index.php?page=search&last_page=object&last_tab=default&q=qa-infra%40suse.de (12 as of 2022-02-18) to point to "qa-team@suse.de" instead of formerly "qa-infra@suse.de" which is obsolete as decided with mgriessmeier
-> 2. TODO: SUSE QE Tools team must learn about switch administration and get access => #107083
-> 3. TODO: Ask for volunteers in SUSE QE Tools that would be able to visit the Nbg server rooms, e.g. as second person accompanying nsinger or any potential new admin => #107086
-> 4. TODO: Make SUSE QE Tools team aware that we need to support EngInfra due to limited capacity => #107089
EDIT: I pinged @channel in https://suse.slack.com/archives/C02AJ1E568M/p1645197538775129 with a replication of this message
Updated by okurz over 2 years ago
- Due date deleted (
2022-03-03)
I pinged @channel in https://suse.slack.com/archives/C02AJ1E568M/p1645197538775129 with a replication of the last message. We have all necessary points covered and either already resolved or covered in specific tickets.
Updated by okurz over 2 years ago
- Status changed from Feedback to Resolved
I added the remaining information regarding server rooms and network infrastructure on https://confluence.suse.com/pages/diffpagesbyversion.action?pageId=192544831&selectedPageVersions=82&selectedPageVersions=81 . Now we are done here.