action #164481
closedcoordination #164466: [saga][epic] Scale up: Hyper-responsive openQA webUI
openQA Infrastructure (public) - coordination #164469: [epic] Better tools team incident handling
[tools] Discuss benefits vs. drawbacks about applying mitigations as early as possible vs. keeping system in broken state to ease investigation. Also do industry standards best practice research size:S
0%
Description
Motivation¶
See #163610-4 (HTTP Response alert Salt)
Acceptance criteria¶
- AC1: The tools team knows (and the wiki explains) which approach or methodology to follow in general
Suggestions¶
- Do a quick industry standards best practice research
- Consider reading https://sre.google/sre-book/table-of-contents/ or ask AI
- Draft an extension to https://progress.opensuse.org/projects/qa/wiki/Tools#Alert-handling
- Consider an example of "last person standing in the team", what should we do?
- Consider a team session to discuss this topic and come up with decisions
Updated by okurz 5 months ago
- Copied from action #164478: Monitoring of idle/busy webUI/liveview handler workers added
Updated by okurz 5 months ago
- Copied to action #164484: [tools] Investigation helper, e.g. commands in a bash script to collect useful logs, systemd journal, etc. added
Updated by okurz 5 months ago
- Subject changed from [tools] Discuss benefits vs. drawbacks about applying mitigations as early as possible vs. keeping system in broken state to ease investigation. Also do industry standards best practice research to [tools] Discuss benefits vs. drawbacks about applying mitigations as early as possible vs. keeping system in broken state to ease investigation. Also do industry standards best practice research size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by kraih 5 months ago
I asked Mixtral about this:
Yes, there are industry standards and best practices that organizations can follow when deciding whether to apply mitigations as early as possible or keep a system in a broken state to ease investigation. Here are some examples:
1. ITIL (Information Technology Infrastructure Library): ITIL is a framework for managing IT services. It includes a process for incident management that recommends restoring normal service operation as quickly as possible while also minimizing the impact of the incident. This means that organizations should apply mitigations as soon as possible, while also conducting a thorough investigation to understand the root cause of the issue.
2.NIST (National Institute of Standards and Technology): NIST provides guidelines for incident response and recovery. Its guidelines recommend that organizations apply mitigations as soon as possible to reduce the impact of an incident, while also conducting a thorough investigation to understand the cause and prevent future incidents.
3. ISO/IEC 27001: This is an international standard for information security management systems (ISMS). It recommends that organizations establish processes for incident management, including the application of mitigations as soon as possible to reduce the impact of an incident.
4. COBIT (Control Objectives for Information and Related Technologies): COBIT is a framework for IT management and IT governance. It includes a process for incident management that recommends restoring normal service operation as quickly as possible while also minimizing the impact of the incident.
By following these industry standards and best practices, organizations can make informed decisions about how to balance the need to apply mitigations as early as possible with the need to conduct a thorough investigation. It's important to note that each situation is unique, and organizations should consider the specific circumstances of each incident when making a decision.```
Updated by livdywan 3 months ago
- Status changed from Workable to In Progress
- Draft an extension to https://progress.opensuse.org/projects/qa/wiki/Tools#Alert-handling
- Consider an example of "last person standing in the team", what should we do?
- Consider a team session to discuss this topic and come up with decisions
After briefly checking with others what they think I've decided to review what I think we already implement, what we actually document as the wiki is not as clear as I thought and propose something we can use as a basis for a conversation.
Updated by openqa_review 3 months ago
- Due date set to 2024-10-02
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan 3 months ago
- Draft an extension to https://progress.opensuse.org/projects/qa/wiki/Tools#Alert-handling
- If users report outages of components of our infrastructure
- Consider teaming up and assigning individual tasks to focus on
- Inform affected users about the impact and ETA via Slack channels, ticket updates and mailing list posts
- Look into mitigations and short-term workarounds such as a hotpatch in production or a revert to an older release
- Investigate a proper solution with a conservative estimate on the effort involved
- Set a time limit to ensure either a workaround or a solution is available within 4 hours
- Join an ad-hoc video call to discuss further steps
- Keep a record of what was discussed and investigated to allow for a later analysis
Updated by okurz 3 months ago
livdywan wrote in #note-11:
- Draft an extension to https://progress.opensuse.org/projects/qa/wiki/Tools#Alert-handling
- If users report outages of components of our infrastructure […]
Very good suggestions. I like them. But the early mitigations should be applicable regardless if users report outages. So I would just add the points on the top level, not within the point "If users report …"
- Inform affected users about the impact and ETA via Slack channels, ticket updates and mailing list posts
I suggest to use "chat channels" instead of "Slack channels" to be more generic and applicable for openSUSE targeted users as well.
- Set a time limit to ensure either a workaround or a solution is available within 4 hours
Depending on the severity of the issue 4 hours might be too long or too short. I would not mention a specific time. We could say "Set a time limit to ensure either a workaround or a solution is available within a reasonable time depending on the severity"
Updated by livdywan 3 months ago
- Set a time limit to ensure either a workaround or a solution is available within 4 hours
Depending on the severity of the issue 4 hours might be too long or too short. I would not mention a specific time. We could say "Set a time limit to ensure either a workaround or a solution is available within a reasonable time depending on the severity"
My thought here is to give examples to make it clear that this should be a short timeframe. I phrased it more explicitly now.
Updated draft:
- If users report outages of components of our infrastructure
- Ensure there is a ticket on the backlog tracking the issue
- For any user-facing outages
- Consider teaming up and assigning individual tasks to focus on
- Inform affected users about the impact and ETA via chat channels, ticket updates and mailing list posts
- Look into mitigations and short-term workarounds such as a hotpatch in production or a revert to an older release
- Investigate a proper solution with a conservative estimate on the effort involved
- Set a time limit to ensure either a workaround or a solution is available within a reasonable amount of time (for example 4 hours or end of working day of the person communicating the changes)
- Join an ad-hoc video call to discuss further steps
- Keep a record of what was discussed and investigated to allow for a later analysis
Updated by livdywan 3 months ago
- Do a quick industry standards best practice research
- Consider reading https://sre.google/sre-book/table-of-contents/ or ask AI
I also took these into account. These are reflected in my draft. We could potentially make it even more verbose and explicitly discuss severity assessment, conduction of a root cause analysis and regular practice runs. However that would make it less of a checklist and more of a manual. I'm trying to keep it concise here.
Updated by livdywan 3 months ago
While discussing #167335 we realized there's an important step missing here, which I'm adding to my draft now:
- If users report outages of components of our infrastructure
- Ensure there is a ticket on the backlog tracking the issue
- For any user-facing outages
- Consider teaming up and assigning individual tasks to focus on
- Inform affected users about the impact and ETA via chat channels, ticket updates and mailing list posts
- Look into mitigations and short-term workarounds such as a hotpatch in production or a revert to an older release
- Investigate a proper solution with a conservative estimate on the effort involved
- Set a time limit to ensure either a workaround or a solution is available within a reasonable amount of time (for example 4 hours or end of working day of the person communicating the changes)
- Join an ad-hoc video call to discuss further steps
- Keep a record of what was discussed and investigated to allow for a later analysis
- Look into symptoms such as restarting incomplete jobs