Project

General

Profile

Actions

action #164481

closed

coordination #164466: [saga][epic] Scale up: Hyper-responsive openQA webUI

openQA Infrastructure - coordination #164469: [epic] Better tools team incident handling

[tools] Discuss benefits vs. drawbacks about applying mitigations as early as possible vs. keeping system in broken state to ease investigation. Also do industry standards best practice research size:S

Added by okurz 4 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2024-07-26
Due date:
2024-10-02
% Done:

0%

Estimated time:

Description

Motivation

See #163610-4 (HTTP Response alert Salt)

Acceptance criteria

  • AC1: The tools team knows (and the wiki explains) which approach or methodology to follow in general

Suggestions


Related issues 2 (2 open0 closed)

Copied from openQA Infrastructure - action #164478: Monitoring of idle/busy webUI/liveview handler workersNew2024-07-26

Actions
Copied to openQA Infrastructure - action #164484: [tools] Investigation helper, e.g. commands in a bash script to collect useful logs, systemd journal, etc.New2024-07-26

Actions
Actions #1

Updated by okurz 4 months ago

  • Copied from action #164478: Monitoring of idle/busy webUI/liveview handler workers added
Actions #2

Updated by okurz 4 months ago

  • Copied to action #164484: [tools] Investigation helper, e.g. commands in a bash script to collect useful logs, systemd journal, etc. added
Actions #3

Updated by okurz 4 months ago

  • Target version changed from future to Ready
Actions #4

Updated by okurz 4 months ago

  • Subject changed from [tools] Discuss benefits vs. drawbacks about applying mitigations as early as possible vs. keeping system in broken state to ease investigation. Also do industry standards best practice research to [tools] Discuss benefits vs. drawbacks about applying mitigations as early as possible vs. keeping system in broken state to ease investigation. Also do industry standards best practice research size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by kraih 4 months ago

I asked Mixtral about this:

Yes, there are industry standards and best practices that organizations can follow when deciding whether to apply mitigations as early as possible or keep a system in a broken state to ease investigation. Here are some examples:

1. ITIL (Information Technology Infrastructure Library): ITIL is a framework for managing IT services. It includes a process for incident management that recommends restoring normal service operation as quickly as possible while also minimizing the impact of the incident. This means that organizations should apply mitigations as soon as possible, while also conducting a thorough investigation to understand the root cause of the issue.

2.NIST (National Institute of Standards and Technology): NIST provides guidelines for incident response and recovery. Its guidelines recommend that organizations apply mitigations as soon as possible to reduce the impact of an incident, while also conducting a thorough investigation to understand the cause and prevent future incidents.

3. ISO/IEC 27001: This is an international standard for information security management systems (ISMS). It recommends that organizations establish processes for incident management, including the application of mitigations as soon as possible to reduce the impact of an incident.

4. COBIT (Control Objectives for Information and Related Technologies): COBIT is a framework for IT management and IT governance. It includes a process for incident management that recommends restoring normal service operation as quickly as possible while also minimizing the impact of the incident.

By following these industry standards and best practices, organizations can make informed decisions about how to balance the need to apply mitigations as early as possible with the need to conduct a thorough investigation. It's important to note that each situation is unique, and organizations should consider the specific circumstances of each incident when making a decision.```
Actions #6

Updated by livdywan 3 months ago

  • Assignee set to livdywan
Actions #7

Updated by livdywan 2 months ago

  • Description updated (diff)
Actions #8

Updated by livdywan 2 months ago

  • Status changed from Workable to In Progress

After briefly checking with others what they think I've decided to review what I think we already implement, what we actually document as the wiki is not as clear as I thought and propose something we can use as a basis for a conversation.

Actions #9

Updated by openqa_review 2 months ago

  • Due date set to 2024-10-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by livdywan 2 months ago

  • Status changed from In Progress to Workable

Putting this on the side as I need to focus on #166313 first.

Actions #11

Updated by livdywan about 2 months ago

  • If users report outages of components of our infrastructure
    • Consider teaming up and assigning individual tasks to focus on
    • Inform affected users about the impact and ETA via Slack channels, ticket updates and mailing list posts
    • Look into mitigations and short-term workarounds such as a hotpatch in production or a revert to an older release
    • Investigate a proper solution with a conservative estimate on the effort involved
    • Set a time limit to ensure either a workaround or a solution is available within 4 hours
    • Join an ad-hoc video call to discuss further steps
    • Keep a record of what was discussed and investigated to allow for a later analysis
Actions #12

Updated by okurz about 2 months ago

livdywan wrote in #note-11:

  • If users report outages of components of our infrastructure […]

Very good suggestions. I like them. But the early mitigations should be applicable regardless if users report outages. So I would just add the points on the top level, not within the point "If users report …"

  • Inform affected users about the impact and ETA via Slack channels, ticket updates and mailing list posts

I suggest to use "chat channels" instead of "Slack channels" to be more generic and applicable for openSUSE targeted users as well.

  • Set a time limit to ensure either a workaround or a solution is available within 4 hours

Depending on the severity of the issue 4 hours might be too long or too short. I would not mention a specific time. We could say "Set a time limit to ensure either a workaround or a solution is available within a reasonable time depending on the severity"

Actions #13

Updated by livdywan about 2 months ago

  • Set a time limit to ensure either a workaround or a solution is available within 4 hours

Depending on the severity of the issue 4 hours might be too long or too short. I would not mention a specific time. We could say "Set a time limit to ensure either a workaround or a solution is available within a reasonable time depending on the severity"

My thought here is to give examples to make it clear that this should be a short timeframe. I phrased it more explicitly now.

Updated draft:

  • If users report outages of components of our infrastructure
    • Ensure there is a ticket on the backlog tracking the issue
  • For any user-facing outages
    • Consider teaming up and assigning individual tasks to focus on
    • Inform affected users about the impact and ETA via chat channels, ticket updates and mailing list posts
    • Look into mitigations and short-term workarounds such as a hotpatch in production or a revert to an older release
    • Investigate a proper solution with a conservative estimate on the effort involved
    • Set a time limit to ensure either a workaround or a solution is available within a reasonable amount of time (for example 4 hours or end of working day of the person communicating the changes)
    • Join an ad-hoc video call to discuss further steps
    • Keep a record of what was discussed and investigated to allow for a later analysis
Actions #14

Updated by livdywan about 2 months ago

I also took these into account. These are reflected in my draft. We could potentially make it even more verbose and explicitly discuss severity assessment, conduction of a root cause analysis and regular practice runs. However that would make it less of a checklist and more of a manual. I'm trying to keep it concise here.

Actions #15

Updated by livdywan about 2 months ago

While discussing #167335 we realized there's an important step missing here, which I'm adding to my draft now:

  • If users report outages of components of our infrastructure
    • Ensure there is a ticket on the backlog tracking the issue
  • For any user-facing outages
    • Consider teaming up and assigning individual tasks to focus on
    • Inform affected users about the impact and ETA via chat channels, ticket updates and mailing list posts
    • Look into mitigations and short-term workarounds such as a hotpatch in production or a revert to an older release
    • Investigate a proper solution with a conservative estimate on the effort involved
    • Set a time limit to ensure either a workaround or a solution is available within a reasonable amount of time (for example 4 hours or end of working day of the person communicating the changes)
    • Join an ad-hoc video call to discuss further steps
    • Keep a record of what was discussed and investigated to allow for a later analysis
    • Look into symptoms such as restarting incomplete jobs
Actions #16

Updated by livdywan about 2 months ago

  • Status changed from Workable to Resolved

No further feedback. I updated the Process based on what was discussed.

Actions

Also available in: Atom PDF