Project

General

Profile

Actions

action #181184

open

Conduct lessons learned "Five Why" analysis for "Lessons learned for "OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem" size:S

Added by okurz 3 days ago. Updated about 5 hours ago.

Status:
Workable
Priority:
High
Assignee:
-
Category:
Organisational
Start date:
2025-04-20
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See #181175. I assume https://mailman.suse.de/mlarch/SuSE/osd-admins/2025/osd-admins.2025.04/msg00343.html was the original instance triggering the manual action:

Subject: Cron <postgres@openqa> backup_dir="/var/lib/openqa/backup"; date=$(date -Idate); bf="$backup_dir/$date.dump"; test -e "$bf" || ionice -c3 nice -n19 pg_dump -Fc openqa -f "$bf"; find $backup_dir/ -mtime +7 -print0 | xargs -0 rm -v
From: "(Cron Daemon)" <postgres@openqa.oqa.prg2.suse.org>
Date: Fri, 18 Apr 2025 23:40:01 +0000 (UTC)

Background

Questions

  1. Why do we have such a long command as a crontab entry and not in a script?
    • A1-1: ...
    • => I1-1-1: ...
  2. ...
    • A2-1: ...
    • => I2-1-1: ...
  3. Why ...
    • A1-1: ...
    • => I1-1-1: ...
  4. Why ...
    • A1-1: ...
    • => I1-1-1: ...
  5. Why ...
    • A1-1: ...
    • => I1-1-1: ...

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys

Related issues 2 (2 open0 closed)

Related to openQA Infrastructure (public) - action #181175: OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem size:MIn Progressybonatakis2025-04-192025-05-07

Actions
Related to openQA Infrastructure (public) - action #181301: Dangerous cleanup of OSD database dumpsNew

Actions
Actions #1

Updated by okurz 3 days ago

  • Copied from action #180863: Conduct lessons learned "Five Why" analysis for "Gracious handling of longer remote git clones outages" size:S added
Actions #2

Updated by okurz 3 days ago

  • Parent task deleted (#162131)
Actions #3

Updated by okurz 3 days ago

  • Related to action #181175: OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem size:M added
Actions #4

Updated by okurz 3 days ago

  • Copied from deleted (action #180863: Conduct lessons learned "Five Why" analysis for "Gracious handling of longer remote git clones outages" size:S)
Actions #5

Updated by okurz 1 day ago

  • Description updated (diff)
Actions #6

Updated by tinita 1 day ago

In order to not forget I want to note one question beforehand:

  • Why do we have such a long command as a crontab entry and not in a script?
Actions #7

Updated by livdywan 1 day ago

  • Description updated (diff)
Actions #8

Updated by livdywan 1 day ago

  • Subject changed from Conduct lessons learned "Five Why" analysis for "Lessons learned for "OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem" to Conduct lessons learned "Five Why" analysis for "Lessons learned for "OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem" size:S
  • Status changed from New to Workable
Actions #9

Updated by livdywan 1 day ago

  • Description updated (diff)

tinita wrote in #note-6:

In order to not forget I want to note one question beforehand:

  • Why do we have such a long command as a crontab entry and not in a script?

I put it in the template, so we don't overlook it when discussing it

Actions #10

Updated by livdywan 1 day ago

  • Priority changed from Normal to High

Also, this should be High so we do it soon while our memory is fresh

Actions #11

Updated by okurz about 6 hours ago

  • Parent task set to #181298
Actions #12

Updated by livdywan about 5 hours ago

  • Related to action #181301: Dangerous cleanup of OSD database dumps added
Actions #13

Updated by livdywan about 5 hours ago

  • Tags changed from infra, lessons learned, reactive work to infra, lessons learned, reactive work, collaborative-session
Actions

Also available in: Atom PDF