Actions
action #181184
openConduct lessons learned "Five Why" analysis for "Lessons learned for "OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem" size:S
Status:
Workable
Priority:
High
Assignee:
-
Category:
Organisational
Target version:
Start date:
2025-04-20
Due date:
% Done:
0%
Estimated time:
Description
Motivation¶
See #181175. I assume https://mailman.suse.de/mlarch/SuSE/osd-admins/2025/osd-admins.2025.04/msg00343.html was the original instance triggering the manual action:
Subject: Cron <postgres@openqa> backup_dir="/var/lib/openqa/backup"; date=$(date -Idate); bf="$backup_dir/$date.dump"; test -e "$bf" || ionice -c3 nice -n19 pg_dump -Fc openqa -f "$bf"; find $backup_dir/ -mtime +7 -print0 | xargs -0 rm -v
From: "(Cron Daemon)" <postgres@openqa.oqa.prg2.suse.org>
Date: Fri, 18 Apr 2025 23:40:01 +0000 (UTC)
Background¶
Questions¶
- Why do we have such a long command as a crontab entry and not in a script?
- A1-1: ...
- => I1-1-1: ...
- ...
- A2-1: ...
- => I2-1-1: ...
- Why ...
- A1-1: ...
- => I1-1-1: ...
- Why ...
- A1-1: ...
- => I1-1-1: ...
- Why ...
- A1-1: ...
- => I1-1-1: ...
Acceptance criteria¶
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
Suggestions¶
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
- Organize a call to conduct the 5 whys
Updated by okurz 3 days ago
- Copied from action #180863: Conduct lessons learned "Five Why" analysis for "Gracious handling of longer remote git clones outages" size:S added
Updated by okurz 3 days ago
- Related to action #181175: OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem size:M added
Updated by okurz 3 days ago
- Copied from deleted (action #180863: Conduct lessons learned "Five Why" analysis for "Gracious handling of longer remote git clones outages" size:S)
Updated by livdywan 1 day ago
- Subject changed from Conduct lessons learned "Five Why" analysis for "Lessons learned for "OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem" to Conduct lessons learned "Five Why" analysis for "Lessons learned for "OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem" size:S
- Status changed from New to Workable
Updated by livdywan about 5 hours ago
- Related to action #181301: Dangerous cleanup of OSD database dumps added
Updated by livdywan about 5 hours ago
- Tags changed from infra, lessons learned, reactive work to infra, lessons learned, reactive work, collaborative-session
Actions