Project

General

Profile

Actions

action #181175

closed

OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem size:M

Added by ybonatakis 26 days ago. Updated 17 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2025-04-19
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

After running manual sudo -u postgres backup_dir="/var/lib/openqa/backup"; date=$(date -Idate); bf="$backup_dir/$date.dump"; test -e "$bf" || ionice -c3 nice -n19 pg_dump -Fc openqa -f "$bf"; find $backup_dir/ -mtime +7 -print0 | xargs -0 rm -v the system became unusable.

I tried to access AWS following https://progress.opensuse.org/projects/openqav3/wiki#Fallback-deployment-on-AWS but I couldnt.

Acceptance Criteria

  • AC1: openqa.suse.de is accessible in a web browser
  • AC2: NFS mount and all related filesystems are back and working as previously
  • AC3: All workers are connected and accessible via salt
  • AC4: The web UI looks sensible

Suggestions

  • Conduct a 5 WHYs PLANNED #181184
  • Set the autoincrement value of the jobs primary key to the highest job id in qem-dashboard and / or the latest id in the testresults directory to avoid reusing job ids
  • Possibly cancel/restart any jobs still in the running status (though stale job detection should cover that)
  • Use the openqa-advanced-retrigger script
  • File a follow-up ticket about availability of osd snapshots (apparently we only have 2 daily snapshots going back a week?)

Rollback steps


Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #181184: Conduct lessons learned "Five Why" analysis for "Lessons learned for "OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem" size:SResolvedokurz2025-04-20

Actions
Blocks openQA Infrastructure (public) - action #180926: openqa.suse.de: Cron <root@openqa> touch /var/lib/openqa/factory/repo/cvd/* size:SResolvedmkittler2025-04-08

Actions
Copied to openQA Infrastructure (public) - action #181301: Dangerous cleanup of OSD database dumps size:SResolvednicksinger2025-05-10

Actions
Actions #2

Updated by tinita 25 days ago

  • Target version set to Ready
Actions #3

Updated by okurz 25 days ago

  • Tags changed from infra to infra, alert
  • Subject changed from OSD is down and broken for good to OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem
  • Assignee set to okurz
  • Priority changed from Urgent to Immediate

ybonatakis wrote:

I tried to access AWS following https://progress.opensuse.org/projects/openqav3/wiki#Fallback-deployment-on-AWS but I couldnt.

That's nonsense. Those are only instructions how one could setup an alternative infrastructure if the original one is not usable for long, not related to needing to recover the original instance

Actions #4

Updated by okurz 25 days ago

  • Related to action #181184: Conduct lessons learned "Five Why" analysis for "Lessons learned for "OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem" size:S added
Actions #5

Updated by ybonatakis 25 days ago

okurz wrote in #note-3:

ybonatakis wrote:

I tried to access AWS following https://progress.opensuse.org/projects/openqav3/wiki#Fallback-deployment-on-AWS but I couldnt.

That's nonsense. Those are only instructions how one could setup an alternative infrastructure if the original one is not usable for long, not related to needing to recover the original instance

I had no idea where to go for some action. I tried to find something and this seemed like a thing to try out.

Actions #6

Updated by okurz 24 days ago

  • Assignee deleted (okurz)

That's what I wrote in the ticket

I assume part of the root filesystem, potentially more, was removed by accidental user action. If you can confirm that the system is inoperable please recover snapshots of the filesystem images attached to openqa.suse.de to the most recent state before 2025-04-19 05:00 UTC

No response since some hours. Since I was involved in multiple urgent mitigations here I would prefer if someone else picks this up and cleans up the mess :)

Actions #7

Updated by ybonatakis 24 days ago

  • Assignee set to ybonatakis
Actions #8

Updated by tinita 24 days ago · Edited

  • Description updated (diff)

I disabled osd-deployment and salt-states-openqa pipelines, so when the VM is back, we can check everything before running deployment and salt.

edit: and also salt-pillars-openqa

Actions #9

Updated by tinita 24 days ago

  • Description updated (diff)
Actions #10

Updated by tinita 24 days ago

  • Description updated (diff)

Also disabled backup for now

Actions #11

Updated by tinita 24 days ago

  • Description updated (diff)

Also disabled fetch_openqa_bugs

Actions #12

Updated by tinita 24 days ago

  • Description updated (diff)

Also disabled the other backup on backup.qe.prg2.suse.org

Actions #13

Updated by ybonatakis 24 days ago

  • Description updated (diff)

Silent alerts as for now:
qesapworker-prg6 hostup
worker33
schort-server
worker29
worker31
worker-arm1
worker-arm2
tumblesle
backup
netboot.qe.prg2.suse.org
worker30
storage
backup-vm
worker34
diesel
petrol
osiris-1
backup-qam
worker35
s390zl12
monitor
worker36
sapworker1
grenache-1
qamaster
unreal6
baremetal-support
baremetal-support-prg2
jenkins
netboot
qesapworker-prg7
worker32
openqaw5-xen
mania
ada

Actions #14

Updated by tinita 24 days ago

All those silences from comment 13 expired after two hours btw.

I now silenced the actual annoying ones keeping resolving and firing.

https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana

  • Systemd services
  • web UI: Too many 5xx HTTP responses
  • External http responses (2 different ones)
Actions #15

Updated by tinita 23 days ago

  • Description updated (diff)

I disabled salt-minion.service on both backup hosts now, as apparently at least on backup.qe.prg2.suse.org it was still running somehow and overwrote the rsnapshot.conf

Actions #16

Updated by livdywan 23 days ago · Edited

I'll leave it to others to identify more follow-up points I guess, see internal team chat

Actions #17

Updated by okurz 23 days ago

  • Status changed from New to In Progress
  • Priority changed from Immediate to Urgent

this is now being worked on by Ignacio Torres and I asked him in https://suse.slack.com/archives/C029APBKLGK/p1745311147032479 to continue in a group chat

Actions #18

Updated by mkittler 23 days ago

  • Blocks action #180926: openqa.suse.de: Cron <root@openqa> touch /var/lib/openqa/factory/repo/cvd/* size:S added
Actions #19

Updated by livdywan 22 days ago

  • Subject changed from OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem to OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem size:M
  • Description updated (diff)
Actions #20

Updated by tinita 22 days ago

For the record:
The maximum job id in qem-dashboard was 17390726, so I set autoincrement like this:

openqa=> ALTER SEQUENCE jobs_id_seq RESTART WITH 17400000;
ALTER SEQUENCE
openqa=> SELECT nextval('jobs_id_seq');                                                                                                                                                                
 nextval  
----------
 17400000
Actions #21

Updated by okurz 22 days ago

  • Description updated (diff)

For a bit more context from conversation with Aziz Rozyev and Ignacio Torres from IT:
There are only 2 daily snapshots recorded, so the most recent we have for the root disk are:

                  weekly.2025-04-13_0015                 237.0GB     1%    4%
                  weekly.2025-04-20_0015                      0B     0%    0%
                  daily.2025-04-21_0010                       0B     0%    0%

so we went for the weekly.2025-04-13_0015. All 5 storage volumes were recovered. Ignacio first booted the system as we requested with systemd.unit=emergency.target. I provided the root password and Ignacio could login and mask+disable openqa-scheduler and openqa-webui. After that the VM was rebooted and we could login over ssh and continue. I have now enabled again osd-deployment and triggered it, also scripts-ci, salt-states-openqa and salt-pillars-openqa. tinita has bumped the auto-increment id for openQA jobs to prevent conflicts based on maximum recorded in http://dashboard.qam.suse.de/. Then new jobs have been triggered and show the new id 1740000+. Enabled openqa-webui and openqa-scheduler now. All looks good so far.

Actions #22

Updated by ybonatakis 22 days ago

Enable backup on backup.qe.prg2.suse.org /etc/rsnapshot.conf again

backup  root@openqa.suse.de:/etc/       openqa.suse.de/
backup_exec     ssh root@openqa.suse.de "cd /tmp; sudo -u postgres ionice -c3 nice -n19 pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"
backup  root@openqa.suse.de:/var/lib/openqa/SQL-DUMPS/  openqa.suse.de/
backup  root@openqa.suse.de:/var/log/zypp/      openqa.suse.de/
Actions #23

Updated by ybonatakis 22 days ago

  • Description updated (diff)
Actions #24

Updated by ybonatakis 22 days ago

  • Description updated (diff)

backup-vm.qe.nue2.suse.org

backup  root@localhost:/etc/    openqa.suse.de/ ssh_args=-p2222
backup_exec     ssh -p 2222 root@localhost "cd /tmp; sudo -u postgres ionice -c3 nice -n19 pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"
backup  root@localhost:/var/lib/openqa/SQL-DUMPS/       openqa.suse.de/ ssh_args=-p2222
backup  root@localhost:/var/log/zypp/   openqa.suse.de/ ssh_args=-p2222```
Actions #25

Updated by tinita 22 days ago

  • Description updated (diff)
Actions #26

Updated by livdywan 22 days ago

  • Description updated (diff)
Actions #27

Updated by ybonatakis 22 days ago

  • Description updated (diff)

Only step missing from rollbacks are the silent alerts

Actions #28

Updated by okurz 22 days ago

  • Priority changed from Urgent to High

https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs?statuses=WAITING_FOR_RESOURCE is showing 300+ jobs meaning that there is a longer backlog. You can cancel some "schedule incident" jobs which should clean up the queue a bit.
I wrote in #eng-testing https://suse.slack.com/archives/C02CANHLANP/p1745331064968439?thread_ts=1745043127.805829&cid=C02CANHLANP

Greetings from the past! https://openqa.suse.de is back in operation based on a state from 2025-04-13 which was the most recent consistent snapshot state that the backup system has. We are carefully monitoring the system and retriggering builds and jobs as applicable. Feel welcome to also trigger according products yourself as needed.

Actions #29

Updated by openqa_review 22 days ago

  • Due date set to 2025-05-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #31

Updated by okurz 22 days ago

  • Parent task set to #162350
Actions #32

Updated by okurz 22 days ago

  • Parent task changed from #162350 to #181298
Actions #33

Updated by okurz 22 days ago

  • Copied to action #181301: Dangerous cleanup of OSD database dumps size:S added
Actions #35

Updated by livdywan 21 days ago

  • File a follow-up ticket about availability of osd snapshots (apparently we only have 2 daily snapshots going back a week?)

See #181334

Actions #36

Updated by ybonatakis 21 days ago

  • Status changed from In Progress to Feedback

I removed the silences. and I will put this ticket in feedback as we still monitoring the server in one or the other way.

Actions #37

Updated by ybonatakis 21 days ago

  • Description updated (diff)
  • Status changed from Feedback to Resolved
  • AC1: openqa.suse.de is accessible in a web browser
  • AC2: NFS mount and all related filesystems are back and working as previously
  • AC3: All workers are connected and accessible via salt
  • AC4: The web UI looks sensible

I consider all the ACs done. closing this one and following up to the subtasks that other colleagues have created.

Actions #38

Updated by okurz 20 days ago

  • Due date deleted (2025-05-07)
Actions #39

Updated by okurz 20 days ago

  • Status changed from Resolved to Feedback
Actions #40

Updated by okurz 20 days ago

  • Status changed from Feedback to Resolved

We fixed that. Probably I messed up.

Actions #41

Updated by nicksinger 17 days ago

monitor.qe.nue2.suse.org was still disabled in salt on OSD and therefore missing newly merged stuff like https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1450 - I accepted the key again and restarted the minion (it was stuck trying to kill some process). With that a highstate applied cleanly again:

Summary for monitor.qe.nue2.suse.org
--------------
Succeeded: 458 (changed=6)
Failed:      0
--------------
Total states run:     458
Total run time:    83.411 s
Actions

Also available in: Atom PDF