Project

General

Profile

Actions

action #181175

open

OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem size:M

Added by ybonatakis 4 days ago. Updated about 11 hours ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2025-04-19
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

After running manual sudo -u postgres backup_dir="/var/lib/openqa/backup"; date=$(date -Idate); bf="$backup_dir/$date.dump"; test -e "$bf" || ionice -c3 nice -n19 pg_dump -Fc openqa -f "$bf"; find $backup_dir/ -mtime +7 -print0 | xargs -0 rm -v the system became unusable.

I tried to access AWS following https://progress.opensuse.org/projects/openqav3/wiki#Fallback-deployment-on-AWS but I couldnt.

Acceptance Criteria

  • AC1: openqa.suse.de is accessible in a web browser
  • AC2: NFS mount and all related filesystems are back and working as previously
  • AC3: All workers are connected and accessible via salt
  • AC4: The web UI looks sensible

Suggestions

  • Conduct a 5 WHYs PLANNED #181184
  • Set the autoincrement value of the jobs primary key to the highest job id in qem-dashboard and / or the latest id in the testresults directory to avoid reusing job ids
  • Possibly cancel/restart any jobs still in the running status (though stale job detection should cover that)
  • Use the openqa-advanced-retrigger script
  • File a follow-up ticket about availability of osd snapshots (apparently we only have 2 daily snapshots going back a week?)

Rollback steps


Related issues 2 (2 open0 closed)

Related to openQA Infrastructure (public) - action #181184: Conduct lessons learned "Five Why" analysis for "Lessons learned for "OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem" size:SWorkable2025-04-20

Actions
Blocks openQA Infrastructure (public) - action #180926: openqa.suse.de: Cron <root@openqa> touch /var/lib/openqa/factory/repo/cvd/* size:SBlockedmkittler2025-04-08

Actions
Actions #2

Updated by tinita 3 days ago

  • Target version set to Ready
Actions #3

Updated by okurz 3 days ago

  • Tags changed from infra to infra, alert
  • Subject changed from OSD is down and broken for good to OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem
  • Assignee set to okurz
  • Priority changed from Urgent to Immediate

ybonatakis wrote:

I tried to access AWS following https://progress.opensuse.org/projects/openqav3/wiki#Fallback-deployment-on-AWS but I couldnt.

That's nonsense. Those are only instructions how one could setup an alternative infrastructure if the original one is not usable for long, not related to needing to recover the original instance

Actions #4

Updated by okurz 3 days ago

  • Related to action #181184: Conduct lessons learned "Five Why" analysis for "Lessons learned for "OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem" size:S added
Actions #5

Updated by ybonatakis 3 days ago

okurz wrote in #note-3:

ybonatakis wrote:

I tried to access AWS following https://progress.opensuse.org/projects/openqav3/wiki#Fallback-deployment-on-AWS but I couldnt.

That's nonsense. Those are only instructions how one could setup an alternative infrastructure if the original one is not usable for long, not related to needing to recover the original instance

I had no idea where to go for some action. I tried to find something and this seemed like a thing to try out.

Actions #6

Updated by okurz 2 days ago

  • Assignee deleted (okurz)

That's what I wrote in the ticket

I assume part of the root filesystem, potentially more, was removed by accidental user action. If you can confirm that the system is inoperable please recover snapshots of the filesystem images attached to openqa.suse.de to the most recent state before 2025-04-19 05:00 UTC

No response since some hours. Since I was involved in multiple urgent mitigations here I would prefer if someone else picks this up and cleans up the mess :)

Actions #7

Updated by ybonatakis 2 days ago

  • Assignee set to ybonatakis
Actions #8

Updated by tinita 2 days ago · Edited

  • Description updated (diff)

I disabled osd-deployment and salt-states-openqa pipelines, so when the VM is back, we can check everything before running deployment and salt.

edit: and also salt-pillars-openqa

Actions #9

Updated by tinita 2 days ago

  • Description updated (diff)
Actions #10

Updated by tinita 2 days ago

  • Description updated (diff)

Also disabled backup for now

Actions #11

Updated by tinita 2 days ago

  • Description updated (diff)

Also disabled fetch_openqa_bugs

Actions #12

Updated by tinita 2 days ago

  • Description updated (diff)

Also disabled the other backup on backup.qe.prg2.suse.org

Actions #13

Updated by ybonatakis 2 days ago

  • Description updated (diff)

Silent alerts as for now:
qesapworker-prg6 hostup

worker33

schort-server

worker29

worker31

worker-arm1

worker-arm2

tumblesle

backup

netboot.qe.prg2.suse.org

worker30

storage

backup-vm

worker34

diesel

petrol

osiris-1

backup-qam

worker35

s390zl12

monitor

worker36

sapworker1

grenache-1

qamaster

unreal6

baremetal-support

baremetal-support-prg2

jenkins

netboot

qesapworker-prg7

worker32

openqaw5-xen

mania

ada

Actions #14

Updated by tinita 2 days ago

All those silences from comment 13 expired after two hours btw.

I now silenced the actual annoying ones keeping resolving and firing.

https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana

  • Systemd services
  • web UI: Too many 5xx HTTP responses
  • External http responses (2 different ones)
Actions #15

Updated by tinita 1 day ago

  • Description updated (diff)

I disabled salt-minion.service on both backup hosts now, as apparently at least on backup.qe.prg2.suse.org it was still running somehow and overwrote the rsnapshot.conf

Actions #16

Updated by livdywan about 17 hours ago · Edited

I'll leave it to others to identify more follow-up points I guess, see internal team chat

Actions #17

Updated by okurz about 16 hours ago

  • Status changed from New to In Progress
  • Priority changed from Immediate to Urgent

this is now being worked on by Ignacio Torres and I asked him in https://suse.slack.com/archives/C029APBKLGK/p1745311147032479 to continue in a group chat

Actions #18

Updated by mkittler about 16 hours ago

  • Blocks action #180926: openqa.suse.de: Cron <root@openqa> touch /var/lib/openqa/factory/repo/cvd/* size:S added
Actions #19

Updated by livdywan about 13 hours ago

  • Subject changed from OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem to OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem size:M
  • Description updated (diff)
Actions #20

Updated by tinita about 12 hours ago

For the record:
The maximum job id in qem-dashboard was 17390726, so I set autoincrement like this:

openqa=> ALTER SEQUENCE jobs_id_seq RESTART WITH 17400000;
ALTER SEQUENCE
openqa=> SELECT nextval('jobs_id_seq');                                                                                                                                                                
 nextval  
----------
 17400000
Actions #21

Updated by okurz about 12 hours ago

  • Description updated (diff)

For a bit more context from conversation with Aziz Rozyev and Ignacio Torres from IT:
There are only 2 daily snapshots recorded, so the most recent we have for the root disk are:

                  weekly.2025-04-13_0015                 237.0GB     1%    4%
                  weekly.2025-04-20_0015                      0B     0%    0%
                  daily.2025-04-21_0010                       0B     0%    0%

so we went for the weekly.2025-04-13_0015. All 5 storage volumes were recovered. Ignacio first booted the system as we requested with systemd.unit=emergency.target. I provided the root password and Ignacio could login and mask+disable openqa-scheduler and openqa-webui. After that the VM was rebooted and we could login over ssh and continue. I have now enabled again osd-deployment and triggered it, also scripts-ci, salt-states-openqa and salt-pillars-openqa. tinita has bumped the auto-increment id for openQA jobs to prevent conflicts based on maximum recorded in http://dashboard.qam.suse.de/. Then new jobs have been triggered and show the new id 1740000+. Enabled openqa-webui and openqa-scheduler now. All looks good so far.

Actions #22

Updated by ybonatakis about 12 hours ago

Enable backup on backup.qe.prg2.suse.org /etc/rsnapshot.conf again

```# osd
backup root@openqa.suse.de:/etc/ openqa.suse.de/
backup_exec ssh root@openqa.suse.de "cd /tmp; sudo -u postgres ionice -c3 nice -n19 pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"
backup root@openqa.suse.de:/var/lib/openqa/SQL-DUMPS/ openqa.suse.de/
backup root@openqa.suse.de:/var/log/zypp/ openqa.suse.de/

Actions #23

Updated by ybonatakis about 12 hours ago

  • Description updated (diff)
Actions #24

Updated by ybonatakis about 12 hours ago

  • Description updated (diff)

backup-vm.qe.nue2.suse.org

# osd
backup root@localhost:/etc/ openqa.suse.de/ ssh_args=-p2222
backup_exec ssh -p 2222 root@localhost "cd /tmp; sudo -u postgres ionice -c3 nice -n19 pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"
backup root@localhost:/var/lib/openqa/SQL-DUMPS/ openqa.suse.de/ ssh_args=-p2222
backup root@localhost:/var/log/zypp/ openqa.suse.de/ ssh_args=-p2222

Actions #25

Updated by tinita about 12 hours ago

  • Description updated (diff)
Actions #26

Updated by livdywan about 11 hours ago

  • Description updated (diff)
Actions #27

Updated by ybonatakis about 11 hours ago

  • Description updated (diff)

Only step missing from rollbacks are the silent alerts

Actions #28

Updated by okurz about 11 hours ago

  • Priority changed from Urgent to High

https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs?statuses=WAITING_FOR_RESOURCE is showing 300+ jobs meaning that there is a longer backlog. You can cancel some "schedule incident" jobs which should clean up the queue a bit.
I wrote in #eng-testing https://suse.slack.com/archives/C02CANHLANP/p1745331064968439?thread_ts=1745043127.805829&cid=C02CANHLANP

Greetings from the past! https://openqa.suse.de is back in operation based on a state from 2025-04-13 which was the most recent consistent snapshot state that the backup system has. We are carefully monitoring the system and retriggering builds and jobs as applicable. Feel welcome to also trigger according products yourself as needed.

Actions

Also available in: Atom PDF