action #181175
openOSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem size:M
0%
Description
Observation¶
After running manual sudo -u postgres backup_dir="/var/lib/openqa/backup"; date=$(date -Idate); bf="$backup_dir/$date.dump"; test -e "$bf" || ionice -c3 nice -n19 pg_dump -Fc openqa -f "$bf"; find $backup_dir/ -mtime +7 -print0 | xargs -0 rm -v
the system became unusable.
I tried to access AWS following https://progress.opensuse.org/projects/openqav3/wiki#Fallback-deployment-on-AWS but I couldnt.
Acceptance Criteria¶
- AC1: openqa.suse.de is accessible in a web browser
- AC2: NFS mount and all related filesystems are back and working as previously
- AC3: All workers are connected and accessible via salt
- AC4: The web UI looks sensible
Suggestions¶
- Conduct a 5 WHYs PLANNED #181184
- Set the autoincrement value of the jobs primary key to the highest job id in qem-dashboard and / or the latest id in the testresults directory to avoid reusing job ids
- Possibly cancel/restart any jobs still in the running status (though stale job detection should cover that)
- Use the openqa-advanced-retrigger script
- File a follow-up ticket about availability of osd snapshots (apparently we only have 2 daily snapshots going back a week?)
Rollback steps¶
- DONE Enable "Automatic OSD deployment" pipeline https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules
- DONE Enable CI/CD in https://gitlab.suse.de/openqa/salt-states-openqa/edit#js-shared-permissions
- DONE Enable CI/CD in https://gitlab.suse.de/openqa/salt-pillars-openqa/edit#js-shared-permissions
- DONE Enable backup on backup-vm.qe.nue2.suse.org
/etc/rsnapshot.conf
again - DONE Enable backup on backup.qe.prg2.suse.org
/etc/rsnapshot.conf
again - DONE Enable
fetch_openqa_bugs
openqa-service.qe.suse.de in/etc/crontab
again - Remove silent alerts
- DONE Enable salt-minion.service on backup.qe.prg2.suse.org and backup-vm.qe.nue2.suse.org
Updated by ybonatakis 3 days ago
Updated by okurz 3 days ago
- Tags changed from infra to infra, alert
- Subject changed from OSD is down and broken for good to OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem
- Assignee set to okurz
- Priority changed from Urgent to Immediate
ybonatakis wrote:
I tried to access AWS following https://progress.opensuse.org/projects/openqav3/wiki#Fallback-deployment-on-AWS but I couldnt.
That's nonsense. Those are only instructions how one could setup an alternative infrastructure if the original one is not usable for long, not related to needing to recover the original instance
Updated by okurz 3 days ago
- Related to action #181184: Conduct lessons learned "Five Why" analysis for "Lessons learned for "OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem" size:S added
Updated by ybonatakis 3 days ago
okurz wrote in #note-3:
ybonatakis wrote:
I tried to access AWS following https://progress.opensuse.org/projects/openqav3/wiki#Fallback-deployment-on-AWS but I couldnt.
That's nonsense. Those are only instructions how one could setup an alternative infrastructure if the original one is not usable for long, not related to needing to recover the original instance
I had no idea where to go for some action. I tried to find something and this seemed like a thing to try out.
Updated by okurz 2 days ago
- Assignee deleted (
okurz)
That's what I wrote in the ticket
I assume part of the root filesystem, potentially more, was removed by accidental user action. If you can confirm that the system is inoperable please recover snapshots of the filesystem images attached to openqa.suse.de to the most recent state before 2025-04-19 05:00 UTC
No response since some hours. Since I was involved in multiple urgent mitigations here I would prefer if someone else picks this up and cleans up the mess :)
Updated by ybonatakis 2 days ago
- Description updated (diff)
Silent alerts as for now:
qesapworker-prg6 hostup
worker33
schort-server
worker29
worker31
worker-arm1
worker-arm2
tumblesle
backup
netboot.qe.prg2.suse.org
worker30
storage
backup-vm
worker34
diesel
petrol
osiris-1
backup-qam
worker35
s390zl12
monitor
worker36
sapworker1
grenache-1
qamaster
unreal6
baremetal-support
baremetal-support-prg2
jenkins
netboot
qesapworker-prg7
worker32
openqaw5-xen
mania
ada
Updated by tinita 2 days ago
All those silences from comment 13 expired after two hours btw.
I now silenced the actual annoying ones keeping resolving and firing.
https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana
- Systemd services
- web UI: Too many 5xx HTTP responses
- External http responses (2 different ones)
Updated by livdywan about 17 hours ago · Edited
I'll leave it to others to identify more follow-up points I guess, see internal team chat
Updated by okurz about 16 hours ago
- Status changed from New to In Progress
- Priority changed from Immediate to Urgent
this is now being worked on by Ignacio Torres and I asked him in https://suse.slack.com/archives/C029APBKLGK/p1745311147032479 to continue in a group chat
Updated by mkittler about 16 hours ago
- Blocks action #180926: openqa.suse.de: Cron <root@openqa> touch /var/lib/openqa/factory/repo/cvd/* size:S added
Updated by livdywan about 13 hours ago
- Subject changed from OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem to OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem size:M
- Description updated (diff)
Updated by tinita about 12 hours ago
For the record:
The maximum job id in qem-dashboard was 17390726, so I set autoincrement like this:
openqa=> ALTER SEQUENCE jobs_id_seq RESTART WITH 17400000;
ALTER SEQUENCE
openqa=> SELECT nextval('jobs_id_seq');
nextval
----------
17400000
Updated by okurz about 12 hours ago
- Description updated (diff)
For a bit more context from conversation with Aziz Rozyev and Ignacio Torres from IT:
There are only 2 daily snapshots recorded, so the most recent we have for the root disk are:
weekly.2025-04-13_0015 237.0GB 1% 4%
weekly.2025-04-20_0015 0B 0% 0%
daily.2025-04-21_0010 0B 0% 0%
so we went for the weekly.2025-04-13_0015. All 5 storage volumes were recovered. Ignacio first booted the system as we requested with systemd.unit=emergency.target
. I provided the root password and Ignacio could login and mask+disable openqa-scheduler and openqa-webui. After that the VM was rebooted and we could login over ssh and continue. I have now enabled again osd-deployment and triggered it, also scripts-ci, salt-states-openqa and salt-pillars-openqa. tinita has bumped the auto-increment id for openQA jobs to prevent conflicts based on maximum recorded in http://dashboard.qam.suse.de/. Then new jobs have been triggered and show the new id 1740000+. Enabled openqa-webui and openqa-scheduler now. All looks good so far.
Updated by ybonatakis about 12 hours ago
Enable backup on backup.qe.prg2.suse.org /etc/rsnapshot.conf again
```# osd
backup root@openqa.suse.de:/etc/ openqa.suse.de/
backup_exec ssh root@openqa.suse.de "cd /tmp; sudo -u postgres ionice -c3 nice -n19 pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"
backup root@openqa.suse.de:/var/lib/openqa/SQL-DUMPS/ openqa.suse.de/
backup root@openqa.suse.de:/var/log/zypp/ openqa.suse.de/
Updated by ybonatakis about 12 hours ago
- Description updated (diff)
backup-vm.qe.nue2.suse.org
# osd
backup root@localhost:/etc/ openqa.suse.de/ ssh_args=-p2222
backup_exec ssh -p 2222 root@localhost "cd /tmp; sudo -u postgres ionice -c3 nice -n19 pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"
backup root@localhost:/var/lib/openqa/SQL-DUMPS/ openqa.suse.de/ ssh_args=-p2222
backup root@localhost:/var/log/zypp/ openqa.suse.de/ ssh_args=-p2222
Updated by ybonatakis about 11 hours ago
- Description updated (diff)
Only step missing from rollbacks are the silent alerts
Updated by okurz about 11 hours ago
- Priority changed from Urgent to High
https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs?statuses=WAITING_FOR_RESOURCE is showing 300+ jobs meaning that there is a longer backlog. You can cancel some "schedule incident" jobs which should clean up the queue a bit.
I wrote in #eng-testing https://suse.slack.com/archives/C02CANHLANP/p1745331064968439?thread_ts=1745043127.805829&cid=C02CANHLANP
Greetings from the past! https://openqa.suse.de is back in operation based on a state from 2025-04-13 which was the most recent consistent snapshot state that the backup system has. We are carefully monitoring the system and retriggering builds and jobs as applicable. Feel welcome to also trigger according products yourself as needed.