action #181175: OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #181175

closed

OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem size:M

Added by ybonatakis 26 days ago. Updated 17 days ago.

Status:

Resolved

Priority:

High

Assignee:

ybonatakis

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2025-04-19

Due date:

% Done:

Estimated time:

Tags:

alert, infra

Description

Observation¶

After running manual sudo -u postgres backup_dir="/var/lib/openqa/backup"; date=$(date -Idate); bf="$backup_dir/$date.dump"; test -e "$bf" || ionice -c3 nice -n19 pg_dump -Fc openqa -f "$bf"; find $backup_dir/ -mtime +7 -print0 | xargs -0 rm -v the system became unusable.

I tried to access AWS following https://progress.opensuse.org/projects/openqav3/wiki#Fallback-deployment-on-AWS but I couldnt.

Acceptance Criteria¶

AC1: openqa.suse.de is accessible in a web browser
AC2: NFS mount and all related filesystems are back and working as previously
AC3: All workers are connected and accessible via salt
AC4: The web UI looks sensible

Suggestions¶

Conduct a 5 WHYs PLANNED #181184
Set the autoincrement value of the jobs primary key to the highest job id in qem-dashboard and / or the latest id in the testresults directory to avoid reusing job ids
Possibly cancel/restart any jobs still in the running status (though stale job detection should cover that)
Use the openqa-advanced-retrigger script
File a follow-up ticket about availability of osd snapshots (apparently we only have 2 daily snapshots going back a week?)

Rollback steps¶

DONE Enable "Automatic OSD deployment" pipeline https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules
DONE Enable CI/CD in https://gitlab.suse.de/openqa/salt-states-openqa/edit#js-shared-permissions
DONE Enable CI/CD in https://gitlab.suse.de/openqa/salt-pillars-openqa/edit#js-shared-permissions
DONE Enable backup on backup-vm.qe.nue2.suse.org /etc/rsnapshot.conf again
DONE Enable backup on backup.qe.prg2.suse.org /etc/rsnapshot.conf again
DONE Enable fetch_openqa_bugs openqa-service.qe.suse.de in /etc/crontab again
DONE Remove silent alerts
DONE Enable salt-minion.service on backup.qe.prg2.suse.org and backup-vm.qe.nue2.suse.org

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by ybonatakis 25 days ago

https://sd.suse.com/servicedesk/customer/portal/1/SD-185992

Actions

Copy link

Updated by tinita 25 days ago

Target version set to Ready

Actions

Copy link

Updated by okurz 25 days ago

Tags changed from infra to infra, alert
Subject changed from OSD is down and broken for good to OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem
Assignee set to okurz
Priority changed from Urgent to Immediate

ybonatakis wrote:

I tried to access AWS following https://progress.opensuse.org/projects/openqav3/wiki#Fallback-deployment-on-AWS but I couldnt.

That's nonsense. Those are only instructions how one could setup an alternative infrastructure if the original one is not usable for long, not related to needing to recover the original instance

Actions

Copy link

Updated by okurz 25 days ago

Related to action #181184: Conduct lessons learned "Five Why" analysis for "Lessons learned for "OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem" size:S added

Actions

Copy link

Updated by ybonatakis 25 days ago

okurz wrote in #note-3:

ybonatakis wrote:

I tried to access AWS following https://progress.opensuse.org/projects/openqav3/wiki#Fallback-deployment-on-AWS but I couldnt.

That's nonsense. Those are only instructions how one could setup an alternative infrastructure if the original one is not usable for long, not related to needing to recover the original instance

I had no idea where to go for some action. I tried to find something and this seemed like a thing to try out.

Actions

Copy link

Updated by okurz 24 days ago

Assignee deleted (~~okurz~~)

That's what I wrote in the ticket

I assume part of the root filesystem, potentially more, was removed by accidental user action. If you can confirm that the system is inoperable please recover snapshots of the filesystem images attached to openqa.suse.de to the most recent state before 2025-04-19 05:00 UTC

No response since some hours. Since I was involved in multiple urgent mitigations here I would prefer if someone else picks this up and cleans up the mess :)

Actions

Copy link

Updated by ybonatakis 24 days ago

Assignee set to ybonatakis

Actions

Copy link

Updated by tinita 24 days ago · Edited

Description updated (diff)

I disabled osd-deployment and salt-states-openqa pipelines, so when the VM is back, we can check everything before running deployment and salt.

edit: and also salt-pillars-openqa

Actions

Copy link

Updated by tinita 24 days ago

Description updated (diff)

Actions

Copy link

#10

Updated by tinita 24 days ago

Description updated (diff)

Also disabled backup for now

Actions

Copy link

#11

Updated by tinita 24 days ago

Description updated (diff)

Also disabled fetch_openqa_bugs

Actions

Copy link

#12

Updated by tinita 24 days ago

Description updated (diff)

Also disabled the other backup on backup.qe.prg2.suse.org

Actions

Copy link

#13

Updated by ybonatakis 24 days ago

Description updated (diff)

Silent alerts as for now:
qesapworker-prg6 hostup
worker33
schort-server
worker29
worker31
worker-arm1
worker-arm2
tumblesle
backup
netboot.qe.prg2.suse.org
worker30
storage
backup-vm
worker34
diesel
petrol
osiris-1
backup-qam
worker35
s390zl12
monitor
worker36
sapworker1
grenache-1
qamaster
unreal6
baremetal-support
baremetal-support-prg2
jenkins
netboot
qesapworker-prg7
worker32
openqaw5-xen
mania
ada

Actions

Copy link

#14

Updated by tinita 24 days ago

All those silences from comment 13 expired after two hours btw.

I now silenced the actual annoying ones keeping resolving and firing.

https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana

Systemd services
web UI: Too many 5xx HTTP responses
External http responses (2 different ones)

Actions

Copy link

#15

Updated by tinita 23 days ago

Description updated (diff)

I disabled salt-minion.service on both backup hosts now, as apparently at least on backup.qe.prg2.suse.org it was still running somehow and overwrote the rsnapshot.conf

Actions

Copy link

#16

Updated by livdywan 23 days ago · Edited

Check with Aziz @okurz, see SD-185992

I'll leave it to others to identify more follow-up points I guess, see internal team chat

Actions

Copy link

#17

Updated by okurz 23 days ago

Status changed from New to In Progress
Priority changed from Immediate to Urgent

this is now being worked on by Ignacio Torres and I asked him in https://suse.slack.com/archives/C029APBKLGK/p1745311147032479 to continue in a group chat

Actions

Copy link

#18

Updated by mkittler 23 days ago

Blocks action #180926: openqa.suse.de: Cron <root@openqa> touch /var/lib/openqa/factory/repo/cvd/* size:S added

Actions

Copy link

#19

Updated by livdywan 22 days ago

Subject changed from OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem to OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem size:M
Description updated (diff)

Actions

Copy link

#20

Updated by tinita 22 days ago

For the record:
The maximum job id in qem-dashboard was 17390726, so I set autoincrement like this:

openqa=> ALTER SEQUENCE jobs_id_seq RESTART WITH 17400000;
ALTER SEQUENCE
openqa=> SELECT nextval('jobs_id_seq');                                                                                                                                                                
 nextval  
----------
 17400000

Actions

Copy link

#21

Updated by okurz 22 days ago

Description updated (diff)

For a bit more context from conversation with Aziz Rozyev and Ignacio Torres from IT:
There are only 2 daily snapshots recorded, so the most recent we have for the root disk are:

                  weekly.2025-04-13_0015                 237.0GB     1%    4%
                  weekly.2025-04-20_0015                      0B     0%    0%
                  daily.2025-04-21_0010                       0B     0%    0%

so we went for the weekly.2025-04-13_0015. All 5 storage volumes were recovered. Ignacio first booted the system as we requested with systemd.unit=emergency.target. I provided the root password and Ignacio could login and mask+disable openqa-scheduler and openqa-webui. After that the VM was rebooted and we could login over ssh and continue. I have now enabled again osd-deployment and triggered it, also scripts-ci, salt-states-openqa and salt-pillars-openqa. tinita has bumped the auto-increment id for openQA jobs to prevent conflicts based on maximum recorded in http://dashboard.qam.suse.de/. Then new jobs have been triggered and show the new id 1740000+. Enabled openqa-webui and openqa-scheduler now. All looks good so far.

Actions

Copy link

#22

Updated by ybonatakis 22 days ago

Enable backup on backup.qe.prg2.suse.org /etc/rsnapshot.conf again

backup  root@openqa.suse.de:/etc/       openqa.suse.de/
backup_exec     ssh root@openqa.suse.de "cd /tmp; sudo -u postgres ionice -c3 nice -n19 pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"
backup  root@openqa.suse.de:/var/lib/openqa/SQL-DUMPS/  openqa.suse.de/
backup  root@openqa.suse.de:/var/log/zypp/      openqa.suse.de/

Actions

Copy link

#23

Updated by ybonatakis 22 days ago

Description updated (diff)

Actions

Copy link

#24

Updated by ybonatakis 22 days ago

Description updated (diff)

backup-vm.qe.nue2.suse.org

backup  root@localhost:/etc/    openqa.suse.de/ ssh_args=-p2222
backup_exec     ssh -p 2222 root@localhost "cd /tmp; sudo -u postgres ionice -c3 nice -n19 pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"
backup  root@localhost:/var/lib/openqa/SQL-DUMPS/       openqa.suse.de/ ssh_args=-p2222
backup  root@localhost:/var/log/zypp/   openqa.suse.de/ ssh_args=-p2222```

Actions

Copy link

#25

Updated by tinita 22 days ago

Description updated (diff)

Actions

Copy link

#26

Updated by livdywan 22 days ago

Description updated (diff)

Actions

Copy link

#27

Updated by ybonatakis 22 days ago

Description updated (diff)

Only step missing from rollbacks are the silent alerts

Actions

Copy link

#28

Updated by okurz 22 days ago

Priority changed from Urgent to High

https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs?statuses=WAITING_FOR_RESOURCE is showing 300+ jobs meaning that there is a longer backlog. You can cancel some "schedule incident" jobs which should clean up the queue a bit.
I wrote in #eng-testing https://suse.slack.com/archives/C02CANHLANP/p1745331064968439?thread_ts=1745043127.805829&cid=C02CANHLANP

Greetings from the past! https://openqa.suse.de is back in operation based on a state from 2025-04-13 which was the most recent consistent snapshot state that the backup system has. We are carefully monitoring the system and retriggering builds and jobs as applicable. Feel welcome to also trigger according products yourself as needed.

Actions

Copy link

#29

Updated by openqa_review 22 days ago

Due date set to 2025-05-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#31

Updated by okurz 22 days ago

Parent task set to #162350

Actions

Copy link

#32

Updated by okurz 22 days ago

Parent task changed from #162350 to #181298

Actions

Copy link

#33

Updated by okurz 22 days ago

Copied to action #181301: Dangerous cleanup of OSD database dumps size:S added

Actions

Copy link

#35

Updated by livdywan 21 days ago

File a follow-up ticket about availability of osd snapshots (apparently we only have 2 daily snapshots going back a week?)

See #181334

Actions

Copy link

#36

Updated by ybonatakis 21 days ago

Status changed from In Progress to Feedback

I removed the silences. and I will put this ticket in feedback as we still monitoring the server in one or the other way.

Actions

Copy link

#37

Updated by ybonatakis 21 days ago

Description updated (diff)
Status changed from Feedback to Resolved

AC1: openqa.suse.de is accessible in a web browser
AC2: NFS mount and all related filesystems are back and working as previously
AC3: All workers are connected and accessible via salt
AC4: The web UI looks sensible

I consider all the ACs done. closing this one and following up to the subtasks that other colleagues have created.

Actions

Copy link

#38

Updated by okurz 20 days ago

Due date deleted (~~2025-05-07~~)

Actions

Copy link

#39

Updated by okurz 20 days ago

Status changed from Resolved to Feedback

https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules is still inactive

Actions

Copy link

#40

Updated by okurz 20 days ago

Status changed from Feedback to Resolved

We fixed that. Probably I messed up.

Actions

Copy link

#41

Updated by nicksinger 17 days ago

monitor.qe.nue2.suse.org was still disabled in salt on OSD and therefore missing newly merged stuff like https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1450 - I accepted the key again and restarted the minion (it was stuck trying to kill some process). With that a highstate applied cleanly again:

Summary for monitor.qe.nue2.suse.org
--------------
Succeeded: 458 (changed=6)
Failed:      0
--------------
Total states run:     458
Total run time:    83.411 s

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #181175

OSD is down since 2025-04-19 due to accidental user actions removing parts of the root filesystem size:M

Observation¶

Acceptance Criteria¶

Suggestions¶

Rollback steps¶

Updated by ybonatakis 25 days ago

Updated by tinita 25 days ago

Updated by okurz 25 days ago

Updated by okurz 25 days ago

Updated by ybonatakis 25 days ago

Updated by okurz 24 days ago

Updated by ybonatakis 24 days ago

Updated by tinita 24 days ago · Edited

Updated by tinita 24 days ago

Updated by tinita 24 days ago

Updated by tinita 24 days ago

Updated by tinita 24 days ago

Updated by ybonatakis 24 days ago

Updated by tinita 24 days ago

Updated by tinita 23 days ago

Updated by livdywan 23 days ago · Edited

Updated by okurz 23 days ago

Updated by mkittler 23 days ago

Updated by livdywan 22 days ago

Updated by tinita 22 days ago

Updated by okurz 22 days ago

Updated by ybonatakis 22 days ago

Updated by ybonatakis 22 days ago

Updated by ybonatakis 22 days ago

Updated by tinita 22 days ago

Updated by livdywan 22 days ago

Updated by ybonatakis 22 days ago

Updated by okurz 22 days ago

Updated by openqa_review 22 days ago

Updated by okurz 22 days ago

Updated by okurz 22 days ago

Updated by okurz 22 days ago

Updated by livdywan 21 days ago

Updated by ybonatakis 21 days ago

Updated by ybonatakis 21 days ago

Updated by okurz 20 days ago

Updated by okurz 20 days ago

Updated by okurz 20 days ago

Updated by nicksinger 17 days ago