action #156535
closedHandle unfinished SLE maintenance tests due to FS corruption on OSD 2024-03-01
0%
Description
Observation¶
See #156460-8
I'm checking results in maintenance dashboard and i can see http://dashboard.qam.suse.de/blocked?group_names=hpc&incident=32814 that jobs either are running or not finished. But job groups in openQA are green and empty https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=364 https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=434.
Suggestions¶
- Possibly an easy workaround is to retrigger the build of the according release requests
- Check audit logs for a trace of how incident test are scheduled in general (not the specific ones we lost), or ask in #discuss-qa-maintenance, ask maintenance coordinators
- messages like https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2340034#L2244 suggests we have some unmatched jobs
- Remove according inconsistent results from the qem-dashboard database
- Trigger according jobs to sync and schedule pending data on https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules
- Also trigger aggregate tests to not need to wait until the end of the day for tests to start
- Monitor the execution of jobs and the presentation on the dashboard
- Re-enable scheduling aggregates again on https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules
- Again monitor the execution of jobs and the presentation on the dashboard
Rollback actions¶
- Remove silence
alertname=Queue: State (SUSE) alert
from https://stats.openqa-monitor.qa.suse.de/alerting/silences - Reactivate
Schedule updates/aggregates (0 20 * * 1-5,7)
at https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules on 2024-03-05
Out of scope¶
- Fix the problematic design of qem-dashboard
Updated by okurz 11 months ago
- Copied from action #156460: Potential FS corruption on osd due to 2 VMs accessing the same disk added
Updated by dheidler 11 months ago
Steps done (see https://gitlab.suse.de/qa-maintenance/bot-ng/#cleanup-of-unwanted-test-results)
ssh root@qam.suse.de
machinectl shell postgresql
sudo -u postgres pg_dump dashboard_db > dashboard_db_backup_20240304.sql
sudo -u postgres psql dashboard_db
dashboard_db=# TRUNCATE incidents CASCADE;
Manually triggered:
- Synchronize SMELT to QEM Dashboard (3,33 * * * *)
- Schedule incidents (0 * * * *)
- Schedule updates/aggregates (0 20 * * 1-5,7)
- Synchronize inc. results into QEM Dashboard (8,38 * * * *)
- Synchronize aggr. results into QEM Dashboard (3,33 * * * *)
Deactivated Schedule updates/aggregates (0 20 * * 1-5,7)
so that it won't run twice.
Updated by openqa_review 11 months ago
- Due date set to 2024-03-19
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler 11 months ago
- Priority changed from Urgent to High
- Still 15k tests in queue: https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=now-24h&to=now
- Reenabled
Schedule updates/aggregates (0 20 * * 1-5,7)
Updated by okurz 11 months ago
Apparently there was still a second aggregated tests build already yesterday as visible on https://openqa.suse.de/group_overview/427 . Do you know what happened?
Updated by dheidler 11 months ago
I guess when discussing the issue with you I had missread the cron string 0 20 * * 1-5,7
as 0:20, but it is 20:00.
Meanwhile openQA made good progress running jobs: https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=now-2d&to=now
We're down to ~4k of scheduled jobs.