action #156535
closedHandle unfinished SLE maintenance tests due to FS corruption on OSD 2024-03-01
See #156460-8
I'm checking results in maintenance dashboard and i can see that jobs either are running or not finished. But job groups in openQA are green and empty
- Possibly an easy workaround is to retrigger the build of the according release requests
- Check audit logs for a trace of how incident test are scheduled in general (not the specific ones we lost), or ask in #discuss-qa-maintenance, ask maintenance coordinators
- messages like suggests we have some unmatched jobs
- Remove according inconsistent results from the qem-dashboard database
- Trigger according jobs to sync and schedule pending data on
- Also trigger aggregate tests to not need to wait until the end of the day for tests to start
- Monitor the execution of jobs and the presentation on the dashboard
- Re-enable scheduling aggregates again on
- Again monitor the execution of jobs and the presentation on the dashboard
Rollback actions¶
- Remove silence
alertname=Queue: State (SUSE) alert
from - Reactivate
Schedule updates/aggregates (0 20 * * 1-5,7)
at on 2024-03-05
Out of scope¶
- Fix the problematic design of qem-dashboard
Updated by okurz 10 months ago
- Copied from action #156460: Potential FS corruption on osd due to 2 VMs accessing the same disk added
Updated by dheidler 10 months ago
Steps done (see
machinectl shell postgresql
sudo -u postgres pg_dump dashboard_db > dashboard_db_backup_20240304.sql
sudo -u postgres psql dashboard_db
dashboard_db=# TRUNCATE incidents CASCADE;
Manually triggered:
- Synchronize SMELT to QEM Dashboard (3,33 * * * *)
- Schedule incidents (0 * * * *)
- Schedule updates/aggregates (0 20 * * 1-5,7)
- Synchronize inc. results into QEM Dashboard (8,38 * * * *)
- Synchronize aggr. results into QEM Dashboard (3,33 * * * *)
Deactivated Schedule updates/aggregates (0 20 * * 1-5,7)
so that it won't run twice.
Updated by openqa_review 10 months ago
- Due date set to 2024-03-19
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler 10 months ago
- Priority changed from Urgent to High
- Still 15k tests in queue:
- Reenabled
Schedule updates/aggregates (0 20 * * 1-5,7)
Updated by okurz 10 months ago
Apparently there was still a second aggregated tests build already yesterday as visible on . Do you know what happened?
Updated by dheidler 10 months ago
I guess when discussing the issue with you I had missread the cron string 0 20 * * 1-5,7
as 0:20, but it is 20:00.
Meanwhile openQA made good progress running jobs:
We're down to ~4k of scheduled jobs.