Actions
action #156535
closedHandle unfinished SLE maintenance tests due to FS corruption on OSD 2024-03-01
Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-03-01
Due date:
2024-03-19
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
See #156460-8
I'm checking results in maintenance dashboard and i can see http://dashboard.qam.suse.de/blocked?group_names=hpc&incident=32814 that jobs either are running or not finished. But job groups in openQA are green and empty https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=364 https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=434.
Suggestions¶
- Possibly an easy workaround is to retrigger the build of the according release requests
- Check audit logs for a trace of how incident test are scheduled in general (not the specific ones we lost), or ask in #discuss-qa-maintenance, ask maintenance coordinators
- messages like https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2340034#L2244 suggests we have some unmatched jobs
- Remove according inconsistent results from the qem-dashboard database
- Trigger according jobs to sync and schedule pending data on https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules
- Also trigger aggregate tests to not need to wait until the end of the day for tests to start
- Monitor the execution of jobs and the presentation on the dashboard
- Re-enable scheduling aggregates again on https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules
- Again monitor the execution of jobs and the presentation on the dashboard
Rollback actions¶
- Remove silence
alertname=Queue: State (SUSE) alert
from https://stats.openqa-monitor.qa.suse.de/alerting/silences - Reactivate
Schedule updates/aggregates (0 20 * * 1-5,7)
at https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules on 2024-03-05
Out of scope¶
- Fix the problematic design of qem-dashboard
Actions