action #156535: Handle unfinished SLE maintenance tests due to FS corruption on OSD 2024-03-01 - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #156535

closed

Handle unfinished SLE maintenance tests due to FS corruption on OSD 2024-03-01

Added by okurz about 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

dheidler

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-03-01

Due date:

2024-03-19

% Done:

Estimated time:

Tags:

reactive work, qem-bot, qem-dashboard

Description

Observation¶

See #156460-8

I'm checking results in maintenance dashboard and i can see http://dashboard.qam.suse.de/blocked?group_names=hpc&incident=32814 that jobs either are running or not finished. But job groups in openQA are green and empty https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=364 https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=434.

Suggestions¶

Possibly an easy workaround is to retrigger the build of the according release requests
Check audit logs for a trace of how incident test are scheduled in general (not the specific ones we lost), or ask in #discuss-qa-maintenance, ask maintenance coordinators
messages like
https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2340034#L2244
suggests we have some unmatched jobs
Remove according inconsistent results from the qem-dashboard database
Trigger according jobs to sync and schedule pending data on https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules
Also trigger aggregate tests to not need to wait until the end of the day for tests to start
Monitor the execution of jobs and the presentation on the dashboard
Re-enable scheduling aggregates again on https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules
Again monitor the execution of jobs and the presentation on the dashboard

Rollback actions¶

Remove silence alertname=Queue: State (SUSE) alert from https://stats.openqa-monitor.qa.suse.de/alerting/silences
Reactivate Schedule updates/aggregates (0 20 * * 1-5,7) at https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules on 2024-03-05

Out of scope¶

Fix the problematic design of qem-dashboard

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz about 1 year ago

Copied from action #156460: Potential FS corruption on osd due to 2 VMs accessing the same disk added

Actions

Copy link

Updated by okurz about 1 year ago

Description updated (diff)

Actions

Copy link

Updated by okurz about 1 year ago

Description updated (diff)
Status changed from New to In Progress
Assignee set to dheidler

As discussed after the daily together with the team. dheidler is working on it.

Actions

Copy link

Updated by dheidler about 1 year ago

Description updated (diff)

Actions

Copy link

Updated by dheidler about 1 year ago

Steps done (see https://gitlab.suse.de/qa-maintenance/bot-ng/#cleanup-of-unwanted-test-results)

ssh root@qam.suse.de
machinectl shell postgresql
sudo -u postgres pg_dump dashboard_db > dashboard_db_backup_20240304.sql
sudo -u postgres psql dashboard_db
dashboard_db=# TRUNCATE incidents CASCADE;

Manually triggered:

Synchronize SMELT to QEM Dashboard (3,33 * * * *)
Schedule incidents (0 * * * *)
Schedule updates/aggregates (0 20 * * 1-5,7)
Synchronize inc. results into QEM Dashboard (8,38 * * * *)
Synchronize aggr. results into QEM Dashboard (3,33 * * * *)

Deactivated Schedule updates/aggregates (0 20 * * 1-5,7) so that it won't run twice.

Actions

Copy link

Updated by openqa_review about 1 year ago

Due date set to 2024-03-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by dheidler about 1 year ago

Priority changed from Urgent to High

Still 15k tests in queue: https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=now-24h&to=now
Reenabled Schedule updates/aggregates (0 20 * * 1-5,7)

Actions

Copy link

Updated by okurz about 1 year ago

Apparently there was still a second aggregated tests build already yesterday as visible on https://openqa.suse.de/group_overview/427 . Do you know what happened?

Actions

Copy link

Updated by dheidler about 1 year ago

I guess when discussing the issue with you I had missread the cron string 0 20 * * 1-5,7 as 0:20, but it is 20:00.

Meanwhile openQA made good progress running jobs: https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=now-2d&to=now
We're down to ~4k of scheduled jobs.

Actions

Copy link

#10

Updated by dheidler about 1 year ago

Status changed from In Progress to Resolved

All rollback steps performed.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #156535

Handle unfinished SLE maintenance tests due to FS corruption on OSD 2024-03-01

Observation¶

Suggestions¶

Rollback actions¶

Out of scope¶

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by dheidler about 1 year ago

Updated by dheidler about 1 year ago

Updated by openqa_review about 1 year ago

Updated by dheidler about 1 year ago

Updated by okurz about 1 year ago

Updated by dheidler about 1 year ago

Updated by dheidler about 1 year ago