Project

General

Profile

Actions

action #156535

closed

Handle unfinished SLE maintenance tests due to FS corruption on OSD 2024-03-01

Added by okurz about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-03-01
Due date:
2024-03-19
% Done:

0%

Estimated time:

Description

Observation

See #156460-8

I'm checking results in maintenance dashboard and i can see http://dashboard.qam.suse.de/blocked?group_names=hpc&incident=32814 that jobs either are running or not finished. But job groups in openQA are green and empty https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=364 https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=434.

Suggestions

Rollback actions

Out of scope

  • Fix the problematic design of qem-dashboard

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure - action #156460: Potential FS corruption on osd due to 2 VMs accessing the same diskResolvednicksinger2024-03-01

Actions
Actions #1

Updated by okurz about 2 months ago

  • Copied from action #156460: Potential FS corruption on osd due to 2 VMs accessing the same disk added
Actions #2

Updated by okurz about 2 months ago

  • Description updated (diff)
Actions #3

Updated by okurz about 2 months ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to dheidler

As discussed after the daily together with the team. dheidler is working on it.

Actions #4

Updated by dheidler about 2 months ago

  • Description updated (diff)
Actions #5

Updated by dheidler about 2 months ago

Steps done (see https://gitlab.suse.de/qa-maintenance/bot-ng/#cleanup-of-unwanted-test-results)

ssh root@qam.suse.de
machinectl shell postgresql
sudo -u postgres pg_dump dashboard_db > dashboard_db_backup_20240304.sql
sudo -u postgres psql dashboard_db
dashboard_db=# TRUNCATE incidents CASCADE;

Manually triggered:

  • Synchronize SMELT to QEM Dashboard (3,33 * * * *)
  • Schedule incidents (0 * * * *)
  • Schedule updates/aggregates (0 20 * * 1-5,7)
  • Synchronize inc. results into QEM Dashboard (8,38 * * * *)
  • Synchronize aggr. results into QEM Dashboard (3,33 * * * *)

Deactivated Schedule updates/aggregates (0 20 * * 1-5,7) so that it won't run twice.

Actions #6

Updated by openqa_review about 2 months ago

  • Due date set to 2024-03-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by dheidler about 2 months ago

  • Priority changed from Urgent to High
Actions #8

Updated by okurz about 2 months ago

Apparently there was still a second aggregated tests build already yesterday as visible on https://openqa.suse.de/group_overview/427 . Do you know what happened?

Actions #9

Updated by dheidler about 2 months ago

I guess when discussing the issue with you I had missread the cron string 0 20 * * 1-5,7 as 0:20, but it is 20:00.

Meanwhile openQA made good progress running jobs: https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=now-2d&to=now
We're down to ~4k of scheduled jobs.

Actions #10

Updated by dheidler about 2 months ago

  • Status changed from In Progress to Resolved

All rollback steps performed.

Actions

Also available in: Atom PDF