Project

General

Profile

Actions

action #103524

closed

OW1: performance loss size:M

Added by dimstar about 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2021-12-06
Due date:
% Done:

100%

Estimated time:

Description

Observation

A lot of tests are failing since Friday, even the rather 'simple' raid install tests. The failure happens very often in 'await_install', after having spent 1 hour being busy installing RPMs. Those tests generally pass in 20 minutes through the installer.

This started happening across the product to be tested as well as also in stagings, at the same time, which makes 'product-related issues' less probable than ow1 related issues.

Acceptance criteria

  • AC1: openqaworker1 is back in production with a stable set of worker instances

Suggestions

  • Reduce the number of openQA worker instances until we don't commonly have failing or incompleting jobs
  • If there are recurring I/O errors showing up in logs consider replacing hardware again or the other NVMe if that is faulty

Rollback actions

  • Use a proper RAID again of all NVMe devices for /var/lib/openqa (fvogdt has used an unusual direct use of /dev/nvme0n1 for /var/lib/openqa for now)
  • Enable additional worker instances again after hardware replacement: systemctl unmask --now openqa-worker-auto-restart@{16..20}
  • Increase WORKERCACHE in openqaworker1:/etc/openqa/workers.ini to a higher value, e.g. 400GB again after replacement of NVMe an increase of space
  • Increase number of worker instances again, i.e. enable worker instance [7..16]

Related issues 3 (0 open3 closed)

Related to openQA Project (public) - action #103581: Many jobs on openqa.opensuse.org incomplete in ' timeout: setup exceeded MAX_SETUP_TIME'Resolvedfavogt2021-12-07

Actions
Has duplicate openQA Infrastructure (public) - action #107017: Random asset download (cache service) failures on openqaworker1Resolvedkraih

Actions
Copied to openQA Infrastructure (public) - action #107017: Random asset download (cache service) failures on openqaworker1Resolvedkraih

Actions
Actions

Also available in: Atom PDF