action #103524

Updated by okurz 4 months ago

## Observation
A lot of tests are failing since Friday, even the rather 'simple' raid install tests. The failure happens very often in 'await_install', after having spent 1 hour being busy installing RPMs. Those tests generally pass in 20 minutes through the installer.

This started happening across the product to be tested as well as also in stagings, at the same time, which makes 'product-related issues' less probable than ow1 related issues.

## Acceptance criteria
* **AC1:** openqaworker1 is back in production with a stable set of worker instances

## Suggestions
* Reduce the number of openQA worker instances until we don't commonly have failing or incompleting jobs
* If there are recurring I/O errors showing up in logs consider replacing hardware again or the other NVMe if that is faulty

Rollback actions

* Use a proper RAID again of all NVMe devices for /var/lib/openqa (fvogdt has used an unusual direct use of /dev/nvme0n1 for /var/lib/openqa for now)
* Enable additional worker instances again after hardware replacement: `systemctl unmask --now openqa-worker-auto-restart@{16..20}`
* Increase WORKERCACHE in openqaworker1:/etc/openqa/workers.ini to a higher value, e.g. 400GB again after replacement of NVMe an increase of space
* Increase number of worker instances again, i.e. enable worker instance [7..16]