coordination #101048
Updated by mkittler over 2 years ago
## Observation
According to https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=27&orgId=1&from=now-30d&to=now (sort by "avg" in the table on the right-hand side) openqaworker-arm-4/5 have a fail-ratio of 33-36% vs. openqaworker-arm-1/2/3 with a fail-ratio of 15-17%
## Acceptance criteria
* **AC1:** openqaworker-arm-4/5 have a fail-ratio less or equal to arm-1/2/3
## Additional information and ideas from the hardware comparison between arm-1/2/3 and arm-4/5
* The CPU (the specific model and version, Cavium ThunderX2) of arm-4/5 is known to behave badly for our use-case and that's the difference to the older arm workers (which have the previous version of that CPU model installed).
* Disabling cpu control and cpu frequency scaling in the firmware environment didn't make a difference.
* Before that we've already tried to reduce the number of worker slots a lot and it didn't help either.
* There are still a few ideas to consider (see #109232#note-5).
* There are also more variables in the firmware environment (see #109232#note-20) we can play with.
* Next time we should buy different hardware (see private comment #109232#note-11).
* See the full ticket #109232 for more context about these findings.
## Suggestions
- Confirm if typing issues cause the failures (look for timeouts, observe additional or missing characters in typed commands)
- Upgrade arm3 to Leap 15.3 and compare failure rate -> #101265 => Leap 15.3 behaves similar as Leap 15.2
- Consider switching to kernel-stable or kernel-head -> #101271 => "kernel-default" from Kernel:stable behaves same as openSUSE:Leap:15.3 one
- ~~Consider downgrading kernel to what's used in 15.2~~ -> same upstream version is running on most
- Bring back arm 4 and 5 after verifying stability
- Run [typing.pm](https://github.com/os-autoinst/os-autoinst/blob/master/t/data/tests/tests/typing.pm) from os-autoinst as test in production -> #101262