Project

General

Profile

coordination #161735

Updated by mkittler 7 months ago

## Motivation Observation 
 See #160646 and #161381 From https://suse.slack.com/archives/C02CANHLANP/p1717381703517509 
 > (Lili Zhao) Hi, multi machine issues found today, for example: https://openqa.suse.de/tests/14504387#step/iscsi_client/8 (ping with packet size 100 failed, problems with MTU size are expected) and https://openqa.suse.de/tests/14504397#step/suseconnect_scc/25 (curl: (7) Couldn't connect to server) 

 possibly related https://suse.slack.com/archives/C02CANHLANP/p1717400281975529 
 > (Anton Smorodskyi) when I see such error https://openqa.suse.de/tests/14492957#step/prepare_instance/27    No route to host at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Transaction.pm line 54. I conclude that worker's network is down .    Is my assumption correct ? 

 also 
 https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1717347718902&to=1717408634010 
 shows the significantly higher ratio of multi-machine test failures happening 

 ## Acceptance criteria 
 * **AC1:** The original issue is understood and resolved 
 * **AC2:** The backend and/or test code can point better to likely causes of an error 
 * **AC2:** **AC3:** The multi-machine test failure ratio on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test is back to sane levels 
 * **AC4:** Similar future issues are prevented with better CI checks 

 ## Suggestions 
 * Monitor contents of the mine to better understand when it breaks and why 
 * Implement sanity checks on the worker to check for proper peer configuration 
 * Change the MTU-size check in the test distribution so make the error message more clear in case not even the smallest MTU-size works (e.g. "The network connection within the SUT does not work at all." and maybe for tap-based tests "Check the MM-setup, e.g. GRE tunnels") 
 * Get rid of the mine completely for "workername" <-> IP lookup 
   * Problem: Currently the pillar-data does not contain the FQDN of the other workers. 
   * We already have "## FQDN: …" in many cases so it would be easy to make that a mandatory key for all, at least the ones where we expect that the tap class should be usable 

 ## WARNING 
 * Do not touch the key of a worker in workerconf.sls - a lot of other states depend on it!

Back