action #157534
closedcoordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machines
Description
Observation¶
openQA test in scenario sle-15-SP5-Server-DVD-HA-Incidents-x86_64-qam_ha_rolling_update_node01@64bit fails in
suseconnect_scc
It fails while attempting to call script_output
which does a curl
command to 10.0.2.2 to download the script.
Test suite description¶
Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.
rolling_update
tests take a working cluster and performs a migration in each node while the node is in maintenance.
Reproducible¶
Fails sporadically since (at least) Build :32868:expat
Majority of the failures have been seen in worker40.
Expected result¶
Last good: :32996:sed (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by okurz 7 months ago
- Related to action #151382: [qe-sap] test fails in iscsi_client with unclear error message, please add "ping_size_check" from https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17817 to rule out MTU problems added
Updated by acarvajal 7 months ago
Recent failure: https://openqa.suse.de/tests/13825910#step/suseconnect_scc/25
Updated by acarvajal 7 months ago
Another recent failure, this time in the ping_size_check
added to the ha/iscsi_client
test module: https://openqa.suse.de/tests/13826037#step/iscsi_client/8
Both recent failures were in worker40
Updated by nicksinger 7 months ago
That's pretty easy: The MM tests cannot work like this on worker40 (cross worker MM tests at least).
/etc/wicked/scripts/gre_tunnel_preup.sh is missing on worker40.oqa.prg2.suse.org
we should check why https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls#L96-128 is not applied on that machine
Updated by okurz 7 months ago ยท Edited
- Subject changed from Multi-Machine Job fails in suseconnect_scc to Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machines
- https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/748 should fix that because I actually made a mistake when setting up those new worker instances.
- Mitigation of impact, e.g. retrigger and monitor impacted tests. -> called
host=openqa.suse.de failed_since=2024-03-18 WORKER=worker40 result="result='failed'" openqa-advanced-retrigger-jobs
but also acarvajal took care of jobs - Make region,datacenter,location consistent -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/749
- How can we prevent such situation in the future in our salt states? -> #157606
- Should we have an actual openQA worker at the location prg2e? -> That's something we can consider for the future. I wouldn't do that for now. Should be good enough.
Updated by nicksinger 7 months ago
I want to add how I found this:
- Render our workerconf.sls to plain yaml with:
salt-call --local cp.get_template /srv/pillar/openqa/workerconf.sls /tmp/test.output
- Create a short python script in our "_modules" directory of salt-states because it contains the code for "gre_peers.compute". I used that script:
import yaml
from gre_peers import compute
with open("workerconf_rendered.yaml", "r") as fi:
y = yaml.safe_load(fi)
print(compute("worker40", y["workerconf"]))
I then placed "import pdb; pdb.set_trace()" at strategic places in gre_peers.py to understand how this happens.
Updated by okurz 7 months ago
- Copied to action #157606: Prevent missing gre tunnel connections in our salt states due to misconfiguration added
Updated by openqa_review 7 months ago
- Due date set to 2024-04-04
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 7 months ago
Mitigation complete. https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_rolling_update_node01&version=15-SP5#next_previous ok again.
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/749 merged. Waiting for that to be deployed. If that shows no problems I will resolve.
Updated by okurz 7 months ago
- Due date deleted (
2024-04-04) - Status changed from In Progress to Resolved
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/749 deployed, showing up with new classes on workers, no problems observed. Resolving.
Updated by okurz 7 months ago
- Related to action #157147: Documentation for OSD worker region, location, datacenter keys in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls size:S added