action #157534: Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machines - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #157534

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machines

Added by acarvajal about 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2024-03-19

Due date:

% Done:

Estimated time:

Description

Observation¶

openQA test in scenario sle-15-SP5-Server-DVD-HA-Incidents-x86_64-qam_ha_rolling_update_node01@64bit fails in
suseconnect_scc

It fails while attempting to call script_output which does a curl command to 10.0.2.2 to download the script.

Test suite description¶

Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.

rolling_update tests take a working cluster and performs a migration in each node while the node is in maintenance.

Reproducible¶

Fails sporadically since (at least) Build :32868:expat

Majority of the failures have been seen in worker40.

Expected result¶

Last good: :32996:sed (or more recent)

Further details¶

Always latest result in this scenario: latest

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Updated by acarvajal about 1 year ago

Other failures in different tests but always in Multi-Machine scenario, and mostly on worker40:

Actions

Copy link

Updated by okurz about 1 year ago

Related to action #151382: [qe-sap] test fails in iscsi_client with unclear error message, please add "ping_size_check" from https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17817 to rule out MTU problems added

Actions

Copy link

Updated by okurz about 1 year ago

Category set to Regressions/Crashes
Target version set to Ready

Actions

Copy link

Updated by acarvajal about 1 year ago

We got also some failures outside of worker40:

https://openqa.suse.de/tests/13823681#step/iscsi_client/32 (worker29)
https://openqa.suse.de/tests/13823680#step/iscsi_client/32 (worker34)
https://openqa.suse.de/tests/13823698#step/iscsi_client/32 (worker39)
https://openqa.suse.de/tests/13823697#step/iscsi_client/32 (worker30)

Actions

Copy link

Updated by acarvajal about 1 year ago

Recent failure: https://openqa.suse.de/tests/13825910#step/suseconnect_scc/25

Actions

Copy link

Updated by acarvajal about 1 year ago

Another recent failure, this time in the ping_size_check added to the ha/iscsi_client test module: https://openqa.suse.de/tests/13826037#step/iscsi_client/8

Both recent failures were in worker40

Actions

Copy link

Updated by nicksinger about 1 year ago

From https://suse.slack.com/archives/C02CANHLANP/p1710931704905159?thread_ts=1710757297.684389&cid=C02CANHLANP:

  That's pretty easy: The MM tests cannot work like this on worker40 (cross worker MM tests at least).
/etc/wicked/scripts/gre_tunnel_preup.sh is missing on worker40.oqa.prg2.suse.org

we should check why https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls#L96-128 is not applied on that machine

Actions

Copy link

Updated by okurz about 1 year ago

Status changed from New to In Progress
Assignee set to okurz

Actions

Copy link

Updated by okurz about 1 year ago · Edited

Subject changed from Multi-Machine Job fails in suseconnect_scc to Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machines

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/748 should fix that because I actually made a mistake when setting up those new worker instances.
Mitigation of impact, e.g. retrigger and monitor impacted tests. -> called host=openqa.suse.de failed_since=2024-03-18 WORKER=worker40 result="result='failed'" openqa-advanced-retrigger-jobs but also acarvajal took care of jobs
Make region,datacenter,location consistent -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/749
How can we prevent such situation in the future in our salt states? -> #157606
Should we have an actual openQA worker at the location prg2e? -> That's something we can consider for the future. I wouldn't do that for now. Should be good enough.

Actions

Copy link

#10

Updated by nicksinger about 1 year ago

I want to add how I found this:

Render our workerconf.sls to plain yaml with: salt-call --local cp.get_template /srv/pillar/openqa/workerconf.sls /tmp/test.output
Create a short python script in our "_modules" directory of salt-states because it contains the code for "gre_peers.compute". I used that script:

import yaml
from gre_peers import compute

with open("workerconf_rendered.yaml", "r") as fi:
  y = yaml.safe_load(fi)

print(compute("worker40", y["workerconf"]))

I then placed "import pdb; pdb.set_trace()" at strategic places in gre_peers.py to understand how this happens.

Actions

Copy link

#11

Updated by okurz about 1 year ago

Parent task set to #111929

Actions

Copy link

#12

Updated by okurz about 1 year ago

Copied to action #157606: Prevent missing gre tunnel connections in our salt states due to misconfiguration added

Actions

Copy link

#13

Updated by openqa_review about 1 year ago

Due date set to 2024-04-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#14

Updated by okurz about 1 year ago

Mitigation complete. https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_rolling_update_node01&version=15-SP5#next_previous ok again.

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/749 merged. Waiting for that to be deployed. If that shows no problems I will resolve.

Actions

Copy link

#15

Updated by okurz about 1 year ago

Due date deleted (~~2024-04-04~~)
Status changed from In Progress to Resolved

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/749 deployed, showing up with new classes on workers, no problems observed. Resolving.

Actions

Copy link

#16

Updated by okurz about 1 year ago

Related to action #157147: Documentation for OSD worker region, location, datacenter keys in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls size:S added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #157534

Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machines

Observation¶

Test suite description¶

Reproducible¶

Expected result¶

Further details¶

Updated by acarvajal about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by acarvajal about 1 year ago

Updated by acarvajal about 1 year ago

Updated by acarvajal about 1 year ago

Updated by nicksinger about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago · Edited

Updated by nicksinger about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by openqa_review about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago