Project

General

Profile

Actions

action #135773

closed

[tools] many multi-machine test failures in "ovs-client+ovs-server" test scenario when tests are run across different workers size:M

Added by okurz over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2023-08-15
Due date:
2023-10-07
% Done:

0%

Estimated time:
Tags:

Description

Observation

See #134282-1

There is something wrong with multimachine network when tests are run across different workers. If is multimachine job forced to run on same worker, it is fine.

There are fails in core group: https://openqa.suse.de/tests/11843205#next_previous
Kernel group: https://openqa.suse.de/tests/11846943#next_previous
HPC: https://openqa.suse.de/tests/11845897#next_previous

The scenario is https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=ovs-client&version=15-SP5

Acceptance criteria

  • AC1: The "ovs-client+ovs-server" test scenario passes consistently when running on multiple OSD workers with "tap" class

Suggestions

Out of scope

  • Anything that already fails when the multi-machine cluster runs on a single physical host
  • #135035 "Pin multimachine jobs to a single worker"
  • Any other test than "ovs-client+server"
  • Try to minimize the reproducer, e.g. skip test modules in openQA -> #135818

Workaround

Pin to a single physical machine


Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retryResolvednicksinger2023-08-15

Actions
Actions #1

Updated by okurz over 1 year ago

  • Copied from action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added
Actions #2

Updated by okurz over 1 year ago

  • Subject changed from [tools] many multi-machine test failures when tests are run across different workers to [tools] many multi-machine test failures in "ovs-client+ovs-server" test scenario when tests are run across different workers size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by livdywan about 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan
Actions #4

Updated by openqa_review about 1 year ago

  • Due date set to 2023-10-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by livdywan about 1 year ago

livdywan wrote in #note-3:

The scenario is https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=ovs-client&version=15-SP5

Jobs are looking good currently e.g. on worker30+29 or worker30+37 (including on worker31 #135407#note-15).

There's one job scheduled specifically for worker31 https://openqa.suse.de/tests/12268287 which won't work because tap was taken out, hence I'm cloning that w/o the worker class.
Edit: https://openqa.suse.de/tests/12268107

Actions #6

Updated by livdywan about 1 year ago

Running ovs-client/server on production mm workers/31 respectively to test stability. Second build has the worker selection reversed in case it makes a difference:

Running this was a bit tedious. What I ended up doing was using --export-command to prepare a -X POST job command, and then setting especially these variables:

'BUILD=poo135773_prod_spare_flip'
"TEST:12253260=ovs-server-$i"
"TEST:12253438=ovs-client-$i"
'WORKER_CLASS:12253260=qemu_x86_64,tap_poo135407,worker31'
'WORKER_CLASS:12253438=qemu_x86_64,tap'
Actions #7

Updated by livdywan about 1 year ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High

I'm scheduling another batch (same group, higher indices). Let's see if jobs continue to be fine.

Actions #8

Updated by livdywan about 1 year ago

livdywan wrote in #note-7:

I'm scheduling another batch (same group, higher indices). Let's see if jobs continue to be fine.

Well, turns out some of the repos were removed rendering the batch pointless since they all fail to install some of those repos and this is at best checked when initially cloning which I was not doing due to my use of --export-command.

Actions #9

Updated by okurz about 1 year ago

so we are good regarding worker31. Yes, triggering those tests and tweaking is ugly. What I did now is use a SLE15-SP6 base which is less likely to need changes regarding repo settings and might be a good variant to test as well. What I did:

openqa-clone-job --export-command --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12207584

then copy-paste the command into the command line, delete all variables "BUILD", "_GROUP_ID", "TEST", "WORKER_CLASS" and replace those accordingly with

'TEST:12207583=ovs-client-poo135773' 'TEST:12207584=ovs-server-poo135773' 'WORKER_CLASS:12207583=qemu_x86_64,tap' 'WORKER_CLASS:12207584=worker31' _GROUP=0 BUILD=poo135773_prod_spare_flip

forming the complete command.
This triggered https://openqa.suse.de/tests/12343491

@livdywan please ensure that the workaround of "pin to individual machines" is removed everywhere and reference in #134282 how to verify other workers.

Actions #10

Updated by livdywan about 1 year ago

@livdywan please ensure that the workaround of "pin to individual machines" is removed everywhere and reference in #134282 how to verify other workers.

There's no remaining pinning for this ticket in pillars/workerconf.sls (there are for #136130 and #135407 which is expected) and none in job groups (note that GitLab can't search this repo, it needs to be cloned with subrepos).

Actions #11

Updated by livdywan about 1 year ago

  • Status changed from Feedback to Resolved

livdywan wrote in #note-10:

@livdywan please ensure that the workaround of "pin to individual machines" is removed everywhere and reference in #134282 how to verify other workers.

I added a reference to #135773-9

Actions

Also available in: Atom PDF