Project

General

Profile

Actions

action #132827

open

[tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests size:M

Added by rfan1 about 1 year ago. Updated 8 months ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
2023-07-17
Due date:
% Done:

0%

Estimated time:

Description

Observation

I can see that some tests are failing due to DNS resolve issue on workers "sapworker*", especially on multi-machine tests.can someone help check?

Some error messages as below:
https://openqa.suse.de/tests/11593878#step/salt_master/15
http://openqa.suse.de/tests/11594635#step/rsync_client/12

Reproducible

Failed test links

Expected result

I Tried with another worker to run the rsync tests without any issue: http://openqa.suse.de/tests/11594925#dependencies

Rollback steps

Further details

May be some network problems with workers "sapworker*", based on my tests [at least for rsync test result], the same test can pass with "worker5" but fail with "sapworker1"

Suggestions

  • First ensure that all openQA workers have the salt state applied cleanly, e.g. sudo salt --no-color -C 'G@roles:worker' state.apply
  • Maybe the failure can be improved on the os-autoinst side, like a better "die"message/reason
  • As temporary measure consider disabling the "tap" class from affected workers, e.g. make it tap_pooXXX
  • Debug multi-machine capabilities according to http://open.qa/docs/#_verify_the_setup
  • Ensure that our salt states ensure all what is needed to run stable multi-machine tests
  • Add back production worker classes for all affected machines openqaworker1, sapworker{1-7}

Related issues 4 (1 open3 closed)

Related to openQA Tests - action #132932: [qe-core] test fails in t01_basic - eth0 stays in setup-in-progresNew2023-07-18

Actions
Related to openQA Project - action #133025: Configure Virtual Interfaces instructions do not work on Leap 15.5 size:MResolveddheidler2023-07-192023-10-31

Actions
Related to openQA Infrastructure - action #132137: Setup new PRG2 openQA worker for osd size:MResolvedmkittler2023-06-29

Actions
Blocked by openQA Infrastructure - action #133127: Frankencampus network broken + GitlabCi failed --> uploading artefactsResolvedokurz2023-07-20

Actions
Actions #1

Updated by rfan1 about 1 year ago

  • Description updated (diff)
Actions #2

Updated by rfan1 about 1 year ago

  • Subject changed from [qe-tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" to [qe-tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests
Actions #3

Updated by rfan1 about 1 year ago

  • Description updated (diff)
Actions #4

Updated by okurz about 1 year ago

  • Subject changed from [qe-tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests to [tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #6

Updated by okurz about 1 year ago

  • Description updated (diff)
Actions #7

Updated by okurz about 1 year ago

  • Status changed from New to In Progress
  • Assignee set to okurz

Retriggering according failures:

for i in 1 2 3; do WORKER=sapworker$i host=openqa.suse.de result="result='failed'" failed_since="2023-07-15" openqa-advanced-retrigger-jobs ; done
Actions #8

Updated by okurz about 1 year ago

  • Status changed from In Progress to New
  • Assignee deleted (okurz)

Retriggered all tests. Now we should look into why those tests could not execute on those workers.

Actions #9

Updated by rfan1 about 1 year ago

I tried to compare sapworker1 and worker5! and I can find some clue now.

GRE tunnels are not shown up in ip a command on sapworker1. [they can show up after I tried to restart wicked service]. hopefully my finding can help something.

Actions #10

Updated by rfan1 about 1 year ago

https://openqa.suse.de/tests/11606807#dependencies The test passed on sapworker1 now.

I did below operation on this host:
sudo systemctl restart wicked

Actions #11

Updated by rfan1 about 1 year ago

One more worker seems hit the same issue:openqaworker1
https://openqa.suse.de/tests/11603529#step/rsync_client/12

Actions #12

Updated by mloviska about 1 year ago

  • Related to action #132932: [qe-core] test fails in t01_basic - eth0 stays in setup-in-progres added
Actions #13

Updated by emiura about 1 year ago

I had one test failing on openqaworker1, logged in there, gre tunnels were down, did restart wicked as suggested. Tunnels appeared on "ip a s", but had some tests failing on network issues:

https://openqa.suse.de/tests/11614943

Actions #14

Updated by rfan1 about 1 year ago

emiura wrote:

I had one test failing on openqaworker1, logged in there, gre tunnels were down, did restart wicked as suggested. Tunnels appeared on "ip a s", but had some tests failing on network issues:

https://openqa.suse.de/tests/11614943

Not sure the command can help:

sudo ip link set gre0 up

Actions #15

Updated by rfan1 about 1 year ago

There is some difference for firewall configuration:

openqaworker1:

> sudo grep -v ^#  /etc/firewalld/firewalld.conf

DefaultZone=trusted

MinimalMark=100

CleanupOnExit=yes

Lockdown=no

IPv6_rpfilter=yes

IndividualCalls=no

LogDenied=off

FirewallBackend=nftables

FlushAllOnReload=no

RFC3964_IPv4=yes

AllowZoneDrifting=no

worker5:

> sudo grep -v ^#  /etc/firewalld/firewalld.conf

DefaultZone=trusted

MinimalMark=100

CleanupOnExit=yes

Lockdown=no

IPv6_rpfilter=yes

IndividualCalls=no

LogDenied=off

AutomaticHelpers=system

Actions #16

Updated by emiura about 1 year ago

This test ended up with name resolution failure, even with what appears to be a correct network setup:
https://openqa.suse.de/tests/11623555#step/hostname/25

Actions #18

Updated by livdywan about 1 year ago

  • Subject changed from [tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests to [tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #19

Updated by mkittler about 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler

For sap workers the bridge device in workerconf.sls was never set to something that will actually work. I've just noticed that so I'll assign this ticket to myself and fix this problem.

Actions #20

Updated by mkittler about 1 year ago

I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/568 for the SAP workers. I will apply this change manually tomorrow and re-run my test jobs again.

Not sure why openqaworker1 and worker5 were problematic. My SSH connection to openqaworker1 is so slow that I couldn't investigate much. Maybe the slow connection is actually also the source of the problem. On worker5 the correct bridge device is configured so it is not just an obvious mistake as on the SAP workers. However, after reading all the comments is seems that worker5 is actually not problematic anyways and was just mentioned as reference. So the suggestion "for all affected machines openqaworker1, worker5, sapworker{1-7}" from the ticket description is likely just misleading. Thus I will focus on the SAP workers for now.

Actions #21

Updated by rfan1 about 1 year ago

mkittler wrote:

I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/568 for the SAP workers. I will apply this change manually tomorrow and re-run my test jobs again.

Not sure why openqaworker1 and worker5 were problematic. My SSH connection to openqaworker1 is so slow that I couldn't investigate much. Maybe the slow connection is actually also the source of the problem. On worker5 the correct bridge device is configured so it is not just an obvious mistake as on the SAP workers. However, after reading all the comments is seems that worker5 is actually not problematic anyways and was just mentioned as reference. So the suggestion "for all affected machines openqaworker1, worker5, sapworker{1-7}" from the ticket description is likely just misleading. Thus I will focus on the SAP workers for now.

@mkittler, so far, I didn't hit any issues on worker5. let me remove it in description part. since it may lead some confustion.

Actions #22

Updated by rfan1 about 1 year ago

  • Description updated (diff)
Actions #23

Updated by openqa_review about 1 year ago

  • Due date set to 2023-08-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions #24

Updated by mkittler about 1 year ago

Thanks for the clarification.

I followed-up the Slack chats and apparently there are networking problems (it works but with huge package loss). That would explain the poor SSH connection I had to openqaworker1 perfectly yesterday perfectly. I will only be able to investigate the situation with openqaworker1 once that problem has been resolved.

The same counts unfortunately for the SAP workers as well because those are also in FC basement and at this point also affected equally bad.

Actions #25

Updated by mkittler about 1 year ago

  • Status changed from In Progress to Blocked
Actions #26

Updated by mkittler about 1 year ago

  • Blocked by action #133127: Frankencampus network broken + GitlabCi failed --> uploading artefacts added
Actions #27

Updated by okurz 12 months ago

  • Status changed from Blocked to Workable

#133127 was resolved. There might still be problems in the network due to this as I did not get any proper "problem resolution" message about the network problems. When we have specific problems we should just report them in according tickets+SD-tickets and block on those. Right now sapworker1… are in salt and responsive, ping results also look stable and good.#133127 was resolved. There might still be problems in the network due to this as I did not get any proper "problem resolution" message about the network problems. When we have specific problems we should just report them in according tickets+SD-tickets and block on those. Right now sapworker1… are in salt and responsive, ping results also look stable and good.

So you can continue the work here.

Actions #28

Updated by mkittler 12 months ago

  • Status changed from Workable to In Progress

The networking problems seem to be resolved (or it is at least better).

I could check the bride interface on openqaworker1 and it is configured correctly. I'm not sure yet why https://openqa.suse.de/tests/11665440#step/iscsi_client/93 still fails.

I'll try cloning some MM tests on the Nürnberg-located SAP workers and possibly enable them again.

Actions #29

Updated by livdywan 12 months ago

As Marius mentioned being a little stuck in the daily, just as an idea #132137 or #132134 might useful to look into as a reference since it's also about multi-machine tests.

Actions #30

Updated by emiura 12 months ago

  • Due date deleted (2023-08-04)
  • Assignee deleted (mkittler)

I had one test failing on openqaworker1, logged in there, gre tunnels were down, did restart wicked as suggested. Tunnels appeared on "ip a s", but had some tests failing on network issues:

https://openqa.suse.de/tests/11614943

mkittler wrote:

The networking problems seem to be resolved (or it is at least better).

I could check the bride interface on openqaworker1 and it is configured correctly. I'm not sure yet why https://openqa.suse.de/tests/11665440#step/iscsi_client/93 still fails.

I'll try cloning some MM tests on the Nürnberg-located SAP workers and possibly enable them again.

We are also seen a ton of failure on iscsi tests on SAP MM machine tests, like this one:
https://openqa.suse.de/tests/11674211

mkittler wrote:

The networking problems seem to be resolved (or it is at least better).

I could check the bride interface on openqaworker1 and it is configured correctly. I'm not sure yet why https://openqa.suse.de/tests/11665440#step/iscsi_client/93 still fails.

I'll try cloning some MM tests on the Nürnberg-located SAP workers and possibly enable them again.

Actions #31

Updated by mkittler 12 months ago

Maybe it would be best for now to remove the tap worker class of openqaworker1. I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/571 for that.

Yesterday I investigated the situation on the SAP workers but could not find out anything useful. Here are nevertheless the findings:

  • iptables-save does not show masquerading enabled but that's also how it is on worker8 and 10 and likely ok considering that firewalld is actually using ntf under the hood.
  • nft list ruleset shows masquerading-related rules. I computed a diff of the output of that command between sapworker1 worker8 and 10 and besides the name of the ethernet interface there is no difference at all.
  • The SUTs can ping each other and the gateway (the bridge interface on sapworker1) but cannot reach any other hosts (which should supposedly work via the masquerading but does not).
  • Enable masquerading via iptables so that iptables-save shows -A POSTROUTING -j MASQUERADE followed by COMMIT doesn't change anything.
Actions #34

Updated by mkittler 12 months ago

  • Status changed from In Progress to Feedback

I've just tried a basic test scenario from o3 and the MM-SUTs also cannot reach outside hosts like openqa.opensuse.org (e.g. https://openqa.suse.de/tests/11677551#step/before_test/24). So the problems I have reproduced so far on sapworker1 don't seem to be specific to the test scenario.

Actions #35

Updated by okurz 12 months ago

  • Due date set to 2023-08-04
  • Assignee set to mkittler

Adding back the assignee and due-date that emiura removed likely by accident.

Actions #36

Updated by okurz 12 months ago

  • Description updated (diff)
Actions #37

Updated by mkittler 12 months ago

  • Priority changed from Urgent to Normal

Currently worked around by disabling the tap worker class.

Actions #38

Updated by livdywan 12 months ago

  • Due date deleted (2023-08-04)
  • Status changed from Feedback to Workable
  • Assignee deleted (mkittler)

So current situation:

The workers are enabled in production but they don't do multi machine and the tap worker class needs to be added again. Check https://github.com/os-autoinst/scripts/ for the helper script

Actions #39

Updated by okurz 12 months ago

  • Related to action #133025: Configure Virtual Interfaces instructions do not work on Leap 15.5 size:M added
Actions #40

Updated by okurz 12 months ago

  • Related to action #132137: Setup new PRG2 openQA worker for osd size:M added
Actions #41

Updated by okurz 12 months ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

livdywan wrote:

Check https://github.com/os-autoinst/scripts/ for the helper script

I guess you mean https://github.com/os-autoinst/os-autoinst/blob/master/script/os-autoinst-setup-multi-machine

We can wait for #133025 and #132137 where we try to understand better how to do a consistent multi-machine setup and debugging approach

Actions #42

Updated by livdywan 9 months ago

We can wait for #133025 and #132137 where we try to understand better how to do a consistent multi-machine setup and debugging approach

#133025 to be resolved soon. So we can probably unblock? Maybe re-estimate since some time has passed?

Actions #43

Updated by okurz 9 months ago

  • Project changed from openQA Tests to openQA Infrastructure
  • Category deleted (Bugs in existing tests)
  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)

unblock yes but I think the suggestions from the ticket description are still valid and don't need re-estimation.

Actions #44

Updated by okurz 8 months ago

  • Target version changed from Ready to future

Sorry, tools-team can not currently help with that.

Actions

Also available in: Atom PDF