action #132827: [tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #132827

closed

[tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests size:M

Added by rfan1 almost 2 years ago. Updated 7 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

ybonatakis

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2023-07-17

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Observation¶

I can see that some tests are failing due to DNS resolve issue on workers "sapworker*", especially on multi-machine tests.can someone help check?

Some error messages as below:
https://openqa.suse.de/tests/11593878#step/salt_master/15
http://openqa.suse.de/tests/11594635#step/rsync_client/12

Reproducible¶

Failed test links

Expected result¶

I Tried with another worker to run the rsync tests without any issue: http://openqa.suse.de/tests/11594925#dependencies

Rollback steps¶

Add back production worker class on all OSD machines mentioning #132827

Further details¶

May be some network problems with workers "sapworker*", based on my tests [at least for rsync test result], the same test can pass with "worker5" but fail with "sapworker1"

Suggestions¶

First ensure that all openQA workers have the salt state applied cleanly, e.g. sudo salt --no-color -C 'G@roles:worker' state.apply
Maybe the failure can be improved on the os-autoinst side, like a better "die"message/reason
~~As temporary measure consider disabling the "tap" class from affected workers, e.g. make it tap_pooXXX~~
~~Debug multi-machine capabilities according to http://open.qa/docs/#_verify_the_setup~~
~~Ensure that our salt states ensure all what is needed to run stable multi-machine tests~~
Add back production worker classes for all affected machines ~~openqaworker1, sapworker{1-7}~~, e.g. qesapworker-prg1-5

Related issues 4 (1 open — 3 closed)

Actions

Copy link

Updated by rfan1 almost 2 years ago

Description updated (diff)

Actions

Copy link

Updated by rfan1 almost 2 years ago

Subject changed from [qe-tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" to [qe-tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests

Actions

Copy link

Updated by rfan1 almost 2 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz almost 2 years ago

Subject changed from [qe-tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests to [tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests
Priority changed from Normal to Urgent
Target version set to Ready

Actions

Copy link

Updated by jbaier_cz almost 2 years ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/564

Actions

Copy link

Updated by okurz almost 2 years ago

Description updated (diff)

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/564 merged, added rollback steps

Actions

Copy link

Updated by okurz almost 2 years ago

Status changed from New to In Progress
Assignee set to okurz

Retriggering according failures:

for i in 1 2 3; do WORKER=sapworker$i host=openqa.suse.de result="result='failed'" failed_since="2023-07-15" openqa-advanced-retrigger-jobs ; done

Actions

Copy link

Updated by okurz almost 2 years ago

Status changed from In Progress to New
Assignee deleted (~~okurz~~)

Retriggered all tests. Now we should look into why those tests could not execute on those workers.

Actions

Copy link

Updated by rfan1 almost 2 years ago

I tried to compare sapworker1 and worker5! and I can find some clue now.

GRE tunnels are not shown up in ip a command on sapworker1. [they can show up after I tried to restart wicked service]. hopefully my finding can help something.

Actions

Copy link

#10

Updated by rfan1 almost 2 years ago

https://openqa.suse.de/tests/11606807#dependencies The test passed on sapworker1 now.

I did below operation on this host:
sudo systemctl restart wicked

Actions

Copy link

#11

Updated by rfan1 almost 2 years ago

One more worker seems hit the same issue:openqaworker1
https://openqa.suse.de/tests/11603529#step/rsync_client/12

Actions

Copy link

#12

Updated by mloviska almost 2 years ago

Related to action #132932: [qe-core] test fails in t01_basic - eth0 stays in setup-in-progres added

Actions

Copy link

#13

Updated by emiura almost 2 years ago

I had one test failing on openqaworker1, logged in there, gre tunnels were down, did restart wicked as suggested. Tunnels appeared on "ip a s", but had some tests failing on network issues:

https://openqa.suse.de/tests/11614943

Actions

Copy link

#14

Updated by rfan1 almost 2 years ago

emiura wrote:

I had one test failing on openqaworker1, logged in there, gre tunnels were down, did restart wicked as suggested. Tunnels appeared on "ip a s", but had some tests failing on network issues:

https://openqa.suse.de/tests/11614943

Not sure the command can help:

sudo ip link set gre0 up

Actions

Copy link

#15

Updated by rfan1 almost 2 years ago

There is some difference for firewall configuration:

openqaworker1:

> sudo grep -v ^#  /etc/firewalld/firewalld.conf

DefaultZone=trusted

MinimalMark=100

CleanupOnExit=yes

Lockdown=no

IPv6_rpfilter=yes

IndividualCalls=no

LogDenied=off

FirewallBackend=nftables

FlushAllOnReload=no

RFC3964_IPv4=yes

AllowZoneDrifting=no

worker5:

> sudo grep -v ^#  /etc/firewalld/firewalld.conf

DefaultZone=trusted

MinimalMark=100

CleanupOnExit=yes

Lockdown=no

IPv6_rpfilter=yes

IndividualCalls=no

LogDenied=off

AutomaticHelpers=system

Actions

Copy link

#16

Updated by emiura almost 2 years ago

This test ended up with name resolution failure, even with what appears to be a correct network setup:
https://openqa.suse.de/tests/11623555#step/hostname/25

Actions

Copy link

#18

Updated by livdywan almost 2 years ago

Subject changed from [tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests to [tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

#19

Updated by mkittler almost 2 years ago

Status changed from Workable to In Progress
Assignee set to mkittler

For sap workers the bridge device in workerconf.sls was never set to something that will actually work. I've just noticed that so I'll assign this ticket to myself and fix this problem.

Actions

Copy link

#20

Updated by mkittler almost 2 years ago

I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/568 for the SAP workers. I will apply this change manually tomorrow and re-run my test jobs again.

Not sure why openqaworker1 and worker5 were problematic. My SSH connection to openqaworker1 is so slow that I couldn't investigate much. Maybe the slow connection is actually also the source of the problem. On worker5 the correct bridge device is configured so it is not just an obvious mistake as on the SAP workers. However, after reading all the comments is seems that worker5 is actually not problematic anyways and was just mentioned as reference. So the suggestion "for all affected machines openqaworker1, worker5, sapworker{1-7}" from the ticket description is likely just misleading. Thus I will focus on the SAP workers for now.

Actions

Copy link

#21

Updated by rfan1 almost 2 years ago

mkittler wrote:

I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/568 for the SAP workers. I will apply this change manually tomorrow and re-run my test jobs again.

Not sure why openqaworker1 and worker5 were problematic. My SSH connection to openqaworker1 is so slow that I couldn't investigate much. Maybe the slow connection is actually also the source of the problem. On worker5 the correct bridge device is configured so it is not just an obvious mistake as on the SAP workers. However, after reading all the comments is seems that worker5 is actually not problematic anyways and was just mentioned as reference. So the suggestion "for all affected machines openqaworker1, worker5, sapworker{1-7}" from the ticket description is likely just misleading. Thus I will focus on the SAP workers for now.

@mkittler, so far, I didn't hit any issues on worker5. let me remove it in description part. since it may lead some confustion.

Actions

Copy link

#22

Updated by rfan1 almost 2 years ago

Description updated (diff)

Actions

Copy link

#23

Updated by openqa_review almost 2 years ago

Due date set to 2023-08-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#24

Updated by mkittler almost 2 years ago

Thanks for the clarification.

I followed-up the Slack chats and apparently there are networking problems (it works but with huge package loss). That would explain the poor SSH connection I had to openqaworker1 perfectly yesterday perfectly. I will only be able to investigate the situation with openqaworker1 once that problem has been resolved.

The same counts unfortunately for the SAP workers as well because those are also in FC basement and at this point also affected equally bad.

Actions

Copy link

#25

Updated by mkittler almost 2 years ago

Status changed from In Progress to Blocked

Actions

Copy link

#26

Updated by mkittler almost 2 years ago

Blocked by action #133127: Frankencampus network broken + GitlabCi failed --> uploading artefacts added

Actions

Copy link

#27

Updated by okurz over 1 year ago

Status changed from Blocked to Workable

#133127 was resolved. There might still be problems in the network due to this as I did not get any proper "problem resolution" message about the network problems. When we have specific problems we should just report them in according tickets+SD-tickets and block on those. Right now sapworker1… are in salt and responsive, ping results also look stable and good.#133127 was resolved. There might still be problems in the network due to this as I did not get any proper "problem resolution" message about the network problems. When we have specific problems we should just report them in according tickets+SD-tickets and block on those. Right now sapworker1… are in salt and responsive, ping results also look stable and good.

So you can continue the work here.

Actions

Copy link

#28

Updated by mkittler over 1 year ago

Status changed from Workable to In Progress

The networking problems seem to be resolved (or it is at least better).

I could check the bride interface on openqaworker1 and it is configured correctly. I'm not sure yet why https://openqa.suse.de/tests/11665440#step/iscsi_client/93 still fails.

I'll try cloning some MM tests on the Nürnberg-located SAP workers and possibly enable them again.

Actions

Copy link

#29

Updated by livdywan over 1 year ago

As Marius mentioned being a little stuck in the daily, just as an idea #132137 or #132134 might useful to look into as a reference since it's also about multi-machine tests.

Actions

Copy link

#30

Updated by emiura over 1 year ago

Due date deleted (~~2023-08-04~~)
Assignee deleted (~~mkittler~~)

I had one test failing on openqaworker1, logged in there, gre tunnels were down, did restart wicked as suggested. Tunnels appeared on "ip a s", but had some tests failing on network issues:

https://openqa.suse.de/tests/11614943

mkittler wrote:

The networking problems seem to be resolved (or it is at least better).

I could check the bride interface on openqaworker1 and it is configured correctly. I'm not sure yet why https://openqa.suse.de/tests/11665440#step/iscsi_client/93 still fails.

I'll try cloning some MM tests on the Nürnberg-located SAP workers and possibly enable them again.

We are also seen a ton of failure on iscsi tests on SAP MM machine tests, like this one:
https://openqa.suse.de/tests/11674211

mkittler wrote:

The networking problems seem to be resolved (or it is at least better).

I could check the bride interface on openqaworker1 and it is configured correctly. I'm not sure yet why https://openqa.suse.de/tests/11665440#step/iscsi_client/93 still fails.

I'll try cloning some MM tests on the Nürnberg-located SAP workers and possibly enable them again.

Actions

Copy link

#31

Updated by mkittler over 1 year ago

Maybe it would be best for now to remove the tap worker class of openqaworker1. I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/571 for that.

Yesterday I investigated the situation on the SAP workers but could not find out anything useful. Here are nevertheless the findings:

iptables-save does not show masquerading enabled but that's also how it is on worker8 and 10 and likely ok considering that firewalld is actually using ntf under the hood.
nft list ruleset shows masquerading-related rules. I computed a diff of the output of that command between sapworker1 worker8 and 10 and besides the name of the ethernet interface there is no difference at all.
The SUTs can ping each other and the gateway (the bridge interface on sapworker1) but cannot reach any other hosts (which should supposedly work via the masquerading but does not).
Enable masquerading via iptables so that iptables-save shows -A POSTROUTING -j MASQUERADE followed by COMMIT doesn't change anything.

Actions

Copy link

#33

Updated by mkittler over 1 year ago

Same proposal for the SAP workers: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/572

Actions

Copy link

#34

Updated by mkittler over 1 year ago

Status changed from In Progress to Feedback

I've just tried a basic test scenario from o3 and the MM-SUTs also cannot reach outside hosts like openqa.opensuse.org (e.g. https://openqa.suse.de/tests/11677551#step/before_test/24). So the problems I have reproduced so far on sapworker1 don't seem to be specific to the test scenario.

Actions

Copy link

#35

Updated by okurz over 1 year ago

Due date set to 2023-08-04
Assignee set to mkittler

Adding back the assignee and due-date that emiura removed likely by accident.

Actions

Copy link

#36

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

#37

Updated by mkittler over 1 year ago

Priority changed from Urgent to Normal

Currently worked around by disabling the tap worker class.

Actions

Copy link

#38

Updated by livdywan over 1 year ago

Due date deleted (~~2023-08-04~~)
Status changed from Feedback to Workable
Assignee deleted (~~mkittler~~)

So current situation:

The workers are enabled in production but they don't do multi machine and the tap worker class needs to be added again. Check https://github.com/os-autoinst/scripts/ for the helper script

Actions

Copy link

#39

Updated by okurz over 1 year ago

Related to action #133025: Configure Virtual Interfaces instructions do not work on Leap 15.5 size:M added

Actions

Copy link

#40

Updated by okurz over 1 year ago

Related to action #132137: Setup new PRG2 openQA worker for osd size:M added

Actions

Copy link

#41

Updated by okurz over 1 year ago

Status changed from Workable to Blocked
Assignee set to okurz

livdywan wrote:

Check https://github.com/os-autoinst/scripts/ for the helper script

I guess you mean https://github.com/os-autoinst/os-autoinst/blob/master/script/os-autoinst-setup-multi-machine

We can wait for #133025 and #132137 where we try to understand better how to do a consistent multi-machine setup and debugging approach

Actions

Copy link

#42

Updated by livdywan over 1 year ago

We can wait for #133025 and #132137 where we try to understand better how to do a consistent multi-machine setup and debugging approach

#133025 to be resolved soon. So we can probably unblock? Maybe re-estimate since some time has passed?

Actions

Copy link

#43

Updated by okurz over 1 year ago

Project changed from openQA Tests (public) to openQA Infrastructure (public)
Category deleted (~~Bugs in existing tests~~)
Status changed from Blocked to Workable
Assignee deleted (~~okurz~~)