action #132827
closed[tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests size:M
0%
Description
Observation¶
I can see that some tests are failing due to DNS resolve issue on workers "sapworker*", especially on multi-machine tests.can someone help check?
Some error messages as below:
https://openqa.suse.de/tests/11593878#step/salt_master/15
http://openqa.suse.de/tests/11594635#step/rsync_client/12
Reproducible¶
Expected result¶
I Tried with another worker to run the rsync tests without any issue: http://openqa.suse.de/tests/11594925#dependencies
Rollback steps¶
- Add back production worker class on all OSD machines mentioning #132827
Further details¶
May be some network problems with workers "sapworker*", based on my tests [at least for rsync test result], the same test can pass with "worker5" but fail with "sapworker1"
Suggestions¶
- First ensure that all openQA workers have the salt state applied cleanly, e.g.
sudo salt --no-color -C 'G@roles:worker' state.apply
- Maybe the failure can be improved on the os-autoinst side, like a better "die"message/reason
As temporary measure consider disabling the "tap" class from affected workers, e.g. make it tap_pooXXXDebug multi-machine capabilities according to http://open.qa/docs/#_verify_the_setupEnsure that our salt states ensure all what is needed to run stable multi-machine tests- Add back production worker classes for all affected machines
openqaworker1, sapworker{1-7}, e.g. qesapworker-prg1-5
Updated by rfan1 over 1 year ago
- Subject changed from [qe-tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" to [qe-tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests
Updated by okurz over 1 year ago
- Subject changed from [qe-tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests to [tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests
- Priority changed from Normal to Urgent
- Target version set to Ready
Updated by jbaier_cz over 1 year ago
Updated by okurz over 1 year ago
- Description updated (diff)
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/564 merged, added rollback steps
Updated by okurz over 1 year ago
- Status changed from New to In Progress
- Assignee set to okurz
Retriggering according failures:
for i in 1 2 3; do WORKER=sapworker$i host=openqa.suse.de result="result='failed'" failed_since="2023-07-15" openqa-advanced-retrigger-jobs ; done
Updated by okurz over 1 year ago
- Status changed from In Progress to New
- Assignee deleted (
okurz)
Retriggered all tests. Now we should look into why those tests could not execute on those workers.
Updated by rfan1 over 1 year ago
I tried to compare sapworker1 and worker5! and I can find some clue now.
GRE tunnels
are not shown up in ip a
command on sapworker1. [they can show up after I tried to restart wicked service]. hopefully my finding can help something.
Updated by rfan1 over 1 year ago
https://openqa.suse.de/tests/11606807#dependencies The test passed on sapworker1 now.
I did below operation on this host:
sudo systemctl restart wicked
Updated by rfan1 over 1 year ago
One more worker seems hit the same issue:openqaworker1
https://openqa.suse.de/tests/11603529#step/rsync_client/12
Updated by mloviska over 1 year ago
- Related to action #132932: [qe-core] test fails in t01_basic - eth0 stays in setup-in-progres added
Updated by emiura over 1 year ago
I had one test failing on openqaworker1, logged in there, gre tunnels were down, did restart wicked as suggested. Tunnels appeared on "ip a s", but had some tests failing on network issues:
Updated by rfan1 over 1 year ago
emiura wrote:
I had one test failing on openqaworker1, logged in there, gre tunnels were down, did restart wicked as suggested. Tunnels appeared on "ip a s", but had some tests failing on network issues:
Not sure the command can help:
sudo ip link set gre0 up
Updated by rfan1 over 1 year ago
There is some difference for firewall configuration:
openqaworker1:
> sudo grep -v ^# /etc/firewalld/firewalld.conf
DefaultZone=trusted
MinimalMark=100
CleanupOnExit=yes
Lockdown=no
IPv6_rpfilter=yes
IndividualCalls=no
LogDenied=off
FirewallBackend=nftables
FlushAllOnReload=no
RFC3964_IPv4=yes
AllowZoneDrifting=no
worker5:
> sudo grep -v ^# /etc/firewalld/firewalld.conf
DefaultZone=trusted
MinimalMark=100
CleanupOnExit=yes
Lockdown=no
IPv6_rpfilter=yes
IndividualCalls=no
LogDenied=off
AutomaticHelpers=system
Updated by emiura over 1 year ago
This test ended up with name resolution failure, even with what appears to be a correct network setup:
https://openqa.suse.de/tests/11623555#step/hostname/25
Updated by livdywan over 1 year ago
- Subject changed from [tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests to [tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
For sap workers the bridge device in workerconf.sls
was never set to something that will actually work. I've just noticed that so I'll assign this ticket to myself and fix this problem.
Updated by mkittler over 1 year ago
I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/568 for the SAP workers. I will apply this change manually tomorrow and re-run my test jobs again.
Not sure why openqaworker1 and worker5 were problematic. My SSH connection to openqaworker1 is so slow that I couldn't investigate much. Maybe the slow connection is actually also the source of the problem. On worker5 the correct bridge device is configured so it is not just an obvious mistake as on the SAP workers. However, after reading all the comments is seems that worker5 is actually not problematic anyways and was just mentioned as reference. So the suggestion "for all affected machines openqaworker1, worker5, sapworker{1-7}" from the ticket description is likely just misleading. Thus I will focus on the SAP workers for now.
Updated by rfan1 over 1 year ago
mkittler wrote:
I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/568 for the SAP workers. I will apply this change manually tomorrow and re-run my test jobs again.
Not sure why openqaworker1 and worker5 were problematic. My SSH connection to openqaworker1 is so slow that I couldn't investigate much. Maybe the slow connection is actually also the source of the problem. On worker5 the correct bridge device is configured so it is not just an obvious mistake as on the SAP workers. However, after reading all the comments is seems that worker5 is actually not problematic anyways and was just mentioned as reference. So the suggestion "for all affected machines openqaworker1, worker5, sapworker{1-7}" from the ticket description is likely just misleading. Thus I will focus on the SAP workers for now.
@mkittler, so far, I didn't hit any issues on worker5. let me remove it in description part. since it may lead some confustion.
Updated by openqa_review over 1 year ago
- Due date set to 2023-08-04
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 1 year ago
Thanks for the clarification.
I followed-up the Slack chats and apparently there are networking problems (it works but with huge package loss). That would explain the poor SSH connection I had to openqaworker1 perfectly yesterday perfectly. I will only be able to investigate the situation with openqaworker1 once that problem has been resolved.
The same counts unfortunately for the SAP workers as well because those are also in FC basement and at this point also affected equally bad.
Updated by mkittler over 1 year ago
- Blocked by action #133127: Frankencampus network broken + GitlabCi failed --> uploading artefacts added
Updated by okurz over 1 year ago
- Status changed from Blocked to Workable
#133127 was resolved. There might still be problems in the network due to this as I did not get any proper "problem resolution" message about the network problems. When we have specific problems we should just report them in according tickets+SD-tickets and block on those. Right now sapworker1… are in salt and responsive, ping results also look stable and good.#133127 was resolved. There might still be problems in the network due to this as I did not get any proper "problem resolution" message about the network problems. When we have specific problems we should just report them in according tickets+SD-tickets and block on those. Right now sapworker1… are in salt and responsive, ping results also look stable and good.
So you can continue the work here.
Updated by mkittler over 1 year ago
- Status changed from Workable to In Progress
The networking problems seem to be resolved (or it is at least better).
I could check the bride interface on openqaworker1 and it is configured correctly. I'm not sure yet why https://openqa.suse.de/tests/11665440#step/iscsi_client/93 still fails.
I'll try cloning some MM tests on the Nürnberg-located SAP workers and possibly enable them again.
Updated by emiura over 1 year ago
- Due date deleted (
2023-08-04) - Assignee deleted (
mkittler)
I had one test failing on openqaworker1, logged in there, gre tunnels were down, did restart wicked as suggested. Tunnels appeared on "ip a s", but had some tests failing on network issues:
https://openqa.suse.de/tests/11614943
mkittler wrote:
The networking problems seem to be resolved (or it is at least better).
I could check the bride interface on openqaworker1 and it is configured correctly. I'm not sure yet why https://openqa.suse.de/tests/11665440#step/iscsi_client/93 still fails.
I'll try cloning some MM tests on the Nürnberg-located SAP workers and possibly enable them again.
We are also seen a ton of failure on iscsi tests on SAP MM machine tests, like this one:
https://openqa.suse.de/tests/11674211
mkittler wrote:
The networking problems seem to be resolved (or it is at least better).
I could check the bride interface on openqaworker1 and it is configured correctly. I'm not sure yet why https://openqa.suse.de/tests/11665440#step/iscsi_client/93 still fails.
I'll try cloning some MM tests on the Nürnberg-located SAP workers and possibly enable them again.
Updated by mkittler over 1 year ago
Maybe it would be best for now to remove the tap worker class of openqaworker1. I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/571 for that.
Yesterday I investigated the situation on the SAP workers but could not find out anything useful. Here are nevertheless the findings:
iptables-save
does not show masquerading enabled but that's also how it is on worker8 and 10 and likely ok considering that firewalld is actually using ntf under the hood.nft list ruleset
shows masquerading-related rules. I computed a diff of the output of that command between sapworker1 worker8 and 10 and besides the name of the ethernet interface there is no difference at all.- The SUTs can ping each other and the gateway (the bridge interface on sapworker1) but cannot reach any other hosts (which should supposedly work via the masquerading but does not).
- Enable masquerading via iptables so that
iptables-save
shows-A POSTROUTING -j MASQUERADE
followed byCOMMIT
doesn't change anything.
Updated by mkittler over 1 year ago
Same proposal for the SAP workers: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/572
Updated by mkittler over 1 year ago
- Status changed from In Progress to Feedback
I've just tried a basic test scenario from o3 and the MM-SUTs also cannot reach outside hosts like openqa.opensuse.org (e.g. https://openqa.suse.de/tests/11677551#step/before_test/24). So the problems I have reproduced so far on sapworker1 don't seem to be specific to the test scenario.
Updated by okurz over 1 year ago
- Due date set to 2023-08-04
- Assignee set to mkittler
Adding back the assignee and due-date that emiura removed likely by accident.
Updated by mkittler over 1 year ago
- Priority changed from Urgent to Normal
Currently worked around by disabling the tap worker class.
Updated by livdywan over 1 year ago
- Due date deleted (
2023-08-04) - Status changed from Feedback to Workable
- Assignee deleted (
mkittler)
So current situation:
The workers are enabled in production but they don't do multi machine and the tap worker class needs to be added again. Check https://github.com/os-autoinst/scripts/ for the helper script
Updated by okurz over 1 year ago
- Related to action #133025: Configure Virtual Interfaces instructions do not work on Leap 15.5 size:M added
Updated by okurz over 1 year ago
- Related to action #132137: Setup new PRG2 openQA worker for osd size:M added
Updated by okurz over 1 year ago
- Status changed from Workable to Blocked
- Assignee set to okurz
livdywan wrote:
Check https://github.com/os-autoinst/scripts/ for the helper script
I guess you mean https://github.com/os-autoinst/os-autoinst/blob/master/script/os-autoinst-setup-multi-machine
We can wait for #133025 and #132137 where we try to understand better how to do a consistent multi-machine setup and debugging approach
Updated by okurz about 1 year ago
- Project changed from openQA Tests (public) to openQA Infrastructure (public)
- Category deleted (
Bugs in existing tests) - Status changed from Blocked to Workable
- Assignee deleted (
okurz)
unblock yes but I think the suggestions from the ticket description are still valid and don't need re-estimation.
Updated by okurz about 1 year ago
- Target version changed from Ready to future
Sorry, tools-team can not currently help with that.
Updated by ybonatakis 2 months ago
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/911 Enable qesapworker-prg5 workers on OSD
Updated by ybonatakis 2 months ago
https://progress.opensuse.org/issues/132827#note-47 merged
supplementary commit for the rest sap workers https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/912
Updated by ybonatakis 2 months ago
- Status changed from Workable to Feedback
note for me: check workers the next couple of days
Updated by ybonatakis 2 months ago
I found qesapworker-prg5:10 running some jobs since yesterday.
https://openqa.suse.de/admin/workers/2428
Updated by ybonatakis 2 months ago
- Status changed from Feedback to Resolved
workers (from qesapworker-prg{4,5,6,7) look pretty good, picking and running jobs. verified in the meeting today so I am close it as resolved