action #136130
closedtest fails in iscsi_client due to salt 'host'/'nodename' confusion size:M
Added by okurz over 1 year ago. Updated about 1 year ago.
0%
Description
Observation¶
openQA test in scenario sle-15-SP5-Online-QR-HA-ppc64le-ha_ctdb_node02@ppc64le-2g fails in
iscsi_client
with "
Test died: command 'curl --form upload=@/var/log/zypper.log --form upname=iscsi_client-zypper.log http://10.0.2.2:20033/yu0FjwLwsVCGi_Fj/uploadlog/zypper.log' failed at /usr/lib/os-autoinst/testapi.pm line 926."
running on malbec and parallel jobs run on petrol or diesel. I suspect incomplete GRE tunnel config for those machine cluster definitions. Likely due to salt state not completely applied or the "diesel/diesel-1" mismatch as in #134864
Reproducible¶
Fails since (at least) Build 118.2 (current job)
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#136130
Expected result¶
Last good: 118.2 (or more recent)
Steps¶
Fix the lookup, https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/997 (merged)doneEnsure that transient + static hostname + minion_id are for the actual host, not the interface, so without the "-1" suffixdoneRemove the "-1" suffix in workerconf accordinglyEnsure that salt mine and grains are up-to-date and salt high state is applieddone- Ensure that both diesel+petrol can work on multi-machine jobs
Rollback steps¶
Further details¶
Always latest result in this scenario: latest
Updated by okurz over 1 year ago
- Description updated (diff)
- Status changed from New to In Progress
Created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/619 to exclude petrol from tap-class. diesel does not have the tap class. I could not find the petrol IP mentioned in malbec:/etc/wicked/scripts/gre_tunnel_preup.sh . That might explain it.
Updated by okurz over 1 year ago
I suspect that the mistake in the reverse PTR that should be fixed by https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4043 are at fault here so let's wait for that MR. In the meantime I am trying to handle related openQA job failures.
Trying to mitigate with
for i in malbec powerqaworker-qam-1 petrol; do env WORKER=$i result="result='failed'" failed_since=2023-09-18 host=openqa.suse.de bash -ex openqa-advanced-retrigger-jobs; done
Updated by nicksinger over 1 year ago
not sure if this is the only reason. I ran a state.highstate
on malbec and can confirm that after a successful run petrol is missing in the gre_tunnel_preup.sh
-script. We just check for "tap" in WORKER_CLASS so even with your changes it still should so up. So I suspect some issue in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls?ref_type=heads#L36-38
Updated by okurz over 1 year ago
Updated by nicksinger over 1 year ago
okurz wrote in #note-4:
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/620
I've fixed the underlying problem in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/997
Updated by openqa_review over 1 year ago
- Due date set to 2023-10-05
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan over 1 year ago
- Subject changed from test fails in iscsi_client auto_review:"(?s)ppc64le.*Test died: command.*curl":retry to test fails in iscsi_client auto_review:"(?s)ppc64le.*Test died: command.*curl":retry due to salt 'host'/'nodename' confusion size:M
- Description updated (diff)
Updated by okurz over 1 year ago
- Due date deleted (
2023-10-05) - Status changed from In Progress to Workable
- Assignee deleted (
okurz) - Priority changed from Urgent to High
After https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/997 we should continue with the next suggestions. Unassigning due to multiple new other tickets after the urgency was mitigated.
Updated by ggardet_arm about 1 year ago
It makes some aarch64 tests to restart, such as: https://openqa.opensuse.org/tests/3616202#comments
Updated by nicksinger about 1 year ago
- Status changed from Workable to In Progress
Updated by nicksinger about 1 year ago
- Subject changed from test fails in iscsi_client auto_review:"(?s)ppc64le.*Test died: command.*curl":retry due to salt 'host'/'nodename' confusion size:M to test fails in iscsi_client due to salt 'host'/'nodename' confusion size:M
ggardet_arm wrote in #note-11:
It makes some aarch64 tests to restart, such as: https://openqa.opensuse.org/tests/3616202#comments
Interesting. Definitely not related but I wonder why the regex matches. I remove it for now.
Updated by nicksinger about 1 year ago
Cloned some tests with openqa-clone-job --within-instance https://openqa.suse.de --apikey D3576CEBF3529E39 --apisecret 63CF71588E55E56B 12230681 _GROUP=0 WORKER_CLASS=tap_poo136130
. Unfortunately I messed up and _GROUP=0 still assigned a job-group. https://openqa.suse.de/tests/12376246 is now a validation for "all mm test on petrol-1".
Updated by openqa_review about 1 year ago
- Due date set to 2023-10-19
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz about 1 year ago
nicksinger wrote in #note-14:
Cloned some tests with
openqa-clone-job --within-instance https://openqa.suse.de --apikey D3576CEBF3529E39 --apisecret 63CF71588E55E56B 12230681 _GROUP=0 WORKER_CLASS=tap_poo136130
. Unfortunately I messed up and _GROUP=0 still assigned a job-group. https://openqa.suse.de/tests/12376246 is now a validation for "all mm test on petrol-1".
use openqa-clone-job --parental-inheritance --skip-chained-deps --within-instance https://openqa.suse.de 12230681 _GROUP=0 WORKER_CLASS=tap_poo136130 TEST+=-poo136130 BUILD=poo136130
Updated by nicksinger about 1 year ago
Created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/645 and https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4160 to get rid of the suffix in the hostname and openqa
Updated by okurz about 1 year ago
- Related to action #137603: [alert] Queue: State (SUSE) - too few jobs executed alert size:S added
Updated by nicksinger about 1 year ago
The host diesel
can complete jobs while called diesel
in theory which can be seen here: https://openqa.suse.de/tests/12443043. But as visible in https://openqa.suse.de/admin/workers/3388 I stumbled over an odd behavior where the host changes its hostname back to diesel-1 after some time (therefore the incomplete, the name changed -> salt changed the config -> worker restarted mid job run to reload changes). I tried to set DHCLIENT_SET_HOSTNAME='no'
in /etc/sysconfig/network/ifcfg-eth3
and /etc/sysconfig/network/dhcp
but still observed the changing hostname after some longer time period (some hours) which is odd. I found some hints in the wicked man-pages regarding a wicked extension but I'm not sure if this is related to the sysconfig options or where else to configured it so I asked in https://suse.slack.com/archives/C02D92APKNU/p1697550296052419 for some help what I could be missing.
I verified all settings again and rebooted the machine which is running for 1h now without changing back. I started a tcpdump for dhcp packages now to check if I can correlate the hostname change with a dhcp package coming in (e.g. after lease expiration), the command used is: tcpdump -i eth3 port 67 or port 68 -e -n -vvvv -w /root/dhcp_hostname_change
Updated by okurz about 1 year ago
- Due date changed from 2023-10-19 to 2023-10-24
We discussed this in the weekly unblock and given that we struggle to find a conclusion we, in particular dheidler and okurz, suggest in general to go with a mixed approach of defining a static DHCP lease for machines where only a single system interface is connected to avoid all those problems and CNAME redirection and such. If That means we should not change all other machine entries according to explicit interface numbering. So what is left to do is to "just get diesel+petrol working" so less work. Extraordinarily bumping due date due to spontaneous vacation taking.
Updated by nicksinger about 1 year ago
I ran another test by masking systemd-hostnamed
but still the contents of /etc/hostname
changed so apparently some tool writes to that file directly. Anyhow, guess this completes my experiments and I will just try to get the workers back online by changing our DHCP/DNS entries to refer only to the host itself without any interface enumeration. I will prepare a MR in the ops repo.
Updated by nicksinger about 1 year ago
- Status changed from In Progress to Blocked
https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4234 created. I also reverted my changes back on diesel and unmasked systemd-hostnamed
again. While we have aliases for the worker config available I don't think it makes sense to run validation runs now until the ops MR is in place because the whole MM-setup heavily depends on these names and it would just require new validation runs after the rename gets into production. Therefore "Blocked" until it is merged.
Updated by nicksinger about 1 year ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines/845191 failed to to petrol not having the correct highstate. I fixed it manually by running an explicit highstate and the pipeline/job succeeded: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1914672
Updated by livdywan about 1 year ago
- Status changed from Blocked to In Progress
- Priority changed from High to Urgent
nicksinger wrote in #note-23:
https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4234 created
THe MR was merged yesterday, and we're now getting Queue: State (SUSE) alerts - maybe because of this? The relation wasn't/isn't clear to me from the alerts. Not entirely sure yet if this was a consequence of the MR being merged or Nick's working on it. Let's treat it as Urgent, tho, since we have no other machines currently
Updated by nicksinger about 1 year ago
DNS/DHCP for both diesel and petrol are now changed back to without any number suffix. I tried to validate MM test yesterday and unfortunately they fail cross-machine: https://openqa.suse.de/tests/12639783#step/multipath_iscsi/26 - not sure what this is caused by. Single machine tests however complete successfully: https://openqa.suse.de/tests/12638911 and therefore I will add the normal qemu_ppc64 class into production while keeping MM disabled for now.
Updated by nicksinger about 1 year ago
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/659 should help the ppc64le queue for single machine tests
Updated by livdywan about 1 year ago
- Due date changed from 2023-10-24 to 2023-10-27
Note I'm update the due date accordingly
Updated by nicksinger about 1 year ago
My changes in the OPS salt didn't apply, now petrol and diesel get random IPs assigned causing the pipeline to fail: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1929411
I created https://sd.suse.com/servicedesk/customer/portal/1/SD-136220 for help
Updated by nicksinger about 1 year ago
I had to manually apply the salt-states on walter1 and walter2 by issuing salt-call state.apply
over ssh on them. With that I was able to recover both workers and they worked flawlessly on single-machine jobs in OSD. However, diesel seems to be very unstable and experiences random crashes. Petrol seems stable at least.
Updated by okurz about 1 year ago
yes, diesel is running an updated kernel when it shouldn't. running the revert now
zypper rm -u kernel-default-extra-5.14.21-150500.55.31.1.ppc64le kernel-default-5.3.18-150300.59.93.1.ppc64le kernel-default-5.14.21-150500.55.31.1.ppc64le kernel-default-optional-5.14.21-150500.55.31.1.ppc64le && zypper al -m "poo#119008, kernel regression boo#1202138" kernel* && sync && reboot
Updated by nicksinger about 1 year ago
machine was unable to boot because no kernel and no initrd were installed. I used the petitboot rescue shell to kexec into a leap live system and chrooted into the system. Afterwards I added the Leap15.3 repos with lower prio:
zypper ar -p 105 http://download.opensuse.org/distribution/leap/15.3/repo/oss/ repo-oss-15.3
zypper ref repo-oss-15.3
after that I was able to force install the old kernel from 15.3 with:
zypper in -f kernel-default-5.14.21-150500.55.31.1
which regenerated all necessary files again. After a reboot petitboot was able to find the system again and I could successfully start the normal system which is online again and already assigned all resources to pending jobs. I keep an eye on the stability of that system.
Updated by okurz about 1 year ago
I checked that both diesel+petrol execute tests just fine. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1 Shows a critically long ppc64le job queue but openQA is also working down the queue.
Updated by nicksinger about 1 year ago
- Priority changed from Urgent to High
Machine seems to be pretty stable now. I consider this and the slowly falling amount of ppc64le tickets enough to reduce priority for now. However, I now need to start at square one and figure out why cross-machine MM tests don't work as expected.
Updated by livdywan about 1 year ago
- Due date changed from 2023-10-27 to 2023-11-10
nicksinger wrote in #note-34:
Machine seems to be pretty stable now. I consider this and the slowly falling amount of ppc64le tickets enough to reduce priority for now. However, I now need to start at square one and figure out why cross-machine MM tests don't work as expected.
Let's be realistic about the due date then
Updated by okurz about 1 year ago
- Copied to coordination #139010: [epic] Long OSD ppc64le job queue added
Updated by okurz about 1 year ago
Next steps:
- Update description to not include malbec+powerqaworker-qam-1 anymore as they are offline and will stay offline
- Ensure multi-machine tests are working on at least one of diesel+petrol and enable multi-machine production use
- Check multi-machine tests across diesel and petrol and ensure multi-machin tests work using GRE tunnels
- Continue with the other steps in ticket description suggestion section
Updated by livdywan about 1 year ago
- Description updated (diff)
Let's call these steps, since you are checking them off one by one.
Updated by okurz about 1 year ago
- Copied to action #139136: Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:M added
Updated by livdywan about 1 year ago
- Due date changed from 2023-11-10 to 2023-11-17
Not expecting this to be worked on this week, hence bumping the due date
Updated by mkittler about 1 year ago
I scheduled some test jobs to check the current state: https://openqa.suse.de/tests/12804036#dependencies
Updated by livdywan about 1 year ago
Notes from our collaborative session:
- Ticket about ensuring salt-based setup works https://progress.opensuse.org/issues/136013
- First firewall fix https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/987
- Explicit net.ipv4 fix https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1004
- Confirm our docs https://docs.openvswitch.org/en/latest/howto/tunneling/#testing
- Prepare a simplified test e.g.
- based on wicked involving its own dhcp and clear checking of IPv4 and IPv6 setup
- extend openqa_worker tests to check that os-autoinst-setup-multi-machine actually created a working network setup, by adding a second worker and checking that one can talk to the other https://github.com/os-autoinst/os-autoinst-distri-openQA/blob/979ea1766de9722a6fa1603c6a2a15c1c108c4ea/tests/install/openqa_worker.pm#L19
- See how we can test the same case on x86_64 and ppc - in practice many if not most tests are never run on power
- Come up with a verifier based on Open vSwitch troubleshooting guidelines
- Consider double-checking a possible relation to other mm issues #139070 #138707 #139154
- Check if the -1 suffix in workerconf is (ir)relevant afterall
- Verify with https://openqa.suse.de/tests/12188617#step/iscsi_client/10
- Implement openqa-cli support for triggering of jobs on one or multiple specific workers
Updated by livdywan about 1 year ago
- Related to action #136013: Ensure IP forwarding is persistent for multi-machine tests also in our salt recipes size:M added
Updated by okurz about 1 year ago
- Assignee changed from nicksinger to mkittler
as discussed in daily infra call mkittler will try to reproduce the multi-machine test scenarios. A more generalized "link to latest" independant of "QR" assets which might be missing by now is
https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&machine=ppc64le-2g&test=ha_ctdb_node02
Updated by okurz about 1 year ago
- Related to action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M added
Updated by mkittler about 1 year ago
I cloned the mentioned scenario with worker classes for diesel/petrol: https://openqa.suse.de/tests/12810996
By the way, despite the tap worker class being removed from petrol it looks like the MM configuration is up-to-date enough on that machine. (The IPs of petrol/diesel are configured correctly on either end.)
EDIT: The jobs have finished now. The server ran on petrol and the other client job that ran there as well did not fail in iscsi_client
which seems to be the critical module. The other client job that ran on diesel failed in that module (and the other finished as parallel_failed). So the issue is definitely still reproducible.
Just for the record, I cloned the jobs by running sudo openqa-clone-job --skip-download --export-command --skip-chained-deps https://openqa.suse.de/tests/12799476 {TEST,BUILD}+='test-ppc-mm' _GROUP=0
on OSD and then executing the returned command with worker classes replaced accordingly.
Updated by mkittler about 1 year ago
I invoked the ovs-vsctl commands on mania/petrol/diesel manually to allow traffic in all directions.
I scheduled one cluster between diesel and mania (https://openqa.suse.de/tests/12816967#dependencies) and one between petrol and mania (https://openqa.suse.de/tests/12816974#dependencies).
Updated by mkittler about 1 year ago
The diesel/mania jobs have already finished reproducing the issue. The petrol/mania jobs are already passed the critical module not reproducing the issue.
That means diesel is the problem.
Note that the "roles" of the different hosts were identical. So this could not have made a difference. (I just replaced "diesel" with "petrol" in the worker class assignments when invoking the 2nd API call for petrol.)
Or I messed up the ovs-vctl commands for the connection between diesel/mania. Just for the record, I invoked the following commands:
martchus@mania:~> sudo ovs-vsctl --may-exist add-port br1 gre20 -- set interface gre20 type=gre options:remote_ip=10.168.192.252 # mania -> diesel
martchus@mania:~> sudo ovs-vsctl --may-exist add-port br1 gre21 -- set interface gre21 type=gre options:remote_ip=10.168.192.254 # mania -> petrol
martchus@diesel:~> sudo ovs-vsctl --may-exist add-port br1 gre21 -- set interface gre21 type=gre options:remote_ip=10.168.192.108 # diesel -> mania
martchus@petrol:~> sudo ovs-vsctl --may-exist add-port br1 gre21 -- set interface gre21 type=gre options:remote_ip=10.168.192.108 # petrol -> mania
(The connection between diesel and petrol was already in place. I reused gre20/gre21 for simplicity because we don't need a connection to the x86_64 worker using this gre interface anyways. I have also checked that the IPs show up in the output of ovs-vsctl show
.)
Considering I haven't done anything differently on diesel I don't think that's the case, though.
Updated by mkittler about 1 year ago
All petrol/mania jobs have now finished successfully. As a first improvement I can create a MR to enable the tap worker class on petrol and mania.
Updated by mkittler about 1 year ago
I restarted the petrol/mania cluster to see whether it wasn't just luck: https://openqa.suse.de/tests/12817482
I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/677 to enable the tap worker class on petrol.
Updated by mkittler about 1 year ago
- Status changed from In Progress to Feedback
Updated by mkittler about 1 year ago
I made a diff of sysctl -a | sort
on petrol and diesel. The most interesting output of colordiff -Naur sysctl-petrol sysctl-diesel
:
-kernel.hostname = petrol
+kernel.hostname = diesel
-kernel.osrelease = 5.3.18-150300.59.93-default
+kernel.osrelease = 5.3.18-57-default
-net.ipv4.conf.erspan0.arp_notify = 0
+net.ipv4.conf.erspan0.arp_notify = 1
-net.ipv4.tcp_mem = 195231 260310 390462
+net.ipv4.tcp_mem = 195273 260364 390546
-net.netfilter.nf_conntrack_tcp_ignore_invalid_rst = 0
Otherwise the diff mainly shows hardware differences (like a different number of CPU cores) and different UUIDs and a little bit lower limits like:
-kernel.pid_max = 114688
+kernel.pid_max = 65536
The output of ip r
, ip a
and ovs-vsctl show
also doesn't show any relevant differences (but maybe Dirk as more input on that).
Updated by okurz about 1 year ago
mkittler wrote in #note-56:
I made a diff of
sysctl -a | sort
on petrol and diesel. The most interesting output ofcolordiff -Naur sysctl-petrol sysctl-diesel
:-kernel.hostname = petrol +kernel.hostname = diesel -kernel.osrelease = 5.3.18-150300.59.93-default +kernel.osrelease = 5.3.18-57-default
the kernel version 5.3.18 is correct as both should be downgraded due to #119008, kernel regression boo#1202138, however the kernel version on diesel looks like its the outdated GM version and should be the same as on petrol. Please install 5.3.18-150300.59.93-default and check multi-machine tests again.
Updated by mkittler about 1 year ago
Updated by okurz about 1 year ago
- Due date changed from 2023-11-17 to 2023-11-22
- Status changed from Feedback to In Progress
- Priority changed from High to Urgent
mkittler will schedule the scenario again and test. We discussed the topic in our tools team meeting. We should handle this with higher priority, hence bumping to "Urgent". @mkittler please also
- Check if the rollback steps have been conducted
- Ensure stability of the originally failing scenario
- Verify that we don't have any workers with interface-number-suffix like "-1" in our list of openQA workers
- Optionally create a separate ticket about bringing back diesel into across-host multi-machine tests as long as that is disabled
Updated by mkittler about 1 year ago
After installing the other kernel package and rebooting the diff in sysctl is gone. I scheduled https://openqa.suse.de/tests/12840274#dependencies. Let's see whether it works.
Updated by mkittler about 1 year ago
I removed all remaining diesel-1:*
slots from OSD. All were offline so I guess the "-1" problem is resolved. salt-key -L
and sudo salt openqa.suse.de mine.get roles:worker nodename grain
also don't show any names with "-1" anymore.
https://stats.openqa-monitor.qa.suse.de/d/WDdiesel-1 also shows no data (as opposed to https://stats.openqa-monitor.qa.suse.de/d/WDdiesel). Maybe I should cleanup the old dashboard (and check why it doesn't happen automatically).
Updated by mkittler about 1 year ago
The new cluster is already at the point where we can call it a success. So the different kernel version might have surprisingly made a difference. Of course I'll retry the jobs a few times to be sure.
About the dashboards: They are not provisioned anymore by salt but one has to remove them still manually as non-provisioned dashboards (which I've just did). I think we actually have already established that this is how it works at some point before (and changing this would be out of scope for this ticket anyways).
Updated by mkittler about 1 year ago
- Description updated (diff)
Looks like the "-1" suffix has already been deleted from workerconf.
Updated by mkittler about 1 year ago
It still doesn't work after all. Before the test scenario always failed in iscsi_client
and now that module softfailed (which seems generally as good as it gets for this module at this point) and watchdog
passed but then ha_cluster_join
failed¹. So not so successful after all. I'm restarting the cluster and investigate a little bit as I'm wondering what the difference is now.
¹ test-ppc-mm-ha_ctdb_node02test-ppc-mm
(test/VM on diesel) cannot ping test-ppc-mm-ha_ctdb_node01test-ppc-mm
(test/VM on mania) via ping -c1 ctdb-node01
due to ping: ctdb-node01: Temporary failure in name resolution
.
Updated by mkittler about 1 year ago
I've got the same outcome again. I've restarted the cluster once more to be sure the test outcome has changed in a stable way: https://openqa.suse.de/tests/12841664#dependencies
Updated by mkittler about 1 year ago
- Status changed from In Progress to Resolved
It failed again in the same way. So I created a follow-up ticket (#150995) and consider this one resolved. This also means I'm no re-enabling the tap worker class as part of the rollback steps.
Updated by okurz about 1 year ago
- Related to action #150995: Fix MM setup on diesel so test scenarios like ha_ctdb_supportservertest-ppc-mm work added