Project

General

Profile

Actions

action #136130

closed

test fails in iscsi_client due to salt 'host'/'nodename' confusion size:M

Added by okurz almost 1 year ago. Updated 10 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Bugs in existing tests
Target version:
Start date:
2023-09-20
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-SP5-Online-QR-HA-ppc64le-ha_ctdb_node02@ppc64le-2g fails in
iscsi_client
with "
Test died: command 'curl --form upload=@/var/log/zypper.log --form upname=iscsi_client-zypper.log http://10.0.2.2:20033/yu0FjwLwsVCGi_Fj/uploadlog/zypper.log' failed at /usr/lib/os-autoinst/testapi.pm line 926."
running on malbec and parallel jobs run on petrol or diesel. I suspect incomplete GRE tunnel config for those machine cluster definitions. Likely due to salt state not completely applied or the "diesel/diesel-1" mismatch as in #134864

Reproducible

Fails since (at least) Build 118.2 (current job)

Steps to reproduce

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#136130

Expected result

Last good: 118.2 (or more recent)

Steps

  • Fix the lookup, https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/997 (merged) done
  • Ensure that transient + static hostname + minion_id are for the actual host, not the interface, so without the "-1" suffix done
  • Remove the "-1" suffix in workerconf accordingly
  • Ensure that salt mine and grains are up-to-date and salt high state is applied done
  • Ensure that both diesel+petrol can work on multi-machine jobs

Rollback steps

Further details

Always latest result in this scenario: latest


Related issues 6 (2 open4 closed)

Related to openQA Infrastructure - action #137603: [alert] Queue: State (SUSE) - too few jobs executed alert size:SResolvedokurz2023-10-09

Actions
Related to openQA Project - action #136013: Ensure IP forwarding is persistent for multi-machine tests also in our salt recipes size:MResolveddheidler

Actions
Related to openQA Infrastructure - action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:MResolvedokurz2023-09-20

Actions
Related to openQA Infrastructure - action #150995: Fix MM setup on diesel so test scenarios like ha_ctdb_supportservertest-ppc-mm workNew2023-11-17

Actions
Copied to openQA Project - coordination #139010: [epic] Long OSD ppc64le job queueBlockedokurz2023-11-04

Actions
Copied to openQA Project - action #139136: Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:MResolvedokurz

Actions
Actions #1

Updated by okurz almost 1 year ago

  • Description updated (diff)
  • Status changed from New to In Progress

Created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/619 to exclude petrol from tap-class. diesel does not have the tap class. I could not find the petrol IP mentioned in malbec:/etc/wicked/scripts/gre_tunnel_preup.sh . That might explain it.

Actions #2

Updated by okurz almost 1 year ago

I suspect that the mistake in the reverse PTR that should be fixed by https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4043 are at fault here so let's wait for that MR. In the meantime I am trying to handle related openQA job failures.

Trying to mitigate with

for i in malbec powerqaworker-qam-1 petrol; do env WORKER=$i result="result='failed'" failed_since=2023-09-18 host=openqa.suse.de bash -ex openqa-advanced-retrigger-jobs; done
Actions #3

Updated by nicksinger almost 1 year ago

not sure if this is the only reason. I ran a state.highstate on malbec and can confirm that after a successful run petrol is missing in the gre_tunnel_preup.sh-script. We just check for "tap" in WORKER_CLASS so even with your changes it still should so up. So I suspect some issue in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls?ref_type=heads#L36-38

Actions #6

Updated by openqa_review almost 1 year ago

  • Due date set to 2023-10-05

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by livdywan 12 months ago

  • Subject changed from test fails in iscsi_client auto_review:"(?s)ppc64le.*Test died: command.*curl":retry to test fails in iscsi_client auto_review:"(?s)ppc64le.*Test died: command.*curl":retry due to salt 'host'/'nodename' confusion size:M
  • Description updated (diff)
Actions #8

Updated by okurz 12 months ago

  • Due date deleted (2023-10-05)
  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)
  • Priority changed from Urgent to High

After https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/997 we should continue with the next suggestions. Unassigning due to multiple new other tickets after the urgency was mitigated.

Actions #9

Updated by nicksinger 12 months ago

  • Assignee set to nicksinger
Actions #11

Updated by ggardet_arm 12 months ago

It makes some aarch64 tests to restart, such as: https://openqa.opensuse.org/tests/3616202#comments

Actions #12

Updated by nicksinger 12 months ago

  • Status changed from Workable to In Progress
Actions #13

Updated by nicksinger 12 months ago

  • Subject changed from test fails in iscsi_client auto_review:"(?s)ppc64le.*Test died: command.*curl":retry due to salt 'host'/'nodename' confusion size:M to test fails in iscsi_client due to salt 'host'/'nodename' confusion size:M

ggardet_arm wrote in #note-11:

It makes some aarch64 tests to restart, such as: https://openqa.opensuse.org/tests/3616202#comments

Interesting. Definitely not related but I wonder why the regex matches. I remove it for now.

Actions #14

Updated by nicksinger 12 months ago

Cloned some tests with openqa-clone-job --within-instance https://openqa.suse.de --apikey D3576CEBF3529E39 --apisecret 63CF71588E55E56B 12230681 _GROUP=0 WORKER_CLASS=tap_poo136130. Unfortunately I messed up and _GROUP=0 still assigned a job-group. https://openqa.suse.de/tests/12376246 is now a validation for "all mm test on petrol-1".

Actions #15

Updated by openqa_review 12 months ago

  • Due date set to 2023-10-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions #16

Updated by okurz 12 months ago

nicksinger wrote in #note-14:

Cloned some tests with openqa-clone-job --within-instance https://openqa.suse.de --apikey D3576CEBF3529E39 --apisecret 63CF71588E55E56B 12230681 _GROUP=0 WORKER_CLASS=tap_poo136130. Unfortunately I messed up and _GROUP=0 still assigned a job-group. https://openqa.suse.de/tests/12376246 is now a validation for "all mm test on petrol-1".

use openqa-clone-job --parental-inheritance --skip-chained-deps --within-instance https://openqa.suse.de 12230681 _GROUP=0 WORKER_CLASS=tap_poo136130 TEST+=-poo136130 BUILD=poo136130

Actions #18

Updated by okurz 11 months ago

Both merged, what's next?

Actions #19

Updated by okurz 11 months ago

  • Related to action #137603: [alert] Queue: State (SUSE) - too few jobs executed alert size:S added
Actions #20

Updated by nicksinger 11 months ago

The host diesel can complete jobs while called diesel in theory which can be seen here: https://openqa.suse.de/tests/12443043. But as visible in https://openqa.suse.de/admin/workers/3388 I stumbled over an odd behavior where the host changes its hostname back to diesel-1 after some time (therefore the incomplete, the name changed -> salt changed the config -> worker restarted mid job run to reload changes). I tried to set DHCLIENT_SET_HOSTNAME='no' in /etc/sysconfig/network/ifcfg-eth3 and /etc/sysconfig/network/dhcp but still observed the changing hostname after some longer time period (some hours) which is odd. I found some hints in the wicked man-pages regarding a wicked extension but I'm not sure if this is related to the sysconfig options or where else to configured it so I asked in https://suse.slack.com/archives/C02D92APKNU/p1697550296052419 for some help what I could be missing.

I verified all settings again and rebooted the machine which is running for 1h now without changing back. I started a tcpdump for dhcp packages now to check if I can correlate the hostname change with a dhcp package coming in (e.g. after lease expiration), the command used is: tcpdump -i eth3 port 67 or port 68 -e -n -vvvv -w /root/dhcp_hostname_change

Actions #21

Updated by okurz 11 months ago

  • Due date changed from 2023-10-19 to 2023-10-24

We discussed this in the weekly unblock and given that we struggle to find a conclusion we, in particular dheidler and okurz, suggest in general to go with a mixed approach of defining a static DHCP lease for machines where only a single system interface is connected to avoid all those problems and CNAME redirection and such. If That means we should not change all other machine entries according to explicit interface numbering. So what is left to do is to "just get diesel+petrol working" so less work. Extraordinarily bumping due date due to spontaneous vacation taking.

Actions #22

Updated by nicksinger 11 months ago

I ran another test by masking systemd-hostnamed but still the contents of /etc/hostname changed so apparently some tool writes to that file directly. Anyhow, guess this completes my experiments and I will just try to get the workers back online by changing our DHCP/DNS entries to refer only to the host itself without any interface enumeration. I will prepare a MR in the ops repo.

Actions #23

Updated by nicksinger 11 months ago

  • Status changed from In Progress to Blocked

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4234 created. I also reverted my changes back on diesel and unmasked systemd-hostnamed again. While we have aliases for the worker config available I don't think it makes sense to run validation runs now until the ops MR is in place because the whole MM-setup heavily depends on these names and it would just require new validation runs after the rename gets into production. Therefore "Blocked" until it is merged.

Actions #24

Updated by nicksinger 11 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines/845191 failed to to petrol not having the correct highstate. I fixed it manually by running an explicit highstate and the pipeline/job succeeded: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1914672

Actions #25

Updated by livdywan 11 months ago

  • Status changed from Blocked to In Progress
  • Priority changed from High to Urgent

nicksinger wrote in #note-23:

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4234 created

THe MR was merged yesterday, and we're now getting Queue: State (SUSE) alerts - maybe because of this? The relation wasn't/isn't clear to me from the alerts. Not entirely sure yet if this was a consequence of the MR being merged or Nick's working on it. Let's treat it as Urgent, tho, since we have no other machines currently

Actions #26

Updated by nicksinger 11 months ago

DNS/DHCP for both diesel and petrol are now changed back to without any number suffix. I tried to validate MM test yesterday and unfortunately they fail cross-machine: https://openqa.suse.de/tests/12639783#step/multipath_iscsi/26 - not sure what this is caused by. Single machine tests however complete successfully: https://openqa.suse.de/tests/12638911 and therefore I will add the normal qemu_ppc64 class into production while keeping MM disabled for now.

Actions #27

Updated by nicksinger 11 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/659 should help the ppc64le queue for single machine tests

Actions #28

Updated by livdywan 11 months ago

  • Due date changed from 2023-10-24 to 2023-10-27

Note I'm update the due date accordingly

Actions #29

Updated by nicksinger 11 months ago

My changes in the OPS salt didn't apply, now petrol and diesel get random IPs assigned causing the pipeline to fail: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1929411
I created https://sd.suse.com/servicedesk/customer/portal/1/SD-136220 for help

Actions #30

Updated by nicksinger 11 months ago

I had to manually apply the salt-states on walter1 and walter2 by issuing salt-call state.apply over ssh on them. With that I was able to recover both workers and they worked flawlessly on single-machine jobs in OSD. However, diesel seems to be very unstable and experiences random crashes. Petrol seems stable at least.

Actions #31

Updated by okurz 11 months ago

yes, diesel is running an updated kernel when it shouldn't. running the revert now

zypper rm -u kernel-default-extra-5.14.21-150500.55.31.1.ppc64le kernel-default-5.3.18-150300.59.93.1.ppc64le kernel-default-5.14.21-150500.55.31.1.ppc64le kernel-default-optional-5.14.21-150500.55.31.1.ppc64le && zypper al -m "poo#119008, kernel regression boo#1202138" kernel* && sync && reboot
Actions #32

Updated by nicksinger 11 months ago

machine was unable to boot because no kernel and no initrd were installed. I used the petitboot rescue shell to kexec into a leap live system and chrooted into the system. Afterwards I added the Leap15.3 repos with lower prio:

zypper ar -p 105 http://download.opensuse.org/distribution/leap/15.3/repo/oss/ repo-oss-15.3
zypper ref repo-oss-15.3

after that I was able to force install the old kernel from 15.3 with:

zypper in -f kernel-default-5.14.21-150500.55.31.1

which regenerated all necessary files again. After a reboot petitboot was able to find the system again and I could successfully start the normal system which is online again and already assigned all resources to pending jobs. I keep an eye on the stability of that system.

Actions #33

Updated by okurz 11 months ago

I checked that both diesel+petrol execute tests just fine. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1 Shows a critically long ppc64le job queue but openQA is also working down the queue.

Actions #34

Updated by nicksinger 11 months ago

  • Priority changed from Urgent to High

Machine seems to be pretty stable now. I consider this and the slowly falling amount of ppc64le tickets enough to reduce priority for now. However, I now need to start at square one and figure out why cross-machine MM tests don't work as expected.

Actions #35

Updated by livdywan 11 months ago

  • Due date changed from 2023-10-27 to 2023-11-10

nicksinger wrote in #note-34:

Machine seems to be pretty stable now. I consider this and the slowly falling amount of ppc64le tickets enough to reduce priority for now. However, I now need to start at square one and figure out why cross-machine MM tests don't work as expected.

Let's be realistic about the due date then

Actions #36

Updated by livdywan 11 months ago

Related SD-137152

Actions #37

Updated by okurz 11 months ago

Actions #38

Updated by okurz 11 months ago

livdywan wrote in #note-36:

Related SD-137152

For that and the general problem of longer ppc64le job queue I created #139010.

Actions #39

Updated by okurz 11 months ago

Next steps:

  1. Update description to not include malbec+powerqaworker-qam-1 anymore as they are offline and will stay offline
  2. Ensure multi-machine tests are working on at least one of diesel+petrol and enable multi-machine production use
  3. Check multi-machine tests across diesel and petrol and ensure multi-machin tests work using GRE tunnels
  4. Continue with the other steps in ticket description suggestion section
Actions #40

Updated by nicksinger 11 months ago

  • Description updated (diff)
Actions #41

Updated by livdywan 11 months ago

  • Description updated (diff)

Let's call these steps, since you are checking them off one by one.

Actions #42

Updated by okurz 11 months ago

  • Copied to action #139136: Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:M added
Actions #43

Updated by livdywan 11 months ago

  • Due date changed from 2023-11-10 to 2023-11-17

Not expecting this to be worked on this week, hence bumping the due date

Actions #44

Updated by mkittler 10 months ago

I scheduled some test jobs to check the current state: https://openqa.suse.de/tests/12804036#dependencies

Actions #45

Updated by livdywan 10 months ago

Notes from our collaborative session:

Actions #46

Updated by livdywan 10 months ago

  • Related to action #136013: Ensure IP forwarding is persistent for multi-machine tests also in our salt recipes size:M added
Actions #47

Updated by okurz 10 months ago

  • Assignee changed from nicksinger to mkittler

as discussed in daily infra call mkittler will try to reproduce the multi-machine test scenarios. A more generalized "link to latest" independant of "QR" assets which might be missing by now is
https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&machine=ppc64le-2g&test=ha_ctdb_node02

Actions #48

Updated by okurz 10 months ago

  • Related to action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M added
Actions #49

Updated by mkittler 10 months ago

I cloned the mentioned scenario with worker classes for diesel/petrol: https://openqa.suse.de/tests/12810996

By the way, despite the tap worker class being removed from petrol it looks like the MM configuration is up-to-date enough on that machine. (The IPs of petrol/diesel are configured correctly on either end.)


EDIT: The jobs have finished now. The server ran on petrol and the other client job that ran there as well did not fail in iscsi_client which seems to be the critical module. The other client job that ran on diesel failed in that module (and the other finished as parallel_failed). So the issue is definitely still reproducible.

Just for the record, I cloned the jobs by running sudo openqa-clone-job --skip-download --export-command --skip-chained-deps https://openqa.suse.de/tests/12799476 {TEST,BUILD}+='test-ppc-mm' _GROUP=0 on OSD and then executing the returned command with worker classes replaced accordingly.

Actions #50

Updated by mkittler 10 months ago

I invoked the ovs-vsctl commands on mania/petrol/diesel manually to allow traffic in all directions.

I scheduled one cluster between diesel and mania (https://openqa.suse.de/tests/12816967#dependencies) and one between petrol and mania (https://openqa.suse.de/tests/12816974#dependencies).

Actions #51

Updated by mkittler 10 months ago

The diesel/mania jobs have already finished reproducing the issue. The petrol/mania jobs are already passed the critical module not reproducing the issue.

That means diesel is the problem.

Note that the "roles" of the different hosts were identical. So this could not have made a difference. (I just replaced "diesel" with "petrol" in the worker class assignments when invoking the 2nd API call for petrol.)

Or I messed up the ovs-vctl commands for the connection between diesel/mania. Just for the record, I invoked the following commands:

martchus@mania:~> sudo ovs-vsctl --may-exist add-port br1 gre20 -- set interface gre20 type=gre options:remote_ip=10.168.192.252 # mania -> diesel
martchus@mania:~> sudo ovs-vsctl --may-exist add-port br1 gre21 -- set interface gre21 type=gre options:remote_ip=10.168.192.254 # mania -> petrol
martchus@diesel:~> sudo ovs-vsctl --may-exist add-port br1 gre21 -- set interface gre21 type=gre options:remote_ip=10.168.192.108 # diesel -> mania
martchus@petrol:~> sudo ovs-vsctl --may-exist add-port br1 gre21 -- set interface gre21 type=gre options:remote_ip=10.168.192.108 # petrol -> mania

(The connection between diesel and petrol was already in place. I reused gre20/gre21 for simplicity because we don't need a connection to the x86_64 worker using this gre interface anyways. I have also checked that the IPs show up in the output of ovs-vsctl show.)

Considering I haven't done anything differently on diesel I don't think that's the case, though.

Actions #52

Updated by okurz 10 months ago

ok, so what's your plan for a next step?

Actions #53

Updated by mkittler 10 months ago

All petrol/mania jobs have now finished successfully. As a first improvement I can create a MR to enable the tap worker class on petrol and mania.

Actions #54

Updated by mkittler 10 months ago

I restarted the petrol/mania cluster to see whether it wasn't just luck: https://openqa.suse.de/tests/12817482

I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/677 to enable the tap worker class on petrol.

Actions #55

Updated by mkittler 10 months ago

  • Status changed from In Progress to Feedback
Actions #56

Updated by mkittler 10 months ago

I made a diff of sysctl -a | sort on petrol and diesel. The most interesting output of colordiff -Naur sysctl-petrol sysctl-diesel:

-kernel.hostname = petrol
+kernel.hostname = diesel
-kernel.osrelease = 5.3.18-150300.59.93-default
+kernel.osrelease = 5.3.18-57-default
-net.ipv4.conf.erspan0.arp_notify = 0
+net.ipv4.conf.erspan0.arp_notify = 1
-net.ipv4.tcp_mem = 195231      260310  390462
+net.ipv4.tcp_mem = 195273      260364  390546
-net.netfilter.nf_conntrack_tcp_ignore_invalid_rst = 0

Otherwise the diff mainly shows hardware differences (like a different number of CPU cores) and different UUIDs and a little bit lower limits like:

-kernel.pid_max = 114688
+kernel.pid_max = 65536

The output of ip r, ip a and ovs-vsctl show also doesn't show any relevant differences (but maybe Dirk as more input on that).

Actions #57

Updated by okurz 10 months ago

mkittler wrote in #note-56:

I made a diff of sysctl -a | sort on petrol and diesel. The most interesting output of colordiff -Naur sysctl-petrol sysctl-diesel:

-kernel.hostname = petrol
+kernel.hostname = diesel
-kernel.osrelease = 5.3.18-150300.59.93-default
+kernel.osrelease = 5.3.18-57-default

the kernel version 5.3.18 is correct as both should be downgraded due to #119008, kernel regression boo#1202138, however the kernel version on diesel looks like its the outdated GM version and should be the same as on petrol. Please install 5.3.18-150300.59.93-default and check multi-machine tests again.

Actions #59

Updated by okurz 10 months ago

  • Due date changed from 2023-11-17 to 2023-11-22
  • Status changed from Feedback to In Progress
  • Priority changed from High to Urgent

mkittler will schedule the scenario again and test. We discussed the topic in our tools team meeting. We should handle this with higher priority, hence bumping to "Urgent". @mkittler please also

  1. Check if the rollback steps have been conducted
  2. Ensure stability of the originally failing scenario
  3. Verify that we don't have any workers with interface-number-suffix like "-1" in our list of openQA workers
  4. Optionally create a separate ticket about bringing back diesel into across-host multi-machine tests as long as that is disabled
Actions #60

Updated by mkittler 10 months ago

After installing the other kernel package and rebooting the diff in sysctl is gone. I scheduled https://openqa.suse.de/tests/12840274#dependencies. Let's see whether it works.

Actions #61

Updated by mkittler 10 months ago

I removed all remaining diesel-1:* slots from OSD. All were offline so I guess the "-1" problem is resolved. salt-key -L and sudo salt openqa.suse.de mine.get roles:worker nodename grain also don't show any names with "-1" anymore.

https://stats.openqa-monitor.qa.suse.de/d/WDdiesel-1 also shows no data (as opposed to https://stats.openqa-monitor.qa.suse.de/d/WDdiesel). Maybe I should cleanup the old dashboard (and check why it doesn't happen automatically).

Actions #62

Updated by mkittler 10 months ago

The new cluster is already at the point where we can call it a success. So the different kernel version might have surprisingly made a difference. Of course I'll retry the jobs a few times to be sure.


About the dashboards: They are not provisioned anymore by salt but one has to remove them still manually as non-provisioned dashboards (which I've just did). I think we actually have already established that this is how it works at some point before (and changing this would be out of scope for this ticket anyways).

Actions #63

Updated by mkittler 10 months ago

  • Description updated (diff)

Looks like the "-1" suffix has already been deleted from workerconf.

Actions #64

Updated by mkittler 10 months ago

It still doesn't work after all. Before the test scenario always failed in iscsi_client and now that module softfailed (which seems generally as good as it gets for this module at this point) and watchdog passed but then ha_cluster_join failed¹. So not so successful after all. I'm restarting the cluster and investigate a little bit as I'm wondering what the difference is now.


¹ test-ppc-mm-ha_ctdb_node02test-ppc-mm (test/VM on diesel) cannot ping test-ppc-mm-ha_ctdb_node01test-ppc-mm (test/VM on mania) via ping -c1 ctdb-node01 due to ping: ctdb-node01: Temporary failure in name resolution.

Actions #65

Updated by mkittler 10 months ago

I've got the same outcome again. I've restarted the cluster once more to be sure the test outcome has changed in a stable way: https://openqa.suse.de/tests/12841664#dependencies

Actions #66

Updated by mkittler 10 months ago

  • Status changed from In Progress to Resolved

It failed again in the same way. So I created a follow-up ticket (#150995) and consider this one resolved. This also means I'm no re-enabling the tap worker class as part of the rollback steps.

Actions #67

Updated by okurz 10 months ago

  • Due date deleted (2023-11-22)
Actions #68

Updated by okurz 10 months ago

  • Related to action #150995: Fix MM setup on diesel so test scenarios like ha_ctdb_supportservertest-ppc-mm work added
Actions

Also available in: Atom PDF