QA (public) &raquo; openQA Project (public) &raquo; openQA Tests (public)

2023-10-09

Related to openQA Project (public) - action #136013: Ensure IP forwarding is persistent for multi-machine tests also in our salt recipes size:M

Resolved

dheidler

Related to openQA Infrastructure (public) - action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M

Resolved

2023-09-20

Related to openQA Infrastructure (public) - action #150995: Fix MM setup on diesel so test scenarios like ha_ctdb_supportservertest-ppc-mm work

New

2023-11-17

Copied to openQA Project (public) - coordination #139010: [epic] Long OSD ppc64le job queue

Blocked

2023-11-04

Copied to openQA Project (public) - action #139136: Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:M

Resolved

Updated by okurz over 1 year ago

Description updated (diff)
Status changed from New to In Progress

Created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/619 to exclude petrol from tap-class. diesel does not have the tap class. I could not find the petrol IP mentioned in malbec:/etc/wicked/scripts/gre_tunnel_preup.sh . That might explain it.

Actions

Updated by okurz over 1 year ago

I suspect that the mistake in the reverse PTR that should be fixed by https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4043 are at fault here so let's wait for that MR. In the meantime I am trying to handle related openQA job failures.

Trying to mitigate with

for i in malbec powerqaworker-qam-1 petrol; do env WORKER=$i result="result='failed'" failed_since=2023-09-18 host=openqa.suse.de bash -ex openqa-advanced-retrigger-jobs; done

Actions

Updated by nicksinger over 1 year ago

not sure if this is the only reason. I ran a state.highstate on malbec and can confirm that after a successful run petrol is missing in the gre_tunnel_preup.sh-script. We just check for "tap" in WORKER_CLASS so even with your changes it still should so up. So I suspect some issue in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls?ref_type=heads#L36-38

Actions

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/620

Updated by okurz over 1 year ago

Actions

Updated by nicksinger over 1 year ago

okurz wrote in #note-4:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/620

I've fixed the underlying problem in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/997

Actions

Updated by openqa_review over 1 year ago

Due date set to 2023-10-05

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Updated by livdywan over 1 year ago

Subject changed from test fails in iscsi_client auto_review:"(?s)ppc64le.*Test died: command.*curl":retry to test fails in iscsi_client auto_review:"(?s)ppc64le.*Test died: command.*curl":retry due to salt 'host'/'nodename' confusion size:M
Description updated (diff)

Actions

Updated by okurz over 1 year ago

Due date deleted (~~2023-10-05~~)
Status changed from In Progress to Workable
Assignee deleted (~~okurz~~)
Priority changed from Urgent to High

After https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/997 we should continue with the next suggestions. Unassigning due to multiple new other tickets after the urgency was mitigated.

Actions

Updated by nicksinger over 1 year ago

Assignee set to nicksinger

Actions

#11

Updated by ggardet_arm over 1 year ago

It makes some aarch64 tests to restart, such as: https://openqa.opensuse.org/tests/3616202#comments

Actions

#12

Updated by nicksinger over 1 year ago

Status changed from Workable to In Progress

Actions

#13

Updated by nicksinger over 1 year ago

Subject changed from test fails in iscsi_client auto_review:"(?s)ppc64le.*Test died: command.*curl":retry due to salt 'host'/'nodename' confusion size:M to test fails in iscsi_client due to salt 'host'/'nodename' confusion size:M

ggardet_arm wrote in #note-11:

It makes some aarch64 tests to restart, such as: https://openqa.opensuse.org/tests/3616202#comments

Interesting. Definitely not related but I wonder why the regex matches. I remove it for now.

Actions

#14

Updated by nicksinger over 1 year ago

Cloned some tests with openqa-clone-job --within-instance https://openqa.suse.de --apikey D3576CEBF3529E39 --apisecret 63CF71588E55E56B 12230681 _GROUP=0 WORKER_CLASS=tap_poo136130. Unfortunately I messed up and _GROUP=0 still assigned a job-group. https://openqa.suse.de/tests/12376246 is now a validation for "all mm test on petrol-1".

Actions

#15

Updated by openqa_review over 1 year ago

Due date set to 2023-10-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions

#16

Updated by okurz over 1 year ago

nicksinger wrote in #note-14:

Cloned some tests with openqa-clone-job --within-instance https://openqa.suse.de --apikey D3576CEBF3529E39 --apisecret 63CF71588E55E56B 12230681 _GROUP=0 WORKER_CLASS=tap_poo136130. Unfortunately I messed up and _GROUP=0 still assigned a job-group. https://openqa.suse.de/tests/12376246 is now a validation for "all mm test on petrol-1".

use openqa-clone-job --parental-inheritance --skip-chained-deps --within-instance https://openqa.suse.de 12230681 _GROUP=0 WORKER_CLASS=tap_poo136130 TEST+=-poo136130 BUILD=poo136130

Actions

#17

Updated by nicksinger over 1 year ago

Created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/645 and https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4160 to get rid of the suffix in the hostname and openqa

Actions

#18

Updated by okurz over 1 year ago

Both merged, what's next?

Actions

#19

Updated by okurz over 1 year ago

Related to action #137603: [alert] Queue: State (SUSE) - too few jobs executed alert size:S added

Actions

#20

Updated by nicksinger over 1 year ago

The host diesel can complete jobs while called diesel in theory which can be seen here: https://openqa.suse.de/tests/12443043. But as visible in https://openqa.suse.de/admin/workers/3388 I stumbled over an odd behavior where the host changes its hostname back to diesel-1 after some time (therefore the incomplete, the name changed -> salt changed the config -> worker restarted mid job run to reload changes). I tried to set DHCLIENT_SET_HOSTNAME='no' in /etc/sysconfig/network/ifcfg-eth3 and /etc/sysconfig/network/dhcp but still observed the changing hostname after some longer time period (some hours) which is odd. I found some hints in the wicked man-pages regarding a wicked extension but I'm not sure if this is related to the sysconfig options or where else to configured it so I asked in https://suse.slack.com/archives/C02D92APKNU/p1697550296052419 for some help what I could be missing.

I verified all settings again and rebooted the machine which is running for 1h now without changing back. I started a tcpdump for dhcp packages now to check if I can correlate the hostname change with a dhcp package coming in (e.g. after lease expiration), the command used is: tcpdump -i eth3 port 67 or port 68 -e -n -vvvv -w /root/dhcp_hostname_change

Actions

#21

Updated by okurz over 1 year ago

Due date changed from 2023-10-19 to 2023-10-24

We discussed this in the weekly unblock and given that we struggle to find a conclusion we, in particular dheidler and okurz, suggest in general to go with a mixed approach of defining a static DHCP lease for machines where only a single system interface is connected to avoid all those problems and CNAME redirection and such. If That means we should not change all other machine entries according to explicit interface numbering. So what is left to do is to "just get diesel+petrol working" so less work. Extraordinarily bumping due date due to spontaneous vacation taking.

Actions

#22

Updated by nicksinger over 1 year ago

I ran another test by masking systemd-hostnamed but still the contents of /etc/hostname changed so apparently some tool writes to that file directly. Anyhow, guess this completes my experiments and I will just try to get the workers back online by changing our DHCP/DNS entries to refer only to the host itself without any interface enumeration. I will prepare a MR in the ops repo.

Actions

#23

Updated by nicksinger over 1 year ago

Status changed from In Progress to Blocked

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4234 created. I also reverted my changes back on diesel and unmasked systemd-hostnamed again. While we have aliases for the worker config available I don't think it makes sense to run validation runs now until the ops MR is in place because the whole MM-setup heavily depends on these names and it would just require new validation runs after the rename gets into production. Therefore "Blocked" until it is merged.

Actions

#24

Updated by nicksinger over 1 year ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines/845191 failed to to petrol not having the correct highstate. I fixed it manually by running an explicit highstate and the pipeline/job succeeded: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1914672

Actions

#25

Updated by livdywan over 1 year ago

Status changed from Blocked to In Progress
Priority changed from High to Urgent

nicksinger wrote in #note-23:

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4234 created

THe MR was merged yesterday, and we're now getting Queue: State (SUSE) alerts - maybe because of this? The relation wasn't/isn't clear to me from the alerts. Not entirely sure yet if this was a consequence of the MR being merged or Nick's working on it. Let's treat it as Urgent, tho, since we have no other machines currently

Actions

#26

Updated by nicksinger over 1 year ago

DNS/DHCP for both diesel and petrol are now changed back to without any number suffix. I tried to validate MM test yesterday and unfortunately they fail cross-machine: https://openqa.suse.de/tests/12639783#step/multipath_iscsi/26 - not sure what this is caused by. Single machine tests however complete successfully: https://openqa.suse.de/tests/12638911 and therefore I will add the normal qemu_ppc64 class into production while keeping MM disabled for now.

Actions

#27

Updated by nicksinger over 1 year ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/659 should help the ppc64le queue for single machine tests

Actions

#28

Updated by livdywan over 1 year ago

Due date changed from 2023-10-24 to 2023-10-27

Note I'm update the due date accordingly

Actions

#29

Updated by nicksinger over 1 year ago

My changes in the OPS salt didn't apply, now petrol and diesel get random IPs assigned causing the pipeline to fail: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1929411
I created https://sd.suse.com/servicedesk/customer/portal/1/SD-136220 for help

Actions

#30

Updated by nicksinger over 1 year ago

I had to manually apply the salt-states on walter1 and walter2 by issuing salt-call state.apply over ssh on them. With that I was able to recover both workers and they worked flawlessly on single-machine jobs in OSD. However, diesel seems to be very unstable and experiences random crashes. Petrol seems stable at least.

Actions

#31

Updated by okurz over 1 year ago

yes, diesel is running an updated kernel when it shouldn't. running the revert now

zypper rm -u kernel-default-extra-5.14.21-150500.55.31.1.ppc64le kernel-default-5.3.18-150300.59.93.1.ppc64le kernel-default-5.14.21-150500.55.31.1.ppc64le kernel-default-optional-5.14.21-150500.55.31.1.ppc64le && zypper al -m "poo#119008, kernel regression boo#1202138" kernel* && sync && reboot

Actions

#32

Updated by nicksinger over 1 year ago

machine was unable to boot because no kernel and no initrd were installed. I used the petitboot rescue shell to kexec into a leap live system and chrooted into the system. Afterwards I added the Leap15.3 repos with lower prio:

zypper ar -p 105 http://download.opensuse.org/distribution/leap/15.3/repo/oss/ repo-oss-15.3
zypper ref repo-oss-15.3

after that I was able to force install the old kernel from 15.3 with:

zypper in -f kernel-default-5.14.21-150500.55.31.1

which regenerated all necessary files again. After a reboot petitboot was able to find the system again and I could successfully start the normal system which is online again and already assigned all resources to pending jobs. I keep an eye on the stability of that system.

Actions

#33

Updated by okurz over 1 year ago

I checked that both diesel+petrol execute tests just fine. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1 Shows a critically long ppc64le job queue but openQA is also working down the queue.

Actions

#34

Updated by nicksinger over 1 year ago

Priority changed from Urgent to High

Machine seems to be pretty stable now. I consider this and the slowly falling amount of ppc64le tickets enough to reduce priority for now. However, I now need to start at square one and figure out why cross-machine MM tests don't work as expected.

Actions

#35

Updated by livdywan over 1 year ago

Due date changed from 2023-10-27 to 2023-11-10

nicksinger wrote in #note-34:

Machine seems to be pretty stable now. I consider this and the slowly falling amount of ppc64le tickets enough to reduce priority for now. However, I now need to start at square one and figure out why cross-machine MM tests don't work as expected.

Let's be realistic about the due date then

Actions

#36

Updated by livdywan over 1 year ago

Related SD-137152

Actions

#37

Updated by okurz over 1 year ago

Copied to coordination #139010: [epic] Long OSD ppc64le job queue added

Actions

#38

Updated by okurz over 1 year ago

livdywan wrote in #note-36:

Related SD-137152

For that and the general problem of longer ppc64le job queue I created #139010.

Actions

#39

Updated by okurz over 1 year ago

Next steps:

Update description to not include malbec+powerqaworker-qam-1 anymore as they are offline and will stay offline
Ensure multi-machine tests are working on at least one of diesel+petrol and enable multi-machine production use
Check multi-machine tests across diesel and petrol and ensure multi-machin tests work using GRE tunnels
Continue with the other steps in ticket description suggestion section

Actions

#40

Updated by nicksinger over 1 year ago

Description updated (diff)

Actions

#41

Updated by livdywan over 1 year ago

Description updated (diff)

Let's call these steps, since you are checking them off one by one.

Actions

#42

Updated by okurz over 1 year ago

Copied to action #139136: Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:M added

Actions

#43

Updated by livdywan over 1 year ago

Due date changed from 2023-11-10 to 2023-11-17

Not expecting this to be worked on this week, hence bumping the due date

Actions

#44

Updated by mkittler over 1 year ago

I scheduled some test jobs to check the current state: https://openqa.suse.de/tests/12804036#dependencies

Actions

#45

Updated by livdywan over 1 year ago

Notes from our collaborative session:

Ticket about ensuring salt-based setup works https://progress.opensuse.org/issues/136013
First firewall fix https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/987
Explicit net.ipv4 fix https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1004
Confirm our docs https://docs.openvswitch.org/en/latest/howto/tunneling/#testing
Prepare a simplified test e.g.
- based on wicked involving its own dhcp and clear checking of IPv4 and IPv6 setup
- extend openqa_worker tests to check that os-autoinst-setup-multi-machine actually created a working network setup, by adding a second worker and checking that one can talk to the other https://github.com/os-autoinst/os-autoinst-distri-openQA/blob/979ea1766de9722a6fa1603c6a2a15c1c108c4ea/tests/install/openqa_worker.pm#L19
- See how we can test the same case on x86_64 and ppc - in practice many if not most tests are never run on power
Come up with a verifier based on Open vSwitch troubleshooting guidelines
Consider double-checking a possible relation to other mm issues #139070 #138707 #139154
Check if the -1 suffix in workerconf is (ir)relevant afterall
Verify with https://openqa.suse.de/tests/12188617#step/iscsi_client/10
Implement openqa-cli support for triggering of jobs on one or multiple specific workers

Actions

#46

Updated by livdywan over 1 year ago

Related to action #136013: Ensure IP forwarding is persistent for multi-machine tests also in our salt recipes size:M added

Actions

#47

Updated by okurz over 1 year ago

Assignee changed from nicksinger to mkittler

as discussed in daily infra call mkittler will try to reproduce the multi-machine test scenarios. A more generalized "link to latest" independant of "QR" assets which might be missing by now is
https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&machine=ppc64le-2g&test=ha_ctdb_node02

Actions

#48

Updated by okurz over 1 year ago

Related to action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M added

Actions

#49

Updated by mkittler over 1 year ago

I cloned the mentioned scenario with worker classes for diesel/petrol: https://openqa.suse.de/tests/12810996

By the way, despite the tap worker class being removed from petrol it looks like the MM configuration is up-to-date enough on that machine. (The IPs of petrol/diesel are configured correctly on either end.)

EDIT: The jobs have finished now. The server ran on petrol and the other client job that ran there as well did not fail in iscsi_client which seems to be the critical module. The other client job that ran on diesel failed in that module (and the other finished as parallel_failed). So the issue is definitely still reproducible.

Just for the record, I cloned the jobs by running sudo openqa-clone-job --skip-download --export-command --skip-chained-deps https://openqa.suse.de/tests/12799476 {TEST,BUILD}+='test-ppc-mm' _GROUP=0 on OSD and then executing the returned command with worker classes replaced accordingly.

Actions

#50

Updated by mkittler over 1 year ago

I invoked the ovs-vsctl commands on mania/petrol/diesel manually to allow traffic in all directions.

I scheduled one cluster between diesel and mania (https://openqa.suse.de/tests/12816967#dependencies) and one between petrol and mania (https://openqa.suse.de/tests/12816974#dependencies).

Actions

#51

Updated by mkittler over 1 year ago

The diesel/mania jobs have already finished reproducing the issue. The petrol/mania jobs are already passed the critical module not reproducing the issue.

That means diesel is the problem.

Note that the "roles" of the different hosts were identical. So this could not have made a difference. (I just replaced "diesel" with "petrol" in the worker class assignments when invoking the 2nd API call for petrol.)

Or I messed up the ovs-vctl commands for the connection between diesel/mania. Just for the record, I invoked the following commands:

martchus@mania:~> sudo ovs-vsctl --may-exist add-port br1 gre20 -- set interface gre20 type=gre options:remote_ip=10.168.192.252 # mania -> diesel
martchus@mania:~> sudo ovs-vsctl --may-exist add-port br1 gre21 -- set interface gre21 type=gre options:remote_ip=10.168.192.254 # mania -> petrol
martchus@diesel:~> sudo ovs-vsctl --may-exist add-port br1 gre21 -- set interface gre21 type=gre options:remote_ip=10.168.192.108 # diesel -> mania
martchus@petrol:~> sudo ovs-vsctl --may-exist add-port br1 gre21 -- set interface gre21 type=gre options:remote_ip=10.168.192.108 # petrol -> mania

(The connection between diesel and petrol was already in place. I reused gre20/gre21 for simplicity because we don't need a connection to the x86_64 worker using this gre interface anyways. I have also checked that the IPs show up in the output of ovs-vsctl show.)

Considering I haven't done anything differently on diesel I don't think that's the case, though.

Actions

#52

Updated by okurz over 1 year ago

ok, so what's your plan for a next step?

Actions

#53

Updated by mkittler over 1 year ago

All petrol/mania jobs have now finished successfully. As a first improvement I can create a MR to enable the tap worker class on petrol and mania.

Actions

#54

Updated by mkittler over 1 year ago

I restarted the petrol/mania cluster to see whether it wasn't just luck: https://openqa.suse.de/tests/12817482

I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/677 to enable the tap worker class on petrol.

Actions

#55

Updated by mkittler over 1 year ago

Status changed from In Progress to Feedback

Actions

#56

Updated by mkittler over 1 year ago

I made a diff of sysctl -a | sort on petrol and diesel. The most interesting output of colordiff -Naur sysctl-petrol sysctl-diesel:

-kernel.hostname = petrol
+kernel.hostname = diesel
-kernel.osrelease = 5.3.18-150300.59.93-default
+kernel.osrelease = 5.3.18-57-default
-net.ipv4.conf.erspan0.arp_notify = 0
+net.ipv4.conf.erspan0.arp_notify = 1
-net.ipv4.tcp_mem = 195231      260310  390462
+net.ipv4.tcp_mem = 195273      260364  390546
-net.netfilter.nf_conntrack_tcp_ignore_invalid_rst = 0

Otherwise the diff mainly shows hardware differences (like a different number of CPU cores) and different UUIDs and a little bit lower limits like:

-kernel.pid_max = 114688
+kernel.pid_max = 65536

The output of ip r, ip a and ovs-vsctl show also doesn't show any relevant differences (but maybe Dirk as more input on that).

Actions

#57

Updated by okurz over 1 year ago

mkittler wrote in #note-56:

I made a diff of sysctl -a | sort on petrol and diesel. The most interesting output of colordiff -Naur sysctl-petrol sysctl-diesel:
-kernel.hostname = petrol
+kernel.hostname = diesel
-kernel.osrelease = 5.3.18-150300.59.93-default
+kernel.osrelease = 5.3.18-57-default

the kernel version 5.3.18 is correct as both should be downgraded due to #119008, kernel regression boo#1202138, however the kernel version on diesel looks like its the outdated GM version and should be the same as on petrol. Please install 5.3.18-150300.59.93-default and check multi-machine tests again.

Actions

#58

Updated by mkittler over 1 year ago

Ok, I installed http://download.opensuse.org/update/leap/15.3/sle/ppc64le/kernel-default-5.3.18-150300.59.93.1.ppc64le.rpm.

Actions

#59

Updated by okurz over 1 year ago

Due date changed from 2023-11-17 to 2023-11-22
Status changed from Feedback to In Progress
Priority changed from High to Urgent

mkittler will schedule the scenario again and test. We discussed the topic in our tools team meeting. We should handle this with higher priority, hence bumping to "Urgent". @mkittler please also

Check if the rollback steps have been conducted
Ensure stability of the originally failing scenario
Verify that we don't have any workers with interface-number-suffix like "-1" in our list of openQA workers
Optionally create a separate ticket about bringing back diesel into across-host multi-machine tests as long as that is disabled

Actions

#60

I've got the same outcome again. I've restarted the cluster once more to be sure the test outcome has changed in a stable way: https://openqa.suse.de/tests/12841664#dependencies

Actions

#66

Updated by mkittler over 1 year ago

Status changed from In Progress to Resolved

It failed again in the same way. So I created a follow-up ticket (#150995) and consider this one resolved. This also means I'm no re-enabling the tap worker class as part of the rollback steps.

Actions

#67

Updated by okurz over 1 year ago

Due date deleted (~~2023-11-22~~)

Actions