action #107470
closed
[openqa][ipmi][worker][sut][needle matching] 'sshd-server-started' needle matching has been continuously failing on some workers/SUTs size:M
Added by waynechen55 almost 3 years ago.
Updated over 2 years ago.
Description
Observation¶
QE Virtualization has a openQA test suite prj2_host_upgrade_sles12sp5_to_developing_xen which automates host upgrade procedure from SLES 12-SP5 Xen host to SLES 15-SP4 Xen host. Needle matching has been continuously failing at reboot_and_wait_up_upgrade step as below:
# Test died: no candidate needle with tag(s) 'sshd-server-started' matched
Actually I has been keeping creating new 'sshd-server-started' needle after each failure. Unfortunately, the same failure still happened every time at the same step when the test was triggered by a new released daily build.
openqaworker-2:18/gonzo-1:
prj2_host_upgrade_sles12sp5_to_developing_xen Build101.1
prj2_host_upgrade_sles12sp5_to_developing_xen Build99.1
prj2_host_upgrade_sles12sp5_to_developing_xen Build98.1
openqaworker-2:19/fozzie-1:
prj2_host_upgrade_sles12sp5_to_developing_xen Build99.1
prj2_host_upgrade_sles12sp5_to_developing_xen Build98.1
prj2_host_upgrade_sles12sp5_to_developing_xen Build97.1
prj2_host_upgrade_sles12sp5_to_developing_xen Build91.2
Steps to reproduce¶
- Trigger a openQA test run with a new daily build and ensure the test is assigned to openqaworker-2:18 or openqaworker-2:19. For example, openqa-client --host xxxxx isos post BUILD=xxxxx DISTRI=sle VERSION=15-SP4 FLAVOR=Online ARCH=x86_64 TEST=prj2_host_upgrade_sles12sp5_to_developing_xen
- The automated host upgrade procedure is explained as below:
> * Install host as base product sles12sp5 with MainUpdate.Do registration during installation.
> * Perform offline upgrade automatically by adding the following into grub config
menuentry SLE-15-SP4-Full-x86_64-Buildxxxxx-Media1-012422075306 {
insmod gzio
insmod part_msdos
insmod btrfs
search --no-floppy --fs-uuid --set=root c911bf44-435b-4b62-a856-e5a0fcc20e8e
linux /boot/loader-qloTWw/linux autoupgrade=1 console=ttyS1,115200 console=tty vga=791 Y2DEBUG=1 xvideo=1024x768 ssh=1 sshpassword=nots3cr3t install=http://openqa.suse.de/assets/repo/SLE-15-SP4-Full-x86_64-Buildxxxxx-Media1
initrd /boot/loader-qloTWw/initrd
}
> * Boot into above grub entry and wait for ssh daemon up and running
> * ssh to the host and run yast.ssh to perform automatic offline upgrade
Problem¶
- Initially I think this might be caused by usb-storage driver loading which can be seen here. I had ever did some experiments of disabling usb-storage driver (passing borkenmodules=usb-storage to kernel) which gave me the feeling that 'sshd-server-started' needle hit rate can be increased. But it is hard to explain and does not make any sense to others. And I do not think usb-storage driver changes in every new daily build. So if a new 'sshd-server-started' needle is captured, it should be matched up afterwards.
- It is more realistic to approach the issue from openQA engine perspective.
- It seems that there is also another progress ticket poo#106056 that is related to ipmi backend issue. I do not think these two correlate directly except that the issue in this ticket depends on ipmi backend.
Suggestion¶
- Check needle matching criteria and mechanism
- Fix the issue from openQA engine perspective
Workaround¶
Capture needle and retrigger the job
Steps to reproduce:¶
- xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- The automated host upgrade procedure is explained as below:
menuentry SLE-15-SP4-Full-x86_64-Buildxxxxx-Media1-012422075306 {
insmod gzio
insmod part_msdos
insmod btrfs
search --no-floppy --fs-uuid --set=root c911bf44-435b-4b62-a856-e5a0fcc20e8e
linux /boot/loader-qloTWw/linux autoupgrade=1 console=ttyS1,115200 console=tty vga=791 Y2DEBUG=1 xvideo=1024x768 ssh=1 sshpassword=nots3cr3t install=http://openqa.suse.de/assets/repo/SLE-15-SP4-Full-x86_64-Buildxxxxx-Media1
initrd /boot/loader-qloTWw/initrd
}
- Boot into above grub entry and wait for ssh daemon up and running > * ssh to the host and run yast.ssh to perform automatic offline upgrade
And here is what a successful 'sshd-server-started' needle matching looks like.
- Category set to Support
- Priority changed from Normal to High
- Target version set to Ready
My intuition is, maybe the needle matching could be replaced with another approach since the output on the console shifts around a lot. I'll talk to Wayne.
- Subject changed from [openqa][ipmi][worker][sut][needle matching] 'sshd-server-started' needle matching has been continously failing on some workers/SUTs to [openqa][ipmi][worker][sut][needle matching] 'sshd-server-started' needle matching has been continously failing on some workers/SUTs size:M
- Status changed from New to Workable
- Category deleted (
Support)
- Assignee deleted (
livdywan)
- Target version deleted (
Ready)
cdywan wrote:
My intuition is, maybe the needle matching could be replaced with another approach since the output on the console shifts around a lot. I'll talk to Wayne.
So what is the alternative way to do needle matching ? Interesting to know.
- Subject changed from [openqa][ipmi][worker][sut][needle matching] 'sshd-server-started' needle matching has been continously failing on some workers/SUTs size:M to [openqa][ipmi][worker][sut][needle matching] 'sshd-server-started' needle matching has been continuously failing on some workers/SUTs size:M
- Category set to Support
- Assignee set to livdywan
- Target version set to Ready
@cdywan May I know any update on this issue ? Are you going to fix this ?
- Status changed from Workable to Feedback
Progress is barely usable, but I'll try to reflct what's being discussed in Slack.
waynechen55 wrote:
@cdywan May I know any update on this issue ? Are you going to fix this ?
Note that this is a "support" ticket, I'm not planning to take over the test and there's no bug here afair.
I noticed the test is waiting for sshd-server-started
and also logging things like SSH connection to .* established
. Since the needles involve wrapping and repeating output, maybe it's better to check this on the console rather than grapically?
@waynechen55 pointed out that installation and upgrade rely on sshd-server-started and the failure on boot_from_pxe
is rare. And ssh should only be connected after we know the host is up.
@cdywan I have some new findings with regard to this issue and those failed test runs. I found that those failed test runs had been always trying to match a inferior needle 'boot_from_pxe-sshd-server-started-20171030' instead of newly created ones that has the same tag 'sshd-server-started'. On the contrary, the successful test run matched up the newly created needle 'reboot_and_wait_up_upgrade-grub2-openqawoker2-20-20211231'. So my questions are:
- Why test still tries to match up inferior needle with already existing newly captured needle ?
- How to combat the issue and let test switch to detect newly created needle instead of old one ?
waynechen55 wrote:
@cdywan I have some new findings with regard to this issue and those failed test runs. I found that those failed test runs had been always trying to match a inferior needle 'boot_from_pxe-sshd-server-started-20171030' instead of newly created ones that has the same tag 'sshd-server-started'. On the contrary, the successful test run matched up the newly created needle 'reboot_and_wait_up_upgrade-grub2-openqawoker2-20-20211231'. So my questions are:
- Why test still tries to match up inferior needle with already existing newly captured needle ?
- How to combat the issue and let test switch to detect newly created needle instead of old one ?
I don't think the timestamp in the filename or when the file was created would be taken into account here. All of them have the same tag and similar match regions.
How about using more distinct tags? Like sshd-via-yast.
cdywan wrote:
waynechen55 wrote:
@cdywan I have some new findings with regard to this issue and those failed test runs. I found that those failed test runs had been always trying to match a inferior needle 'boot_from_pxe-sshd-server-started-20171030' instead of newly created ones that has the same tag 'sshd-server-started'. On the contrary, the successful test run matched up the newly created needle 'reboot_and_wait_up_upgrade-grub2-openqawoker2-20-20211231'. So my questions are:
- Why test still tries to match up inferior needle with already existing newly captured needle ?
- How to combat the issue and let test switch to detect newly created needle instead of old one ?
I don't think the timestamp in the filename or when the file was created would be taken into account here. All of them have the same tag and similar match regions.
How about using more distinct tags? Like sshd-via-yast.
So if there are multiple needles having the same tag and similar match regions, which one will be used for matching up ? May I know how the engine chooses the one to be used for matching up ? Thanks. @cdywan
In other words, I want to know how I can let the test chooses the recent needles instead of those in the distant past if they have the same tag.
cdywan wrote:
waynechen55 wrote:
@cdywan I have some new findings with regard to this issue and those failed test runs. I found that those failed test runs had been always trying to match a inferior needle 'boot_from_pxe-sshd-server-started-20171030' instead of newly created ones that has the same tag 'sshd-server-started'. On the contrary, the successful test run matched up the newly created needle 'reboot_and_wait_up_upgrade-grub2-openqawoker2-20-20211231'. So my questions are:
- Why test still tries to match up inferior needle with already existing newly captured needle ?
- How to combat the issue and let test switch to detect newly created needle instead of old one ?
I don't think the timestamp in the filename or when the file was created would be taken into account here. All of them have the same tag and similar match regions.
How about using more distinct tags? Like sshd-via-yast.
About your suggestion "How about using more distinct tags? Like sshd-via-yast.":
This looks feasible at the first glimpse. But after giving it a second thought, I think the same issue may also happen to a new distinct tag. For example, if a test run fails to match a needle with the new distinct tag, then a new needle is captured and created. How can you guarantee that the test will start looking for the new needle ? It may still look for old ones, so the same problem comes back. Right ? I think it may worth a deep look into the openQA engine IMHO. @cdywan
- Assignee changed from livdywan to okurz
- Priority changed from High to Low
waynechen55 wrote:
@cdywan I have some new findings with regard to this issue and those failed test runs. I found that those failed test runs had been always trying to match a inferior needle 'boot_from_pxe-sshd-server-started-20171030' instead of newly created ones that has the same tag 'sshd-server-started'. On the contrary, the successful test run matched up the newly created needle 'reboot_and_wait_up_upgrade-grub2-openqawoker2-20-20211231'. So my questions are:
- Why test still tries to match up inferior needle with already existing newly captured needle ?
- How to combat the issue and let test switch to detect newly created needle instead of old one ?
Please remove older needles which are wrongly matching or are not as specific as newer needles that you created. That's the right approach.
- Status changed from Feedback to Resolved
@waynechen55 I assume you managed to follow the suggestions and remove old, invalid needles. I assume this support task can be resolved.
okurz wrote:
@waynechen55 I assume you managed to follow the suggestions and remove old, invalid needles. I assume this support task can be resolved.
Agree. It is done.
Also available in: Atom
PDF