action #107470
closed[openqa][ipmi][worker][sut][needle matching] 'sshd-server-started' needle matching has been continuously failing on some workers/SUTs size:M
Description
Observation¶
QE Virtualization has a openQA test suite prj2_host_upgrade_sles12sp5_to_developing_xen which automates host upgrade procedure from SLES 12-SP5 Xen host to SLES 15-SP4 Xen host. Needle matching has been continuously failing at reboot_and_wait_up_upgrade step as below:
# Test died: no candidate needle with tag(s) 'sshd-server-started' matched
Actually I has been keeping creating new 'sshd-server-started' needle after each failure. Unfortunately, the same failure still happened every time at the same step when the test was triggered by a new released daily build.
openqaworker-2:18/gonzo-1:
prj2_host_upgrade_sles12sp5_to_developing_xen Build101.1
prj2_host_upgrade_sles12sp5_to_developing_xen Build99.1
prj2_host_upgrade_sles12sp5_to_developing_xen Build98.1
openqaworker-2:19/fozzie-1:
prj2_host_upgrade_sles12sp5_to_developing_xen Build99.1
prj2_host_upgrade_sles12sp5_to_developing_xen Build98.1
prj2_host_upgrade_sles12sp5_to_developing_xen Build97.1
prj2_host_upgrade_sles12sp5_to_developing_xen Build91.2
Steps to reproduce¶
- Trigger a openQA test run with a new daily build and ensure the test is assigned to openqaworker-2:18 or openqaworker-2:19. For example, openqa-client --host xxxxx isos post BUILD=xxxxx DISTRI=sle VERSION=15-SP4 FLAVOR=Online ARCH=x86_64 TEST=prj2_host_upgrade_sles12sp5_to_developing_xen
- The automated host upgrade procedure is explained as below: > * Install host as base product sles12sp5 with MainUpdate.Do registration during installation. > * Perform offline upgrade automatically by adding the following into grub config menuentry SLE-15-SP4-Full-x86_64-Buildxxxxx-Media1-012422075306 { insmod gzio insmod part_msdos insmod btrfs search --no-floppy --fs-uuid --set=root c911bf44-435b-4b62-a856-e5a0fcc20e8e linux /boot/loader-qloTWw/linux autoupgrade=1 console=ttyS1,115200 console=tty vga=791 Y2DEBUG=1 xvideo=1024x768 ssh=1 sshpassword=nots3cr3t install=http://openqa.suse.de/assets/repo/SLE-15-SP4-Full-x86_64-Buildxxxxx-Media1 initrd /boot/loader-qloTWw/initrd } > * Boot into above grub entry and wait for ssh daemon up and running > * ssh to the host and run yast.ssh to perform automatic offline upgrade
Problem¶
- Initially I think this might be caused by usb-storage driver loading which can be seen here. I had ever did some experiments of disabling usb-storage driver (passing borkenmodules=usb-storage to kernel) which gave me the feeling that 'sshd-server-started' needle hit rate can be increased. But it is hard to explain and does not make any sense to others. And I do not think usb-storage driver changes in every new daily build. So if a new 'sshd-server-started' needle is captured, it should be matched up afterwards.
- It is more realistic to approach the issue from openQA engine perspective.
- It seems that there is also another progress ticket poo#106056 that is related to ipmi backend issue. I do not think these two correlate directly except that the issue in this ticket depends on ipmi backend.
Suggestion¶
- Check needle matching criteria and mechanism
- Fix the issue from openQA engine perspective
Workaround¶
Capture needle and retrigger the job
Updated by waynechen55 over 2 years ago
Steps to reproduce:¶
- xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- The automated host upgrade procedure is explained as below:
Install host as base product sles12sp5 with MainUpdate.Do registration during installation.
Perform offline upgrade automatically by adding the following into grub config
menuentry SLE-15-SP4-Full-x86_64-Buildxxxxx-Media1-012422075306 {
insmod gzio
insmod part_msdos
insmod btrfs
search --no-floppy --fs-uuid --set=root c911bf44-435b-4b62-a856-e5a0fcc20e8e
linux /boot/loader-qloTWw/linux autoupgrade=1 console=ttyS1,115200 console=tty vga=791 Y2DEBUG=1 xvideo=1024x768 ssh=1 sshpassword=nots3cr3t install=http://openqa.suse.de/assets/repo/SLE-15-SP4-Full-x86_64-Buildxxxxx-Media1
initrd /boot/loader-qloTWw/initrd
}
- Boot into above grub entry and wait for ssh daemon up and running > * ssh to the host and run yast.ssh to perform automatic offline upgrade
And here is what a successful 'sshd-server-started' needle matching looks like.
Updated by okurz over 2 years ago
- Category set to Support
- Priority changed from Normal to High
- Target version set to Ready
Updated by livdywan over 2 years ago
- Assignee set to livdywan
My intuition is, maybe the needle matching could be replaced with another approach since the output on the console shifts around a lot. I'll talk to Wayne.
Updated by livdywan over 2 years ago
- Subject changed from [openqa][ipmi][worker][sut][needle matching] 'sshd-server-started' needle matching has been continously failing on some workers/SUTs to [openqa][ipmi][worker][sut][needle matching] 'sshd-server-started' needle matching has been continously failing on some workers/SUTs size:M
- Status changed from New to Workable
Updated by waynechen55 over 2 years ago
- Category deleted (
Support) - Assignee deleted (
livdywan) - Target version deleted (
Ready)
cdywan wrote:
My intuition is, maybe the needle matching could be replaced with another approach since the output on the console shifts around a lot. I'll talk to Wayne.
So what is the alternative way to do needle matching ? Interesting to know.
Updated by waynechen55 over 2 years ago
- Subject changed from [openqa][ipmi][worker][sut][needle matching] 'sshd-server-started' needle matching has been continously failing on some workers/SUTs size:M to [openqa][ipmi][worker][sut][needle matching] 'sshd-server-started' needle matching has been continuously failing on some workers/SUTs size:M
- Category set to Support
- Assignee set to livdywan
- Target version set to Ready
Updated by waynechen55 over 2 years ago
@cdywan May I know any update on this issue ? Are you going to fix this ?
Updated by livdywan over 2 years ago
- Status changed from Workable to Feedback
Progress is barely usable, but I'll try to reflct what's being discussed in Slack.
waynechen55 wrote:
@cdywan May I know any update on this issue ? Are you going to fix this ?
Note that this is a "support" ticket, I'm not planning to take over the test and there's no bug here afair.
I noticed the test is waiting for sshd-server-started
and also logging things like SSH connection to .* established
. Since the needles involve wrapping and repeating output, maybe it's better to check this on the console rather than grapically?
@waynechen55 pointed out that installation and upgrade rely on sshd-server-started and the failure on boot_from_pxe
is rare. And ssh should only be connected after we know the host is up.
Updated by waynechen55 over 2 years ago
@cdywan I have some new findings with regard to this issue and those failed test runs. I found that those failed test runs had been always trying to match a inferior needle 'boot_from_pxe-sshd-server-started-20171030' instead of newly created ones that has the same tag 'sshd-server-started'. On the contrary, the successful test run matched up the newly created needle 'reboot_and_wait_up_upgrade-grub2-openqawoker2-20-20211231'. So my questions are:
- Why test still tries to match up inferior needle with already existing newly captured needle ?
- How to combat the issue and let test switch to detect newly created needle instead of old one ?
Updated by livdywan over 2 years ago
waynechen55 wrote:
@cdywan I have some new findings with regard to this issue and those failed test runs. I found that those failed test runs had been always trying to match a inferior needle 'boot_from_pxe-sshd-server-started-20171030' instead of newly created ones that has the same tag 'sshd-server-started'. On the contrary, the successful test run matched up the newly created needle 'reboot_and_wait_up_upgrade-grub2-openqawoker2-20-20211231'. So my questions are:
- Why test still tries to match up inferior needle with already existing newly captured needle ?
- How to combat the issue and let test switch to detect newly created needle instead of old one ?
I don't think the timestamp in the filename or when the file was created would be taken into account here. All of them have the same tag and similar match regions.
How about using more distinct tags? Like sshd-via-yast.
Updated by waynechen55 over 2 years ago
cdywan wrote:
waynechen55 wrote:
@cdywan I have some new findings with regard to this issue and those failed test runs. I found that those failed test runs had been always trying to match a inferior needle 'boot_from_pxe-sshd-server-started-20171030' instead of newly created ones that has the same tag 'sshd-server-started'. On the contrary, the successful test run matched up the newly created needle 'reboot_and_wait_up_upgrade-grub2-openqawoker2-20-20211231'. So my questions are:
- Why test still tries to match up inferior needle with already existing newly captured needle ?
- How to combat the issue and let test switch to detect newly created needle instead of old one ?
I don't think the timestamp in the filename or when the file was created would be taken into account here. All of them have the same tag and similar match regions.
How about using more distinct tags? Like sshd-via-yast.
So if there are multiple needles having the same tag and similar match regions, which one will be used for matching up ? May I know how the engine chooses the one to be used for matching up ? Thanks. @cdywan
In other words, I want to know how I can let the test chooses the recent needles instead of those in the distant past if they have the same tag.
Updated by waynechen55 over 2 years ago
cdywan wrote:
waynechen55 wrote:
@cdywan I have some new findings with regard to this issue and those failed test runs. I found that those failed test runs had been always trying to match a inferior needle 'boot_from_pxe-sshd-server-started-20171030' instead of newly created ones that has the same tag 'sshd-server-started'. On the contrary, the successful test run matched up the newly created needle 'reboot_and_wait_up_upgrade-grub2-openqawoker2-20-20211231'. So my questions are:
- Why test still tries to match up inferior needle with already existing newly captured needle ?
- How to combat the issue and let test switch to detect newly created needle instead of old one ?
I don't think the timestamp in the filename or when the file was created would be taken into account here. All of them have the same tag and similar match regions.
How about using more distinct tags? Like sshd-via-yast.
About your suggestion "How about using more distinct tags? Like sshd-via-yast.":
This looks feasible at the first glimpse. But after giving it a second thought, I think the same issue may also happen to a new distinct tag. For example, if a test run fails to match a needle with the new distinct tag, then a new needle is captured and created. How can you guarantee that the test will start looking for the new needle ? It may still look for old ones, so the same problem comes back. Right ? I think it may worth a deep look into the openQA engine IMHO. @cdywan
Updated by okurz over 2 years ago
- Assignee changed from livdywan to okurz
- Priority changed from High to Low
waynechen55 wrote:
@cdywan I have some new findings with regard to this issue and those failed test runs. I found that those failed test runs had been always trying to match a inferior needle 'boot_from_pxe-sshd-server-started-20171030' instead of newly created ones that has the same tag 'sshd-server-started'. On the contrary, the successful test run matched up the newly created needle 'reboot_and_wait_up_upgrade-grub2-openqawoker2-20-20211231'. So my questions are:
- Why test still tries to match up inferior needle with already existing newly captured needle ?
- How to combat the issue and let test switch to detect newly created needle instead of old one ?
Please remove older needles which are wrongly matching or are not as specific as newer needles that you created. That's the right approach.
Updated by okurz over 2 years ago
After the workshop my primary suggestion is to replace the multiple check_screen
calls in https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/virt_autotest/login_console.pm#L91 and below with assert_screen
. Also see https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/CONTRIBUTING.md?plain=1#L114 . In particular using a check_screen
without checking the return code should be considered an error.
Updated by waynechen55 over 2 years ago
okurz wrote:
After the workshop my primary suggestion is to replace the multiple
check_screen
calls in https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/virt_autotest/login_console.pm#L91 and below withassert_screen
. Also see https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/CONTRIBUTING.md?plain=1#L114 . In particular using acheck_screen
without checking the return code should be considered an error.
I think why people choose to use check_screen with match_has_tag is that it allows them to handle "failing to match" more tactically instead of just failing the whole test. Sometimes it is just not very necessary to fail the test because it is not a fatal checkpoint, and at the same time, different operations need to be done based on the specific needle matched up.
Updated by okurz over 2 years ago
- Status changed from Feedback to Resolved
@waynechen55 I assume you managed to follow the suggestions and remove old, invalid needles. I assume this support task can be resolved.
Updated by waynechen55 over 2 years ago
okurz wrote:
@waynechen55 I assume you managed to follow the suggestions and remove old, invalid needles. I assume this support task can be resolved.
Agree. It is done.