Project

General

Profile

action #18936

[tools][sles][functional] Enable 3 stress acceptance on s390x

Added by yosun over 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
New test
Start date:
2017-05-04
Due date:
2017-11-08
% Done:

0%

Estimated time:
Difficulty:

Description

In this moment, we have acceptance running in x86_64 and ppc64le in openQA. https://openqa.suse.de/tests/overview?distri=sle&version=12-SP3&build=0365&groupid=55
It reduce a lot of time to make the whole process of milestone candidate build acceptance test. Before x86_64 and ppc64le runs stress acceptance tests, we need 1 day to finish all test. And now, it need only half day to make it.
If s390x can enable following stress test, it'll help to reduce more time in each product milestone test.


Related issues

Related to openQA Tests - action #18938: [tools]Enable 3 stress acceptance on aarch64Resolved2017-05-04

Related to openQA Tests - action #25638: [sles][functional][s390x] test fails in shutdown: VNC stall detected, needs to be investigatedResolved2017-09-282017-10-25

Related to openQA Tests - action #26022: [sle][functional] Make sure that s390x use comatible qcow2 image created by parent jobResolved2017-10-122017-11-08

Blocked by openQA Tests - action #19862: [sle][functional][s390x] test fails in addon_products_sleResolved2017-06-15

Copied to openQA Tests - action #20830: [tools][sles][functional]Get rid of version specific prefix in test suite names when the test suites also apply for other versionsResolved2017-06-11

History

#1 Updated by okurz over 4 years ago

  • Category set to New test

#2 Updated by okurz over 4 years ago

  • Related to action #18938: [tools]Enable 3 stress acceptance on aarch64 added

#3 Updated by RBrownSUSE over 4 years ago

  • Subject changed from Enable 3 stress acceptance on s390x to [tools]Enable 3 stress acceptance on s390x
  • Assignee set to RBrownSUSE
  • Target version set to Milestone 8

At the request of Sero, taken as a QA Tools milestone 8 task for me to discuss with Matthias whether or not hardware is available for this

#4 Updated by RBrownSUSE over 4 years ago

  • Status changed from New to Resolved

Setup on zkvm, enjoy :)

#5 Updated by okurz over 4 years ago

  • Subject changed from [tools]Enable 3 stress acceptance on s390x to [tools][sles][functional]Enable 3 stress acceptance on s390x
  • Status changed from Resolved to In Progress
  • Assignee changed from RBrownSUSE to okurz

Seems like that never worked. I see that the tests always fail as incomplete trying to access an asset which is not generated, e.g.: https://openqa.suse.de/tests/993148
It can't work like this because assets are generated but the job does not have a dependency on any parent. Also the worker must be zkvm-image because of https://progress.opensuse.org/projects/openqav3/wiki/Wiki#on-zKVM . The test suites have a correct START_AFTER_TEST=sles12_minimal_base+sdk_create_hdd but that is not fulfilled by any job on s390x-zkvm. So I did the following steps now:

  • Moved scenarios to "test development job group"
  • Changed worker class to "zkvm-image"
  • Added scenario "sles12_minimal_base+sdk_create_hdd@zkvm-image"
  • Triggered tests for build 0420 with
$ openqa_client_osd isos post _NOOBSOLETEBUILD=1 BUILD=0420 BUILD_HA=0179 BUILD_HA_GEO=0138 BUILD_SDK=0230 BUILD_SLE=0420 BUILD_WE=0139 DISTRI=sle FLAVOR=Server-DVD ISO=SLE-12-SP3-Server-DVD-s390x-Build0420-Media1.iso ISO_1=SLE-12-SP3-SDK-DVD-s390x-Build0230-Media1.iso REPO_0=SLE-12-SP3-Server-DVD-s390x-Build0420-Media1 VERSION=12-SP3 ARCH=s390x TEST=sles12_qa_acceptance_fs_stress,sles12_qa_acceptance_process_stress,sles12_qa_acceptance_sched_stress,sles12_minimal_base+sdk_create_hdd
{ count => 4, failed => [], ids => [993908 .. 993911] }

-> https://openqa.suse.de/t993911, https://openqa.suse.de/t993910, https://openqa.suse.de/t993909, https://openqa.suse.de/t993908

#6 Updated by okurz over 4 years ago

  • Status changed from In Progress to Feedback

was cancelled, waiting for new scheduled, e.g https://openqa.suse.de/tests/994655 and https://openqa.suse.de/tests/995815

#7 Updated by okurz over 4 years ago

https://openqa.suse.de/tests/995815#step/addon_products_sle/11 failed because the job is within in the test development group with a more restricted quota and the worker is very busy so the job started late, the sdk was already deleted by gru cleanup because quota was exceeded. It is correct that the asset got deleted because when we have a scheduled job we never know if it will ever be triggered. So the image creation job should rather go into the sle functional group where I moved it now.

#8 Updated by okurz over 4 years ago

  • Blocked by action #19862: [sle][functional][s390x] test fails in addon_products_sle added

#9 Updated by okurz over 4 years ago

  • Assignee deleted (okurz)

blocked by #19862 , unassigning for now

#10 Updated by okurz about 4 years ago

  • Assignee set to okurz
  • Target version changed from Milestone 8 to Milestone 9

#11 Updated by riafarov about 4 years ago

  • Status changed from Feedback to In Progress
  • Assignee changed from okurz to riafarov

#12 Updated by riafarov about 4 years ago

  • Status changed from In Progress to Feedback

Four test suites created:
create_hdd_minimal_base+sdk_s390x
sles12_qa_acceptance_fs_stress_s390x
sles12_qa_acceptance_process_stress_s390x
sles12_qa_acceptance_sched_stress_s390x

As of now they are part of SLE12 SP3 development group. Tests won't work a may need some changes, but they run, so we can pass it to the responsible team.

PR required for tests to be able to run: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3339

#13 Updated by riafarov about 4 years ago

  • Assignee changed from riafarov to okurz

Oliver, I've verified that all 3 jobs work fine (at least launch, I haven't waited for execution results, as timeout is 10800 seconds.
sles12_qa_acceptance_fs_stress_s390x: https://openqa.suse.de/tests/1079813
sles12_qa_acceptance_process_stress_s390x: https://openqa.suse.de/tests/1080172
sles12_qa_acceptance_sched_stress_s390x: https://openqa.suse.de/tests/1080176

Please, review and inform me if anything else is required in this ticket. As of now, I leave test suites in dev job group for sle 12 sp3.

#14 Updated by okurz about 4 years ago

To have a good template for SLE 12SP4 as well as SLE15 I removed the scheduling from the test dev group and added it to SLE 12 SP3 functional: https://openqa.suse.de/admin/job_templates/55 . We won't have a new build triggered there but it serves as a template to mark the jobs that are supposed to work.

The jobs you referenced were obsoleted (don't know why, maybe sle15?) so I triggered new ones: https://openqa.suse.de/tests/1080197 , https://openqa.suse.de/tests/1080198, https://openqa.suse.de/tests/1080199 that might complete but will take some time. I will keep this ticket on feedback for now to monitor.

I will create another ticket to rename some scenarios: I don't think the "sles12_qa" prefix makes sense for test suites.

#15 Updated by okurz about 4 years ago

  • Copied to action #20830: [tools][sles][functional]Get rid of version specific prefix in test suite names when the test suites also apply for other versions added

#16 Updated by okurz about 4 years ago

passed fs: https://openqa.suse.de/tests/1080198
passed sched: https://openqa.suse.de/tests/1080199

missing process: https://openqa.suse.de/t1080236, retriggered as https://openqa.suse.de/tests/1080236#step/acceptance_process_stress/24, failing to connect to the root terminal it seems.

mgriessmeier any idea?

#17 Updated by mgriessmeier about 4 years ago

as agreed with riafarov we suggest to run these tests only on zkvm-images workerclass
It might work on others too if we recreate the image, but this is untested

#18 Updated by mgriessmeier about 4 years ago

we should improve the error handling in cases of select_console failing on zkvm, e.g. execute commands over the svirt connection and check for the existance of corresponding ssh connections for each console (and potentially xterm-console processes as well on the worker)

#19 Updated by okurz about 4 years ago

  • Target version changed from Milestone 9 to Milestone 10

blocked by s390x builds not available at all now for SLE15

#20 Updated by okurz almost 4 years ago

  • Status changed from Feedback to In Progress
  • Priority changed from Normal to High
  • Target version changed from Milestone 10 to Milestone 11

need to check if we have proper s390x image generation jobs that we can rely on by now. If not blocked by that.

#21 Updated by okurz almost 4 years ago

  • Due date set to 2017-10-11

#22 Updated by okurz almost 4 years ago

SLE15 scenario create_hdd_minimal_base+sdk is fine so we can use it. I added "create_hdd_minimal_base+sdk@s390x-kvm" to SLE 15 Functional and sched_stress@s390x-kvm, process_stress@s390x-kvm, fs_stress@s390x-kvm to the test development job group.

Triggered one test manually with

openqa_clone_job_osd --skip-deps 1186961 PUBLISH_HDD_1= INSTALLONLY= _GROUP="Test Development: SLE 15" BOOT_HDD_IMAGE=1 HDD_1=SLES-15-s390x-278.1-minimal_with_sdk278.1_installed.qcow2 MAX_JOB_TIME=7200 QA_HEAD_REPO=http://149.44.176.2/ibs/QA:/Head/SLE-15/ QA_TESTSET=acceptance_fs_stress VIRTIO_CONSOLE=0 TEST=fs_stress

-> https://openqa.suse.de/tests/1190743#live

failed because couldn't find a pty device?!?

#23 Updated by okurz almost 4 years ago

  • Assignee deleted (okurz)

unassigning for holiday.

mgriessmeier, riafarov maybe you can continue?

#24 Updated by riafarov almost 4 years ago

mgriessmeier and me have fixed initial issue: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3676 problem was that domain xml was not defined when we tried to grep for pty device.
Now fails to switch to root console: http://opeth.suse.de/tests/5709#step/acceptance_fs_stress/30

#25 Updated by okurz almost 4 years ago

  • Assignee set to okurz

Discussed with riafarov. PR merged. Same problem as on opeth now reproduced on osd:

https://openqa.suse.de/tests/1197620/file/serial0.txt explains the problem with s390x stress tests, see at the end the errors about network config rewriting. With no network it is understandable why ssh can not be reached:

susetest:~ #(B sed -i "s:IPADDR='[0-9.]*/\([0-9]*\)':IPADDR='10.161.145.16/\1':" /e
etc/sysconfig/network/ifcfg-*
susetest:~ #(B cat /etc/sysconfig/network/ifcfg-eth2 /etc/sysconfig/network/ifcfg-e
eth2.old /etc/sysconfig/network/ifcfg-lo
cat: /etc/sysconfig/network/ifcfg-eth2: No such file or directory
cat: /etc/sysconfig/network/ifcfg-eth2.old: No such file or directory
# Loopback (lo) configuration
IPADDR=127.0.0.1/8
NETMASK=255.0.0.0
NETWORK=127.0.0.0
STARTMODE=nfsroot
BOOTPROTO=static
USERCONTROL=no
FIREWALL=no

The same is also visible in https://openqa.suse.de/tests/1197620#step/acceptance_sched_stress/24

It looks like the wrong names of ethernet devices are evaluated with the globbing *.

#26 Updated by okurz almost 4 years ago

#27 Updated by okurz almost 4 years ago

  • Subject changed from [tools][sles][functional]Enable 3 stress acceptance on s390x to [tools][sles][functional][BLOCKED]Enable 3 stress acceptance on s390x
  • Status changed from In Progress to Feedback

Discussed with riafarov and we decided that we are blocked here by #19362 . The test modules "acceptance_sched_stress" and such are not easy to understand, they do some custom stuff about booting the images and probably are missing some parts for s390x. So the approach should be to split the test modules accordingly into booting existing disk images which should also work for s390x and then we can see what's missing about s390x in this specific scenario.

#28 Updated by okurz almost 4 years ago

  • Blocked by action #13216: [sles][functional][s390x] Run extratest on s390x added

#29 Updated by okurz almost 4 years ago

  • Target version changed from Milestone 11 to Milestone 13

Also blocked by #13216 because still have no extra tests on s390x which should be done first to ensure we have any test on s390x that can show we can boot a pre-created disk image.

#30 Updated by cachen almost 4 years ago

It looks booting the pre-created hdd works fine, but test failed in select_console('root-console') step. When test running in s390 zvm with svirt, what kind of console should be used?

qa_run.pm:
sub system_login {
my $self = shift;
$self->wait_boot;
if (get_var('VIRTIO_CONSOLE')) {
select_console('root-virtio-terminal');
}
else {
select_console('root-console');
}
}

09:04:51.5209 15757 activate_console, console: root-console, type: console
09:04:51.5210 15757 backend s390x || zkvm
09:04:51.5213 15757 <<< testapi::assert_screen(mustmatch='password-prompt', timeout=30)
09:04:51.5849 15759 MATCH(password-prompt-20170927:0.00)
09:04:51.5984 15759 MATCH(password-prompt-20150730:0.00)
09:04:51.6143 15759 MATCH(password-prompt-20141126:0.00)
09:04:51.9707 15759 MATCH(password-prompt-20160405:0.00)
09:04:51.9857 15759 MATCH(password-prompt-20160414:0.00)
09:04:52.0012 15759 MATCH(password-prompt-bsc965787-20160216:0.00)
09:04:52.0210 15759 MATCH(password-prompt-ipmi-20170619:0.00)
09:04:52.0367 15759 MATCH(password-prompt-ipmi-20170627:0.00)
09:04:52.0629 15759 MATCH(password-prompt-xterm-20150818:0.00)
09:04:52.0635 15759 WARNING: check_asserted_screen took 0.52 seconds - make your needles more specific
09:04:52.0639 15759 DEBUG_IO:

#31 Updated by okurz almost 4 years ago

cachen wrote:

It looks booting the pre-created hdd works fine, but test failed in select_console('root-console') step. When test running in s390 zvm with svirt, what kind of console should be used?
[…]

Nothing is wrong about the console but there is a preparatory step missing. This is why I stated in #18936#note-27 that #19362 should come first by splitting qa_run and using the standard test modules to boot to a complete system in the proper way. Trying to patch the custom stress test code is IMHO not efficient.

#32 Updated by cachen almost 4 years ago

okurz wrote:

cachen wrote:

It looks booting the pre-created hdd works fine, but test failed in select_console('root-console') step. When test running in s390 zvm with svirt, what kind of console should be used?
[…]

Nothing is wrong about the console but there is a preparatory step missing. This is why I stated in #18936#note-27 that #19362 should come first by splitting qa_run and using the standard test modules to boot to a complete system in the proper way. Trying to patch the custom stress test code is IMHO not efficient.

clearly, system_login is the first step of this testing from qa_run.pm, it looks need some specify setup for s390 zvm host connection in this first step, which can be defined in qa_run.pm.
Although #19362 to splitting workload is the task we would like to enhance, but I don't think that is the root cause. Now the key point is, we need to know how can connect and reuse the s390_created_hdd as testing host in openQA, let's say, what step missing before "select_console('root-console')", does it need reconnect_s390?

[...]
sub run {
my $self = shift;
$self->system_login();
$self->prepare_repos();
[...]

#33 Updated by okurz almost 4 years ago

cachen wrote:

Now the key point is, we need to know how can connect and reuse the s390_created_hdd as testing host in openQA, let's say, what step missing before "select_console('root-console')", does it need reconnect_s390?

and to solve that question I proposed to work on #19362 first

#34 Updated by cachen almost 4 years ago

okurz wrote:

cachen wrote:

Now the key point is, we need to know how can connect and reuse the s390_created_hdd as testing host in openQA, let's say, what step missing before "select_console('root-console')", does it need reconnect_s390?

and to solve that question I proposed to work on #19362 first

My understanding is #19362 splitting workload is an enhancement of the test structure, better for debugging/reviewing, but not fix this main issue. The urgent thing is to learn which step is missing, as for putting this step in qa_run.pm structure or in a single *.pm for calling, I would leave to tester to decide. #19362 was created many month ago, I want to enhance the structure, however Sero and Nathan were taking many other high priority on project testing, testsuite fixing, so, for those 'enhancement' task we have to put in lower priority.

If you don't mind, I will ask Nathan help to look at this issue with you together, since our other testsuite testing in 'Kernel''Userspace' groups will see the same problem when they be enabled on s390 arch. On the other hand, when Sero finish his high priority tasks, he will keep taking the splitting task. Is it fine to you? :)

#35 Updated by okurz almost 4 years ago

  • Due date changed from 2017-10-11 to 2017-10-25
  • Assignee changed from okurz to riafarov
  • Target version changed from Milestone 13 to Milestone 11

I understand how you see the priority. Maybe riafarov and me can already solve it soon on our own after experimenting what is necessary with s390x. We will plan to solve this issue and therefore all blocking tasks in our next QA SLE functional sprint starting tomorrow.

#36 Updated by riafarov almost 4 years ago

https://openqa.suse.de/tests/1204159# fs_stress test run on s390x

#37 Updated by okurz almost 4 years ago

that looks very promising.

#38 Updated by cachen almost 4 years ago

wow, how can I express my big thanks to your and raifarov great activity and supporting in openQA, appreciated so much! We will follow up and learn more of openQA usage.

#39 Updated by riafarov almost 4 years ago

  • Blocked by deleted (action #19362: [userspace] split qa_run.pm)

#40 Updated by riafarov almost 4 years ago

  • Subject changed from [tools][sles][functional][BLOCKED]Enable 3 stress acceptance on s390x to [tools][sles][functional] Enable 3 stress acceptance on s390x

Not blocked anymore. Changes to qa_run merged.

#41 Updated by okurz almost 4 years ago

So https://openqa.suse.de/tests/1204159# for build 278.1 was fine but build 300.3 failed to boot: https://openqa.suse.de/tests/1210591#step/boot_to_desktop/20
ideas?

#42 Updated by okurz almost 4 years ago

  • Status changed from Feedback to In Progress

So we had the following finding. The switch to 'root-console' does not work. Most likely the problem is that we run the image creation job on s390x-kvm as well as zkvm which are both publishing the image with the same name. So based on what job completes the latest will generate the image. If a test booting that image is then triggered it will most likely fail if the machine of the image generation job differs from the one booting that image.

I see the following options:

  • get rid of the zkvm jobs and replace all downstream jobs by s390x-kvm ones as we want to phase out zkvm anyway. This could be done by "dump_templates", replace, "load_templates" or manually in the job group schedule assignment.
  • add %MACHINE% to PUBLISH_HDD_1 to be emachine specific. That's already used in many test suites so why not?

#43 Updated by riafarov almost 4 years ago

  • Status changed from In Progress to Feedback

We can see that we don't reach SUT, x is running, and we are not able to switch to root console. Even on same instance of the worker test succeeds and fails randomly. Debugging is complicated, as we have to kill virsh connection from openQA to connect and then we need second connection to actually send the commands.

See runs here: https://openqa.suse.de/tests/1212861#live
We've used zkvm, as x-kvm fails to create qcow image.
Here is example of same worker instance which failed to connect and passed:
https://openqa.suse.de/tests/1212516
https://openqa.suse.de/tests/1212522

#44 Updated by riafarov almost 4 years ago

  • Blocked by deleted (action #13216: [sles][functional][s390x] Run extratest on s390x)

#45 Updated by riafarov almost 4 years ago

  • Related to action #25638: [sles][functional][s390x] test fails in shutdown: VNC stall detected, needs to be investigated added

#46 Updated by riafarov almost 4 years ago

  • Related to action #26022: [sle][functional] Make sure that s390x use comatible qcow2 image created by parent job added

#47 Updated by riafarov almost 4 years ago

Enabled stress tests in functional job group for zkvm, as works fine there: https://openqa.suse.de/tests/1227017#previous

For x-kvm we need to add %MACHINE% to the created qcow2 image name. Workaround for shutdown is implemented.

#48 Updated by riafarov almost 4 years ago

With workaround fix for x-kvm, now we are able to create image using create_hdd_minimal_base+sdk job and trigger stress tests there. But before that we also need to add %MACHINE% variable into published image name, and in all chained jobs, so we use relevant qcow image. Which should be done here: https://progress.opensuse.org/issues/26022

#49 Updated by okurz almost 4 years ago

  • Due date changed from 2017-10-25 to 2017-11-08

carry over to sprint 3 because we should be unblocked again within that sprint.

#50 Updated by riafarov almost 4 years ago

Enabled qcow2 creation for zkvm and x-kvm, stress for x-kvm are enabled in dev group only atm to prove that they work.
Here is the runs on x-kvm:
https://openqa.suse.de/tests/1233848
https://openqa.suse.de/tests/1234671
https://openqa.suse.de/tests/1235017

#51 Updated by riafarov almost 4 years ago

  • Status changed from Feedback to Resolved

#52 Updated by okurz almost 4 years ago

wow, good one :-)

Also available in: Atom PDF