Project

General

Profile

Actions

action #60662

closed

[SLE][Migration][SLE15SP2][Regression] Some regression tests fails on ppc64le - /dev/hvc0: permission denied

Added by tinawang123 over 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Bugs in existing tests
Target version:
-
Start date:
2019-12-05
Due date:
% Done:

100%

Estimated time:
16.00 h
Difficulty:

Description

Observation

openQA test in scenario sle-15-SP2-Regression-on-Migration-from-SLE15-SPX-to-SLE15-SP2-ppc64le-online_sles15_pscc_Addons-basesys-srv-desk-dev-contm-lgm-pcm-wsm_all_full@ppc64le fails in
curl_https

Test suite description

Reproducible

Fails since (at least) Build 80.5

Further details

Always latest result in this scenario: latest


Related issues 1 (0 open1 closed)

Related to openQA Tests - action #60935: [SLE][Migration][SLE15SP2][Regression]test fails in salt - salt-run timeoutResolvedcoolgw2019-12-12

Actions
Actions #1

Updated by coolgw over 4 years ago

dev/hvc0 not permission

Actions #2

Updated by hjluo over 4 years ago

  • Assignee set to hjluo
Actions #3

Updated by leli over 4 years ago

  • Estimated time set to 16.00 h
Actions #4

Updated by hjluo over 4 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 20

rerun it in 108.1 to check it it still hit this issue.
https://openqa.suse.de/tests/3739548 this case failed at zypper migration
reason: proxy scc is not working during zypper migration

Actions #5

Updated by hjluo over 4 years ago

  • Status changed from In Progress to Blocked
  • % Done changed from 20 to 30

migration was not completed and can't reach curl_https module.

Actions #7

Updated by hjluo over 4 years ago

skip module install_service and check_upgraded_service and rerun https://openqa.suse.de/tests/375091

Actions #8

Updated by hjluo over 4 years ago

  • % Done changed from 30 to 40

it looks like the connection that curl to https://eu.httpbin.org was timedout sometimes.

curl -f -v https://eu.httpbin.org/get failed: script timeout: curl -f -v https://httpbin.org/get 2>&1 at /usr/lib/os-autoinst/testapi.pm line 1091.

[2019-12-04T08:45:27.131 CET] [debug] Maybe the network is busy. Retry: 7 of 7
[2019-12-04T08:45:27.131 CET] [debug] curl failed (with retries) at /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/console/curl_https.pm line 36.

curl_https-dmesg.log:[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinux-5.3.13-1-default root=UUID=567b5484-b2b4-40b4-85a1-b19d7550fc9a plymouth.ignore-serial-consoles console=hvc0 console=tty splash=silent showopts crashkernel=640M

Actions #9

Updated by hjluo over 4 years ago

Actions #10

Updated by hjluo over 4 years ago

try console=tty console=hvc0 in kernel boot parameters

Actions #12

Updated by hjluo over 4 years ago

  • Status changed from Blocked to In Progress
Actions #14

Updated by hjluo over 4 years ago

  • Status changed from In Progress to Blocked
Actions #15

Updated by leli over 4 years ago

  • Subject changed from [SLE][Migration][SLE15SP2]test fails in curl_https - permission denied to [SLE][Migration][SLE15SP2][Regression]test fails in curl_https - permission denied
Actions #16

Updated by hjluo about 4 years ago

the test case name is online_sles15_pscc_lp-basesys-srv-desk-dev-contm-lgm-wsm_all_full on ppc64le. On latest build 132.1 I saw this case passed.
and in 132.1 blocked at zypper migration for poo#63184

Actions #17

Updated by hjluo about 4 years ago

  • Status changed from Blocked to Resolved
  • % Done changed from 50 to 100

till 142.1 didn't hit this issue so we'll close this by now and will reopen it once the test case hit this case again.

Actions #19

Updated by hjluo about 4 years ago

  • % Done changed from 100 to 60

actually it's not the permission issue, just because the shell output is not fully returned.

Actions #20

Updated by hjluo about 4 years ago

verify run with enlarging timeout https://openqa.nue.suse.com/tests/3975555 passed.
and the previous failed one is: http://openqa.suse.de/tests/3971265

Actions #21

Updated by leli about 4 years ago

  • Priority changed from Normal to Urgent
Actions #22

Updated by hjluo about 4 years ago

  • Subject changed from [SLE][Migration][SLE15SP2][Regression]test fails in curl_https - permission denied to [SLE][Migration][SLE15SP2][Regression] Some regression tests fails on ppc64le - /dev/hvc0: permission denied

This kind of issue includes curl_https,salt,mtab, zypper_lifecycle,keymap_or_locale_x11,sshxterm

Actions #23

Updated by coolgw about 4 years ago

  • Related to action #60935: [SLE][Migration][SLE15SP2][Regression]test fails in salt - salt-run timeout added
Actions #24

Updated by okurz about 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: offline_sles15sp1_pscc_lp-basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full
https://openqa.suse.de/tests/4042033

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #25

Updated by hjluo about 4 years ago

  • % Done changed from 60 to 70
Actions #26

Updated by hjluo about 4 years ago

in the logfile we can see the console wait time of 3o is not enough
https://openqa.nue.suse.com/tests/4042033/file/autoinst-log.txt

[2020-03-26T16:02:31.936 CET] [debug] tests/console/curl_https.pm:24 called testapi::select_console -> lib/susedistribution.pm:884 called testapi::assert_screen
[2020-03-26T16:02:31.936 CET] [debug] <<< testapi::assert_screen(mustmatch="user-console", no_wait=1, timeout=30)
[2020-03-26T16:02:32.050 CET] [debug] >>> testapi::_handle_found_needle: found user-console-20191207, similarity 1.00 @ 145/210
[2020-03-26T16:02:32.050 CET] [debug] tests/console/curl_https.pm:29 called (eval) -> tests/console/curl_https.pm:30 called testapi::validate_script_output
[2020-03-26T16:02:32.050 CET] [debug] <<< testapi::type_string(string="cat - > /tmp/script4V9zF.sh;\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2020-03-26T16:02:35.132 CET] [debug] tests/console/curl_https.pm:29 called (eval) -> tests/console/curl_https.pm:30 called testapi::validate_script_output
[2020-03-26T16:02:35.133 CET] [debug] <<< testapi::type_string(string="curl -f -v https://httpbin.org/get 2>&1\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2020-03-26T16:02:39.327 CET] [debug] tests/console/curl_https.pm:29 called (eval) -> tests/console/curl_https.pm:30 called testapi::validate_script_output
[2020-03-26T16:02:39.328 CET] [debug] <<< testapi::send_key(key="ctrl-d", wait_screen_change=0, do_wait=0)
[2020-03-26T16:02:39.929 CET] [debug] tests/console/curl_https.pm:29 called (eval) -> tests/console/curl_https.pm:30 called testapi::validate_script_output
[2020-03-26T16:02:39.929 CET] [debug] <<< testapi::type_string(string="(echo 4V9zF; bash -eox pipefail /tmp/script4V9zF.sh ; echo SCRIPT_FINISHED4V9zF-\$?-) | tee /dev/hvc0\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2020-03-26T16:02:51.402 CET] [debug] tests/console/curl_https.pm:29 called (eval) -> tests/console/curl_https.pm:30 called testapi::validate_script_output
[2020-03-26T16:02:51.403 CET] [debug] <<< testapi::wait_serial(quiet=undef, regexp="SCRIPT_FINISHED4V9zF-\d+-", timeout=30, expect_not_found=0, no_regex=0, buffer_size=undef, record_output=1)
[2020-03-26T16:03:22.536 CET] [debug] >>> testapi::wait_serial: SCRIPT_FINISHED4V9zF-\d+-: fail
[2020-03-26T16:03:22.537 CET] [debug] curl -f -v https://eu.httpbin.org/get failed: script timeout: curl -f -v https://httpbin.org/get 2>&1 at /usr/lib/os-autoinst/testapi.pm line 1104.

[2020-03-26T16:03:22.537 CET] [debug] Maybe the network is busy. Retry: 1 of 7
[2020-03-26T16:03:22.537 CET] [debug] tests/console/curl_https.pm:29 called (eval) -> tests/console/curl_https.pm:30 called testapi::validate_script_output
[2020-03-26T16:03:22.537 CET] [debug] <<< testapi::type_string(string="cat - > /tmp/script4V9zF.sh;\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2020-03-26T16:03:25.606 CET] [debug] tests/console/curl_https.pm:29 called (eval) -> tests/console/curl_https.pm:30 called testapi::validate_script_output
[2020-03-26T16:03:25.607 CET] [debug] <<< testapi::type_string(string="curl -f -v https://httpbin.org/get 2>&1\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2020-03-26T16:03:29.782 CET] [debug] tests/console/curl_https.pm:29 called (eval) -> tests/console/curl_https.pm:30 called testapi::validate_script_output
[2020-03-26T16:03:29.782 CET] [debug] <<< testapi::send_key(key="ctrl-d", wait_screen_change=0, do_wait=0)
[2020-03-26T16:03:30.383 CET] [debug] tests/console/curl_https.pm:29 called (eval) -> tests/console/curl_https.pm:30 called testapi::validate_script_output
[2020-03-26T16:03:30.383 CET] [debug] <<< testapi::type_string(string="(echo 4V9zF; bash -eox pipefail /tmp/script4V9zF.sh ; echo SCRIPT_FINISHED4V9zF-\$?-) | tee /dev/hvc0\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2020-03-26T16:03:41.870 CET] [debug] tests/console/curl_https.pm:29 called (eval) -> tests/console/curl_https.pm:30 called testapi::validate_script_output
[2020-03-26T16:03:41.870 CET] [debug] <<< testapi::wait_serial(quiet=undef, timeout=30, regexp="SCRIPT_FINISHED4V9zF-\d+-", record_output=1, buffer_size=undef, expect_not_found=0, no_regex=0)
[2020-03-26T16:04:12.999 CET] [debug] >>> testapi::wait_serial: SCRIPT_FINISHED4V9zF-\d+-: fail
[2020-03-26T16:04:13.000 CET] [debug] curl -f -v https://eu.httpbin.org/get failed: script timeout: curl -f -v https://httpbin.org/get 2>&1 at /usr/lib/os-autoinst/testapi.pm line 1104.

Actions #27

Updated by hjluo about 4 years ago

  • % Done changed from 70 to 80

now we use the script_run to avoid the console output which may be caused by the valide_script_output. with new fix it will works even
we hit the /dev/hvc0 permission denied.
http://149.44.176.58/tests/4072395#step/curl_https/8 ==>permission denied
http://149.44.176.58/tests/4072613#step/curl_https/2 ==>no permission denied

Actions #29

Updated by hjluo about 4 years ago

Oliver asked for a change request
'There are already many test modules that call ensure_serialdev_permissions, not every unrelated test module should do the same. In your test run already https://openqa.nue.suse.com/tests/4079380#step/system_prepare/8 does the necessary parts. If something has gone missing than this is most likely about either loosing the group for the user or using a different user. This should be fixed rather than putting architecture specific single fixes in arbitrary test modules.'

we need check where we 'loosing the group for the user or using a different user' after the system_prepare.

Actions #30

Updated by leli about 4 years ago

  • Priority changed from Urgent to High
Actions #31

Updated by hjluo about 4 years ago

clone and run without all service_check to see if our services have missed up the permission on ppc64le.
https://openqa.nue.suse.com/tests/4114809.

this case passed without /dev/hvc0 permission denied when select_console 'user-console'.

Actions #32

Updated by hjluo about 4 years ago

clone without install_service and check_upgraded_service and it passed without permission denied issue in /dev/hvc0, so we think the
issue was triggered by install_service and check_upgraded_service, then we re-clone it to just include install_service to see the result
the job is: http://openqa.suse.de/t4116232 and it passed.

Actions #33

Updated by hjluo about 4 years ago

run with both install_service and check_upgraded_service added https://openqa.nue.suse.com/tests/4116299
result curl_https and matab failed with /dev/hvc0 permission denied.

Actions #34

Updated by hjluo about 4 years ago

run with delete select_console 'root-console' in apparmor.pm in service check.
https://openqa.nue.suse.com/tests/4116300 FAILED.

Actions #35

Updated by hjluo about 4 years ago

run with delete of ensure_serialdev_permissions in keymap_or_locale.pm http://openqa.suse.de/t4117050, still failed.

Actions #36

Updated by hjluo about 4 years ago

Run with remove select_console 'root-console' in check_upgraded_service and INCLUDE_SERVICES=apache2 http://openqa.suse.de/t4122658

Actions #37

Updated by hjluo about 4 years ago

run with exclude_services=kdump http://openqa.suse.de/t4122649 PASSED,
rerun another one with EXCLUDE_SERVICES=kdump http://openqa.suse.de/t4123407 PASSED

so the kdump in check_upgraded_service was the culprit.

EXCLUDE_SERVICES=yast2-nfs-server,kdump http://openqa.suse.de/t4123042 PASSED

RUN with fix in lib/kdump_utils.pm http://openqa.suse.de/t4123495 failed at system_prepare.
http://openqa.suse.de/t4127984 FAILED

Actions #38

Updated by hjluo about 4 years ago

Now rerun with fix in kdump
http://openqa.nue.suse.com/t4159973

http://openqa.nue.suse.com/t4160279 PASSED which restore the serialdev_permission after panic reboot.
check_function();
if ($stage ne 'before') {
ensure_serialdev_permissions;
}

http://openqa.nue.suse.com/t4165107 PASSED

http://openqa.nue.suse.com/t4165108 PASSED

Actions #39

Updated by hjluo about 4 years ago

AS Gaowei suggested to run just kdump in service_check to verify that kdump is the culprit of this mess.
barry:~/:[0]# /usr/share/openqa/script/clone_job.pl --from http://openqa.suse.de --host http://openqa.suse.de 4152563 --apikey XXXX --apisecret XXXX _GROUP=0 --skip-download INCLUDE_SERVICES=kdump
Created job #4159854: sle-15-SP2-Regression-on-Migration-from-SLE15-SPX-to-SLE15-SP2-ppc64le-Build181.4-offline_sles15sp1_pscc_lp-basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full@ppc64le -> http://openqa.suse.de/t4159854
barry:~/:[0]# /usr/share/openqa/script/clone_job.pl --from http://openqa.suse.de --host http://openqa.suse.de 4154324 --apikey XXXX --apisecret XXXXX _GROUP=0 --skip-download INCLUDE_SERVICES=kdump
Created job #4159855: sle-15-SP2-Regression-on-Migration-from-SLE15-SPX-to-SLE15-SP2-ppc64le-Build181.4-offline_sles15_pscc_lp-basesys-srv-desk-dev-contm-lgm-wsm_all_full@ppc64le -> http://openqa.suse.de/t4159855

and the results VERIFIED that kdump is the culprit for the permission denied for /dev/hvc0
http://149.44.176.58/tests/4159854#step/curl_https/5
http://149.44.176.58/tests/4159855#step/curl_https/5

Actions #40

Updated by hjluo about 4 years ago

Actions #41

Updated by hjluo about 4 years ago

rerun with debugging code:

  • script_run 'ls -l /dev/hvc*'; check_function(); +
  • if ($stage ne 'before') {
  • script_run 'ls -l /dev/hvc*';
  • select_console 'user-console';

http://openqa.nue.suse.com/t4165135

http://openqa.nue.suse.com/t4165136

you can see from http://149.44.176.58/tests/4165136#step/check_upgraded_service/5, ->right owner
before the system panic which was issued by kdump, before system crash it's owner is right.
but after the reboot the owner back to root. which caused select_console 'user-console' failed.
this debugging code reproduced the issue.

http://149.44.176.58/tests/4165136#step/check_upgraded_service/31 -->root /dev/hvc0

http://149.44.176.58/tests/4165136#step/check_upgraded_service/37 -->select_console 'user-console'

Actions #43

Updated by hjluo almost 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 80 to 100

PR merged.

Actions

Also available in: Atom PDF