Project

General

Profile

Actions

action #42038

closed

[sle][functional[u] test fails in shutdown - add post_fail_hook for shutdown module if possible

Added by mloviska over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Bugs in existing tests
Target version:
SUSE QA - Milestone 23
Start date:
2018-10-05
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-12-SP4-Server-DVD-s390x-create_hdd_pcm_googlecloud@s390x-kvm-sle12 fails in
shutdown

Two issues with shutdown module have appeared in last 3 builds of sle12sp4 (0396, 0405, 0410) on s38:

However it is not really clear what has happened with the system. It would be nice to have some data to analyze gathered by post_fail_hook.

Reproducible

Fails since (at least) Build 0410 (current job)

Expected result

Last good: 0405 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 3 (0 open3 closed)

Related to openQA Tests - action #43376: [functional][u] Adapt opensusebasetest to provide dmesg and journal logResolvedSLindoMansilla2018-11-05

Actions
Related to openQA Tests - action #42863: [sle][functional][u] test fails in first_boot - hard to see what is exactly the cause for timeout issueResolvedokurz2018-10-24

Actions
Related to openQA Tests - action #43880: [functional][u][s390x][sporadic] test fails in shutdown on s390xResolvedzluo2018-11-16

Actions
Actions #1

Updated by okurz over 5 years ago

  • Related to action #43376: [functional][u] Adapt opensusebasetest to provide dmesg and journal log added
Actions #2

Updated by okurz over 5 years ago

  • Related to action #42863: [sle][functional][u] test fails in first_boot - hard to see what is exactly the cause for timeout issue added
Actions #3

Updated by okurz over 5 years ago

  • Related to action #43880: [functional][u][s390x][sporadic] test fails in shutdown on s390x added
Actions #4

Updated by SLindoMansilla over 5 years ago

  • Target version set to Milestone 22

I would say this is important to debug failures for shutdown issues.

Actions #5

Updated by okurz about 5 years ago

  • Priority changed from Normal to High
Actions #6

Updated by zluo about 5 years ago

  • Status changed from New to In Progress
  • Assignee set to zluo

take over.

Actions #7

Updated by zluo about 5 years ago

http://f40.suse.de/tests/884 shows problem with clone_job atm.

Actions #8

Updated by zluo about 5 years ago

trying directly on osd:
http://openqa.suse.de/tests/2408946

Actions #9

Updated by zluo about 5 years ago

Actions #10

Updated by zluo about 5 years ago

if shutdown is triggered and if it failed, there is no way to execute post_fail_hook.

http://f40.suse.de/tests/899#step/shutdown/

Actions #11

Updated by zluo about 5 years ago

of course we can collect some logs before power_action, but is this what we want to do really?

Actions #12

Updated by okurz about 5 years ago

The idea is to call the post_fail_hook in any case as the system might still be stuck before the actual power down. On a high level you can describe the states "shutdown" as such:

  • before shutdown was triggered
  • system is shutting down
  • system is powered off

The post_fail_hook should be able to identify the states and provide information suitable for every state.

Actions #13

Updated by zluo about 5 years ago

use DEBUG_SHUTDOWN to change log level so that we can call later /usr/lib/systemd/system-shutdown/debug.sh in post_fail_hook:

http://f40.suse.de/tests/904

Actions #14

Updated by zluo about 5 years ago

http://f40.suse.de/tests/927#step/shutdown/7 shows problem with

assert_script_run("journalctl --no-pager -b 0 > /tmp/full_journal.log");

typing issue happens again and again.

Actions #15

Updated by zluo about 5 years ago

http://f40.suse.de/tests/928#step/shutdown/9

with a empty space before journalctl it seems better but upload_logs failed because of curl server connection issue.

Actions #16

Updated by zluo about 5 years ago

http://f40.suse.de/tests/930#step/shutdown/16 resolved the issue with typing, it requires select_console('log-console');

but connection issue for uploading logs is still there...

Actions #17

Updated by zluo about 5 years ago

I found the reason why upload is not working, because cleanup_before_shutdown stop network.service:

  # Proceed with dhcp cleanup on qemu backend only.
    # Cleanup is made, because if same hdd image used in multimachine scenario
    # on several nodes, the dhcp clients use same id and cause conflicts on dhcpd server.
    if (check_var('BACKEND', 'qemu')) {
    my $network_status = script_output('systemctl status network');
    # Do dhcp cleanup for wicked
    if ($network_status =~ /wicked/) {
    systemctl 'stop network.service';
    systemctl 'stop wickedd.service';
    assert_script_run('ls /var/lib/wicked/');
    save_screenshot;
    assert_script_run('rm -f /var/lib/wicked/*.xml');
    }
    }

Actions #18

Updated by zluo about 5 years ago

changes in cleanup_before_shutdown.pm

# Proceed with dhcp cleanup on qemu backend only.
# Cleanup is made, because if same hdd image used in multimachine scenario
# on several nodes, the dhcp clients use same id and cause conflicts on dhcpd server.
# network.service is needed for uploading log for DEBUG_SHUTDOWN 
if (check_var('BACKEND', 'qemu') && !get_var('DEBUG_SHUTDOWN')){
my $network_status = script_output('systemctl status network');
# Do dhcp cleanup for wicked
if ($network_status =~ /wicked/) {
systemctl 'stop network.service';
systemctl 'stop wickedd.service';
assert_script_run('ls /var/lib/wicked/');
save_screenshot;
assert_script_run('rm -f /var/lib/wicked/*.xml');
}
}

http://f40.suse.de/tests/936#step/shutdown/21 shows successful uploading of logs.

Next step should be check this on s390 remote worker and think about what kind of additional logs are helpful.

Actions #19

Updated by zluo about 5 years ago

Actions #21

Updated by zluo about 5 years ago

  • Status changed from In Progress to Feedback
Actions #22

Updated by zluo about 5 years ago

  • Status changed from Feedback to In Progress

need to think about DEBUG_SHUTDOWN, since it already provide dmesg to serial0.txt.

Actions #23

Updated by zluo about 5 years ago

cleanup_before_shutdown before shutdown collects already dmesg logs to serial0.txt, so add journal log extra to serial0.txt with comment "--- dmesg ---" and "journactl log".

http://f40.suse.de/tests/1076/file/serial0.txt shows now expected logs.

Actions #24

Updated by okurz about 5 years ago

I like the idea of the PR but please keep in mind that the goal is to have a helpful post_fail_hook ready in all cases and not just on request when DEBUG_SHUTDOWN=1 is set

Actions #25

Updated by zluo about 5 years ago

The Ticket is about to provide useful logs for shutdown, and if DEBUG_SHUTDONW is not set, then we cannot get full logs. This can be done before shutdown (when shutdown got triggered, the network is in most cases down for example.) So I think this is what we can achieve atm. The goal you mean is different then it here.

Actions #26

Updated by okurz about 5 years ago

zluo wrote:

The Ticket is about to provide useful logs for shutdown, and if DEBUG_SHUTDONW is not set, then we cannot get full logs.

Please remember #42038#note-12 which we also discussed in person. The idea is if the system is actually not shut down so when we can still access a console. So I think in the post_fail_hook we should be able to try that.

Actions #27

Updated by mgriessmeier about 5 years ago

  • Status changed from In Progress to Feedback
Actions #28

Updated by zluo about 5 years ago

Note:

before shutdown was triggered

  • this is no problem, we can catch the logs in cleanup_before_shutdown.

system is shutting down

  • this is complicated, we need to define what we need to check and what is possible to check during system's shutting down

system is powered off

  • this is already covered in the test.
Actions #29

Updated by okurz about 5 years ago

  • Target version changed from Milestone 22 to Milestone 23

alright, so what do you think we still need to be able to close this ticket or do you think we are done already?

Actions #30

Updated by zluo about 5 years ago

  • Status changed from Feedback to Resolved

http://openqa.suse.de/tests/2470815/file/serial0.txt shows expected logs.

So we are done for now, however we should check the logs which provide useful information for the shutdown issue in future.

Actions

Also available in: Atom PDF