Project

General

Profile

Actions

action #153057

closed

[tools] test fails in bootloader_start because openQA can not boot for s390x size:M

Added by AdaLovelace 11 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Infrastructure
Target version:
Start date:
2024-01-03
Due date:
% Done:

0%

Estimated time:
Difficulty:
Tags:

Description

Observation

We have got a successful build of openSUSE Tumbleweed for s390x during the Christmas time.
openQA is failing in the first steps during the preparation of the bootloader and choosing the s320 console:

Test died: expected command exit status ok, got error at /usr/lib/os-autoinst/consoles/s3270.pm line 75, <$fh> line 22.

    consoles::s3270::send_3270(consoles::s3270=HASH(0x556b400f2428), "Connect(s390zl11.openqanet.opensuse.org)") called at /usr/lib/os-autoinst/consoles/s3270.pm line 315
    consoles::s3270::_connect_3270(consoles::s3270=HASH(0x556b400f2428), "s390zl11.openqanet.opensuse.org") called at /usr/lib/os-autoinst/consoles/s3270.pm line 360
    consoles::s3270::connect_and_login(consoles::s3270=HASH(0x556b400f2428)) called at /usr/lib/os-autoinst/consoles/s3270.pm line 424
    consoles::s3270::activate(consoles::s3270=HASH(0x556b400f2428)) called at /usr/lib/os-autoinst/consoles/console.pm line 55
    consoles::console::select(consoles::s3270=HASH(0x556b400f2428)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 660
    backend::baseclass::try {...} () called at /usr/lib/perl5/vendor_perl/5.26.1/Try/Tiny.pm line 100
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Try/Tiny.pm line 93
    Try::Tiny::try(CODE(0x556b4000b430), Try::Tiny::Catch=REF(0x556b408595e8)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 664
    backend::baseclass::select_console(backend::s390x=HASH(0x556b40393d78), HASH(0x556b3fdf2ea8)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 79
    backend::baseclass::handle_command(backend::s390x=HASH(0x556b40393d78), HASH(0x556b4089ff50)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 616
    backend::baseclass::check_socket(backend::s390x=HASH(0x556b40393d78), IO::Handle=GLOB(0x556b40b26c28), 0) called at /usr/lib/os-autoinst/backend/s390x.pm line 41
    backend::s390x::check_socket(backend::s390x=HASH(0x556b40393d78), IO::Handle=GLOB(0x556b40b26c28), 0) called at /usr/lib/os-autoinst/backend/baseclass.pm line 284
    backend::baseclass::do_capture(backend::s390x=HASH(0x556b40393d78), undef, 1704270564.33364) called at /usr/lib/os-autoinst/backend/baseclass.pm line 311
    eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 311
    backend::baseclass::run_capture_loop(backend::s390x=HASH(0x556b40393d78)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 133
    backend::baseclass::run(backend::s390x=HASH(0x556b40393d78), 14, 17) called at /usr/lib/os-autoinst/backend/driver.pm line 68
    backend::driver::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x556b408938c8)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 329
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 329
    Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x556b408938c8), CODE(0x556b3f715d50)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 492
    Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x556b408938c8)) called at /usr/lib/os-autoinst/backend/driver.pm line 72
    backend::driver::start(backend::driver=HASH(0x556b40a88648)) called at /usr/lib/os-autoinst/backend/driver.pm line 37
    backend::driver::new("backend::driver", "s390x") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Backend.pm line 14
    OpenQA::Isotovideo::Backend::new("OpenQA::Isotovideo::Backend") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 100
    OpenQA::Isotovideo::Runner::create_backend(OpenQA::Isotovideo::Runner=HASH(0x556b3b1b8a38)) called at /usr/bin/isotovideo line 134

openQA test in scenario opensuse-Tumbleweed-DVD-s390x-autoyast_zvm@s390x-zVM-vswitch-l2 fails in
bootloader_start

Has been there any openQA changes during the Christmas time, what can have an effect on the boot process for s390x?

Test suite description

Create HDD for s390x textmode

Reproducible

Fails since (at least) Build 20231228

Expected result

Last good: 20231115 (or more recent)

Suggestions

  • See related ticket #137408 about the recent o3 setup
  • Confirm what s390zl11.openqanet.opensuse.org is and where it should be reachable from
    • Check the worker config
    • Lookup ipmi config? There is no ipmi
    • Check if there's an entry in pillars / Add a new entry
  • Verify if this is a regression or a product issue
  • Optional: Try to login to the machine manually with x3270, same as openQA tests do
  • Read https://progress.opensuse.org/projects/openqav3/wiki/#o3-s390-workers about the setup
  • Ask Oliver for details about the worker, since he might have been the last to work on it (if he still remembers), and mgriessmeier and nicksinger
  • Ask Ada

Further details

Always latest result in this scenario: latest


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #137408: Support move of s390x mainframe(s) to PRG2 - o3 size:MResolvedmgriessmeier2023-06-29

Actions
Related to openQA Tests - action #162101: [s390x] timeouts on s390x openQA Workers size:MResolvedokurz2024-06-11

Actions
Actions #1

Updated by AdaLovelace 11 months ago

  • Category changed from Bugs in existing tests to Infrastructure
Actions #2

Updated by okurz 11 months ago

  • Subject changed from test fails in bootloader_start because openQA can not boot for s390x to [tools] test fails in bootloader_start because openQA can not boot for s390x
Actions #3

Updated by livdywan 11 months ago

  • Priority changed from Normal to High
  • Target version set to Ready

All of the investigation jobs are failing and it's not sporadic. So I'm guessing it's either a regression in os-autoinst or a change in our production infrastructure - this would have to have been introduced between 2023-12-23 and 2024-01-01.

Actions #4

Updated by AdaLovelace 11 months ago

When can we receive a working openQA environment?

Actions #5

Updated by okurz 11 months ago

Likely the same issue was brought up in https://suse.slack.com/archives/C02CANHLANP/p1704826925398509?thread_ts=1704826925.398509&cid=C02CANHLANP .

AdaLovelace wrote in #note-4:

When can we receive a working openQA environment?

Likely in the next weeks. The ticket was just set to High some days ago meaning our usual reaction time for High tickets leaves about 25 days until we would remind the team again to look into this

Actions #6

Updated by okurz 11 months ago

  • Priority changed from High to Urgent
Actions #7

Updated by okurz 11 months ago

  • Related to action #137408: Support move of s390x mainframe(s) to PRG2 - o3 size:M added
Actions #8

Updated by livdywan 11 months ago

  • Subject changed from [tools] test fails in bootloader_start because openQA can not boot for s390x to [tools] test fails in bootloader_start because openQA can not boot for s390x size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #9

Updated by okurz 11 months ago

  • Tags set to infra
Actions #10

Updated by mkittler 11 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #11

Updated by okurz 11 months ago · Edited

https://openqa.opensuse.org/tests/3860852 passed again. mkittler asked mgriessmeier in https://suse.slack.com/archives/C02CANHLANP/p1704884374336609 and mgriessmeier asked gschlotter who asked astalker who apparently un- and replugged a SFP+ connection which fixed the issue.

From o3 I can also now ping s390zl11:

okurz@new-ariel:~> ping s390zl11.openqanet.opensuse.org 
PING s390zl11.openqanet.opensuse.org (10.150.1.41) 56(84) bytes of data.
64 bytes from s390zl11.openqanet.opensuse.org (10.150.1.41): icmp_seq=1 ttl=60 time=0.597 ms

I retriggered other failures in https://openqa.opensuse.org/tests/overview?build=20240107&groupid=34&version=Tumbleweed&distri=opensuse as well and if ok then you can resolve the ticket. continue.

It seems on w23 services like container-openqaworker23_container_102.service are disabled. I suggest you check 102-104

Actions #12

Updated by mkittler 11 months ago · Edited

  • Status changed from In Progress to Feedback

I enabled/started the other worker slots as well and it looks like it works, e.g. https://openqa.opensuse.org/tests/3860905 is currently running.

I also changed the hostnames from openqaworker1 to openqaworker23 in the Wiki.

Actions #13

Updated by AdaLovelace 11 months ago

Thank you!
Interesting that an SFTP connection was the issue. It is working again so far, that we can release in the future again.
I will adopt a needle today for a successful Tumbleweed release.

Actions #14

Updated by okurz 11 months ago

  • Status changed from Feedback to Resolved

AdaLovelace wrote in #note-13:

Thank you!
Interesting that an SFTP connection was the issue. It is working again so far, that we can release in the future again.

No, not SFTP. SFP+ as in https://en.wikipedia.org/wiki/Small_Form-factor_Pluggable :)

With that we can resolve

Actions #15

Updated by openqa_review 7 months ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: autoyast_zvm
https://openqa.opensuse.org/tests/4091874#step/bootloader_start/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #16

Updated by livdywan 7 months ago

  • Status changed from Feedback to Workable

I guess @openqa_review is not aware that we don't want urgent tickets in Feedback 😜

I suppose there were no mitigations here?

{
  "args" => [
              "output_delim",
              "(?^:Loading Installation System)",
              "timeout",
              300
            ],
  "cmd" => "backend_proxy_console_call",
  "console" => "x3270",
  "function" => "expect_3270",
  "wantarray" => "",
  "json_cmd_token" => "vEMHEmMg"
}
expect_3270: timed out.
  waiting for $VAR1 = {
          'clear_buffer' => 0,
          'timeout' => 300,
          'expected_status' => qr/RUNNING/u,
          'buffer_full' => qr/MORE\.\.\./u,
          'delete_lines' => qr/^ +$/u,
          'buffer_ready' => $VAR1->{'expected_status'},
          'output_delim' => '(?^:Loading Installation System)'
        };
[...]
consoles::s3270::expect_3270(consoles::s3270=HASH(0x5601d825c718), "output_delim", "(?^:Loading Installation System)", "timeout", 300) called at /usr/lib/os-autoinst/backend/baseclass.pm line 821
Actions #17

Updated by mkittler 7 months ago

  • Status changed from Workable to Resolved

Those are different types of failures now. So I deleted all irrelevant comments referencing this ticket. I suppose reviewers have to look at it and create a new ticket. This time it doesn't look like there's something fundamentally broken with the containerized setup of s390x workers. It rather looks like the SUT is misbehaving on boot.

Actions #18

Updated by mgriessmeier 7 months ago

there seems to be an issue with retrieving the autoyast file from the worker which should be investigated - based on results of others jobs I don't see a "general" issue with the network

            'Downloading AutoYaST file: http://openqaworker23:21013/Cgdb_ENScohOIvfp/files/zv',
            'm.xml                                                                           ',
            '[   13.591189][ T1165] qeth: register layer 3 discipline                        ',
            '[   13.597000][ T1156] qeth 0.0.0a00: CHID: ff00 CHPID: 0                       ',
            '[   13.597678][ T1156] qeth 0.0.0a02: qdio: OSA on SC 2 using AI:1 QEBSM:0 PRI:1',
            ' TDD:1 SIGA:RW                                                                  ',
            '[   13.610181][ T1156] qeth 0.0.0a00: Device is a Virtual NIC QDIO card (level: ',
            'V730)                                                                           ',
            '[   13.610181][ T1156] with link type Virt.NIC QDIO.                            ',
            '[   13.610218][ T1156] qeth 0.0.0a00: Inbound source MAC-address not supported o',
            'n (unnamed net_device)                                                          ',
            '[   13.610237][ T1156] qeth 0.0.0a00: VLAN enabled                              ',
            '[   13.610247][ T1156] qeth 0.0.0a00: Multicast enabled                         ',
            '[   13.610272][ T1156] qeth 0.0.0a00: IPV6 enabled                              ',
            '[   13.610298][ T1156] qeth 0.0.0a00: Broadcast enabled                         ',
            '[   13.612387][ T1168] qeth 0.0.0a00 enca00: renamed from eth0                  ',
            'enca00: network config created                                                  ',
            'Loading http://openqaworker23:21013/Cgdb_ENScohOIvfp/files/zvm.xml -            '
Actions #19

Updated by okurz 5 months ago

  • Related to action #162101: [s390x] timeouts on s390x openQA Workers size:M added
Actions

Also available in: Atom PDF