Project

General

Profile

Actions

action #69328

closed

[o3][s390x] Early fail on s390x workers: connection refused

Added by dimstar over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Infrastructure
Target version:
Start date:
2020-07-24
Due date:
2020-11-13
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario opensuse-Tumbleweed-DVD-s390x-textmode@s390x fails in
bootloader_s390

# Test died: expected command exit status ok, got error at /usr/lib/os-autoinst/consoles/s3270.pm line 102.
    consoles::s3270::send_3270(consoles::s3270=HASH(0x564b2c90c3e8), "Connect(192.168.112.9)") called at /usr/lib/os-autoinst/consoles/s3270.pm line 375
    consoles::s3270::_connect_3270(consoles::s3270=HASH(0x564b2c90c3e8), "192.168.112.9") called at /usr/lib/os-autoinst/consoles/s3270.pm line 438
    consoles::s3270::connect_and_login(consoles::s3270=HASH(0x564b2c90c3e8)) called at /usr/lib/os-autoinst/consoles/s3270.pm line 506
    consoles::s3270::activate(consoles::s3270=HASH(0x564b2c90c3e8)) called at /usr/lib/os-autoinst/consoles/console.pm line 97
    consoles::console::select(consoles::s3270=HASH(0x564b2c90c3e8)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 612
    backend::baseclass::try {...} () called at /usr/lib/perl5/vendor_perl/5.26.1/Try/Tiny.pm line 100
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Try/Tiny.pm line 93
    Try::Tiny::try(CODE(0x564b2b529808), Try::Tiny::Catch=REF(0x564b2bc1a8a0)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 616
    backend::baseclass::select_console(backend::s390x=HASH(0x564b2b9c3998), HASH(0x564b2c4dec60)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 89
    backend::baseclass::handle_command(backend::s390x=HASH(0x564b2b9c3998), HASH(0x564b2ca34000)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 570
    backend::baseclass::check_socket(backend::s390x=HASH(0x564b2b9c3998), IO::Handle=GLOB(0x564b2b8f8bf0), 0) called at /usr/lib/os-autoinst/backend/s390x.pm line 69
    backend::s390x::check_socket(backend::s390x=HASH(0x564b2b9c3998), IO::Handle=GLOB(0x564b2b8f8bf0), 0) called at /usr/lib/os-autoinst/backend/baseclass.pm line 276
    eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 191
    backend::baseclass::run_capture_loop(backend::s390x=HASH(0x564b2b9c3998)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 146
    backend::baseclass::run(backend::s390x=HASH(0x564b2b9c3998), 13, 16) called at /usr/lib/os-autoinst/backend/driver.pm line 88
    backend::driver::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x564b2c846f38)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x564b2c846f38), CODE(0x564b2694a508)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 477
    Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x564b2c846f38)) called at /usr/lib/os-autoinst/backend/driver.pm line 90
    backend::driver::start(backend::driver=HASH(0x564b2b806dd0)) called at /usr/lib/os-autoinst/backend/driver.pm line 54
    backend::driver::new("backend::driver", "s390x") called at /usr/bin/isotovideo line 233
    main::init_backend() called at /usr/bin/isotovideo line 284

This looks very similar to an issue from 2017 - https://progress.opensuse.org/issues/25662

The community that is interested in s390x Tumbleweed is curretnly completely blocked on this issue, as no indication about the actual quality of the distro can be given.

See also https://lists.opensuse.org/opensuse-project/2020-07/msg00127.html

Test suite description

Maintainer: okurz

Installation in textmode and selecting the textmode "desktop" during installation.

Reproducible

Fails since (at least) Build 20200428

Expected result

Last good: 20200425 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 2 (0 open2 closed)

Related to openQA Tests - action #77116: test fails in bootloader_s390 - ftp installation media directory repo is too long for using in parmfile - linux144, linux145, linux146, linux147 (rebel)ResolvedSLindoMansilla2020-11-08

Actions
Related to openQA Infrastructure - action #77209: workers on o3 machine rebel provide no "WORKER_HOSTNAME" value anymore but it shows up in journal of worker serviceResolvedSLindoMansilla2020-11-09

Actions
Actions #1

Updated by SLindoMansilla over 3 years ago

  • Subject changed from Early fail on s390x workers: connection refused to [o3][s390x] Early fail on s390x workers: connection refused
Actions #2

Updated by SLindoMansilla over 3 years ago

I talked to Mike Friesenegger and found out that the old linux128 was moved from zvm63b.openqanet.opensuse.org (192.168.112.9) to s390zlp1.suse.de (10.161.159.101)
(Notice different networks)

We believe that zlp1 does not have a vswitch attached to the 192.168.112.0 network.
We will need to confirm with Ihno and Gerhard. If they agree then changes may need to be made to s390zlp1 or linux128 guest.

FYI: Berthold set up the old machine network.

Actions #3

Updated by azouhr over 3 years ago

So, we did not have a single openQA run in more than half a year. Could you please come up with a solution NOW? The whole distribution seems to deteriorate right now, and there is nothing I can do about it without openQA.

This should not take years or weeks, but maybe days or hours.

Actions #4

Updated by SLindoMansilla over 3 years ago

azouhr wrote:

So, we did not have a single openQA run in more than half a year. Could you please come up with a solution NOW? The whole distribution seems to deteriorate right now, and there is nothing I can do about it without openQA.

This should not take years or weeks, but maybe days or hours.

Hi azouhr,

I am really sorry, but there is nothing I can do until Ihno set up the machines. And Ihno seems to not look into this ticket. I have asked him so many times that I got bored. Maybe we can ask Mike to raise it to Ihno? again? I will send again another email... and add you as CC

Actions #5

Updated by AdaLovelace over 3 years ago

I have been working for IBM since October and have received the approval for openSUSE contributions at the end of the month. This week we had a small discussion about how to proceed.
IBM is watching that as a win-win situation to receive an openSUSE Member as an employee. We want to improve the cooperation in this way.

A non-working mainframe does not affect only openSUSE. You can not test SLES on Z, too.
We are testing all developed products for Z by us (for all distribution partners). Do you want to improve the partnership, too?
Then you should have a working mainframe.

Actions #6

Updated by okurz over 3 years ago

  • Due date set to 2020-11-13
  • Status changed from New to In Progress
  • Assignee changed from mgriessmeier to okurz
  • Target version set to Ready

I participated in an "ad-hoc openSUSE s390x meeting" invited for by lkocman, see minutes in
https://lists.opensuse.org/opensuse-factory/2020-11/msg00067.html
@lkocman thanks for taking care. This also triggered AdaLovelace's reaction ;)

Ihno (SUSE) looked into the problem on the side of SUSE mainframe VM network configuration. I provided him "operator" permissions on o3 so he could retrigger the openQA test scenario https://openqa.opensuse.org/tests/latest?arch=s390x&distri=opensuse&flavor=DVD&machine=s390x&test=textmode&version=Tumbleweed and we have some progress as visible in https://openqa.opensuse.org/tests/1464182#step/bootloader_s390/1 . https://openqa.opensuse.org/tests/1464182#step/bootloader_s390/29 shows that we have entered some necessary details in x3270 to end up on a repo URL selection screen.

Ihno asked me to configure "LINUX144, LINUX145, LINUX146, LINUX147" instead of LINUX128.

I have asked him in clarification if I should enable all of these and remove linux128. For now I have just added the additional hosts and not removed linux128 yet. The corresponding content of rebel:/etc/openqa/workers.ini is:

[1]
WORKER_CLASS = s390x,s390x-rebel-1-linux128
BACKEND=s390x
S390_HOST=128
ZVM_GUEST=linux128
ZVM_PASSWORD=lin390
ZVM_HOST=192.168.112.9

[2]
WORKER_CLASS = s390x,s390x-rebel-1-linux144
BACKEND=s390x
S390_HOST=144
ZVM_GUEST=linux144
ZVM_PASSWORD=lin390
ZVM_HOST=192.168.112.9

[3]
WORKER_CLASS = s390x,s390x-rebel-1-linux145
BACKEND=s390x
S390_HOST=145
ZVM_GUEST=linux145
ZVM_PASSWORD=lin390
ZVM_HOST=192.168.112.9

[4]
WORKER_CLASS = s390x,s390x-rebel-1-linux146
BACKEND=s390x
S390_HOST=146
ZVM_GUEST=linux146
ZVM_PASSWORD=lin390
ZVM_HOST=192.168.112.9

[5]
WORKER_CLASS = s390x,s390x-rebel-1-linux147
BACKEND=s390x
S390_HOST=147
ZVM_GUEST=linux147
ZVM_PASSWORD=lin390
ZVM_HOST=192.168.112.9

Called on rebel:

systemctl restart openqa-worker@1
systemctl enable --now openqa-worker@{2..5}

and then from my machine:

cnt=0; for i in 128 144 145 146 147 ; do cnt=$((cnt+1)); openqa-clone-job --within-instance https://openqa.opensuse.org 1464182 WORKER_CLASS=s390x-rebel-$cnt-linux$i; done

resulting in

Created job #1464467: opensuse-Tumbleweed-DVD-s390x-Build20201106-textmode@s390x -> https://openqa.opensuse.org/t1464467
Created job #1464468: opensuse-Tumbleweed-DVD-s390x-Build20201106-textmode@s390x -> https://openqa.opensuse.org/t1464468
Created job #1464469: opensuse-Tumbleweed-DVD-s390x-Build20201106-textmode@s390x -> https://openqa.opensuse.org/t1464469
Created job #1464470: opensuse-Tumbleweed-DVD-s390x-Build20201106-textmode@s390x -> https://openqa.opensuse.org/t1464470
Created job #1464471: opensuse-Tumbleweed-DVD-s390x-Build20201106-textmode@s390x -> https://openqa.opensuse.org/t1464471

All jobs seem to reach at least the same state. After clarification with Ihno (SUSE) I have removed the configuration for linux128 now and adjusted rebel:/etc/openqa/workers.ini accordingly to have four instances, linux144, linux145, linux146, linux147. I will document accordingly in our internal openQA infrastructure document https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls as well.

Actions #7

Updated by SLindoMansilla over 3 years ago

AdaLovelace wrote:

I have been working for IBM since October and have received the approval for openSUSE contributions at the end of the month. This week we had a small discussion about how to proceed.
IBM is watching that as a win-win situation to receive an openSUSE Member as an employee. We want to improve the cooperation in this way.

A non-working mainframe does not affect only openSUSE. You can not test SLES on Z, too.
We are testing all developed products for Z by us (for all distribution partners). Do you want to improve the partnership, too?
Then you should have a working mainframe.

Hi, just to clarify, the setup for openSUSE tests and SLE tests are separated. Tests for SLE were working, it was only openSUSE setup which was not working.

Actions #8

Updated by okurz over 3 years ago

  • Status changed from In Progress to Resolved

Meanwhile late yesterday evening tests have progress further, best example https://openqa.opensuse.org/tests/1464522#step/await_install/68 which is clearly a product bug that should be reported as such. I had a 1.5h session with "AdaLovelace", Sarah from IBM. I gave her a crash course in "openSUSE s390x release management – the review part". She will keep up with this topic so hopefully we see better state of s390x in the near future.

I have updated the internal reference for the o3 reserved z/VM guests in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/271 , merged. with this I regard this specific ticket as "Resolved". For any other problems in the s390x test scenarios we should have specific specific tickets. E.g. currently I see the problem that tests incomplete being unable to resolve the test variable "WORKER_CLASS".

Actions #9

Updated by SLindoMansilla over 3 years ago

okurz wrote:

Meanwhile late yesterday evening tests have progress further, best example https://openqa.opensuse.org/tests/1464522#step/await_install/68 which is clearly a product bug that should be reported as such.

https://bugzilla.opensuse.org/show_bug.cgi?id=1178557

Actions #10

Updated by SLindoMansilla over 3 years ago

  • Related to action #77116: test fails in bootloader_s390 - ftp installation media directory repo is too long for using in parmfile - linux144, linux145, linux146, linux147 (rebel) added
Actions #11

Updated by okurz over 3 years ago

  • Related to action #77209: workers on o3 machine rebel provide no "WORKER_HOSTNAME" value anymore but it shows up in journal of worker service added
Actions

Also available in: Atom PDF