Project

General

Profile

Actions

action #37820

closed

[functional][sle][u][hard][ipmi][sporadic] test fails in first_boot - Lost connection to SUT on SLE12-SP4

Added by SLindoMansilla over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Bugs in existing tests
Target version:
SUSE QA (private) - Milestone 17
Start date:
2018-06-26
Due date:
2018-07-17
% Done:

0%

Estimated time:
8.00 h
Difficulty:
hard

Description

Scenario sle-12-SP4-Server-DVD-x86_64-Build0238-default@64bit-ipmi

The connection to the SUT was lost.

Reproducible

Fails since Build 0238 osd#1659807#step/first_boot/2

Expected result

Acceptance criteria

  • AC1: SLE12SP4 ipmi tests pass first_boot consistently (could be with soft-fail and workaround) -> AT1: >1 job on SLE12SP4 on the ipmi backend passed first_boot

Suggestions

  • Trigger a job (could even be on osd) for the last SLE12SP3 milestone ISO/repo we still have around that worked to crosscheck if it still works with the current tests
  • Workaround sporadic errors by triggering multiple jobs in all cases
  • Trigger a job for the current SLE12SP4 build but with the last good test git commit
  • Try out manually, both current SLE12SP4 as well as e.g. SLE12SP3B1 (where it still worked)
  • Based on previous steps fix the test regression or report the product bug and work around it

Further details

Latest result: sle-12-SP4-Server-DVD-x86_64-Latest-default@64bit-ipmi


Related issues 4 (0 open4 closed)

Related to openQA Tests (public) - action #32089: [sle][functional][u][ipmi][easy] test fails in first_boot - abort the test early so that we at least test the installationResolvedSLindoMansilla2018-02-05

Actions
Related to openQA Tests (public) - action #34402: [functional][u][s390x][medium] Revisit extra_tests_on_gnome@s390x (was: Do no run extra_tests_on_gnome on s390x)Resolvedzluo2018-04-062018-11-20

Actions
Related to openQA Tests (public) - action #38267: [functional][u][hard] ipmi test fails in consoletest_setup to write to the serial device - regression!Resolvedzluo2018-07-062018-09-11

Actions
Related to openQA Tests (public) - action #38423: [sle][functional][u][hard] Refactor first_boot to unify duplicated behavior for remote backendRejectedzluo2018-07-16

Actions
Actions #1

Updated by SLindoMansilla over 6 years ago

  • Related to action #32089: [sle][functional][u][ipmi][easy] test fails in first_boot - abort the test early so that we at least test the installation added
Actions #2

Updated by SLindoMansilla over 6 years ago

  • Subject changed from [functional][sle][ipmi] test fails in first_boot - Lost connection to SUT to [functional][sle][ipmi] test fails in first_boot - Lost connection to SUT on SLE12-SP4
Actions #3

Updated by okurz over 6 years ago

  • Subject changed from [functional][sle][ipmi] test fails in first_boot - Lost connection to SUT on SLE12-SP4 to [functional][sle][u][ipmi] test fails in first_boot - Lost connection to SUT on SLE12-SP4
  • Description updated (diff)
  • Due date set to 2018-07-17
  • Status changed from New to Workable
  • Priority changed from Normal to High
  • Target version set to Milestone 17

We really should act fast to not loose older references, e.g. the last "green" SLE 12 SP3 job has the following content of vars.json:

{
   "ARCH" : "x86_64",
   "ASSETDIR" : "/var/lib/openqa/cache/openqa.suse.de/factory",
   "BACKEND" : "ipmi",
   "BETA" : "1",
   "BETA_SDK" : "1",
   "BETA_WE" : "1",
   "BUILD" : "0314",
   "BUILD_HA" : "0119",
   "BUILD_HA_GEO" : "0088",
   "BUILD_SDK" : "0152",
   "BUILD_SLE" : "0314",
   "BUILD_WE" : "0087",
   "CACHEDIRECTORY" : "/var/lib/openqa/cache",
   "CASEDIR" : "/var/lib/openqa/cache/openqa.suse.de/tests/sle",
   "DESKTOP" : "gnome",
   "DISTRI" : "sle",
   "DVD" : 1,
   "FLAVOR" : "Server-DVD",
   "GNOME" : 1,
   "HASLICENSE" : 1,
   "INSTLANG" : "en_US",
   "IPMI_HOSTNAME" : "10.162.28.200",
   "IPMI_PASSWORD" : "qatesting",
   "IPMI_USER" : "admin",
   "ISO" : "/var/lib/openqa/pool/1/SLE-12-SP3-Server-DVD-x86_64-Build0314-Media1.iso",
   "ISO_MAXSIZE" : "4700372992",
   "JOBTOKEN" : "vojxI6rLeezvh85q",
   "MACHINE" : "64bit-ipmi",
   "MAX_JOB_TIME" : "32000",
   "NAME" : "00857605-sle-12-SP3-Server-DVD-x86_64-Build0314-gnome@64bit-ipmi",
   "NOAUTOLOGIN" : 1,
   "OPENQA_HOSTNAME" : "openqa.suse.de",
   "OPENQA_URL" : "http://openqa.suse.de",
   "PACKAGETOINSTALL" : "x3270",
   "PRJDIR" : "/var/lib/openqa/cache/openqa.suse.de",
   "PRODUCTDIR" : "/var/lib/openqa/cache/openqa.suse.de/tests/sle/products/sle",
   "QA_HEAD_REPO" : "http://dist.nue.suse.com/ibs/QA:/Head/SLE-12-SP3",
   "QA_WEB_REPO" : "http://dist.suse.de/install/SLP/SLE-12-Module-Web-Scripting-LATEST/x86_64/CD1/",
   "QEMUPORT" : "20012",
   "REPO_0" : "SLE-12-SP3-Server-DVD-x86_64-Build0314-Media1",
   "SCC_REGCODE" : "30452ce234918d23",
   "SCC_URL" : "http://Server-0314.proxy.scc.suse.de",
   "SERIALDEV" : "ttyS1",
   "SHUTDOWN_NEEDS_AUTH" : "1",
   "SLENKINS_TESTSUITES_REPO" : "http://download.suse.de/ibs/Devel:/SLEnkins:/testsuites/SLE_12_SP3/",
   "SP2ORLATER" : 1,
   "SP3ORLATER" : 1,
   "TEST" : "gnome",
   "TEST_GIT_HASH" : "749297d010f39e18c741cc3ed0c633b03445f647",
   "TIMEOUT_SCALE" : "3",
   "VERSION" : "12-SP3",
   "VNC" : "91",
   "WALLPAPER" : "/usr/share/wallpapers/SLEdefault/contents/images/1280x1024.jpg",
   "WORKER_CLASS" : "64bit-ipmi",
   "WORKER_HOSTNAME" : "10.162.0.12",
   "WORKER_ID" : 368,
   "WORKER_INSTANCE" : "1"
}

So with this it might be possible to reproduce the working setup first and then we can plan further steps in details. SLE12SP4 should not be exactly the same as SLE15 as the installer behaves different in not starting a VNC server anymore in the target system … unless this change was already pushed to SLE12SP4.

Actions #4

Updated by okurz over 6 years ago

  • Subject changed from [functional][sle][u][ipmi] test fails in first_boot - Lost connection to SUT on SLE12-SP4 to [functional][sle][u][hard][ipmi] test fails in first_boot - Lost connection to SUT on SLE12-SP4
  • Description updated (diff)
Actions #5

Updated by zluo over 6 years ago

  • Status changed from Workable to In Progress
  • Assignee set to zluo

take over

Actions #6

Updated by okurz over 6 years ago

  • Description updated (diff)

Hi, I tried to clarify the ACs by specifying them explicitly. Please let me know if this is in conflict with your previous expectations regarding this ticket because you picked it up in before.

Actions #7

Updated by SLindoMansilla over 6 years ago

FYI: On SLE15 we were not able to make the test pass modules after first_boot, so we decided to move the default scenario to the respective development group and schedule one install only test suite (see btrfs@ipmi osd#1772527 which was showing green results since we fixed the test modules until first_boo).

Actions #8

Updated by okurz over 6 years ago

  • Related to action #34402: [functional][u][s390x][medium] Revisit extra_tests_on_gnome@s390x (was: Do no run extra_tests_on_gnome on s390x) added
Actions #9

Updated by riafarov over 6 years ago

  • Estimated time set to 8.00 h
Actions #10

Updated by zluo over 6 years ago

  • Status changed from In Progress to Rejected

I checked on my openQA server and tried to reproduce it for more than 20 times, but I cannot reproduce this issue.
The latest test runs on osd don't show this issue anymore.
Please re-open it if you find this happens again.

Actions #11

Updated by zluo over 6 years ago

Actions #12

Updated by okurz over 6 years ago

  • Subject changed from [functional][sle][u][hard][ipmi] test fails in first_boot - Lost connection to SUT on SLE12-SP4 to [functional][sle][u][hard][ipmi][sporadic] test fails in first_boot - Lost connection to SUT on SLE12-SP4
  • Status changed from Rejected to In Progress

yes, interesting. It seems that the scenario become more stable but the last failure in first_boot was just yesterday: https://openqa.suse.de/tests/1810540 so I doubt we are done here but I will mark the issue as "sporadic" now. Interesting to see that you can not reproduce the issue locally. Can you please check what workers you use and what is used in production? Maybe there is one production worker that is not stable?

Actions #13

Updated by zluo over 6 years ago

openqaw1:1, openqaw1:2 show the failed status.
I use loewe remote-worker:200. the ipmi machine used for my tests belongs to our QA and it is only one atm and it is not stable. But it doesn't show any issue for my tests.

Actions #14

Updated by okurz over 6 years ago

Keep in mind that for IPMI we would rather need the actual machines on which the tests have been executed, not the workers just delegating to the machines connected over IPMI

Actions #15

Updated by okurz over 6 years ago

Retriggered test on production also failed again in first_boot: https://openqa.suse.de/tests/1814410#step/first_boot/3

Actions #16

Updated by zluo over 6 years ago

I checked for sles 13 sp3 on osd, ipmi test scenario has never worked before:
http://openqa.suse.de/tests/1123607#

Actions #17

Updated by zluo over 6 years ago

/usr/lib/os-autoinst/consoles/vnc_base.pm:71:{
'password' => 'nots3cr3t',
'port' => 5901,
'hostname' => '10.162.2.75'
}

the last test run shows the same ip address:

# Test died: Error connecting to host <10.162.2.75>: IO::Socket::INET: connect: No route to host at /usr/lib/os-autoinst/testapi.pm line 1385.

So above shows booting is not completed. Maybe we should increase timeout for this. Question is where to change this for a try?

So it means that this ipmi machine might be not stable or doesn't meet requirement for ipmi test scenario. Please check this machine.

Actions #18

Updated by zluo over 6 years ago

for record:

add checking text_login and it works.
http://e13.suse.de/tests/6171

will try with wait_boot now

Actions #19

Updated by zluo over 6 years ago

http://e13.suse.de/tests/6175#step/first_boot/9

wait_boot seems to be not working

Actions #20

Updated by okurz over 6 years ago

  • Related to action #38267: [functional][u][hard] ipmi test fails in consoletest_setup to write to the serial device - regression! added
Actions #22

Updated by SLindoMansilla over 6 years ago

Merged, waiting for verification run on OSD:

  • At least 10 jobs pass.
Actions #24

Updated by SLindoMansilla over 6 years ago

10 jobs scheduled for default@ipmi: osd#0282-poo37820

Actions #25

Updated by okurz over 6 years ago

Good job! All 10 end up in "consoletest_setup" – see #38267 – so passed first_boot. Please make sure to cleanup first_boot as discussed before closing this ticket.

Actions #26

Updated by SLindoMansilla over 6 years ago

  • Related to action #38423: [sle][functional][u][hard] Refactor first_boot to unify duplicated behavior for remote backend added
Actions #28

Updated by SLindoMansilla over 6 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF