action #37820
closed[functional][sle][u][hard][ipmi][sporadic] test fails in first_boot - Lost connection to SUT on SLE12-SP4
0%
Description
Scenario sle-12-SP4-Server-DVD-x86_64-Build0238-default@64bit-ipmi
The connection to the SUT was lost.
Reproducible¶
Fails since Build 0238 osd#1659807#step/first_boot/2
Expected result¶
- SLE15 GM: osd#1772527#step/first_boot/1
- Last good from SLE12SP3 (pre-GM): https://openqa.suse.de/tests/992659 at least reaching one test module further
- Last "passed" (yes, green) job SLE12SP3 build 0314
Acceptance criteria¶
- AC1: SLE12SP4 ipmi tests pass first_boot consistently (could be with soft-fail and workaround) -> AT1: >1 job on SLE12SP4 on the ipmi backend passed first_boot
Suggestions¶
- Trigger a job (could even be on osd) for the last SLE12SP3 milestone ISO/repo we still have around that worked to crosscheck if it still works with the current tests
- Workaround sporadic errors by triggering multiple jobs in all cases
- Trigger a job for the current SLE12SP4 build but with the last good test git commit
- Try out manually, both current SLE12SP4 as well as e.g. SLE12SP3B1 (where it still worked)
- Based on previous steps fix the test regression or report the product bug and work around it
Further details¶
Latest result: sle-12-SP4-Server-DVD-x86_64-Latest-default@64bit-ipmi
Updated by SLindoMansilla over 6 years ago
- Related to action #32089: [sle][functional][u][ipmi][easy] test fails in first_boot - abort the test early so that we at least test the installation added
Updated by SLindoMansilla over 6 years ago
- Subject changed from [functional][sle][ipmi] test fails in first_boot - Lost connection to SUT to [functional][sle][ipmi] test fails in first_boot - Lost connection to SUT on SLE12-SP4
Updated by okurz over 6 years ago
- Subject changed from [functional][sle][ipmi] test fails in first_boot - Lost connection to SUT on SLE12-SP4 to [functional][sle][u][ipmi] test fails in first_boot - Lost connection to SUT on SLE12-SP4
- Description updated (diff)
- Due date set to 2018-07-17
- Status changed from New to Workable
- Priority changed from Normal to High
- Target version set to Milestone 17
We really should act fast to not loose older references, e.g. the last "green" SLE 12 SP3 job has the following content of vars.json:
{
"ARCH" : "x86_64",
"ASSETDIR" : "/var/lib/openqa/cache/openqa.suse.de/factory",
"BACKEND" : "ipmi",
"BETA" : "1",
"BETA_SDK" : "1",
"BETA_WE" : "1",
"BUILD" : "0314",
"BUILD_HA" : "0119",
"BUILD_HA_GEO" : "0088",
"BUILD_SDK" : "0152",
"BUILD_SLE" : "0314",
"BUILD_WE" : "0087",
"CACHEDIRECTORY" : "/var/lib/openqa/cache",
"CASEDIR" : "/var/lib/openqa/cache/openqa.suse.de/tests/sle",
"DESKTOP" : "gnome",
"DISTRI" : "sle",
"DVD" : 1,
"FLAVOR" : "Server-DVD",
"GNOME" : 1,
"HASLICENSE" : 1,
"INSTLANG" : "en_US",
"IPMI_HOSTNAME" : "10.162.28.200",
"IPMI_PASSWORD" : "qatesting",
"IPMI_USER" : "admin",
"ISO" : "/var/lib/openqa/pool/1/SLE-12-SP3-Server-DVD-x86_64-Build0314-Media1.iso",
"ISO_MAXSIZE" : "4700372992",
"JOBTOKEN" : "vojxI6rLeezvh85q",
"MACHINE" : "64bit-ipmi",
"MAX_JOB_TIME" : "32000",
"NAME" : "00857605-sle-12-SP3-Server-DVD-x86_64-Build0314-gnome@64bit-ipmi",
"NOAUTOLOGIN" : 1,
"OPENQA_HOSTNAME" : "openqa.suse.de",
"OPENQA_URL" : "http://openqa.suse.de",
"PACKAGETOINSTALL" : "x3270",
"PRJDIR" : "/var/lib/openqa/cache/openqa.suse.de",
"PRODUCTDIR" : "/var/lib/openqa/cache/openqa.suse.de/tests/sle/products/sle",
"QA_HEAD_REPO" : "http://dist.nue.suse.com/ibs/QA:/Head/SLE-12-SP3",
"QA_WEB_REPO" : "http://dist.suse.de/install/SLP/SLE-12-Module-Web-Scripting-LATEST/x86_64/CD1/",
"QEMUPORT" : "20012",
"REPO_0" : "SLE-12-SP3-Server-DVD-x86_64-Build0314-Media1",
"SCC_REGCODE" : "30452ce234918d23",
"SCC_URL" : "http://Server-0314.proxy.scc.suse.de",
"SERIALDEV" : "ttyS1",
"SHUTDOWN_NEEDS_AUTH" : "1",
"SLENKINS_TESTSUITES_REPO" : "http://download.suse.de/ibs/Devel:/SLEnkins:/testsuites/SLE_12_SP3/",
"SP2ORLATER" : 1,
"SP3ORLATER" : 1,
"TEST" : "gnome",
"TEST_GIT_HASH" : "749297d010f39e18c741cc3ed0c633b03445f647",
"TIMEOUT_SCALE" : "3",
"VERSION" : "12-SP3",
"VNC" : "91",
"WALLPAPER" : "/usr/share/wallpapers/SLEdefault/contents/images/1280x1024.jpg",
"WORKER_CLASS" : "64bit-ipmi",
"WORKER_HOSTNAME" : "10.162.0.12",
"WORKER_ID" : 368,
"WORKER_INSTANCE" : "1"
}
So with this it might be possible to reproduce the working setup first and then we can plan further steps in details. SLE12SP4 should not be exactly the same as SLE15 as the installer behaves different in not starting a VNC server anymore in the target system … unless this change was already pushed to SLE12SP4.
Updated by okurz over 6 years ago
- Subject changed from [functional][sle][u][ipmi] test fails in first_boot - Lost connection to SUT on SLE12-SP4 to [functional][sle][u][hard][ipmi] test fails in first_boot - Lost connection to SUT on SLE12-SP4
- Description updated (diff)
Updated by zluo over 6 years ago
- Status changed from Workable to In Progress
- Assignee set to zluo
take over
Updated by okurz over 6 years ago
- Description updated (diff)
Hi, I tried to clarify the ACs by specifying them explicitly. Please let me know if this is in conflict with your previous expectations regarding this ticket because you picked it up in before.
Updated by SLindoMansilla over 6 years ago
FYI: On SLE15 we were not able to make the test pass modules after first_boot, so we decided to move the default scenario to the respective development group and schedule one install only test suite (see btrfs@ipmi osd#1772527 which was showing green results since we fixed the test modules until first_boo).
Updated by okurz over 6 years ago
- Related to action #34402: [functional][u][s390x][medium] Revisit extra_tests_on_gnome@s390x (was: Do no run extra_tests_on_gnome on s390x) added
Updated by zluo over 6 years ago
- Status changed from In Progress to Rejected
I checked on my openQA server and tried to reproduce it for more than 20 times, but I cannot reproduce this issue.
The latest test runs on osd don't show this issue anymore.
Please re-open it if you find this happens again.
Updated by zluo over 6 years ago
for example my test run:
http://e13.suse.de/tests/6160#step/first_boot
Updated by okurz over 6 years ago
- Subject changed from [functional][sle][u][hard][ipmi] test fails in first_boot - Lost connection to SUT on SLE12-SP4 to [functional][sle][u][hard][ipmi][sporadic] test fails in first_boot - Lost connection to SUT on SLE12-SP4
- Status changed from Rejected to In Progress
yes, interesting. It seems that the scenario become more stable but the last failure in first_boot was just yesterday: https://openqa.suse.de/tests/1810540 so I doubt we are done here but I will mark the issue as "sporadic" now. Interesting to see that you can not reproduce the issue locally. Can you please check what workers you use and what is used in production? Maybe there is one production worker that is not stable?
Updated by zluo over 6 years ago
openqaw1:1, openqaw1:2 show the failed status.
I use loewe remote-worker:200. the ipmi machine used for my tests belongs to our QA and it is only one atm and it is not stable. But it doesn't show any issue for my tests.
Updated by okurz over 6 years ago
Keep in mind that for IPMI we would rather need the actual machines on which the tests have been executed, not the workers just delegating to the machines connected over IPMI
Updated by okurz over 6 years ago
Retriggered test on production also failed again in first_boot: https://openqa.suse.de/tests/1814410#step/first_boot/3
Updated by zluo over 6 years ago
I checked for sles 13 sp3 on osd, ipmi test scenario has never worked before:
http://openqa.suse.de/tests/1123607#
Updated by zluo over 6 years ago
/usr/lib/os-autoinst/consoles/vnc_base.pm:71:{
'password' => 'nots3cr3t',
'port' => 5901,
'hostname' => '10.162.2.75'
}
the last test run shows the same ip address:
# Test died: Error connecting to host <10.162.2.75>: IO::Socket::INET: connect: No route to host at /usr/lib/os-autoinst/testapi.pm line 1385.
So above shows booting is not completed. Maybe we should increase timeout for this. Question is where to change this for a try?
So it means that this ipmi machine might be not stable or doesn't meet requirement for ipmi test scenario. Please check this machine.
Updated by zluo over 6 years ago
for record:
add checking text_login and it works.
http://e13.suse.de/tests/6171
will try with wait_boot now
Updated by zluo over 6 years ago
http://e13.suse.de/tests/6175#step/first_boot/9
wait_boot seems to be not working
Updated by okurz over 6 years ago
- Related to action #38267: [functional][u][hard] ipmi test fails in consoletest_setup to write to the serial device - regression! added
Updated by zluo over 6 years ago
Updated by SLindoMansilla over 6 years ago
Merged, waiting for verification run on OSD:
- At least 10 jobs pass.
Updated by zluo over 6 years ago
Updated by SLindoMansilla over 6 years ago
10 jobs scheduled for default@ipmi: osd#0282-poo37820
Updated by okurz over 6 years ago
Good job! All 10 end up in "consoletest_setup" – see #38267 – so passed first_boot. Please make sure to cleanup first_boot as discussed before closing this ticket.
Updated by SLindoMansilla over 6 years ago
- Related to action #38423: [sle][functional][u][hard] Refactor first_boot to unify duplicated behavior for remote backend added
Updated by SLindoMansilla over 6 years ago
Waiting to verify: https://openqa.suse.de/tests/1832261
Updated by SLindoMansilla over 6 years ago
- Status changed from In Progress to Resolved
Verified on OSD: https://openqa.suse.de/tests/1832261#step/consoletest_setup/1