Project

General

Profile

Actions

action #16512

closed

timeout while uploading the logs - test fails in install_and_reboot

Added by dgutu about 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Infrastructure
Target version:
-
Start date:
2017-02-06
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-12-SP3-Server-DVD-HA-s390x-ha-ftp@s390x-zVM-vswitch-l3 fails in
install_and_reboot

  • ON the step of uploading the logs after installation the test dies because of timeout. From console its visible that there is some connectivity issue.

Reproducible

Fails since (at least) Build 0064@0234

Expected result

Last good: 0063@0232 (or more recent)

Problem

H1. ACCEPTED: worker specific problem -> E1-1
H2. REJECTED: generic problem in the product failing every time -> E2-1
H3. REJECTED: specific problem in the product failing sometimes under special circumstances
H4. REJECTED: os-autoinst problem

Suggestions

E1-1. check for differences on worker -> O1-1: openqaworker2 ran all jobs that seem to have this problem, and its network seems to have problem -> ACCEPT
E2-1. reproduce locally -> job passed, http://lord.arch/tests/5619#step/install_and_reboot/17 -> REJECT
E3-1. start job of old build in an environment where it fails often -> REJECT
E4-1. check os-autoinst version -> REJECT

Further details

Always latest result in this scenario: latest


Related issues 1 (0 open1 closed)

Related to openQA Project - action #12912: [tools]monitoring of o3/osdResolvedokurz2016-07-28

Actions
Actions #1

Updated by okurz about 7 years ago

  • Description updated (diff)
  • Assignee set to okurz
  • Priority changed from Normal to Urgent

staging tests are also affected:
install_and_reboot (https://v.gd/t760741, https://v.gd/t760742, https://v.gd/t760778, https://v.gd/t761038, https://v.gd/t760800)

Cloning latest failing staging job to local does not fail in the same step:
http://lord.arch/tests/5619#step/install_and_reboot/17

Actions #2

Updated by okurz about 7 years ago

  • Description updated (diff)
  • Status changed from New to In Progress

originally mentioned job fails: https://openqa.suse.de/tests/760747#previous
and two other jobs of the same build also fail so it is reproducible
-> E3-1: Triggered s390x job https://openqa.suse.de/tests/761579#live of an old build where it was working before. Let's wait for the result and see if it fails as well (supporting H1+H4) or if it passes (supporting H3)

And openqaworker2 is not even reachable over ping. mgriessmeier and me try https://wiki.microfocus.net/index.php/OpenQA#IPMI.2FSerial_Connections

Actions #3

Updated by okurz about 7 years ago

  • Description updated (diff)

updated my os-autoinst on lord.arch 14dc132..945efee and rerunning again: http://lord.arch/tests/5627#live

Also, openqaworker2 seems to be the culprit here, discussion in #infra ongoing…

Actions #4

Updated by okurz about 7 years ago

Actions #5

Updated by mgriessmeier about 7 years ago

  • Description updated (diff)
  • Status changed from In Progress to Resolved

Problem was that infra changed the network config of all workers (removing bond) but our salt recipes were not adjusted for all of our workers and did overwrite the ifcfg to use bond and therefore openqaworker2 got a new dhcp name and wasn't able any longer to be reachable and to upload logs

fixed staging test on openqaworker2: https://openqa.suse.de/tests/761627#

Actions

Also available in: Atom PDF