Project

General

Profile

Actions

action #28325

closed

[tools][sprint 201712.2][aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h

Added by okurz over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Feature requests
Target version:
Start date:
2017-11-24
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://openqa.suse.de/tests/1268572/file/autoinst-log.txt

shows:

16:24:03.5060 1116 <<< testapi::assert_screen(mustmatch=[
  'bootloader-shim-import-prompt',
  'bootloader-grub2'
], timeout=30)
01:16:43.0065 1109 signalhandler got TERM - loop 1

so quite some gap in between

Journal on the worker does not tell so much either:

Nov 23 17:23:22 seattle14 worker[18759]: [Thu Nov 23 17:23:22 2017] [worker:info] 1109: WORKING 1268572
Nov 24 02:16:42 seattle14 worker[18759]: [Fri Nov 24 02:16:42 2017] [worker:warn] max job time exceeded, aborting 01268572-sle-15-Installer-DVD-aarch64-Build3
Nov 24 02:16:43 seattle14 worker[18759]: killed 1109
Nov 24 02:17:24 seattle14 worker[18759]: [Fri Nov 24 02:17:24 2017] [worker:warn] job 1268572 spent more time than MAX_JOB_TIME

Reproducible

First time occurence for some time

Problem

  • Can we know? Do we need more logging here?
Actions #1

Updated by okurz over 6 years ago

  • Subject changed from [aarch64]worker incompletes job after ~9h on seattle14:2 - no action in log for more than 8h to [aarch64]worker incompletes job with timeout after ~9h on seattle14:2 - no action in log for more than 8h
  • Priority changed from Normal to Urgent

That seems to be a very severe thing because we see this reproducing, e.g. see https://openqa.suse.de/tests/1282870/file/autoinst-log.txt

Last time happening on seattle14:2, now seattle14:1. I assume it's worker host specific. Something about network here?

Actions #2

Updated by okurz over 6 years ago

  • Subject changed from [aarch64]worker incompletes job with timeout after ~9h on seattle14:2 - no action in log for more than 8h to [aarch64]worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
Actions #3

Updated by coolo over 6 years ago

  • Project changed from openQA Project to openQA Tests
  • Subject changed from [aarch64]worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [aarch64] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
  • Category changed from Regressions/Crashes to Infrastructure

take down that worker - this hardware is no good

Actions #4

Updated by okurz over 6 years ago

  • Status changed from New to In Progress
  • Assignee set to szarate

I checked on seattle14 and there is no openqa worker running since today, 1500. I assume someone stopped it for that reason?

@foursixnine you have the orthos machine reserved. How do you want to proceed, unreserve?

Actions #5

Updated by szarate over 6 years ago

  • Project changed from openQA Tests to openQA Project
  • Subject changed from [aarch64] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
  • Category changed from Infrastructure to 168
  • Status changed from In Progress to Resolved
  • Target version set to Current Sprint

All of the orthos machines I have reserved ATM will move to staging, but this is not planned yet.

Seattles where removed from Production on osd, workers have been stopped.

Actions #6

Updated by szarate over 6 years ago

  • Subject changed from [aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [tools][Sprint 201711.2][aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
Actions #7

Updated by szarate over 6 years ago

  • Subject changed from [tools][Sprint 201711.2][aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [tools][sprint 201712.2] aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
Actions #8

Updated by szarate over 6 years ago

  • Subject changed from [tools][sprint 201712.2] aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [tools][sprint 201712.2][aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
Actions #9

Updated by szarate about 6 years ago

  • Target version changed from Current Sprint to Done
Actions

Also available in: Atom PDF