action #28325

[tools][sprint 201712.2][aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h

Added by okurz over 2 years ago. Updated almost 2 years ago.

Status:ResolvedStart date:24/11/2017
Priority:UrgentDue date:
Assignee:szarate% Done:

0%

Category:Feature requests
Target version:Done
Difficulty:
Duration:

Description

Observation

https://openqa.suse.de/tests/1268572/file/autoinst-log.txt

shows:

16:24:03.5060 1116 <<< testapi::assert_screen(mustmatch=[
  'bootloader-shim-import-prompt',
  'bootloader-grub2'
], timeout=30)
01:16:43.0065 1109 signalhandler got TERM - loop 1

so quite some gap in between

Journal on the worker does not tell so much either:

Nov 23 17:23:22 seattle14 worker[18759]: [Thu Nov 23 17:23:22 2017] [worker:info] 1109: WORKING 1268572
Nov 24 02:16:42 seattle14 worker[18759]: [Fri Nov 24 02:16:42 2017] [worker:warn] max job time exceeded, aborting 01268572-sle-15-Installer-DVD-aarch64-Build3
Nov 24 02:16:43 seattle14 worker[18759]: killed 1109
Nov 24 02:17:24 seattle14 worker[18759]: [Fri Nov 24 02:17:24 2017] [worker:warn] job 1268572 spent more time than MAX_JOB_TIME

Reproducible

First time occurence for some time

Problem

  • Can we know? Do we need more logging here?

History

#1 Updated by okurz about 2 years ago

  • Subject changed from [aarch64]worker incompletes job after ~9h on seattle14:2 - no action in log for more than 8h to [aarch64]worker incompletes job with timeout after ~9h on seattle14:2 - no action in log for more than 8h
  • Priority changed from Normal to Urgent

That seems to be a very severe thing because we see this reproducing, e.g. see https://openqa.suse.de/tests/1282870/file/autoinst-log.txt

Last time happening on seattle14:2, now seattle14:1. I assume it's worker host specific. Something about network here?

#2 Updated by okurz about 2 years ago

  • Subject changed from [aarch64]worker incompletes job with timeout after ~9h on seattle14:2 - no action in log for more than 8h to [aarch64]worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h

#3 Updated by coolo about 2 years ago

  • Project changed from openQA Project to openQA Tests
  • Subject changed from [aarch64]worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [aarch64] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
  • Category changed from Concrete Bugs to Infrastructure

take down that worker - this hardware is no good

#4 Updated by okurz about 2 years ago

  • Status changed from New to In Progress
  • Assignee set to szarate

I checked on seattle14 and there is no openqa worker running since today, 1500. I assume someone stopped it for that reason?

@foursixnine you have the orthos machine reserved. How do you want to proceed, unreserve?

#5 Updated by szarate about 2 years ago

  • Project changed from openQA Tests to openQA Project
  • Subject changed from [aarch64] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
  • Category changed from Infrastructure to 168
  • Status changed from In Progress to Resolved
  • Target version set to Current Sprint

All of the orthos machines I have reserved ATM will move to staging, but this is not planned yet.

Seattles where removed from Production on osd, workers have been stopped.

#6 Updated by szarate about 2 years ago

  • Subject changed from [aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [tools][Sprint 201711.2][aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h

#7 Updated by szarate about 2 years ago

  • Subject changed from [tools][Sprint 201711.2][aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [tools][sprint 201712.2] aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h

#8 Updated by szarate about 2 years ago

  • Subject changed from [tools][sprint 201712.2] aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [tools][sprint 201712.2][aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h

#9 Updated by szarate almost 2 years ago

  • Target version changed from Current Sprint to Done

Also available in: Atom PDF