action #28325
closed[tools][sprint 201712.2][aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
Description
Observation¶
https://openqa.suse.de/tests/1268572/file/autoinst-log.txt
shows:
16:24:03.5060 1116 <<< testapi::assert_screen(mustmatch=[
'bootloader-shim-import-prompt',
'bootloader-grub2'
], timeout=30)
01:16:43.0065 1109 signalhandler got TERM - loop 1
so quite some gap in between
Journal on the worker does not tell so much either:
Nov 23 17:23:22 seattle14 worker[18759]: [Thu Nov 23 17:23:22 2017] [worker:info] 1109: WORKING 1268572
Nov 24 02:16:42 seattle14 worker[18759]: [Fri Nov 24 02:16:42 2017] [worker:warn] max job time exceeded, aborting 01268572-sle-15-Installer-DVD-aarch64-Build3
Nov 24 02:16:43 seattle14 worker[18759]: killed 1109
Nov 24 02:17:24 seattle14 worker[18759]: [Fri Nov 24 02:17:24 2017] [worker:warn] job 1268572 spent more time than MAX_JOB_TIME
Reproducible¶
First time occurence for some time
Problem¶
- Can we know? Do we need more logging here?
Updated by okurz about 7 years ago
- Subject changed from [aarch64]worker incompletes job after ~9h on seattle14:2 - no action in log for more than 8h to [aarch64]worker incompletes job with timeout after ~9h on seattle14:2 - no action in log for more than 8h
- Priority changed from Normal to Urgent
That seems to be a very severe thing because we see this reproducing, e.g. see https://openqa.suse.de/tests/1282870/file/autoinst-log.txt
Last time happening on seattle14:2, now seattle14:1. I assume it's worker host specific. Something about network here?
Updated by okurz about 7 years ago
- Subject changed from [aarch64]worker incompletes job with timeout after ~9h on seattle14:2 - no action in log for more than 8h to [aarch64]worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
Updated by coolo about 7 years ago
- Project changed from openQA Project (public) to openQA Tests (public)
- Subject changed from [aarch64]worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [aarch64] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
- Category changed from Regressions/Crashes to Infrastructure
take down that worker - this hardware is no good
Updated by okurz about 7 years ago
- Status changed from New to In Progress
- Assignee set to szarate
I checked on seattle14 and there is no openqa worker running since today, 1500. I assume someone stopped it for that reason?
@foursixnine you have the orthos machine reserved. How do you want to proceed, unreserve?
Updated by szarate about 7 years ago
- Project changed from openQA Tests (public) to openQA Project (public)
- Subject changed from [aarch64] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
- Category changed from Infrastructure to 168
- Status changed from In Progress to Resolved
- Target version set to Current Sprint
All of the orthos machines I have reserved ATM will move to staging, but this is not planned yet.
Seattles where removed from Production on osd, workers have been stopped.
Updated by szarate almost 7 years ago
- Subject changed from [aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [tools][Sprint 201711.2][aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
Updated by szarate almost 7 years ago
- Subject changed from [tools][Sprint 201711.2][aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [tools][sprint 201712.2] aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
Updated by szarate almost 7 years ago
- Subject changed from [tools][sprint 201712.2] aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [tools][sprint 201712.2][aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
Updated by szarate almost 7 years ago
- Target version changed from Current Sprint to Done