action #28325
closed
[tools][sprint 201712.2][aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
Added by okurz over 7 years ago.
Updated about 7 years ago.
Category:
Feature requests
Description
Observation¶
https://openqa.suse.de/tests/1268572/file/autoinst-log.txt
shows:
16:24:03.5060 1116 <<< testapi::assert_screen(mustmatch=[
'bootloader-shim-import-prompt',
'bootloader-grub2'
], timeout=30)
01:16:43.0065 1109 signalhandler got TERM - loop 1
so quite some gap in between
Journal on the worker does not tell so much either:
Nov 23 17:23:22 seattle14 worker[18759]: [Thu Nov 23 17:23:22 2017] [worker:info] 1109: WORKING 1268572
Nov 24 02:16:42 seattle14 worker[18759]: [Fri Nov 24 02:16:42 2017] [worker:warn] max job time exceeded, aborting 01268572-sle-15-Installer-DVD-aarch64-Build3
Nov 24 02:16:43 seattle14 worker[18759]: killed 1109
Nov 24 02:17:24 seattle14 worker[18759]: [Fri Nov 24 02:17:24 2017] [worker:warn] job 1268572 spent more time than MAX_JOB_TIME
Reproducible¶
First time occurence for some time
Problem¶
- Can we know? Do we need more logging here?
- Subject changed from [aarch64]worker incompletes job after ~9h on seattle14:2 - no action in log for more than 8h to [aarch64]worker incompletes job with timeout after ~9h on seattle14:2 - no action in log for more than 8h
- Priority changed from Normal to Urgent
- Subject changed from [aarch64]worker incompletes job with timeout after ~9h on seattle14:2 - no action in log for more than 8h to [aarch64]worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
- Project changed from openQA Project (public) to openQA Tests (public)
- Subject changed from [aarch64]worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [aarch64] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
- Category changed from Regressions/Crashes to Infrastructure
take down that worker - this hardware is no good
- Status changed from New to In Progress
- Assignee set to szarate
I checked on seattle14 and there is no openqa worker running since today, 1500. I assume someone stopped it for that reason?
@foursixnine you have the orthos machine reserved. How do you want to proceed, unreserve?
- Project changed from openQA Tests (public) to openQA Project (public)
- Subject changed from [aarch64] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
- Category changed from Infrastructure to 168
- Status changed from In Progress to Resolved
- Target version set to Current Sprint
All of the orthos machines I have reserved ATM will move to staging, but this is not planned yet.
Seattles where removed from Production on osd, workers have been stopped.
- Subject changed from [aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [tools][Sprint 201711.2][aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
- Subject changed from [tools][Sprint 201711.2][aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [tools][sprint 201712.2] aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
- Subject changed from [tools][sprint 201712.2] aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h to [tools][sprint 201712.2][aarch64][bonus] worker incompletes job with timeout after ~9h on seattle14 - no action in log for more than 8h
- Target version changed from Current Sprint to Done
Also available in: Atom
PDF