Project

General

Profile

Actions

action #44162

closed

Various tests stayed 'running' for ~ 4 hours or longer

Added by dimstar about 6 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
-
Start date:
2018-11-21
Due date:
% Done:

0%

Estimated time:

Description

Sample test:

https://openqa.opensuse.org/tests/801664 (raid10/TW) and https://openqa.opensuse.org/tests/801657 (cryptlvm/leap15.1)

it was a clone of an earlier one, that finished incomplete.

The restarted job staid around for 4 hours, but did not really make progress (and the test usually finishes much quicker. normaly runtime of RAID10/TW is 30 - 45 minutes)


Related issues 1 (0 open1 closed)

Related to openQA Project (public) - action #44105: if workercache dies, we get *tons* of incompletesResolvedmkittler2018-11-21

Actions
Actions #1

Updated by okurz about 6 years ago

  • Related to action #44105: if workercache dies, we get *tons* of incompletes added
Actions #2

Updated by okurz about 6 years ago

  • Subject changed from Various tests staid 'running' for ~ 4 hours to Various tests stayed 'running' for ~ 4 hours
Actions #3

Updated by okurz over 5 years ago

  • Category set to Regressions/Crashes
Actions #4

Updated by okurz over 5 years ago

  • Subject changed from Various tests stayed 'running' for ~ 4 hours to Various tests stayed 'running' for ~ 4 hours or longer

Let me hijack this ticket to reference most recent examples on OSD which run for > 6 days (!):

All of them have been cloned to newer jobs already and they do not even block the workers anymore as the assigned worker already executed other jobs just fine. Cancelling the job over web UI does not work, restarting the worker instance systemd job also not successful. A manual deletion of the job does work but I haven't executed that on theabove jobs, just an older one last week.

Actions #5

Updated by okurz almost 5 years ago

  • Status changed from New to Resolved
  • Assignee set to okurz

Since then we have improved stall detection, worker handling, refactored the worker-websockets-webui connection. Haven't observed this lately so I think we have it actually covered.

Actions

Also available in: Atom PDF