Project

General

Profile

Actions

action #177366

closed

coordination #161414: [epic] Improved salt based infrastructure management

osd deployment "test.ping" check runs into gitlab CI timeout

Added by okurz 14 days ago. Updated 6 days ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

From https://gitlab.suse.de/openqa/osd-deployment/-/jobs/3824667#L36

++ retry ssh openqa.suse.de 'sudo salt \* test.ping'
$ retry ssh $TARGET "sudo salt \* test.ping"
...[…]..Terminated
.Retrying up to 3 more times after sleeping 3s …
.Terminated
+++ kill %1
WARNING: step_script could not run to completion because the timeout was exceeded. For more control over job and script timeouts see: https://docs.gitlab.com/ee/ci/runners/configure_runners.html#set-script-and-after_script-timeouts
ERROR: Job failed: execution took longer than 2h0m0s seconds

In #175407 we set our global salt timeout on salt master to a much higher number. That means that also in a simple test.ping when we don't get a reply we use that very long timeout. Combined with up to 3 retries that means we exceed the 2h gitlab CI timeout and have no feedback on which hosts did not respond.

Acceptance criteria

  • AC1: test.ping returns reasonably fast
  • AC2: state.apply still uses the much longer timeout

Suggestions

  • We could explicitly apply a longer timeout on either test.ping e.g. salt -t 5 or longer timeouts on state.apply. Maybe it's also possible to use custom timeouts per command?

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #175407: salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:SResolvedokurz

Actions
Actions #1

Updated by okurz 14 days ago

  • Copied from action #175407: salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S added
Actions #2

Updated by okurz 11 days ago

  • Priority changed from Normal to Low
Actions #4

Updated by okurz 6 days ago

  • Status changed from Feedback to Resolved

merged and effective. Should be more stable again

Actions

Also available in: Atom PDF