Project

General

Profile

Actions

action #175407

closed

coordination #161414: [epic] Improved salt based infrastructure management

salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S

Added by nicksinger 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

While looking into https://progress.opensuse.org/issues/174985, @nicksinger did a salt 'monitor.qe.nue2.suse.org' state.apply which timed out. The mentioned salt-run jobs.lookup_jid 20250114115421987682 showed:

[…]
----------
          ID: dehydrated.timer
    Function: service.running
      Result: True
     Comment: The service dehydrated.timer is already running
     Started: 12:56:56.183595
    Duration: 126.445 ms
     Changes:
----------
          ID: systemctl start dehydrated
    Function: cmd.run
      Result: False
     Comment: The following requisites were not found:
                                 require:
                                     id: webserver_config
     Started: 12:56:56.324904
    Duration: 0.008 ms
     Changes:

Summary for monitor.qe.nue2.suse.org
--------------
Succeeded: 456 (changed=4)
Failed:      1
--------------
Total states run:     457
Total run time:   122.178 s

apparently this broke with https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/6cd458a57044b50f60df936ac04b1033534c7a9d#e83a0785fb156ea339320f4bf1d083717c84a2ba_11_13 but we never realized. Do we truncate our salt-logs too much?

Acceptance criteria

  • AC1: We have a stable and clean salt-states-openqa pipeline again
  • AC2: A pipeline only succeeds if all currently salt controlled hosts responded

Suggestions


Related issues 6 (0 open6 closed)

Related to openQA Infrastructure (public) - action #174985: [alert] salt-states-openqa | Failed pipeline for master "salt.exceptions.SaltReqTimeoutError: Message timed out" size:SRejectednicksinger2025-01-03

Actions
Related to openQA Infrastructure (public) - action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17Resolvedokurz2024-07-10

Actions
Related to openQA Infrastructure (public) - action #175989: Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:SResolvedjbaier_cz2025-01-22

Actions
Blocks openQA Infrastructure (public) - action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response"Resolvedokurz2025-01-16

Actions
Copied to openQA Infrastructure (public) - action #175629: diesel+petrol (possibly all ppc64le OPAL machines) often run into salt error "Not connected" or "No response" due to wireguard services failing to start on boot size:SResolvednicksinger2025-01-16

Actions
Copied to openQA Infrastructure (public) - action #177366: osd deployment "test.ping" check runs into gitlab CI timeoutResolvedokurz

Actions
Actions

Also available in: Atom PDF