action #175407: salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #175407

closed

coordination #161414: [epic] Improved salt based infrastructure management

salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S

Added by nicksinger 4 months ago. Updated 3 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

osd, salt, infra, bug, reactive work, salt-states-openqa

Description

Observation¶

While looking into https://progress.opensuse.org/issues/174985, @nicksinger did a salt 'monitor.qe.nue2.suse.org' state.apply which timed out. The mentioned salt-run jobs.lookup_jid 20250114115421987682 showed:

[…]
----------
          ID: dehydrated.timer
    Function: service.running
      Result: True
     Comment: The service dehydrated.timer is already running
     Started: 12:56:56.183595
    Duration: 126.445 ms
     Changes:
----------
          ID: systemctl start dehydrated
    Function: cmd.run
      Result: False
     Comment: The following requisites were not found:
                                 require:
                                     id: webserver_config
     Started: 12:56:56.324904
    Duration: 0.008 ms
     Changes:

Summary for monitor.qe.nue2.suse.org
--------------
Succeeded: 456 (changed=4)
Failed:      1
--------------
Total states run:     457
Total run time:   122.178 s

apparently this broke with https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/6cd458a57044b50f60df936ac04b1033534c7a9d#e83a0785fb156ea339320f4bf1d083717c84a2ba_11_13 but we never realized. Do we truncate our salt-logs too much?

Acceptance criteria¶

AC1: We have a stable and clean salt-states-openqa pipeline again
AC2: A pipeline only succeeds if all currently salt controlled hosts responded

Suggestions¶

Does the same happen with other hosts?
Check the job artifacts for salt deployments in https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3609342/artifacts/browse
Check what was done in the past to hide more of the salt-output on pipeline runs, e.g. crosscheck execution line https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3439811#L42
Not failing everything because of a small issue is good - but how to make sure "small issues" don't get out of hand?
- Can we somehow extract metrics from salt? Our pipelines? (e.g. "success percentage" of workers per pipeline, etc)
- Think about oqa-minion-jobs: they can fail but we only care if it is about a certain threshold
Research why we put "hide-timeout" into the salt state call, see commit c825939 from https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/254 which unfortunately has not more content

Related issues 6 (0 open — 6 closed)

Related to openQA Infrastructure (public) - action #174985: [alert] salt-states-openqa | Failed pipeline for master "salt.exceptions.SaltReqTimeoutError: Message timed out" size:S

Rejected

nicksinger

2025-01-03

Actions

Related to openQA Infrastructure (public) - action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17

Resolved

okurz

2024-07-10

Actions

Related to openQA Infrastructure (public) - action #175989: Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S

Resolved

jbaier_cz

2025-01-22

Actions

Blocks openQA Infrastructure (public) - action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response"

Resolved

okurz

2025-01-16

Actions

Copied to openQA Infrastructure (public) - action #175629: diesel+petrol (possibly all ppc64le OPAL machines) often run into salt error "Not connected" or "No response" due to wireguard services failing to start on boot size:S

Resolved

nicksinger

2025-01-16

Actions

Copied to openQA Infrastructure (public) - action #177366: osd deployment "test.ping" check runs into gitlab CI timeout

Resolved

okurz

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #175407

salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by nicksinger 4 months ago

Updated by okurz 4 months ago

Updated by robert.richardson 4 months ago

Updated by okurz 4 months ago

Updated by okurz 4 months ago

Updated by okurz 4 months ago

Updated by okurz 4 months ago

Updated by nicksinger 3 months ago

Updated by nicksinger 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by openqa_review 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by jbaier_cz 3 months ago

Updated by okurz 2 months ago