Project

General

Profile

Actions

action #175407

closed

coordination #161414: [epic] Improved salt based infrastructure management

salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S

Added by nicksinger 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

While looking into https://progress.opensuse.org/issues/174985, @nicksinger did a salt 'monitor.qe.nue2.suse.org' state.apply which timed out. The mentioned salt-run jobs.lookup_jid 20250114115421987682 showed:

[…]
----------
          ID: dehydrated.timer
    Function: service.running
      Result: True
     Comment: The service dehydrated.timer is already running
     Started: 12:56:56.183595
    Duration: 126.445 ms
     Changes:
----------
          ID: systemctl start dehydrated
    Function: cmd.run
      Result: False
     Comment: The following requisites were not found:
                                 require:
                                     id: webserver_config
     Started: 12:56:56.324904
    Duration: 0.008 ms
     Changes:

Summary for monitor.qe.nue2.suse.org
--------------
Succeeded: 456 (changed=4)
Failed:      1
--------------
Total states run:     457
Total run time:   122.178 s

apparently this broke with https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/6cd458a57044b50f60df936ac04b1033534c7a9d#e83a0785fb156ea339320f4bf1d083717c84a2ba_11_13 but we never realized. Do we truncate our salt-logs too much?

Acceptance criteria

  • AC1: We have a stable and clean salt-states-openqa pipeline again
  • AC2: A pipeline only succeeds if all currently salt controlled hosts responded

Suggestions


Related issues 6 (0 open6 closed)

Related to openQA Infrastructure (public) - action #174985: [alert] salt-states-openqa | Failed pipeline for master "salt.exceptions.SaltReqTimeoutError: Message timed out" size:SRejectednicksinger2025-01-03

Actions
Related to openQA Infrastructure (public) - action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17Resolvedokurz2024-07-10

Actions
Related to openQA Infrastructure (public) - action #175989: Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:SResolvedjbaier_cz2025-01-22

Actions
Blocks openQA Infrastructure (public) - action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response"Resolvedokurz2025-01-16

Actions
Copied to openQA Infrastructure (public) - action #175629: diesel+petrol (possibly all ppc64le OPAL machines) often run into salt error "Not connected" or "No response" due to wireguard services failing to start on boot size:SResolvednicksinger2025-01-16

Actions
Copied to openQA Infrastructure (public) - action #177366: osd deployment "test.ping" check runs into gitlab CI timeoutResolvedokurz

Actions
Actions #1

Updated by nicksinger 4 months ago

  • Related to action #174985: [alert] salt-states-openqa | Failed pipeline for master "salt.exceptions.SaltReqTimeoutError: Message timed out" size:S added
Actions #2

Updated by okurz 4 months ago

  • Tags changed from infra, bug, salt-states-openqa, salt, osd to infra, bug, salt-states-openqa, salt, osd, reactive work
Actions #3

Updated by robert.richardson 4 months ago

  • Subject changed from salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us to salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size: S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz 4 months ago

  • Description updated (diff)
Actions #5

Updated by okurz 3 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
Actions #6

Updated by okurz 3 months ago

  • Subject changed from salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size: S to salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S
Actions #7

Updated by okurz 3 months ago

  • Status changed from In Progress to Feedback
Actions #8

Updated by nicksinger 3 months ago

okurz wrote in #note-7:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1336

merged. Interestingly in your deployment job, monitor is missing again. The associated salt_highstate.log artifact however mentions it somehow:

2025-01-14 16:41:51,723 [salt.client      :1167][DEBUG   ][12951] get_iter_returns for jid 20250114164151566706 sent to {'…-prg4.qa.suse.cz', 'grenache-1.oqa.prg2.suse.org', 'baremetal-support.qe.nue2.suse.org', 'monitor.qe.nue2.suse.org', 'storage.qe.prg2.suse.org', , …'} will timeout at 16:43:21.723622

In a following MR deployment job, monitor shows up with the expected issue. However, for some reason OSD timed out. I looked up the job (with salt-run jobs.lookup_jid 20250114193508323824) and found:

Summary for openqa.suse.de
--------------
Succeeded: 629 (changed=117)
Failed:      0
--------------
Total states run:     629
Total run time:   172.592 s
[…]
Summary for monitor.qe.nue2.suse.org
--------------
Succeeded: 456 (changed=4)
Failed:      1
--------------
Total states run:     457
Total run time:   109.676 s
[…]
Summary for backup-qam.qe.nue2.suse.org
--------------
Succeeded: 286
Failed:      0
--------------
Total states run:     286
Total run time:   161.726 s

so either your change did not apply correctly or we need to bump it even more because it includes some more overhead.

For the initial issue of the missing salt state requirement, I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1338

Actions #9

Updated by nicksinger 3 months ago

I wanted to check what would be a good value to but it even more and looked into https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3666125

$ salt-run jobs.lookup_jid 20250116085031017432
Summary for petrol.qe.nue2.suse.org
--------------
Succeeded: 423 (changed=2)
Failed:      1
--------------
Total states run:     424
Total run time:  3219.013 s
[…]
Summary for diesel.qe.nue2.suse.org
--------------
Succeeded: 422 (changed=2)
Failed:      1
--------------
Total states run:     423
Total run time:  1795.992 s

guess only raising timeouts won't cut it.

Actions #10

Updated by okurz 3 months ago

  • Parent task set to #161414
Actions #11

Updated by okurz 3 months ago

  • Copied to action #175629: diesel+petrol (possibly all ppc64le OPAL machines) often run into salt error "Not connected" or "No response" due to wireguard services failing to start on boot size:S added
Actions #12

Updated by okurz 3 months ago

I also observed that diesel+petrol are excessively slow. Reported #175629 for this specific issue

Actions #13

Updated by okurz 3 months ago

  • Status changed from Feedback to Resolved

Now we have again all salt minions included in each run with no ignoring of timeouts. I triggered multiple jobs for deployment which ended in stable successful salt state CI pipelines, e.g.

Actions #14

Updated by okurz 3 months ago

  • Status changed from Resolved to Workable

The situation is not stable, see https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3677962 . Various different hosts can run into timeout. We need to think harder what to do or work with even much bigger timeouts

Actions #15

Updated by okurz 3 months ago

  • Status changed from Workable to Feedback
Actions #16

Updated by okurz 3 months ago

ybonatakis suggested to look into "gather_job_timeout" as well however from https://docs.saltproject.io/en/latest/ref/configuration/master.html#gather-job-timeout this does not sound related.

I tried

sudo nice env runs=30 count-fail-ratio salt --state-output=changes -C "G@roles:worker and G@osarch:ppc64le" state.apply queue=True | grep -v 'Result.*Clean'

but that failed. nicksinger is currently looking into syntax errors for #162296. Waiting for his progress then before trying again.

Actions #17

Updated by okurz 3 months ago

merged https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1343

I will merge this now but will still test different sets of parameters after current syntax issues are fixed by @nicksinger

Actions #18

Updated by okurz 3 months ago

  • Related to action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response" added
Actions #19

Updated by okurz 3 months ago

  • Related to deleted (action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response")
Actions #20

Updated by okurz 3 months ago

  • Blocks action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response" added
Actions #21

Updated by okurz 3 months ago

  • Status changed from Feedback to In Progress

I called sudo nice env runs=30 count-fail-ratio salt --state-output=changes -C "G@roles:worker and G@osarch:ppc64le" state.apply queue=True | grep -v 'Result.*Clean' and got

…
petrol.qe.nue2.suse.org:
----------
…
Total states run:     428
Total run time:    59.859 s
diesel.qe.nue2.suse.org:
----------
…
Total run time:    66.725 s
grenache-1.oqa.prg2.suse.org:
----------
…
Total run time:    75.593 s
mania.qe.nue2.suse.org:

Summary for mania.qe.nue2.suse.org
--------------
…
Total run time:    75.174 s
## count-fail-ratio: Run: 30. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 10.00%
## mean runtime: 91588±5128.76 ms

so each machine takes 60-75s and overall one run of state apply takes 92s and no failures. Running again with all nodes sudo nice env runs=30 count-fail-ratio salt --state-output=changes -C "*" state.apply queue=True | grep -v 'Result.*Clean'

Actions #22

Updated by openqa_review 3 months ago

  • Due date set to 2025-02-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions #23

Updated by okurz 3 months ago

So far many failures, current

## count-fail-ratio: Run: 12. Fails: 12. Fail ratio 100.00±0%
## mean runtime: 3498995±415554.03 ms

that is 3,499s which are nearly 1h. looking into individual failures

Actions #24

Updated by okurz 3 months ago

  • Status changed from In Progress to Blocked

The problem is actually mostly #175629 which today hit sapworker1.

Actions #25

Updated by okurz 3 months ago

  • Status changed from Blocked to In Progress

end of European business day approaching so running another set of tests. First sudo nice env runs=300 count-fail-ratio salt -C \* test.true yields

## count-fail-ratio: Run: 300. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 1.00%
## mean runtime: 1648±1226.41 ms

so stable and no problem there. Again full state

Actions #26

Updated by okurz 3 months ago

  • Status changed from In Progress to Workable

Results from over the night

## count-fail-ratio: Run: 300. Fails: 16. Fail ratio 5.33±2.54%
## mean runtime: 192065±41227.61 ms

due to limited screen scrollback I did not record the individual failures. Still with a fail ratio of 5% I declare that as "stable" enough. And AC2 is also covered. The rest is to be followed for example in #175629

I want to run another big run over the next night but with lower timeout, e.g. salt -t 180 …

Actions #27

Updated by okurz 3 months ago

  • Status changed from Workable to In Progress

running over night again, with -t 180

Actions #28

Updated by okurz 3 months ago

  • Related to action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17 added
Actions #29

Updated by okurz 3 months ago

  • Due date deleted (2025-02-04)
  • Status changed from In Progress to Resolved
## count-fail-ratio: Run: 262. Fails: 166. Fail ratio 63.35±5.83%
## mean runtime: 183087±7143.24 ms

so very bad. But all machines are still reachable. So I think my approach of increasing the timeout was good.

We again observed corruption, see #175710. I think in one way or another me running salt state apply in a loop might help to reproduce more easily whatever underlying sporadic issue we might have.

Actions #30

Updated by jbaier_cz 3 months ago

  • Related to action #175989: Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S added
Actions #31

Updated by okurz 2 months ago

  • Copied to action #177366: osd deployment "test.ping" check runs into gitlab CI timeout added
Actions

Also available in: Atom PDF