Project

General

Profile

Actions

action #175407

closed

coordination #161414: [epic] Improved salt based infrastructure management

salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S

Added by nicksinger 24 days ago. Updated 15 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

While looking into https://progress.opensuse.org/issues/174985, @nicksinger did a salt 'monitor.qe.nue2.suse.org' state.apply which timed out. The mentioned salt-run jobs.lookup_jid 20250114115421987682 showed:

[…]
----------
          ID: dehydrated.timer
    Function: service.running
      Result: True
     Comment: The service dehydrated.timer is already running
     Started: 12:56:56.183595
    Duration: 126.445 ms
     Changes:
----------
          ID: systemctl start dehydrated
    Function: cmd.run
      Result: False
     Comment: The following requisites were not found:
                                 require:
                                     id: webserver_config
     Started: 12:56:56.324904
    Duration: 0.008 ms
     Changes:

Summary for monitor.qe.nue2.suse.org
--------------
Succeeded: 456 (changed=4)
Failed:      1
--------------
Total states run:     457
Total run time:   122.178 s

apparently this broke with https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/6cd458a57044b50f60df936ac04b1033534c7a9d#e83a0785fb156ea339320f4bf1d083717c84a2ba_11_13 but we never realized. Do we truncate our salt-logs too much?

Acceptance criteria

  • AC1: We have a stable and clean salt-states-openqa pipeline again
  • AC2: A pipeline only succeeds if all currently salt controlled hosts responded

Suggestions


Related issues 5 (1 open4 closed)

Related to openQA Infrastructure (public) - action #174985: [alert] salt-states-openqa | Failed pipeline for master "salt.exceptions.SaltReqTimeoutError: Message timed out" size:SRejectednicksinger2025-01-03

Actions
Related to openQA Infrastructure (public) - action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17Blockedokurz2024-07-10

Actions
Related to openQA Infrastructure (public) - action #175989: Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:SResolvedjbaier_cz2025-01-22

Actions
Blocks openQA Infrastructure (public) - action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response"Resolvedokurz2025-01-16

Actions
Copied to openQA Infrastructure (public) - action #175629: diesel+petrol (possibly all ppc64le OPAL machines) often run into salt error "Not connected" or "No response" due to wireguard services failing to start on boot size:SResolvednicksinger2025-01-16

Actions
Actions #1

Updated by nicksinger 24 days ago

  • Related to action #174985: [alert] salt-states-openqa | Failed pipeline for master "salt.exceptions.SaltReqTimeoutError: Message timed out" size:S added
Actions #2

Updated by okurz 24 days ago

  • Tags changed from infra, bug, salt-states-openqa, salt, osd to infra, bug, salt-states-openqa, salt, osd, reactive work
Actions #3

Updated by robert.richardson 24 days ago

  • Subject changed from salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us to salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size: S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz 24 days ago

  • Description updated (diff)
Actions #5

Updated by okurz 24 days ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
Actions #6

Updated by okurz 24 days ago

  • Subject changed from salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size: S to salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S
Actions #7

Updated by okurz 24 days ago

  • Status changed from In Progress to Feedback
Actions #8

Updated by nicksinger 23 days ago

okurz wrote in #note-7:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1336

merged. Interestingly in your deployment job, monitor is missing again. The associated salt_highstate.log artifact however mentions it somehow:

2025-01-14 16:41:51,723 [salt.client      :1167][DEBUG   ][12951] get_iter_returns for jid 20250114164151566706 sent to {'…-prg4.qa.suse.cz', 'grenache-1.oqa.prg2.suse.org', 'baremetal-support.qe.nue2.suse.org', 'monitor.qe.nue2.suse.org', 'storage.qe.prg2.suse.org', , …'} will timeout at 16:43:21.723622

In a following MR deployment job, monitor shows up with the expected issue. However, for some reason OSD timed out. I looked up the job (with salt-run jobs.lookup_jid 20250114193508323824) and found:

Summary for openqa.suse.de
--------------
Succeeded: 629 (changed=117)
Failed:      0
--------------
Total states run:     629
Total run time:   172.592 s
[…]
Summary for monitor.qe.nue2.suse.org
--------------
Succeeded: 456 (changed=4)
Failed:      1
--------------
Total states run:     457
Total run time:   109.676 s
[…]
Summary for backup-qam.qe.nue2.suse.org
--------------
Succeeded: 286
Failed:      0
--------------
Total states run:     286
Total run time:   161.726 s

so either your change did not apply correctly or we need to bump it even more because it includes some more overhead.

For the initial issue of the missing salt state requirement, I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1338

Actions #9

Updated by nicksinger 22 days ago

I wanted to check what would be a good value to but it even more and looked into https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3666125

$ salt-run jobs.lookup_jid 20250116085031017432
Summary for petrol.qe.nue2.suse.org
--------------
Succeeded: 423 (changed=2)
Failed:      1
--------------
Total states run:     424
Total run time:  3219.013 s
[…]
Summary for diesel.qe.nue2.suse.org
--------------
Succeeded: 422 (changed=2)
Failed:      1
--------------
Total states run:     423
Total run time:  1795.992 s

guess only raising timeouts won't cut it.

Actions #10

Updated by okurz 22 days ago

  • Parent task set to #161414
Actions #11

Updated by okurz 22 days ago

  • Copied to action #175629: diesel+petrol (possibly all ppc64le OPAL machines) often run into salt error "Not connected" or "No response" due to wireguard services failing to start on boot size:S added
Actions #12

Updated by okurz 22 days ago

I also observed that diesel+petrol are excessively slow. Reported #175629 for this specific issue

Actions #13

Updated by okurz 21 days ago

  • Status changed from Feedback to Resolved

Now we have again all salt minions included in each run with no ignoring of timeouts. I triggered multiple jobs for deployment which ended in stable successful salt state CI pipelines, e.g.

Actions #14

Updated by okurz 21 days ago

  • Status changed from Resolved to Workable

The situation is not stable, see https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3677962 . Various different hosts can run into timeout. We need to think harder what to do or work with even much bigger timeouts

Actions #15

Updated by okurz 20 days ago

  • Status changed from Workable to Feedback
Actions #16

Updated by okurz 18 days ago

ybonatakis suggested to look into "gather_job_timeout" as well however from https://docs.saltproject.io/en/latest/ref/configuration/master.html#gather-job-timeout this does not sound related.

I tried

sudo nice env runs=30 count-fail-ratio salt --state-output=changes -C "G@roles:worker and G@osarch:ppc64le" state.apply queue=True | grep -v 'Result.*Clean'

but that failed. nicksinger is currently looking into syntax errors for #162296. Waiting for his progress then before trying again.

Actions #17

Updated by okurz 18 days ago

merged https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1343

I will merge this now but will still test different sets of parameters after current syntax issues are fixed by @nicksinger

Actions #18

Updated by okurz 18 days ago

  • Related to action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response" added
Actions #19

Updated by okurz 18 days ago

  • Related to deleted (action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response")
Actions #20

Updated by okurz 18 days ago

  • Blocks action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response" added
Actions #21

Updated by okurz 18 days ago

  • Status changed from Feedback to In Progress

I called sudo nice env runs=30 count-fail-ratio salt --state-output=changes -C "G@roles:worker and G@osarch:ppc64le" state.apply queue=True | grep -v 'Result.*Clean' and got

…
petrol.qe.nue2.suse.org:
----------
…
Total states run:     428
Total run time:    59.859 s
diesel.qe.nue2.suse.org:
----------
…
Total run time:    66.725 s
grenache-1.oqa.prg2.suse.org:
----------
…
Total run time:    75.593 s
mania.qe.nue2.suse.org:

Summary for mania.qe.nue2.suse.org
--------------
…
Total run time:    75.174 s
## count-fail-ratio: Run: 30. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 10.00%
## mean runtime: 91588±5128.76 ms

so each machine takes 60-75s and overall one run of state apply takes 92s and no failures. Running again with all nodes sudo nice env runs=30 count-fail-ratio salt --state-output=changes -C "*" state.apply queue=True | grep -v 'Result.*Clean'

Actions #22

Updated by openqa_review 17 days ago

  • Due date set to 2025-02-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions #23

Updated by okurz 17 days ago

So far many failures, current

## count-fail-ratio: Run: 12. Fails: 12. Fail ratio 100.00±0%
## mean runtime: 3498995±415554.03 ms

that is 3,499s which are nearly 1h. looking into individual failures

Actions #24

Updated by okurz 17 days ago

  • Status changed from In Progress to Blocked

The problem is actually mostly #175629 which today hit sapworker1.

Actions #25

Updated by okurz 17 days ago

  • Status changed from Blocked to In Progress

end of European business day approaching so running another set of tests. First sudo nice env runs=300 count-fail-ratio salt -C \* test.true yields

## count-fail-ratio: Run: 300. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 1.00%
## mean runtime: 1648±1226.41 ms

so stable and no problem there. Again full state

Actions #26

Updated by okurz 16 days ago

  • Status changed from In Progress to Workable

Results from over the night

## count-fail-ratio: Run: 300. Fails: 16. Fail ratio 5.33±2.54%
## mean runtime: 192065±41227.61 ms

due to limited screen scrollback I did not record the individual failures. Still with a fail ratio of 5% I declare that as "stable" enough. And AC2 is also covered. The rest is to be followed for example in #175629

I want to run another big run over the next night but with lower timeout, e.g. salt -t 180 …

Actions #27

Updated by okurz 16 days ago

  • Status changed from Workable to In Progress

running over night again, with -t 180

Actions #28

Updated by okurz 15 days ago

  • Related to action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17 added
Actions #29

Updated by okurz 15 days ago

  • Due date deleted (2025-02-04)
  • Status changed from In Progress to Resolved
## count-fail-ratio: Run: 262. Fails: 166. Fail ratio 63.35±5.83%
## mean runtime: 183087±7143.24 ms

so very bad. But all machines are still reachable. So I think my approach of increasing the timeout was good.

We again observed corruption, see #175710. I think in one way or another me running salt state apply in a loop might help to reproduce more easily whatever underlying sporadic issue we might have.

Actions #30

Updated by jbaier_cz 3 days ago

  • Related to action #175989: Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S added
Actions

Also available in: Atom PDF