QA (public) &raquo; openQA Project (public) &raquo; openQA Infrastructure (public)

openQA Project (public) - Ready

Category:

Regressions/Crashes

Target version:

Start date:

Due date:

% Done:

Estimated time:

Tags:

osd, salt, infra, bug, reactive work, salt-states-openqa

Description

Observation¶

While looking into https://progress.opensuse.org/issues/174985, @nicksinger did a salt 'monitor.qe.nue2.suse.org' state.apply which timed out. The mentioned salt-run jobs.lookup_jid 20250114115421987682 showed:

[…]
----------
          ID: dehydrated.timer
    Function: service.running
      Result: True
     Comment: The service dehydrated.timer is already running
     Started: 12:56:56.183595
    Duration: 126.445 ms
     Changes:
----------
          ID: systemctl start dehydrated
    Function: cmd.run
      Result: False
     Comment: The following requisites were not found:
                                 require:
                                     id: webserver_config
     Started: 12:56:56.324904
    Duration: 0.008 ms
     Changes:

Summary for monitor.qe.nue2.suse.org
--------------
Succeeded: 456 (changed=4)
Failed:      1
--------------
Total states run:     457
Total run time:   122.178 s

apparently this broke with https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/6cd458a57044b50f60df936ac04b1033534c7a9d#e83a0785fb156ea339320f4bf1d083717c84a2ba_11_13 but we never realized. Do we truncate our salt-logs too much?

Acceptance criteria¶

AC1: We have a stable and clean salt-states-openqa pipeline again
AC2: A pipeline only succeeds if all currently salt controlled hosts responded

Suggestions¶

Does the same happen with other hosts?
Check the job artifacts for salt deployments in https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3609342/artifacts/browse
Check what was done in the past to hide more of the salt-output on pipeline runs, e.g. crosscheck execution line https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3439811#L42
Not failing everything because of a small issue is good - but how to make sure "small issues" don't get out of hand?
- Can we somehow extract metrics from salt? Our pipelines? (e.g. "success percentage" of workers per pipeline, etc)
- Think about oqa-minion-jobs: they can fail but we only care if it is about a certain threshold
Research why we put "hide-timeout" into the salt state call, see commit c825939 from https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/254 which unfortunately has not more content

Related issues 6 (0 open — 6 closed)

Related to openQA Infrastructure (public) - action #174985: [alert] salt-states-openqa | Failed pipeline for master "salt.exceptions.SaltReqTimeoutError: Message timed out" size:S

Rejected

nicksinger

2025-01-03

Related to openQA Infrastructure (public) - action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17

Resolved

2024-07-10

Related to openQA Infrastructure (public) - action #175989: Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S

Resolved

jbaier_cz

2025-01-22

Blocks openQA Infrastructure (public) - action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response"

Resolved

2025-01-16

Copied to openQA Infrastructure (public) - action #175629: diesel+petrol (possibly all ppc64le OPAL machines) often run into salt error "Not connected" or "No response" due to wireguard services failing to start on boot size:S

Resolved

nicksinger

2025-01-16

Copied to openQA Infrastructure (public) - action #177366: osd deployment "test.ping" check runs into gitlab CI timeout

Resolved

Updated by nicksinger 4 months ago

Related to action #174985: [alert] salt-states-openqa | Failed pipeline for master "salt.exceptions.SaltReqTimeoutError: Message timed out" size:S added

Actions

Updated by okurz 4 months ago

Tags changed from infra, bug, salt-states-openqa, salt, osd to infra, bug, salt-states-openqa, salt, osd, reactive work

Actions

Updated by robert.richardson 4 months ago

Subject changed from salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us to salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size: S
Description updated (diff)
Status changed from New to Workable

Actions

Updated by okurz 4 months ago

Description updated (diff)

Actions

Updated by okurz 3 months ago

Status changed from Workable to In Progress
Assignee set to okurz

Actions

Updated by okurz 3 months ago

Subject changed from salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size: S to salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S

Actions

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1336

Updated by okurz 3 months ago

Status changed from In Progress to Feedback

Actions

Updated by nicksinger 3 months ago

okurz wrote in #note-7:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1336

merged. Interestingly in your deployment job, monitor is missing again. The associated salt_highstate.log artifact however mentions it somehow:

2025-01-14 16:41:51,723 [salt.client      :1167][DEBUG   ][12951] get_iter_returns for jid 20250114164151566706 sent to {'…-prg4.qa.suse.cz', 'grenache-1.oqa.prg2.suse.org', 'baremetal-support.qe.nue2.suse.org', 'monitor.qe.nue2.suse.org', 'storage.qe.prg2.suse.org', , …'} will timeout at 16:43:21.723622

In a following MR deployment job, monitor shows up with the expected issue. However, for some reason OSD timed out. I looked up the job (with salt-run jobs.lookup_jid 20250114193508323824) and found:

Summary for openqa.suse.de
--------------
Succeeded: 629 (changed=117)
Failed:      0
--------------
Total states run:     629
Total run time:   172.592 s
[…]
Summary for monitor.qe.nue2.suse.org
--------------
Succeeded: 456 (changed=4)
Failed:      1
--------------
Total states run:     457
Total run time:   109.676 s
[…]
Summary for backup-qam.qe.nue2.suse.org
--------------
Succeeded: 286
Failed:      0
--------------
Total states run:     286
Total run time:   161.726 s

so either your change did not apply correctly or we need to bump it even more because it includes some more overhead.

For the initial issue of the missing salt state requirement, I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1338

Actions

Updated by nicksinger 3 months ago

I wanted to check what would be a good value to but it even more and looked into https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3666125

$ salt-run jobs.lookup_jid 20250116085031017432
Summary for petrol.qe.nue2.suse.org
--------------
Succeeded: 423 (changed=2)
Failed:      1
--------------
Total states run:     424
Total run time:  3219.013 s
[…]
Summary for diesel.qe.nue2.suse.org
--------------
Succeeded: 422 (changed=2)
Failed:      1
--------------
Total states run:     423
Total run time:  1795.992 s

guess only raising timeouts won't cut it.

Actions

#10

Updated by okurz 3 months ago

Parent task set to #161414

Actions

#11

Updated by okurz 3 months ago

Copied to action #175629: diesel+petrol (possibly all ppc64le OPAL machines) often run into salt error "Not connected" or "No response" due to wireguard services failing to start on boot size:S added

Actions

#12

Updated by okurz 3 months ago

I also observed that diesel+petrol are excessively slow. Reported #175629 for this specific issue

Actions

#13

Updated by okurz 3 months ago

Status changed from Feedback to Resolved

Now we have again all salt minions included in each run with no ignoring of timeouts. I triggered multiple jobs for deployment which ended in stable successful salt state CI pipelines, e.g.

Actions

#14

Updated by okurz 3 months ago

Status changed from Resolved to Workable

The situation is not stable, see https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3677962 . Various different hosts can run into timeout. We need to think harder what to do or work with even much bigger timeouts

Actions

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1343

#15

Updated by okurz 3 months ago

Status changed from Workable to Feedback

Actions

#16

Updated by okurz 3 months ago

ybonatakis suggested to look into "gather_job_timeout" as well however from https://docs.saltproject.io/en/latest/ref/configuration/master.html#gather-job-timeout this does not sound related.

I tried

sudo nice env runs=30 count-fail-ratio salt --state-output=changes -C "G@roles:worker and G@osarch:ppc64le" state.apply queue=True | grep -v 'Result.*Clean'

but that failed. nicksinger is currently looking into syntax errors for #162296. Waiting for his progress then before trying again.

Actions

merged https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1343

#17

Updated by okurz 3 months ago

I will merge this now but will still test different sets of parameters after current syntax issues are fixed by @nicksinger

Actions

#18

Updated by okurz 3 months ago

Related to action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response" added

Actions

#19

Updated by okurz 3 months ago

Related to deleted (action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response")

Actions

#20

Updated by okurz 3 months ago

Blocks action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response" added

Actions

#21

Updated by okurz 3 months ago

Status changed from Feedback to In Progress

I called sudo nice env runs=30 count-fail-ratio salt --state-output=changes -C "G@roles:worker and G@osarch:ppc64le" state.apply queue=True | grep -v 'Result.*Clean' and got

…
petrol.qe.nue2.suse.org:
----------
…
Total states run:     428
Total run time:    59.859 s
diesel.qe.nue2.suse.org:
----------
…
Total run time:    66.725 s
grenache-1.oqa.prg2.suse.org:
----------
…
Total run time:    75.593 s
mania.qe.nue2.suse.org:

Summary for mania.qe.nue2.suse.org
--------------
…
Total run time:    75.174 s
## count-fail-ratio: Run: 30. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 10.00%
## mean runtime: 91588±5128.76 ms

so each machine takes 60-75s and overall one run of state apply takes 92s and no failures. Running again with all nodes sudo nice env runs=30 count-fail-ratio salt --state-output=changes -C "*" state.apply queue=True | grep -v 'Result.*Clean'

Actions

#22

Updated by openqa_review 3 months ago

Due date set to 2025-02-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions

#23

Updated by okurz 3 months ago

So far many failures, current

## count-fail-ratio: Run: 12. Fails: 12. Fail ratio 100.00±0%
## mean runtime: 3498995±415554.03 ms

that is 3,499s which are nearly 1h. looking into individual failures

Actions

#24

Updated by okurz 3 months ago

Status changed from In Progress to Blocked

The problem is actually mostly #175629 which today hit sapworker1.

Actions

#25

Updated by okurz 3 months ago

Status changed from Blocked to In Progress

end of European business day approaching so running another set of tests. First sudo nice env runs=300 count-fail-ratio salt -C \* test.true yields

## count-fail-ratio: Run: 300. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 1.00%
## mean runtime: 1648±1226.41 ms

so stable and no problem there. Again full state

Actions

#26

Updated by okurz 3 months ago

Status changed from In Progress to Workable

Results from over the night

## count-fail-ratio: Run: 300. Fails: 16. Fail ratio 5.33±2.54%
## mean runtime: 192065±41227.61 ms

due to limited screen scrollback I did not record the individual failures. Still with a fail ratio of 5% I declare that as "stable" enough. And AC2 is also covered. The rest is to be followed for example in #175629

I want to run another big run over the next night but with lower timeout, e.g. salt -t 180 …

Actions

#27

Updated by okurz 3 months ago

Status changed from Workable to In Progress

running over night again, with -t 180

Actions

#28

Updated by okurz 3 months ago

Related to action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17 added

Actions

#29

Updated by okurz 3 months ago

Due date deleted (~~2025-02-04~~)
Status changed from In Progress to Resolved

## count-fail-ratio: Run: 262. Fails: 166. Fail ratio 63.35±5.83%
## mean runtime: 183087±7143.24 ms

so very bad. But all machines are still reachable. So I think my approach of increasing the timeout was good.

We again observed corruption, see #175710. I think in one way or another me running salt state apply in a loop might help to reproduce more easily whatever underlying sporadic issue we might have.

Actions

#30

Updated by jbaier_cz 3 months ago

Related to action #175989: Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S added

Actions