Project

General

Profile

Actions

action #134906

closed

osd-deployment failed due to openqaworker1 showing "No response" in salt size:M

Added by okurz 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-08-31
Due date:
2023-09-23
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1794346#L9197 shows

Minions returned with non-zero exit code
openqaworker1.qe.nue2.suse.org:
    Minion did not return. [No response]

Acceptance criteria

Suggestions

  • Research how to backport + package lock in salt recipes, e.g. start with https://docs.saltproject.io/en/latest/ref/modules/all/salt.modules.zypperpkg.html or ask experts in chat (but be careful not be drawn into a "just install SUSE Manager" discussion)
  • Add instructions to salt to ensure the salt-minion package is backported and package locked
  • As alternative consider another separate repo that has the backported/fixed version and is applied to all salt controlled machines (not devel:openQA as this is a salt problem, not openQA machine specific)

Related issues 4 (0 open4 closed)

Related to openQA Infrastructure - action #134132: Bare-metal control openQA worker in NUE2 size:MResolvedokurz

Actions
Related to openQA Infrastructure - action #131249: [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:MResolvedokurz2023-06-22

Actions
Related to openQA Infrastructure - action #135404: openqaworker-arm-2.suse.de minion not returningResolvednicksinger2023-09-082023-09-23

Actions
Related to openQA Infrastructure - action #136325: salt deploy fails due to multiple offline workers in qe.nue2.suse.org+prg2.suse.orgResolvedokurz2023-09-22

Actions
Actions #1

Updated by okurz 8 months ago

  • Related to action #134132: Bare-metal control openQA worker in NUE2 size:M added
Actions #2

Updated by dheidler 8 months ago

  • Status changed from New to In Progress
  • Assignee set to dheidler

Tried to restart salt-minion on openqaworker1, but didn't help.
Had to kill (not -9) one of the processes which helped.

Now seems to be working again:

openqa:~ # salt openqaworker1.qe.nue2.suse.org test.ping
openqaworker1.qe.nue2.suse.org:
    True
Actions #3

Updated by okurz 8 months ago

  • Related to action #131249: [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:M added
Actions #4

Updated by okurz 8 months ago

likely the same issue as in #131249, please apply the backport and package lock and add the backport and package lock to salt repos.

Actions #5

Updated by okurz 8 months ago

  • Subject changed from osd-deployment failed due to openqaworker1 showing "No response" in salt to osd-deployment failed due to openqaworker1 showing "No response" in salt size:M
  • Description updated (diff)
Actions #6

Updated by okurz 8 months ago

openqaworker1.qe.nue2.suse.org is not responsive again. Please expedite applying the backport+lock

Actions #7

Updated by dheidler 8 months ago

  • Status changed from In Progress to Feedback
Actions #8

Updated by dheidler 8 months ago

  • Status changed from Feedback to Resolved

Looks good.

openqaworker1:~ # zypper ll

# | Name             | Type    | Repository | Comment
--+------------------+---------+------------+-----------
1 | qemu-ovmf-x86_64 | package | (beliebig) | poo#116914
2 | salt             | package | (beliebig) | poo#131249
openqaworker1:~ # zypper se -s --match-exact salt | grep ^i
il | salt | Paket      | 3004-150400.8.25.1   | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
Actions #9

Updated by okurz 8 months ago

  • Status changed from Resolved to Feedback

wait. The problem was with salt-minion. If you just apply a lock for the package "salt" then I assume salt-minion might still be pulled in with the wrong version … or the update services failing trying to resolve that conflict.

# salt \* cmd.run 'rpm -q salt-minion | grep 3006'
worker40.oqa.prg2.suse.org:
worker34.oqa.prg2.suse.org:
backup-qam.qe.nue2.suse.org:
worker29.oqa.prg2.suse.org:
worker35.oqa.prg2.suse.org:
worker37.oqa.prg2.suse.org:
worker38.oqa.prg2.suse.org:
worker39.oqa.prg2.suse.org:
worker31.oqa.prg2.suse.org:
worker36.oqa.prg2.suse.org:
worker33.oqa.prg2.suse.org:
worker32.oqa.prg2.suse.org:
worker30.oqa.prg2.suse.org:
worker-arm1.oqa.prg2.suse.org:
sapworker2.qe.nue2.suse.org:
worker-arm2.oqa.prg2.suse.org:
sapworker1.qe.nue2.suse.org:
sapworker3.qe.nue2.suse.org:
storage.oqa.suse.de:
openqaworker16.qa.suse.cz:
openqaworker18.qa.suse.cz:
openqaworker17.qa.suse.cz:
worker2.oqa.suse.de:
openqa.suse.de:
openqaworker1.qe.nue2.suse.org:
worker9.oqa.suse.de:
worker8.oqa.suse.de:
worker5.oqa.suse.de:
qesapworker-prg5.qa.suse.cz:
openqaworker14.qa.suse.cz:
qesapworker-prg7.qa.suse.cz:
openqaw5-xen.qa.suse.de:
qamasternue.qa.suse.de:
powerqaworker-qam-1.qa.suse.de:
worker3.oqa.suse.de:
jenkins.qa.suse.de:
QA-Power8-5-kvm.qa.suse.de:
qesapworker-prg6.qa.suse.cz:
worker13.oqa.suse.de:
QA-Power8-4-kvm.qa.suse.de:
qesapworker-prg4.qa.suse.cz:
backup.qa.suse.de:
tumblesle.qa.suse.de:
schort-server.qa.suse.de:
worker10.oqa.suse.de:
malbec.arch.suse.de:
openqa-piworker.qa.suse.de:
    salt-minion-3006.0-150500.4.12.2.aarch64
openqaworker-arm-2.suse.de:
openqaworker-arm-3.suse.de:

shouldn't openqa-piworker also have 3004?

Actions #10

Updated by dheidler 8 months ago

  • Status changed from Feedback to Resolved

Minion requires exact salt version.

openqaworker1:~ # zypper info --requires salt-minion
Repository-Daten werden geladen...
Installierte Pakete werden gelesen...


Informationen zu Paket salt-minion:
-----------------------------------
Repository         : Update repository with updates from SUSE Linux Enterprise 15
Name               : salt-minion
Version            : 3006.0-150400.8.37.2
Arch               : x86_64
Anbieter           : SUSE LLC <https://www.suse.com/>
Installierte Größe : 43,2 KiB
Installiert        : Ja
Status             : veraltet (Version 3004-150400.8.25.1 installiert)
Quellpaket         : salt-3006.0-150400.8.37.2.src
Upstream-URL       : https://saltproject.io/
Zusammenfassung    : The client component for Saltstack
Beschreibung       :
    Salt minion is queried and controlled from the master.
    Listens to the salt master and execute the commands.
Anforderungen      : [9]
    /usr/bin/python3
    salt = 3006.0-150400.8.37.2
    (salt-transactional-update = 3006.0-150400.8.37.2 if read-only-root-fs)
    /bin/sh
    coreutils
    systemd
    grep
    diffutils
    fillup
Actions #11

Updated by dheidler 8 months ago

  • Status changed from Resolved to In Progress

Piworker seems to have some issues even with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/968

Actions #12

Updated by dheidler 8 months ago

  • Status changed from In Progress to Resolved

Seems to actually be fine on piworker - just got confused as "zypper info" doesn't show the installed pkg (like rpm -qi) but the newest one.

Actions #13

Updated by okurz 8 months ago

  • Status changed from Resolved to Feedback

You seem to have overlooked my comment. I checked on an arbitrary worker, worker39 in this case:

$ sudo journalctl -u auto-update

shows

Sep 01 02:14:37 worker39 sh[34892]: Loading repository data...
Sep 01 02:14:37 worker39 sh[34892]: Reading installed packages...
Sep 01 02:14:38 worker39 sh[34892]: Resolving package dependencies...
Sep 01 02:14:38 worker39 sh[34892]: 2 Problems:
Sep 01 02:14:38 worker39 sh[34892]: Problem: the to be installed patch:openSUSE-SLE-15.5-2023-3139-1.noarch conflicts with 'python3-salt.x86_64 < 3006.0-150500.4.12.2' provided by the installed python3-salt-3004-150400.8.25.1.x86_64
Sep 01 02:14:38 worker39 sh[34892]: Problem: the to be installed patch:openSUSE-SLE-15.5-2023-2582-1.noarch conflicts with 'salt.x86_64 < 3006.0-150500.4.9.2' provided by the installed salt-3004-150400.8.25.1.x86_64
Sep 01 02:14:38 worker39 sh[34892]: Problem: the to be installed patch:openSUSE-SLE-15.5-2023-3139-1.noarch conflicts with 'python3-salt.x86_64 < 3006.0-150500.4.12.2' provided by the installed python3-salt-3004-150400.8.25.1.x86_64
Sep 01 02:14:38 worker39 sh[34892]:  Solution 1: Following actions will be done:
Sep 01 02:14:38 worker39 sh[34892]:   remove lock to allow removal of python3-salt-3004-150400.8.25.1.x86_64
Sep 01 02:14:38 worker39 sh[34892]:   remove lock to allow removal of salt-minion-3004-150400.8.25.1.x86_64
Sep 01 02:14:38 worker39 sh[34892]:  Solution 2: do not install patch:openSUSE-SLE-15.5-2023-3139-1.noarch
Sep 01 02:14:38 worker39 sh[34892]: Choose from above solutions by number or skip, retry or cancel [1/2/s/r/c/d/?] (c): c
Sep 01 02:14:38 worker39 systemd[1]: auto-update.service: Deactivated successfully.

I stated multiple times to please follow the suggestions that are written down and in the linked ticket. Please follow that, i.e. particularly the lock on multiple packages:

sudo zypper al --comment "poo#131249 - potential salt regression, unresponsive salt-minion" salt salt-minion salt-bash-completion python3-salt'

as stated in #131249-35

Actions #14

Updated by okurz 8 months ago

  • Status changed from Feedback to Workable
  • Assignee deleted (dheidler)
Actions #15

Updated by nicksinger 8 months ago

  • Assignee set to nicksinger
Actions #16

Updated by nicksinger 8 months ago

I did a for i in salt salt-minion salt-bash-completion python3-salt; do zypper al --comment "poo#131249 - potential salt regression, unresponsive salt-minion" $i; done (was not sure if it works in one big command) but still face the same problem according to the auto-update log:

Sep 08 10:15:15 openqaworker1 systemd[1]: Started Automatically patch system packages..
Sep 08 10:15:15 openqaworker1 sh[23373]: Repository 'SUSE_CA' is up to date.
Sep 08 10:15:15 openqaworker1 sh[23373]: Repository 'devel_openQA' is up to date.
Sep 08 10:15:15 openqaworker1 sh[23373]: Repository 'devel:openQA:Leap:15.4 (openSUSE_Leap_15.4)' is up to date.
Sep 08 10:15:15 openqaworker1 sh[23373]: Repository 'devel_openQA_Modules' is up to date.
Sep 08 10:15:15 openqaworker1 sh[23373]: Repository 'openSUSE-Leap-15.4-1' is up to date.
Sep 08 10:15:15 openqaworker1 sh[23373]: Repository 'Update repository of openSUSE Backports' is up to date.
Sep 08 10:15:15 openqaworker1 sh[23373]: Repository 'openSUSE-Leap-15.4-Non-Oss' is up to date.
Sep 08 10:15:15 openqaworker1 sh[23373]: Repository 'Update repository with updates from SUSE Linux Enterprise 15' is up to date.
Sep 08 10:15:16 openqaworker1 sh[23373]: Repository 'openSUSE-Leap-15.4-Update' is up to date.
Sep 08 10:15:16 openqaworker1 sh[23373]: Repository 'openSUSE-Leap-15.4-Update-Non-Oss' is up to date.
Sep 08 10:15:16 openqaworker1 sh[23373]: Repository 'Server Monitoring Software' is up to date.
Sep 08 10:15:16 openqaworker1 sh[23373]: All repositories have been refreshed.
Sep 08 10:15:16 openqaworker1 sh[23408]: Loading repository data...
Sep 08 10:15:17 openqaworker1 sh[23408]: Reading installed packages...
Sep 08 10:15:19 openqaworker1 sh[23408]: Resolving package dependencies...
Sep 08 10:15:19 openqaworker1 sh[23408]: 2 Problems:
Sep 08 10:15:19 openqaworker1 sh[23408]: Problem: the to be installed patch:openSUSE-SLE-15.4-2023-3145-1.noarch conflicts with 'python3-salt.x86_64 < 3006.0-150400.8.37.2' provided by the installed python3-salt-3004-150400.8.25.1.x86_64
Sep 08 10:15:19 openqaworker1 sh[23408]: Problem: the to be installed patch:openSUSE-SLE-15.4-2023-2571-1.noarch conflicts with 'salt.x86_64 < 3006.0-150400.8.34.2' provided by the installed salt-3004-150400.8.25.1.x86_64
Sep 08 10:15:19 openqaworker1 sh[23408]: Problem: the to be installed patch:openSUSE-SLE-15.4-2023-3145-1.noarch conflicts with 'python3-salt.x86_64 < 3006.0-150400.8.37.2' provided by the installed python3-salt-3004-150400.8.25.1.x86_64
Sep 08 10:15:19 openqaworker1 sh[23408]:  Solution 1: Following actions will be done:
Sep 08 10:15:19 openqaworker1 sh[23408]:   remove lock to allow removal of python3-salt-3004-150400.8.25.1.x86_64
Sep 08 10:15:19 openqaworker1 sh[23408]:   remove lock to allow removal of salt-minion-3004-150400.8.25.1.x86_64
Sep 08 10:15:19 openqaworker1 sh[23408]:  Solution 2: do not install patch:openSUSE-SLE-15.4-2023-3145-1.noarch
Sep 08 10:15:19 openqaworker1 sh[23408]: Choose from above solutions by number or skip, retry or cancel [1/2/s/r/c/d/?] (c): c
Sep 08 10:15:19 openqaworker1 systemd[1]: auto-update.service: Deactivated successfully.
Actions #17

Updated by nicksinger 8 months ago

  • Status changed from Workable to In Progress

I asked in #discuss-zypp (https://suse.slack.com/archives/C02CL8FJ8UF/p1694162093543369) and it seems there is currently no way to tell zypper to ignore non-applying patches due to locks. One suggested workaround was to lock patches which would affect the locked package (e.g. salt), these can be found with zypper se --conflicts-pkg salt | grep "^!". After several more locks added (also for ovmf) I now reach some bigger problems:

openqaworker1:~ # zypper -n --no-refresh --non-interactive-include-reboot-patches patch --replacefiles --auto-agree-with-licenses --download-in-advance
Loading repository data...
Reading installed packages...
Patch 'openSUSE-SLE-15.4-2023-3145-1' is locked. Use 'zypper in --force patch:openSUSE-SLE-15.4-2023-3145' to install it, or unlock it using 'zypper rl patch:openSUSE-SLE-15.4-2023-3145'.
Patch 'openSUSE-SLE-15.4-2023-2571-1' is locked. Use 'zypper in --force patch:openSUSE-SLE-15.4-2023-2571' to install it, or unlock it using 'zypper rl patch:openSUSE-SLE-15.4-2023-2571'.
Patch 'openSUSE-SLE-15.4-2023-3145-1' is locked. Use 'zypper in --force patch:openSUSE-SLE-15.4-2023-3145' to install it, or unlock it using 'zypper rl patch:openSUSE-SLE-15.4-2023-3145'.
Patch 'openSUSE-SLE-15.4-2023-2571-1' is locked. Use 'zypper in --force patch:openSUSE-SLE-15.4-2023-2571' to install it, or unlock it using 'zypper rl patch:openSUSE-SLE-15.4-2023-2571'.
Patch 'openSUSE-SLE-15.4-2023-2234-1' is locked. Use 'zypper in --force patch:openSUSE-SLE-15.4-2023-2234' to install it, or unlock it using 'zypper rl patch:openSUSE-SLE-15.4-2023-2234'.
Patch 'openSUSE-SLE-15.4-2022-3811-1' is locked. Use 'zypper in --force patch:openSUSE-SLE-15.4-2022-3811' to install it, or unlock it using 'zypper rl patch:openSUSE-SLE-15.4-2022-3811'.
Resolving package dependencies...

Problem: the installed WebKit2GTK-4.1-lang-2.38.6-150400.4.42.4.noarch requires 'WebKit2GTK-4.1 = 2.38.6', but this requirement cannot be provided
  not installable providers: libwebkit2gtk-4_1-0-2.36.0-150400.2.13.x86_64[openSUSE-Leap-$releasever-1]
                   libwebkit2gtk-4_1-0-2.36.3-150400.4.3.1.x86_64[repo-sle-update]
                   libwebkit2gtk-4_1-0-2.36.4-150400.4.6.2.x86_64[repo-sle-update]
                   libwebkit2gtk-4_1-0-2.36.5-150400.4.9.1.x86_64[repo-sle-update]
                   libwebkit2gtk-4_1-0-2.36.7-150400.4.12.1.x86_64[repo-sle-update]
                   libwebkit2gtk-4_1-0-2.36.8-150400.4.15.1.x86_64[repo-sle-update]
                   libwebkit2gtk-4_1-0-2.38.2-150400.4.22.1.x86_64[repo-sle-update]
                   libwebkit2gtk-4_1-0-2.38.3-150400.4.25.1.x86_64[repo-sle-update]
                   libwebkit2gtk-4_1-0-2.38.5-150400.4.34.2.x86_64[repo-sle-update]
                   libwebkit2gtk-4_1-0-2.38.6-150400.4.39.1.x86_64[repo-sle-update]
 Solution 1: Following actions will be done:
  deinstallation of WebKit2GTK-4.1-lang-2.38.6-150400.4.42.4.noarch
  deinstallation of WebKit2GTK-4.0-lang-2.38.6-150400.4.42.4.noarch
 Solution 2: do not install patch:openSUSE-SLE-15.4-2023-3419-1.noarch
 Solution 3: break WebKit2GTK-4.1-lang-2.38.6-150400.4.42.4.noarch by ignoring some of its dependencies

Choose from above solutions by number or cancel [1/2/3/c/d/?] (c): c

So I'm not sure currently how to fix this.

Actions #18

Updated by okurz 8 months ago

I think the WebKit thingy is actually a real big issue in general Leap maintenance and the update is blocked by openQA tests, see https://suse.slack.com/archives/C02CLB8TZP1/p1694159637322819?thread_ts=1694159637.322819&cid=C02CLB8TZP1

Actions #19

Updated by nicksinger 8 months ago

ah, good you mention it. This means I can implement a workaround in salt to apply patch-locks as well

Actions #20

Updated by openqa_review 8 months ago

  • Due date set to 2023-09-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions #21

Updated by okurz 8 months ago

  • Related to action #135404: openqaworker-arm-2.suse.de minion not returning added
Actions #22

Updated by nicksinger 8 months ago

  • Priority changed from Urgent to Normal

Reducing priority as the issue described above (failing pipelines) is resolved but auto-update needs some further refinement to not break

Actions #25

Updated by nicksinger 8 months ago

  • Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/984 has been finalized and is waiting for review and merging

Actions #26

Updated by nicksinger 8 months ago

My changes worked but the previous logic has an issue if a lock is already present which can be seen here: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1828601#L99 (sapworker1). We now have two options:

  1. Remove the check for the ticket reference in the list of locks. It would keep old locks which are most likely in place for a reason. However, this means we never get rid of old locks which might not be needed any longer and got implemented without a proper ticket reference for easy checking if still required
  2. Add a zypper rl {{ pkg_name }} before executing the install command. This ensures the latest known version to be installed with an additional lock after with a proper bug reference
Actions #27

Updated by nicksinger 8 months ago

As discussed we go with suggestion 2. Since gitlab is currently down I will paste my patch here for the time being:

diff --git a/salt/minion.sls b/salt/minion.sls
index 497f9c9..075ea3b 100644
--- a/salt/minion.sls
+++ b/salt/minion.sls
@@ -33,8 +33,8 @@ speedup_minion:
 {% for pkg_name in ['salt', 'salt-bash-completion', 'salt-minion'] %}
 lock_{{ pkg_name }}_pkg:
   cmd.run:
-    - unless: 'zypper ll | grep -qE "{{ pkg_name }}.*\| poo#131249"'
-    - name: "(zypper -n in --oldpackage --allow-downgrade '{{ pkg_name }}<=3005' || zypper -n in --oldpackage --allow-downgrade '{{ pkg_name }}<=3005.1') && zypper al -m 'poo#131249 - potential salt regression, unresponsive salt-minion' {{ pkg_name }}"
+    - unless: 'zypper ll | grep -qE "{{ pkg_name }}.*\| poo#[0-9]{6,}\s"'
+    - name: "zypper rl {{ pkg_name }}; (zypper -n in --oldpackage --allow-downgrade '{{ pkg_name }}<=3005' || zypper -n in --oldpackage --allow-downgrade '{{ pkg_name }}<=3005.1') && zypper al -m 'poo#131249 - potential salt regression, unresponsive salt-minion' {{ pkg_name }}"

 {% set unlocked_conflicting_patches = salt['cmd.shell']('zypper se --conflicts-pkg salt-minion | grep -P \'^!\s.*?\|\' | cut -d "|" -f 2 | awk \'{$1=$1;print}\'').split("\n") %}
 {% if unlocked_conflicting_patches[0] != "" %}

It expands the "existing lock check" to allow every valid bug reference. If none is present in the description of an existing lock we assume it is invalid and replace it with a proper one pointing to this bug here.

Actions #29

Updated by nicksinger 8 months ago

  • Status changed from Feedback to In Progress

Pipeline failed again because salt-bash-completion is not installed on some workers. Zypper then tries to install it with a lower version (as per salt recipe) but this is blocked because it would pull in a higher version of salt-minion, salt and python3-salt which are all locked. Working on a possible solution now

Actions #30

Updated by okurz 7 months ago

@nicksinger based on #131249-43 my suspicion is that salt-minion-3005 is also affected by https://bugzilla.opensuse.org/show_bug.cgi?id=1212816 so I suggest we install http://download.opensuse.org/update/leap/15.4/sle/x86_64/salt-3004-150400.8.25.1.x86_64.rpm on Leap 15.5 machines as well. Can you do that?

Actions #31

Updated by okurz 7 months ago

  • Priority changed from Normal to High

Also recently multiple people were hit by the problem, e.g. in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/612#note_539012 , raising prio accordingly.

Actions #32

Updated by nicksinger 7 months ago

  • Status changed from In Progress to Resolved

No, pipelines run after my manual fixes on Friday (around 2pm) succeeded since then: https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines and https://gitlab.suse.de/openqa/salt-pillars-openqa/-/pipelines - the mentioned fail in julies MR was before my manual cleanup of the locks and don't show any symptom of a hanging minion.

Actions #33

Updated by okurz 7 months ago

  • Related to action #136325: salt deploy fails due to multiple offline workers in qe.nue2.suse.org+prg2.suse.org added
Actions

Also available in: Atom PDF