action #134906
closedosd-deployment failed due to openqaworker1 showing "No response" in salt size:M
0%
Description
Observation¶
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1794346#L9197 shows
Minions returned with non-zero exit code
openqaworker1.qe.nue2.suse.org:
Minion did not return. [No response]
Acceptance criteria¶
- AC1: All OSD salt controlled machines are ensured to not be affected by unresponsive salt-minion https://bugzilla.opensuse.org/show_bug.cgi?id=1212816, i.e. the salt-minion backport+package lock is applied to all salt controlled machines
Suggestions¶
- Research how to backport + package lock in salt recipes, e.g. start with https://docs.saltproject.io/en/latest/ref/modules/all/salt.modules.zypperpkg.html or ask experts in chat (but be careful not be drawn into a "just install SUSE Manager" discussion)
- Add instructions to salt to ensure the salt-minion package is backported and package locked
- As alternative consider another separate repo that has the backported/fixed version and is applied to all salt controlled machines (not devel:openQA as this is a salt problem, not openQA machine specific)
Updated by okurz over 1 year ago
- Related to action #134132: Bare-metal control openQA worker in NUE2 size:M added
Updated by dheidler over 1 year ago
- Status changed from New to In Progress
- Assignee set to dheidler
Tried to restart salt-minion on openqaworker1, but didn't help.
Had to kill (not -9) one of the processes which helped.
Now seems to be working again:
openqa:~ # salt openqaworker1.qe.nue2.suse.org test.ping
openqaworker1.qe.nue2.suse.org:
True
Updated by okurz over 1 year ago
- Related to action #131249: [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:M added
Updated by okurz over 1 year ago
likely the same issue as in #131249, please apply the backport and package lock and add the backport and package lock to salt repos.
Updated by okurz over 1 year ago
- Subject changed from osd-deployment failed due to openqaworker1 showing "No response" in salt to osd-deployment failed due to openqaworker1 showing "No response" in salt size:M
- Description updated (diff)
Updated by okurz over 1 year ago
openqaworker1.qe.nue2.suse.org is not responsive again. Please expedite applying the backport+lock
Updated by dheidler over 1 year ago
- Status changed from In Progress to Feedback
Updated by dheidler over 1 year ago
- Status changed from Feedback to Resolved
Looks good.
openqaworker1:~ # zypper ll
# | Name | Type | Repository | Comment
--+------------------+---------+------------+-----------
1 | qemu-ovmf-x86_64 | package | (beliebig) | poo#116914
2 | salt | package | (beliebig) | poo#131249
openqaworker1:~ # zypper se -s --match-exact salt | grep ^i
il | salt | Paket | 3004-150400.8.25.1 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
Updated by okurz over 1 year ago
- Status changed from Resolved to Feedback
wait. The problem was with salt-minion. If you just apply a lock for the package "salt" then I assume salt-minion might still be pulled in with the wrong version … or the update services failing trying to resolve that conflict.
# salt \* cmd.run 'rpm -q salt-minion | grep 3006'
worker40.oqa.prg2.suse.org:
worker34.oqa.prg2.suse.org:
backup-qam.qe.nue2.suse.org:
worker29.oqa.prg2.suse.org:
worker35.oqa.prg2.suse.org:
worker37.oqa.prg2.suse.org:
worker38.oqa.prg2.suse.org:
worker39.oqa.prg2.suse.org:
worker31.oqa.prg2.suse.org:
worker36.oqa.prg2.suse.org:
worker33.oqa.prg2.suse.org:
worker32.oqa.prg2.suse.org:
worker30.oqa.prg2.suse.org:
worker-arm1.oqa.prg2.suse.org:
sapworker2.qe.nue2.suse.org:
worker-arm2.oqa.prg2.suse.org:
sapworker1.qe.nue2.suse.org:
sapworker3.qe.nue2.suse.org:
storage.oqa.suse.de:
openqaworker16.qa.suse.cz:
openqaworker18.qa.suse.cz:
openqaworker17.qa.suse.cz:
worker2.oqa.suse.de:
openqa.suse.de:
openqaworker1.qe.nue2.suse.org:
worker9.oqa.suse.de:
worker8.oqa.suse.de:
worker5.oqa.suse.de:
qesapworker-prg5.qa.suse.cz:
openqaworker14.qa.suse.cz:
qesapworker-prg7.qa.suse.cz:
openqaw5-xen.qa.suse.de:
qamasternue.qa.suse.de:
powerqaworker-qam-1.qa.suse.de:
worker3.oqa.suse.de:
jenkins.qa.suse.de:
QA-Power8-5-kvm.qa.suse.de:
qesapworker-prg6.qa.suse.cz:
worker13.oqa.suse.de:
QA-Power8-4-kvm.qa.suse.de:
qesapworker-prg4.qa.suse.cz:
backup.qa.suse.de:
tumblesle.qa.suse.de:
schort-server.qa.suse.de:
worker10.oqa.suse.de:
malbec.arch.suse.de:
openqa-piworker.qa.suse.de:
salt-minion-3006.0-150500.4.12.2.aarch64
openqaworker-arm-2.suse.de:
openqaworker-arm-3.suse.de:
shouldn't openqa-piworker also have 3004?
Updated by dheidler over 1 year ago
- Status changed from Feedback to Resolved
Minion requires exact salt version.
openqaworker1:~ # zypper info --requires salt-minion
Repository-Daten werden geladen...
Installierte Pakete werden gelesen...
Informationen zu Paket salt-minion:
-----------------------------------
Repository : Update repository with updates from SUSE Linux Enterprise 15
Name : salt-minion
Version : 3006.0-150400.8.37.2
Arch : x86_64
Anbieter : SUSE LLC <https://www.suse.com/>
Installierte Größe : 43,2 KiB
Installiert : Ja
Status : veraltet (Version 3004-150400.8.25.1 installiert)
Quellpaket : salt-3006.0-150400.8.37.2.src
Upstream-URL : https://saltproject.io/
Zusammenfassung : The client component for Saltstack
Beschreibung :
Salt minion is queried and controlled from the master.
Listens to the salt master and execute the commands.
Anforderungen : [9]
/usr/bin/python3
salt = 3006.0-150400.8.37.2
(salt-transactional-update = 3006.0-150400.8.37.2 if read-only-root-fs)
/bin/sh
coreutils
systemd
grep
diffutils
fillup
Updated by dheidler over 1 year ago
- Status changed from Resolved to In Progress
Piworker seems to have some issues even with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/968
Updated by dheidler over 1 year ago
- Status changed from In Progress to Resolved
Seems to actually be fine on piworker - just got confused as "zypper info" doesn't show the installed pkg (like rpm -qi) but the newest one.
Updated by okurz over 1 year ago
- Status changed from Resolved to Feedback
You seem to have overlooked my comment. I checked on an arbitrary worker, worker39 in this case:
$ sudo journalctl -u auto-update
shows
Sep 01 02:14:37 worker39 sh[34892]: Loading repository data...
Sep 01 02:14:37 worker39 sh[34892]: Reading installed packages...
Sep 01 02:14:38 worker39 sh[34892]: Resolving package dependencies...
Sep 01 02:14:38 worker39 sh[34892]: 2 Problems:
Sep 01 02:14:38 worker39 sh[34892]: Problem: the to be installed patch:openSUSE-SLE-15.5-2023-3139-1.noarch conflicts with 'python3-salt.x86_64 < 3006.0-150500.4.12.2' provided by the installed python3-salt-3004-150400.8.25.1.x86_64
Sep 01 02:14:38 worker39 sh[34892]: Problem: the to be installed patch:openSUSE-SLE-15.5-2023-2582-1.noarch conflicts with 'salt.x86_64 < 3006.0-150500.4.9.2' provided by the installed salt-3004-150400.8.25.1.x86_64
Sep 01 02:14:38 worker39 sh[34892]: Problem: the to be installed patch:openSUSE-SLE-15.5-2023-3139-1.noarch conflicts with 'python3-salt.x86_64 < 3006.0-150500.4.12.2' provided by the installed python3-salt-3004-150400.8.25.1.x86_64
Sep 01 02:14:38 worker39 sh[34892]: Solution 1: Following actions will be done:
Sep 01 02:14:38 worker39 sh[34892]: remove lock to allow removal of python3-salt-3004-150400.8.25.1.x86_64
Sep 01 02:14:38 worker39 sh[34892]: remove lock to allow removal of salt-minion-3004-150400.8.25.1.x86_64
Sep 01 02:14:38 worker39 sh[34892]: Solution 2: do not install patch:openSUSE-SLE-15.5-2023-3139-1.noarch
Sep 01 02:14:38 worker39 sh[34892]: Choose from above solutions by number or skip, retry or cancel [1/2/s/r/c/d/?] (c): c
Sep 01 02:14:38 worker39 systemd[1]: auto-update.service: Deactivated successfully.
I stated multiple times to please follow the suggestions that are written down and in the linked ticket. Please follow that, i.e. particularly the lock on multiple packages:
sudo zypper al --comment "poo#131249 - potential salt regression, unresponsive salt-minion" salt salt-minion salt-bash-completion python3-salt'
as stated in #131249-35
Updated by okurz over 1 year ago
- Status changed from Feedback to Workable
- Assignee deleted (
dheidler)
Updated by nicksinger over 1 year ago
I did a for i in salt salt-minion salt-bash-completion python3-salt; do zypper al --comment "poo#131249 - potential salt regression, unresponsive salt-minion" $i; done
(was not sure if it works in one big command) but still face the same problem according to the auto-update log:
Sep 08 10:15:15 openqaworker1 systemd[1]: Started Automatically patch system packages..
Sep 08 10:15:15 openqaworker1 sh[23373]: Repository 'SUSE_CA' is up to date.
Sep 08 10:15:15 openqaworker1 sh[23373]: Repository 'devel_openQA' is up to date.
Sep 08 10:15:15 openqaworker1 sh[23373]: Repository 'devel:openQA:Leap:15.4 (openSUSE_Leap_15.4)' is up to date.
Sep 08 10:15:15 openqaworker1 sh[23373]: Repository 'devel_openQA_Modules' is up to date.
Sep 08 10:15:15 openqaworker1 sh[23373]: Repository 'openSUSE-Leap-15.4-1' is up to date.
Sep 08 10:15:15 openqaworker1 sh[23373]: Repository 'Update repository of openSUSE Backports' is up to date.
Sep 08 10:15:15 openqaworker1 sh[23373]: Repository 'openSUSE-Leap-15.4-Non-Oss' is up to date.
Sep 08 10:15:15 openqaworker1 sh[23373]: Repository 'Update repository with updates from SUSE Linux Enterprise 15' is up to date.
Sep 08 10:15:16 openqaworker1 sh[23373]: Repository 'openSUSE-Leap-15.4-Update' is up to date.
Sep 08 10:15:16 openqaworker1 sh[23373]: Repository 'openSUSE-Leap-15.4-Update-Non-Oss' is up to date.
Sep 08 10:15:16 openqaworker1 sh[23373]: Repository 'Server Monitoring Software' is up to date.
Sep 08 10:15:16 openqaworker1 sh[23373]: All repositories have been refreshed.
Sep 08 10:15:16 openqaworker1 sh[23408]: Loading repository data...
Sep 08 10:15:17 openqaworker1 sh[23408]: Reading installed packages...
Sep 08 10:15:19 openqaworker1 sh[23408]: Resolving package dependencies...
Sep 08 10:15:19 openqaworker1 sh[23408]: 2 Problems:
Sep 08 10:15:19 openqaworker1 sh[23408]: Problem: the to be installed patch:openSUSE-SLE-15.4-2023-3145-1.noarch conflicts with 'python3-salt.x86_64 < 3006.0-150400.8.37.2' provided by the installed python3-salt-3004-150400.8.25.1.x86_64
Sep 08 10:15:19 openqaworker1 sh[23408]: Problem: the to be installed patch:openSUSE-SLE-15.4-2023-2571-1.noarch conflicts with 'salt.x86_64 < 3006.0-150400.8.34.2' provided by the installed salt-3004-150400.8.25.1.x86_64
Sep 08 10:15:19 openqaworker1 sh[23408]: Problem: the to be installed patch:openSUSE-SLE-15.4-2023-3145-1.noarch conflicts with 'python3-salt.x86_64 < 3006.0-150400.8.37.2' provided by the installed python3-salt-3004-150400.8.25.1.x86_64
Sep 08 10:15:19 openqaworker1 sh[23408]: Solution 1: Following actions will be done:
Sep 08 10:15:19 openqaworker1 sh[23408]: remove lock to allow removal of python3-salt-3004-150400.8.25.1.x86_64
Sep 08 10:15:19 openqaworker1 sh[23408]: remove lock to allow removal of salt-minion-3004-150400.8.25.1.x86_64
Sep 08 10:15:19 openqaworker1 sh[23408]: Solution 2: do not install patch:openSUSE-SLE-15.4-2023-3145-1.noarch
Sep 08 10:15:19 openqaworker1 sh[23408]: Choose from above solutions by number or skip, retry or cancel [1/2/s/r/c/d/?] (c): c
Sep 08 10:15:19 openqaworker1 systemd[1]: auto-update.service: Deactivated successfully.
Updated by nicksinger over 1 year ago
- Status changed from Workable to In Progress
I asked in #discuss-zypp (https://suse.slack.com/archives/C02CL8FJ8UF/p1694162093543369) and it seems there is currently no way to tell zypper to ignore non-applying patches due to locks. One suggested workaround was to lock patches which would affect the locked package (e.g. salt), these can be found with zypper se --conflicts-pkg salt | grep "^!"
. After several more locks added (also for ovmf) I now reach some bigger problems:
openqaworker1:~ # zypper -n --no-refresh --non-interactive-include-reboot-patches patch --replacefiles --auto-agree-with-licenses --download-in-advance
Loading repository data...
Reading installed packages...
Patch 'openSUSE-SLE-15.4-2023-3145-1' is locked. Use 'zypper in --force patch:openSUSE-SLE-15.4-2023-3145' to install it, or unlock it using 'zypper rl patch:openSUSE-SLE-15.4-2023-3145'.
Patch 'openSUSE-SLE-15.4-2023-2571-1' is locked. Use 'zypper in --force patch:openSUSE-SLE-15.4-2023-2571' to install it, or unlock it using 'zypper rl patch:openSUSE-SLE-15.4-2023-2571'.
Patch 'openSUSE-SLE-15.4-2023-3145-1' is locked. Use 'zypper in --force patch:openSUSE-SLE-15.4-2023-3145' to install it, or unlock it using 'zypper rl patch:openSUSE-SLE-15.4-2023-3145'.
Patch 'openSUSE-SLE-15.4-2023-2571-1' is locked. Use 'zypper in --force patch:openSUSE-SLE-15.4-2023-2571' to install it, or unlock it using 'zypper rl patch:openSUSE-SLE-15.4-2023-2571'.
Patch 'openSUSE-SLE-15.4-2023-2234-1' is locked. Use 'zypper in --force patch:openSUSE-SLE-15.4-2023-2234' to install it, or unlock it using 'zypper rl patch:openSUSE-SLE-15.4-2023-2234'.
Patch 'openSUSE-SLE-15.4-2022-3811-1' is locked. Use 'zypper in --force patch:openSUSE-SLE-15.4-2022-3811' to install it, or unlock it using 'zypper rl patch:openSUSE-SLE-15.4-2022-3811'.
Resolving package dependencies...
Problem: the installed WebKit2GTK-4.1-lang-2.38.6-150400.4.42.4.noarch requires 'WebKit2GTK-4.1 = 2.38.6', but this requirement cannot be provided
not installable providers: libwebkit2gtk-4_1-0-2.36.0-150400.2.13.x86_64[openSUSE-Leap-$releasever-1]
libwebkit2gtk-4_1-0-2.36.3-150400.4.3.1.x86_64[repo-sle-update]
libwebkit2gtk-4_1-0-2.36.4-150400.4.6.2.x86_64[repo-sle-update]
libwebkit2gtk-4_1-0-2.36.5-150400.4.9.1.x86_64[repo-sle-update]
libwebkit2gtk-4_1-0-2.36.7-150400.4.12.1.x86_64[repo-sle-update]
libwebkit2gtk-4_1-0-2.36.8-150400.4.15.1.x86_64[repo-sle-update]
libwebkit2gtk-4_1-0-2.38.2-150400.4.22.1.x86_64[repo-sle-update]
libwebkit2gtk-4_1-0-2.38.3-150400.4.25.1.x86_64[repo-sle-update]
libwebkit2gtk-4_1-0-2.38.5-150400.4.34.2.x86_64[repo-sle-update]
libwebkit2gtk-4_1-0-2.38.6-150400.4.39.1.x86_64[repo-sle-update]
Solution 1: Following actions will be done:
deinstallation of WebKit2GTK-4.1-lang-2.38.6-150400.4.42.4.noarch
deinstallation of WebKit2GTK-4.0-lang-2.38.6-150400.4.42.4.noarch
Solution 2: do not install patch:openSUSE-SLE-15.4-2023-3419-1.noarch
Solution 3: break WebKit2GTK-4.1-lang-2.38.6-150400.4.42.4.noarch by ignoring some of its dependencies
Choose from above solutions by number or cancel [1/2/3/c/d/?] (c): c
So I'm not sure currently how to fix this.
Updated by okurz over 1 year ago
I think the WebKit thingy is actually a real big issue in general Leap maintenance and the update is blocked by openQA tests, see https://suse.slack.com/archives/C02CLB8TZP1/p1694159637322819?thread_ts=1694159637.322819&cid=C02CLB8TZP1
Updated by nicksinger over 1 year ago
ah, good you mention it. This means I can implement a workaround in salt to apply patch-locks as well
Updated by openqa_review over 1 year ago
- Due date set to 2023-09-23
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 1 year ago
- Related to action #135404: openqaworker-arm-2.suse.de minion not returning added
Updated by nicksinger about 1 year ago
- Priority changed from Urgent to Normal
Reducing priority as the issue described above (failing pipelines) is resolved but auto-update needs some further refinement to not break
Updated by livdywan about 1 year ago
Updated by okurz about 1 year ago
I guess you meant https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/984
Updated by nicksinger about 1 year ago
- Status changed from In Progress to Feedback
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/984 has been finalized and is waiting for review and merging
Updated by nicksinger about 1 year ago
My changes worked but the previous logic has an issue if a lock is already present which can be seen here: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1828601#L99 (sapworker1). We now have two options:
- Remove the check for the ticket reference in the list of locks. It would keep old locks which are most likely in place for a reason. However, this means we never get rid of old locks which might not be needed any longer and got implemented without a proper ticket reference for easy checking if still required
- Add a
zypper rl {{ pkg_name }}
before executing the install command. This ensures the latest known version to be installed with an additional lock after with a proper bug reference
Updated by nicksinger about 1 year ago
As discussed we go with suggestion 2. Since gitlab is currently down I will paste my patch here for the time being:
diff --git a/salt/minion.sls b/salt/minion.sls
index 497f9c9..075ea3b 100644
--- a/salt/minion.sls
+++ b/salt/minion.sls
@@ -33,8 +33,8 @@ speedup_minion:
{% for pkg_name in ['salt', 'salt-bash-completion', 'salt-minion'] %}
lock_{{ pkg_name }}_pkg:
cmd.run:
- - unless: 'zypper ll | grep -qE "{{ pkg_name }}.*\| poo#131249"'
- - name: "(zypper -n in --oldpackage --allow-downgrade '{{ pkg_name }}<=3005' || zypper -n in --oldpackage --allow-downgrade '{{ pkg_name }}<=3005.1') && zypper al -m 'poo#131249 - potential salt regression, unresponsive salt-minion' {{ pkg_name }}"
+ - unless: 'zypper ll | grep -qE "{{ pkg_name }}.*\| poo#[0-9]{6,}\s"'
+ - name: "zypper rl {{ pkg_name }}; (zypper -n in --oldpackage --allow-downgrade '{{ pkg_name }}<=3005' || zypper -n in --oldpackage --allow-downgrade '{{ pkg_name }}<=3005.1') && zypper al -m 'poo#131249 - potential salt regression, unresponsive salt-minion' {{ pkg_name }}"
{% set unlocked_conflicting_patches = salt['cmd.shell']('zypper se --conflicts-pkg salt-minion | grep -P \'^!\s.*?\|\' | cut -d "|" -f 2 | awk \'{$1=$1;print}\'').split("\n") %}
{% if unlocked_conflicting_patches[0] != "" %}
It expands the "existing lock check" to allow every valid bug reference. If none is present in the description of an existing lock we assume it is invalid and replace it with a proper one pointing to this bug here.
Updated by nicksinger about 1 year ago
Updated by nicksinger about 1 year ago
- Status changed from Feedback to In Progress
Pipeline failed again because salt-bash-completion is not installed on some workers. Zypper then tries to install it with a lower version (as per salt recipe) but this is blocked because it would pull in a higher version of salt-minion, salt and python3-salt which are all locked. Working on a possible solution now
Updated by okurz about 1 year ago
@nicksinger based on #131249-43 my suspicion is that salt-minion-3005 is also affected by https://bugzilla.opensuse.org/show_bug.cgi?id=1212816 so I suggest we install http://download.opensuse.org/update/leap/15.4/sle/x86_64/salt-3004-150400.8.25.1.x86_64.rpm on Leap 15.5 machines as well. Can you do that?
Updated by okurz about 1 year ago
- Priority changed from Normal to High
Also recently multiple people were hit by the problem, e.g. in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/612#note_539012 , raising prio accordingly.
Updated by nicksinger about 1 year ago
- Status changed from In Progress to Resolved
No, pipelines run after my manual fixes on Friday (around 2pm) succeeded since then: https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines and https://gitlab.suse.de/openqa/salt-pillars-openqa/-/pipelines - the mentioned fail in julies MR was before my manual cleanup of the locks and don't show any symptom of a hanging minion.
Updated by okurz about 1 year ago
- Related to action #136325: salt deploy fails due to multiple offline workers in qe.nue2.suse.org+prg2.suse.org added