Project

General

Profile

Actions

action #131249

closed

[alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:M

Added by okurz over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-06-22
Due date:
% Done:

0%

Estimated time:

Description

Observation

From https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1648467 and reproduced locally with sudo salt --no-color -C 'G@roles:worker' test.ping:

grenache-1.qa.suse.de:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20230622084232610255
worker5.oqa.suse.de:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20230622084232610255
worker2.oqa.suse.de:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20230622084232610255

Acceptance criteria

  • AC1: salt \* test.ping and salt \* state.apply succeeds consistently for more than one day
  • AC2: our salt states and pillar and osd deployment pipelines are green and stable again

Suggestions

  • DONE It seems we might have had this problem for a while but never really that severly. Now it looks like those machines even if we trigger a reboot and restart salt-minion can end up with "no response" again. Maybe we can revert some recent package updates? From /var/log/zypp/history there is
2023-06-22 03:01:12|install|python3-pyzmq|17.1.2-150000.3.5.2|x86_64||repo-sle-update|e2d9d07654cffc31e5199f40aa1ba9fee1e114c4ca5abd78f7fdc78b2e6cc21a|
  • DONE Debug the actual problem of hanging salt-minion. Maybe we can actually try to better trigger the problem, not prevent it?
  • DONE Research upstream, apply workarounds, potentially try upgrade Leap 15.5 if that might fix something

Rollback steps

  • DONE on worker2,worker3,worker5,grenache-1,openqaworker-arm-2,openqaworker-arm-3 sudo mv /etc/systemd/system/auto-update.$i{.disabled_poo131249,} && sudo systemctl enable --now auto-update.timer && sudo systemctl start auto-update, remove manual override /etc/systemd/system/auto-update.service.d/override.conf, wait for upgrade to complete and reboot
  • DONE re-enable osd-deployment https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit
  • DONE remove silence https://stats.openqa-monitor.qa.suse.de/alerting/silences "alertname=Failed systemd services alert (except openqa.suse.de)"
  • DONE remove package locks for anything related to salt

Related issues 10 (0 open10 closed)

Related to openQA Infrastructure - action #130835: salt high state fails after recent merge requests in salt pillars size:MResolvedokurz2023-06-14

Actions
Related to openQA Project - action #131447: Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machinesResolvedkraih2023-06-27

Actions
Related to openQA Infrastructure - action #107932: Handling broken RPM databases does not handle certain casesResolvedmkittler2022-03-07

Actions
Related to openQA Infrastructure - action #102942: Failed systemd services alert: snapper-cleanup on QA-Power8-4-kvm fails size:MResolvedmkittler2021-11-24

Actions
Related to openQA Infrastructure - action #132137: Setup new PRG2 openQA worker for osd size:MResolvedmkittler2023-06-29

Actions
Related to openQA Infrastructure - action #134906: osd-deployment failed due to openqaworker1 showing "No response" in salt size:MResolvednicksinger2023-08-312023-09-23

Actions
Related to openQA Infrastructure - action #136325: salt deploy fails due to multiple offline workers in qe.nue2.suse.org+prg2.suse.orgResolvedokurz2023-09-22

Actions
Related to openQA Infrastructure - action #150965: At least diesel+petrol+mania fail to auto-update due to kernel locks preventing patches size:MResolveddheidler2023-11-162023-12-22

Actions
Copied to openQA Infrastructure - action #131540: openqa-piworker fails to upgrade many packages. vendor change is not enabled as our salt states so far only do that for openQA machines, not generic machines size:MResolvedmkittler

Actions
Copied to openQA Infrastructure - action #131543: We have machines with both auto-update&auto-upgrade deployed, we should have only one at a time size:MResolvedokurz

Actions
Actions

Also available in: Atom PDF