Project

General

Profile

Actions

coordination #102882

closed

[epic] All OSD PPC64LE workers except malbec appear to have horribly broken cache service

Added by okurz about 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2022-02-10
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

User report https://suse.slack.com/archives/C02CANHLANP/p1637666699462700 .
mdoucha: "All jobs are stuck downloading assets until they time out. OSD dashboard shows that the workers are downloading ridiculous amounts of data all the time since yesterday."

Suggestions

  • Find corresponding monitoring data on https://monitor.qa.suse.de/ that can be used to visualize the problem as well as a verification after any potential fix
  • Identify what might cause such problems "since yesterday", i.e. 2021-11-22

Rollback steps (to be done once the actual issue has been resolved)

powerqaworker-qam-1 # systemctl unmask openqa-worker-auto-restart@{3..6} openqa-reload-worker-auto-restart@{3..6}.{service,timer} && systemctl enable --now openqa-worker-auto-restart@{3..6} openqa-reload-worker-auto-restart@{3..6}.{service,timer}
QA-Power8-4-kvm # systemctl unmask openqa-worker-auto-restart@{4..8} openqa-reload-worker-auto-restart@{4..8}.{service,timer} && systemctl enable --now openqa-worker-auto-restart@{4..8} openqa-reload-worker-auto-restart@{4..8}.{service,timer}
QA-Power8-5-kvm # systemctl unmask openqa-worker-auto-restart@{4..8} openqa-reload-worker-auto-restart@{4..8}.{service,timer} && systemctl enable --now openqa-worker-auto-restart@{4..8} openqa-reload-worker-auto-restart@{4..8}.{service,timer}
  • Add qa-power8-4-kvm.qa.suse.de, qa-power8-5-kvm.qa.suse.de and powerqaworker-qam-1.qa.suse.de back to salt and ensure all services are running again.

Subtasks 6 (0 open6 closed)

action #106538: lessons learned "five whys" for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" size:SResolvedokurz2022-02-10

Actions
action #106540: Mitigate/resolve All OSD PPC64LE workers except malbec appear to have horribly broken cache serviceResolvedkraih2022-02-10

Actions
action #106543: Conduct rollback steps and check impact for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" size:MResolvedkraih2022-02-10

Actions
action #107083: SUSE QE Tools team must learn about switch administration and get accessResolvedokurz2022-02-18

Actions
action #107086: Ask for volunteers in SUSE QE Tools that would be able to visit the Nbg server rooms, e.g. as second person accompanying nsinger or any potential new adminResolvedokurz2022-02-18

Actions
action #107089: Make SUSE QE Tools team aware that we need to support EngInfra due to limited capacityResolvedokurz2022-02-18

Actions

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #104106: [qe-core] test fails in await_install - Network peformace for ppc installations is decreasing size:SResolvedmkittler2021-12-16

Actions
Related to openQA Project (public) - action #105804: Job age (scheduled) (median) alert size:SResolvedmkittler2022-02-01

Actions
Copied to openQA Project (public) - coordination #102951: [epic] Better network performance monitoringResolvedokurz2021-11-24

Actions
Actions

Also available in: Atom PDF