Project

General

Profile

Actions

action #132383

closed

FC Basement OSD hosts not reachable since 2023-07-06 01:50 CEST

Added by okurz 10 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2023-07-06
Due date:
% Done:

0%

Estimated time:

Description

Observation

Multiple FC Basement OSD hosts are not reachable since 2023-07-06 01:50 CEST as visible on https://monitor.qa.suse.de/alerting/list?search=state:firing mentioning

plus the "Packet loss" alert

Suggestions


Related issues 1 (0 open1 closed)

Related to QA - action #132452: Bring seth+osiris up-to-dateResolvedokurz2023-06-28

Actions
Actions #1

Updated by okurz 10 months ago

  • Status changed from New to Blocked
Actions #2

Updated by okurz 10 months ago

  • Description updated (diff)
Actions #3

Updated by mkittler 10 months ago

It looks like that caused the NFS mount to be totally unresponsive. So the workers got stuck trying to access it during initialization (and thus stayed offline). Manual filesystem commands also get stuck. I could not even unmount the filesystem again. It is exactly the same on all 3 sap workers. I'll have a look on the other machines to see whether they are equally badly affected.

Actions #4

Updated by mkittler 10 months ago

The other hosts (piworker and openqaworker1) were not affected. That's strange but also good.

The NFS mount being stuck was also the reason why zypper was stuck on rpm --root / --dbpath /usr/lib/sysimage/rpm -U --percent --noglob --force --nodeps -- /var/cache/zypp/packages/devel_openQA/x86_64/openQA-common-4.6.1688565452.efc15ea-lp155.5933.1.x86_64.rpm and thus the zypper lock was stuck as well and thus salt was not able to apply states. It would be nice if this chain of problems leading to other problems was at least shorter…

Actions #5

Updated by okurz 10 months ago

  • Status changed from Blocked to In Progress

SD ticket was resolved, so network is back. Regarding the NFS mount: We tried to fade out the use of that for years for good reasons. I think we should try again, e.g. only provide the mount on some limited workers with a special worker class e.g. "deprecated-nfs".

Actions #6

Updated by openqa_review 10 months ago

  • Due date set to 2023-07-21

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by livdywan 10 months ago

Actions #8

Updated by okurz 10 months ago

  • Due date deleted (2023-07-21)
  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF