action #132383
closedFC Basement OSD hosts not reachable since 2023-07-06 01:50 CEST
0%
Description
Observation¶
Multiple FC Basement OSD hosts are not reachable since 2023-07-06 01:50 CEST as visible on https://monitor.qa.suse.de/alerting/list?search=state:firing mentioning
plus the "Packet loss" alert
Suggestions¶
- Wait for https://sd.suse.com/servicedesk/customer/portal/1/SD-126191 resolution
- Ensure all relevant machines in FC Basement are reachable again
- Retry https://gitlab.suse.de/openqa/osd-deployment/-/pipelines and other gitlab CI pipelines where failed
- Ensure alerts are ok again
Updated by okurz over 1 year ago
- Status changed from New to Blocked
Updated by mkittler over 1 year ago
It looks like that caused the NFS mount to be totally unresponsive. So the workers got stuck trying to access it during initialization (and thus stayed offline). Manual filesystem commands also get stuck. I could not even unmount the filesystem again. It is exactly the same on all 3 sap workers. I'll have a look on the other machines to see whether they are equally badly affected.
Updated by mkittler over 1 year ago
The other hosts (piworker and openqaworker1) were not affected. That's strange but also good.
The NFS mount being stuck was also the reason why zypper was stuck on rpm --root / --dbpath /usr/lib/sysimage/rpm -U --percent --noglob --force --nodeps -- /var/cache/zypp/packages/devel_openQA/x86_64/openQA-common-4.6.1688565452.efc15ea-lp155.5933.1.x86_64.rpm
and thus the zypper lock was stuck as well and thus salt was not able to apply states. It would be nice if this chain of problems leading to other problems was at least shorter…
Updated by okurz over 1 year ago
- Status changed from Blocked to In Progress
SD ticket was resolved, so network is back. Regarding the NFS mount: We tried to fade out the use of that for years for good reasons. I think we should try again, e.g. only provide the mount on some limited workers with a special worker class e.g. "deprecated-nfs".
Updated by openqa_review over 1 year ago
- Due date set to 2023-07-21
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan over 1 year ago
- Related to action #132452: Bring seth+osiris up-to-date added
Updated by okurz over 1 year ago
- Due date deleted (
2023-07-21) - Status changed from In Progress to Resolved
https://gitlab.suse.de/openqa/osd-deployment/-/pipelines and all hosts are good