action #99288
closed[Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:M
0%
Description
Observation¶
use ipmitool -I lanplus -C 3 -H ipmi.openqaworker-arm-4.qa.suse.de -U admin -P password sol activate
to connect arm-4, there is nothing shown.
use ipmitool to connect openqaworker-arm-5, it shows
Give root password for maintenance
(or press Control-D to continue): sulogin: cannot read /dev/ttyAMA0: Operation not permitted
[ 9426.877540] systemd-fstab-generator[3152]: x-systemd.device-timeout ignored for openqa.suse.de:/var/lib/openqa/share
Acceptance criteria¶
- AC1: openqaworker-arm-4 is up and can be rebooted without problems multiple times
- AC2: Same for openqaworker-arm-5
Suggestions¶
- DONE: Pause alerts
- Investigate problem
- Ensure reboot-stability with multiple reboots
Rollback steps¶
- Unpause alerts for arm-4 and arm-5
Workaround¶
used ipmitool power cycle
to reboot arm-4 and arm-5
Updated by Xiaojing_liu about 3 years ago
- Project changed from openQA Project (public) to openQA Infrastructure (public)
Updated by okurz about 3 years ago
- Description updated (diff)
- Priority changed from Normal to Urgent
Xiaojing_liu wrote:
Workaround¶
used
ipmitool power cycle
to reboot arm-4 and arm-5
I can not confirm this worked. The alerts in OSD are still active. ssh openqaworker-arm-4.qa
asks me for a password which it should not. Updated description with suggestions and rollback steps. Paused alerts in grafana.
I assume in #90275 the intended target state was never verified so likely no reboot after a salt high state was conducted. Right now on openqaworker-arm-5.qa it seems that /var/lib/openqa can not be correctly mounted assuming /dev/md/openqa which does not exist
Updated by okurz about 3 years ago
- Related to action #90275: Replacement openQA OSD aarch64 hardware (was: Dedicated non-rpi aarch64 hardware for manual testing) added
Updated by okurz about 3 years ago
openqaworker-arm-4 seems to repeatedly boot into installer medium.
openqaworker-arm-5 could be temporarily recovered by disabling the line /dev/md/openqa /var/lib/openqa ext2 defaults 0 0
in /etc/fstab, calling mount -a
and logging out of the recovery shell which continued the boot.
@nicksinger didn't you mention necessary changes to support the filesystem setup slightly differing from other hosts?
Updated by okurz about 3 years ago
- Subject changed from [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 to [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by nicksinger about 3 years ago
- Assignee set to nicksinger
- Target version deleted (
Ready)
openqaworker-arm-5 could be temporarily recovered by disabling the line
/dev/md/openqa /var/lib/openqa ext2 defaults 0 0
in /etc/fstab, callingmount -a
and logging out of the recovery shell which continued the boot.@nicksinger didn't you mention necessary changes to support the filesystem setup slightly differing from other hosts?
Yes, https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/572 - But I had to manually reverted the changes on arm-5 and overlooked https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/nvme_store/init.sls#L35 which creates an entry in /etc/fstab. I deleted it now manually on arm-5.
Updated by livdywan about 3 years ago
- Description updated (diff)
- Status changed from Workable to In Progress
- Target version set to Ready
Updated by nicksinger about 3 years ago
- Target version deleted (
Ready)
Set the bootdev for openqaworker-arm-4 persistent with: ipmitool -I lanplus -C 3 -H ipmi.openqaworker-arm-4.qa.suse.de chassis bootdev disk options=persistent
. Reboot worked 4/4 times. Another highstate also confirmed that the fstab-entry does not reappear. Doing the reboot-test now with arm-5.
Updated by nicksinger about 3 years ago
- Status changed from In Progress to Resolved
- Target version set to Ready
gnah sorry for deleting "Ready" once again… openqaworker-arm-5 also survived 3/3 reboots. I consider this ticket covered then.
Updated by okurz about 3 years ago
- Status changed from Resolved to Feedback
waaait … what happens if salt overwrites fstab changes again? I think the AC should be "survives multiple reboot+salt+reboot cycles" :) Or is the above really that you reverted parts but never rebooted so missed the fstab part?
Updated by nicksinger about 3 years ago
- Status changed from Feedback to Resolved
okurz wrote:
waaait … what happens if salt overwrites fstab changes again? I think the AC should be "survives multiple reboot+salt+reboot cycles" :) Or is the above really that you reverted parts but never rebooted so missed the fstab part?
Right, I reverted the changes manually after applying a proper fix in salt (the addition of another grain) and never rebooted afterwards. In https://progress.opensuse.org/issues/99288#note-10 I mentioned another highstate to confirm this is working. Another way to view this: arm-5 was deployed first and showed problems while arm-4 didn't have the fstab problems (it was deployed with the grain already present) but just a wrong boot order :)
So your newly formulated AC is met as well