Project

General

Profile

action #99288

[Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:M

Added by Xiaojing_liu 2 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2021-09-26
Due date:
% Done:

0%

Estimated time:

Description

Observation

use ipmitool -I lanplus -C 3 -H ipmi.openqaworker-arm-4.qa.suse.de -U admin -P password sol activate to connect arm-4, there is nothing shown.
use ipmitool to connect openqaworker-arm-5, it shows

Give root password for maintenance
(or press Control-D to continue): sulogin: cannot read /dev/ttyAMA0: Operation not permitted

[ 9426.877540] systemd-fstab-generator[3152]: x-systemd.device-timeout ignored for openqa.suse.de:/var/lib/openqa/share

Acceptance criteria

  • AC1: openqaworker-arm-4 is up and can be rebooted without problems multiple times
  • AC2: Same for openqaworker-arm-5

Suggestions

  • DONE: Pause alerts
  • Investigate problem
  • Ensure reboot-stability with multiple reboots

Rollback steps

  • Unpause alerts for arm-4 and arm-5

Workaround

used ipmitool power cycle to reboot arm-4 and arm-5


Related issues

Related to openQA Infrastructure - action #90275: Replacement openQA OSD aarch64 hardware (was: Dedicated non-rpi aarch64 hardware for manual testing)Resolved2021-09-17

History

#1 Updated by Xiaojing_liu 2 months ago

  • Project changed from openQA Project to openQA Infrastructure

#2 Updated by okurz 2 months ago

  • Target version set to Ready

#3 Updated by okurz 2 months ago

  • Description updated (diff)
  • Priority changed from Normal to Urgent

Xiaojing_liu wrote:

Workaround

used ipmitool power cycle to reboot arm-4 and arm-5

I can not confirm this worked. The alerts in OSD are still active. ssh openqaworker-arm-4.qa asks me for a password which it should not. Updated description with suggestions and rollback steps. Paused alerts in grafana.

I assume in #90275 the intended target state was never verified so likely no reboot after a salt high state was conducted. Right now on openqaworker-arm-5.qa it seems that /var/lib/openqa can not be correctly mounted assuming /dev/md/openqa which does not exist

#4 Updated by okurz 2 months ago

  • Related to action #90275: Replacement openQA OSD aarch64 hardware (was: Dedicated non-rpi aarch64 hardware for manual testing) added

#5 Updated by okurz 2 months ago

openqaworker-arm-4 seems to repeatedly boot into installer medium.

openqaworker-arm-5 could be temporarily recovered by disabling the line /dev/md/openqa /var/lib/openqa ext2 defaults 0 0 in /etc/fstab, calling mount -a and logging out of the recovery shell which continued the boot.

nicksinger didn't you mention necessary changes to support the filesystem setup slightly differing from other hosts?

#6 Updated by okurz 2 months ago

  • Description updated (diff)

#7 Updated by okurz 2 months ago

  • Subject changed from [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 to [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:M
  • Description updated (diff)
  • Status changed from New to Workable

#8 Updated by nicksinger 2 months ago

  • Assignee set to nicksinger
  • Target version deleted (Ready)

openqaworker-arm-5 could be temporarily recovered by disabling the line /dev/md/openqa /var/lib/openqa ext2 defaults 0 0 in /etc/fstab, calling mount -a and logging out of the recovery shell which continued the boot.

nicksinger didn't you mention necessary changes to support the filesystem setup slightly differing from other hosts?

Yes, https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/572 - But I had to manually reverted the changes on arm-5 and overlooked https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/nvme_store/init.sls#L35 which creates an entry in /etc/fstab. I deleted it now manually on arm-5.

#9 Updated by cdywan 2 months ago

  • Description updated (diff)
  • Status changed from Workable to In Progress
  • Target version set to Ready

#10 Updated by nicksinger 2 months ago

  • Target version deleted (Ready)

Set the bootdev for openqaworker-arm-4 persistent with: ipmitool -I lanplus -C 3 -H ipmi.openqaworker-arm-4.qa.suse.de chassis bootdev disk options=persistent. Reboot worked 4/4 times. Another highstate also confirmed that the fstab-entry does not reappear. Doing the reboot-test now with arm-5.

#11 Updated by nicksinger 2 months ago

  • Status changed from In Progress to Resolved
  • Target version set to Ready

gnah sorry for deleting "Ready" once again… openqaworker-arm-5 also survived 3/3 reboots. I consider this ticket covered then.

#12 Updated by okurz 2 months ago

  • Status changed from Resolved to Feedback

waaait … what happens if salt overwrites fstab changes again? I think the AC should be "survives multiple reboot+salt+reboot cycles" :) Or is the above really that you reverted parts but never rebooted so missed the fstab part?

#13 Updated by nicksinger 2 months ago

  • Status changed from Feedback to Resolved

okurz wrote:

waaait … what happens if salt overwrites fstab changes again? I think the AC should be "survives multiple reboot+salt+reboot cycles" :) Or is the above really that you reverted parts but never rebooted so missed the fstab part?

Right, I reverted the changes manually after applying a proper fix in salt (the addition of another grain) and never rebooted afterwards. In https://progress.opensuse.org/issues/99288#note-10 I mentioned another highstate to confirm this is working. Another way to view this: arm-5 was deployed first and showed problems while arm-4 didn't have the fstab problems (it was deployed with the grain already present) but just a wrong boot order :)

So your newly formulated AC is met as well

Also available in: Atom PDF