Project

General

Profile

Actions

action #136133

closed

QA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA - coordination #129280: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG Datacenters

Migrate aarch64.openqanet.opensuse.org to FC Basement size:M

Added by okurz 8 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Extracted from #134864

Acceptance criteria

  • AC1: aarch64.openqanet.opensuse.org up and running in FC Basement again
  • AC2: Inventory management systems, i.e. racktables and netbox, reflect the plan and current situation at all times
  • AC3: On top of AC1 machine is successfully working on o3 jobs

Suggestions

Rollback actions


Related issues 1 (1 open0 closed)

Copied to openQA Infrastructure - action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:MBlockedmkittler

Actions
Actions #2

Updated by okurz 7 months ago

  • Project changed from 46 to openQA Infrastructure
  • Category deleted (Infrastructure)
Actions #3

Updated by okurz 7 months ago

  • Subject changed from Migrate aarch64.openqanet.opensuse.org to FC Basement to Migrate aarch64.openqanet.opensuse.org to FC Basement size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz 7 months ago

  • Target version changed from Tools - Next to Ready
Actions #5

Updated by nicksinger 7 months ago

  • Assignee set to nicksinger
Actions #6

Updated by nicksinger 7 months ago

  • Assignee deleted (nicksinger)

putting back as I'm currently to involved with #136130

Actions #7

Updated by nicksinger 7 months ago

  • Tags changed from infra, nue1, nue2, nue3 to infra, nue1, nue2, nue3, next-frankencampus-visit

SP cannot be reached at the moment. Resetting power to socket 9 on http://epdu-b4.qe.nue2.suse.org/ did not work. Adding next-frankencampus-visit to physically check connections and power cycle the machine if needed.

Actions #8

Updated by dheidler 7 months ago

  • Related to action #138326: Extend openQA-in-openQA test to do a real multimachine run. added
Actions #9

Updated by dheidler 7 months ago

  • Related to deleted (action #138326: Extend openQA-in-openQA test to do a real multimachine run.)
Actions #10

Updated by dheidler 7 months ago

  • Assignee set to dheidler

Will have a look on wednesday, if anyone comes around earlier, please reassign.

Actions #11

Updated by okurz 7 months ago

  • Status changed from Workable to In Progress
  • Assignee changed from dheidler to okurz
Actions #12

Updated by okurz 7 months ago

I physically checked the machine. After connecting a display I Could see it's stuck as reported in before trying and failing to boot but it was at least powered on. Over the PDU local control panel I powered off the outlet A9 and the correct machine went down. After powering on outlet A9 the management interface ethernet outlet showed activity again. The power supply stayed off for multiple minutes but eventually the green LED showed up again and the display showed a message

W E L C O M E
CPU Type:
…
System will boot soon
Actions #13

Updated by okurz 7 months ago

nicksinger wrote in #note-7:

SP cannot be reached at the moment.

What address have you tried? workerconf.sls seems to reference an outdated entry that can not work.

The sp address should be aarch64-o3-sp.qe.nue2.suse.org but on walter1 I could not see any entries for

journalctl -u dhcpd --since "1h ago" | grep -i '\(00:18\|00:00\)'

I tried to enter the setup but the BIOS setup is password protected and needs a reboot every three failed password entry attempts, tried multiple usual passwords all failed, argh

Maybe one can try from the grub menu "Enter firmware settings". I tried to add console=tty0 to the linux command line arguments but still nothing shown on the display after "EFI stub: Exiting boot services..."

Actions #14

Updated by okurz 7 months ago

  • Status changed from In Progress to Workable
  • Assignee changed from okurz to dheidler

Maybe one can try from the grub menu "Enter firmware settings". I tried to add console=tty0 to the linux command line arguments but still nothing shown on the display after "EFI stub: Exiting boot services..."

@dheidler back to you. I suggest try the above with local display and keyboard, experiment with the installed grub menu, try boot over network or from a local USB thumbdrive.

Actions #17

Updated by dheidler 6 months ago

  • Status changed from Workable to In Progress
Actions #18

Updated by dheidler 6 months ago

  • Status changed from In Progress to Resolved

https://openqa.opensuse.org/tests/3680520

This is what I did after installing the OS:

zypper ar -p 95 -f 'http://download.opensuse.org/repositories/devel:openQA/$releasever' devel_openQA
zypper ar -p 90 -f 'http://download.opensuse.org/repositories/devel:openQA:Leap:$releasever/$releasever' devel_openQA_Leap
zypper --gpg-auto-import-keys ref
echo "requires:openQA-worker" > /etc/zypp/systemCheck.d/openqa.check
zypper -n in openQA-worker openQA-auto-update openQA-continuous-update os-autoinst-distri-opensuse-deps swtpm # openQA worker services plus dependencies for openSUSE distri or development repo if added previously
zypper -n in ffmpeg-4  # for using external video encoder as it is already configured on some machines like ow19, ow20 and power8
zypper -n in nfs-client  # For /var/lib/openqa/share
zypper -n in bash-completion vim htop strace systemd-coredump iputils tcpdump bind-utils  # for general tinkering
zypper -n in qemu-uefi-aarch32

# for use of nfs
# echo "openqa1-opensuse:/ /var/lib/openqa/share nfs4 noauto,nofail,retry=30,ro,x-systemd.automount,x-systemd.device-timeout=10m,x-systemd.mount-timeout=30m 0 0" >> /etc/fstab

# for use of cacheservice
# if connecting via ssh-rsync to o3
mkdir -p /var/lib/openqa/ssh/
chown _openqa-worker: /var/lib/openqa/ssh/
sudo -u _openqa-worker ssh-keygen -t ed25519 -N '' -f /var/lib/openqa/ssh/id_ed25519
# append /var/lib/openqa/ssh/id_ed25519.pub to ariel:/var/lib/openqa/.ssh/authorized_keys
sudo -u _openqa-worker ssh -i /var/lib/openqa/ssh/id_ed25519 -p 2214 -o "UserKnownHostsFile /var/lib/openqa/ssh/known_hosts" geekotest@gate.opensuse.org whoami
# note that ariel can be reached directly via ariel.dmz-prg2.suse.org from suse network

sed -i 's/\(solver.dupAllowVendorChange = \)false/\1true/' /etc/zypp/zypp.conf

instances=30 bash -x $(which os-autoinst-setup-multi-machine)

#[global]
#HOST = https://openqa.opensuse.org
#CACHEDIRECTORY = /var/lib/openqa/cache
##WORKER_CLASS = foo
#EXTERNAL_VIDEO_ENCODER_CMD=ffmpeg -y -hide_banner -nostats -r 24 -f image2pipe -vcodec ppm -i - -pix_fmt yuv420p -c:v libvpx-vp9 -crf 35 -b:v 1500k -cpu-used 0
#[https://openqa.opensuse.org]
##TESTPOOLSERVER = -e 'ssh -p 2214 -i /var/lib/openqa/ssh/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh/known_hosts"' geekotest@gate.opensuse.org:/var/lib/openqa/tests
##TESTPOOLSERVER = -e 'ssh -i /var/lib/openqa/ssh/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh/known_hosts"' geekotest@ariel.dmz-prg2.suse.org:/var/lib/openqa/tests
#TESTPOOLSERVER = rsync://openqa.opensuse.org/tests

zypper -n in rsync sshpass
rsync -rP --rsh="sshpass -p opensuse ssh -o StrictHostKeyChecking=no" root@openqaworker24:/etc/openqa/ /etc/openqa/
rsync -rP --rsh="sshpass -p opensuse ssh -o StrictHostKeyChecking=no" root@openqaworker24:/etc/wicked/scripts/gre_tunnel_preup.sh /etc/wicked/scripts/
sed -i "s/openqaworker24/$HOSTNAME/g" /etc/openqa/workers.ini
# comment out the worker that we are on and enable all other entries
vi /etc/wicked/scripts/gre_tunnel_preup.sh
# use only hostname as worker class for testing
vi /etc/openqa/workers.ini

chown _openqa-worker /etc/openqa/client.conf
chmod 640 /etc/openqa/client.conf

# configure /etc/openqa/client.conf and /etc/openqa/workers.ini, then enable the desired number of worker slots, e.g.:
systemctl enable --now openqa-worker-auto-restart@{1..30}.service openqa-reload-worker-auto-restart@{1..30}.path openqa-auto-update.timer openqa-continuous-update.timer openqa-worker-cacheservice.service openqa-worker-cacheservice-minion.service

# on aarch64 do something about hugepages permissions (see https://progress.opensuse.org/issues/53234)
echo "hugetlbfs /dev/hugepages hugetlbfs defaults,gid=kvm,mode=1775,pagesize=1G" >> /etc/fstab

# add this to kernel cmdline via /etc/default/grub (& update-bootloader) on aarch64:
# showopts nospec spectre_v2=off pti=off kpti=off mitigations=off default_hugepagesz=1G hugepagesz=1G hugepages=64 crashkernel=167M enforcing=0
Actions #19

Updated by okurz 6 months ago

  • Status changed from Resolved to Workable

Please ensure "AC3: On top of AC1 machine is successfully working on o3 jobs" is covered. https://openqa.opensuse.org/tests/3680520#step/patch_agama/6 looks like tests are executing too slow, maybe not optimized virtualization emulation. If in doubt please ask ggardet in irc://libera.chat/opensuse-factory

dheidler wrote in #note-18:

https://openqa.opensuse.org/tests/3680520

This is what I did after installing the OS:
[…]

Did you actually reinstall the system? If yes, why?

And I see that the system has worker classes "WORKER_CLASS = aarch64-o3,qemu_aarch64,qemu_aarch64_lse", didn't it have 32bit ones in before? See for example https://openqa.opensuse.org/admin/workers/158 , one instance from the same host before the move.

Actions #20

Updated by ggardet_arm 6 months ago

okurz wrote in #note-19:

And I see that the system has worker classes "WORKER_CLASS = aarch64-o3,qemu_aarch64,qemu_aarch64_lse", didn't it have 32bit ones in before? See for example https://openqa.opensuse.org/admin/workers/158 , one instance from the same host before the move.

That indeed looks wrong. If this is the D05 machine (as I understand it), it should have: d05,qemu_aarch64,qemu_aarch64_no_lse,qemu_aarch32,tap

Actions #21

Updated by dheidler 6 months ago

okurz wrote in #note-19:

Did you actually reinstall the system? If yes, why?

Because the non-reactive num led indicated that the kernel was not booting up.

And I see that the system has worker classes "WORKER_CLASS = aarch64-o3,qemu_aarch64,qemu_aarch64_lse", didn't it have 32bit ones in before? See for example https://openqa.opensuse.org/admin/workers/158 , one instance from the same host before the move.

Added qemu-uefi-aarch32 and added worker class for 32bit and now 32bit jobs should run. Not adding tap due to networking constraints.

Actions #22

Updated by dheidler 6 months ago

  • Status changed from Workable to Resolved

okurz wrote in #note-19:

Please ensure "AC3: On top of AC1 machine is successfully working on o3 jobs" is covered. https://openqa.opensuse.org/tests/3680520#step/patch_agama/6 looks like tests are executing too slow, maybe not optimized virtualization emulation. If in doubt please ask ggardet in irc://libera.chat/opensuse-factory

https://openqa.opensuse.org/tests/3677478#step/agama/3 already had the same error. And that machine is only twice as fast to execute that test. Without kvm we would be much slower than only at 50%.

This machine might have been overloaded as 30 worker slots were enabled.
I reduced the number of slots to 16, like it was the case before on this machine.
Also I did some checks and kvm seems to work. Qemu also doesn't complain about missing /dev/kvm - because it is present.

Actions #23

Updated by ggardet_arm 6 months ago

Despite KVM and huge pages, the systems looks quite slow.
There are lots of failures on 32-bit tests, such as https://openqa.opensuse.org/tests/3687863#step/bootloader_uefi/3

Actions #24

Updated by ggardet_arm 6 months ago

  • Status changed from Resolved to Workable
Actions #25

Updated by dheidler 6 months ago

Removed preempt=full from kernel cmdline - let's see what will happen.

Actions #26

Updated by okurz 6 months ago

@dheidler seems you have overlooked some comments and open questions as well as the ticket description. Please only resolve after covering all points, e.g. https://progress.opensuse.org/issues/136133#Rollback-actions

Actions #27

Updated by okurz 6 months ago

  • Tags changed from infra, nue1, nue2, nue3, next-frankencampus-visit to infra, nue1, nue2, nue3
Actions #28

Updated by dheidler 6 months ago

  • Status changed from Workable to Feedback
Actions #29

Updated by dheidler 6 months ago

  • Status changed from Feedback to Resolved
Actions #30

Updated by okurz 6 months ago

  • Status changed from Resolved to Feedback

Please show successful aarch32 jobs for verification. And https://gitlab.suse.de/openqa/monitor-o3/-/pipeline_schedules shows that the last job failed.

Actions #31

Updated by dheidler 6 months ago

  • Status changed from Feedback to Resolved
Actions #32

Updated by okurz 6 months ago

  • Copied to action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:M added
Actions

Also available in: Atom PDF