Project

General

Profile

action #90275

Replacement openQA OSD aarch64 hardware (was: Dedicated non-rpi aarch64 hardware for manual testing)

Added by tjyrinki_suse 7 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
Due date:
2021-09-17
% Done:

0%

Estimated time:

Description

Motivation

We have had a machine from orthos, but we need to renew the lease every now and then or lose it. It would be easier to have a more permanent solution for eg snapshot validations.

openQA tests aarch64 on KVM, this ticket would be about improving the baremetal aarch64 testing. We were not able to test RC1 on baremetal (non-rpi) aarch64 due to lack of access to hardware (eg problems on thunderx10)

Acceptance criteria

  • AC1: DONE Make new arm machines available
  • AC2: openqaworker-arm-4 is salted and processing jobs
  • AC3: openqaworker-arm-5 is salted and processing jobs

Related issues

Related to openQA Infrastructure - action #98661: Tweak worker numbers for openqaworker-arm-4 and arm-5New2021-09-15

Related to openQA Infrastructure - action #99288: [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:MResolved2021-09-26

History

#1 Updated by tjyrinki_suse 7 months ago

Two arm64 hw machine orders are already in pipeline.

One has been ordered as replacement for arm1 and arm2 that were used by QAM (but no longer available).
Second one for openQA.
There would be also need for this kind of validation work, with a machine more resembling what customers would use.

Heiko will talk to Andreas and what is the status of plans for orthos as well.

#2 Updated by tjyrinki_suse 7 months ago

  • Status changed from New to In Progress

#3 Updated by tjyrinki_suse 7 months ago

  • Status changed from In Progress to New
  • Assignee set to hrommel1

#4 Updated by okurz 7 months ago

As stated in the meeting I suggest to not aim for "permanent reservations" as this tends to cause forgotten, unused hardware. Either ask the arch-team for a dedicated, special reservation without expiration or use automation to extend reservations. See https://github.com/okurz/scripts/blob/master/extend-orthos-reserve for a way to automate orthos reservation

#5 Updated by okurz 7 months ago

  • Status changed from New to In Progress

"in progress", waiting for feedback from hrommel1

#6 Updated by pcervinka 6 months ago

tjyrinki_suse Although it is about manual testing and I'm not sure what is purpose of it as description doesn't contain specific details, but we have one arm machine for kernel testing. We do automatic installation, basic ltp and kdump(in development). Machine is not ready and never will be ready for stable validation (hw unstability, firmware issues...already unsupported cpu by vendor), but it is good enough and maybe better than random orthos machine for our purpose.

Also I'm not sure what exactly you want to test for 15-SP3 RCx, arm stakeholders interested in bare metal testing wanted installation and simple smoke tests, which we already do.

If you want to, we can agree and you can try to test what you want. Machine usually contains latest SLE 15-SP3 build so we can start it up and you can connect.

#7 Updated by maritawerner 6 months ago

3 new machines have arrived, 1 for manual tests two for openQA. Heiko will ask Calen for setting up the machines, FC is a problem here.

#8 Updated by okurz 5 months ago

maritawerner wrote:

3 new machines have arrived, 1 for manual tests two for openQA. Heiko will ask Calen for setting up the machines, FC is a problem here.

  1. can you provide more details and an updated status? I am especially interested in "two for openQA".
  2. who takes care that the machines are ending up in server rooms
  3. is there an EngInfra ticket that can be referenced

The "SUSE QE Tools" team can take over as soon as the according machines are accessible over IPMI+SSH

#9 Updated by maritawerner 5 months ago

@heiko? Could you please add the info?

#10 Updated by okurz 5 months ago

nicksinger you are involved with FSP+ and something like 10G switch for ARM machines, right? I guess this ticket is about the same. If yes, can you link a ticket if you have one or take this ticket and update it?

#11 Updated by okurz 4 months ago

mgriessmeier can you help here what's the status and estimates about the ARM machines, FSP+ 10G switches, etc.?

#12 Updated by mgriessmeier 4 months ago

machines are in nuremberg office / conf room london, switch space is clarified - what is need to be done is to order the FSP-Adapters, quote is already requested from Delta and a ticket to be opened to engInfra for mounting the machines

#13 Updated by mgriessmeier 4 months ago

I just opened the infra ticket, so at least the machines could be already moved: https://infra.nue.suse.com/Ticket/Display.html?id=191515

#14 Updated by okurz 4 months ago

mgriessmeier wrote:

"quote is already requested from Delta"

I assume that means that either you or nsinger would get a response from them and then order, right?

#15 Updated by mgriessmeier 4 months ago

okurz wrote:

mgriessmeier wrote:

"quote is already requested from Delta"

I assume that means that either you or nsinger would get a response from them and then order, right?

yes,
however, meanwhile I got a response in the engInfra ticket, stating that they cannot move the servers there in the next two weeks because they are overloaded with urgent inventory works of SRV1 as well as current network issues ongoing

#16 Updated by okurz 3 months ago

  • Due date set to 2021-08-31
  • Status changed from In Progress to Blocked
  • Assignee changed from hrommel1 to mgriessmeier

sit is more severe as updated in https://infra.nue.suse.com/SelfService/Display.html?id=191515#txn-2941755 as now one server room has an AC problem which will delay other work.

mgriessmeier as you reported https://infra.nue.suse.com/SelfService/Display.html?id=191515 you will receive notifications if there are updates. Hence I am assigning this ticket to you to please update us if there is something new. Thanks in advance :)

#17 Updated by mgriessmeier 3 months ago

sorry - I forgot to update the ticket, so let me reflect current situation as of yesterday:

  • due to AC failure in SRV2, the machines could not be mounted there. There is no ETA when this can happen
  • Yesterday (August 28th) I was in the office with nsinger to mitigate this and we set up a temporary solution for those and potential more servers to come.
  • so both machines are at the moment mounted in the big QA Lab, connected to the network and are waiting to be installed, which Nick planned to do today. As soon this is done we can connect them to openQA

#18 Updated by jlausuch 3 months ago

I have a question. Are these machines to be used only for on-demand manual testing or we can automate jobs on them?

#19 Updated by maritawerner 3 months ago

These two machines will be added to openQA and can than be used for all jobs. But in addition to the two machines Heiko has a third aarch64 machine that he has reserved for manual tests. I am not very familiar with manual tests in QAM but I think that there is a "pool of HW for manual tests" were this machine is/will be added.

#20 Updated by jlausuch 3 months ago

maritawerner wrote:

These two machines will be added to openQA and can than be used for all jobs. But in addition to the two machines Heiko has a third aarch64 machine that he had reserved for manual tests.

Ok, thanks for the update. Anyway, we will keep using the qemu_aarch64 worker class, so if they are added to that pool we won't need to do anything for JeOS.

#21 Updated by okurz 3 months ago

  • Status changed from Blocked to Feedback

https://infra.nue.suse.com/SelfService/Display.html?id=191515 is resolved so now it's waiting for installation

#22 Updated by mgriessmeier 3 months ago

  • Assignee changed from mgriessmeier to nicksinger

Hi, so the machines are sitting in a rack in the big QA LAB and are ready to be installed

Details:

      # eth0
      host openqaworker-arm-4; 18:C0:4D:8C:82:8E; 10.162.6.200
      host openqaworker-arm-5; 18:C0:4D:06:CE:57; 10.162.6.201

      # IPMI
      host openqaworker-arm-4-sp; 18:C0:4D:8C:82:90; 10.162.6.210
      host openqaworker-arm-5-sp; 18:C0:4D:06:CE:59; 10.162.6.211

#23 Updated by cdywan 3 months ago

  • Project changed from qam-qasle-collaboration to openQA Infrastructure
  • Target version set to Ready

Moving the ticket so that it shows up on the Tools backlog

#24 Updated by cdywan 3 months ago

  • Status changed from Feedback to Workable

#25 Updated by cdywan 3 months ago

  • Description updated (diff)

#26 Updated by nicksinger 2 months ago

trying to boot the machine over PXE shows that the qanet PXE setup for aarch is not working and outdated. I added the following config to the grub.cfg on qanet:

submenu 'Leap 15.3' {
  menuentry 'Leap-15.3-installer' {
    echo 'Setting append...'
    set append='usessh=1 sshpassword=linux network=1 install=http://download.opensuse.org/ports/aarch64/distribution/leap/15.3/repo/oss/ console=ttyAMA0,115200n8'
    echo 'Done!'
    echo 'Loading kernel...'
    linux http,download.opensuse.org)/ports/aarch64/distribution/leap/15.3/repo/oss/boot/aarch64/linux $append
    echo 'Done!'
    echo 'Loading initrd...'
    initrd (http,download.opensuse.org)/ports/aarch64/distribution/leap/15.3/repo/oss/boot/aarch64/initrd
    echo 'Done!'
  }
}

#27 Updated by nicksinger 2 months ago

  • Status changed from Workable to In Progress

loading the files over network/http does not seem to work. grub complains that it doesn't receive a DNS reply to resolve download.opensuse.org. I found https://bugzilla.redhat.com/show_bug.cgi?id=860829 - it hinted that grub got some patches in the meantime so I loaded the most recent grub-binary from //download.opensuse.org/distribution/leap/15.3/repo/oss/noarch/grub2-arm64-efi-2.04-20.4.noarch.rpm and tried it with that version. Unfortunately the error still persists despite the network configuration being successful

#28 Updated by nicksinger 2 months ago

I now downloaded the kernel+initrd on our tftp server directly and changed grub.cfg to:

    echo 'Loading kernel...'
    linux aarch64/leap15.3/linux $append
    echo 'Done!'
    echo 'Loading initrd...'
    initrd aarch64/leap15.3/initrd
    echo 'Done!'

The kernel loads successfully. However the initrd times out - most likely because it is to big. This is what grubs tells me at boot:

Setting append...
Done!
Loading kernel...
Done!
Loading initrd...
error: ../../grub-core/net/net.c:1716:timeout reading
`aarch64/leap15.3/initrd'.
Done!

looking for a way to increase that timeout now

#29 Updated by nicksinger 2 months ago

Okay, I don't find anything where this timeout can be adjusted. I also tried several other methods of loading the initrd e.g. supplying the IP of download.opensuse.org directly but no success. I also tried to "mount" an ISO directly over the webui of the BMC but this also just failed with error messages that the ISO could not be mounted.

I'm a little bit out of ideas now what could be done to boot these systems. Maybe I contact somebody from buildops or enginfra if they can show me how they boot aarch64 machines. Another idea would be to compile iPXE for aarch64. For this I would need to have access to a functional aarch64 machine so I might hijack one of our openQA workers to do this.

#30 Updated by nicksinger about 2 months ago

I was able to cross-compile a working version of iPXE. With make ARCH=arm64 CROSS_COMPILE=aarch64-unknown-linux-gnu- -j$(nproc) bin-arm64-efi/ipxe.efi EMBED=../myscript.ipxe. myscript.ipxe contains the following:

#!ipxe
dhcp

kernel http://download.opensuse.org/ports/aarch64/distribution/leap/15.3/repo/oss/boot/aarch64/linux usessh=1 sshpassword=linux network=1 install=http://download.opensuse.org/ports/aarch64/distribution/leap/15.3/repo/oss/ console=ttyAMA0,115200n8 root=/dev/ram0 initrd=initrd textmode=1
initrd http://download.opensuse.org/ports/aarch64/distribution/leap/15.3/repo/oss/boot/aarch64/initrd
boot
shell

This allowed me to boot and install Leap15.3 (currently running) on openqaworker-arm-5

#31 Updated by okurz about 2 months ago

and where did you put these files? Can you also say where the file grub.cfg on qanet resides? Is this something that could/should also be handled on salt?

#32 Updated by nicksinger about 2 months ago

everything related to these experiments here is located on qanet in the folder /srv/tftp/aarch64. And yes, a proper PXE setup for all arches should be covered by salt. However I don't think that manually tinkering with bootloaders and having years-old scripts to generate their configs is the right approach. I have yet to find a tool which can handle all this more smoothly. Maybe cobbler is an option which configs could then be managed by salt

#33 Updated by nicksinger about 2 months ago

There was an IP collision with another host making arm-4's BMC unreachable. It was resolved by https://gitlab.suse.de/qa-sle/qanet-configs/-/commit/139354ea37851407deb0570362d96ed87ef8f1d9 and over night the BMC requested another lease and is reachable again now. I started an installation on it now.
In the meantime I try to setup salt on arm-5 but currently struggling by salt detecting the host as "QA-Power8-4-kvm.qa.suse.de" (which is obviously wrong). I tried to add a proper reverse entry for those arms with https://gitlab.suse.de/qa-sle/qanet-configs/-/commit/0b12e2ab0eb025514e14612963a22e90a1ec12fd but still wrongly reported in salt-key on OSD

#34 Updated by nicksinger about 2 months ago

apparently the minion_id on the minion was wrongly generated due to some cached DNS entries. I removed the wrong key in /etc/salt/pki/master/minions_denied on OSD, deleted /etc/salt/minion_id on arm-5 and started the minion again. This time it correctly connects as openqaworker-arm-5.qa.suse.de to OSD

#35 Updated by nicksinger about 2 months ago

highstate into the worker role seems to have worked - almost. 1 issue:

----------
          ID: firewalld_zones
    Function: file.managed
        Name: /etc/firewalld/zones/trusted.xml
      Result: False
     Comment: Unable to manage file: Jinja variable 'dict object' has no attribute 'openqaworker-arm-5'
     Started: 14:19:43.186399
    Duration: 136.135 ms
     Changes:

#36 Updated by cdywan about 2 months ago

  • Due date changed from 2021-08-31 to 2021-09-03

Since you're clearly working on this and providing incremental updates, I'm just going to bump the due date to the end of the week

#37 Updated by nicksinger about 2 months ago

  • Due date changed from 2021-09-03 to 2021-09-11

Yesterday I realized that salt needs to be fixed to properly deploy new servers and get working alerts. Also I just finished installing arm-4 so I need some more time on this

#38 Updated by nicksinger about 2 months ago

Work has been done to fix salt in e.g. https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/572, https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/569, https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/570 and https://progress.opensuse.org/issues/98243 - this allows our highstate to fully complete again hence make telegraf+monitoring work properly on new hosts. I now look into the final deployment of openqaworker-arm-4

#39 Updated by cdywan about 1 month ago

  • Due date changed from 2021-09-11 to 2021-09-17

Next step, put a low number of workers in salt e.g. see that at least one job succeeds and file a follow-up ticket for further refinement.

#40 Updated by nicksinger about 1 month ago

I added the workerconf in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/353 with a really small number of workers for now to see if these machines work as expected. Further tweaking of these instance numbers should be handled in https://progress.opensuse.org/issues/98661 #98661

#41 Updated by nicksinger about 1 month ago

  • Status changed from In Progress to Feedback

#42 Updated by okurz about 1 month ago

I suggest to always reference other tickets with the format #<id> for better linking and preview. If you like to have both a valid URL in email updates as well as useful text on the web interface you can write both, URL and #<id>

#43 Updated by okurz about 1 month ago

  • Related to action #98661: Tweak worker numbers for openqaworker-arm-4 and arm-5 added

#45 Updated by maritawerner about 1 month ago

Thanks a lot! That is good news!

#47 Updated by okurz about 1 month ago

  • Subject changed from Dedicated non-rpi aarch64 hardware for manual testing to Replacement openQA OSD aarch64 hardware (was: Dedicated non-rpi aarch64 hardware for manual testing)
  • Status changed from Feedback to Resolved

Added https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/581 for full integration in monitoring following https://progress.opensuse.org/projects/openqav3/wiki/#Setup-guide-for-new-machines

Now I would like to come back to the original request about "manual testing hardware". For that I created #98859 within the "qam-qasle-collaboration" project. So let's focus this ticket here on the new OSD workers and #98859 on any potential manual hardware needs.

For OSD aarch64 workers openqaworker-arm-1, openqaworker-arm-2 and openqaworker-arm-3 are still up and running though with the known instabilities and limitations. So we could benefit from at least one more aarch64 machine but I recommend to actually not replace them all at once so not ordering a new replacement for aarch64 right now (though it would likely take multiple months anyway to receive anything). With this I see all ACs fulfilled and the ticket as resolved.

#48 Updated by okurz about 1 month ago

  • Status changed from Resolved to Feedback

Wait, sorry, this was premature. The machines do not yet use production worker classes. I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/357 now

#49 Updated by okurz about 1 month ago

  • Status changed from Feedback to Resolved

tests passed on production worker classes, e.g. https://openqa.suse.de/tests/7181326 on openqaworker-arm-4:1

#50 Updated by okurz about 1 month ago

  • Related to action #99288: [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:M added

Also available in: Atom PDF