Project

General

Profile

Actions

action #62162

closed

Move one openqa worker machine from osd to o3

Added by okurz almost 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
2020-01-15
Due date:
% Done:

0%

Estimated time:

Description

okurz: @coolo I learned from Marita the day before yesterday that originally hmuelle ordered new openQA machines to give three to openSUSE but only two ended up there. Given that according to my observation x86_64 workers in osd are mostly idling but o3 is not I would ask EngInfra to move one machine to o3, WDYT?
coolo: it might be there before august then. but I agree. better save 'the old ones' to a better use


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #62849: broken NVMe on openqaworker4 auto_review:"No space left on device"Resolvedokurz2020-01-312020-03-10

Actions
Actions #1

Updated by okurz almost 5 years ago

  • Related to action #62849: broken NVMe on openqaworker4 auto_review:"No space left on device" added
Actions #2

Updated by okurz almost 5 years ago

  • Subject changed from Move on openqaworker from osd to o3 to Move one openqa worker machine from osd to o3

this gets a bit more pressing with #62849 . It is not exactly clear what needs to be done to move machines between o3 and osd but physically the machines live in the same racks, mainly https://racktables.nue.suse.com/index.php?page=rack&rack_id=193 , potentially some switch configuration changes.

Actions #3

Updated by okurz over 4 years ago

  • Status changed from New to Feedback
  • Assignee set to okurz

Created https://infra.nue.suse.com/SelfService/Display.html?id=16458 to have the machine openqaworker7 assigned to o3.

Actions #4

Updated by okurz over 4 years ago

Actually the move has been conducted by mmaher and should be revertable with a simple switch update. https://racktables.nue.suse.com/index.php?page=object&object_id=4994 does not reflect this. Probably needs a manual update.

Remove from osd:

salt-key -y -d openqaworker7.suse.de

IPMI access to the machine still works as in before. I connected and rebooted the machine. At this point it still needs updated network config though. The osd worker machines are configured for DHCP. openqaworker1, openqaworker4, aarch64 and rebel have statically configured IPv4 addresses but I would prefer to use DHCP which we also use for power8 and imagetester.

  • Added entry on o3 to /etc/dnsmasq.d/openqa.conf
dhcp-host=54:ab:3a:24:34:b8,openqaworker7
  • Added entry to /etc/hosts which dnsmasq should pick up to give out a DHCP lease
192.168.112.12   openqaworker7.openqanet.opensuse.org openqaworker7
  • Reload dnsmasq with systemctl restart dnsmasq
  • Restarted network on openqaworker7 (over IMPI) using systemctl restart network and monitoring in o3:journalctl -f -u dnsmasq and adress is assigned:
Feb 29 10:48:30 ariel dnsmasq[28105]: read /etc/hosts - 30 addresses
Feb 29 10:48:54 ariel dnsmasq-dhcp[28105]: DHCPREQUEST(eth1) 10.160.1.101 54:ab:3a:24:34:b8
Feb 29 10:48:54 ariel dnsmasq-dhcp[28105]: DHCPNAK(eth1) 10.160.1.101 54:ab:3a:24:34:b8 wrong network
Feb 29 10:49:10 ariel dnsmasq-dhcp[28105]: DHCPDISCOVER(eth1) 54:ab:3a:24:34:b8
Feb 29 10:49:10 ariel dnsmasq-dhcp[28105]: DHCPOFFER(eth1) 192.168.112.12 54:ab:3a:24:34:b8
Feb 29 10:49:10 ariel dnsmasq-dhcp[28105]: DHCPREQUEST(eth1) 192.168.112.12 54:ab:3a:24:34:b8
Feb 29 10:49:10 ariel dnsmasq-dhcp[28105]: DHCPACK(eth1) 192.168.112.12 54:ab:3a:24:34:b8 openqaworker7
  • Changed root password to o3 one
  • Added my ssh key to openqaworker7:/root/.ssh/authorized_keys
  • Updated /etc/openqa/client.conf with the same key as used on other workers for "openqa1-opensuse"
  • Updated /etc/openqa/workers.ini with similar config as used on other workers, e.g. openqaworker4
# diff -Naur /etc/openqa/workers.ini{.osd,}
--- /etc/openqa/workers.ini.osd 2020-02-29 15:21:47.737998821 +0100
+++ /etc/openqa/workers.ini     2020-02-29 15:22:53.334464958 +0100
@@ -1,17 +1,10 @@
-# This file is generated by salt - don't touch
-# Hosted on https://gitlab.suse.de/openqa/salt-pillars-openqa
-# numofworkers: 10
-
 [global]
-HOST=openqa.suse.de
-CACHEDIRECTORY=/var/lib/openqa/cache
-LOG_LEVEL=debug
-WORKER_CLASS=qemu_x86_64,qemu_x86_64_staging,tap,openqaworker7
-WORKER_HOSTNAME=10.160.1.101
-
-[1]
-WORKER_CLASS=qemu_x86_64,qemu_x86_64_staging,tap,qemu_x86_64_ibft,openqaworker7
+HOST=http://openqa1-opensuse
+WORKER_HOSTNAME=192.168.112.12
+CACHEDIRECTORY = /var/lib/openqa/cache
+CACHELIMIT = 50
+WORKER_CLASS = openqaworker7,qemu_x86_64

-[openqa.suse.de]
-TESTPOOLSERVER = rsync://openqa.suse.de/tests
+[http://openqa1-opensuse]
+TESTPOOLSERVER = rsync://openqa1-opensuse/tests

As the machine is not a transactional-server in contrast to openqaworker1 and openqaworker4 there are the following options: Keep as is and handle like power8 (also not transactional), enable transactional updates w/o root being r/o, change to root being r/o on-the-fly, reinstall as transactional. I guess I will try with option 2 first. So what I did:

  • Remove OSD specifics
systemctl disable --now auto-update.timer salt-minion telegraf
for i in  NPI SUSE_CA telegraf-monitoring; do zypper rr $i; done
zypper -n dup --force-resolution --allow-vendor-change
  • enable transactional updates
zypper -n in transactional-update
systemctl enable --now transactional-update.timer rebootmgr

As the latest openQA job could successfully finish I enabled more worker instances

systemctl unmask openqa-worker@{2..14} && systemctl enable --now openqa-worker@{2..14}

Will monitor if jobs are running fine and also if the nightly update works.

EDIT: Some tests were failing incomplete due to missing staging uefi images again, see #63382

EDIT: 2020-03-01: Automatic updates+reboot worked:

Mar 01 00:08:26 openqaworker7 transactional-update[10933]: Calling zypper up
…
Mar 01 00:08:51 openqaworker7 transactional-update[10933]: transactional-update finished - informed rebootmgr
Mar 01 00:08:51 openqaworker7 systemd[1]: Started Update the system.
…
Mar 01 03:30:00 openqaworker7 rebootmgrd[40760]: rebootmgr: reboot triggered now!
…
Mar 01 03:36:32 openqaworker7 systemd[1]: Reached target openQA Worker.

Keep in mind. The system is configured for multi-machine tests from OSD time but I have not tested nor enabled the worker classes with "tap". Also firewall+apparmor configuration might still be differing.

  • Enable apparmor
zypper -n in apparmor-utils
systemctl unmask apparmor
systemctl enable --now apparmor
  • Switch firewall from SuSEfirewall2 to firewalld
zypper -n in firewalld && zypper -n rm SuSEfirewall2
systemctl enable --now firewalld
firewall-cmd --zone=trusted --add-interface=br1
firewall-cmd --set-default-zone trusted
firewall-cmd --zone=trusted --add-masquerade

EDIT: Created new section in https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Infrastructure-setup-for-o3-openqaopensuseorg with all necessary instructions for moving from osd to o3 based on above's comments.

Actions #5

Updated by okurz over 4 years ago

  • Status changed from Feedback to Resolved

Working good, properly removed from OSD and generalized instructions for future reference available in https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Infrastructure-setup-for-o3-openqaopensuseorg

Actions

Also available in: Atom PDF