Project

General

Custom queries

Profile

Actions

action #119443

closed

coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

coordination #116623: [epic] Migration of SUSE Nbg based openQA+QA+QAM systems to new security zones

Conduct the migration of SUSE openQA systems from Nbg SRV1 to new security zones size:M

Added by okurz about 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Start date:
2022-11-17
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

See parent #116623

Acceptance criteria

  • AC1: All openQA machines in Nbg SRV1 are in new security zones
  • AC2: All openQA machines in Nbg SRV1 are fully usable in production

Suggestions

Open points

  • DONE Failed to connect to gitlab.suse.de port 443 from both worker11.oqa.suse.de and worker12.oqa.suse.de
  • DONE https://openqa.suse.de/tests/9870589#step/suseconnect_scc/23 failed trying to access scc.suse.com . I thought there would be no restrictions contacting services outside the network zones. What are the actual rules applied? --> Specific rules are managed within the firewall by lhaleplidis and will be documented later on wiki but unfortunately can not currently be dynamically visible to users
  • DONE https://openqa.suse.de/tests/9870976#step/sys_param_check/19 fails to curl -f -v "qa-css-hq.qa.suse.de/robot.tar.gz" --> see https://progress.opensuse.org/issues/119443?issue_count=97&issue_position=19&next_issue_id=118660&prev_issue_id=81192#note-17
  • DONE I try to access VNC services on the hosts. That seems to be blocked as well.
  • DONE Where can we see which services are blocked ourselves? --> Specific rules are managed within the firewall by lhaleplidis and will be documented later on wiki but unfortunately can not currently be dynamically visible to users. Not really done though. Extracted into a new ticket #120145
  • DONE hosts within the new domain .oqa.suse.de. should search for matches within that domain so that nslookup $(hostname) works, e.g. nslookup worker13 should work. I assume that salt is relying on that to return a proper match for grains.fqdn
  • DONE worker13 back in production
  • DONE worker10 back in production
  • DONE worker3 back in production
  • DONE worker5 back in production
  • DONE worker6 back in production
  • DONE worker8 back in production
  • DONE worker9 back in production
  • DONE Unpause "Packet loss between worker hosts and other hosts alert"
  • DONE worker2 back in production
  • Unpause "job age (scheduled) (max)" and "job age (scheduled) (median)"

Out-of-scope

  • This is not including o3 (openqa.opensuse.org) machines as they are in a dedicated network already
  • Not including non-openQA systems, see #120264 about that

Related issues 13 (1 open12 closed)

Related to openQA Infrastructure (public) - action #109241: Prefer to use domain names rather than IPv4 in salt pillars size:MResolvedokurz

Actions
Related to openQA Infrastructure (public) - action #120025: [openQA][ipmi][worker] Worker host hostname changed and broken networking connectionResolvedokurz2022-11-07

Actions
Related to openQA Infrastructure (public) - action #120112: worker worker2.oqa.suse.de auto_review:"Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out":retry size:MResolvedokurz2022-11-08

Actions
Related to openQA Infrastructure (public) - action #120261: tests should try to access worker by WORKER_HOSTNAME FQDN but sometimes get 'worker2' or something auto_review:".*curl.*worker\d+:.*failed at.*":retry size:meowResolvedmkittler2022-11-10

Actions
Related to openQA Infrastructure (public) - action #113701: [qe-core] Move workers back to grenacheNew

Actions
Related to openQA Infrastructure (public) - action #120339: QEMU DNS fails to resolve openqa.suse.de via IP addressResolvedokurz2022-11-11

Actions
Copied from QA (public) - action #116629: Preparation planning for migration of SUSE openQA+QA systems to new security zones size:MResolvedokurz2022-09-15

Actions
Copied to QA (public) - action #119446: Conduct the migration of SUSE openQA+QA systems from Nbg SRV2 to new security zonesResolvedokurz2022-09-15

Actions
Copied to QA (public) - action #119638: Ensure every physical machine within .qam.suse.de has an IPMI+eth L2 address entry in racktables size:MResolvedokurz

Actions
Copied to openQA Infrastructure (public) - action #120163: Use salt grains instead of manually specifying IPs in "bridge_ip" size:MResolvedmkittler

Actions
Copied to QA (public) - action #120264: Conduct the migration of SUSE QA systems (non-tools-team maintained) from Nbg SRV1 to new security zones size:MResolvedokurz2022-09-15

Actions
Copied to openQA Infrastructure (public) - action #120270: Conduct the migration of SUSE openQA systems IPMI from Nbg SRV1 to new security zones size:MResolvedmkittler

Actions
Copied to openQA Infrastructure (public) - action #120807: [alert] openqa.suse.de - worker12.oqa.suse.de 100% packet loss due to outdated AAAA recordResolvedokurz2022-11-17

Actions
Actions #5

Updated by openqa_review about 2 years ago

  • Due date set to 2022-11-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by okurz about 2 years ago

Discussed in Slack huddle 2022-10-31. We have about 20 openQA machines. We should create one openQA zone covering everything which is now in .suse.de domain, e.g. openqaworker1 through openqaworker20, also qa-power8, qa-power8-4, qa-power8-5, malbec.arch, grenache, …. And another QA zone combining everything from .qa.suse.de and .qam.suse.de. mcaj will present DNS options shortly.

EDIT: In a huddle with mcaj+lhaleplidis. mcaj can take https://gitlab.suse.de/qa-sle/qanet-configs/ as reference and build salt managed DHCP+DNS structure based on that. New DNS domain for openQA machines, CNAME entries pointing to old entries. mcaj offered he could take over maintainership for the qa domain by having control over the machine qanet and migrate the dhcp+dns config to salt control and then migrate machines to the new zone one by one. We can define a new name scheme for the network and hosts, e.g. .oqa.suse.de with hostname CNAME openqaworker11.suse.de -> A worker11.oqa.suse.de. Regarding remote control there are two ways in the separate zone. /etc/hosts or internal DNS. We prefer internal DNS.

Every employee can request to have their ssh keys added over gitlab merge requests as well as propose config changes. Same goes for automated accounts.

lhaleplidis mentioned that multiple machines miss their L2 addresses in racktables. I went over all machines in https://racktables.nue.suse.com/index.php?page=rack&rack_id=193 and completed L2 addresses using both sudo salt \* cmd.run 'ipmitool lan print | grep "MAC Address"' on OSD as well as logging into the according machines directly and ip link commands. For aarch64.openqanet.opensuse.org racktables was missing the entry for IPMI completely. ipmitool lan print on the machine did return only Set in Progress : Set Complete, not more useful information. Then from login.suse.de I retrieved an L2 address with ip=$(dig openqa-aarch64-ipmi.suse.de A +short); ping -c1 $ip && ip neighbour | grep $ip. I reviewed the complete list of all QA machines as I could and added L2 addresses as far as I managed. In cases of QAM machines I could not find useful responses and I don't know the config so I did not follow up with that but created a specific ticket.

lhaleplidis also mentioned that there is a design plan from Cyber Security to only enable SSH access to machines over a jump host as well. I doubted the RoI is worth enough for that. mcaj+lhaleplidis noted that they are merely executors here. I might need to raise that issue.

Actions #8

Updated by okurz about 2 years ago

I can confirm that openqaworker11 has received a new IPv4 address 10.137.10.11 and I can't ping it from my home network. DNS resolution worker11.oqa.suse.de seems to work although CNAME redirect openqaworker11.suse.de -> worker11.oqa.suse.de does not work. openqaworker11-ipmi.suse.de is still reachable as in before.

ping -c1 worker11.oqa.suse.de works now. next step could be ssh. yes, works. DNS AAAA is missing

(Martin Caj) Can you please meanwhile check jobs on worker11 to see if all is working there fine ?

openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/9825072 _GROUP=0 BUILD= TEST=salt-minion-test-migration-poo119443-w11-okurz WORKER_CLASS=openqaworker11

->
Cloning parents of sle-15-SP5-Online-x86_64-Build32.1-salt-minion@64bit
Cloning children of sle-15-SP5-Online-x86_64-Build32.1-salt-master@64bit
Created job #9854444: sle-15-SP5-Online-x86_64-Build32.1-salt-master@64bit -> https://openqa.suse.de/t9854444
Created job #9854443: sle-15-SP5-Online-x86_64-Build32.1-salt-minion@64bit -> https://openqa.suse.de/t9854443

EDIT: It turns out OSD tries to reach worker11.oqa.suse.de over the second interface with IP

3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:37:af:f4 brd ff:ff:ff:ff:ff:ff
    altname enp0s4
    altname ens4
    inet 149.44.176.58/21 brd 149.44.183.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe37:aff4/64 scope link 
       valid_lft forever preferred_lft forever

so we agree to use a route setting

sudo ip route add 10.137.10.0/24 via 10.160.255.254
sudo ip route add 2a07:de40:a203:12::/64 via 2620:113:80c0:8080::1

I can confirm that I can ping the host over 4+6 from my home, A+AAAA works

$ for i in 4 6; do ping -c1 -$i worker11.oqa.suse.de; done
PING  (10.137.10.11) 56(84) bytes of data.
64 bytes from worker11.oqa.suse.de (10.137.10.11): icmp_seq=1 ttl=61 time=21.8 ms

---  ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 21.794/21.794/21.794/0.000 ms
PING worker11.oqa.suse.de(2a07:de40:a203:12:ec4:7aff:fe7a:7896 (2a07:de40:a203:12:ec4:7aff:fe7a:7896)) 56 data bytes
64 bytes from 2a07:de40:a203:12:ec4:7aff:fe7a:7896 (2a07:de40:a203:12:ec4:7aff:fe7a:7896): icmp_seq=1 ttl=61 time=22.0 ms

--- worker11.oqa.suse.de ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 21.989/21.989/21.989/0.000 ms

also A+AAAA lookup over the old hostname seems to work out fine:

$ nslookup openqaworker11.suse.de
Server:     127.0.0.1
Address:    127.0.0.1#53

openqaworker11.suse.de  canonical name = worker11.oqa.suse.de.
Name:   worker11.oqa.suse.de
Address: 10.137.10.11
Name:   worker11.oqa.suse.de
Address: 2a07:de40:a203:12:ec4:7aff:fe7a:7896

EDIT: lhaleplidis did further (unknown) changes. Now I can confirm ping 4&6 from openqa.suse.de to worker11.oqa.suse.de works. https://openqa.suse.de/admin/workers/2014 shows worker11 to be online and idle

Now I did a simpler test, no multi-machine test:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/9825060 _GROUP=0 BUILD= TEST=memtest-test-migration-poo119443-w11-okurz WORKER_CLASS=openqaworker11

https://openqa.suse.de/tests/9854942 scheduled and running. The test is running now

I triggered another test

https://openqa.suse.de/tests/9855171 based on rescue_system, passed, so all good for single-machine tests.

I think I understand why the multi-machine tests can't run. I only started a single worker instance so the parallel cluster can't start.

Let's try again:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/9825072 _GROUP=0 BUILD= TEST=salt-minion-test-migration-poo119443-w11-okurz WORKER_CLASS=openqaworker11,tap

-> Created job #9854940: sle-15-SP5-Online-x86_64-Build32.1-salt-minion@64bit -> https://openqa.suse.de/t9854940
-> Created job #9854941: sle-15-SP5-Online-x86_64-Build32.1-salt-master@64bit -> https://openqa.suse.de/t9854941

On worker11

for i in 2 3; do systemctl unmask openqa-worker-auto-restart@$i && systemctl start openqa-worker-auto-restart@$i; done
openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/9825072 _GROUP=0 BUILD= TEST+=test-migration-poo119443-w11-okurz WORKER_CLASS=openqaworker11,tap

Cloning parents of sle-15-SP5-Online-x86_64-Build32.1-salt-minion@64bit
Cloning children of sle-15-SP5-Online-x86_64-Build32.1-salt-master@64bit
Created job #9855288: sle-15-SP5-Online-x86_64-Build32.1-salt-master@64bit -> https://openqa.suse.de/t9855288
Created job #9855289: sle-15-SP5-Online-x86_64-Build32.1-salt-minion@64bit -> https://openqa.suse.de/t9855289

but do those jobs actually wait for one another? Let's see. Both passed, all good.

  • DONE: @Lazaros Haleplidis I observed multiple times that the ssh connection from my workstation to worker11.oqa.suse.de stopped working whereas a ssh connection to openqa.suse.de stays responsive. After forcefully disconnecting with ~. I can reconnect but the unstable connection is a problem. 2022-11-03 Did not happen anymore.

  • TODO: memtest repeatedly fails in https://openqa.suse.de/tests/9855169#step/memtest/12 with memtest not continuing, asking for confirmation regarding multi-threaded operation?

Actions #10

Updated by okurz about 2 years ago

Continuing the migration. vhaleplidis also migrated eth1 on openqaworker11. mcaj suggested for the updated network:

(Martin Caj) I was thinking about the new QA vlan / dns DHCP and how to do it without any big downtime. Since we are going to merge QA and QAM into one vlan/subnet. I have a suggestion to use new domain name. My suggestion is use: qe.suse.de
That will all me to create new DHCP/DNS server (based on your data in gitlab) and we can do host by host migration and also old machine can stay in the old domains (qam.suse.de and qa.suse.de) without any issue.

I approved that plan.

openqaworker11 currently has a 3 hop route to OSD over SUSE's NUE Nexus Core 5696Q switch. I am a bit concerned if that will stay for too long until we have OSD migrated. lhaleplidis plans to migrate a VM like OSD last.

Regarding SSH keys on the IPMI access jump host I suggested to mcaj:

I suggest to use the SSH key list from https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/sshd/users.sls . As alternative you can reference the git repo that you like to have keys in and we can each explicitly request our keys to be added as necessary.

Regarding plans for today:

(Lazaros Haleplidis) do you have a list of the machines, and their preferred order?
(Oliver Kurz) ow12 can be done immediately, no problem. After that my preferred order is that the documentation in rack tables and a generic documentation exists before touching the other production machines. That I consider crucial. For the actual order there is no strong preference so go in any order just please announce it here and update racktables with the according config. I have seen such rush often enough mentioning "we will update later" ending with "we will not document, we have forgotten the details by now". So I can't give green light to go on with production systems if we don't manage to have five lines updated in a wiki page and rather simple updates to racktables. We selected w11 as the first system to conduct the migration so I would like to see the process working there including the update. I hope you understand what I mean.

worker12 was also migrated. Next is openqaworker13, the first production machine. I followed https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production

worker11:/srv/salt # git pull --rebase origin master
fatal: unable to access 'https://gitlab.suse.de/openqa/salt-states-openqa.git/': Failed to connect to gitlab.suse.de port 443 after 129559 ms: Connection timed out

mentioned in https://suse.slack.com/archives/C0488BZNA5S/p1667488049879659

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/2781 is adding all users from the OSD infrastructure to the Eng-Infra maintained jump hosts.

> sudo salt --no-color 'openqaworker11*' state.apply
openqaworker11.suse.de:
    Data failed to compile:
----------
    Rendering SLS 'base:openqa.worker' failed: Jinja variable list object has no element 0

same if I call salt-call locally. Not sure what this means. I already tried to use a different name "worker11" instead of "openqaworker11" in /srv/pillar/openqa/workerconf.sls assuming that maybe the host does not match anymore but that does not have an effect. Same is reproduced by sudo salt --no-color 'openqaworker11*' state.apply test=True but not reproduced for sudo salt --no-color 'openqaworker10*' state.apply test=True. But the error is also reproduced for salt-call --local --no-color state.apply test=True. So something is fishy about our machines worker11+worker12.

Actions #12

Updated by okurz about 2 years ago

  • Description updated (diff)

Continuing migration. I provided a list in order in https://suse.slack.com/archives/C0488BZNA5S/p1667553148815839

sure. I suggest the following order:
openqaworker10.suse.de
openqaworker2.suse.de
openqaworker3.suse.de
openqaworker5.suse.de
openqaworker6.suse.de
openqaworker8.suse.de
openqaworker9.suse.de

worker13 is migrated now so I am re-enabling production use again with systemctl enable --now telegraf $(systemctl list-units | grep openqa-worker-auto-restart | cut -d "." -f 1 | xargs) and triggering a specific test job:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/9870221 _GROUP=0 TEST=test-migration-worker13-okurz BUILD= WORKER_CLASS=openqaworker13

Created job #9870589: sle-12-SP5-JeOS-for-kvm-and-xen-Updates-x86_64-Build20221103-1-jeos-extratest@uefi-virtio-vga -> https://openqa.suse.de/t9870589

Also added to salt on OSD again. Doing a salt high state after migration on worker13 as test first:

sudo salt --no-color 'openqaworker13*' state.apply test=True

this fails with

    Data failed to compile:
----------
    Rendering SLS 'base:openqa.worker' failed: Jinja variable list object has no element 0

same as we had on the other machines.

I paused "Packet loss between worker hosts and other hosts alert" as it was showing problems with worker13 accessing download.opensuse.org

Maybe the salt problem is due to salt referencing the hostname inconsistently. Ok, on worker11 I updated /etc/hostname to worker11 and /etc/salt/minion_id to worker11.oqa.suse.de. Then I triggered a reboot but the machine did not come up. I could connect using IPMI and then I saw that the RAID which we expected as /dev/md/openqa was suddenly called /dev/md/openqaworker11:openqa so our prepare script failed to proceed.

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/764 to address renamed hosts.

Continuing with openqaworker10:

hostname='openqaworker10.suse.de'
ssh osd "sudo salt-key -y -d $hostname"
ssh $hostname "sudo systemctl disable --now telegraf $(systemctl list-units | grep openqa-worker-auto-restart | cut -d "." -f 1 | xargs)"
Removed /etc/systemd/system/multi-user.target.wants/telegraf.service.

There were host up alerts so I paused the according alerts as well.

Due to the not yet resolved problem that worker13 can not yet fully communicate with download.opensuse.org I also added to the worker classes on worker13 for now with the suffix "_debug_119443" to prevent further production jobs to fail.

I also called

WORKER=worker13 result="result='failed'" failed_since=2022-11-04 host=openqa.suse.de openqa-advanced-retrigger-jobs

complete output:

{"result":[{"9877164":9877577}],"test_url":[{"9877164":"\/tests\/9877577"}]}
{"result":[{"9870535":9877578,"9870536":9877579}],"test_url":[{"9870535":"\/tests\/9877578","9870536":"\/tests\/9877579"}]}
{"result":[{"9873665":9877580}],"test_url":[{"9873665":"\/tests\/9877580"}]}
{"result":[{"9870512":9877582,"9870513":9877583,"9870514":9877584}],"test_url":[{"9870512":"\/tests\/9877582","9870513":"\/tests\/9877583","9870514":"\/tests\/9877584"}]}
{"result":[{"9870515":9877585}],"test_url":[{"9870515":"\/tests\/9877585"}]}
{"result":[{"9873702":9877586}],"test_url":[{"9873702":"\/tests\/9877586"}]}
{"result":[{"9870411":9877587}],"test_url":[{"9870411":"\/tests\/9877587"}]}
{"result":[{"9870784":9877588}],"test_url":[{"9870784":"\/tests\/9877588"}]}
{"result":[{"9875506":9877589}],"test_url":[{"9875506":"\/tests\/9877589"}]}
{"result":[{"9870809":9877590}],"test_url":[{"9870809":"\/tests\/9877590"}]}
{"result":[{"9872470":9877591}],"test_url":[{"9872470":"\/tests\/9877591"}]}
{"result":[{"9875255":9877592}],"test_url":[{"9875255":"\/tests\/9877592"}]}
{"result":[{"9870238":9877593}],"test_url":[{"9870238":"\/tests\/9877593"}]}
{"result":[{"9870547":9877594}],"test_url":[{"9870547":"\/tests\/9877594"}]}
{"result":[{"9870558":9877595}],"test_url":[{"9870558":"\/tests\/9877595"}]}
{"result":[{"9870412":9877596}],"test_url":[{"9870412":"\/tests\/9877596"}]}
{"result":[{"9870682":9877597}],"test_url":[{"9870682":"\/tests\/9877597"}]}
{"result":[{"9871233":9877598}],"test_url":[{"9871233":"\/tests\/9877598"}]}
{"result":[{"9870510":9877599}],"test_url":[{"9870510":"\/tests\/9877599"}]}
{"result":[{"9870280":9877600}],"test_url":[{"9870280":"\/tests\/9877600"}]}
{"result":[{"9870517":9877601}],"test_url":[{"9870517":"\/tests\/9877601"}]}
{"result":[{"9870966":9877602}],"test_url":[{"9870966":"\/tests\/9877602"}]}
{"result":[{"9871124":9877603}],"test_url":[{"9871124":"\/tests\/9877603"}]}
{"result":[{"9873334":9877604,"9873337":9877605,"9873343":9877606,"9873347":9877607,"9873360":9877608,"9874014":9877609}],"test_url":[{"9873334":"\/tests\/9877604","9873337":"\/tests\/9877605","9873343":"\/tests\/9877606","9873347":"\/tests\/9877607","9873360":"\/tests\/9877608","9874014":"\/tests\/9877609"}]}
{"result":[{"9872651":9877610}],"test_url":[{"9872651":"\/tests\/9877610"}]}
{"result":[{"9870282":9877611}],"test_url":[{"9870282":"\/tests\/9877611"}]}
{"result":[{"9870526":9877612}],"test_url":[{"9870526":"\/tests\/9877612"}]}
{"result":[{"9873590":9877613}],"test_url":[{"9873590":"\/tests\/9877613"}]}
{"result":[{"9873811":9877614}],"test_url":[{"9873811":"\/tests\/9877614"}]}
{"result":[{"9876682":9877615}],"test_url":[{"9876682":"\/tests\/9877615"}]}
{"result":[{"9870668":9877616}],"test_url":[{"9870668":"\/tests\/9877616"}]}
{"result":[{"9870976":9877617}],"test_url":[{"9870976":"\/tests\/9877617"}]}
{"result":[{"9871126":9877618}],"test_url":[{"9871126":"\/tests\/9877618"}]}
{"result":[{"9870816":9877619}],"test_url":[{"9870816":"\/tests\/9877619"}]}
{"result":[{"9872136":9877620}],"test_url":[{"9872136":"\/tests\/9877620"}]}
{"result":[{"9870532":9877621}],"test_url":[{"9870532":"\/tests\/9877621"}]}
{"result":[{"9873977":9877622}],"test_url":[{"9873977":"\/tests\/9877622"}]}
{"result":[{"9870283":9877623}],"test_url":[{"9870283":"\/tests\/9877623"}]}
{"result":[{"9870511":9877624}],"test_url":[{"9870511":"\/tests\/9877624"}]}
{"result":[{"9870805":9877625}],"test_url":[{"9870805":"\/tests\/9877625"}]}
{"result":[{"9873969":9877626}],"test_url":[{"9873969":"\/tests\/9877626"}]}
{"result":[{"9870563":9877627}],"test_url":[{"9870563":"\/tests\/9877627"}]}
{"result":[{"9870589":9877628}],"test_url":[{"9870589":"\/tests\/9877628"}]}
{"result":[{"9870785":9877629}],"test_url":[{"9870785":"\/tests\/9877629"}]}
{"result":[{"9871347":9877630}],"test_url":[{"9871347":"\/tests\/9877630"}]}
{"result":[{"9872681":9877631}],"test_url":[{"9872681":"\/tests\/9877631"}]}
{"result":[{"9870533":9877632,"9870539":9877633}],"test_url":[{"9870533":"\/tests\/9877632","9870539":"\/tests\/9877633"}]}
{"result":[{"9870961":9877634}],"test_url":[{"9870961":"\/tests\/9877634"}]}

Also openqaworker2 was planned to migrated but not done yet after encountering the problem to access download.opensuse.org from worker13.

mdoucha mentioned openQA jobs failing to connect to qam.suse.de downloading files from there, see https://suse.slack.com/archives/C02CANHLANP/p1667580960422459?thread_ts=1667580491.234429&cid=C02CANHLANP but on worker13 the command

curl -sS https://qam.suse.de/media/downloads/ltp_known_issues.json -o /dev/null; echo $?

works fine. So maybe something about qemu tests or bridging?

I triggered

openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/9871117 _GROUP=0 BUILD= TEST=ltp_crashme_okurz_poo119443 WORKER_CLASS=openqaworker13

cloning the job that mdoucha mentioned as problematic. This will take about 5 minutes.

Created job #9877664: sle-15-Server-DVD-Incidents-Kernel-KOTD-x86_64-Build4.12.14-150000.293.1.g99629c6-ltp_crashme@64bit -> https://openqa.suse.de/t9877664

I try to access VNC services on the host. That seems to be blocked as well.

So I did

qemu-system-x86_64 -vnc :42 -cdrom Core-current.iso

on worker13.oqa.suse.de and build a bridge using ssh with

ssh -L 5942:localhost:5942 worker13.oqa.suse.de

and then connected using

vncviewer -Shared localhost:5942

booted into TinyCore and executed

wget http://qam.suse.de/media/downloads/ltp_known_issues.json

which worked fine. https I can not test within TinyCore with wget as openssl seems to be missing. So if that works but openQA tests continue to fail the problem might be about the specific bridge config or so we use with the custom qemu command line?

Actions #16

Updated by okurz about 2 years ago

  • Due date changed from 2022-11-10 to 2022-12-01

This is actively being worked and in cooperation with SUSE-IT. As noted we encountered some problems which need feedback and resolution from SUSE-IT, bumping due date

Actions #17

Updated by okurz about 2 years ago

Reconducting tests on worker13:

ping -6 download.opensuse.org is ok now. Cloning a reference job that failed in before

openqa-clone-job --within-instance https://openqa.suse.de/tests/9887122 _GROUP=0 TEST=test-migration-worker13-okurz BUILD= WORKER_CLASS=qemu_x86_64_debug_119443

Created job #9890472: sle-12-SP5-JeOS-for-kvm-and-xen-Updates-x86_64-Build20221106-1-jeos-extratest@uefi-virtio-vga -> https://openqa.suse.de/t9890472

passed the previously failing suseconnect_scc so that part of communication works as well now.

worker10 was migrated so checking that

openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/9871117 _GROUP=0 BUILD= TEST=ltp_crashme_okurz_poo119443 WORKER_CLASS=openqaworker10

Created job #9890500: sle-15-Server-DVD-Incidents-Kernel-KOTD-x86_64-Build4.12.14-150000.293.1.g99629c6-ltp_crashme@64bit -> https://openqa.suse.de/t9890500

test passed so that's good. The developer mode could not connect because the DNS entry for worker10.oqa.suse.de is not complete for both IPv4 and IPv6.

EDIT: 2022-11-07: 14:08Z this was no fixed by mcaj

Testing one more scenario mentioned in open points:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/9890596 _GROUP=0 TEST=test-mau-sles-sys-param-check-migration-worker10-okurz BUILD= WORKER_CLASS=openqaworker10

Created job #9891294: sle-15-SP4-Server-DVD-Incidents-x86_64-Build:26478:samba-mau-sles-sys-param-check@64bit-2gbram -> https://openqa.suse.de/t9891294

job passed so likely the access to qa-css-hq.qa.suse.de also works now.

Discussed with Lazaros about allowing/disallowing traffic in particular from within openQA tests. All outbound traffic to the public internet is allowed. Inbound traffic is restricted and needs to be whitelisted case by case. okurz considers we are on a good track regarding critical machines first and then continue with more machines. We will restrict the outbound public internet traffic later to progress with migrating more machines. I again urged to have a dynamic, live overview of what the current rules are. Lazaros has understood and and agrees with that request but can not provide that right now but forwarded it to the cyber security team.

Temporarily over the weekend I had worker13 disabled to prevent unforeseen problems. After we have worked to whitelist multiple connections I have now enabled production use of worker13 again. Same will be done for worker10, worker2 and then the others

the fqdn as returned by salt was still worker13 so I did sudo salt --no-color --state-output=changes -C 'G@roles:worker' saltutil.refresh_grains,grains.get ,fqdn, this worked to update it. Same problem on worker2. I struggled to get the fqdn to update. Locally salt-call grains.get fqdn returned worker2.oqa.suse.de fine. Calling the refresh and get command multiple times worked. Also it seems like get is called first, then the refresh. Well, good now.

The migrated machines in salt pruned the openQA worker config as the existing entries do not match. So for now I removed openqaworker5, openqaworker6, openqaworker8, openqaworker9 from salt keys to prevent the worker config to be deleted until we have found a better way.

Actions #21

Updated by okurz about 2 years ago

Removed openqaworker5 and openqaworker6 from production for now, the next candidates to migrate.

Actions #22

Updated by okurz about 2 years ago

I need to make route entries persistent

ip route add 10.137.10.0/24 via 10.160.255.254
ip route add 2a07:de40:a203:12::/64 via 2620:113:80c0:8080::1
ip route add 10.136.0.0/14 via 10.160.255.254

In /etc/sysconfig/network/routes I added

10.136.0.0/14 10.160.255.254 - -
10.137.10.0/24 10.160.255.254 - -
2a07:de40:a203:12::/64 2620:113:80c0:8080::1 - -

and then did wicked ifup eth0

On openqaworker2 /etc/openqa/workers.ini was missing the entry for the FQDN hostname so tests couldn't upload logs properly. Retriggered all according incomplete and failed jobs from yesterday

WORKER=worker2 result="result='failed'" failed_since=2022-11-07 host=openqa.suse.de openqa-advanced-retrigger-jobs | tee -a worker2_restart_$(date +%F).log
WORKER=worker2 result="result='incomplete'" failed_since=2022-11-07 host=openqa.suse.de openqa-advanced-retrigger-jobs | tee -a worker2_restart_$(date +%F).log
Actions #25

Updated by martinsmac about 2 years ago

On my openqa-review week, not specific to one squad, I observed today tests failling with networks errors to connect/get/send by curl, on s390 and x86_64. Bellow I get examples to verify

https://openqa.suse.de/tests/9902465#step/install_updates/11 - worker2 - curl timeout
https://openqa.suse.de/tests/9901371#step/iscsi_client/33 - worker10 - curl timeout to http://10.0.2.2
https://openqa.suse.de/tests/9904917#step/bind/6 - worker2 - curl timeout
https://openqa.suse.de/tests/9904919#step/prepare_test_data/9 - worker2 - curl timeout
https://openqa.suse.de/tests/9906141#step/yast2_nfs4_client/91 - worker5 - ping timeout
https://openqa.suse.de/tests/9902287#step/iscsi_client/33 - worker9 - curl timeout

Actions #27

Updated by okurz about 2 years ago

Multi-machine tests need updated bridge_ip settings in salt pillars, https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/456, merged, updated on OSD, then applied the high state on all workers.

Actions #29

Updated by livdywan about 2 years ago

osd-deployment is now failing because the name no longer matches, so I'm proposing a trivial fix which works when I manually run this on osd.

Actions #34

Updated by okurz about 2 years ago

cdywan wrote:

osd-deployment is now failing because the name no longer matches, so I'm proposing a trivial fix which works when I manually run this on osd.

thanks. merged https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/50 . I triggered a new deployment and monitored it and it went fine.

Actions #35

Updated by okurz about 2 years ago

Looking at monitoring data I find worker3 was missing data on https://monitor.qa.suse.de/d/WDworker3/worker-dashboard-worker3?orgId=1 , restarted the telegraf service as the host was not rebooted nor was telegraf restarted after the migration. Now data appeared again just fine. Also the dashboards for former openqaworkers are still there and no empty as the names have changed to the format "worker*" so eventually we can delete the old dashboards. For now I have moved the according dashboards to a separate "folder" in grafana, out of the "salt" folder.

Actions #36

Updated by okurz about 2 years ago

  • Description updated (diff)

I confirmed that all openQA machines in Nbg SRV1 are in new security zones and all openQA machines in Nbg SRV1 are fully usable in production with the sole exception of worker2, see in particular #120261

Actions #38

Updated by okurz about 2 years ago

  • Due date deleted (2022-12-01)
  • Status changed from In Progress to Blocked

Yesterday we looked into #120261 and confirmed that over the weekend there was no further problem so at least we have viable workarounds. The "job age" alerts are still paused. Checking https://openqa.suse.de/tests/ I find some tests scheduled for multiple days for the worker class "s390-kvm-sle12". I confirmed that worker instances are up but #120261 could be the cause of some bigger backlog. This is also showing the importance of #113701. Blocking one one of those … or both :)

Actions #42

Updated by okurz about 2 years ago

  • Description updated (diff)
  • Category set to Infrastructure
  • Status changed from Blocked to Resolved

okurz wrote:

Motivation

See parent #116623

Acceptance criteria

  • AC1: All openQA machines in Nbg SRV1 are in new security zones
  • AC2: All openQA machines in Nbg SRV1 are fully usable in production

Both ACs fulfilled. One exception is storage.qa.suse.de currently down, to be handled in #121282 including the according "packet loss" alert.

Actions

Also available in: Atom PDF