action #81192
closedcoordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
coordination #129280: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG Datacenters
coordination #37910: [tools][epic] Migration of or away from qanet.qa.suse.de
[tools] Migrate (upgrade or replace) qanet.qa.suse.de to a supported, current OS size:M
0%
Description
Observation¶
# cat /etc/SuSE-release
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 3
The services that we rely upon on this host:
- named
- dhcpcd
- ipxe server (http+tftp)
- cscreen (which is basically
/usr/bin/SCREEN -d -m -S console -c /etc/cscreenrc
)
Services that are currently running at time of writing (excluding obvious system services):
root 2906 0.0 0.4 50704 24920 ? Ss Nov18 0:43 /usr/bin/SCREEN -d -m -S console -c /etc/cscreenrc
root 6668 0.0 0.0 20444 1660 pts/4 Ss+ Nov18 0:42 \_ ipmitool -H ia64mm1001.qa.suse.de -P shell
root 6669 0.0 0.0 20444 1660 pts/5 Ss+ Nov18 0:43 \_ ipmitool -H ia64ph1002.qa.suse.de -P shell
root 6670 0.0 0.0 20444 1660 pts/6 Ss+ Nov18 0:43 \_ ipmitool -H ia64mm1006.qa.suse.de -P shell
root 6671 0.0 0.0 20444 1664 pts/7 Ss+ Nov18 0:42 \_ ipmitool -H ia64mm1007.qa.suse.de -P shell
root 6672 0.0 0.0 20444 1660 pts/8 Ss+ Nov18 0:42 \_ ipmitool -H ia64mm1008.qa.suse.de -P shell
root 6673 0.0 0.0 20444 1660 pts/9 Ss+ Nov18 0:43 \_ ipmitool -H ia64mm1011.qa.suse.de -P XXXXXXXX shell
root 3135 0.0 0.0 68708 1036 ? Ss Nov18 0:03 /usr/sbin/lldpd
_lldpd 3158 0.0 0.0 68708 504 ? S Nov18 0:15 \_ /usr/sbin/lldpd
root 3174 0.0 0.0 27136 500 ? Ss Nov18 0:00 /usr/sbin/mcelog --daemon --config-file /etc/mcelog/mcelog.conf
root 3254 0.0 0.0 11324 1412 ? S Nov18 0:00 /bin/sh /usr/bin/mysqld_safe --mysqld=mysqld --user=mysql --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql
mysql 3596 0.0 0.7 406024 45536 ? Sl Nov18 31:06 \_ /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --log-error=/
named 3619 0.1 1.5 258384 90172 ? Ssl Nov18 74:51 /usr/sbin/named -t /var/lib/named -u named
root 3884 0.0 0.0 22844 1188 ? S Nov18 0:00 /usr/sbin/vsftpd
icinga 4268 0.0 0.0 39736 172 ? Ss Nov18 0:00 /usr/sbin/ido2db -c /etc/icinga/ido2db.cfg
root 4323 0.0 0.0 89532 1440 ? Ss Nov18 0:00 /usr/sbin/smbd -D -s /etc/samba/smb.conf
root 4428 0.0 0.0 89636 980 ? S Nov18 0:05 \_ /usr/sbin/smbd -D -s /etc/samba/smb.conf
root 4330 0.0 0.0 61656 848 ? Sl Nov18 0:12 /usr/sbin/ypbind
root 4432 0.0 0.0 34984 928 ? Ssl Nov18 0:00 /usr/sbin/automount -p /var/run/automount.pid
root 4464 0.0 0.0 23768 864 ? Ss Nov18 0:00 /usr/sbin/rpc.mountd
root 4620 0.0 0.3 388132 19828 ? Ss Nov18 0:56 /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf -DSSL -DICINGA -DICINGAWEB
root 4663 0.0 0.0 15748 772 ? Ss Nov18 0:00 /usr/sbin/xinetd -pidfile /var/run/xinetd.init.pid
nagios 4692 0.0 0.0 19040 948 ? Ss Nov18 0:00 /usr/sbin/nrpe -c /etc/nrpe.cfg -d
nobody 23063 0.1 0.0 179336 920 ? Ss Dec04 39:26 /usr/sbin/atftpd --pidfile /var/run/atftpd/pid --daemon --verbose=7 /srv/tftp
dhcpd 22267 0.0 0.1 38808 8224 ? Ss Dec10 4:37 /usr/sbin/dhcpd6 -6 -cf /etc/dhcpd6.conf -pf /var/run/dhcpd6.pid -chroot /var/lib/dhcp6 -lf /db/dhcpd6.leases -user dhcpd -g
dhcpd 20606 0.1 0.1 39076 6744 ? Ss Dec17 1:04 /usr/sbin/dhcpd -4 -cf /etc/dhcpd.conf -pf /var/run/dhcpd.pid -chroot /var/lib/dhcp -lf /db/dhcpd.leases -user dhcpd -group
Acceptance criteria¶
- AC1: qanet.qa is upgraded to a currently supported OS
Suggestion¶
- I suggest to create a full system backup and then just life-migrate to a more recent version of SLE.
- The storage system could be used to store the backup
Updated by livdywan almost 4 years ago
Two questions:
- Is there an existing workflow to create backups? Snapshots? rsync? Something else?
- I can't seem to login as a user or root - how does one get SSH access on this machine?
Updated by okurz almost 4 years ago
cdywan wrote:
Two questions:
- Is there an existing workflow to create backups? Snapshots? rsync? Something else?
if by "snapshots" you mean btrfs or LVM snapshots that is not possible. Rsync or something else is suggested here.
- I can't seem to login as a user or root - how does one get SSH access on this machine?
you ask an existing user with root access to add your key to /root/.ssh/authorized_keys . We can discuss this together with nsinger in some days.
Updated by okurz almost 4 years ago
- Related to action #81200: [tools][labs] some partitions on qanet are 100% full, seems like /data/backups has no new archives since 20201009 due to that added
Updated by okurz almost 4 years ago
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler almost 4 years ago
- Assignee set to mkittler
I'd try this if @nicksinger is available again since I don't know much about the concrete system.
Would you recommend to update to SLE 15 right away? And about the backup: Where should I store it? Maybe the new storage server?
Updated by okurz almost 4 years ago
mkittler wrote:
I'd try this if @nicksinger is available again since I don't know much about the concrete system.
Would you recommend to update to SLE 15 right away?
Yes, I suggest you coordinate with nicksinger. Also maybe he already is in the progress to prepare a complete replacement machine unless I confused something.
And about the backup: Where should I store it? Maybe the new storage server?
No, we have "backup.qa.suse.de" which we can use unless it TBs of data as for openQA where we need a special solution.
Updated by okurz almost 4 years ago
- Assignee set to nicksinger
mkittler has unassigned but without comments. I can just assume this is based on a chat with nicksinger. So assigning to "nicksinger" to clarify and follow-up :)
Updated by okurz over 3 years ago
- Due date set to 2021-03-31
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan over 3 years ago
- Due date deleted (
2021-03-31)
okurz wrote:
mkittler has unassigned but without comments. I can just assume this is based on a chat with nicksinger. So assigning to "nicksinger" to clarify and follow-up :)
@nicksinger @mkittler Are you guys still planning to work on this together? Or one of you? 🤔️
I would also generally consider taking it, assuming rsync to backup.qa.suse.de is an okay approach going by the comments above. That was why I was hestiating to do it before.
Updated by openqa_review over 3 years ago
- Due date set to 2021-04-22
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 3 years ago
- Due date deleted (
2021-04-22)
for now no due-date on "Workable", see https://github.com/os-autoinst/scripts/pull/71
Updated by okurz over 3 years ago
- Priority changed from Normal to Low
discussed with nicksinger: We plan to follow up here but not necessarily need to act that soon
Updated by nicksinger over 3 years ago
- Status changed from Workable to In Progress
Updated by nicksinger over 3 years ago
I tried the whole day to boot a EFI compatible ISO image over HTTP but failed. Nothing I tried was accepted by the server. No openSUSE ISO, no ipxe-payload, nothing.
I now tried the "CD-ROM Image" option in the BMC. It requires a Windows (SAMBA) share. With a protocol version from "Windows NT". This is absolutely ridiculous but seems to have worked with the following smb.conf on my workstation:
[global]
workgroup = WORKGROUP
passdb backend = tdbsam
map to guest = Bad User
usershare allow guests = Yes
log level = 3
log file = /var/log/samba/%m.log
min protocol = NT1
max protocol = SMB3
[boot]
comment = boot
path = /home/nsinger/Downloads/opensuse
public = yes
read only = no
force user = nsinger
With this I was able to mount the ISO in the BMC which then created a "Virtual CD drive" on the server. I could boot form it and see grub. Next will be an installation of the basesystem.
Updated by nicksinger over 3 years ago
Base system is installed now. I will now deploy a basic salt infra based on what we already have in https://gitlab.suse.de/qa-sle/qanet-salt
Updated by nicksinger over 3 years ago
User creation, ssh key management and postgresql installation is done now
Updated by nicksinger over 3 years ago
added powerdns packages and some basic postgresql configuration for now. Currently struggling to get salt to create the psql user with the right password hash
Updated by nicksinger over 3 years ago
last friday I figured out that this is caused by an outdated version of salt in openSUSE. I've opened https://bugzilla.opensuse.org/show_bug.cgi?id=1186500 to address that issue and went with md5 encryption for now (https://gitlab.suse.de/qa-sle/qanet-salt/-/commit/6fe695c82527110acaa61e6f1d4391bdf99943b1)
Updated by nicksinger over 3 years ago
initial database initialization as well as authoritative and recursive powerdns config was added to salt.
With this the server is now a slave for the current qanet and delivers the same results:
selenium ~ » dig holmes-4.qa.suse.de @qanet.qa.suse.de +short
10.162.2.104
selenium ~ » dig holmes-4.qa.suse.de @qanet2.qa.suse.de +short
10.162.2.104
Next I need to figure out what of the old configuration is still valid:
allow-recursion { localnets; localhost; 10.120.0.40; 10.120.0.41; 10.120.0.44; 10.120.0.45 ; 149.44.176.22; 10.160.0.40; 10.160.0.41; 10.160.0.44; 10.160.0.45; 149.44.176.36; 149.44.176.37; 149.44.176.22; 10.162.64.10; 10.0.0.0/8; };
also-notify { 149.44.160.72; 10.160.0.1; 10.160.2.88; 10.100.2.8; 10.100.2.10; 10.162.0.2; };
allow-transfer { 149.44.160.1; 149.44.160.160; 149.44.160.72; 10.120.0.1; 10.120.0.150; 10.120.2.88; 10.160.0.1; 10.160.0.150; 10.160.2.88; 10.100.2.8; 10.100.2.10; 10.162.0.2; };
Updated by okurz over 3 years ago
- Status changed from Workable to New
moving all tickets without size confirmation by the team back to "New". The team should move the tickets back after estimating and agreeing on a consistent size
Updated by ilausuch over 3 years ago
- Subject changed from [tools] Upgrade qanet.qa.suse.de to a supported, current OS to [tools] Upgrade qanet.qa.suse.de to a supported, current OS size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 3 years ago
- Subject changed from [tools] Upgrade qanet.qa.suse.de to a supported, current OS size:M to [tools] Migrate (upgrade or replace) qanet.qa.suse.de to a supported, current OS size:M
Updated by okurz almost 3 years ago
- Status changed from Workable to In Progress
- Assignee changed from nicksinger to okurz
I found that multiple partitions had been 100% full. I added myself to group "wheel" and allowed wheel users sudo without password so that I can login as my own user and others can see that I logged in. I then deleted some old stuff from the full partitions, e.g. a lot of old automatic intermediate backup directories which likely prevented new backups since years already. I created a new SSH keypair on qanet as otherwise I wouldn't be able to access a remote backup location anyway. So I did ssh-keygen
and copied the public key into backup.qa.suse.de:/root/.ssh/authorized_keys . Then created on backup.qa a dir /home/backup/qanet/ and on qanet called:
for i in / /srv/ /var/ /data/; do rsync -aHP --one-file-system $i backup.qa:/home/backup/qanet$i; done
By the way, sudo du -x --max-depth 1 -BM / | sort -n
shows what we need to care about primarily from the root filesystem when trying an upgrade:
0M ./dev
0M ./mounts
0M ./proc
0M ./suse
0M ./sys
1M ./boot
1M ./csv
1M ./data
1M ./dist
1M ./img
1M ./lost+found
1M ./media
1M ./mnt
1M ./secret
1M ./selinux
1M ./srv
1M ./tftproot
1M ./tmp
1M ./var
10M ./bin
15M ./sbin
20M ./lib64
28M ./etc
152M ./lib
319M ./opt
2544M ./home
3598M ./usr
5880M ./root
12562M .
so most is in /root and also a lot in /home. I suggest after the backup is complete replicate an environment into a VM or anywhere where we can run chroot or a container environment excluding /root/* and /home/* and experiment with live upgrades.
Updated by okurz almost 3 years ago
When trying to conduct the backup I realized quite slow transfer speeds.
I did a benchmark with
qanetnue:/suse/okurz # dd bs=100M count=20 if=/dev/zero | nc -l 42420
and
backup-vm:/home/okurz # nc qanet.qa 42420 | dd of=/dev/null status=progress
and the result is
116733440 bytes (117 MB, 111 MiB) copied, 1948 s, 59.9 kB/s
so abysmal slow network speed.
-> #107437
Updated by okurz almost 3 years ago
With #107437 resolved I can continue. Speed looks much better now.
EDIT: Backup complete.
Updated by okurz almost 3 years ago
- Status changed from In Progress to Workable
I would like to pick up the work again in a mob-session, e.g. together with nsinger.
Updated by okurz almost 3 years ago
Now I know why backups fill up our root partition on qanet.qa in the past years. /etc/cron.weekly/removeoldqabackups.sh
has:
rm /backups/qa.suse.de_201*.tar.gz-* > /tmp/cronout-$mydate
guess why the problem started in the year 2020 :facepalm: . Fixed by replacing with
find /backups/ -mtime 30 -delete > /tmp/cronout-$mydate
Updated by okurz over 2 years ago
Created a full root partition image backup with command on backup.qa
backup-vm:/home/backup/qanet # nc qanet 42420 | pv > sda2_root-$(date +%F).img
and from qanet
dd bs=1M if=/dev/sda2 | nc backup 42420
Updated by okurz over 2 years ago
- Status changed from Workable to In Progress
We managed to upgrade from sle11sp3 to sle11sp4 but without any maintenance updates yet. Would be interesting to find if we can find maintenance update repos. Removed a lot of packages for services that were not running anymore. Also X11 stack. We restarted dhcpd and named and everything fine within there. Next step trying to upgrade to SLE12.
Updated by okurz over 2 years ago
Trying zypper dup
from a SLE12GM iso image mounted yields
qanetnue:/tmp # zypper dup -r sle12gm
Loading repository data...
Reading installed packages...
Computing distribution upgrade...
29 Problems:
Problem: solvable libgcc_s1-4.8.3+r212056-6.3.x86_64 conflicts with libgcc_s1 provided by itself
Problem: solvable libgcc_s1-32bit-4.8.3+r212056-6.3.x86_64 conflicts with libgcc_s1-32bit provided by itself
Problem: solvable libgfortran3-4.8.3+r212056-6.3.x86_64 conflicts with libgfortran3 provided by itself
Problem: solvable libgomp1-4.8.3+r212056-6.3.x86_64 conflicts with libgomp1 provided by itself
Problem: solvable libquadmath0-4.8.3+r212056-6.3.x86_64 conflicts with libquadmath0 provided by itself
Problem: solvable libstdc++6-4.8.3+r212056-6.3.x86_64 conflicts with libstdc++6 provided by itself
Problem: solvable libstdc++6-32bit-4.8.3+r212056-6.3.x86_64 conflicts with libstdc++6-32bit provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-32bit-4.8.3+r212056-6.3.x86_64 conflicts with libffi4-32bit provided by itself
Problem: solvable libtsan0-4.8.3+r212056-6.3.x86_64 conflicts with libtsan0 provided by itself
Problem: solvable libtsan0-4.8.3+r212056-6.3.x86_64 conflicts with libtsan0 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: ca-certificates-mozilla-1.97-4.5.noarch requires ca-certificates, but this requirement cannot be provided
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libgcc_s1-4.8.3+r212056-6.3.x86_64 conflicts with libgcc_s1 provided by itself
so quite problematic.
So trying to find and add update repositories.
zypper ar http://dist.suse.de/updates/repo/\$RCE/SLES11-SP4-LTSS-Updates/sle-11-x86_64/SUSE:Updates:SLE-SERVER:11-SP4-LTSS:x86_64.repo
seems to work, mind the escaped \$RCE
. Might be we need the non-LTSS directories in parallel.
EDIT: Added the non-LTSS updates repo:
zypper ar http://dist.suse.de/updates/repo/\$RCE/SLES11-SP4-Updates/sle-11-x86_64/SUSE:Updates:SLE-SERVER:11-SP4:x86_64.repo
Now zypper patch
looks sane. Calling it once updated zypper, calling it a second time installs much more. Now zypper patch
is clean, so is zypper up
. But zypper dup
is stuck on conflicts:
3 Problems:
Problem: nothing provides libaudit.so.0 needed by pam-32bit-1.1.5-0.17.2.x86_64
Problem: nothing provides libgdbm.so.3 needed by perl-32bit-5.10.0-64.80.1.x86_64
Problem: nothing provides libaudit.so.0 needed by pam-32bit-1.1.5-0.17.2.x86_64
Likely zypper rm -u pam-32bit
could work. We shouldn't need that many 32bit packages if any. Done that.
The following packages are going to be REMOVED:
ConsoleKit-32bit cryptconfig-32bit pam-32bit pam-modules-32bit pam_mount-32bit samba-32bit samba-winbind-32bit sssd-32bit
I restarted named and dhcpd and they seem to be still working fine.
Updated by okurz over 2 years ago
- Status changed from Workable to In Progress
Currently reconducting an image based backup of the root partition as preparation for the next upgrade steps. I compressed the previous image on backup.qa. Now on qanet
dd bs=1M if=/dev/sda2 | nc -l 42420
on backup.qa
nc qanet 42420 | pv | xz -c - > sda2_root-$(date +%F).img.xz
Updated by okurz over 2 years ago
nicksinger and me yesterday added the sle11sp4 iso back as repo and then conducted a zypper dup
ending up in a consistent state. Then we tried to do an online upgrade with sle12gm repos but ran again into the problems shown in #81192#note-34 . We plan to continue next week Tuesday with an medium based migration.
Updated by okurz over 2 years ago
- Status changed from Workable to In Progress
In SRV2 with nicksinger. I put SLES12SP3 on a USB thumbdrive. We connected the thumbdrive to qanet, first back, then front. We found that the USB device is not found for booting, likely not supported. So we booted again the original system, mounted the thumbdrive and executed kexec, along the lines of:
mount /dev/sdc2 /mnt/iso
kexec --initrd=/mnt/iso/boot/loader/x86_64/initrd --command-line="upgrade=1 textmode=1 ssh=1 sshpassword=XXX ifcfg=eth0=10.162.0.1/18,10.162.163.254 nameserver=10.162.163.254"
We used a local VGA monitor and a PS/2 keyboard as USB is not supported during the BIOS menu but we could have done the kexec remotely as well. We just used the local VGA connection to be able to monitor the boot processes.
Updated by okurz over 2 years ago
A problem seems to be the one monitor we have in SRV2 does not support a resolution that is found to be vertical 1050 so soon at boot the monitor does not show anything anymore. Still the system booted and was reachable over ssh. DNS server named and dhcpd are running fine so core services are available. Some failed services, e.g. apache2, all seem to be non-critical so to be cared about later. Trying to add update repos we hit a problem that curl could not initialize. Calling ldd $(which curl)
revealed that some /usr/lib/vmware directories were in the list so that sounded fishy. We have renamed that vmware folder with extension .old
and curl and zypper were running fine so we could add https://updates.suse.de/SUSE/Updates/SLE-SERVER/12-SP3/x86_64/update/SUSE:Updates:SLE-SERVER:12-SP3:x86_64.repo
and https://updates.suse.de/SUSE/Updates/SLE-SERVER/12-SP3-LTSS/x86_64/update/SUSE:Updates:SLE-SERVER:12-SP3-LTSS:x86_64.repo
and https://ca.suse.de/ and call zypper dup
to bring the system in a proper updated state. After that we rebooted two times to check. Initial grub screen looks weird and no real menu shows up, at least not on VGA, but eventually the system boots fine so good enough.
Next tasks:
- DONE: Check all failed systemd services
- Configure automatic updates and reboots, e.g. use salt same as for backup.qa and alike?
- Requires prior update to SLE12SP5 (see https://software.opensuse.org/package/salt)
- Upgrade to more recent versions of SLE, then sidegrade to Leap. Or just go to Leap 42.3 now and then upgrade assuming we still find according repos
- Cleanup more old cruft, like apache2, 32bit libraries, etc.
- Consider repartitioning with optional move of / from ext3 to btrfs
- Review all .rpm* files in /etc
Updated by okurz over 2 years ago
I would say next step is that we visit Maxtorhof again and side-grade to Leap and then upgrade all the way to 15.4. As http://download.opensuse.org/distribution/leap/ goes all the way down to 42.3 I see it as easiest to go to Leap first. Bonus points for building a poor-mans-KVM with a raspberryPi, or we just connect the serial port to another machine and plug the power into a remote controlled PDU :) We can use qamaster as serial host. PDUs are already connected but need cable tracing. Further idea a backup qanet as VM on qamaster
Updated by okurz over 2 years ago
- Related to action #113357: UEFI PXE or "network boot" support within .qa.suse.de size:M added
Updated by nicksinger over 2 years ago
serial port is now connected to qamaster and a console is reachable. I also re-plugged on of the Y-power-connectors so all 4 PSUs are now connected to qaps09 - see port documentation in https://racktables.suse.de/index.php?page=object&tab=ports&object_id=1610 or in the webinterface of qaps09
Updated by nicksinger over 2 years ago
Online migration from 12SP3->12SP4->12SP5 done now. System works fine after a reboot now. According to https://documentation.suse.com/sles/15-SP4/html/SLES-all/cha-upgrade-paths.html#sec-upgrade-paths-supported the upgrade path to SLE15 is only supported offline once again so we might consider going to leap straight away.
Updated by okurz over 2 years ago
Thank you. Next step can be again the "kexec into the downloaded iso" approach and we can dare to do this remotely with serial and remote power control :)
Updated by okurz over 2 years ago
At around 1600 CEST a problem was reported that DNS resolution on grenache-1 does not work.
grenache-1:~ # host openqa.suse.de
;; connection timed out; no servers could be reached
ping -c 1 -4 10.160.0.207
, the IPv4 address of OSD, is fine, same as ping -c 1 -6 2620:113:80c0:8080:10:160:0:207
. Works now after I restarted named on qanet, not sure why.
Logs of named on qanet:
okurz@qanet:~ 0 (master) $ sudo systemctl status named.service
● named.service - LSB: Domain Name System (DNS) server, named
Loaded: loaded (/etc/init.d/named; bad; vendor preset: disabled)
Active: active (exited) since Wed 2022-07-13 15:05:41 CEST; 1h 13min ago
Docs: man:systemd-sysv-generator(8)
Process: 1798 ExecStart=/etc/init.d/named start (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 512)
Jul 13 15:05:40 qanet named[1875]: automatic empty zone: HOME.ARPA
Jul 13 15:05:40 qanet named[1875]: none:104: 'max-cache-size 90%' - setting to 5356MB (out of 5951MB)
Jul 13 15:05:40 qanet named[1875]: configuring command channel from '/etc/rndc.key'
Jul 13 15:05:40 qanet named[1875]: command channel listening on 127.0.0.1#953
Jul 13 15:05:40 qanet named[1875]: configuring command channel from '/etc/rndc.key'
Jul 13 15:05:40 qanet named[1875]: command channel listening on ::1#953
Jul 13 15:05:41 qanet named[1875]: zone qa.suse.de/IN: cloud2.qa.suse.de/NS 'crowbar.cloud2adm.qa.suse.de' has no SIBLING GLUE address records (A or AAAA)
Jul 13 15:05:41 qanet named[1875]: zone qa.suse.de/IN: cloud3.qa.suse.de/NS 'crowbar.cloud3adm.qa.suse.de' has no SIBLING GLUE address records (A or AAAA)
Jul 13 15:05:41 qanet named[1798]: Starting name server BIND ..done
Jul 13 15:05:41 qanet systemd[1]: Started LSB: Domain Name System (DNS) server, named.
okurz@qanet:~ 3 (master) $ sudo systemctl restart named
okurz@qanet:~ 0 (master) $ sudo journalctl -f -u named
-- Logs begin at Wed 2022-07-13 15:04:39 CEST. --
Jul 13 16:19:31 qanet named[5334]: automatic empty zone: HOME.ARPA
Jul 13 16:19:31 qanet named[5334]: none:104: 'max-cache-size 90%' - setting to 5356MB (out of 5951MB)
Jul 13 16:19:31 qanet named[5334]: configuring command channel from '/etc/rndc.key'
Jul 13 16:19:31 qanet named[5334]: command channel listening on 127.0.0.1#953
Jul 13 16:19:31 qanet named[5334]: configuring command channel from '/etc/rndc.key'
Jul 13 16:19:31 qanet named[5334]: command channel listening on ::1#953
Jul 13 16:19:31 qanet named[5334]: zone qa.suse.de/IN: cloud2.qa.suse.de/NS 'crowbar.cloud2adm.qa.suse.de' has no SIBLING GLUE address records (A or AAAA)
Jul 13 16:19:31 qanet named[5334]: zone qa.suse.de/IN: cloud3.qa.suse.de/NS 'crowbar.cloud3adm.qa.suse.de' has no SIBLING GLUE address records (A or AAAA)
Jul 13 16:19:31 qanet named[5268]: Starting name server BIND - Warning: /var/lib/named/var/run/named/named.pid exists! ..done
Jul 13 16:19:31 qanet systemd[1]: Started LSB: Domain Name System (DNS) server, named.
Handling some restarts:
host=openqa.suse.de failed_since="2022-07-13 13:00" openqa-advanced-retrigger-jobs
result="result='failed'" host=openqa.suse.de failed_since="2022-07-13 13:00" openqa-advanced-retrigger-jobs
{"result":[{"9116842":9124646}],"test_url":[{"9116842":"\/tests\/9124646"}]}
{"enforceable":1,"errors":["Job 9124626 misses the following mandatory assets: hdd\/SLES-15-x86_64-mru-install-minimal-with-addons-Build20220617-1-Server-DVD-Updates-64bit.qcow2\nEnsure to provide mandatory assets and\/or force retriggering if necessary."],"result":[],"test_url":[]}
{"enforceable":1,"errors":["Job 9124628 misses the following mandatory assets: hdd\/SLES-12-SP4-x86_64-mru-install-minimal-with-addons-Build20220627-1-Server-DVD-Updates-64bit.qcow2\nEnsure to provide mandatory assets and\/or force retriggering if necessary."],"result":[],"test_url":[]}
{"result":[{"9122209":9124648}],"test_url":[{"9122209":"\/tests\/9124648"}]}
{"result":[{"9122212":9124649}],"test_url":[{"9122212":"\/tests\/9124649"}]}
{"result":[{"9122215":9124650}],"test_url":[{"9122215":"\/tests\/9124650"}]}
{"result":[{"9122214":9124651}],"test_url":[{"9122214":"\/tests\/9124651"}]}
{"result":[{"9116915":9124652}],"test_url":[{"9116915":"\/tests\/9124652"}]}
{"result":[{"9116910":9124653}],"test_url":[{"9116910":"\/tests\/9124653"}]}
{"result":[{"9116877":9124654}],"test_url":[{"9116877":"\/tests\/9124654"}]}
{"result":[{"9116913":9124655}],"test_url":[{"9116913":"\/tests\/9124655"}]}
{"result":[{"9116919":9124656}],"test_url":[{"9116919":"\/tests\/9124656"}]}
{"result":[{"9116925":9124657}],"test_url":[{"9116925":"\/tests\/9124657"}]}
{"result":[{"9116928":9124658}],"test_url":[{"9116928":"\/tests\/9124658"}]}
{"result":[{"9116902":9124659}],"test_url":[{"9116902":"\/tests\/9124659"}]}
{"result":[{"9116901":9124660}],"test_url":[{"9116901":"\/tests\/9124660"}]}
{"result":[{"9116914":9124661}],"test_url":[{"9116914":"\/tests\/9124661"}]}
{"result":[{"9116917":9124662}],"test_url":[{"9116917":"\/tests\/9124662"}]}
{"result":[{"9116918":9124663}],"test_url":[{"9116918":"\/tests\/9124663"}]}
{"result":[{"9116926":9124664}],"test_url":[{"9116926":"\/tests\/9124664"}]}
{"result":[{"9116920":9124665}],"test_url":[{"9116920":"\/tests\/9124665"}]}
{"result":[{"9116891":9124666}],"test_url":[{"9116891":"\/tests\/9124666"}]}
{"result":[{"9116922":9124667}],"test_url":[{"9116922":"\/tests\/9124667"}]}
{"result":[{"9116911":9124668}],"test_url":[{"9116911":"\/tests\/9124668"}]}
{"result":[{"9116907":9124669}],"test_url":[{"9116907":"\/tests\/9124669"}]}
{"result":[{"9116921":9124670}],"test_url":[{"9116921":"\/tests\/9124670"}]}
{"result":[{"9116927":9124671}],"test_url":[{"9116927":"\/tests\/9124671"}]}
{"result":[{"9116924":9124672}],"test_url":[{"9116924":"\/tests\/9124672"}]}
{"result":[{"9116895":9124673}],"test_url":[{"9116895":"\/tests\/9124673"}]}
{"result":[{"9116903":9124674}],"test_url":[{"9116903":"\/tests\/9124674"}]}
{"result":[{"9116897":9124675}],"test_url":[{"9116897":"\/tests\/9124675"}]}
{"result":[{"9116896":9124676}],"test_url":[{"9116896":"\/tests\/9124676"}]}
{"result":[{"9116876":9124677}],"test_url":[{"9116876":"\/tests\/9124677"}]}
{"result":[{"9116893":9124678}],"test_url":[{"9116893":"\/tests\/9124678"}]}
{"result":[{"9116909":9124679}],"test_url":[{"9116909":"\/tests\/9124679"}]}
{"result":[{"9116912":9124680}],"test_url":[{"9116912":"\/tests\/9124680"}]}
{"result":[{"9116916":9124681}],"test_url":[{"9116916":"\/tests\/9124681"}]}
{"result":[{"9116929":9124682}],"test_url":[{"9116929":"\/tests\/9124682"}]}
{"result":[{"9116931":9124683}],"test_url":[{"9116931":"\/tests\/9124683"}]}
{"result":[{"9116933":9124684}],"test_url":[{"9116933":"\/tests\/9124684"}]}
{"result":[{"9116934":9124685}],"test_url":[{"9116934":"\/tests\/9124685"}]}
{"result":[{"9116935":9124686}],"test_url":[{"9116935":"\/tests\/9124686"}]}
{"result":[{"9123817":9124687}],"test_url":[{"9123817":"\/tests\/9124687"}]}
{"result":[{"9123818":9124688}],"test_url":[{"9123818":"\/tests\/9124688"}]}
{"result":[{"9124251":9124689}],"test_url":[{"9124251":"\/tests\/9124689"}]}
{"result":[{"9124252":9124690}],"test_url":[{"9124252":"\/tests\/9124690"}]}
{"result":[{"9124253":9124691}],"test_url":[{"9124253":"\/tests\/9124691"}]}
{"result":[{"9124393":9124692}],"test_url":[{"9124393":"\/tests\/9124692"}]}
{"result":[{"9124375":9124693,"9124389":9124694,"9124390":9124695,"9124391":9124696,"9124392":9124697}],"test_url":[{"9124375":"\/tests\/9124693","9124389":"\/tests\/9124694","9124390":"\/tests\/9124695","9124391":"\/tests\/9124696","9124392":"\/tests\/9124697"}]}
{"result":[{"9122213":9124698}],"test_url":[{"9122213":"\/tests\/9124698"}]}
{"result":[{"9124451":9124699}],"test_url":[{"9124451":"\/tests\/9124699"}]}
{"result":[{"9124441":9124700}],"test_url":[{"9124441":"\/tests\/9124700"}]}
{"result":[{"9124442":9124701}],"test_url":[{"9124442":"\/tests\/9124701"}]}
{"result":[{"9124487":9124702}],"test_url":[{"9124487":"\/tests\/9124702"}]}
{"result":[{"9124520":9124703}],"test_url":[{"9124520":"\/tests\/9124703"}]}
{"result":[{"9124536":9124705}],"test_url":[{"9124536":"\/tests\/9124705"}]}
{"result":[{"9124401":9124706}],"test_url":[{"9124401":"\/tests\/9124706"}]}
{"result":[{"9124597":9124707}],"test_url":[{"9124597":"\/tests\/9124707"}]}
{"result":[{"9124380":9124708}],"test_url":[{"9124380":"\/tests\/9124708"}]}
{"result":[{"9124623":9124709}],"test_url":[{"9124623":"\/tests\/9124709"}]}
{"result":[{"9124629":9124710}],"test_url":[{"9124629":"\/tests\/9124710"}]}
{"result":[{"9116923":9124711}],"test_url":[{"9116923":"\/tests\/9124711"}]}
Updated by okurz over 2 years ago
I ran while sleep 10; do date && pgrep -a named; done
. Starting
Wed Jul 13 17:41:09 CEST 2022
…
Wed Jul 13 17:51:20 CEST 2022
6269 /usr/sbin/named -t /var/lib/named -u named
Wed Jul 13 17:51:30 CEST 2022
6269 /usr/sbin/named -t /var/lib/named -u named
Wed Jul 13 17:51:41 CEST 2022
7460 /bin/sh /etc/init.d/named stop
Wed Jul 13 17:51:51 CEST 2022
7554 /usr/sbin/named -t /var/lib/named -u named
so something/someone started/stopped/restarted named but a new instance was running. Then later
7554 /usr/sbin/named -t /var/lib/named -u named
Wed Jul 13 18:30:06 CEST 2022
7554 /usr/sbin/named -t /var/lib/named -u named
Wed Jul 13 18:30:16 CEST 2022
Wed Jul 13 18:30:26 CEST 2022
Wed Jul 13 18:30:36 CEST 2022
Wed Jul 13 18:30:46 CEST 2022
Wed Jul 13 18:30:56 CEST 2022
Wed Jul 13 18:31:06 CEST 2022
Maybe a conflict with /etc/init.d. So I did:
mkdir /etc/init.d/old_okurz_20220713
mv /etc/init.d/named /etc/init.d/old_okurz_20220713/
but then I got
qanet:/suse/okurz # systemctl start named
Warning: named.service changed on disk. Run 'systemctl daemon-reload' to reload units.
qanet:/suse/okurz # systemctl daemon-reload
You have new mail in /var/mail/root
qanet:/suse/okurz # systemctl start named
Failed to start named.service: Unit named.service failed to load: No such file or directory.
so the file is actually necessary. Reverted. In journalctl --since=today -u named
I saw:
Jul 13 18:30:06 qanet named[7554]: mem.c:906: fatal error:
Jul 13 18:30:06 qanet named[7554]: malloc failed: Cannot allocate memory
Jul 13 18:30:06 qanet named[7554]: exiting (due to fatal error in library)
…
Jul 13 18:30:53 qanet systemd-coredump[8941]: Process 7554 (named) of user 44 dumped core.
Stack trace of thread 7556:
#0 0x00007fc030f390d7 raise (libc.so.6)
#1 0x00007fc030f3a4aa abort (libc.so.6)
#2 0x0000557772ca812f n/a (named)
#3 0x00007fc0330c8fe3 isc_error_fatal (libisc.so.1107)
#4 0x00007fc0330d9293 n/a (libisc.so.1107)
#5 0x00007fc0330d75b9 n/a (libisc.so.1107)
#6 0x00007fc0330d993d isc___mem_allocate (libisc.so.1107)
#7 0x00007fc0330dcbe3 isc___mem_strdup (libisc.so.1107)
#8 0x00007fc033ab08e1 n/a (libdns.so.1110)
#9 0x00007fc033ab3015 dns_resolver_createfetch3 (libdns.so.1110)
#10 0x0000557772caed13 n/a (named)
#11 0x0000557772cbc11b n/a (named)
that can certainly explain the problem of disappearing named. Maybe a memory leak?
EDIT: According to https://flylib.com/books/en/2.684.1/limiting_the_memory_a_name_server_uses.html the configuration option datasize 200M;
which we have in /etc/named.conf might be the problem. So I commented that option and restarted the service again. I assume it's enough to rely on https://www.zytrax.com/books/dns/ch7/hkpng.html#max-cache-size. Maybe ages ago "datasize" was by default much lower than 200M.
EDIT: 2022-07-13 19:00Z, named still running so seems to be better.
EDIT: 2022-07-14 07:00Z, same named process still running.
Updated by okurz over 2 years ago
- Status changed from In Progress to Workable
Something that mkittler/nsinger/okurz can follow up with after the summer vacations.
Updated by okurz about 2 years ago
- Related to action #117043: Request DHCP+DNS services for new QE network zones, same as already provided for .qam.suse.de and .qa.suse.cz added
Updated by okurz almost 2 years ago
- Category set to Infrastructure
- Target version changed from Ready to future
I will track this outside our backlog. I assume that within 2023 we will clarify if we will still use that installation or have moved out of the corresponding server rooms and migrated to other services which is likely.
Updated by okurz about 1 year ago
#117043 resolved. With https://gitlab.suse.de/qa-sle/qanet-configs/-/commit/6246fc46224606ba5932f6da6e6d6b87cbc722c5 qanet is still running a very limited dhcp server but forwards DNS to dns1.suse.de, dns1.prg2.suse.org, dns2.suse.de which serves qa.suse.de now from https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4247, blocked on #134051
Updated by okurz about 1 year ago
- Related to action #132623: Decommissioning of selected selected LSQ QE machines from NUE1-SRV2 added
Updated by okurz about 1 year ago
- Tags changed from next-office-day, infra to infra
- Target version changed from future to Tools - Next
#134051 resolved. We still have https://gitlab.suse.de/qa-sle/qanet-configs/ needed for DHCP of last machines in SRV2. Waiting for #132623. After those are going we will mark the gitlab repo as archived.
Updated by okurz about 1 year ago
https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4447 to remove DNS entries for decomissioned qanet.
Updated by okurz about 1 year ago
- Status changed from Blocked to Resolved
- Target version changed from Tools - Next to Ready
Archived https://gitlab.suse.de/qa-sle/qanet-configs now. I did update https://wiki.suse.net/index.php/SUSE-Quality_Assurance/Labs accordingly. Found no references in our team wiki needing updates or other wiki pages.