Project

General

Profile

Actions

action #81192

closed

coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

coordination #129280: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG Datacenters

coordination #37910: [tools][epic] Migration of or away from qanet.qa.suse.de

[tools] Migrate (upgrade or replace) qanet.qa.suse.de to a supported, current OS size:M

Added by okurz over 3 years ago. Updated 5 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Target version:
Start date:
2020-12-18
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

# cat /etc/SuSE-release 
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 3

The services that we rely upon on this host:

  • named
  • dhcpcd
  • ipxe server (http+tftp)
  • cscreen (which is basically /usr/bin/SCREEN -d -m -S console -c /etc/cscreenrc)

Services that are currently running at time of writing (excluding obvious system services):

root      2906  0.0  0.4  50704 24920 ?        Ss   Nov18   0:43 /usr/bin/SCREEN -d -m -S  console -c /etc/cscreenrc
root      6668  0.0  0.0  20444  1660 pts/4    Ss+  Nov18   0:42  \_ ipmitool -H ia64mm1001.qa.suse.de -P  shell
root      6669  0.0  0.0  20444  1660 pts/5    Ss+  Nov18   0:43  \_ ipmitool -H ia64ph1002.qa.suse.de -P  shell
root      6670  0.0  0.0  20444  1660 pts/6    Ss+  Nov18   0:43  \_ ipmitool -H ia64mm1006.qa.suse.de -P  shell
root      6671  0.0  0.0  20444  1664 pts/7    Ss+  Nov18   0:42  \_ ipmitool -H ia64mm1007.qa.suse.de -P  shell
root      6672  0.0  0.0  20444  1660 pts/8    Ss+  Nov18   0:42  \_ ipmitool -H ia64mm1008.qa.suse.de -P  shell
root      6673  0.0  0.0  20444  1660 pts/9    Ss+  Nov18   0:43  \_ ipmitool -H ia64mm1011.qa.suse.de -P XXXXXXXX shell
root      3135  0.0  0.0  68708  1036 ?        Ss   Nov18   0:03 /usr/sbin/lldpd
_lldpd    3158  0.0  0.0  68708   504 ?        S    Nov18   0:15  \_ /usr/sbin/lldpd
root      3174  0.0  0.0  27136   500 ?        Ss   Nov18   0:00 /usr/sbin/mcelog --daemon --config-file /etc/mcelog/mcelog.conf
root      3254  0.0  0.0  11324  1412 ?        S    Nov18   0:00 /bin/sh /usr/bin/mysqld_safe --mysqld=mysqld --user=mysql --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql
mysql     3596  0.0  0.7 406024 45536 ?        Sl   Nov18  31:06  \_ /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --log-error=/
named     3619  0.1  1.5 258384 90172 ?        Ssl  Nov18  74:51 /usr/sbin/named -t /var/lib/named -u named
root      3884  0.0  0.0  22844  1188 ?        S    Nov18   0:00 /usr/sbin/vsftpd
icinga    4268  0.0  0.0  39736   172 ?        Ss   Nov18   0:00 /usr/sbin/ido2db -c /etc/icinga/ido2db.cfg
root      4323  0.0  0.0  89532  1440 ?        Ss   Nov18   0:00 /usr/sbin/smbd -D -s /etc/samba/smb.conf
root      4428  0.0  0.0  89636   980 ?        S    Nov18   0:05  \_ /usr/sbin/smbd -D -s /etc/samba/smb.conf
root      4330  0.0  0.0  61656   848 ?        Sl   Nov18   0:12 /usr/sbin/ypbind
root      4432  0.0  0.0  34984   928 ?        Ssl  Nov18   0:00 /usr/sbin/automount -p /var/run/automount.pid
root      4464  0.0  0.0  23768   864 ?        Ss   Nov18   0:00 /usr/sbin/rpc.mountd
root      4620  0.0  0.3 388132 19828 ?        Ss   Nov18   0:56 /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf -DSSL -DICINGA -DICINGAWEB
root      4663  0.0  0.0  15748   772 ?        Ss   Nov18   0:00 /usr/sbin/xinetd -pidfile /var/run/xinetd.init.pid
nagios    4692  0.0  0.0  19040   948 ?        Ss   Nov18   0:00 /usr/sbin/nrpe -c /etc/nrpe.cfg -d
nobody   23063  0.1  0.0 179336   920 ?        Ss   Dec04  39:26 /usr/sbin/atftpd --pidfile /var/run/atftpd/pid --daemon --verbose=7 /srv/tftp
dhcpd    22267  0.0  0.1  38808  8224 ?        Ss   Dec10   4:37 /usr/sbin/dhcpd6 -6 -cf /etc/dhcpd6.conf -pf /var/run/dhcpd6.pid -chroot /var/lib/dhcp6 -lf /db/dhcpd6.leases -user dhcpd -g
dhcpd    20606  0.1  0.1  39076  6744 ?        Ss   Dec17   1:04 /usr/sbin/dhcpd -4 -cf /etc/dhcpd.conf -pf /var/run/dhcpd.pid -chroot /var/lib/dhcp -lf /db/dhcpd.leases -user dhcpd -group

Acceptance criteria

  • AC1: qanet.qa is upgraded to a currently supported OS

Suggestion

  • I suggest to create a full system backup and then just life-migrate to a more recent version of SLE.
  • The storage system could be used to store the backup

Related issues 4 (1 open3 closed)

Related to QA - action #81200: [tools][labs] some partitions on qanet are 100% full, seems like /data/backups has no new archives since 20201009 due to thatResolvedokurz2020-12-18

Actions
Related to openQA Infrastructure - action #113357: UEFI PXE or "network boot" support within .qa.suse.de size:MWorkable2022-07-07

Actions
Related to QA - action #117043: Request DHCP+DNS services for new QE network zones, same as already provided for .qam.suse.de and .qa.suse.czResolvedokurz

Actions
Related to QA - action #132623: Decommissioning of selected selected LSQ QE machines from NUE1-SRV2Resolvedokurz2023-07-12

Actions
Actions #1

Updated by livdywan over 3 years ago

Two questions:

  • Is there an existing workflow to create backups? Snapshots? rsync? Something else?
  • I can't seem to login as a user or root - how does one get SSH access on this machine?
Actions #2

Updated by okurz over 3 years ago

cdywan wrote:

Two questions:

  • Is there an existing workflow to create backups? Snapshots? rsync? Something else?

if by "snapshots" you mean btrfs or LVM snapshots that is not possible. Rsync or something else is suggested here.

  • I can't seem to login as a user or root - how does one get SSH access on this machine?

you ask an existing user with root access to add your key to /root/.ssh/authorized_keys . We can discuss this together with nsinger in some days.

Actions #3

Updated by okurz over 3 years ago

  • Related to action #81200: [tools][labs] some partitions on qanet are 100% full, seems like /data/backups has no new archives since 20201009 due to that added
Actions #4

Updated by okurz over 3 years ago

  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by mkittler about 3 years ago

  • Assignee set to mkittler

I'd try this if @nicksinger is available again since I don't know much about the concrete system.

Would you recommend to update to SLE 15 right away? And about the backup: Where should I store it? Maybe the new storage server?

Actions #6

Updated by okurz about 3 years ago

mkittler wrote:

I'd try this if @nicksinger is available again since I don't know much about the concrete system.

Would you recommend to update to SLE 15 right away?

Yes, I suggest you coordinate with nicksinger. Also maybe he already is in the progress to prepare a complete replacement machine unless I confused something.

And about the backup: Where should I store it? Maybe the new storage server?

No, we have "backup.qa.suse.de" which we can use unless it TBs of data as for openQA where we need a special solution.

Actions #7

Updated by mkittler about 3 years ago

  • Assignee deleted (mkittler)
Actions #8

Updated by okurz about 3 years ago

  • Assignee set to nicksinger

mkittler has unassigned but without comments. I can just assume this is based on a chat with nicksinger. So assigning to "nicksinger" to clarify and follow-up :)

Actions #9

Updated by okurz about 3 years ago

  • Due date set to 2021-03-31

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by livdywan about 3 years ago

  • Due date deleted (2021-03-31)

okurz wrote:

mkittler has unassigned but without comments. I can just assume this is based on a chat with nicksinger. So assigning to "nicksinger" to clarify and follow-up :)

@nicksinger @mkittler Are you guys still planning to work on this together? Or one of you? 🤔️

I would also generally consider taking it, assuming rsync to backup.qa.suse.de is an okay approach going by the comments above. That was why I was hestiating to do it before.

Actions #11

Updated by openqa_review about 3 years ago

  • Due date set to 2021-04-22

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by okurz about 3 years ago

  • Due date deleted (2021-04-22)

for now no due-date on "Workable", see https://github.com/os-autoinst/scripts/pull/71

Actions #13

Updated by okurz almost 3 years ago

  • Priority changed from Normal to Low

discussed with nicksinger: We plan to follow up here but not necessarily need to act that soon

Actions #14

Updated by nicksinger almost 3 years ago

  • Status changed from Workable to In Progress
Actions #15

Updated by nicksinger almost 3 years ago

I tried the whole day to boot a EFI compatible ISO image over HTTP but failed. Nothing I tried was accepted by the server. No openSUSE ISO, no ipxe-payload, nothing.
I now tried the "CD-ROM Image" option in the BMC. It requires a Windows (SAMBA) share. With a protocol version from "Windows NT". This is absolutely ridiculous but seems to have worked with the following smb.conf on my workstation:

[global]
    workgroup = WORKGROUP
    passdb backend = tdbsam
    map to guest = Bad User
    usershare allow guests = Yes
    log level = 3
    log file = /var/log/samba/%m.log
    min protocol = NT1
    max protocol = SMB3

[boot]
    comment = boot
    path = /home/nsinger/Downloads/opensuse
    public = yes
    read only = no
    force user = nsinger

With this I was able to mount the ISO in the BMC which then created a "Virtual CD drive" on the server. I could boot form it and see grub. Next will be an installation of the basesystem.

Actions #16

Updated by nicksinger almost 3 years ago

Base system is installed now. I will now deploy a basic salt infra based on what we already have in https://gitlab.suse.de/qa-sle/qanet-salt

Actions #17

Updated by nicksinger almost 3 years ago

User creation, ssh key management and postgresql installation is done now

Actions #18

Updated by nicksinger almost 3 years ago

added powerdns packages and some basic postgresql configuration for now. Currently struggling to get salt to create the psql user with the right password hash

Actions #19

Updated by nicksinger almost 3 years ago

last friday I figured out that this is caused by an outdated version of salt in openSUSE. I've opened https://bugzilla.opensuse.org/show_bug.cgi?id=1186500 to address that issue and went with md5 encryption for now (https://gitlab.suse.de/qa-sle/qanet-salt/-/commit/6fe695c82527110acaa61e6f1d4391bdf99943b1)

Actions #20

Updated by nicksinger almost 3 years ago

initial database initialization as well as authoritative and recursive powerdns config was added to salt.
With this the server is now a slave for the current qanet and delivers the same results:

selenium ~ » dig holmes-4.qa.suse.de @qanet.qa.suse.de +short
10.162.2.104
selenium ~ » dig holmes-4.qa.suse.de @qanet2.qa.suse.de +short
10.162.2.104

Next I need to figure out what of the old configuration is still valid:

allow-recursion { localnets; localhost; 10.120.0.40; 10.120.0.41; 10.120.0.44; 10.120.0.45 ; 149.44.176.22; 10.160.0.40; 10.160.0.41; 10.160.0.44; 10.160.0.45; 149.44.176.36; 149.44.176.37; 149.44.176.22; 10.162.64.10; 10.0.0.0/8; };
also-notify { 149.44.160.72; 10.160.0.1; 10.160.2.88; 10.100.2.8; 10.100.2.10; 10.162.0.2; };
allow-transfer { 149.44.160.1; 149.44.160.160; 149.44.160.72; 10.120.0.1; 10.120.0.150; 10.120.2.88; 10.160.0.1; 10.160.0.150; 10.160.2.88; 10.100.2.8; 10.100.2.10; 10.162.0.2; };
Actions #21

Updated by okurz almost 3 years ago

  • Status changed from In Progress to Workable
Actions #22

Updated by okurz almost 3 years ago

  • Status changed from Workable to New

moving all tickets without size confirmation by the team back to "New". The team should move the tickets back after estimating and agreeing on a consistent size

Actions #23

Updated by ilausuch almost 3 years ago

  • Subject changed from [tools] Upgrade qanet.qa.suse.de to a supported, current OS to [tools] Upgrade qanet.qa.suse.de to a supported, current OS size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #24

Updated by okurz over 2 years ago

  • Subject changed from [tools] Upgrade qanet.qa.suse.de to a supported, current OS size:M to [tools] Migrate (upgrade or replace) qanet.qa.suse.de to a supported, current OS size:M
Actions #25

Updated by okurz about 2 years ago

  • Status changed from Workable to In Progress
  • Assignee changed from nicksinger to okurz

I found that multiple partitions had been 100% full. I added myself to group "wheel" and allowed wheel users sudo without password so that I can login as my own user and others can see that I logged in. I then deleted some old stuff from the full partitions, e.g. a lot of old automatic intermediate backup directories which likely prevented new backups since years already. I created a new SSH keypair on qanet as otherwise I wouldn't be able to access a remote backup location anyway. So I did ssh-keygen and copied the public key into backup.qa.suse.de:/root/.ssh/authorized_keys . Then created on backup.qa a dir /home/backup/qanet/ and on qanet called:

for i in / /srv/ /var/ /data/; do rsync -aHP --one-file-system $i backup.qa:/home/backup/qanet$i; done

By the way, sudo du -x --max-depth 1 -BM / | sort -n shows what we need to care about primarily from the root filesystem when trying an upgrade:

0M      ./dev
0M      ./mounts
0M      ./proc
0M      ./suse
0M      ./sys
1M      ./boot
1M      ./csv
1M      ./data
1M      ./dist
1M      ./img
1M      ./lost+found
1M      ./media
1M      ./mnt
1M      ./secret
1M      ./selinux
1M      ./srv
1M      ./tftproot
1M      ./tmp
1M      ./var
10M     ./bin
15M     ./sbin
20M     ./lib64
28M     ./etc
152M    ./lib
319M    ./opt
2544M   ./home
3598M   ./usr
5880M   ./root
12562M  .

so most is in /root and also a lot in /home. I suggest after the backup is complete replicate an environment into a VM or anywhere where we can run chroot or a container environment excluding /root/* and /home/* and experiment with live upgrades.

Actions #26

Updated by okurz about 2 years ago

When trying to conduct the backup I realized quite slow transfer speeds.

I did a benchmark with

qanetnue:/suse/okurz # dd bs=100M count=20 if=/dev/zero | nc -l 42420

and

backup-vm:/home/okurz # nc qanet.qa 42420 | dd of=/dev/null status=progress

and the result is

116733440 bytes (117 MB, 111 MiB) copied, 1948 s, 59.9 kB/s

so abysmal slow network speed.

-> #107437

Actions #27

Updated by okurz about 2 years ago

With #107437 resolved I can continue. Speed looks much better now.

EDIT: Backup complete.

Actions #28

Updated by okurz about 2 years ago

  • Status changed from In Progress to Workable

I would like to pick up the work again in a mob-session, e.g. together with nsinger.

Actions #29

Updated by okurz about 2 years ago

Now I know why backups fill up our root partition on qanet.qa in the past years. /etc/cron.weekly/removeoldqabackups.sh has:

rm /backups/qa.suse.de_201*.tar.gz-* > /tmp/cronout-$mydate

guess why the problem started in the year 2020 :facepalm: . Fixed by replacing with

find /backups/ -mtime 30 -delete > /tmp/cronout-$mydate
Actions #30

Updated by okurz about 2 years ago

  • Parent task set to #109743
Actions #31

Updated by okurz almost 2 years ago

  • Tags set to next-office-day
Actions #32

Updated by okurz almost 2 years ago

Created a full root partition image backup with command on backup.qa

backup-vm:/home/backup/qanet # nc qanet 42420 | pv > sda2_root-$(date +%F).img

and from qanet

dd bs=1M if=/dev/sda2 | nc backup 42420
Actions #33

Updated by okurz almost 2 years ago

  • Status changed from Workable to In Progress

We managed to upgrade from sle11sp3 to sle11sp4 but without any maintenance updates yet. Would be interesting to find if we can find maintenance update repos. Removed a lot of packages for services that were not running anymore. Also X11 stack. We restarted dhcpd and named and everything fine within there. Next step trying to upgrade to SLE12.

Actions #34

Updated by okurz almost 2 years ago

Trying zypper dup from a SLE12GM iso image mounted yields

qanetnue:/tmp # zypper dup -r sle12gm
Loading repository data...
Reading installed packages...
Computing distribution upgrade...
29 Problems:
Problem: solvable libgcc_s1-4.8.3+r212056-6.3.x86_64 conflicts with libgcc_s1 provided by itself
Problem: solvable libgcc_s1-32bit-4.8.3+r212056-6.3.x86_64 conflicts with libgcc_s1-32bit provided by itself
Problem: solvable libgfortran3-4.8.3+r212056-6.3.x86_64 conflicts with libgfortran3 provided by itself
Problem: solvable libgomp1-4.8.3+r212056-6.3.x86_64 conflicts with libgomp1 provided by itself
Problem: solvable libquadmath0-4.8.3+r212056-6.3.x86_64 conflicts with libquadmath0 provided by itself
Problem: solvable libstdc++6-4.8.3+r212056-6.3.x86_64 conflicts with libstdc++6 provided by itself
Problem: solvable libstdc++6-32bit-4.8.3+r212056-6.3.x86_64 conflicts with libstdc++6-32bit provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-32bit-4.8.3+r212056-6.3.x86_64 conflicts with libffi4-32bit provided by itself
Problem: solvable libtsan0-4.8.3+r212056-6.3.x86_64 conflicts with libtsan0 provided by itself
Problem: solvable libtsan0-4.8.3+r212056-6.3.x86_64 conflicts with libtsan0 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: ca-certificates-mozilla-1.97-4.5.noarch requires ca-certificates, but this requirement cannot be provided
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself
Problem: solvable libffi4-4.8.3+r212056-6.3.x86_64 conflicts with libffi4 provided by itself

Problem: solvable libgcc_s1-4.8.3+r212056-6.3.x86_64 conflicts with libgcc_s1 provided by itself

so quite problematic.

So trying to find and add update repositories.

zypper ar http://dist.suse.de/updates/repo/\$RCE/SLES11-SP4-LTSS-Updates/sle-11-x86_64/SUSE:Updates:SLE-SERVER:11-SP4-LTSS:x86_64.repo

seems to work, mind the escaped \$RCE. Might be we need the non-LTSS directories in parallel.

EDIT: Added the non-LTSS updates repo:

zypper ar http://dist.suse.de/updates/repo/\$RCE/SLES11-SP4-Updates/sle-11-x86_64/SUSE:Updates:SLE-SERVER:11-SP4:x86_64.repo

Now zypper patch looks sane. Calling it once updated zypper, calling it a second time installs much more. Now zypper patch is clean, so is zypper up. But zypper dup is stuck on conflicts:

3 Problems:
Problem: nothing provides libaudit.so.0 needed by pam-32bit-1.1.5-0.17.2.x86_64
Problem: nothing provides libgdbm.so.3 needed by perl-32bit-5.10.0-64.80.1.x86_64
Problem: nothing provides libaudit.so.0 needed by pam-32bit-1.1.5-0.17.2.x86_64

Likely zypper rm -u pam-32bit could work. We shouldn't need that many 32bit packages if any. Done that.

The following packages are going to be REMOVED:
  ConsoleKit-32bit cryptconfig-32bit pam-32bit pam-modules-32bit pam_mount-32bit samba-32bit samba-winbind-32bit sssd-32bit 

I restarted named and dhcpd and they seem to be still working fine.

Actions #35

Updated by okurz almost 2 years ago

  • Status changed from In Progress to Workable
Actions #36

Updated by okurz almost 2 years ago

  • Status changed from Workable to In Progress

Currently reconducting an image based backup of the root partition as preparation for the next upgrade steps. I compressed the previous image on backup.qa. Now on qanet

dd bs=1M if=/dev/sda2 | nc -l 42420

on backup.qa

nc qanet 42420 | pv | xz -c - > sda2_root-$(date +%F).img.xz
Actions #37

Updated by okurz almost 2 years ago

  • Status changed from In Progress to Workable
Actions #38

Updated by okurz almost 2 years ago

nicksinger and me yesterday added the sle11sp4 iso back as repo and then conducted a zypper dup ending up in a consistent state. Then we tried to do an online upgrade with sle12gm repos but ran again into the problems shown in #81192#note-34 . We plan to continue next week Tuesday with an medium based migration.

Actions #39

Updated by okurz almost 2 years ago

  • Status changed from Workable to In Progress

In SRV2 with nicksinger. I put SLES12SP3 on a USB thumbdrive. We connected the thumbdrive to qanet, first back, then front. We found that the USB device is not found for booting, likely not supported. So we booted again the original system, mounted the thumbdrive and executed kexec, along the lines of:

mount /dev/sdc2 /mnt/iso
kexec --initrd=/mnt/iso/boot/loader/x86_64/initrd --command-line="upgrade=1 textmode=1 ssh=1 sshpassword=XXX ifcfg=eth0=10.162.0.1/18,10.162.163.254 nameserver=10.162.163.254"

We used a local VGA monitor and a PS/2 keyboard as USB is not supported during the BIOS menu but we could have done the kexec remotely as well. We just used the local VGA connection to be able to monitor the boot processes.

Actions #40

Updated by okurz almost 2 years ago

A problem seems to be the one monitor we have in SRV2 does not support a resolution that is found to be vertical 1050 so soon at boot the monitor does not show anything anymore. Still the system booted and was reachable over ssh. DNS server named and dhcpd are running fine so core services are available. Some failed services, e.g. apache2, all seem to be non-critical so to be cared about later. Trying to add update repos we hit a problem that curl could not initialize. Calling ldd $(which curl) revealed that some /usr/lib/vmware directories were in the list so that sounded fishy. We have renamed that vmware folder with extension .old and curl and zypper were running fine so we could add https://updates.suse.de/SUSE/Updates/SLE-SERVER/12-SP3/x86_64/update/SUSE:Updates:SLE-SERVER:12-SP3:x86_64.repo and https://updates.suse.de/SUSE/Updates/SLE-SERVER/12-SP3-LTSS/x86_64/update/SUSE:Updates:SLE-SERVER:12-SP3-LTSS:x86_64.repo and https://ca.suse.de/ and call zypper dup to bring the system in a proper updated state. After that we rebooted two times to check. Initial grub screen looks weird and no real menu shows up, at least not on VGA, but eventually the system boots fine so good enough.

Next tasks:

  • DONE: Check all failed systemd services
  • Configure automatic updates and reboots, e.g. use salt same as for backup.qa and alike?
  • Upgrade to more recent versions of SLE, then sidegrade to Leap. Or just go to Leap 42.3 now and then upgrade assuming we still find according repos
  • Cleanup more old cruft, like apache2, 32bit libraries, etc.
  • Consider repartitioning with optional move of / from ext3 to btrfs
  • Review all .rpm* files in /etc
Actions #41

Updated by okurz almost 2 years ago

I would say next step is that we visit Maxtorhof again and side-grade to Leap and then upgrade all the way to 15.4. As http://download.opensuse.org/distribution/leap/ goes all the way down to 42.3 I see it as easiest to go to Leap first. Bonus points for building a poor-mans-KVM with a raspberryPi, or we just connect the serial port to another machine and plug the power into a remote controlled PDU :) We can use qamaster as serial host. PDUs are already connected but need cable tracing. Further idea a backup qanet as VM on qamaster

Actions #42

Updated by okurz almost 2 years ago

  • Related to action #113357: UEFI PXE or "network boot" support within .qa.suse.de size:M added
Actions #43

Updated by nicksinger almost 2 years ago

serial port is now connected to qamaster and a console is reachable. I also re-plugged on of the Y-power-connectors so all 4 PSUs are now connected to qaps09 - see port documentation in https://racktables.suse.de/index.php?page=object&tab=ports&object_id=1610 or in the webinterface of qaps09

Actions #44

Updated by nicksinger almost 2 years ago

Online migration from 12SP3->12SP4->12SP5 done now. System works fine after a reboot now. According to https://documentation.suse.com/sles/15-SP4/html/SLES-all/cha-upgrade-paths.html#sec-upgrade-paths-supported the upgrade path to SLE15 is only supported offline once again so we might consider going to leap straight away.

Actions #45

Updated by okurz almost 2 years ago

Thank you. Next step can be again the "kexec into the downloaded iso" approach and we can dare to do this remotely with serial and remote power control :)

Actions #46

Updated by okurz almost 2 years ago

At around 1600 CEST a problem was reported that DNS resolution on grenache-1 does not work.

grenache-1:~ # host openqa.suse.de
;; connection timed out; no servers could be reached

ping -c 1 -4 10.160.0.207, the IPv4 address of OSD, is fine, same as ping -c 1 -6 2620:113:80c0:8080:10:160:0:207. Works now after I restarted named on qanet, not sure why.

Logs of named on qanet:

okurz@qanet:~ 0 (master) $ sudo systemctl status named.service
● named.service - LSB: Domain Name System (DNS) server, named
   Loaded: loaded (/etc/init.d/named; bad; vendor preset: disabled)
   Active: active (exited) since Wed 2022-07-13 15:05:41 CEST; 1h 13min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 1798 ExecStart=/etc/init.d/named start (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 512)

Jul 13 15:05:40 qanet named[1875]: automatic empty zone: HOME.ARPA
Jul 13 15:05:40 qanet named[1875]: none:104: 'max-cache-size 90%' - setting to 5356MB (out of 5951MB)
Jul 13 15:05:40 qanet named[1875]: configuring command channel from '/etc/rndc.key'
Jul 13 15:05:40 qanet named[1875]: command channel listening on 127.0.0.1#953
Jul 13 15:05:40 qanet named[1875]: configuring command channel from '/etc/rndc.key'
Jul 13 15:05:40 qanet named[1875]: command channel listening on ::1#953
Jul 13 15:05:41 qanet named[1875]: zone qa.suse.de/IN: cloud2.qa.suse.de/NS 'crowbar.cloud2adm.qa.suse.de' has no SIBLING GLUE address records (A or AAAA)
Jul 13 15:05:41 qanet named[1875]: zone qa.suse.de/IN: cloud3.qa.suse.de/NS 'crowbar.cloud3adm.qa.suse.de' has no SIBLING GLUE address records (A or AAAA)
Jul 13 15:05:41 qanet named[1798]: Starting name server BIND ..done
Jul 13 15:05:41 qanet systemd[1]: Started LSB: Domain Name System (DNS) server, named.

okurz@qanet:~ 3 (master) $ sudo systemctl restart named
okurz@qanet:~ 0 (master) $ sudo journalctl -f -u named
-- Logs begin at Wed 2022-07-13 15:04:39 CEST. --
Jul 13 16:19:31 qanet named[5334]: automatic empty zone: HOME.ARPA
Jul 13 16:19:31 qanet named[5334]: none:104: 'max-cache-size 90%' - setting to 5356MB (out of 5951MB)
Jul 13 16:19:31 qanet named[5334]: configuring command channel from '/etc/rndc.key'
Jul 13 16:19:31 qanet named[5334]: command channel listening on 127.0.0.1#953
Jul 13 16:19:31 qanet named[5334]: configuring command channel from '/etc/rndc.key'
Jul 13 16:19:31 qanet named[5334]: command channel listening on ::1#953
Jul 13 16:19:31 qanet named[5334]: zone qa.suse.de/IN: cloud2.qa.suse.de/NS 'crowbar.cloud2adm.qa.suse.de' has no SIBLING GLUE address records (A or AAAA)
Jul 13 16:19:31 qanet named[5334]: zone qa.suse.de/IN: cloud3.qa.suse.de/NS 'crowbar.cloud3adm.qa.suse.de' has no SIBLING GLUE address records (A or AAAA)
Jul 13 16:19:31 qanet named[5268]: Starting name server BIND - Warning: /var/lib/named/var/run/named/named.pid exists! ..done
Jul 13 16:19:31 qanet systemd[1]: Started LSB: Domain Name System (DNS) server, named.

Handling some restarts:

host=openqa.suse.de failed_since="2022-07-13 13:00" openqa-advanced-retrigger-jobs 
result="result='failed'" host=openqa.suse.de failed_since="2022-07-13 13:00" openqa-advanced-retrigger-jobs
{"result":[{"9116842":9124646}],"test_url":[{"9116842":"\/tests\/9124646"}]}
{"enforceable":1,"errors":["Job 9124626 misses the following mandatory assets: hdd\/SLES-15-x86_64-mru-install-minimal-with-addons-Build20220617-1-Server-DVD-Updates-64bit.qcow2\nEnsure to provide mandatory assets and\/or force retriggering if necessary."],"result":[],"test_url":[]}
{"enforceable":1,"errors":["Job 9124628 misses the following mandatory assets: hdd\/SLES-12-SP4-x86_64-mru-install-minimal-with-addons-Build20220627-1-Server-DVD-Updates-64bit.qcow2\nEnsure to provide mandatory assets and\/or force retriggering if necessary."],"result":[],"test_url":[]}
{"result":[{"9122209":9124648}],"test_url":[{"9122209":"\/tests\/9124648"}]}
{"result":[{"9122212":9124649}],"test_url":[{"9122212":"\/tests\/9124649"}]}
{"result":[{"9122215":9124650}],"test_url":[{"9122215":"\/tests\/9124650"}]}
{"result":[{"9122214":9124651}],"test_url":[{"9122214":"\/tests\/9124651"}]}
{"result":[{"9116915":9124652}],"test_url":[{"9116915":"\/tests\/9124652"}]}
{"result":[{"9116910":9124653}],"test_url":[{"9116910":"\/tests\/9124653"}]}
{"result":[{"9116877":9124654}],"test_url":[{"9116877":"\/tests\/9124654"}]}
{"result":[{"9116913":9124655}],"test_url":[{"9116913":"\/tests\/9124655"}]}
{"result":[{"9116919":9124656}],"test_url":[{"9116919":"\/tests\/9124656"}]}
{"result":[{"9116925":9124657}],"test_url":[{"9116925":"\/tests\/9124657"}]}
{"result":[{"9116928":9124658}],"test_url":[{"9116928":"\/tests\/9124658"}]}
{"result":[{"9116902":9124659}],"test_url":[{"9116902":"\/tests\/9124659"}]}
{"result":[{"9116901":9124660}],"test_url":[{"9116901":"\/tests\/9124660"}]}
{"result":[{"9116914":9124661}],"test_url":[{"9116914":"\/tests\/9124661"}]}
{"result":[{"9116917":9124662}],"test_url":[{"9116917":"\/tests\/9124662"}]}
{"result":[{"9116918":9124663}],"test_url":[{"9116918":"\/tests\/9124663"}]}
{"result":[{"9116926":9124664}],"test_url":[{"9116926":"\/tests\/9124664"}]}
{"result":[{"9116920":9124665}],"test_url":[{"9116920":"\/tests\/9124665"}]}
{"result":[{"9116891":9124666}],"test_url":[{"9116891":"\/tests\/9124666"}]}
{"result":[{"9116922":9124667}],"test_url":[{"9116922":"\/tests\/9124667"}]}
{"result":[{"9116911":9124668}],"test_url":[{"9116911":"\/tests\/9124668"}]}
{"result":[{"9116907":9124669}],"test_url":[{"9116907":"\/tests\/9124669"}]}
{"result":[{"9116921":9124670}],"test_url":[{"9116921":"\/tests\/9124670"}]}
{"result":[{"9116927":9124671}],"test_url":[{"9116927":"\/tests\/9124671"}]}
{"result":[{"9116924":9124672}],"test_url":[{"9116924":"\/tests\/9124672"}]}
{"result":[{"9116895":9124673}],"test_url":[{"9116895":"\/tests\/9124673"}]}
{"result":[{"9116903":9124674}],"test_url":[{"9116903":"\/tests\/9124674"}]}
{"result":[{"9116897":9124675}],"test_url":[{"9116897":"\/tests\/9124675"}]}
{"result":[{"9116896":9124676}],"test_url":[{"9116896":"\/tests\/9124676"}]}
{"result":[{"9116876":9124677}],"test_url":[{"9116876":"\/tests\/9124677"}]}
{"result":[{"9116893":9124678}],"test_url":[{"9116893":"\/tests\/9124678"}]}
{"result":[{"9116909":9124679}],"test_url":[{"9116909":"\/tests\/9124679"}]}
{"result":[{"9116912":9124680}],"test_url":[{"9116912":"\/tests\/9124680"}]}
{"result":[{"9116916":9124681}],"test_url":[{"9116916":"\/tests\/9124681"}]}
{"result":[{"9116929":9124682}],"test_url":[{"9116929":"\/tests\/9124682"}]}
{"result":[{"9116931":9124683}],"test_url":[{"9116931":"\/tests\/9124683"}]}
{"result":[{"9116933":9124684}],"test_url":[{"9116933":"\/tests\/9124684"}]}
{"result":[{"9116934":9124685}],"test_url":[{"9116934":"\/tests\/9124685"}]}
{"result":[{"9116935":9124686}],"test_url":[{"9116935":"\/tests\/9124686"}]}
{"result":[{"9123817":9124687}],"test_url":[{"9123817":"\/tests\/9124687"}]}
{"result":[{"9123818":9124688}],"test_url":[{"9123818":"\/tests\/9124688"}]}
{"result":[{"9124251":9124689}],"test_url":[{"9124251":"\/tests\/9124689"}]}
{"result":[{"9124252":9124690}],"test_url":[{"9124252":"\/tests\/9124690"}]}
{"result":[{"9124253":9124691}],"test_url":[{"9124253":"\/tests\/9124691"}]}
{"result":[{"9124393":9124692}],"test_url":[{"9124393":"\/tests\/9124692"}]}
{"result":[{"9124375":9124693,"9124389":9124694,"9124390":9124695,"9124391":9124696,"9124392":9124697}],"test_url":[{"9124375":"\/tests\/9124693","9124389":"\/tests\/9124694","9124390":"\/tests\/9124695","9124391":"\/tests\/9124696","9124392":"\/tests\/9124697"}]}
{"result":[{"9122213":9124698}],"test_url":[{"9122213":"\/tests\/9124698"}]}
{"result":[{"9124451":9124699}],"test_url":[{"9124451":"\/tests\/9124699"}]}
{"result":[{"9124441":9124700}],"test_url":[{"9124441":"\/tests\/9124700"}]}
{"result":[{"9124442":9124701}],"test_url":[{"9124442":"\/tests\/9124701"}]}
{"result":[{"9124487":9124702}],"test_url":[{"9124487":"\/tests\/9124702"}]}
{"result":[{"9124520":9124703}],"test_url":[{"9124520":"\/tests\/9124703"}]}
{"result":[{"9124536":9124705}],"test_url":[{"9124536":"\/tests\/9124705"}]}
{"result":[{"9124401":9124706}],"test_url":[{"9124401":"\/tests\/9124706"}]}
{"result":[{"9124597":9124707}],"test_url":[{"9124597":"\/tests\/9124707"}]}
{"result":[{"9124380":9124708}],"test_url":[{"9124380":"\/tests\/9124708"}]}
{"result":[{"9124623":9124709}],"test_url":[{"9124623":"\/tests\/9124709"}]}
{"result":[{"9124629":9124710}],"test_url":[{"9124629":"\/tests\/9124710"}]}
{"result":[{"9116923":9124711}],"test_url":[{"9116923":"\/tests\/9124711"}]}
Actions #47

Updated by okurz almost 2 years ago

I ran while sleep 10; do date && pgrep -a named; done. Starting

Wed Jul 13 17:41:09 CEST 2022
…
Wed Jul 13 17:51:20 CEST 2022
6269 /usr/sbin/named -t /var/lib/named -u named
Wed Jul 13 17:51:30 CEST 2022
6269 /usr/sbin/named -t /var/lib/named -u named
Wed Jul 13 17:51:41 CEST 2022
7460 /bin/sh /etc/init.d/named stop
Wed Jul 13 17:51:51 CEST 2022
7554 /usr/sbin/named -t /var/lib/named -u named

so something/someone started/stopped/restarted named but a new instance was running. Then later

7554 /usr/sbin/named -t /var/lib/named -u named
Wed Jul 13 18:30:06 CEST 2022
7554 /usr/sbin/named -t /var/lib/named -u named
Wed Jul 13 18:30:16 CEST 2022
Wed Jul 13 18:30:26 CEST 2022
Wed Jul 13 18:30:36 CEST 2022
Wed Jul 13 18:30:46 CEST 2022
Wed Jul 13 18:30:56 CEST 2022
Wed Jul 13 18:31:06 CEST 2022

Maybe a conflict with /etc/init.d. So I did:

mkdir /etc/init.d/old_okurz_20220713
mv /etc/init.d/named /etc/init.d/old_okurz_20220713/

but then I got

qanet:/suse/okurz # systemctl start named
Warning: named.service changed on disk. Run 'systemctl daemon-reload' to reload units.
qanet:/suse/okurz # systemctl daemon-reload
You have new mail in /var/mail/root
qanet:/suse/okurz # systemctl start named
Failed to start named.service: Unit named.service failed to load: No such file or directory.

so the file is actually necessary. Reverted. In journalctl --since=today -u named I saw:

Jul 13 18:30:06 qanet named[7554]: mem.c:906: fatal error:
Jul 13 18:30:06 qanet named[7554]: malloc failed: Cannot allocate memory
Jul 13 18:30:06 qanet named[7554]: exiting (due to fatal error in library)
…
Jul 13 18:30:53 qanet systemd-coredump[8941]: Process 7554 (named) of user 44 dumped core.

                                              Stack trace of thread 7556:
                                              #0  0x00007fc030f390d7 raise (libc.so.6)
                                              #1  0x00007fc030f3a4aa abort (libc.so.6)
                                              #2  0x0000557772ca812f n/a (named)
                                              #3  0x00007fc0330c8fe3 isc_error_fatal (libisc.so.1107)
                                              #4  0x00007fc0330d9293 n/a (libisc.so.1107)
                                              #5  0x00007fc0330d75b9 n/a (libisc.so.1107)
                                              #6  0x00007fc0330d993d isc___mem_allocate (libisc.so.1107)
                                              #7  0x00007fc0330dcbe3 isc___mem_strdup (libisc.so.1107)
                                              #8  0x00007fc033ab08e1 n/a (libdns.so.1110)
                                              #9  0x00007fc033ab3015 dns_resolver_createfetch3 (libdns.so.1110)
                                              #10 0x0000557772caed13 n/a (named)
                                              #11 0x0000557772cbc11b n/a (named)

that can certainly explain the problem of disappearing named. Maybe a memory leak?

EDIT: According to https://flylib.com/books/en/2.684.1/limiting_the_memory_a_name_server_uses.html the configuration option datasize 200M; which we have in /etc/named.conf might be the problem. So I commented that option and restarted the service again. I assume it's enough to rely on https://www.zytrax.com/books/dns/ch7/hkpng.html#max-cache-size. Maybe ages ago "datasize" was by default much lower than 200M.

EDIT: 2022-07-13 19:00Z, named still running so seems to be better.

EDIT: 2022-07-14 07:00Z, same named process still running.

Actions #48

Updated by okurz over 1 year ago

  • Status changed from In Progress to Workable

Something that mkittler/nsinger/okurz can follow up with after the summer vacations.

Actions #52

Updated by okurz over 1 year ago

  • Related to action #117043: Request DHCP+DNS services for new QE network zones, same as already provided for .qam.suse.de and .qa.suse.cz added
Actions #53

Updated by okurz over 1 year ago

  • Status changed from Workable to Blocked

In #117043 we plan to migrate our network infrastructure to an Eng-Infra maintained DHCP+DNS service which should replace qanet, blocking on #117043

Actions #54

Updated by okurz about 1 year ago

  • Category set to Infrastructure
  • Target version changed from Ready to future

I will track this outside our backlog. I assume that within 2023 we will clarify if we will still use that installation or have moved out of the corresponding server rooms and migrated to other services which is likely.

Actions #55

Updated by okurz 11 months ago

  • Parent task changed from #109743 to #37910
Actions #56

Updated by okurz 6 months ago

#117043 resolved. With https://gitlab.suse.de/qa-sle/qanet-configs/-/commit/6246fc46224606ba5932f6da6e6d6b87cbc722c5 qanet is still running a very limited dhcp server but forwards DNS to dns1.suse.de, dns1.prg2.suse.org, dns2.suse.de which serves qa.suse.de now from https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4247, blocked on #134051

Actions #57

Updated by okurz 6 months ago

  • Related to action #132623: Decommissioning of selected selected LSQ QE machines from NUE1-SRV2 added
Actions #58

Updated by okurz 6 months ago

  • Tags changed from next-office-day, infra to infra
  • Target version changed from future to Tools - Next

#134051 resolved. We still have https://gitlab.suse.de/qa-sle/qanet-configs/ needed for DHCP of last machines in SRV2. Waiting for #132623. After those are going we will mark the gitlab repo as archived.

Actions #59

Updated by okurz 5 months ago

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4447 to remove DNS entries for decomissioned qanet.

Actions #60

Updated by okurz 5 months ago

  • Status changed from Blocked to Resolved
  • Target version changed from Tools - Next to Ready

Archived https://gitlab.suse.de/qa-sle/qanet-configs now. I did update https://wiki.suse.net/index.php/SUSE-Quality_Assurance/Labs accordingly. Found no references in our team wiki needing updates or other wiki pages.

Actions

Also available in: Atom PDF