action #54128
closed[tools] openqaworker-arm-3 is broken
0%
Description
https://openqa.suse.de/tests/3050403
w¶
10:01:25 up 14:18, 2 users, load average: 0.99, 1.85, 1.71
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
nsinger pts/0 10.162.29.251 Wed19 13:44m 0.69s 0.11s sshd: nsinger [priv]
Updated by okurz over 5 years ago
- Assignee set to nicksinger
[2019-07-11T09:59:39.359 CEST] [debug] qemu-img: Could not open '/var/lib/openqa/pool/10/SLES-15-SP1-aarch64-Installtest.qcow2': Could not open '/var/lib/openqa/pool/10/SLES-15-SP1-aarch64-Installtest.qcow2': No such file or directory
but has been downloaded by the test.
@nicksinger as you were just working on this machine I assume(?) please take a look. I guess disabling the machine temporarily might be helpful to prevent further tests running into this.
Updated by nicksinger over 5 years ago
- Status changed from New to Blocked
I've removed the worker again from the OSD pool and will have to debug why the cacheservice behaves like this. I hope this is not blocking our leap15.1 migration on the other workers :/
Updated by okurz about 5 years ago
- Assignee changed from nicksinger to okurz
closely related to #56447 then I will consolidate.
Updated by okurz about 5 years ago
- Related to action #56447: openqaworker-arm-2 is out-of-space on /was: openQA on osd fails with empty logs added
Updated by okurz about 5 years ago
- Status changed from Blocked to In Progress
what I see on arm-* is that there is always the same 745.2GiB NVME device but on arm-3 it's split into two partitions, one for cache and one for pool. for arm-1 and arm-2 it's a single partition. I prefer a single partition.
The nvme symlinks are not consistent so either root can run full because of pool not ending up on nvme or the cache service thinking it has 0 bytes occupied because of symlinks. So I suggest to
- change partitions to single for all
and then continue in #56447 , clarified with nsinger as well.
Aligning partitions on arm-3:
sudo parted /dev/nvme0n1
(parted) mklabel GPT
(parted) mkpart primary ext4 2048s 100%
then realized that openqaworker-arm-3 is not properly upgraded to Leap 15.1, has left-overs from SLE12SP3, needs SUSEConnect -d || SUSEConnect --cleanup
first and a zypper dup
. This installed the new kernel 4.12.14-lp151.28.7-default which seems to fail to boot. From ipmitool -I lanplus -H openqaworker-arm-3-ipmi.suse.de -U ADMIN -P ADMIN sol activate
I get as last messages:
…
BL31: node: 1, core: 46, AP_CVM_POWER_EL1:0xa51cff3002ffff00
BL31: node: 1, core: 46, AP_CVM_POWER_EL1:0xa51cff3002a5ff00
BL31: node: 1, core: 47, AP_CVM_POWER_EL1:0xa51cff3002ffff00
BL31: node: 1, core: 47, AP_CVM_POWER_EL1:0xa51cff3002a5ff00
and then nothing. Trying once more with forced power off and on again.
No avail. Also reproduced the same problem when hitting "esc" and "left" until the grub menu shows up on serial and I selected the previous kernel 4.12.14-lp151.27-default . openqaworker-arm-1 shows in the same step:
…
BL31: node: 0, core: 47, AP_CVM_POWER_EL1:0xa516ff3002a5ff00
[ 0.000000] Booting Linux on physical CPU 0x0
[ 0.000000] Linux version 4.12.14-lp151.27-default (geeko@buildhost) (gcc version 7.4.0 (SUSE Linux) ) #1 SMP Fri May 10 14:13:15 UTC 2019 (862c838)
…
which we don't see so I assume something wrong with initrd or boot parameters which are on openqaworker-arm-1: BOOT_IMAGE=/boot/Image-4.12.14-lp151.27-default root=/dev/mapper/system-root nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 . Managed to boot again by overwriting the kernel command line in grub with /boot/Image-4.12.14-lp151.27-default root=/dev/sda3 nospec
. Adjusted boot parameters in /etc/default/grub:
-GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 console=ttyS1,115200 nospec"
+GRUB_CMDLINE_LINUX_DEFAULT=" nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1"
and did grub2-mkconfig -o /boot/grub2/grub.cfg
. Also adjusted UUID parameters in /etc/fstab. Rebooted and now it looks ok again. Need to continue or crosscheck later after progressing with #56447
Updated by okurz about 5 years ago
- Status changed from Blocked to In Progress
Some services failed, e.g. postfix. This seems to be due no IPv6 being available as configured in
# cat /etc/sysctl.d/disable_ipv6.conf
net.ipv6.conf.all.disable_ipv6 = 1
I removed this file which unfortunately does not have a comment. It was probably created due to some problems some time ago but as we moved to Leap 15.1 now let's try again :)
Also os-autoinst-openvswitch could fail for the same reason:
Sep 06 11:35:39 openqaworker-arm-3 os-autoinst-openvswitch[2444]: can't parse bridge local port IP at /usr/lib/os-autoinst/os-autoinst-openvswitch line 39.
Restarted, fine.
The telegraf package is installed on many but not all workers, e.g. not on openqaworker-arm-3 and a reason is also that it is not available in the OBS repo, see https://build.opensuse.org/package/show/devel:languages:go/telegraf . I downloaded the package for Leap 15.0 from https://build.opensuse.org/package/binaries/devel:languages:go/telegraf/openSUSE_Leap_15.0 in particular https://build.opensuse.org/package/binary/download/devel:languages:go/telegraf/openSUSE_Leap_15.0/aarch64/telegraf-1.8.3-lp150.7.9.aarch64.rpm and force installed it on openqaworker-arm-3 and wrote an email to the project maintainers of devel:languages:go to please enable aarch64 and ppc64le on Leap 15.1 and also created a submission of telegraf from Factory to Leap: https://build.opensuse.org/request/show/729120
Updated by okurz about 5 years ago
- Status changed from In Progress to Feedback
Triggered a job for verification
openqa_clone_job_osd --skip-chained-deps 3335386 WORKER_CLASS=openqaworker-arm-3 _GROUP=0 TEST=poo54128_extra_tests_on_gnome
Created job #3337989: sle-12-SP5-Server-DVD-aarch64-Build0307-extra_tests_on_gnome@aarch64 -> https://openqa.suse.de/t3337989
this looks wrong, did not properly start. A retriggered one looks better though.
Let's try more:
for i in {001..040}; do openqa_clone_job_osd --skip-chained-deps 3335386 WORKER_CLASS=openqaworker-arm-3 _GROUP=0 TEST=poo54128_extra_tests_on_gnome_$i BUILD=poo54128 ; done
-> https://openqa.suse.de/tests/overview?build=poo54128&distri=sle&version=12-SP5
4 passed, 16 soft-failed, 24 failed, mainly due to #56585 so not worker specific. However now I see that openqaworker-arm-3 is completely down. chassis power is reported as "on", triggered a power cycle. Booting now. It was online 16h ago, that is around 2019-09-17 00:00 UTC, incompleted some jobs, e.g. https://openqa.suse.de/tests/3366879 . As the journal is not persistent I created /var/log/journal/ which should be enough to support a persistent journal.
Retriggering again a job multiple times, this time a shorter LTP test:
for i in {001..040}; do openqa_clone_job_osd --skip-chained-deps 3368517 WORKER_CLASS=openqaworker-arm-3 _GROUP=0 TEST=poo54128_ltp_sched_$i BUILD=poo54128-$(date +%F) ; done
-> https://openqa.suse.de/tests/overview?distri=sle&version=12-SP5&build=poo54128-2019-09-17
Did not behave well. I triggered a reboot again and found more problems:
# journalctl -u os-autoinst-openvswitch.service
-- Logs begin at Tue 2019-09-17 17:38:09 CEST, end at Tue 2019-09-17 22:22:55 CEST. --
Sep 17 17:39:20 openqaworker-arm-3 systemd[1]: Starting os-autoinst openvswitch helper...
Sep 17 17:39:21 openqaworker-arm-3 systemd[1]: Started os-autoinst openvswitch helper.
-- Reboot --
Sep 17 20:20:38 openqaworker-arm-3 os-autoinst-openvswitch[2569]: can't parse bridge local port IP at /usr/lib/os-autoinst/os-autoinst-openvswitch line 39.
Sep 17 20:19:28 openqaworker-arm-3 systemd[1]: Starting os-autoinst openvswitch helper...
Sep 17 20:19:28 openqaworker-arm-3 systemd[1]: Started os-autoinst openvswitch helper.
Sep 17 20:19:29 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Main process exited, code=exited, status=255/n/a
Sep 17 20:19:29 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Unit entered failed state.
Sep 17 20:19:29 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
Sep 17 21:41:06 openqaworker-arm-3 systemd[1]: Starting os-autoinst openvswitch helper...
Sep 17 21:41:07 openqaworker-arm-3 systemd[1]: Started os-autoinst openvswitch helper.
the last two lines show that a simple restart of the service helped. It seems that the network is not completely up before os-autoinst-openvswitch is started. So the line After=network.target
seems to be not enough, but look at the timestamps:
openqaworker-arm-3:/home/okurz # journalctl -u os-autoinst-openvswitch.service
-- Logs begin at Tue 2019-09-17 17:38:09 CEST, end at Tue 2019-09-17 22:22:55 CEST. --
Sep 17 20:20:38 openqaworker-arm-3 os-autoinst-openvswitch[2569]: can't parse bridge local port IP at /usr/lib/os-autoinst/os-autoinst-openvswitch line 39.
Sep 17 20:19:28 openqaworker-arm-3 systemd[1]: Starting os-autoinst openvswitch helper...
Sep 17 20:19:28 openqaworker-arm-3 systemd[1]: Started os-autoinst openvswitch helper.
Sep 17 20:19:29 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Main process exited, code=exited, status=255/n/a
Sep 17 20:19:29 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Unit entered failed state.
Sep 17 20:19:29 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
openqaworker-arm-3:/home/okurz # systemctl status network.target
…
Sep 17 20:19:28 openqaworker-arm-3 systemd[1]: Reached target Network.
openqaworker-arm-3:/home/okurz # systemctl status network-online.target
…
Sep 17 20:19:28 openqaworker-arm-3 systemd[1]: Reached target Network is Online.
so both "network.target" and "network-online.target" are up at the same time the helper starts. The problem is that we are looking for "inet" and a valid adress on the bridge device, br1 in this case. journalctl --since=today | grep 'br1'
shows that clearly the device is not yet read at 20:19:29.
I also just see that there is still SuSEfirewall2 on that machine, not firewalld. That should be fixed. The same is true on openqaworker-arm-1 and …-arm-2 though. Probably should check all workers for this. Hm, actually all workers have the firewall enabled. Probably should be firewalld, properly salted, ready for MM testing.
Machine was down again. We have persistent log enabled now but all I see is:
Sep 18 21:10:55 openqaworker-arm-3 worker[46659]: [debug] [pid:46659] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3375050/status
-- Reboot --
Sep 19 12:10:58 openqaworker-arm-3 kernel: Booting Linux on physical CPU 0x0
Also I see errors about the cache service in the journal and:
_openqa+ 21081 0.1 0.0 62784 56544 ? Ss 12:40 0:12 /usr/bin/perl /usr/share/openqa/script/openqa-workercache minion worker -m production
_openqa+ 22211 0.0 0.0 78808 67984 ? S 13:02 0:01 \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache minion worker -m production
_openqa+ 22212 0.0 0.0 30276 7668 ? S 13:02 0:02 \_ rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
_openqa+ 22213 1.2 0.0 39024 5536 ? D 13:02 1:35 \_ rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
as if the NVME device is broken.
# lsof -p 22213
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
rsync 22213 _openqa-worker cwd DIR 0,45 474 617752 /var/lib/openqa/cache/openqa.suse.de/tests
rsync 22213 _openqa-worker rtd DIR 0,45 156 256 /
rsync 22213 _openqa-worker txt REG 0,45 540744 493427 /usr/bin/rsync
rsync 22213 _openqa-worker DEL REG 0,23 89150 /run/nscd/dbYvaSrr
rsync 22213 _openqa-worker mem REG 0,45 330604 341410 /usr/lib/locale/en_US.utf8/LC_CTYPE
rsync 22213 _openqa-worker mem REG 0,45 72848 337246 /lib64/libdl-2.26.so
rsync 22213 _openqa-worker mem REG 0,45 177464 337264 /lib64/libpthread-2.26.so
rsync 22213 _openqa-worker mem REG 0,45 2535584 337381 /usr/lib64/libcrypto.so.1.1
rsync 22213 _openqa-worker mem REG 0,45 67904 337372 /lib64/libattr.so.1.1.0
rsync 22213 _openqa-worker mem REG 0,45 1827616 337240 /lib64/libc-2.26.so
rsync 22213 _openqa-worker mem REG 0,45 134112 362392 /usr/lib64/libslp.so.1.0.0
rsync 22213 _openqa-worker mem REG 0,45 68712 345879 /usr/lib64/libpopt.so.0.0.0
rsync 22213 _openqa-worker mem REG 0,45 133112 337300 /lib64/libz.so.1.2.11
rsync 22213 _openqa-worker mem REG 0,45 68088 337394 /lib64/libacl.so.1.1.2253
rsync 22213 _openqa-worker mem REG 0,45 160688 337233 /lib64/ld-2.26.so
rsync 22213 _openqa-worker mem REG 0,45 26244 341411 /usr/lib64/gconv/gconv-modules.cache
rsync 22213 _openqa-worker 0r CHR 1,3 0t0 1030 /dev/null
rsync 22213 _openqa-worker 1w FIFO 0,12 0t0 569602 pipe
rsync 22213 _openqa-worker 2u unix 0xffff810d62ba8c00 0t0 143073 type=STREAM
rsync 22213 _openqa-worker 3u IPv6 454473 0t0 TCP openqaworker-arm-3.suse.de:58886->openqa.suse.de:rsync (ESTABLISHED)
rsync 22213 _openqa-worker 4u REG 0,45 27623424 691229 /var/lib/openqa/cache/openqa.suse.de/tests/sle/products/sle/needles/.git/objects/pack/.pack-2cbeca3341d5195c120b3033811b10f4f35b6997.pack.KCDTn3
rsync 22213 _openqa-worker 5u unix 0xffff810f69a8b800 0t0 564570 type=STREAM
# strace -f -p 22213 2>&1 | less
strace: Process 22213 attached
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999995070})
read(3, "\0\200\0\0\4\200\0\7\366\315\257w\353T7`P\215GI\30\370[\335*\240\2749<\37L\322"..., 32768) = 32768
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999998070})
read(3, "+G\203\232\246ng{", 8) = 8
ioctl(1, TIOCGPGRP, 0xffffe9a14c44) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "\r 7,602,176 2% 256.00kB/"..., 46) = 46
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999997410})
read(3, "\0\200\0\0\4\200\0\7\353\374\342\264\t\\\364\372\210\26\200\2145\3367\245W0\311\336\376\335\371\242"..., 32768) = 32768
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999998600})
read(3, "p\345+D\f\207g=", 8) = 8
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999998540})
read(3, "\0\200\0\0\4\200\0\7fI)eUe\2\266\214T,DeCrh/\220\f\212\3060\261Q"..., 32768) = 32768
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999997410})
read(3, "\307x\34'1 \0\333", 8) = 8
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999996970})
read(3, "\0\200\0\0\4\200\0\7\333\343O~\362O\277\370\362\342z\273cf\343\340/\376\342G\327\327\373/"..., 32768) = 32768
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999997590})
read(3, "\327\0\334\241\313\34<\237", 8) = 8
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999997169})
read(3, "\0\200\0\0\4\200\0\7/\303]z\365\277\243\245}2\264\216\267\351\2574\317\226\316\203\37\35\324\364"..., 32768) = 32768
so it's stuck but I don't know why. Stopped openqa-worker.target and the cache services
, cleaned /var/lib/openqa/cache/* and restarted. Looks fine now.
sudo systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● systemd-coredump@0-56201-0.service loaded failed failed Process Core Dump
● systemd-coredump@10-58906-0.service loaded failed failed Process Core Dump
● systemd-coredump@11-58908-0.service loaded failed failed Process Core Dump
● systemd-coredump@12-58914-0.service loaded failed failed Process Core Dump
● systemd-coredump@13-58917-0.service loaded failed failed Process Core Dump
● systemd-coredump@15-58927-0.service loaded failed failed Process Core Dump
● systemd-coredump@18-58971-0.service loaded failed failed Process Core Dump
● systemd-coredump@4-58878-0.service loaded failed failed Process Core Dump
● systemd-coredump@5-58888-0.service loaded failed failed Process Core Dump
● systemd-coredump@6-58892-0.service loaded failed failed Process Core Dump
● systemd-coredump@8-58897-0.service loaded failed failed Process Core Dump
● systemd-coredump@9-58903-0.service loaded failed failed Process Core Dump
I saw that again some time ago.
sudo journalctl -u systemd-coredump@0-56201-0.service
-- Logs begin at Tue 2019-09-17 17:38:09 CEST, end at Fri 2019-09-20 21:41:40 CEST. --
Sep 19 16:32:41 openqaworker-arm-3 systemd[1]: Started Process Core Dump (PID 56201/UID 0).
Sep 19 16:37:41 openqaworker-arm-3 systemd[1]: systemd-coredump@0-56201-0.service: Service reached runtime time limit. Stopping.
Sep 19 16:37:41 openqaworker-arm-3 systemd[1]: systemd-coredump@0-56201-0.service: Unit entered failed state.
Sep 19 16:37:41 openqaworker-arm-3 systemd[1]: systemd-coredump@0-56201-0.service: Failed with result 'timeout'.
okurz@openqaworker-arm-3:~> sudo systemctl status systemd-coredump@9-58903-0.service
● systemd-coredump@9-58903-0.service - Process Core Dump
Loaded: loaded (/usr/lib/systemd/system/systemd-coredump@.service; static; vendor preset: disabled)
Active: failed (Result: timeout) since Thu 2019-09-19 20:36:19 CEST; 1 day 1h ago
Docs: man:systemd-coredump(8)
Main PID: 58904 (code=killed, signal=TERM)
Sep 19 20:31:15 openqaworker-arm-3 systemd[1]: Started Process Core Dump (PID 58903/UID 0).
Sep 19 20:36:15 openqaworker-arm-3 systemd[1]: systemd-coredump@9-58903-0.service: Service reached runtime time limit. Stopping.
Sep 19 20:36:19 openqaworker-arm-3 systemd[1]: systemd-coredump@9-58903-0.service: Unit entered failed state.
Sep 19 20:36:19 openqaworker-arm-3 systemd[1]: systemd-coredump@9-58903-0.service: Failed with result 'timeout'.
so processes die but coredump takes longer than 5 minutes and is aborted. But sudo coredumpctl info
shows me it's probably just the "usual suspect".
EDIT: Created new grafana dashboard https://stats.openqa-monitor.qa.suse.de/d/XLzVbupZk/worker-dashoard-openqaworker-arm-3?from=now-2d&to=now
Updated by okurz about 5 years ago
- Has duplicate action #57434: openqaworker-arm-3 keeps running jobs for 18 hours added
Updated by okurz about 5 years ago
- Status changed from Feedback to Resolved
I assume we brought openqaworker-arm-3 on the same state as arm-1 and arm-2 again.