Actions

Copy link

action #54128

closed

[tools] openqaworker-arm-3 is broken

Added by dzedro almost 6 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Target version:

Start date:

2019-07-11

Due date:

% Done:

Estimated time:

Description

https://openqa.suse.de/tests/3050403

w¶

10:01:25 up 14:18, 2 users, load average: 0.99, 1.85, 1.71
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
nsinger pts/0 10.162.29.251 Wed19 13:44m 0.69s 0.11s sshd: nsinger [priv]

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz almost 6 years ago

Assignee set to nicksinger

[2019-07-11T09:59:39.359 CEST] [debug] qemu-img: Could not open '/var/lib/openqa/pool/10/SLES-15-SP1-aarch64-Installtest.qcow2': Could not open '/var/lib/openqa/pool/10/SLES-15-SP1-aarch64-Installtest.qcow2': No such file or directory

but has been downloaded by the test.

@nicksinger as you were just working on this machine I assume(?) please take a look. I guess disabling the machine temporarily might be helpful to prevent further tests running into this.

Actions

Copy link

Updated by nicksinger almost 6 years ago

Status changed from New to Blocked

I've removed the worker again from the OSD pool and will have to debug why the cacheservice behaves like this. I hope this is not blocking our leap15.1 migration on the other workers :/

Actions

Copy link

Updated by okurz over 5 years ago

Assignee changed from nicksinger to okurz

closely related to #56447 then I will consolidate.

Actions

Copy link

Updated by okurz over 5 years ago

Related to action #56447: openqaworker-arm-2 is out-of-space on /was: openQA on osd fails with empty logs added

Actions

Copy link

Updated by okurz over 5 years ago

Status changed from Blocked to In Progress

what I see on arm-* is that there is always the same 745.2GiB NVME device but on arm-3 it's split into two partitions, one for cache and one for pool. for arm-1 and arm-2 it's a single partition. I prefer a single partition.
The nvme symlinks are not consistent so either root can run full because of pool not ending up on nvme or the cache service thinking it has 0 bytes occupied because of symlinks. So I suggest to

change partitions to single for all

and then continue in #56447 , clarified with nsinger as well.

Aligning partitions on arm-3:

sudo parted /dev/nvme0n1
(parted) mklabel GPT
(parted) mkpart primary ext4 2048s 100%

then realized that openqaworker-arm-3 is not properly upgraded to Leap 15.1, has left-overs from SLE12SP3, needs SUSEConnect -d || SUSEConnect --cleanup first and a zypper dup. This installed the new kernel 4.12.14-lp151.28.7-default which seems to fail to boot. From ipmitool -I lanplus -H openqaworker-arm-3-ipmi.suse.de -U ADMIN -P ADMIN sol activate I get as last messages:

…
BL31: node: 1, core: 46, AP_CVM_POWER_EL1:0xa51cff3002ffff00
BL31: node: 1, core: 46, AP_CVM_POWER_EL1:0xa51cff3002a5ff00
BL31: node: 1, core: 47, AP_CVM_POWER_EL1:0xa51cff3002ffff00
BL31: node: 1, core: 47, AP_CVM_POWER_EL1:0xa51cff3002a5ff00

and then nothing. Trying once more with forced power off and on again.

No avail. Also reproduced the same problem when hitting "esc" and "left" until the grub menu shows up on serial and I selected the previous kernel 4.12.14-lp151.27-default . openqaworker-arm-1 shows in the same step:

…
BL31: node: 0, core: 47, AP_CVM_POWER_EL1:0xa516ff3002a5ff00
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.12.14-lp151.27-default (geeko@buildhost) (gcc version 7.4.0 (SUSE Linux) ) #1 SMP Fri May 10 14:13:15 UTC 2019 (862c838)
…

which we don't see so I assume something wrong with initrd or boot parameters which are on openqaworker-arm-1: BOOT_IMAGE=/boot/Image-4.12.14-lp151.27-default root=/dev/mapper/system-root nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 . Managed to boot again by overwriting the kernel command line in grub with /boot/Image-4.12.14-lp151.27-default root=/dev/sda3 nospec. Adjusted boot parameters in /etc/default/grub:

-GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 console=ttyS1,115200 nospec"
+GRUB_CMDLINE_LINUX_DEFAULT=" nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1"

and did grub2-mkconfig -o /boot/grub2/grub.cfg . Also adjusted UUID parameters in /etc/fstab. Rebooted and now it looks ok again. Need to continue or crosscheck later after progressing with #56447

Actions

Copy link

Updated by okurz over 5 years ago

Status changed from In Progress to Blocked

Actions

Copy link

Updated by okurz over 5 years ago

Status changed from Blocked to In Progress

Some services failed, e.g. postfix. This seems to be due no IPv6 being available as configured in

# cat /etc/sysctl.d/disable_ipv6.conf 
net.ipv6.conf.all.disable_ipv6 = 1

I removed this file which unfortunately does not have a comment. It was probably created due to some problems some time ago but as we moved to Leap 15.1 now let's try again :)

Also os-autoinst-openvswitch could fail for the same reason:

Sep 06 11:35:39 openqaworker-arm-3 os-autoinst-openvswitch[2444]: can't parse bridge local port IP at /usr/lib/os-autoinst/os-autoinst-openvswitch line 39.

Restarted, fine.

The telegraf package is installed on many but not all workers, e.g. not on openqaworker-arm-3 and a reason is also that it is not available in the OBS repo, see https://build.opensuse.org/package/show/devel:languages:go/telegraf . I downloaded the package for Leap 15.0 from https://build.opensuse.org/package/binaries/devel:languages:go/telegraf/openSUSE_Leap_15.0 in particular https://build.opensuse.org/package/binary/download/devel:languages:go/telegraf/openSUSE_Leap_15.0/aarch64/telegraf-1.8.3-lp150.7.9.aarch64.rpm and force installed it on openqaworker-arm-3 and wrote an email to the project maintainers of devel:languages:go to please enable aarch64 and ppc64le on Leap 15.1 and also created a submission of telegraf from Factory to Leap: https://build.opensuse.org/request/show/729120

Actions

Copy link

Updated by okurz over 5 years ago

Status changed from In Progress to Feedback

Triggered a job for verification

openqa_clone_job_osd --skip-chained-deps 3335386 WORKER_CLASS=openqaworker-arm-3 _GROUP=0 TEST=poo54128_extra_tests_on_gnome

Created job #3337989: sle-12-SP5-Server-DVD-aarch64-Build0307-extra_tests_on_gnome@aarch64 -> https://openqa.suse.de/t3337989

this looks wrong, did not properly start. A retriggered one looks better though.

Let's try more:

for i in {001..040}; do openqa_clone_job_osd --skip-chained-deps 3335386 WORKER_CLASS=openqaworker-arm-3 _GROUP=0 TEST=poo54128_extra_tests_on_gnome_$i BUILD=poo54128 ; done

-> https://openqa.suse.de/tests/overview?build=poo54128&distri=sle&version=12-SP5

4 passed, 16 soft-failed, 24 failed, mainly due to #56585 so not worker specific. However now I see that openqaworker-arm-3 is completely down. chassis power is reported as "on", triggered a power cycle. Booting now. It was online 16h ago, that is around 2019-09-17 00:00 UTC, incompleted some jobs, e.g. https://openqa.suse.de/tests/3366879 . As the journal is not persistent I created /var/log/journal/ which should be enough to support a persistent journal.

Retriggering again a job multiple times, this time a shorter LTP test:

for i in {001..040}; do openqa_clone_job_osd --skip-chained-deps 3368517 WORKER_CLASS=openqaworker-arm-3 _GROUP=0 TEST=poo54128_ltp_sched_$i BUILD=poo54128-$(date +%F) ; done

-> https://openqa.suse.de/tests/overview?distri=sle&version=12-SP5&build=poo54128-2019-09-17

Did not behave well. I triggered a reboot again and found more problems:

# journalctl -u os-autoinst-openvswitch.service 
-- Logs begin at Tue 2019-09-17 17:38:09 CEST, end at Tue 2019-09-17 22:22:55 CEST. --
Sep 17 17:39:20 openqaworker-arm-3 systemd[1]: Starting os-autoinst openvswitch helper...
Sep 17 17:39:21 openqaworker-arm-3 systemd[1]: Started os-autoinst openvswitch helper.
-- Reboot --
Sep 17 20:20:38 openqaworker-arm-3 os-autoinst-openvswitch[2569]: can't parse bridge local port IP at /usr/lib/os-autoinst/os-autoinst-openvswitch line 39.
Sep 17 20:19:28 openqaworker-arm-3 systemd[1]: Starting os-autoinst openvswitch helper...
Sep 17 20:19:28 openqaworker-arm-3 systemd[1]: Started os-autoinst openvswitch helper.
Sep 17 20:19:29 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Main process exited, code=exited, status=255/n/a
Sep 17 20:19:29 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Unit entered failed state.
Sep 17 20:19:29 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
Sep 17 21:41:06 openqaworker-arm-3 systemd[1]: Starting os-autoinst openvswitch helper...
Sep 17 21:41:07 openqaworker-arm-3 systemd[1]: Started os-autoinst openvswitch helper.

the last two lines show that a simple restart of the service helped. It seems that the network is not completely up before os-autoinst-openvswitch is started. So the line After=network.target seems to be not enough, but look at the timestamps:

openqaworker-arm-3:/home/okurz # journalctl -u os-autoinst-openvswitch.service 
-- Logs begin at Tue 2019-09-17 17:38:09 CEST, end at Tue 2019-09-17 22:22:55 CEST. --
Sep 17 20:20:38 openqaworker-arm-3 os-autoinst-openvswitch[2569]: can't parse bridge local port IP at /usr/lib/os-autoinst/os-autoinst-openvswitch line 39.
Sep 17 20:19:28 openqaworker-arm-3 systemd[1]: Starting os-autoinst openvswitch helper...
Sep 17 20:19:28 openqaworker-arm-3 systemd[1]: Started os-autoinst openvswitch helper.
Sep 17 20:19:29 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Main process exited, code=exited, status=255/n/a
Sep 17 20:19:29 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Unit entered failed state.
Sep 17 20:19:29 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.

openqaworker-arm-3:/home/okurz # systemctl status network.target
…
Sep 17 20:19:28 openqaworker-arm-3 systemd[1]: Reached target Network.

openqaworker-arm-3:/home/okurz # systemctl status network-online.target
…
Sep 17 20:19:28 openqaworker-arm-3 systemd[1]: Reached target Network is Online.

so both "network.target" and "network-online.target" are up at the same time the helper starts. The problem is that we are looking for "inet" and a valid adress on the bridge device, br1 in this case. journalctl --since=today | grep 'br1' shows that clearly the device is not yet read at 20:19:29.

I also just see that there is still SuSEfirewall2 on that machine, not firewalld. That should be fixed. The same is true on openqaworker-arm-1 and …-arm-2 though. Probably should check all workers for this. Hm, actually all workers have the firewall enabled. Probably should be firewalld, properly salted, ready for MM testing.

Machine was down again. We have persistent log enabled now but all I see is:

Sep 18 21:10:55 openqaworker-arm-3 worker[46659]: [debug] [pid:46659] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3375050/status
-- Reboot --
Sep 19 12:10:58 openqaworker-arm-3 kernel: Booting Linux on physical CPU 0x0

Also I see errors about the cache service in the journal and:

_openqa+ 21081  0.1  0.0  62784 56544 ?        Ss   12:40   0:12 /usr/bin/perl /usr/share/openqa/script/openqa-workercache minion worker -m production
_openqa+ 22211  0.0  0.0  78808 67984 ?        S    13:02   0:01  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache minion worker -m production
_openqa+ 22212  0.0  0.0  30276  7668 ?        S    13:02   0:02      \_ rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
_openqa+ 22213  1.2  0.0  39024  5536 ?        D    13:02   1:35          \_ rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/

as if the NVME device is broken.

# lsof -p 22213
COMMAND   PID           USER   FD   TYPE             DEVICE SIZE/OFF   NODE NAME
rsync   22213 _openqa-worker  cwd    DIR               0,45      474 617752 /var/lib/openqa/cache/openqa.suse.de/tests
rsync   22213 _openqa-worker  rtd    DIR               0,45      156    256 /
rsync   22213 _openqa-worker  txt    REG               0,45   540744 493427 /usr/bin/rsync
rsync   22213 _openqa-worker  DEL    REG               0,23           89150 /run/nscd/dbYvaSrr
rsync   22213 _openqa-worker  mem    REG               0,45   330604 341410 /usr/lib/locale/en_US.utf8/LC_CTYPE
rsync   22213 _openqa-worker  mem    REG               0,45    72848 337246 /lib64/libdl-2.26.so
rsync   22213 _openqa-worker  mem    REG               0,45   177464 337264 /lib64/libpthread-2.26.so
rsync   22213 _openqa-worker  mem    REG               0,45  2535584 337381 /usr/lib64/libcrypto.so.1.1
rsync   22213 _openqa-worker  mem    REG               0,45    67904 337372 /lib64/libattr.so.1.1.0
rsync   22213 _openqa-worker  mem    REG               0,45  1827616 337240 /lib64/libc-2.26.so
rsync   22213 _openqa-worker  mem    REG               0,45   134112 362392 /usr/lib64/libslp.so.1.0.0
rsync   22213 _openqa-worker  mem    REG               0,45    68712 345879 /usr/lib64/libpopt.so.0.0.0
rsync   22213 _openqa-worker  mem    REG               0,45   133112 337300 /lib64/libz.so.1.2.11
rsync   22213 _openqa-worker  mem    REG               0,45    68088 337394 /lib64/libacl.so.1.1.2253
rsync   22213 _openqa-worker  mem    REG               0,45   160688 337233 /lib64/ld-2.26.so
rsync   22213 _openqa-worker  mem    REG               0,45    26244 341411 /usr/lib64/gconv/gconv-modules.cache
rsync   22213 _openqa-worker    0r   CHR                1,3      0t0   1030 /dev/null
rsync   22213 _openqa-worker    1w  FIFO               0,12      0t0 569602 pipe
rsync   22213 _openqa-worker    2u  unix 0xffff810d62ba8c00      0t0 143073 type=STREAM
rsync   22213 _openqa-worker    3u  IPv6             454473      0t0    TCP openqaworker-arm-3.suse.de:58886->openqa.suse.de:rsync (ESTABLISHED)
rsync   22213 _openqa-worker    4u   REG               0,45 27623424 691229 /var/lib/openqa/cache/openqa.suse.de/tests/sle/products/sle/needles/.git/objects/pack/.pack-2cbeca3341d5195c120b3033811b10f4f35b6997.pack.KCDTn3
rsync   22213 _openqa-worker    5u  unix 0xffff810f69a8b800      0t0 564570 type=STREAM

# strace -f -p 22213 2>&1 | less
strace: Process 22213 attached
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999995070})
read(3, "\0\200\0\0\4\200\0\7\366\315\257w\353T7`P\215GI\30\370[\335*\240\2749<\37L\322"..., 32768) = 32768
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999998070})
read(3, "+G\203\232\246ng{", 8)         = 8
ioctl(1, TIOCGPGRP, 0xffffe9a14c44)     = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "\r      7,602,176   2%  256.00kB/"..., 46) = 46
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999997410})
read(3, "\0\200\0\0\4\200\0\7\353\374\342\264\t\\\364\372\210\26\200\2145\3367\245W0\311\336\376\335\371\242"..., 32768) = 32768
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999998600})
read(3, "p\345+D\f\207g=", 8)           = 8
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999998540})
read(3, "\0\200\0\0\4\200\0\7fI)eUe\2\266\214T,DeCrh/\220\f\212\3060\261Q"..., 32768) = 32768
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999997410})
read(3, "\307x\34'1 \0\333", 8)         = 8
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999996970})
read(3, "\0\200\0\0\4\200\0\7\333\343O~\362O\277\370\362\342z\273cf\343\340/\376\342G\327\327\373/"..., 32768) = 32768
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999997590})
read(3, "\327\0\334\241\313\34<\237", 8) = 8
pselect6(4, [3], [], [3], {tv_sec=60, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=59, tv_nsec=999997169})
read(3, "\0\200\0\0\4\200\0\7/\303]z\365\277\243\245}2\264\216\267\351\2574\317\226\316\203\37\35\324\364"..., 32768) = 32768

so it's stuck but I don't know why. Stopped openqa-worker.target and the cache services, cleaned /var/lib/openqa/cache/* and restarted. Looks fine now.

sudo systemctl --failed
  UNIT                                LOAD   ACTIVE SUB    DESCRIPTION                                                                                                                                                          
● systemd-coredump@0-56201-0.service  loaded failed failed Process Core Dump                                                                                                                                                    
● systemd-coredump@10-58906-0.service loaded failed failed Process Core Dump                                                                                                                                                    
● systemd-coredump@11-58908-0.service loaded failed failed Process Core Dump                                                                                                                                                    
● systemd-coredump@12-58914-0.service loaded failed failed Process Core Dump                                                                                                                                                    
● systemd-coredump@13-58917-0.service loaded failed failed Process Core Dump                                                                                                                                                    
● systemd-coredump@15-58927-0.service loaded failed failed Process Core Dump                                                                                                                                                    
● systemd-coredump@18-58971-0.service loaded failed failed Process Core Dump                                                                                                                                                    
● systemd-coredump@4-58878-0.service  loaded failed failed Process Core Dump                                                                                                                                                    
● systemd-coredump@5-58888-0.service  loaded failed failed Process Core Dump                                                                                                                                                    
● systemd-coredump@6-58892-0.service  loaded failed failed Process Core Dump                                                                                                                                                    
● systemd-coredump@8-58897-0.service  loaded failed failed Process Core Dump                                                                                                                                                    
● systemd-coredump@9-58903-0.service  loaded failed failed Process Core Dump

I saw that again some time ago.

sudo journalctl -u systemd-coredump@0-56201-0.service
-- Logs begin at Tue 2019-09-17 17:38:09 CEST, end at Fri 2019-09-20 21:41:40 CEST. --
Sep 19 16:32:41 openqaworker-arm-3 systemd[1]: Started Process Core Dump (PID 56201/UID 0).
Sep 19 16:37:41 openqaworker-arm-3 systemd[1]: systemd-coredump@0-56201-0.service: Service reached runtime time limit. Stopping.
Sep 19 16:37:41 openqaworker-arm-3 systemd[1]: systemd-coredump@0-56201-0.service: Unit entered failed state.
Sep 19 16:37:41 openqaworker-arm-3 systemd[1]: systemd-coredump@0-56201-0.service: Failed with result 'timeout'.
okurz@openqaworker-arm-3:~> sudo systemctl status systemd-coredump@9-58903-0.service
● systemd-coredump@9-58903-0.service - Process Core Dump
   Loaded: loaded (/usr/lib/systemd/system/systemd-coredump@.service; static; vendor preset: disabled)
   Active: failed (Result: timeout) since Thu 2019-09-19 20:36:19 CEST; 1 day 1h ago
     Docs: man:systemd-coredump(8)
 Main PID: 58904 (code=killed, signal=TERM)

Sep 19 20:31:15 openqaworker-arm-3 systemd[1]: Started Process Core Dump (PID 58903/UID 0).
Sep 19 20:36:15 openqaworker-arm-3 systemd[1]: systemd-coredump@9-58903-0.service: Service reached runtime time limit. Stopping.
Sep 19 20:36:19 openqaworker-arm-3 systemd[1]: systemd-coredump@9-58903-0.service: Unit entered failed state.
Sep 19 20:36:19 openqaworker-arm-3 systemd[1]: systemd-coredump@9-58903-0.service: Failed with result 'timeout'.

so processes die but coredump takes longer than 5 minutes and is aborted. But sudo coredumpctl info shows me it's probably just the "usual suspect".

EDIT: Created new grafana dashboard https://stats.openqa-monitor.qa.suse.de/d/XLzVbupZk/worker-dashoard-openqaworker-arm-3?from=now-2d&to=now

Actions

Copy link