Project

General

Profile

action #81046

openqaworker-arm-2.suse.de unreachable

Added by cdywan 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2020-12-15
Due date:
% Done:

0%

Estimated time:

Description

Something caused the connection to drop while I had an ssh session to openqaworker-arm-2.suse.de open and shortly after it appeared to have rebooted but got stuck at Press ^D to continue. I triggered another reboot when that wouldn't get to a working state.
Machine responds to ssh now, but rejects my login/ doesn't seem to recognize my SSH key. Machine responds to SSH but appears broken to the web UI.


Related issues

Related to openQA Infrastructure - action #75238: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.2Resolved

Related to openQA Project - action #78390: Worker is stuck in "broken" state due to unavailable cache service (was: and even continuously fails to (re)connect to some configured web UIs)Resolved2021-01-18

History

#1 Updated by cdywan 7 months ago

  • Related to action #75238: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.2 added

#2 Updated by cdywan 7 months ago

  • Related to action #78390: Worker is stuck in "broken" state due to unavailable cache service (was: and even continuously fails to (re)connect to some configured web UIs) added

#3 Updated by cdywan 7 months ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to cdywan

So this Cache service info error: can't connect looks like #78390 and indeed sudo systemctl restart openqa-worker@{1..20} did the trick

#4 Updated by cdywan 7 months ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High

Worker instances seem to be working again

#5 Updated by cdywan 7 months ago

cdywan wrote:

Worker instances seem to be working again

Somehow openqaworker-arm-1:11 to openqaworker-arm-1:20 are in fact offline, not seen for 7 months. I guess this might be expected? Should these be pruned from the web UI? Or reinstated?

#6 Updated by okurz 7 months ago

  • Target version set to Ready

cdywan wrote:

cdywan wrote:

Worker instances seem to be working again

Somehow openqaworker-arm-1:11 to openqaworker-arm-1:20 are in fact offline, not seen for 7 months. I guess this might be expected?

yes, expected. See https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls where you can find that the number of expected worker services is 10 (was reduced from previously 20).

Should these be pruned from the web UI? Or reinstated?

Not reinstated, could be pruned based on your personal preference.

#7 Updated by dzedro 7 months ago

openqaworker-arm-2 is down, many aarch64 jobs are waiting.

#8 Updated by dzedro 7 months ago

Serial console on openqaworker-arm-2

[66892.118949] usb 1-1.3: new high-speed USB device number 4 using xhci_hcd
[66892.329874] usb 1-1.3: New USB device found, idVendor=0624, idProduct=0251, bcdDevice= 0.00
[66892.340863] usb 1-1.3: New USB device strings: Mfr=4, Product=5, SerialNumber=6
[66892.350807] usb 1-1.3: Product: Mass Storage Function
[66892.358703] usb 1-1.3: Manufacturer: Avocent
[66892.365819] usb 1-1.3: SerialNumber: 20120731-1
[66892.470479] ------------[ cut here ]------------
[66892.477962] coherent pool not initialised!
[66892.485012] WARNING: CPU: 0 PID: 5 at kernel/dma/remap.c:190 dma_alloc_from_pool+0xa4/0xb0
[66892.496233] Modules linked in: btrfs blake2b_generic libcrc32c xor xor_neon hid_generic usbhid nicvf cavium_ptp ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm nvme gpio_keys nvme_core raid6_pq nicpf xhci_pci xhci_hcd usbcore thunderx_mmc mmc_core thunder_bgx thunder_xcv i2c_thunderx mdio_thunder mdio_cavium sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
[66892.549095] CPU: 0 PID: 5 Comm: kworker/0:0 Tainted: G        W         5.7.12-1.g9c98feb-default #1 openSUSE Tumbleweed (unreleased)
[66892.567733] Hardware name: GIGABYTE R270-T64-00/MT60-SC4-00, BIOS T32 03/03/2017
[66892.578764] Workqueue: usb_hub_wq hub_event [usbcore]
[66892.587284] pstate: 40000005 (nZcv daif -PAN -UAO)
[66892.595566] pc : dma_alloc_from_pool+0xa4/0xb0
[66892.603524] lr : dma_alloc_from_pool+0xa4/0xb0
[66892.611474] sp : ffff800010073510
[66892.618290] x29: ffff800010073510 x28: 0000000000000002 
[66892.627119] x27: 0000000000000001 x26: ffff0000f62b9000 
[66892.635941] x25: 0000000000000000 x24: ffff8000101adaf8 
[66892.644753] x23: ffff0000fb9970b0 x22: 0000000000001000 
[66892.653489] x21: ffff80001128a000 x20: ffff0000ff9c0000 
[66892.662133] x19: 0000000000000000 x18: 0000000000000030 
[66892.670695] x17: 0000000000000000 x16: 0000000000000000 
[66892.679151] x15: ffff0000ff9c0550 x14: 0000000000000001 
[66892.687535] x13: 0000000000000020 x12: ffff0000f6af9010 
[66892.695863] x11: 0000000000000008 x10: 00000000000007ed 
[66892.704143] x9 : ffff80001028300c x8 : 0000000000000001 
[66892.712343] x7 : 0000000000000000 x6 : 0000000000000000 
[66892.720449] x5 : 0000000000000000 x4 : ffff0000ffd34310 
[66892.728501] x3 : ffff0000ffd444c0 x2 : ffff0000ffd34310 
[66892.736519] x1 : 16d97a2aa875f600 x0 : 0000000000000000 
[66892.744482] Call trace:
[66892.749515]  dma_alloc_from_pool+0xa4/0xb0
[66892.756220]  dma_direct_alloc_pages+0x1f0/0x220
[66892.763317]  dma_direct_alloc+0x18/0x24
[66892.769651]  dma_alloc_attrs+0x88/0xfc
[66892.775827]  pool_alloc_page+0x74/0x110
[66892.781995]  dma_pool_alloc+0xe4/0x1c0
[66892.788128]  xhci_alloc_container_ctx+0xac/0x120 [xhci_hcd]
[66892.795969]  xhci_alloc_command_with_ctx+0x40/0x80 [xhci_hcd]
[66892.803966]  xhci_endpoint_reset+0xec/0x3d0 [xhci_hcd]
[66892.811379]  usb_hcd_reset_endpoint+0x2c/0xa0 [usbcore]
[66892.818890]  usb_enable_interface+0x104/0x110 [usbcore]
[66892.826364]  usb_set_configuration+0x2b4/0x970 [usbcore]
[66892.833902]  usb_generic_driver_probe+0x6c/0xb0 [usbcore]
[66892.841517]  usb_probe_device+0x48/0x130 [usbcore]
[66892.848481]  really_probe+0xe8/0x49c
[66892.854215]  driver_probe_device+0xec/0x140
[66892.860564]  __device_attach_driver+0x94/0x11c
[66892.867177]  bus_for_each_drv+0x80/0xd0
[66892.873172]  __device_attach+0x10c/0x1a0
[66892.879209]  device_initial_probe+0x1c/0x30
[66892.885433]  bus_probe_device+0xa4/0xb0
[66892.891225]  device_add+0x390/0x5e0
[66892.896614]  usb_new_device+0x1e0/0x424 [usbcore]
[66892.903154]  hub_port_connect+0x484/0x9d0 [usbcore]
[66892.909793]  hub_port_connect_change+0xd4/0x2f0 [usbcore]
[66892.916890]  port_event+0x448/0x5e0 [usbcore]
[66892.922907]  hub_event+0x148/0x4b0 [usbcore]
[66892.928774]  process_one_work+0x1e4/0x490
[66892.934362]  worker_thread+0x170/0x440
[66892.939685]  kthread+0x11c/0x120
[66892.944496]  ret_from_fork+0x10/0x18
[66892.949644] ---[ end trace abde0b1c25a58b7b ]---
[66892.955890] ------------[ cut here ]------------
[66892.962117] coherent pool not initialised!
[66892.967834] WARNING: CPU: 0 PID: 5 at kernel/dma/remap.c:190 dma_alloc_from_pool+0xa4/0xb0
[66892.977794] Modules linked in: btrfs blake2b_generic libcrc32c xor xor_neon hid_generic usbhid nicvf cavium_ptp ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm nvme gpio_keys nvme_core raid6_pq nicpf xhci_pci xhci_hcd usbcore thunderx_mmc mmc_core thunder_bgx thunder_xcv i2c_thunderx mdio_thunder mdio_cavium sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
[66893.025981] CPU: 0 PID: 5 Comm: kworker/0:0 Tainted: G        W         5.7.12-1.g9c98feb-default #1 openSUSE Tumbleweed (unreleased)
[66893.042303] Hardware name: GIGABYTE R270-T64-00/MT60-SC4-00, BIOS T32 03/03/2017
[66893.051996] Workqueue: usb_hub_wq hub_event [usbcore]
[66893.059335] pstate: 80000005 (Nzcv daif -PAN -UAO)
[66893.066433] pc : dma_alloc_from_pool+0xa4/0xb0
[66893.073200] lr : dma_alloc_from_pool+0xa4/0xb0
[66893.079940] sp : ffff800010073510
[66893.085530] x29: ffff800010073510 x28: 0000000000000003 
[66893.093132] x27: 0000000000000001 x26: ffff0000f57c6680 
[66893.100734] x25: 0000000000000000 x24: ffff8000101adaf8 
[66893.108333] x23: ffff0000fb9970b0 x22: 0000000000001000 
[66893.115938] x21: ffff80001128a000 x20: ffff0000ff9c0000 
[66893.123542] x19: 0000000000000000 x18: 0000000000000030 
[66893.131147] x17: 0000000000000000 x16: 0000000000000000 
[66893.138717] x15: ffff0000ff9c0550 x14: 0000000000000001 
[66893.146266] x13: 0000000000000020 x12: ffff0000f6af9db0 
[66893.153803] x11: 0000000000000008 x10: 0000000000000828 
[66893.161319] x9 : ffff800010186bfc x8 : 0000000000000001 
[66893.168831] x7 : 0000000000000000 x6 : 0000000000000000 
[66893.176321] x5 : 0000000000000000 x4 : 0000000000000000 
[66893.183778] x3 : ffff800010073300 x2 : 0000000000000001 
[66893.191211] x1 : 16d97a2aa875f600 x0 : 0000000000000000 
[66893.198628] Call trace:
[66893.203166]  dma_alloc_from_pool+0xa4/0xb0
[66893.209370]  dma_direct_alloc_pages+0x1f0/0x220
[66893.216009]  dma_direct_alloc+0x18/0x24
[66893.221944]  dma_alloc_attrs+0x88/0xfc
[66893.227784]  pool_alloc_page+0x74/0x110
[66893.233703]  dma_pool_alloc+0xe4/0x1c0
[66893.239552]  xhci_alloc_container_ctx+0xac/0x120 [xhci_hcd]
[66893.247256]  xhci_alloc_command_with_ctx+0x40/0x80 [xhci_hcd]
[66893.255133]  xhci_endpoint_reset+0xec/0x3d0 [xhci_hcd]
[66893.262426]  usb_hcd_reset_endpoint+0x2c/0xa0 [usbcore]
[66893.269803]  usb_enable_interface+0x104/0x110 [usbcore]
[66893.277173]  usb_set_configuration+0x2b4/0x970 [usbcore]
[66893.284637]  usb_generic_driver_probe+0x6c/0xb0 [usbcore]
[66893.292201]  usb_probe_device+0x48/0x130 [usbcore]
[66893.299137]  really_probe+0xe8/0x49c
[66893.304857]  driver_probe_device+0xec/0x140
[66893.311180]  __device_attach_driver+0x94/0x11c
[66893.317770]  bus_for_each_drv+0x80/0xd0
[66893.323743]  __device_attach+0x10c/0x1a0
[66893.329757]  device_initial_probe+0x1c/0x30
[66893.335960]  bus_probe_device+0xa4/0xb0
[66893.341730]  device_add+0x390/0x5e0
[66893.347095]  usb_new_device+0x1e0/0x424 [usbcore]
[66893.353612]  hub_port_connect+0x484/0x9d0 [usbcore]
[66893.360228]  hub_port_connect_change+0xd4/0x2f0 [usbcore]
[66893.367303]  port_event+0x448/0x5e0 [usbcore]
[66893.373295]  hub_event+0x148/0x4b0 [usbcore]
[66893.379134]  process_one_work+0x1e4/0x490
[66893.384700]  worker_thread+0x170/0x440
[66893.389993]  kthread+0x11c/0x120
[66893.394778]  ret_from_fork+0x10/0x18
[66893.399902] ---[ end trace abde0b1c25a58b7c ]---
[66894.454806] usb-storage 1-1.3:1.0: USB Mass Storage device detected
[66894.479144] scsi host20: usb-storage 1-1.3:1.0
[66894.490550] usbcore: registered new interface driver usb-storage
[66894.563171] usbcore: registered new interface driver uas
[66895.569951] scsi 20:0:0:0: CD-ROM            MP EMS   Virtual Media    0326 PQ: 0 ANSI: 0
[66895.639441] scsi 20:0:0:1: Direct-Access     MP EMS   Virtual Media    0326 PQ: 0 ANSI: 0 CCS
[66895.659763] scsi 20:0:0:2: Direct-Access     MP EMS   Virtual Media    0326 PQ: 0 ANSI: 0 CCS
[66895.690556] scsi 20:0:0:0: Attached scsi generic sg2 type 5
[66895.710789] sd 20:0:0:1: Power-on or device reset occurred
[66895.717955] sd 20:0:0:1: Attached scsi generic sg3 type 0
[66895.741025] sd 20:0:0:1: [sdc] Attached SCSI removable disk
[66895.772898] systemd-udevd: page allocation failure: order:0, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
[66895.788730] CPU: 0 PID: 681 Comm: systemd-udevd Tainted: G        W         5.7.12-1.g9c98feb-default #1 openSUSE Tumbleweed (unreleased)
[66895.805248] Hardware name: GIGABYTE R270-T64-00/MT60-SC4-00, BIOS T32 03/03/2017
[66895.814803] Call trace:
[66895.819396]  dump_backtrace+0x0/0x1dc
[66895.825221]  show_stack+0x20/0x30
[66895.830701]  dump_stack+0xb8/0x110
[66895.836257]  warn_alloc+0xfc/0x170
[66895.841798]  __alloc_pages_slowpath.constprop.0+0x788/0x7a4
[66895.849534]  __alloc_pages_nodemask+0x2bc/0x31c
[66895.856215]  alloc_pages_current+0x90/0x14c
[66895.862535]  alloc_slab_page+0x1bc/0x360
[66895.868576]  allocate_slab+0x46c/0x4a0
[66895.874437]  new_slab+0x60/0xc0
[66895.879671]  new_slab_objects+0x90/0x150
[66895.885675]  ___slab_alloc+0x230/0x360
[66895.891490]  __slab_alloc+0x28/0x5c
[66895.897030]  kmem_cache_alloc_trace+0x26c/0x280
[66895.903632]  get_capabilities+0x4c/0x370 [sr_mod]
[66895.910414]  sr_probe+0x220/0x39c [sr_mod]
[66895.916583]  really_probe+0xe8/0x49c
[66895.922229]  driver_probe_device+0xec/0x140
[66895.928487]  device_driver_attach+0xc8/0xd0
[66895.934741]  __driver_attach+0xac/0x180
[66895.940647]  bus_for_each_dev+0x78/0xcc
[66895.946545]  driver_attach+0x2c/0x40
[66895.952171]  bus_add_driver+0x150/0x244
[66895.958059]  driver_register+0x80/0x13c
[66895.963932]  scsi_register_driver+0x28/0x3c
[66895.970147]  init_sr+0x44/0x1000 [sr_mod]
[66895.976187]  do_one_initcall+0x50/0x230
[66895.982054]  do_init_module+0x60/0x290
[66895.987830]  load_module+0x624/0x914
[66895.993433]  __do_sys_init_module+0xe8/0x16c
[66895.999733]  __arm64_sys_init_module+0x24/0x30
[66896.006185]  el0_svc_common.constprop.0+0x84/0x230
[66896.012970]  do_el0_svc+0x2c/0x94
[66896.018254]  el0_svc+0x18/0x50
[66896.023280]  el0_sync_handler+0x120/0x1b4
[66896.029227]  el0_sync+0x158/0x180
[66896.034491] Mem-Info:
[66896.038710] active_anon:2638 inactive_anon:846 isolated_anon:0
[66896.038710]  active_file:186 inactive_file:118 isolated_file:0
[66896.038710]  unevictable:13427 dirty:0 writeback:0 unstable:0
[66896.038710]  slab_reclaimable:2982 slab_unreclaimable:4418
[66896.038710]  mapped:3702 shmem:891 pagetables:119 bounce:0
[66896.038710]  free:855 free_pcp:21 free_cma:0
[66896.082846] Node 0 active_anon:10552kB inactive_anon:3384kB active_file:744kB inactive_file:472kB unevictable:53708kB isolated(anon):0kB isolated(file):0kB mapped:14808kB dirty:0kB writeback:0kB shmem:3564kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[66896.115909] Node 0 DMA free:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:8192kB managed:0kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[66896.148875] lowmem_reserve[]: 0 102 102 102 102
[66896.155548] Node 0 DMA32 free:3420kB min:1296kB low:1620kB high:1944kB reserved_highatomic:0KB active_anon:10552kB inactive_anon:3384kB active_file:744kB inactive_file:472kB unevictable:53708kB writepending:0kB present:215132kB managed:129984kB mlocked:0kB kernel_stack:2288kB pagetables:476kB bounce:0kB free_pcp:84kB local_pcp:84kB free_cma:0kB
[66896.193122] lowmem_reserve[]: 0 0 0 0 0
[66896.199380] Node 0 Normal free:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:12824kB managed:0kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[66896.234420] lowmem_reserve[]: 0 0 0 0 0
[66896.240967] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[66896.254080] Node 0 DMA32: 29*4kB (ME) 3*8kB (UME) 1*16kB (M) 4*32kB (ME) 5*64kB (ME) 4*128kB (UE) 3*256kB (UE) 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 3420kB
[66896.274435] Node 0 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[66896.288022] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[66896.299807] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
[66896.311398] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[66896.322883] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
[66896.334176] 14622 total pagecache pages
[66896.341038] 0 pages in swap cache
[66896.347362] Swap cache stats: add 0, delete 0, find 0/0
[66896.355637] Free swap  = 0kB
[66896.361555] Total swap = 0kB
[66896.367456] 59037 pages RAM
[66896.373264] 0 pages HighMem/MovableOnly
[66896.380122] 26541 pages reserved
[66896.386361] 0 pages cma reserved
[66896.392596] 0 pages hwpoisoned
[66896.398642] SLUB: Unable to allocate memory on node -1, gfp=0xcc1(GFP_KERNEL|GFP_DMA)
[66896.409547]   cache: dma-kmalloc-512, object size: 512, buffer size: 512, default order: 0, min order: 0
[66896.422175]   node 0: slabs: 0, objs: 0, free: 0
[66896.429941] sr 20:0:0:0: [sr0] out of memory.
[66896.437433] cdrom: Uniform CD-ROM driver Revision: 3.20
[66896.449836] scsi 20:0:0:2: Attached scsi generic sg4 type 0
[66896.467333] sd 20:0:0:2: Power-on or device reset occurred
[66896.538535] sd 20:0:0:2: [sdd] Attached SCSI removable disk
[66897.405495] sr 20:0:0:0: Power-on or device reset occurred
[66898.720850] usb 1-1.3: USB disconnect, device number 4

#9 Updated by cdywan 7 months ago

nicksinger came up with this magical remedy:

ipmitool sol set volatile-bit-rate 115.2 1
ipmitool sol set non-volatile-bit-rate 115.2 1

#10 Updated by cdywan 7 months ago

  • Status changed from Feedback to Resolved

I suppose the machine is fine now

Also available in: Atom PDF