action #59858

"Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260." broken NVMe on openqaworker13, jobs incomplete trying to save snapshots

Added by okurz 3 months ago. Updated 17 days ago.

Status:FeedbackStart date:14/11/2019
Priority:HighDue date:
Assignee:okurz% Done:

0%

Category:-
Target version:openQA Project - Current Sprint
Duration:

Description

Observation

https://openqa.suse.de/tests/3594947/file/autoinst-log.txt on openqaworker13 shows:

[2019-11-14T20:32:28.969 CET] [debug] EVENT {"data":{"status":"active"},"event":"MIGRATION","timestamp":{"microseconds":476972,"seconds":1573759948}}
[2019-11-14T20:32:28.969 CET] [debug] Migrating total bytes:        1078796288
…
[2019-11-14T20:36:28.601 CET] [debug] Migrating total bytes:        1078796288
[2019-11-14T20:36:28.601 CET] [debug] Migrating remaining bytes:    337879040
[2019-11-14T20:36:28.602 CET] [debug] EVENT {"data":{"status":"cancelling"},"event":"MIGRATION","timestamp":{"microseconds":602550,"seconds":1573760188}}
[2019-11-14T20:36:28.603 CET] [debug] Backend process died, backend errors are reported below in the following lines:
Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260.

From dmesg on openqaworker13 I can see:

[Thu Nov 14 21:03:15 2019] nvme nvme0: Abort status: 0x0
…
[Thu Nov 14 21:03:15 2019] nvme nvme0: Abort status: 0x0
[Thu Nov 14 21:03:24 2019] nvme nvme0: I/O 451 QID 3 timeout, aborting
…
[Thu Nov 14 21:03:24 2019] nvme nvme0: I/O 458 QID 3 timeout, aborting
[Thu Nov 14 21:03:28 2019] nvme nvme0: Abort status: 0x0
…
[Thu Nov 14 21:03:28 2019] nvme nvme0: Abort status: 0x0

so it looks like the NVMe device is used but not providing expected performance or being faulty.


Related issues

Related to openQA Tests - action #60743: [qam] Migrate to file failed, it has been running for mor... Resolved 05/12/2019
Blocked by openQA Project - action #62108: auto_review:"Can.t fcntl.*Operation not permitted at .*vi... Resolved 14/01/2020

History

#1 Updated by okurz 3 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version set to Current Sprint

This seems to produce some incompletes which do not automatically retrigger. I triggered a reboot of openqaworker13. Have not (yet) looked for all related incompletes but restarted all incompletes which were running on this machine with

env worker=openqaworker13 host=openqa.suse.de failed_since=2019-11-10 ~/local/os-autoinst/scripts/openqa-restart-incompletes-on-worker-instance

#2 Updated by okurz 3 months ago

  • Status changed from In Progress to Feedback

Machine is back up and working on jobs already. Do we have a monitoring alert that could inform us about related problem in the future?

#3 Updated by okurz 3 months ago

  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

Seeing this again:

$ ssh openqaworker13 "sudo dmesg -T | grep 'nvme.*Abort' | tail | uniq"
[Fri Nov 22 13:22:20 2019] nvme nvme0: Abort status: 0x0
[Wed Nov 27 01:33:20 2019] nvme nvme0: Abort status: 0x0

only happens on w13:

$ sudo salt -l error -C 'G@roles:worker' cmd.run 'sudo dmesg -T | grep "nvme.*Abort" | tail | uniq'
QA-Power8-4-kvm.qa.suse.de:
openqaworker2.suse.de:
openqaworker10.suse.de:
openqaworker8.suse.de:
openqaworker13.suse.de:
    [Fri Nov 22 13:22:20 2019] nvme nvme0: Abort status: 0x0
    [Wed Nov 27 01:33:20 2019] nvme nvme0: Abort status: 0x0
openqaworker7.suse.de:
openqaworker9.suse.de:
openqaworker5.suse.de:
powerqaworker-qam-1:
QA-Power8-5-kvm.qa.suse.de:
openqaworker6.suse.de:
malbec.arch.suse.de:
openqaworker-arm-2.suse.de:
openqaworker-arm-1.suse.de:
openqaworker3.suse.de:
grenache-1.qa.suse.de:

I suspect a loose connection or broken device which needs replacement.

#4 Updated by okurz 3 months ago

  • Related to action #60743: [qam] Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 263. added

#5 Updated by okurz 3 months ago

Disabled openQA worker instances by worker class with https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/202

@nicksinger can you take over, check NVMe physically, replace it, rub it on your t-shirt or something?

#6 Updated by okurz 3 months ago

  • Subject changed from jobs incomplete on openqaworker13 trying to save snapshots with "Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260." to broken NVMe on openqaworker13, jobs incomplete trying to save snapshots with "Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260."

#7 Updated by okurz 3 months ago

  • Subject changed from broken NVMe on openqaworker13, jobs incomplete trying to save snapshots with "Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260." to "Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260." broken NVMe on openqaworker13, jobs incomplete trying to save snapshots

#8 Updated by okurz 24 days ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

I understood that the machine is within SRV2 EXT which is maintained by Engineering Infrastructure and also the NVMe's are part of the ordered hardware as closed cases with a warranty seal, i.e. probably we do not want to open the code and check the seating of the hardware ourselves. So I created https://infra.nue.suse.com/SelfService/Display.html?id=162199 shared with #62849

#9 Updated by okurz 20 days ago

  • Status changed from Blocked to Feedback

Not much news. gschlotter reported he "did some things" but it's not clear he did that for both openqaworker4 and openqaworker13 . In any case in #46742#note-33 I already ran tests fine on openqaworker13 with a different setup of how the NVMe device is used.

smartctl -a /dev/nvme0 reports no problems.

So trying harder with currently 16 worker instances running:

for i in {001..100}; do openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de 3853128 _GROUP=0 BUILD=poo59858 WORKER_CLASS=openqaworker13 TEST=wicked_advanced_sut_poo59858_$i; done

Cloning dependencies of sle-15-SP2-Online-x86_64-Build134.1-wicked_advanced_sut@64bit
Created job #3860458: sle-15-SP2-Online-x86_64-Build134.1-wicked_advanced_ref@64bit -> https://openqa.suse.de/t3860458
Created job #3860459: sle-15-SP2-Online-x86_64-Build134.1-wicked_advanced_sut@64bit -> https://openqa.suse.de/t3860459

this looks a bit weird because I use --parental-inheritance but also want to tailor the name with TEST=… which ends up with both the "ref" and "sut" job with the same name, the build shows only the sut as the "later" one within the build overview:

build: https://openqa.suse.de/tests/overview?build=poo59858&distri=sle&version=15-SP2

many jobs incomplete with Can't fcntl($fh, '1031', '1048576'): Operation not permitted at /usr/lib/os-autoinst/consoles/virtio_terminal.pm line 147. Seems like os-autoinst has a bug that we have fixed in current os-autoinst master. rpm -q --changelog os-autoinst shows:

* Mon Jan 20 2020 okurz@suse.com
- Update to version 4.6.1579522398.2afcda12:

Doing zypper dup. Updated os-autoinst, restarted openqa-worker@{1..16} and retriggered incomplete tests.

#10 Updated by okurz 20 days ago

  • Status changed from Feedback to Blocked

No luck, e.g. https://openqa.suse.de/tests/3860547/file/autoinst-log.txt shows Can't fcntl($fh, '1031', '1048576'): Operation not permitted at /usr/lib/os-autoinst/consoles/virtio_terminal.pm line 128. Blocked on #62108

#11 Updated by cfconrad 20 days ago

  • Blocked by action #62108: auto_review:"Can.t fcntl.*Operation not permitted at .*virtio_terminal.pm" test incompletes added

#12 Updated by okurz 17 days ago

  • Status changed from Blocked to Feedback

new change by cfconrad https://github.com/os-autoinst/os-autoinst/pull/1348 merged and deployed on openqaworker13, trying again:

for i in {001..100}; do openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de 3867902 _GROUP=0 BUILD=poo59858:2 WORKER_CLASS=openqaworker13 TEST=wicked_advanced_sut_poo59858_$i; done

-> https://openqa.suse.de/tests/overview?distri=sle&build=poo59858%3A2&version=15-SP2

EDIT: Hm, this did not yield any I/O errors. Maybe I "fixed" the NVMe by recreating the filesystem and using it as shared for cache+pool or something?

EDIT: 2020-02-13: Still no new NVMe errors with sudo dmesg -T | grep 'nvme.*Abort' | tail | uniq. I have re-enabled the default worker class for this worker.

Also available in: Atom PDF