action #59858

"Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260." broken NVMe on openqaworker13, jobs incomplete trying to save snapshots

Added by okurz 5 months ago. Updated about 1 month ago.

Status:ResolvedStart date:14/11/2019
Priority:HighDue date:
Assignee:okurz% Done:

0%

Category:-
Target version:openQA Project - Current Sprint
Duration:

Description

Observation

https://openqa.suse.de/tests/3594947/file/autoinst-log.txt on openqaworker13 shows:

[2019-11-14T20:32:28.969 CET] [debug] EVENT {"data":{"status":"active"},"event":"MIGRATION","timestamp":{"microseconds":476972,"seconds":1573759948}}
[2019-11-14T20:32:28.969 CET] [debug] Migrating total bytes:        1078796288
…
[2019-11-14T20:36:28.601 CET] [debug] Migrating total bytes:        1078796288
[2019-11-14T20:36:28.601 CET] [debug] Migrating remaining bytes:    337879040
[2019-11-14T20:36:28.602 CET] [debug] EVENT {"data":{"status":"cancelling"},"event":"MIGRATION","timestamp":{"microseconds":602550,"seconds":1573760188}}
[2019-11-14T20:36:28.603 CET] [debug] Backend process died, backend errors are reported below in the following lines:
Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260.

From dmesg on openqaworker13 I can see:

[Thu Nov 14 21:03:15 2019] nvme nvme0: Abort status: 0x0
…
[Thu Nov 14 21:03:15 2019] nvme nvme0: Abort status: 0x0
[Thu Nov 14 21:03:24 2019] nvme nvme0: I/O 451 QID 3 timeout, aborting
…
[Thu Nov 14 21:03:24 2019] nvme nvme0: I/O 458 QID 3 timeout, aborting
[Thu Nov 14 21:03:28 2019] nvme nvme0: Abort status: 0x0
…
[Thu Nov 14 21:03:28 2019] nvme nvme0: Abort status: 0x0

so it looks like the NVMe device is used but not providing expected performance or being faulty.


Related issues

Related to openQA Tests - action #60743: [qam] Migrate to file failed, it has been running for mor... Resolved 05/12/2019
Blocked by openQA Project - action #62108: auto_review:"Can.t fcntl.*Operation not permitted at .*vi... Resolved 14/01/2020

History

#1 Updated by okurz 5 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version set to Current Sprint

This seems to produce some incompletes which do not automatically retrigger. I triggered a reboot of openqaworker13. Have not (yet) looked for all related incompletes but restarted all incompletes which were running on this machine with

env worker=openqaworker13 host=openqa.suse.de failed_since=2019-11-10 ~/local/os-autoinst/scripts/openqa-restart-incompletes-on-worker-instance

#2 Updated by okurz 5 months ago

  • Status changed from In Progress to Feedback

Machine is back up and working on jobs already. Do we have a monitoring alert that could inform us about related problem in the future?

#3 Updated by okurz 4 months ago

  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

Seeing this again:

$ ssh openqaworker13 "sudo dmesg -T | grep 'nvme.*Abort' | tail | uniq"
[Fri Nov 22 13:22:20 2019] nvme nvme0: Abort status: 0x0
[Wed Nov 27 01:33:20 2019] nvme nvme0: Abort status: 0x0

only happens on w13:

$ sudo salt -l error -C 'G@roles:worker' cmd.run 'sudo dmesg -T | grep "nvme.*Abort" | tail | uniq'
QA-Power8-4-kvm.qa.suse.de:
openqaworker2.suse.de:
openqaworker10.suse.de:
openqaworker8.suse.de:
openqaworker13.suse.de:
    [Fri Nov 22 13:22:20 2019] nvme nvme0: Abort status: 0x0
    [Wed Nov 27 01:33:20 2019] nvme nvme0: Abort status: 0x0
openqaworker7.suse.de:
openqaworker9.suse.de:
openqaworker5.suse.de:
powerqaworker-qam-1:
QA-Power8-5-kvm.qa.suse.de:
openqaworker6.suse.de:
malbec.arch.suse.de:
openqaworker-arm-2.suse.de:
openqaworker-arm-1.suse.de:
openqaworker3.suse.de:
grenache-1.qa.suse.de:

I suspect a loose connection or broken device which needs replacement.

#4 Updated by okurz 4 months ago

  • Related to action #60743: [qam] Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 263. added

#5 Updated by okurz 4 months ago

Disabled openQA worker instances by worker class with https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/202

@nicksinger can you take over, check NVMe physically, replace it, rub it on your t-shirt or something?

#6 Updated by okurz 4 months ago

  • Subject changed from jobs incomplete on openqaworker13 trying to save snapshots with "Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260." to broken NVMe on openqaworker13, jobs incomplete trying to save snapshots with "Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260."

#7 Updated by okurz 4 months ago

  • Subject changed from broken NVMe on openqaworker13, jobs incomplete trying to save snapshots with "Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260." to "Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260." broken NVMe on openqaworker13, jobs incomplete trying to save snapshots

#8 Updated by okurz 2 months ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

I understood that the machine is within SRV2 EXT which is maintained by Engineering Infrastructure and also the NVMe's are part of the ordered hardware as closed cases with a warranty seal, i.e. probably we do not want to open the code and check the seating of the hardware ourselves. So I created https://infra.nue.suse.com/SelfService/Display.html?id=162199 shared with #62849

#9 Updated by okurz 2 months ago

  • Status changed from Blocked to Feedback

Not much news. gschlotter reported he "did some things" but it's not clear he did that for both openqaworker4 and openqaworker13 . In any case in #46742#note-33 I already ran tests fine on openqaworker13 with a different setup of how the NVMe device is used.

smartctl -a /dev/nvme0 reports no problems.

So trying harder with currently 16 worker instances running:

for i in {001..100}; do openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de 3853128 _GROUP=0 BUILD=poo59858 WORKER_CLASS=openqaworker13 TEST=wicked_advanced_sut_poo59858_$i; done

Cloning dependencies of sle-15-SP2-Online-x86_64-Build134.1-wicked_advanced_sut@64bit
Created job #3860458: sle-15-SP2-Online-x86_64-Build134.1-wicked_advanced_ref@64bit -> https://openqa.suse.de/t3860458
Created job #3860459: sle-15-SP2-Online-x86_64-Build134.1-wicked_advanced_sut@64bit -> https://openqa.suse.de/t3860459

this looks a bit weird because I use --parental-inheritance but also want to tailor the name with TEST=… which ends up with both the "ref" and "sut" job with the same name, the build shows only the sut as the "later" one within the build overview:

build: https://openqa.suse.de/tests/overview?build=poo59858&distri=sle&version=15-SP2

many jobs incomplete with Can't fcntl($fh, '1031', '1048576'): Operation not permitted at /usr/lib/os-autoinst/consoles/virtio_terminal.pm line 147. Seems like os-autoinst has a bug that we have fixed in current os-autoinst master. rpm -q --changelog os-autoinst shows:

* Mon Jan 20 2020 okurz@suse.com
- Update to version 4.6.1579522398.2afcda12:

Doing zypper dup. Updated os-autoinst, restarted openqa-worker@{1..16} and retriggered incomplete tests.

#10 Updated by okurz 2 months ago

  • Status changed from Feedback to Blocked

No luck, e.g. https://openqa.suse.de/tests/3860547/file/autoinst-log.txt shows Can't fcntl($fh, '1031', '1048576'): Operation not permitted at /usr/lib/os-autoinst/consoles/virtio_terminal.pm line 128. Blocked on #62108

#11 Updated by cfconrad 2 months ago

  • Blocked by action #62108: auto_review:"Can.t fcntl.*Operation not permitted at .*virtio_terminal.pm" test incompletes added

#12 Updated by okurz about 1 month ago

  • Status changed from Blocked to Feedback

new change by cfconrad https://github.com/os-autoinst/os-autoinst/pull/1348 merged and deployed on openqaworker13, trying again:

for i in {001..100}; do openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de 3867902 _GROUP=0 BUILD=poo59858:2 WORKER_CLASS=openqaworker13 TEST=wicked_advanced_sut_poo59858_$i; done

-> https://openqa.suse.de/tests/overview?distri=sle&build=poo59858%3A2&version=15-SP2

EDIT: Hm, this did not yield any I/O errors. Maybe I "fixed" the NVMe by recreating the filesystem and using it as shared for cache+pool or something?

EDIT: 2020-02-13: Still no new NVMe errors with sudo dmesg -T | grep 'nvme.*Abort' | tail | uniq. I have re-enabled the default worker class for this worker.

#13 Updated by okurz about 1 month ago

  • Status changed from Feedback to Resolved

gschlotter informed me he has changed the NVMe on openqaworker13. I assume the machine always had only a single NVMe as he could not find a second one. I accepted the salt key for the machine again and applied the current worker highstate and rebooted. Jobs should be picked up as usual after reboot. As the previous NVMe installation also did not show any problems anymore after I recreated the filesystem to use as shared storage for cache+pool I suggest to not declare the previous NVMe device as faulty (yet). I have informed gschlotter.

Have monitored for 1h now and jobs are executed fine so far.

Also available in: Atom PDF