Project

General

Profile

Actions

action #59858

closed

"Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260." broken NVMe on openqaworker13, jobs incomplete trying to save snapshots

Added by okurz over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2019-11-14
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://openqa.suse.de/tests/3594947/file/autoinst-log.txt on openqaworker13 shows:

[2019-11-14T20:32:28.969 CET] [debug] EVENT {"data":{"status":"active"},"event":"MIGRATION","timestamp":{"microseconds":476972,"seconds":1573759948}}
[2019-11-14T20:32:28.969 CET] [debug] Migrating total bytes:        1078796288
…
[2019-11-14T20:36:28.601 CET] [debug] Migrating total bytes:        1078796288
[2019-11-14T20:36:28.601 CET] [debug] Migrating remaining bytes:    337879040
[2019-11-14T20:36:28.602 CET] [debug] EVENT {"data":{"status":"cancelling"},"event":"MIGRATION","timestamp":{"microseconds":602550,"seconds":1573760188}}
[2019-11-14T20:36:28.603 CET] [debug] Backend process died, backend errors are reported below in the following lines:
Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260.

From dmesg on openqaworker13 I can see:

[Thu Nov 14 21:03:15 2019] nvme nvme0: Abort status: 0x0
…
[Thu Nov 14 21:03:15 2019] nvme nvme0: Abort status: 0x0
[Thu Nov 14 21:03:24 2019] nvme nvme0: I/O 451 QID 3 timeout, aborting
…
[Thu Nov 14 21:03:24 2019] nvme nvme0: I/O 458 QID 3 timeout, aborting
[Thu Nov 14 21:03:28 2019] nvme nvme0: Abort status: 0x0
…
[Thu Nov 14 21:03:28 2019] nvme nvme0: Abort status: 0x0

so it looks like the NVMe device is used but not providing expected performance or being faulty.


Related issues 2 (0 open2 closed)

Related to openQA Tests - action #60743: [qam] Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 263.Resolvedokurz2019-12-05

Actions
Blocked by openQA Project - action #62108: auto_review:"Can.t fcntl.*Operation not permitted at .*virtio_terminal.pm" test incompletesResolvedcfconrad2020-01-14

Actions
Actions #1

Updated by okurz over 4 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version set to Current Sprint

This seems to produce some incompletes which do not automatically retrigger. I triggered a reboot of openqaworker13. Have not (yet) looked for all related incompletes but restarted all incompletes which were running on this machine with

env worker=openqaworker13 host=openqa.suse.de failed_since=2019-11-10 ~/local/os-autoinst/scripts/openqa-restart-incompletes-on-worker-instance
Actions #2

Updated by okurz over 4 years ago

  • Status changed from In Progress to Feedback

Machine is back up and working on jobs already. Do we have a monitoring alert that could inform us about related problem in the future?

Actions #3

Updated by okurz over 4 years ago

  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

Seeing this again:

$ ssh openqaworker13 "sudo dmesg -T | grep 'nvme.*Abort' | tail | uniq"
[Fri Nov 22 13:22:20 2019] nvme nvme0: Abort status: 0x0
[Wed Nov 27 01:33:20 2019] nvme nvme0: Abort status: 0x0

only happens on w13:

$ sudo salt -l error -C 'G@roles:worker' cmd.run 'sudo dmesg -T | grep "nvme.*Abort" | tail | uniq'
QA-Power8-4-kvm.qa.suse.de:
openqaworker2.suse.de:
openqaworker10.suse.de:
openqaworker8.suse.de:
openqaworker13.suse.de:
    [Fri Nov 22 13:22:20 2019] nvme nvme0: Abort status: 0x0
    [Wed Nov 27 01:33:20 2019] nvme nvme0: Abort status: 0x0
openqaworker7.suse.de:
openqaworker9.suse.de:
openqaworker5.suse.de:
powerqaworker-qam-1:
QA-Power8-5-kvm.qa.suse.de:
openqaworker6.suse.de:
malbec.arch.suse.de:
openqaworker-arm-2.suse.de:
openqaworker-arm-1.suse.de:
openqaworker3.suse.de:
grenache-1.qa.suse.de:

I suspect a loose connection or broken device which needs replacement.

Actions #4

Updated by okurz over 4 years ago

  • Related to action #60743: [qam] Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 263. added
Actions #5

Updated by okurz over 4 years ago

Disabled openQA worker instances by worker class with https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/202

@nicksinger can you take over, check NVMe physically, replace it, rub it on your t-shirt or something?

Actions #6

Updated by okurz over 4 years ago

  • Subject changed from jobs incomplete on openqaworker13 trying to save snapshots with "Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260." to broken NVMe on openqaworker13, jobs incomplete trying to save snapshots with "Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260."
Actions #7

Updated by okurz over 4 years ago

  • Subject changed from broken NVMe on openqaworker13, jobs incomplete trying to save snapshots with "Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260." to "Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 260." broken NVMe on openqaworker13, jobs incomplete trying to save snapshots
Actions #8

Updated by okurz about 4 years ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

I understood that the machine is within SRV2 EXT which is maintained by Engineering Infrastructure and also the NVMe's are part of the ordered hardware as closed cases with a warranty seal, i.e. probably we do not want to open the code and check the seating of the hardware ourselves. So I created https://infra.nue.suse.com/SelfService/Display.html?id=162199 shared with #62849

Actions #9

Updated by okurz about 4 years ago

  • Status changed from Blocked to Feedback

Not much news. gschlotter reported he "did some things" but it's not clear he did that for both openqaworker4 and openqaworker13 . In any case in #46742#note-33 I already ran tests fine on openqaworker13 with a different setup of how the NVMe device is used.

smartctl -a /dev/nvme0 reports no problems.

So trying harder with currently 16 worker instances running:

for i in {001..100}; do openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de 3853128 _GROUP=0 BUILD=poo59858 WORKER_CLASS=openqaworker13 TEST=wicked_advanced_sut_poo59858_$i; done

Cloning dependencies of sle-15-SP2-Online-x86_64-Build134.1-wicked_advanced_sut@64bit
Created job #3860458: sle-15-SP2-Online-x86_64-Build134.1-wicked_advanced_ref@64bit -> https://openqa.suse.de/t3860458
Created job #3860459: sle-15-SP2-Online-x86_64-Build134.1-wicked_advanced_sut@64bit -> https://openqa.suse.de/t3860459

this looks a bit weird because I use --parental-inheritance but also want to tailor the name with TEST=… which ends up with both the "ref" and "sut" job with the same name, the build shows only the sut as the "later" one within the build overview:

build: https://openqa.suse.de/tests/overview?build=poo59858&distri=sle&version=15-SP2

many jobs incomplete with Can't fcntl($fh, '1031', '1048576'): Operation not permitted at /usr/lib/os-autoinst/consoles/virtio_terminal.pm line 147. Seems like os-autoinst has a bug that we have fixed in current os-autoinst master. rpm -q --changelog os-autoinst shows:

* Mon Jan 20 2020 okurz@suse.com
- Update to version 4.6.1579522398.2afcda12:

Doing zypper dup. Updated os-autoinst, restarted openqa-worker@{1..16} and retriggered incomplete tests.

Actions #10

Updated by okurz about 4 years ago

  • Status changed from Feedback to Blocked

No luck, e.g. https://openqa.suse.de/tests/3860547/file/autoinst-log.txt shows Can't fcntl($fh, '1031', '1048576'): Operation not permitted at /usr/lib/os-autoinst/consoles/virtio_terminal.pm line 128. Blocked on #62108

Actions #11

Updated by cfconrad about 4 years ago

  • Blocked by action #62108: auto_review:"Can.t fcntl.*Operation not permitted at .*virtio_terminal.pm" test incompletes added
Actions #12

Updated by okurz about 4 years ago

  • Status changed from Blocked to Feedback

new change by cfconrad https://github.com/os-autoinst/os-autoinst/pull/1348 merged and deployed on openqaworker13, trying again:

for i in {001..100}; do openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de 3867902 _GROUP=0 BUILD=poo59858:2 WORKER_CLASS=openqaworker13 TEST=wicked_advanced_sut_poo59858_$i; done

-> https://openqa.suse.de/tests/overview?distri=sle&build=poo59858%3A2&version=15-SP2

EDIT: Hm, this did not yield any I/O errors. Maybe I "fixed" the NVMe by recreating the filesystem and using it as shared for cache+pool or something?

EDIT: 2020-02-13: Still no new NVMe errors with sudo dmesg -T | grep 'nvme.*Abort' | tail | uniq. I have re-enabled the default worker class for this worker.

Actions #13

Updated by okurz about 4 years ago

  • Status changed from Feedback to Resolved

gschlotter informed me he has changed the NVMe on openqaworker13. I assume the machine always had only a single NVMe as he could not find a second one. I accepted the salt key for the machine again and applied the current worker highstate and rebooted. Jobs should be picked up as usual after reboot. As the previous NVMe installation also did not show any problems anymore after I recreated the filesystem to use as shared storage for cache+pool I suggest to not declare the previous NVMe device as faulty (yet). I have informed gschlotter.

Have monitored for 1h now and jobs are executed fine so far.

Actions

Also available in: Atom PDF