Project

General

Profile

action #43874

HDD in kermit.qa.suse.de has problems in combination with SAS-slot

Added by xlai about 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
Start date:
2018-11-16
Due date:
% Done:

0%

Estimated time:

Description

OpenQA worker openqaworker2:23 relevant SUT, configuration:
"IPMI_HOSTNAME" : "sp.kermit.qa.suse.de",
"IPMI_PASSWORD" : "ADMIN",
"IPMI_USER" : "ADMIN",

mkfs and fdisk operations on this disk fails (on sle12sp3 /15sp1 system), showing error of writing superblocks.

Please help to fix it.

SI0041665.pdf (326 KB) SI0041665.pdf sebchlad, 2018-11-22 10:25

History

#1 Updated by nicksinger about 4 years ago

  • Assignee changed from nicksinger to xlai

Please first make sure the machine is not used in production anymore by removing it from our salt configuration (you can also change the WORKER_CLASS to something else so no job gets assigned anymore):
https://gitlab.suse.de/openqa/salt-pillars-openqa/blob/master/openqa/workerconf.sls#L186

After you did this, I can start investigating. I'd run a badblocks test for the disk in question because the hardware reports no problem yet. Maybe it is also related to the controller or something else…
After all please expect that this host will be down for a while.

#3 Updated by nicksinger about 4 years ago

  • Assignee changed from xlai to nicksinger

Thanks, I'll soon take a look at the machine and check what is broken and needs to be send back - I hope it's only the HDD itself…

#4 Updated by nicksinger about 4 years ago

  • Status changed from New to Workable

#5 Updated by nicksinger about 4 years ago

So just to clarify, the broken disk in question has the UUID 3b2c3c5a-560e-487c-80ea-e77eb6ff1f04 and is roughly 600GB in size.
The problems where reproducible with a recent systemrescuecd. After swapping the slot of the disk, the IO-messages in dmesg went away. Also the write speed went back to the expected ~80MB/s.
I'll swap the disk back into the old slot after "dd"ing it once with data and do the same there. This should provide us with better insights if maybe the whole SAS controller is broken.

I've also just checked my records and I've no invoice for that machine at all. Could you please talk to the person who ordered this machine to get us such invoice? Otherwise we don't know where it was bought, if it still is under warranty and we can't ship anything back.

#6 Updated by nicksinger about 4 years ago

  • Status changed from Workable to Feedback
  • Assignee changed from nicksinger to xlai

Assigning back to clarify invoice problematic. I'll continue to debug further in the meantime.

#7 Updated by nicksinger about 4 years ago

HDD is back in the old slot. Trying now with another disk in the "broken slot" to pinpoint it down to the controller.

#8 Updated by nicksinger about 4 years ago

Now I swapped a known, working SSD into the slot where the HDD produces errors. Both devices work fine in this configuration.
This would mean we've some funny combination where it only produces issues with "this specific HDD" in "this specific slot".
We've two options now:

  1. Send the whole machine back with an issue-description (invoice needed)
  2. Just take another slot for the HDD and monitor the machine further if the issue appears again (invoice not needed)

Up to you to decide now. I can, of course help in packaging and shipping the hardware.

#9 Updated by xlai about 4 years ago

nicksinger wrote:

Now I swapped a known, working SSD into the slot where the HDD produces errors. Both devices work fine in this configuration.
This would mean we've some funny combination where it only produces issues with "this specific HDD" in "this specific slot".
We've two options now:

  1. Send the whole machine back with an issue-description (invoice needed)
  2. Just take another slot for the HDD and monitor the machine further if the issue appears again (invoice not needed)

Up to you to decide now. I can, of course help in packaging and shipping the hardware.

Thank you nick for making the issue so clear. Now let me involve calen for a decision from our side.

#10 Updated by cachen about 4 years ago

Thank you Nick to help debugging and conclusion!

To be honestly, I don't have the clear idea of what is the equipment repair process in Nuremberg, and even don't know where can find the invoice(what I heard was usually vendor sent the invoice to finance department), in China we don't need invoice when call for online/onsite service, we only provide vendor server's serial number.
Not easy for me to handle the issue remotely, so Ralf will help me to ask around for the invoice, but not sure when can have result.

Another suggestion from Ralf is, how about let's create a ticket to infra or MF-IT to ask for repair service? I just guess they maybe can contact with vendor to call for 24h/4h online/onside service? if you already did this, then please ignore this suggestion.

The server is still in warranty for sure and under Sebastian's name(the exactly purchased date can be found in asset system by Tag: MF16690 SN: E15749528301238)

Otherwise, if we can't find the invoice until end of next week, we may need you help go on option2 or pull out HDD to just use SSD.

Again, I appreciated for your help.

#11 Updated by nicksinger about 4 years ago

cachen wrote:

Thank you Nick to help debugging and conclusion!

To be honestly, I don't have the clear idea of what is the equipment repair process in Nuremberg, and even don't know where can find the invoice(what I heard was usually vendor sent the invoice to finance department), in China we don't need invoice when call for online/onsite service, we only provide vendor server's serial number.

I don't know if this would work with our vendors. However for now we don't even know the exact vendor (Just that it is a supermicro product).

Not easy for me to handle the issue remotely, so Ralf will help me to ask around for the invoice, but not sure when can have result.

Another suggestion from Ralf is, how about let's create a ticket to infra or MF-IT to ask for repair service? I just guess they maybe can contact with vendor to call for 24h/4h online/onside service? if you already did this, then please ignore this suggestion.

Infra will close the ticket since they can't do anything without knowing the vendor. MF-IT might do the same but could be worth a shot.
But after all I'm pretty sure this case will not be handled by onside-service and the server needs to be shipped to repair service.

The server is still in warranty for sure and under Sebastian's name(the exactly purchased date can be found in asset system by Tag: MF16690 SN: E15749528301238)

Otherwise, if we can't find the invoice until end of next week, we may need you help go on option2 or pull out HDD to just use SSD.

IMHO no need to completely remove the HDD just yet. It still works in other slots and the server has plenty of them.

Again, I appreciated for your help.

Welcome, really nothing special

#12 Updated by cachen about 4 years ago

Just found the PO number: 4100021076, RIO number: REQ_053177, trying to find invoice....

#13 Updated by sebchlad about 4 years ago

Seems I located the invoice thanks to Natalia from the Finance department.

#14 Updated by sebchlad about 4 years ago

Please see attached the attached invoice for SN: E15749528301238

#15 Updated by cachen about 4 years ago

@Sebastian, I just saw the update here, that's awesome, thanks a lot!

@Nick, I have a question to you now, is it possible to estimate how long will it need to repair this server(include the round shipment and repair), that's say is it possible to get it back before 15SP1 Beta1(Dec 12)?

If we can't get it back soon, that will be a challenge to 15SP1 testing, so, my anther question is, base on your experience, how serious you think this broken issue? will it became worse and worse? if it's not that serious so far, we can keep using it by option 2, since the server is still new and has 3 years warranty period. Your suggestion?

#16 Updated by cachen about 4 years ago

  • Assignee changed from xlai to nicksinger

reassign to Nick for suggestion!

#17 Updated by nicksinger about 4 years ago

cachen wrote:

@Sebastian, I just saw the update here, that's awesome, thanks a lot!

Indeed, hero of the day - thanks a lot :)

cachen wrote:

@Nick, I have a question to you now, is it possible to estimate how long will it need to repair this server(include the round shipment and repair), that's say is it possible to get it back before 15SP1 Beta1(Dec 12)?

If we can't get it back soon, that will be a challenge to 15SP1 testing, so, my anther question is, base on your experience, how serious you think this broken issue? will it became worse and worse? if it's not that serious so far, we can keep using it by option 2, since the server is still new and has 3 years warranty period. Your suggestion?

If the testing for Beta1 seems at risk I'd definitely postpone the repair. I still need to figure out the details of the process which I could do in these two weeks.
Unfortunately I have zero experience with such a weird issue-combination therefore I really can't give any guarantee. But given the fact that this HDD works fine in other bays, I'd go that route until after Beta1 and ship the server in a more quiet time.

#18 Updated by cachen about 4 years ago

@Nick, thanks for the suggestion! If we can't get it back in 2 weeks, it will definitely risky to Beta1, and we actually very hard to find a quiet time till 15sp1 project GMC. The good thing is the server doesn't fully broken, let's take your option 2 to get it back to run for openQA, for whether any other issue occur, take your time for more details, and mgmt team also need more times to figure out the formal process for HW after sale service. Thanks a lot for your great support!

#19 Updated by nicksinger about 4 years ago

Alright, so let us bring the worker back for now. As a reference, this is how the disk-setup looks now (on my recovery system, names might change):

NAME MAJ:MIN RM SIZE RO TYPE UUID
loop0 7:0 0 493.1M 1 loop
sdb 8:16 0 223.6G 0 disk
├─sdb1 8:17 0 2G 0 part 1f931c2d-0c95-4d54-8b86-b015c6f12df9
├─sdb2 8:18 0 77G 0 part 180b5c68-4902-4bd3-bf0e-8a06dce93b52
├─sdb3 8:19 0 59.9G 0 part d44ba653-e081-4c4b-b40b-5606084e2b59
└─sdb4 8:20 0 84.7G 0 part b3acbedc-d45a-4ade-85c0-e8de594188c2
sdc 8:32 0 223.6G 0 disk c36bc5c1-8a4b-4f7d-b715-6d70c06ff7b2
sdd 8:48 0 223.6G 0 disk
├─sdd1 8:49 0 8M 0 part
├─sdd2 8:50 0 40G 0 part ad3edc2c-b4d5-4ebf-8531-8e02bf9b397b
├─sdd3 8:51 0 120.8G 0 part 515bc23f-6d1d-4e39-9f3c-b34fc7d27356
└─sdd4 8:52 0 62.8G 0 part 65522a7a-6c62-45f6-922d-b2d61082a9f7
sdf 8:80 0 558.9G 0 disk
└─sdf1 8:81 0 558.9G 0 part

#20 Updated by nicksinger about 4 years ago

  • Subject changed from openqaworker2:23 related SUT kermit.qa.suse.de /dev/sdd has HW problem to format. to HDD in kermit.qa.suse.de has problems in combination with SAS-slot

https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/144 enables the worker to be back in production again and it already takes jobs again: https://openqa.suse.de/admin/workers/1089
I'll collect some experience with shipping back hardware even though I doubt that we will ever manage to repair this machine withing 2 weeks. It was bought in Ireland and I guess would therefore need non-domestic shipping. But I'll try to get a expected repair time from the vendor if possible.
In the meantime I'd kindly ask you to keep an eye on that machine and if it happens again we need to rethink our solution (therefore leaving the ticket on feedback and on me)…

#21 Updated by xlai about 4 years ago

nicksinger wrote:

https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/144 enables the worker to be back in production again and it already takes jobs again: https://openqa.suse.de/admin/workers/1089
I'll collect some experience with shipping back hardware even though I doubt that we will ever manage to repair this machine withing 2 weeks. It was bought in Ireland and I guess would therefore need non-domestic shipping. But I'll try to get a expected repair time from the vendor if possible.
In the meantime I'd kindly ask you to keep an eye on that machine and if it happens again we need to rethink our solution (therefore leaving the ticket on feedback and on me)…

Sure, I will keep an eye on the machine and update info if HW goes wrong again. Thank you very much for the help and effort!

#22 Updated by nicksinger almost 4 years ago

  • Status changed from Feedback to Resolved

Let us close this ticket here and in case of a new problem open a new ticket referencing to this one here.

Also available in: Atom PDF