Project

General

Profile

Actions

action #103736

closed

Make aarch64 machine chan-1 and chow up and running after it is broken size:M

Added by waynechen55 over 2 years ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2021-12-09
Due date:
2023-04-30
% Done:

0%

Estimated time:

Description

Motivation

Jira service desk ticket: https://sd.suse.com/servicedesk/customer/portal/1/SD-69653 https://sd.suse.com/servicedesk/customer/portal/1/SD-70018 to track this.

Will follow the ticket and also automation run to verify the status of chan-1.

Acceptance criteria

  • AC1: Machine is usable again in SRV2 (or reimbursed)

Suggestions

  • Get access to the invoice of the machine, e.g. contact people from above SD tickets
  • Contact hardware vendors over phone how to continue as they don't seem to react to tickets or something
  • Get the machine replaced by vendor
  • Have new machine put back into Nbg Maxtorhof SRV2 and provide remote control options in the ticket

Further details

Current machine details: https://racktables.nue.suse.com/index.php?page=object&object_id=13554

Actions #1

Updated by waynechen55 over 2 years ago

  • Assignee set to waynechen55
Actions #2

Updated by waynechen55 over 2 years ago

  • Target version changed from QE-VT Sprint 86 to QE-VT Sprint 87
Actions #4

Updated by waynechen55 about 2 years ago

Suggest infra guys to contact vendor for further support.

Actions #5

Updated by waynechen55 about 2 years ago

  • Target version changed from QE-VT Sprint 87 to QE-VT Sprint 88
Actions #6

Updated by waynechen55 about 2 years ago

This issue has associated bugzilla
https://bugzilla.suse.com/show_bug.cgi?id=1194105

Actions #7

Updated by waynechen55 about 2 years ago

  • Status changed from In Progress to Blocked
  • Target version changed from QE-VT Sprint 88 to QE-VT Sprint 89

No further update.

Actions #8

Updated by waynechen55 about 2 years ago

  • Target version changed from QE-VT Sprint 89 to QE-VT Sprint 90
Actions #9

Updated by waynechen55 about 2 years ago

  • Target version changed from QE-VT Sprint 90 to QE-VT Sprint 91
Actions #10

Updated by waynechen55 about 2 years ago

  • Target version changed from QE-VT Sprint 91 to QE-VT Sprint 92
Actions #11

Updated by waynechen55 about 2 years ago

  • Target version changed from QE-VT Sprint 92 to QE-VT Sprint 93
Actions #12

Updated by waynechen55 almost 2 years ago

  • Target version changed from QE-VT Sprint 93 to QE-VT Sprint 94
Actions #13

Updated by waynechen55 almost 2 years ago

  • Target version changed from QE-VT Sprint 94 to QE-VT Sprint 95
Actions #14

Updated by waynechen55 almost 2 years ago

  • Target version changed from QE-VT Sprint 95 to QE-VT Sprint 96
Actions #15

Updated by waynechen55 almost 2 years ago

  • Target version changed from QE-VT Sprint 96 to QE-VT Sprint 97
Actions #16

Updated by nicksinger almost 2 years ago

  • Tags set to next-office-day
  • Project changed from 204 to openQA Infrastructure
  • Assignee changed from waynechen55 to nicksinger
  • Target version deleted (QE-VT Sprint 97)
Actions #17

Updated by nicksinger almost 2 years ago

  • Description updated (diff)

@waynechen55 since I'm mainly working on the debugging and communication with the vendor I take this ticket over into our backlog. Hope you're fine with this

Actions #18

Updated by okurz almost 2 years ago

  • Status changed from Blocked to New
  • Priority changed from Normal to Low
  • Target version set to Ready

the ticket was blocked on https://bugzilla.suse.com/show_bug.cgi?id=1194105 which is VERIFIED FIXED so we can move (back) to "New" and clarify what needs to be done next.

@nicksinger https://racktables.nue.suse.com/?page=search&last_page=index&last_tab=default&q=chan-1 does not resolve anything. What is this machine?

Actions #19

Updated by nicksinger almost 2 years ago

okurz wrote:

the ticket was blocked on https://bugzilla.suse.com/show_bug.cgi?id=1194105 which is VERIFIED FIXED so we can move (back) to "New" and clarify what needs to be done next.

@nicksinger https://racktables.nue.suse.com/?page=search&last_page=index&last_tab=default&q=chan-1 does not resolve anything. What is this machine?

I previously updated the description to include the most recent and relevant ticket. There is also information what was done and what we're waiting (feedback from vendor). The machine in question is https://racktables.nue.suse.com/index.php?page=object&object_id=13554

Actions #20

Updated by okurz almost 2 years ago

nicksinger and me checked the machine together. Using ssh qanet14nue.qa.suse.de, disconnecting and reconnecting the BMC LAN cable we could verify that the machine's BMC is connected to gi17 on qanet14nue.qa.suse.de. show mac address-table interface GigabitEthernet 17 showed the MAC address to be e0:d5:5e:a7:e8:34. ssh qanet and tail -n 10000 /var/log/dhcpd.log | grep -i e0:d5:5e:a7:e8:34 revealed the IPv4 address of the BMC 10.162.3.68 . Over browser we could connect to the BMC and found the machine to be reported to be generally ok. The presence of two CPUs was reported. Surprisingly "Hardware > Memory" reported DIMM slots to be occupied but we did remove all. Maybe the BMC needs to have such information from at least one successful POST of the system to know such information. We have then checked all different kind of memory configurations. In no case could we hear any tones from the on-board buzzer and no content showed on the VGA output. The BMC seems to be fully operational. We did unscrew again the heatsink on CPU1 (secondary) and could confirm that the CPU is soldered into the socket and not unpluggable. Support by the manufacturer is advised.

Actions #21

Updated by okurz over 1 year ago

  • Tags deleted (next-office-day)
Actions #22

Updated by okurz over 1 year ago

  • Priority changed from Low to High

setting to "High" as blocking #111578. nicksinger to ask for fuze VoIP account to contact hardware supplier.

Actions #23

Updated by livdywan over 1 year ago

  • Subject changed from Make aarch64 machine chan-1 up and running after it is broken to Make aarch64 machine chan-1 up and running after it is broken size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #24

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #26

Updated by okurz over 1 year ago

I asked for further information in racktables by email, see email on osd-admins@suse.de

Actions #27

Updated by nanzhang over 1 year ago

Server informaton from Invoice:
PH-R150-T62
Phoenics 1U 8-bay Server
S/N :GIG3N7612A0015
A/T :SU10468

Detailed specifications from order:
Motherboard
System on Chip (SoC)
Cavium® ThunderXTM ARM processor
64bit ARMv8 architecture, BGA 2601, 28nm
8 DIMM Slots (max. 1024GB DDR4)
1 x 40GbE QSFP+ LAN Port and
4 x 10GbE SFP+ LAN Ports
4x SATA3 (6Gb/s) Ports (RAID 0,1,5,10)

PCI Slot:
1x PCI-E 3.0 x8 Slot (FHHL)
Integrated IPMI 2.0 with Dedicated LAN
Integrated Aspeed AST2400 BMC
2x USB 3.0

CPU
1 x Cavium ARM ThunderX
48 Cores pro CPU, 2,0 GHz

RAM
4 x Micron MTA9ASF1G72PZ-2G9
= 64 GB, 8 x 8 GB

Disk
2 x Intel S4610 SSDSC2KG480G801
480 GB, SSD, 3,42 DWPD

Official website for this model:
https://www.gigabyte.com/Enterprise/ARM-Server/R150-T62-rev-100

Actions #28

Updated by nicksinger over 1 year ago

  • Status changed from Workable to Blocked

I requested a FUZE account in https://sd.suse.com/servicedesk/customer/portal/1/SD-91608 - unfortunately this requires quite some approval chain so lets see how long it will take. Blocked by SD-91608

Actions #29

Updated by okurz over 1 year ago

  • Tags set to reactive work
Actions #30

Updated by waynechen55 over 1 year ago

Actually both chan-1 and chow-1 are down now in NUE. Probably due to the same issue. I already added this information in SD-70018 months ago. So I think @nsinger is already aware of this.

Actions #31

Updated by okurz over 1 year ago

the ticket came up in our "SLO high" query as it was not updated within 30 days. It is still important to handle the one or both machines but also tasks like labs move are more important for now.

Actions #32

Updated by okurz over 1 year ago

  • Priority changed from High to Normal
Actions #33

Updated by nicksinger over 1 year ago

  • Status changed from Blocked to Feedback

I gave codec a call and they unfortunately just told me to write yet another e-mail. I just mailed the more general info@codec.ie asking for details on the process of sending back the machine as any attempts to further debug the system didn't yield any reasonable response from them.

Actions #34

Updated by okurz over 1 year ago

  • Status changed from Feedback to Workable

codec responded so you need to ping gigabyte directly

Actions #35

Updated by waynechen55 over 1 year ago

okurz wrote:

codec responded so you need to ping gigabyte directly

It is pleasing to here this good news ? Could you help share more info if there is ?

And, by the way, this ticket https://sd.suse.com/servicedesk/customer/portal/1/SD-70018 had been closed due to there is already internal discussion in the mean time. I think it points to the discussion here within QA.

Actions #36

Updated by nicksinger over 1 year ago

  • Status changed from Workable to Blocked

I've created an account on the gigabyte "eSupport" website and raised our problems there and mentioned that codec cannot provide us help.
The link to the ticket is https://esupport.gigabyte.com/Default/#id:1374724 but most likely you cannot see this without being logged in. I will keep it monitored.

Actions #37

Updated by waynechen55 over 1 year ago

nicksinger wrote:

I've created an account on the gigabyte "eSupport" website and raised our problems there and mentioned that codec cannot provide us help.
The link to the ticket is https://esupport.gigabyte.com/Default/#id:1374724 but most likely you cannot see this without being logged in. I will keep it monitored.

I can not see it even logged in. "No record".

Actions #38

Updated by okurz over 1 year ago

@nicksinger, as discussed, just send the hardware to "them" – I don't know who "them" is, you decide :)

Actions #39

Updated by nicksinger over 1 year ago

  • Status changed from Blocked to Workable

Gigabyte rejected my request and told us to send back to the vendor. I'm not sure yet what to do but I think we can continue.

Actions #40

Updated by okurz over 1 year ago

  • Due date set to 2022-12-16
  • Priority changed from Normal to Urgent

nsinger talked to "the IT office" in Prg office. They often have contacts with codec and good experiences. nsinger will create an SD ticket to the ladies from IT and they should be able to help us, e.g. where to send the server

Actions #41

Updated by nicksinger over 1 year ago

  • Status changed from Workable to Blocked
Actions #42

Updated by okurz over 1 year ago

  • Due date deleted (2022-12-16)
  • Priority changed from Urgent to Normal

That's great, let's see

Actions #43

Updated by okurz about 1 year ago

From https://sd.suse.com/servicedesk/customer/portal/1/SD-106628
Martina Markova 2023-01-26 16:03 wrote

Follow up email sent to Nick.

Actions #44

Updated by okurz about 1 year ago

  • Status changed from Blocked to In Progress

Nick Singer needs to answer by email to a request.

Actions #45

Updated by openqa_review about 1 year ago

  • Due date set to 2023-02-22

Setting due date based on mean cycle time of SUSE QE Tools

Actions #46

Updated by nicksinger about 1 year ago

  • Status changed from In Progress to Blocked

After the escalation, CODEC requested some of the already sent information again. I collected old descriptions of what we already did and send it with Martina (SUSE IT), Aine (CODEC) and osd-admins on CC. I got a swift reply already that my contact at CODEC moved positions but GIGABYTE was contacted nevertheless:

Hi Nick,

Aisling has recently moved roles in the company, and I am now looking after the SUSE account.

Thanks for reaching out and advising on the below.

I will contact the supplier of this machine and hope we can reach a solution.

I will keep you updated.

Kind regards,

Aine
Actions #47

Updated by livdywan about 1 year ago

  • Due date changed from 2023-02-22 to 2023-03-03

Conversation is continuing in email. Nick is likely able to be in the office next week. Bumping the due date accordingly.

Actions #48

Updated by nicksinger about 1 year ago

Today I just plugged in my notebook with a dhcp directly attached to the BMC and was able to access the webinterface as well as SOL. However, the machine still does not boot, reports very little over sol and shows no output on the connected VGA screen. I send over this information to codec by mail (visible on the osd-admins ML) today.

Actions #49

Updated by okurz about 1 year ago

  • Due date changed from 2023-03-03 to 2023-03-10

nicksinger forwarded a message and received a first-level response, pending second-level response from Gigabyte

Actions #50

Updated by okurz about 1 year ago

I mounted the machine in NUE-FC-B:5 connected to PDU outlet A9 and mgmt ethernet on switch port 22 . As we don't the MAC address because there was incomplete information on racktables for the machine so far I created https://sd.suse.com/servicedesk/customer/portal/1/SD-113959) for Eng-Infra to provide us log or admin access to walter1.qe.nue2.suse.org

As workaround

I did nmap -p 443 10.168.192.0/22 > with with chan mgmt patchcable not plugged in, then again with it plugged in and diffed it. Result:

+Nmap scan report for 10.168.193.122
+Host is up (0.00051s latency).
+
+PORT    STATE SERVICE
+443/tcp open  https
+MAC Address: E0:D5:5E:A7:E8:34 (Giga-byte Technology)
+
Actions #51

Updated by okurz about 1 year ago

We attempted a firmware upgrade as suggested. This succeeded but no change in behaviour in the boot process. Ongoing email conversation, CC'ed to osd-admins@suse.de. Latest message at time of writing https://mailman.suse.de/mlarch/SuSE/osd-admins/2023/osd-admins.2023.03/msg00043.html

Actions #52

Updated by okurz about 1 year ago

  • Subject changed from Make aarch64 machine chan-1 up and running after it is broken size:M to Make aarch64 machine chan-1 and chow up and running after it is broken size:M

From email:


From: qa-apac2 on behalf of Oliver Kurz okurz@suse.de
Sent: Saturday, March 4, 2023 5:19 AM
To: qa-apac2 qa-apac2@suse.de
Subject: How is machine "chow" used? Is racktables correct?

Hi, I found the machine
chow.qa.suse.de under
https://racktables.nue.suse.com/index.php?
page=object&tab=default&object_id=13550
but neither

ping -c1 chow-sp.qa.suse.de

nor

ping -c1 chow-1.qa.suse.de

responded. Is the information in racktables correct?

Both chan and chow are down due to the same or similar issue, also see #103736#note-30

Actions #53

Updated by livdywan about 1 year ago

  • Due date changed from 2023-03-10 to 2023-03-17

okurz wrote:

nicksinger forwarded a message and received a first-level response, pending second-level response from Gigabyte

Still being discussed. I also raised this on Slack to clarify if they have the information they need from our side.

Actions #54

Updated by livdywan 12 months ago

  • Due date changed from 2023-03-17 to 2023-03-31

Quick update. Ordering of the replacement machine is on-going.

Actions #55

Updated by livdywan 12 months ago

  • Due date changed from 2023-03-31 to 2023-04-30

cdywan wrote:

Quick update. Ordering of the replacement machine is on-going.

Still pending feedback on getting the new machine.

Actions #56

Updated by okurz 11 months ago

The last email I saw was https://mailman.suse.de/mlarch/SuSE/osd-admins/2023/osd-admins.2023.03/msg00185.html on 21 Mar 2023 . No related message in https://mailman.suse.de/mlarch/SuSE/osd-admins/2023/osd-admins.2023.04/threads.html

@nicksinger please make sure that our relevant communication partners do the next step and then bump the due date accordingly.

Actions #57

Updated by nicksinger 11 months ago

  • Status changed from Blocked to Resolved

I wrote another e-mail today asking if they still need something from us. We don't expect a free replacement for the broken machines. If we get an answer now, we can consider reopening again.

Actions

Also available in: Atom PDF