action #103736
Make aarch64 machine chan-1 and chow up and running after it is broken size:M
0%
Description
Motivation¶
Jira service desk ticket: https://sd.suse.com/servicedesk/customer/portal/1/SD-69653 https://sd.suse.com/servicedesk/customer/portal/1/SD-70018 to track this.
Will follow the ticket and also automation run to verify the status of chan-1.
Acceptance criteria¶
- AC1: Machine is usable again in SRV2 (or reimbursed)
Suggestions¶
- Get access to the invoice of the machine, e.g. contact people from above SD tickets
- Contact hardware vendors over phone how to continue as they don't seem to react to tickets or something
- Get the machine replaced by vendor
- Have new machine put back into Nbg Maxtorhof SRV2 and provide remote control options in the ticket
Further details¶
Current machine details: https://racktables.nue.suse.com/index.php?page=object&object_id=13554
History
#1
Updated by waynechen55 over 1 year ago
- Assignee set to waynechen55
#2
Updated by waynechen55 over 1 year ago
- Target version changed from QE-VT Sprint 86 to QE-VT Sprint 87
#3
Updated by waynechen55 over 1 year ago
chan-1 is not recovered yet.
Ticket https://sd.suse.com/servicedesk/customer/portal/1/SD-70018
The corresponding worker is disabled
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/374
#4
Updated by waynechen55 over 1 year ago
Suggest infra guys to contact vendor for further support.
#5
Updated by waynechen55 over 1 year ago
- Target version changed from QE-VT Sprint 87 to QE-VT Sprint 88
#6
Updated by waynechen55 over 1 year ago
This issue has associated bugzilla
https://bugzilla.suse.com/show_bug.cgi?id=1194105
#7
Updated by waynechen55 over 1 year ago
- Status changed from In Progress to Blocked
- Target version changed from QE-VT Sprint 88 to QE-VT Sprint 89
No further update.
#8
Updated by waynechen55 over 1 year ago
- Target version changed from QE-VT Sprint 89 to QE-VT Sprint 90
#9
Updated by waynechen55 over 1 year ago
- Target version changed from QE-VT Sprint 90 to QE-VT Sprint 91
#10
Updated by waynechen55 over 1 year ago
- Target version changed from QE-VT Sprint 91 to QE-VT Sprint 92
#11
Updated by waynechen55 about 1 year ago
- Target version changed from QE-VT Sprint 92 to QE-VT Sprint 93
#12
Updated by waynechen55 about 1 year ago
- Target version changed from QE-VT Sprint 93 to QE-VT Sprint 94
#13
Updated by waynechen55 about 1 year ago
- Target version changed from QE-VT Sprint 94 to QE-VT Sprint 95
#14
Updated by waynechen55 about 1 year ago
- Target version changed from QE-VT Sprint 95 to QE-VT Sprint 96
#15
Updated by waynechen55 about 1 year ago
- Target version changed from QE-VT Sprint 96 to QE-VT Sprint 97
#16
Updated by nicksinger about 1 year ago
- Tags set to next-office-day
- Project changed from QE-Virtualization to openQA Infrastructure
- Assignee changed from waynechen55 to nicksinger
- Target version deleted (
QE-VT Sprint 97)
#17
Updated by nicksinger about 1 year ago
- Description updated (diff)
waynechen55 since I'm mainly working on the debugging and communication with the vendor I take this ticket over into our backlog. Hope you're fine with this
#18
Updated by okurz about 1 year ago
- Status changed from Blocked to New
- Priority changed from Normal to Low
- Target version set to Ready
the ticket was blocked on https://bugzilla.suse.com/show_bug.cgi?id=1194105 which is VERIFIED FIXED so we can move (back) to "New" and clarify what needs to be done next.
nicksinger https://racktables.nue.suse.com/?page=search&last_page=index&last_tab=default&q=chan-1 does not resolve anything. What is this machine?
#19
Updated by nicksinger about 1 year ago
okurz wrote:
the ticket was blocked on https://bugzilla.suse.com/show_bug.cgi?id=1194105 which is VERIFIED FIXED so we can move (back) to "New" and clarify what needs to be done next.
nicksinger https://racktables.nue.suse.com/?page=search&last_page=index&last_tab=default&q=chan-1 does not resolve anything. What is this machine?
I previously updated the description to include the most recent and relevant ticket. There is also information what was done and what we're waiting (feedback from vendor). The machine in question is https://racktables.nue.suse.com/index.php?page=object&object_id=13554
#20
Updated by okurz about 1 year ago
nicksinger and me checked the machine together. Using ssh qanet14nue.qa.suse.de
, disconnecting and reconnecting the BMC LAN cable we could verify that the machine's BMC is connected to gi17 on qanet14nue.qa.suse.de. show mac address-table interface GigabitEthernet 17
showed the MAC address to be e0:d5:5e:a7:e8:34
. ssh qanet
and tail -n 10000 /var/log/dhcpd.log | grep -i e0:d5:5e:a7:e8:34
revealed the IPv4 address of the BMC 10.162.3.68 . Over browser we could connect to the BMC and found the machine to be reported to be generally ok. The presence of two CPUs was reported. Surprisingly "Hardware > Memory" reported DIMM slots to be occupied but we did remove all. Maybe the BMC needs to have such information from at least one successful POST of the system to know such information. We have then checked all different kind of memory configurations. In no case could we hear any tones from the on-board buzzer and no content showed on the VGA output. The BMC seems to be fully operational. We did unscrew again the heatsink on CPU1 (secondary) and could confirm that the CPU is soldered into the socket and not unpluggable. Support by the manufacturer is advised.
#26
Updated by okurz 12 months ago
I asked for further information in racktables by email, see email on osd-admins@suse.de
#27
Updated by nanzhang 12 months ago
Server informaton from Invoice:
PH-R150-T62
Phoenics 1U 8-bay Server
S/N :GIG3N7612A0015
A/T :SU10468
Detailed specifications from order:
Motherboard
System on Chip (SoC)
Cavium® ThunderXTM ARM processor
64bit ARMv8 architecture, BGA 2601, 28nm
8 DIMM Slots (max. 1024GB DDR4)
1 x 40GbE QSFP+ LAN Port and
4 x 10GbE SFP+ LAN Ports
4x SATA3 (6Gb/s) Ports (RAID 0,1,5,10)
PCI Slot:
1x PCI-E 3.0 x8 Slot (FHHL)
Integrated IPMI 2.0 with Dedicated LAN
Integrated Aspeed AST2400 BMC
2x USB 3.0
CPU
1 x Cavium ARM ThunderX
48 Cores pro CPU, 2,0 GHz
RAM
4 x Micron MTA9ASF1G72PZ-2G9
= 64 GB, 8 x 8 GB
Disk
2 x Intel S4610 SSDSC2KG480G801
480 GB, SSD, 3,42 DWPD
Official website for this model:
https://www.gigabyte.com/Enterprise/ARM-Server/R150-T62-rev-100
#28
Updated by nicksinger 11 months ago
- Status changed from Workable to Blocked
I requested a FUZE account in https://sd.suse.com/servicedesk/customer/portal/1/SD-91608 - unfortunately this requires quite some approval chain so lets see how long it will take. Blocked by SD-91608
#30
Updated by waynechen55 11 months ago
Actually both chan-1 and chow-1 are down now in NUE. Probably due to the same issue. I already added this information in SD-70018 months ago. So I think @nsinger is already aware of this.
#33
Updated by nicksinger 8 months ago
- Status changed from Blocked to Feedback
I gave codec a call and they unfortunately just told me to write yet another e-mail. I just mailed the more general info@codec.ie asking for details on the process of sending back the machine as any attempts to further debug the system didn't yield any reasonable response from them.
#35
Updated by waynechen55 7 months ago
okurz wrote:
codec responded so you need to ping gigabyte directly
It is pleasing to here this good news ? Could you help share more info if there is ?
And, by the way, this ticket https://sd.suse.com/servicedesk/customer/portal/1/SD-70018 had been closed due to there is already internal discussion in the mean time. I think it points to the discussion here within QA.
#36
Updated by nicksinger 7 months ago
- Status changed from Workable to Blocked
I've created an account on the gigabyte "eSupport" website and raised our problems there and mentioned that codec cannot provide us help.
The link to the ticket is https://esupport.gigabyte.com/Default/#id:1374724 but most likely you cannot see this without being logged in. I will keep it monitored.
#37
Updated by waynechen55 7 months ago
nicksinger wrote:
I've created an account on the gigabyte "eSupport" website and raised our problems there and mentioned that codec cannot provide us help.
The link to the ticket is https://esupport.gigabyte.com/Default/#id:1374724 but most likely you cannot see this without being logged in. I will keep it monitored.
I can not see it even logged in. "No record".
#38
Updated by okurz 6 months ago
nicksinger, as discussed, just send the hardware to "them" – I don't know who "them" is, you decide :)
#39
Updated by nicksinger 6 months ago
- Status changed from Blocked to Workable
Gigabyte rejected my request and told us to send back to the vendor. I'm not sure yet what to do but I think we can continue.
#40
Updated by okurz 6 months ago
- Due date set to 2022-12-16
- Priority changed from Normal to Urgent
nsinger talked to "the IT office" in Prg office. They often have contacts with codec and good experiences. nsinger will create an SD ticket to the ladies from IT and they should be able to help us, e.g. where to send the server
#41
Updated by nicksinger 6 months ago
- Status changed from Workable to Blocked
Request created: https://sd.suse.com/servicedesk/customer/portal/1/SD-106628
#43
Updated by okurz 4 months ago
From https://sd.suse.com/servicedesk/customer/portal/1/SD-106628
Martina Markova 2023-01-26 16:03 wrote
Follow up email sent to Nick.
#45
Updated by openqa_review 4 months ago
- Due date set to 2023-02-22
Setting due date based on mean cycle time of SUSE QE Tools
#46
Updated by nicksinger 4 months ago
- Status changed from In Progress to Blocked
After the escalation, CODEC requested some of the already sent information again. I collected old descriptions of what we already did and send it with Martina (SUSE IT), Aine (CODEC) and osd-admins on CC. I got a swift reply already that my contact at CODEC moved positions but GIGABYTE was contacted nevertheless:
Hi Nick, Aisling has recently moved roles in the company, and I am now looking after the SUSE account. Thanks for reaching out and advising on the below. I will contact the supplier of this machine and hope we can reach a solution. I will keep you updated. Kind regards, Aine
#48
Updated by nicksinger 3 months ago
Today I just plugged in my notebook with a dhcp directly attached to the BMC and was able to access the webinterface as well as SOL. However, the machine still does not boot, reports very little over sol and shows no output on the connected VGA screen. I send over this information to codec by mail (visible on the osd-admins ML) today.
#50
Updated by okurz 3 months ago
I mounted the machine in NUE-FC-B:5 connected to PDU outlet A9 and mgmt ethernet on switch port 22 . As we don't the MAC address because there was incomplete information on racktables for the machine so far I created https://sd.suse.com/servicedesk/customer/portal/1/SD-113959) for Eng-Infra to provide us log or admin access to walter1.qe.nue2.suse.org
As workaround
I did nmap -p 443 10.168.192.0/22 > with
with chan mgmt patchcable not plugged in, then again with it plugged in and diffed it. Result:
+Nmap scan report for 10.168.193.122 +Host is up (0.00051s latency). + +PORT STATE SERVICE +443/tcp open https +MAC Address: E0:D5:5E:A7:E8:34 (Giga-byte Technology) +
#51
Updated by okurz 3 months ago
We attempted a firmware upgrade as suggested. This succeeded but no change in behaviour in the boot process. Ongoing email conversation, CC'ed to osd-admins@suse.de. Latest message at time of writing https://mailman.suse.de/mlarch/SuSE/osd-admins/2023/osd-admins.2023.03/msg00043.html
#52
Updated by okurz 3 months ago
- Subject changed from Make aarch64 machine chan-1 up and running after it is broken size:M to Make aarch64 machine chan-1 and chow up and running after it is broken size:M
From email:
From: qa-apac2 on behalf of Oliver Kurz okurz@suse.de
Sent: Saturday, March 4, 2023 5:19 AM
To: qa-apac2 qa-apac2@suse.de
Subject: How is machine "chow" used? Is racktables correct?Hi, I found the machine
chow.qa.suse.de under
https://racktables.nue.suse.com/index.php?
page=object&tab=default&object_id=13550
but neitherping -c1 chow-sp.qa.suse.denor
ping -c1 chow-1.qa.suse.deresponded. Is the information in racktables correct?
…
Both chan and chow are down due to the same or similar issue, also see #103736#note-30
#53
Updated by cdywan 3 months ago
- Due date changed from 2023-03-10 to 2023-03-17
okurz wrote:
nicksinger forwarded a message and received a first-level response, pending second-level response from Gigabyte
Still being discussed. I also raised this on Slack to clarify if they have the information they need from our side.
#56
Updated by okurz about 1 month ago
The last email I saw was https://mailman.suse.de/mlarch/SuSE/osd-admins/2023/osd-admins.2023.03/msg00185.html on 21 Mar 2023 . No related message in https://mailman.suse.de/mlarch/SuSE/osd-admins/2023/osd-admins.2023.04/threads.html
nicksinger please make sure that our relevant communication partners do the next step and then bump the due date accordingly.
#57
Updated by nicksinger about 1 month ago
- Status changed from Blocked to Resolved
I wrote another e-mail today asking if they still need something from us. We don't expect a free replacement for the broken machines. If we get an answer now, we can consider reopening again.