action #91530
closedSevere performance problems on malbec
Added by okurz over 3 years ago. Updated over 3 years ago.
0%
Description
@MDoucha wrote:
@okurz grenache-1 is either seriously overloaded or one of the disk drives is about to die. I'm getting lots of tests failed due towait_serial
timeout (zypped dup
, waiting for login prompt after boot, etc.) all on grenache
there's nothing obviously wrong with the test or the VM, it's just that something that'd normally take 30 seconds times out after 30 minutes
Updated by nicksinger over 3 years ago
Looking at https://stats.openqa-monitor.qa.suse.de/d/WDgrenache-1/worker-dashboard-grenache-1?orgId=1&refresh=1m&from=now-90d&to=now everything seems "normal". Slightly increased CPU usage but nothing really concerning. I will crosscheck if the most recent "fix" for PXE booting is somehow related (thinking of ipv6 problems). But I'm quite certain that this is not to blame here. Checking anyways
Updated by nicksinger over 3 years ago
nicksinger wrote:
Looking at https://stats.openqa-monitor.qa.suse.de/d/WDgrenache-1/worker-dashboard-grenache-1?orgId=1&refresh=1m&from=now-90d&to=now everything seems "normal". Slightly increased CPU usage but nothing really concerning. I will crosscheck if the most recent "fix" for PXE booting is somehow related (thinking of ipv6 problems). But I'm quite certain that this is not to blame here. Checking anyways
silly me… looking at grenache while malbec is in question m(
Updated by nicksinger over 3 years ago
- Status changed from New to In Progress
- Assignee set to nicksinger
So, https://stats.openqa-monitor.qa.suse.de/d/WDmalbec/worker-dashboard-malbec?orgId=1&from=now-90d&to=now&refresh=1m is interesting. There was something with the disk sdf in between 04-01 and 04-25. Given this ticket is from 21 this might be related. Do we still see increased issues on the machine as sdf seems to have settled again?
I will try to dig down a little further if I can somehow get smart reports from these disks.
Updated by nicksinger over 3 years ago
iprconfig
shows the RAID0 of the machine as "Degraded". Currently I can't figure out what disk is causing issues but I think this is the right track
Updated by openqa_review over 3 years ago
- Due date set to 2021-05-14
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger over 3 years ago
Last week I saw a "Permanent IOA failure" on that machines dmesg
. According to IBM docs this means the adapter should be exchanged (https://www.ibm.com/docs/en/power8?topic=recovery-unit-reference-code-tables). But they also write something like "If two errors have occurred for the same I/O adapter in 24 hours, exchange the failing items in the Failing Items list one at a time.". After some days not touching the machine the error didn't came up again. Also the iprconfig
utility reports both RAIDs as "degraded". After all this "IOA failure" might be totally unrelated to the "degraded" state of the RAID. I forced a "RAID consistency check" now to see if some more specific message comes up. Until now I could not figure out what disk is failing. Just that the whole RAID is unhealthy.
Updated by nicksinger over 3 years ago
- Priority changed from Urgent to High
given that the IO performance on grafana looks good again and we didn't hear further complains about the machine I will lower the prio a little. However, it is still important that we can fix the raid.
Updated by okurz over 3 years ago
Executed iprconfig together with nsinger. Over "Display SAS Path Status" we found:
Type option, press Enter.
1=Display SAS Path routing details
OPT Name PCI/SCSI Location Description Status
--- ------ ------------------------- ---------------------------- -----------------
sg46 0005:04:00.0/1:0:7:0 RAID 0 Array Member Redundant
sg51 0005:04:00.0/1:0:12:0 RAID 0 Array Member Redundant
sg45 0005:04:00.0/1:0:6:0 RAID 10 Array Member Redundant
sg48 0005:04:00.0/1:0:9:0 RAID 0 Array Member Redundant
sg50 0005:04:00.0/1:0:11:0 RAID 0 Array Member Redundant
sg44 0005:04:00.0/1:0:5:0 RAID 10 Array Member Redundant
sg47 0005:04:00.0/1:0:8:0 RAID 0 Array Member Redundant
sg49 0005:04:00.0/1:0:10:0 RAID 0 Array Member Redundant
sg4 0001:08:00.0/0:0:3:0 RAID 0 Array Member Redundant
sg7 0001:08:00.0/0:0:6:0 RAID 0 Array Member Redundant
sg3 0001:08:00.0/0:0:2:0 RAID 0 Array Member Redundant
sg6 0001:08:00.0/0:0:5:0 RAID 0 Array Member Redundant
sg2 0001:08:00.0/0:0:1:0 RAID 10 Array Member Redundant
sg5 0001:08:00.0/0:0:4:0 RAID 0 Array Member Redundant
sg8 0001:08:00.0/0:0:7:0 RAID 0 Array Member Redundant
sg1 0001:08:00.0/0:0:0:0 RAID 10 Array Member Redundant
or toggled
OPT Name Resource Path/Address Vendor Product ID Status
--- ------ -------------------------- -------- ------------------- -----------------
sg46 00-0C-02 IBM ST600MP0005 Redundant
sg51 00-0C-07 IBM ST600MP0005 Redundant
sg45 00-0C-01 IBM ST600MP0005 Redundant
sg48 00-0C-04 IBM ST600MP0005 Redundant
sg50 00-0C-06 IBM ST600MP0005 Redundant
sg44 00-0C-00 IBM HUC156060CSS20 Redundant
sg47 00-0C-03 IBM ST600MP0005 Redundant
sg49 00-0C-05 IBM ST600MP0005 Redundant
sg4 00-0C-03 IBM ST600MP0005 Redundant
sg7 00-0C-06 IBM ST600MP0005 Redundant
sg3 00-0C-02 IBM ST600MP0005 Redundant
sg6 00-0C-05 IBM ST600MP0005 Redundant
sg2 00-0C-01 IBM ST600MP0005 Redundant
sg5 00-0C-04 IBM ST600MP0005 Redundant
sg8 00-0C-07 IBM ST600MP0005 Redundant
sg1 00-0C-00 IBM HUC156060CSS20 Redundant
HUC156060CSS20
is likely a "Hitachi HGST Ultrastar C15K600 600GB HDD" hard disk, ST600MP0005
is a Seagate Enterprise Performance 15K SAS 600GB
. So it seems like we have at least two different HDD drives but we do not yet know how many real physical drives (just 2?). In "Display Hardware Status" we could find product ids:
OPT Name Resource Path/Address Vendor Product ID Status
--- ------ -------------------------- -------- ------------------- -----------------
sg25 FE IBM 57D8001SISIOA Operational
sg39 FC-06-00 IBM IPR-0 5ED56800 Degraded
sg48 00-0C-04 IBM ST600MP0005 Remote
sg42 FC-04-00 IBM IPR-0 5ED56800 Degraded
sg50 00-0C-06 IBM ST600MP0005 Remote
sg38 FC-01-00 IBM IPR-0 5ED56800 Degraded
sg46 00-0C-02 IBM ST600MP0005 Remote
sg41 FC-05-00 IBM IPR-0 5ED56800 Degraded
sg49 00-0C-05 IBM ST600MP0005 Remote
sg37 FC-02-00 IBM IPR-10 5ED59900 Degraded
sg45 00-0C-01 IBM ST600MP0005 Remote
sg44 00-0C-00 IBM HUC156060CSS20 Remote
sg40 FC-00-00 IBM IPR-0 5ED56800 Degraded
sg47 00-0C-03 IBM ST600MP0005 Remote
sg43 FC-03-00 IBM IPR-0 5ED56800 Degraded
sg51 00-0C-07 IBM ST600MP0005 Remote
sg29 00-0F IBM. RMBO0140532 Active
sg26 00-14 IBM VSBPD14M1 6GSAS Active
sg28 00-0C-18 IBM PSBPD14M1 6GSAS Active
sg27 00-08-18 IBM PSBPD14M1 6GSAS Active
sg0 FE IBM 57D8001SISIOA Operational
sg17 FC-01-00 IBM IPR-0 5ED56800 Degraded
sg3 00-0C-02 IBM ST600MP0005 Active
sg20 FC-05-00 IBM IPR-0 5ED56800 Degraded
sg6 00-0C-05 IBM ST600MP0005 Active
sg16 FC-02-00 IBM IPR-10 5ED59900 Degraded
sg2 00-0C-01 IBM ST600MP0005 Active
sg1 00-0C-00 IBM HUC156060CSS20 Active
sg19 FC-06-00 IBM IPR-0 5ED56800 Degraded
sg5 00-0C-04 IBM ST600MP0005 Active
sg22 FC-03-00 IBM IPR-0 5ED56800 Degraded
sg8 00-0C-07 IBM ST600MP0005 Active
sg18 FC-00-00 IBM IPR-0 5ED56800 Degraded
sg4 00-0C-03 IBM ST600MP0005 Active
sg21 FC-04-00 IBM IPR-0 5ED56800 Degraded
sg7 00-0C-06 IBM ST600MP0005 Active
sg23 00-08-18 IBM PSBPD14M1 6GSAS Active
sg24 00-0C-18 IBM PSBPD14M1 6GSAS Active
where we can find sg1 and sg44, the Hitachi drive. Seems like one entry sg1 is the direct physical connection, status "Active", the other entry sg44 is connected to the same drive over multipath, status "Remote". serial number. So we find that we have either two devices, one Seagate, one Hitachi, or we have 8 devices, see
Display Device Statistics
Type option, press Enter.
1=Display device statistics
OPT Name Resource Path/Address Vendor Product ID Status
--- ------ -------------------------- -------- ------------------- -----------------
00-0C-04 IBM ST600MP0005 Remote
00-0C-06 IBM ST600MP0005 Remote
00-0C-02 IBM ST600MP0005 Remote
00-0C-05 IBM ST600MP0005 Remote
00-0C-01 IBM ST600MP0005 Remote
00-0C-00 IBM HUC156060CSS20 Remote
00-0C-03 IBM ST600MP0005 Remote
00-0C-07 IBM ST600MP0005 Remote
00-0C-02 IBM ST600MP0005 Active
00-0C-05 IBM ST600MP0005 Active
00-0C-01 IBM ST600MP0005 Active
00-0C-00 IBM HUC156060CSS20 Active
00-0C-04 IBM ST600MP0005 Active
00-0C-07 IBM ST600MP0005 Active
00-0C-03 IBM ST600MP0005 Active
00-0C-06 IBM ST600MP0005 Active
By the way the racktables entry is
https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=3052
So we propose the following tasks:
- Research how we can find out faulty RAID members in
iprconfig
- Ask mgriessmeier to check if there is any red light on the case of the physical machine
Updated by okurz over 3 years ago
Following https://www.linux.org/docs/man8/iprconfig.html I found iprconfig -c query-raid-consistency-check
which yields:
Name PCI/SCSI Location Description Status
------ ------------------------- ------------------------- -----------------
sdb 0001:08:00.0/0:2:0:0 RAID 10 Array Degraded
which seems to corresponding to sg16 FC-02-00 IBM IPR-10 5ED59900 Degraded
.
could be a next step :)
Updated by okurz over 3 years ago
- Status changed from In Progress to Resolved
iprconfig -c show-slots
should show physical locations:
Name Platform Location Description Status
------ -------------------------- ---------------------------- ------------
sg1 U78CB.001.WZS06YF-P2-D1 RAID 10 Array Member Active
sg2 U78CB.001.WZS06YF-P2-D2 RAID 10 Array Member Active
sg3 U78CB.001.WZS06YF-P2-D3 RAID 0 Array Member Active
sg4 U78CB.001.WZS06YF-P2-D4 RAID 0 Array Member Active
sg5 U78CB.001.WZS06YF-P2-D5 RAID 0 Array Member Active
sg6 U78CB.001.WZS06YF-P2-D6 RAID 0 Array Member Active
sg7 U78CB.001.WZS06YF-P2-D7 RAID 0 Array Member Active
sg8 U78CB.001.WZS06YF-P2-D8 RAID 0 Array Member Active
U78CB.001.WZS06YF-P2-D9 Empty
U78CB.001.WZS06YF-P2-D11 Empty
U78CB.001.WZS06YF-P2-D13 Empty
U78CB.001.WZS06YF-P2-D10 Empty
U78CB.001.WZS06YF-P2-D12 Empty
U78CB.001.WZS06YF-P2-D14 Empty
which makes it more likely that we work with 8 physical drives.
iprconfig -c dump
shows a good complete information set, basically all the interesting commands in a row, actually quite readable.
However, "Degraded" might just mean that we have multiple RAID0 configured with one drive each which is valid and possible and might be seen as "Degraded" but we should not see that as problematic. And the device that was shown as "needing a RAID consistency check" is sdb which according to lsblk
is only used for swap space. I did now
swapoff /dev/disk/by-id/dm-name-1IBM_IPR-10_5ED5990000000020-part1
mkswap /dev/disk/by-id/dm-name-1IBM_IPR-10_5ED5990000000020-part1
swapon /dev/disk/by-id/dm-name-1IBM_IPR-10_5ED5990000000020-part1
just to be sure that the device offers valid swap.
By the way, it seems iprconfig is stripping off some characters from the serial ids of the devices but they are actually accessible within Linux, see output of ls -l /dev/disk/by-id
showing:
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED5680000000020 -> ../../dm-5
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED5680000000020-part1 -> ../../dm-7
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED5680000000040 -> ../../dm-8
lrwxrwxrwx 1 root root 11 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED5680000000040-part1 -> ../../dm-10
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED5680000000080 -> ../../dm-6
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED5680000000080-part1 -> ../../dm-9
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED56800000000A0 -> ../../dm-3
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED56800000000C0 -> ../../dm-4
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED56800000000E0 -> ../../dm-2
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-10_5ED5990000000020 -> ../../dm-0
lrwxrwxrwx 1 root root 10 May 6 23:14 dm-name-1IBM_IPR-10_5ED5990000000020-part1 -> ../../dm-1
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-mpath-1IBM_IPR-0_5ED5680000000020 -> ../../dm-5
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-mpath-1IBM_IPR-0_5ED5680000000040 -> ../../dm-8
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-mpath-1IBM_IPR-0_5ED5680000000080 -> ../../dm-6
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-mpath-1IBM_IPR-0_5ED56800000000A0 -> ../../dm-3
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-mpath-1IBM_IPR-0_5ED56800000000C0 -> ../../dm-4
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-mpath-1IBM_IPR-0_5ED56800000000E0 -> ../../dm-2
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-mpath-1IBM_IPR-10_5ED5990000000020 -> ../../dm-0
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-part1-mpath-1IBM_IPR-0_5ED5680000000020 -> ../../dm-7
lrwxrwxrwx 1 root root 11 Apr 25 03:36 dm-uuid-part1-mpath-1IBM_IPR-0_5ED5680000000040 -> ../../dm-10
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-part1-mpath-1IBM_IPR-0_5ED5680000000080 -> ../../dm-9
lrwxrwxrwx 1 root root 10 May 6 23:14 dm-uuid-part1-mpath-1IBM_IPR-10_5ED5990000000020 -> ../../dm-1
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-0_5ED5680000000020 -> ../../dm-5
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-0_5ED5680000000020-part1 -> ../../dm-7
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-0_5ED5680000000040 -> ../../dm-8
lrwxrwxrwx 1 root root 11 Apr 25 03:36 scsi-1IBM_IPR-0_5ED5680000000040-part1 -> ../../dm-10
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-0_5ED5680000000080 -> ../../dm-6
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-0_5ED5680000000080-part1 -> ../../dm-9
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-0_5ED56800000000A0 -> ../../dm-3
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-0_5ED56800000000C0 -> ../../dm-4
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-0_5ED56800000000E0 -> ../../dm-2
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-10_5ED5990000000020 -> ../../dm-0
lrwxrwxrwx 1 root root 10 May 6 23:14 scsi-1IBM_IPR-10_5ED5990000000020-part1 -> ../../dm-1
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED5680000000020 -> ../../dm-5
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED5680000000020-part1 -> ../../dm-7
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED5680000000040 -> ../../dm-8
lrwxrwxrwx 1 root root 11 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED5680000000040-part1 -> ../../dm-10
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED5680000000080 -> ../../dm-6
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED5680000000080-part1 -> ../../dm-9
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED56800000000A0 -> ../../dm-3
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED56800000000C0 -> ../../dm-4
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED56800000000E0 -> ../../dm-2
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-10_5ED5990000000020 -> ../../dm-0
lrwxrwxrwx 1 root root 10 May 6 23:14 wwn-0xIBM_IPR-10_5ED5990000000020-part1 -> ../../dm-1
see how most devices start with 5ED5680
but end with different numbers except for 5ED599000
, so all separate disk drives.
Updated by nicksinger over 3 years ago
okurz wrote:
However, "Degraded" might just mean that we have multiple RAID0 configured with one drive each which is valid and possible and might be seen as "Degraded" but we should not see that as problematic.
Could you explain where this assumtion comes from? Not saying it is wrong but for me this doesn't define a "degraded" raid. Two RAID0 combined in a RAID1 (RAID10) is a quite widespread setup so why would it be considered as degraded?
By the way, it seems iprconfig is stripping off some characters from the serial ids of the devices but they are actually accessible within Linux, see output of
ls -l /dev/disk/by-id
showing:
[…]
see how most devices start with5ED5680
but end with different numbers except for5ED599000
, so all separate disk drives.
That makes sense, good catch!
Updated by okurz over 3 years ago
nicksinger wrote:
okurz wrote:
However, "Degraded" might just mean that we have multiple RAID0 configured with one drive each which is valid and possible and might be seen as "Degraded" but we should not see that as problematic.
Could you explain where this assumtion comes from? Not saying it is wrong but for me this doesn't define a "degraded" raid. Two RAID0 combined in a RAID1 (RAID10) is a quite widespread setup so why would it be considered as degraded?
I meant maybe someone creates a RAID0 with two slots configured but just one drive active and maybe that shows up as "Degraded" then.