Project

General

Profile

Actions

action #91530

closed

Severe performance problems on malbec

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2021-04-21
Due date:
2021-05-14
% Done:

0%

Estimated time:

Description

@MDoucha wrote:
@okurz grenache-1 is either seriously overloaded or one of the disk drives is about to die. I'm getting lots of tests failed due to wait_serial timeout (zypped dup, waiting for login prompt after boot, etc.) all on grenache
there's nothing obviously wrong with the test or the VM, it's just that something that'd normally take 30 seconds times out after 30 minutes

Actions #1

Updated by livdywan over 3 years ago

  • Description updated (diff)
Actions #2

Updated by nicksinger over 3 years ago

Looking at https://stats.openqa-monitor.qa.suse.de/d/WDgrenache-1/worker-dashboard-grenache-1?orgId=1&refresh=1m&from=now-90d&to=now everything seems "normal". Slightly increased CPU usage but nothing really concerning. I will crosscheck if the most recent "fix" for PXE booting is somehow related (thinking of ipv6 problems). But I'm quite certain that this is not to blame here. Checking anyways

Actions #3

Updated by nicksinger over 3 years ago

nicksinger wrote:

Looking at https://stats.openqa-monitor.qa.suse.de/d/WDgrenache-1/worker-dashboard-grenache-1?orgId=1&refresh=1m&from=now-90d&to=now everything seems "normal". Slightly increased CPU usage but nothing really concerning. I will crosscheck if the most recent "fix" for PXE booting is somehow related (thinking of ipv6 problems). But I'm quite certain that this is not to blame here. Checking anyways

silly me… looking at grenache while malbec is in question m(

Actions #4

Updated by nicksinger over 3 years ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger

So, https://stats.openqa-monitor.qa.suse.de/d/WDmalbec/worker-dashboard-malbec?orgId=1&from=now-90d&to=now&refresh=1m is interesting. There was something with the disk sdf in between 04-01 and 04-25. Given this ticket is from 21 this might be related. Do we still see increased issues on the machine as sdf seems to have settled again?
I will try to dig down a little further if I can somehow get smart reports from these disks.

Actions #5

Updated by nicksinger over 3 years ago

iprconfig shows the RAID0 of the machine as "Degraded". Currently I can't figure out what disk is causing issues but I think this is the right track

Actions #6

Updated by openqa_review over 3 years ago

  • Due date set to 2021-05-14

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by nicksinger over 3 years ago

Last week I saw a "Permanent IOA failure" on that machines dmesg. According to IBM docs this means the adapter should be exchanged (https://www.ibm.com/docs/en/power8?topic=recovery-unit-reference-code-tables). But they also write something like "If two errors have occurred for the same I/O adapter in 24 hours, exchange the failing items in the Failing Items list one at a time.". After some days not touching the machine the error didn't came up again. Also the iprconfig utility reports both RAIDs as "degraded". After all this "IOA failure" might be totally unrelated to the "degraded" state of the RAID. I forced a "RAID consistency check" now to see if some more specific message comes up. Until now I could not figure out what disk is failing. Just that the whole RAID is unhealthy.

Actions #8

Updated by nicksinger over 3 years ago

  • Priority changed from Urgent to High

given that the IO performance on grafana looks good again and we didn't hear further complains about the machine I will lower the prio a little. However, it is still important that we can fix the raid.

Actions #10

Updated by okurz over 3 years ago

Executed iprconfig together with nsinger. Over "Display SAS Path Status" we found:

Type option, press Enter.
  1=Display SAS Path routing details

OPT Name   PCI/SCSI Location          Description                  Status
--- ------ -------------------------  ---------------------------- -----------------
    sg46   0005:04:00.0/1:0:7:0       RAID 0 Array Member          Redundant
    sg51   0005:04:00.0/1:0:12:0      RAID 0 Array Member          Redundant
    sg45   0005:04:00.0/1:0:6:0       RAID 10 Array Member         Redundant
    sg48   0005:04:00.0/1:0:9:0       RAID 0 Array Member          Redundant
    sg50   0005:04:00.0/1:0:11:0      RAID 0 Array Member          Redundant
    sg44   0005:04:00.0/1:0:5:0       RAID 10 Array Member         Redundant
    sg47   0005:04:00.0/1:0:8:0       RAID 0 Array Member          Redundant
    sg49   0005:04:00.0/1:0:10:0      RAID 0 Array Member          Redundant
    sg4    0001:08:00.0/0:0:3:0       RAID 0 Array Member          Redundant
    sg7    0001:08:00.0/0:0:6:0       RAID 0 Array Member          Redundant
    sg3    0001:08:00.0/0:0:2:0       RAID 0 Array Member          Redundant
    sg6    0001:08:00.0/0:0:5:0       RAID 0 Array Member          Redundant
    sg2    0001:08:00.0/0:0:1:0       RAID 10 Array Member         Redundant
    sg5    0001:08:00.0/0:0:4:0       RAID 0 Array Member          Redundant
    sg8    0001:08:00.0/0:0:7:0       RAID 0 Array Member          Redundant
    sg1    0001:08:00.0/0:0:0:0       RAID 10 Array Member         Redundant

or toggled

OPT Name   Resource Path/Address      Vendor   Product ID          Status
--- ------ -------------------------- -------- ------------------- -----------------
    sg46   00-0C-02                   IBM      ST600MP0005         Redundant
    sg51   00-0C-07                   IBM      ST600MP0005         Redundant
    sg45   00-0C-01                   IBM      ST600MP0005         Redundant
    sg48   00-0C-04                   IBM      ST600MP0005         Redundant
    sg50   00-0C-06                   IBM      ST600MP0005         Redundant
    sg44   00-0C-00                   IBM      HUC156060CSS20      Redundant
    sg47   00-0C-03                   IBM      ST600MP0005         Redundant
    sg49   00-0C-05                   IBM      ST600MP0005         Redundant
    sg4    00-0C-03                   IBM      ST600MP0005         Redundant
    sg7    00-0C-06                   IBM      ST600MP0005         Redundant
    sg3    00-0C-02                   IBM      ST600MP0005         Redundant
    sg6    00-0C-05                   IBM      ST600MP0005         Redundant
    sg2    00-0C-01                   IBM      ST600MP0005         Redundant
    sg5    00-0C-04                   IBM      ST600MP0005         Redundant
    sg8    00-0C-07                   IBM      ST600MP0005         Redundant
    sg1    00-0C-00                   IBM      HUC156060CSS20      Redundant

HUC156060CSS20 is likely a "Hitachi HGST Ultrastar C15K600 600GB HDD" hard disk, ST600MP0005 is a Seagate Enterprise Performance 15K SAS 600GB. So it seems like we have at least two different HDD drives but we do not yet know how many real physical drives (just 2?). In "Display Hardware Status" we could find product ids:

OPT Name   Resource Path/Address      Vendor   Product ID          Status
--- ------ -------------------------- -------- ------------------- -----------------
    sg25   FE                         IBM      57D8001SISIOA       Operational
    sg39   FC-06-00                   IBM      IPR-0   5ED56800    Degraded
    sg48   00-0C-04                   IBM      ST600MP0005         Remote
    sg42   FC-04-00                   IBM      IPR-0   5ED56800    Degraded
    sg50   00-0C-06                   IBM      ST600MP0005         Remote
    sg38   FC-01-00                   IBM      IPR-0   5ED56800    Degraded
    sg46   00-0C-02                   IBM      ST600MP0005         Remote
    sg41   FC-05-00                   IBM      IPR-0   5ED56800    Degraded
    sg49   00-0C-05                   IBM      ST600MP0005         Remote
    sg37   FC-02-00                   IBM      IPR-10  5ED59900    Degraded
    sg45   00-0C-01                   IBM      ST600MP0005         Remote
    sg44   00-0C-00                   IBM      HUC156060CSS20      Remote
    sg40   FC-00-00                   IBM      IPR-0   5ED56800    Degraded
    sg47   00-0C-03                   IBM      ST600MP0005         Remote
    sg43   FC-03-00                   IBM      IPR-0   5ED56800    Degraded
    sg51   00-0C-07                   IBM      ST600MP0005         Remote
    sg29   00-0F                      IBM.     RMBO0140532         Active
    sg26   00-14                      IBM      VSBPD14M1 6GSAS     Active
    sg28   00-0C-18                   IBM      PSBPD14M1 6GSAS     Active
    sg27   00-08-18                   IBM      PSBPD14M1 6GSAS     Active
    sg0    FE                         IBM      57D8001SISIOA       Operational
    sg17   FC-01-00                   IBM      IPR-0   5ED56800    Degraded
    sg3    00-0C-02                   IBM      ST600MP0005         Active
    sg20   FC-05-00                   IBM      IPR-0   5ED56800    Degraded
    sg6    00-0C-05                   IBM      ST600MP0005         Active
    sg16   FC-02-00                   IBM      IPR-10  5ED59900    Degraded
    sg2    00-0C-01                   IBM      ST600MP0005         Active
    sg1    00-0C-00                   IBM      HUC156060CSS20      Active
    sg19   FC-06-00                   IBM      IPR-0   5ED56800    Degraded
    sg5    00-0C-04                   IBM      ST600MP0005         Active
    sg22   FC-03-00                   IBM      IPR-0   5ED56800    Degraded
    sg8    00-0C-07                   IBM      ST600MP0005         Active
    sg18   FC-00-00                   IBM      IPR-0   5ED56800    Degraded
    sg4    00-0C-03                   IBM      ST600MP0005         Active
    sg21   FC-04-00                   IBM      IPR-0   5ED56800    Degraded
    sg7    00-0C-06                   IBM      ST600MP0005         Active
    sg23   00-08-18                   IBM      PSBPD14M1 6GSAS     Active
    sg24   00-0C-18                   IBM      PSBPD14M1 6GSAS     Active

where we can find sg1 and sg44, the Hitachi drive. Seems like one entry sg1 is the direct physical connection, status "Active", the other entry sg44 is connected to the same drive over multipath, status "Remote". serial number. So we find that we have either two devices, one Seagate, one Hitachi, or we have 8 devices, see

                                                                                  Display Device Statistics

Type option, press Enter.
  1=Display device statistics

OPT Name   Resource Path/Address      Vendor   Product ID          Status
--- ------ -------------------------- -------- ------------------- -----------------
           00-0C-04                   IBM      ST600MP0005         Remote
           00-0C-06                   IBM      ST600MP0005         Remote
           00-0C-02                   IBM      ST600MP0005         Remote
           00-0C-05                   IBM      ST600MP0005         Remote
           00-0C-01                   IBM      ST600MP0005         Remote
           00-0C-00                   IBM      HUC156060CSS20      Remote
           00-0C-03                   IBM      ST600MP0005         Remote
           00-0C-07                   IBM      ST600MP0005         Remote
           00-0C-02                   IBM      ST600MP0005         Active
           00-0C-05                   IBM      ST600MP0005         Active
           00-0C-01                   IBM      ST600MP0005         Active
           00-0C-00                   IBM      HUC156060CSS20      Active
           00-0C-04                   IBM      ST600MP0005         Active
           00-0C-07                   IBM      ST600MP0005         Active
           00-0C-03                   IBM      ST600MP0005         Active
           00-0C-06                   IBM      ST600MP0005         Active

By the way the racktables entry is
https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=3052

So we propose the following tasks:

  • Research how we can find out faulty RAID members in iprconfig
  • Ask mgriessmeier to check if there is any red light on the case of the physical machine
Actions #11

Updated by okurz over 3 years ago

Following https://www.linux.org/docs/man8/iprconfig.html I found iprconfig -c query-raid-consistency-check which yields:

Name   PCI/SCSI Location          Description               Status
------ -------------------------  ------------------------- -----------------
sdb    0001:08:00.0/0:2:0:0       RAID 10 Array                Degraded

which seems to corresponding to sg16 FC-02-00 IBM IPR-10 5ED59900 Degraded.

could be a next step :)

Actions #12

Updated by okurz over 3 years ago

  • Status changed from In Progress to Resolved

iprconfig -c show-slots should show physical locations:

Name   Platform Location          Description                  Status
------ -------------------------- ---------------------------- ------------
sg1    U78CB.001.WZS06YF-P2-D1    RAID 10 Array Member         Active
sg2    U78CB.001.WZS06YF-P2-D2    RAID 10 Array Member         Active
sg3    U78CB.001.WZS06YF-P2-D3    RAID 0 Array Member          Active
sg4    U78CB.001.WZS06YF-P2-D4    RAID 0 Array Member          Active
sg5    U78CB.001.WZS06YF-P2-D5    RAID 0 Array Member          Active
sg6    U78CB.001.WZS06YF-P2-D6    RAID 0 Array Member          Active
sg7    U78CB.001.WZS06YF-P2-D7    RAID 0 Array Member          Active
sg8    U78CB.001.WZS06YF-P2-D8    RAID 0 Array Member          Active
       U78CB.001.WZS06YF-P2-D9                                 Empty
       U78CB.001.WZS06YF-P2-D11                                Empty
       U78CB.001.WZS06YF-P2-D13                                Empty
       U78CB.001.WZS06YF-P2-D10                                Empty
       U78CB.001.WZS06YF-P2-D12                                Empty
       U78CB.001.WZS06YF-P2-D14                                Empty

which makes it more likely that we work with 8 physical drives.

iprconfig -c dump shows a good complete information set, basically all the interesting commands in a row, actually quite readable.

However, "Degraded" might just mean that we have multiple RAID0 configured with one drive each which is valid and possible and might be seen as "Degraded" but we should not see that as problematic. And the device that was shown as "needing a RAID consistency check" is sdb which according to lsblk is only used for swap space. I did now

swapoff /dev/disk/by-id/dm-name-1IBM_IPR-10_5ED5990000000020-part1
mkswap /dev/disk/by-id/dm-name-1IBM_IPR-10_5ED5990000000020-part1
swapon /dev/disk/by-id/dm-name-1IBM_IPR-10_5ED5990000000020-part1

just to be sure that the device offers valid swap.

By the way, it seems iprconfig is stripping off some characters from the serial ids of the devices but they are actually accessible within Linux, see output of ls -l /dev/disk/by-id showing:

lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED5680000000020 -> ../../dm-5
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED5680000000020-part1 -> ../../dm-7
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED5680000000040 -> ../../dm-8
lrwxrwxrwx 1 root root 11 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED5680000000040-part1 -> ../../dm-10
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED5680000000080 -> ../../dm-6
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED5680000000080-part1 -> ../../dm-9
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED56800000000A0 -> ../../dm-3
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED56800000000C0 -> ../../dm-4
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-0_5ED56800000000E0 -> ../../dm-2
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-name-1IBM_IPR-10_5ED5990000000020 -> ../../dm-0
lrwxrwxrwx 1 root root 10 May  6 23:14 dm-name-1IBM_IPR-10_5ED5990000000020-part1 -> ../../dm-1
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-mpath-1IBM_IPR-0_5ED5680000000020 -> ../../dm-5
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-mpath-1IBM_IPR-0_5ED5680000000040 -> ../../dm-8
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-mpath-1IBM_IPR-0_5ED5680000000080 -> ../../dm-6
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-mpath-1IBM_IPR-0_5ED56800000000A0 -> ../../dm-3
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-mpath-1IBM_IPR-0_5ED56800000000C0 -> ../../dm-4
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-mpath-1IBM_IPR-0_5ED56800000000E0 -> ../../dm-2
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-mpath-1IBM_IPR-10_5ED5990000000020 -> ../../dm-0
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-part1-mpath-1IBM_IPR-0_5ED5680000000020 -> ../../dm-7
lrwxrwxrwx 1 root root 11 Apr 25 03:36 dm-uuid-part1-mpath-1IBM_IPR-0_5ED5680000000040 -> ../../dm-10
lrwxrwxrwx 1 root root 10 Apr 25 03:36 dm-uuid-part1-mpath-1IBM_IPR-0_5ED5680000000080 -> ../../dm-9
lrwxrwxrwx 1 root root 10 May  6 23:14 dm-uuid-part1-mpath-1IBM_IPR-10_5ED5990000000020 -> ../../dm-1
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-0_5ED5680000000020 -> ../../dm-5
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-0_5ED5680000000020-part1 -> ../../dm-7
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-0_5ED5680000000040 -> ../../dm-8
lrwxrwxrwx 1 root root 11 Apr 25 03:36 scsi-1IBM_IPR-0_5ED5680000000040-part1 -> ../../dm-10
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-0_5ED5680000000080 -> ../../dm-6
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-0_5ED5680000000080-part1 -> ../../dm-9
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-0_5ED56800000000A0 -> ../../dm-3
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-0_5ED56800000000C0 -> ../../dm-4
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-0_5ED56800000000E0 -> ../../dm-2
lrwxrwxrwx 1 root root 10 Apr 25 03:36 scsi-1IBM_IPR-10_5ED5990000000020 -> ../../dm-0
lrwxrwxrwx 1 root root 10 May  6 23:14 scsi-1IBM_IPR-10_5ED5990000000020-part1 -> ../../dm-1
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED5680000000020 -> ../../dm-5
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED5680000000020-part1 -> ../../dm-7
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED5680000000040 -> ../../dm-8
lrwxrwxrwx 1 root root 11 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED5680000000040-part1 -> ../../dm-10
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED5680000000080 -> ../../dm-6
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED5680000000080-part1 -> ../../dm-9
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED56800000000A0 -> ../../dm-3
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED56800000000C0 -> ../../dm-4
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-0_5ED56800000000E0 -> ../../dm-2
lrwxrwxrwx 1 root root 10 Apr 25 03:36 wwn-0xIBM_IPR-10_5ED5990000000020 -> ../../dm-0
lrwxrwxrwx 1 root root 10 May  6 23:14 wwn-0xIBM_IPR-10_5ED5990000000020-part1 -> ../../dm-1

see how most devices start with 5ED5680 but end with different numbers except for 5ED599000, so all separate disk drives.

Actions #13

Updated by nicksinger over 3 years ago

okurz wrote:

However, "Degraded" might just mean that we have multiple RAID0 configured with one drive each which is valid and possible and might be seen as "Degraded" but we should not see that as problematic.

Could you explain where this assumtion comes from? Not saying it is wrong but for me this doesn't define a "degraded" raid. Two RAID0 combined in a RAID1 (RAID10) is a quite widespread setup so why would it be considered as degraded?

By the way, it seems iprconfig is stripping off some characters from the serial ids of the devices but they are actually accessible within Linux, see output of ls -l /dev/disk/by-id showing:
[…]
see how most devices start with 5ED5680 but end with different numbers except for 5ED599000, so all separate disk drives.

That makes sense, good catch!

Actions #14

Updated by okurz over 3 years ago

nicksinger wrote:

okurz wrote:

However, "Degraded" might just mean that we have multiple RAID0 configured with one drive each which is valid and possible and might be seen as "Degraded" but we should not see that as problematic.

Could you explain where this assumtion comes from? Not saying it is wrong but for me this doesn't define a "degraded" raid. Two RAID0 combined in a RAID1 (RAID10) is a quite widespread setup so why would it be considered as degraded?

I meant maybe someone creates a RAID0 with two slots configured but just one drive active and maybe that shows up as "Degraded" then.

Actions

Also available in: Atom PDF