Project

General

Profile

Actions

action #110545

open

openQA Project (public) - coordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3

Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 - further things to try size:M

Added by okurz over 2 years ago. Updated over 1 year ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
2022-05-02
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

See parent #101048 . In #109232#note-5 ggardet_arm gave some additional hints that we could try. We should try all and run tests as mkittler did in #109232

Acceptance criteria

  • AC1: All concrete ideas have been tried and openQA tests have been executed with a statement regarding stability

Suggestions

  • Remind mkittler that he should always write down the commands he used in tickets as otherwise his colleagues will ask him anyway what he did in in #109232 to run openQA tests ;)
  • Change the parameters on the systems as written in #109232#note-5 , one by one or in combination, reconduct tests and gather stability figures
  • Come up with final assessment

Concrete ideas to try out


Related issues 1 (0 open1 closed)

Copied to openQA Infrastructure (public) - action #111578: Recover openqaworker-arm-4/5 after "bricking" in #110545 size:MResolvednicksinger

Actions
Actions #1

Updated by okurz over 2 years ago

  • Project changed from openQA Project (public) to openQA Infrastructure (public)
  • Category deleted (Regressions/Crashes)
Actions #2

Updated by livdywan over 2 years ago

  • Subject changed from Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 - further things to try to Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 - further things to try size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by mkittler over 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #4

Updated by mkittler over 2 years ago

I've now added kernel parameters that we also have on the o3 worker aarch64:

martchus@openqaworker-arm-4:~> cat /proc/cmdline
BOOT_IMAGE=/boot/Image-5.3.18-150300.59.63-default root=UUID=be776b2a-53e6-458c-9ab6-c35b63e4a834 console=tty0 console=ttyAMA0,115200 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M mitigations=off default_hugepagesz=1G hugepagesz=1G hugepages=64 enforcing=0

So now mitigations are disabled and huge pages are enabled similarly to aarch64.

I've been cloning the last 100 passing jobs from OSD to see whether it makes a difference: https://openqa.suse.de/tests/overview?build=test-arm4-3

Actions #6

Updated by openqa_review over 2 years ago

  • Due date set to 2022-06-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by mkittler over 2 years ago

The overall fail rate is still quite high:

openqa=> with test_jobs as (select distinct id, state, result from jobs where build = 'test-arm4-3') select state, result, count(id) * 100. / (select count(id) from test_jobs) as ratio from test_jobs group by test_jobs.state, test_jobs.result order by ratio desc;
 state |     result      |         ratio          
-------+-----------------+------------------------
 done  | passed          |    59.1836734693877551
 done  | failed          |    24.4897959183673469
 done  | parallel_failed |    15.6462585034013605
 done  | softfailed      | 0.68027210884353741497
(4 Zeilen)

I've updated the previous comment. I guess the number of typing issues is still too high to consider using mitigations=off default_hugepagesz=1G hugepagesz=1G hugepages=64 kernel parameters an improvement.

Actions #8

Updated by mkittler over 2 years ago

I now disabled progdevfreq in the firmware (after previously only disabling progcpufreq). Not sure how I'd disable hardware threading in firmware (as mentioned in suggestions). The same counts for useing a single socket instead of dual sockets.

So that's what the current firmware settings are:

CAVM_CN99xx# env save
drivername snor
snor_erase: off=0x3ff0000, len=0x10000

-----------------------------------
       ENV Variable Settings 
-----------------------------------
Name                  : Value 
-----------------------------------
turbo                 : 2 
smt                   : 4 
corefreq              : 2199 
numcores              : 32 
icispeed              : 1 
socnclk               : 666 
socsclk               : 1199 
memclk                : 2199 
ddrspeed_auto         : 1 
ddrspeed              : 2400 
progcpufreq           : 0 
progdevfreq           : 0 
dmc_node_channel_mask : 0000ffff 
thermcontrol          : 1 
thermlimit            : 110 
enter_debug_shell     : 0 
dbg_speed_up_ddr_lvl  : 0 
enable_dram_scrub     : 0 
ipmbcontrol           : 1
ddr_dmt_advanced      : 0 
cppccontrol           : 0
loglevel              : 0
uart_params           : 115200/8-N-1 none
core_feature_mask     : 0
sys_feature_mask      : 0x00000000
ddr_refresh_rate      : 1
fw_feature_mask       : 0x00000000
dram_ce_threshold     : 1
dram_ce_step_threshold: 0
dram_ce_record_max    : 10
dram_ce_window        : 60 sec
dram_ce_leak_rate     : 2000 msec/error
pcie_ce_threshold     : 1
pcie_ce_window        : 30 sec
pcie_ce_leak_rate     : 15000 msec/error
-----------------------------------

Btw, I've just found: https://en.opensuse.org/HCL:ThunderX2 - Somehow I doubt these are "the best processors".

Actions #9

Updated by mkittler over 2 years ago

I invoked cvmrundiag in the hope it would maybe print something useful. However, it left the system in a broken state where neither power cycle nor power reset help. I'm currently trying a factory reset (hopefully preserving most of the settings for authentication/IPMI).

Actions #10

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Feedback

The worker is still not working, when resetting the power only the following is printed:

Rom...
CRC: len=0xf080, cal=0x27ff5de9, img=0x27ff5de9, match!

Loading from boot device SPI NOR 
Header:
  000|0x23ffdc0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  010|0x23ffdd0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  020|0x23ffde0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  030|0x23ffdf0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  040|0x23ffe00:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  050|0x23ffe10:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  060|0x23ffe20:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  070|0x23ffe30:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  080|0x23ffe40:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  090|0x23ffe50:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  0A0|0x23ffe60:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  0B0|0x23ffe70:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  0C0|0x23ffe80:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  0D0|0x23ffe90:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  0E0|0x23ffea0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  0F0|0x23ffeb0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  100|0x23ffec0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  110|0x23ffed0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  120|0x23ffee0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  130|0x23ffef0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  140|0x23fff00:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  150|0x23fff10:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  160|0x23fff20:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  170|0x23fff30:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  180|0x23fff40:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  190|0x23fff50:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  1A0|0x23fff60:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  1B0|0x23fff70:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  1C0|0x23fff80:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  1D0|0x23fff90:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  1E0|0x23fffa0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  1F0|0x23fffb0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 

It likely needs further investigation on-site.

@okurz Maybe you want to take over, at least for trying to recover it on Friday?

Actions #11

Updated by okurz over 2 years ago

  • Tags set to next-office-day
  • Status changed from Feedback to Workable
  • Assignee changed from mkittler to okurz

we will look into this.

Actions #12

Updated by okurz over 2 years ago

found https://www.gigabyte.com/Enterprise/ARM-Server/R181-T92-rev-100#Support-Bios . Downloaded server_system_boot_mt91-fsx_f34.zip, extracted from that "image.RBU" and flashed that over https://ipmi.openqaworker-arm-4.qa.suse.de/#maintenance/firmware_update_wizard selecting "Update Type: BIOS". There could be options for CPDLD and BMC itself it seems.

Updated, system behaves the same as reported in https://progress.opensuse.org/issues/110545#note-10 . I wonder how we can configure boot devices.

I saved all configuration from BMC to a local ZIP file. Now restorting factory defaults. This saves all entries listed in a checklist so if this has no effect we likely need to configure the system to not preserve that much, then try again.

No effect, same behaviour. Configured on https://ipmi.openqaworker-arm-4.qa.suse.de/#maintenance/preserve_configuration to not preserve anything except IPMI(+network),Authentication so that our remote access password should be preserved.

Actions #13

Updated by okurz over 2 years ago

Conducted a "factory reset", no difference. Then with nsinger being my witness I connected to https://ipmi.openqaworker-arm-5.qa.suse.de/ and in the remote control "h5viewer", likely HTML5 viewer which actually looks quite nice and usable that showed the picture of a getty session with some linux messages on the screen and the serial console showed a getty as well, so working nicely. Then I triggered a "power reset" and the machine started a reboot and it looked like there would be similar output as #110545#note-10 but with some "ff ff" included or some non-zero output. Then I triggered a "power cycle" (while the system was still booting) to resemble what mkittler reported he did on arm-4 and we actually ended up with the same symptoms, system does not boot anymore and "Header" shows only zeroes

Actions #14

Updated by okurz over 2 years ago

  • Copied to action #111578: Recover openqaworker-arm-4/5 after "bricking" in #110545 size:M added
Actions #15

Updated by okurz over 2 years ago

  • Status changed from Workable to Blocked

nsinger, mkittler and me tried to recover both openqaworker-arm-4/5 and so far have not succeeded. I don't think there is anything useful we could do when being in the server room physically but of course we can still try to hook up a local VGA monitor or something. I suggest we continue in a specific "recover" ticket so that we are not polluting this ticket with more recovery specific information: #111578

Actions #16

Updated by mkittler over 2 years ago

  • Due date deleted (2022-06-04)
Actions #17

Updated by okurz over 2 years ago

  • Tags deleted (next-office-day)
Actions #18

Updated by okurz over 2 years ago

  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)

back to try further stuff after we could recover both machines with a complete cold power cycle, see #111578

Actions #19

Updated by okurz over 2 years ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

let's wait for moving those machines to the Nbg TAM lab: #114604

Actions #20

Updated by livdywan over 2 years ago

  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)

The blocker is gone (SRV2 suffering from high temperatures)

Actions #21

Updated by livdywan over 2 years ago

  • Description updated (diff)
Actions #22

Updated by okurz over 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

Nobody from the team wants to touch these beasts so I will ask OBS team if maybe they want to trade machines.

Actions #23

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #24

Updated by okurz over 2 years ago

  • Status changed from In Progress to Feedback

Asked BuildOPS if they would be interested for a trade, see #110539

Brought up the topic again in https://suse.slack.com/archives/C02CANHLANP/p1663928419759359

Hi. Some months ago we have ordered two ARM machines to be used as openQA workers. Unfortunately for yet unknown reasons these two machines are much less reliable than our older ARM workers. If you are interested or want to help see https://progress.opensuse.org/issues/101048 and all subtasks for the full story. I also asked the BuildOPS team if they would potentially be interested in trading machines. Other than that I see no good path to continue with replacing our aging and unstable ARM workers which are still more stable than the new ones.

EDIT: szarate will have a I have a meeting with afaerber in week 2022-W39 bringing up this topic as well. Some background info by mawerner: https://confluence.suse.com/display/LEONG/2022-09-19+WG-+ALP%3A+QE+on+Arm

Actions #25

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #26

Updated by okurz about 2 years ago

okurz wrote:

Asked BuildOPS if they would be interested for a trade, see #110539

The answer for #110539 is "No". awaiting response from szarate in https://suse.slack.com/archives/C02CANHLANP/p1664966773773939

(Oliver Kurz) @Santiago Zarate for https://progress.opensuse.org/issues/110545, what's the result regarding ARM workers discussion with afaerber?

Actions #27

Updated by okurz about 2 years ago

There was no update by szarate yet so I reminded them in
https://suse.slack.com/archives/C02CANHLANP/p1665563709438689?thread_ts=1664966773.773939&cid=C02CANHLANP

@Santiago Zarate still missing the update from above, plz

Actions #29

Updated by okurz about 2 years ago

  • Description updated (diff)
  • Status changed from Feedback to New
  • Assignee deleted (okurz)
  • Priority changed from High to Normal
  • Target version changed from Ready to future

Thanks. So additional information and additional ideas have been provided. I updated the description of the ticket about the suggestions that are still open to be tried. This could be done by anyone with access to the team, i.e. within SUSE. The SUSE QE Tools team does currently not plan to try any further.

@szarate if you or anyone within QE-Core would like to go on testing the stability of the machines I would appreciate that a lot. The invitation is to everybody with access to the machine. We can provide support in getting access and starting experiments.

Actions #31

Updated by okurz almost 2 years ago

  • Tags set to infra
Actions #32

Updated by okurz over 1 year ago

  • Target version changed from future to Ready
Actions #33

Updated by okurz over 1 year ago

  • Status changed from New to Workable
Actions #34

Updated by okurz over 1 year ago

  • Target version changed from Ready to future

I honestly don't remember anymore why two months ago I added the ticket back to the backlog without a comment. It might actually have been a mistake. #110545-29 is still the most recent and valid state. I consider it unfortunate that so far nobody could find clear requirements for what a machine needs to fulfill to be able to run stable openQA tests.

Actions

Also available in: Atom PDF