Project

General

Profile

action #110545

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

openQA Project - coordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3

Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 - further things to try size:M

Added by okurz about 2 months ago. Updated 6 days ago.

Status:
Blocked
Priority:
High
Assignee:
Target version:
Start date:
2022-05-02
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See parent #10148 . In #109232#note-5 ggardet_arm gave some additional hints that we could try. We should try all and run tests as mkittler did in #109232

Acceptance criteria

  • AC1: All concrete ideas have been tried and openQA tests have been executed with a statement regarding stability

Suggestions

  • Remind mkittler that he should always write down the commands he used in tickets as otherwise his colleagues will ask him anyway what he did in in #109232 to run openQA tests ;)
  • Change the parameters on the systems as written in #109232#note-5 , one by one or in combination, reconduct tests and gather stability figures
  • Come up with final assessment

Concrete ideas to try out

  • Disable mitigation (KPTI, etc.)
  • Enable/disable huge pages
  • Disable hardware threading in firmware (it will lower the number of CPU seen by the kernel)
  • Check actual CPU frequency
  • Check temperature (cpu throttling could slow down cpu freq and you get lower perfs)
  • Use single socket instead of dual sockets (may be configurable in the firmware)
  • Use a distribution without LSE-atomics (known to be slow on TX2)
  • You can also run sudo perf stat while the system is busy with openQA tests

Related issues

Copied to openQA Infrastructure - action #111578: Recover openqaworker-arm-4/5 after "bricking" in #110545Blocked

History

#1 Updated by okurz about 2 months ago

  • Project changed from openQA Project to openQA Infrastructure
  • Category deleted (Concrete Bugs)

#2 Updated by cdywan about 2 months ago

  • Subject changed from Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 - further things to try to Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 - further things to try size:M
  • Description updated (diff)
  • Status changed from New to Workable

#3 Updated by mkittler about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler

#4 Updated by mkittler about 1 month ago

I've now added kernel parameters that we also have on the o3 worker aarch64:

martchus@openqaworker-arm-4:~> cat /proc/cmdline
BOOT_IMAGE=/boot/Image-5.3.18-150300.59.63-default root=UUID=be776b2a-53e6-458c-9ab6-c35b63e4a834 console=tty0 console=ttyAMA0,115200 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M mitigations=off default_hugepagesz=1G hugepagesz=1G hugepages=64 enforcing=0

So now mitigations are disabled and huge pages are enabled similarly to aarch64.

I've been cloning the last 100 passing jobs from OSD to see whether it makes a difference: https://openqa.suse.de/tests/overview?build=test-arm4-3

#6 Updated by openqa_review about 1 month ago

  • Due date set to 2022-06-04

Setting due date based on mean cycle time of SUSE QE Tools

#7 Updated by mkittler about 1 month ago

The overall fail rate is still quite high:

openqa=> with test_jobs as (select distinct id, state, result from jobs where build = 'test-arm4-3') select state, result, count(id) * 100. / (select count(id) from test_jobs) as ratio from test_jobs group by test_jobs.state, test_jobs.result order by ratio desc;
 state |     result      |         ratio          
-------+-----------------+------------------------
 done  | passed          |    59.1836734693877551
 done  | failed          |    24.4897959183673469
 done  | parallel_failed |    15.6462585034013605
 done  | softfailed      | 0.68027210884353741497
(4 Zeilen)

I've updated the previous comment. I guess the number of typing issues is still too high to consider using mitigations=off default_hugepagesz=1G hugepagesz=1G hugepages=64 kernel parameters an improvement.

#8 Updated by mkittler about 1 month ago

I now disabled progdevfreq in the firmware (after previously only disabling progcpufreq). Not sure how I'd disable hardware threading in firmware (as mentioned in suggestions). The same counts for useing a single socket instead of dual sockets.

So that's what the current firmware settings are:

CAVM_CN99xx# env save
drivername snor
snor_erase: off=0x3ff0000, len=0x10000

-----------------------------------
       ENV Variable Settings 
-----------------------------------
Name                  : Value 
-----------------------------------
turbo                 : 2 
smt                   : 4 
corefreq              : 2199 
numcores              : 32 
icispeed              : 1 
socnclk               : 666 
socsclk               : 1199 
memclk                : 2199 
ddrspeed_auto         : 1 
ddrspeed              : 2400 
progcpufreq           : 0 
progdevfreq           : 0 
dmc_node_channel_mask : 0000ffff 
thermcontrol          : 1 
thermlimit            : 110 
enter_debug_shell     : 0 
dbg_speed_up_ddr_lvl  : 0 
enable_dram_scrub     : 0 
ipmbcontrol           : 1
ddr_dmt_advanced      : 0 
cppccontrol           : 0
loglevel              : 0
uart_params           : 115200/8-N-1 none
core_feature_mask     : 0
sys_feature_mask      : 0x00000000
ddr_refresh_rate      : 1
fw_feature_mask       : 0x00000000
dram_ce_threshold     : 1
dram_ce_step_threshold: 0
dram_ce_record_max    : 10
dram_ce_window        : 60 sec
dram_ce_leak_rate     : 2000 msec/error
pcie_ce_threshold     : 1
pcie_ce_window        : 30 sec
pcie_ce_leak_rate     : 15000 msec/error
-----------------------------------

Btw, I've just found: https://en.opensuse.org/HCL:ThunderX2 - Somehow I doubt these are "the best processors".

#9 Updated by mkittler about 1 month ago

I invoked cvmrundiag in the hope it would maybe print something useful. However, it left the system in a broken state where neither power cycle nor power reset help. I'm currently trying a factory reset (hopefully preserving most of the settings for authentication/IPMI).

#10 Updated by mkittler about 1 month ago

  • Status changed from In Progress to Feedback

The worker is still not working, when resetting the power only the following is printed:

Rom...
CRC: len=0xf080, cal=0x27ff5de9, img=0x27ff5de9, match!

Loading from boot device SPI NOR 
Header:
  000|0x23ffdc0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  010|0x23ffdd0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  020|0x23ffde0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  030|0x23ffdf0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  040|0x23ffe00:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  050|0x23ffe10:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  060|0x23ffe20:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  070|0x23ffe30:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  080|0x23ffe40:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  090|0x23ffe50:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  0A0|0x23ffe60:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  0B0|0x23ffe70:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  0C0|0x23ffe80:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  0D0|0x23ffe90:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  0E0|0x23ffea0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  0F0|0x23ffeb0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  100|0x23ffec0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  110|0x23ffed0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  120|0x23ffee0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  130|0x23ffef0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  140|0x23fff00:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  150|0x23fff10:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  160|0x23fff20:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  170|0x23fff30:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  180|0x23fff40:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  190|0x23fff50:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  1A0|0x23fff60:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  1B0|0x23fff70:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  1C0|0x23fff80:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  1D0|0x23fff90:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  1E0|0x23fffa0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
  1F0|0x23fffb0:   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 

It likely needs further investigation on-site.

okurz Maybe you want to take over, at least for trying to recover it on Friday?

#11 Updated by okurz about 1 month ago

  • Tags set to next-office-day
  • Status changed from Feedback to Workable
  • Assignee changed from mkittler to okurz

we will look into this.

#12 Updated by okurz about 1 month ago

found https://www.gigabyte.com/Enterprise/ARM-Server/R181-T92-rev-100#Support-Bios . Downloaded server_system_boot_mt91-fsx_f34.zip, extracted from that "image.RBU" and flashed that over https://ipmi.openqaworker-arm-4.qa.suse.de/#maintenance/firmware_update_wizard selecting "Update Type: BIOS". There could be options for CPDLD and BMC itself it seems.

Updated, system behaves the same as reported in https://progress.opensuse.org/issues/110545#note-10 . I wonder how we can configure boot devices.

I saved all configuration from BMC to a local ZIP file. Now restorting factory defaults. This saves all entries listed in a checklist so if this has no effect we likely need to configure the system to not preserve that much, then try again.

No effect, same behaviour. Configured on https://ipmi.openqaworker-arm-4.qa.suse.de/#maintenance/preserve_configuration to not preserve anything except IPMI(+network),Authentication so that our remote access password should be preserved.

#13 Updated by okurz about 1 month ago

Conducted a "factory reset", no difference. Then with nsinger being my witness I connected to https://ipmi.openqaworker-arm-5.qa.suse.de/ and in the remote control "h5viewer", likely HTML5 viewer which actually looks quite nice and usable that showed the picture of a getty session with some linux messages on the screen and the serial console showed a getty as well, so working nicely. Then I triggered a "power reset" and the machine started a reboot and it looked like there would be similar output as #110545#note-10 but with some "ff ff" included or some non-zero output. Then I triggered a "power cycle" (while the system was still booting) to resemble what mkittler reported he did on arm-4 and we actually ended up with the same symptoms, system does not boot anymore and "Header" shows only zeroes

#14 Updated by okurz about 1 month ago

  • Copied to action #111578: Recover openqaworker-arm-4/5 after "bricking" in #110545 added

#15 Updated by okurz about 1 month ago

  • Status changed from Workable to Blocked

nsinger, mkittler and me tried to recover both openqaworker-arm-4/5 and so far have not succeeded. I don't think there is anything useful we could do when being in the server room physically but of course we can still try to hook up a local VGA monitor or something. I suggest we continue in a specific "recover" ticket so that we are not polluting this ticket with more recovery specific information: #111578

#16 Updated by mkittler about 1 month ago

  • Due date deleted (2022-06-04)

#17 Updated by okurz 6 days ago

  • Tags deleted (next-office-day)

Also available in: Atom PDF