action #110545
openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
openQA Project - coordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3
Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 - further things to try size:M
0%
Description
Motivation¶
See parent #10148 . In #109232#note-5 ggardet_arm gave some additional hints that we could try. We should try all and run tests as mkittler did in #109232
Acceptance criteria¶
- AC1: All concrete ideas have been tried and openQA tests have been executed with a statement regarding stability
Suggestions¶
- Remind mkittler that he should always write down the commands he used in tickets as otherwise his colleagues will ask him anyway what he did in in #109232 to run openQA tests ;)
- See my notes on exporting job IDs via
psql
: https://github.com/Martchus/openQA-helper#useful-sql-queries=
- See my notes on exporting job IDs via
- Change the parameters on the systems as written in #109232#note-5 , one by one or in combination, reconduct tests and gather stability figures
- Come up with final assessment
Concrete ideas to try out¶
- Disable mitigation (KPTI, etc.)
- Use kernel parameter
mitigations=off
(see https://www.kernel.org/doc/html/v5.15-rc1/admin-guide/kernel-parameters.html)
- Use kernel parameter
- Enable/disable huge pages
- Disable hardware threading in firmware (it will lower the number of CPU seen by the kernel)
- Check actual CPU frequency
- Check temperature (cpu throttling could slow down cpu freq and you get lower perfs)
- Use single socket instead of dual sockets (may be configurable in the firmware)
- Use a distribution without LSE-atomics (known to be slow on TX2)
- You can also run sudo perf stat while the system is busy with openQA tests
Related issues
History
#1
Updated by okurz about 2 months ago
- Project changed from openQA Project to openQA Infrastructure
- Category deleted (
Concrete Bugs)
#2
Updated by cdywan about 2 months ago
- Subject changed from Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 - further things to try to Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 - further things to try size:M
- Description updated (diff)
- Status changed from New to Workable
#3
Updated by mkittler about 1 month ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
#4
Updated by mkittler about 1 month ago
I've now added kernel parameters that we also have on the o3 worker aarch64:
martchus@openqaworker-arm-4:~> cat /proc/cmdline BOOT_IMAGE=/boot/Image-5.3.18-150300.59.63-default root=UUID=be776b2a-53e6-458c-9ab6-c35b63e4a834 console=tty0 console=ttyAMA0,115200 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M mitigations=off default_hugepagesz=1G hugepagesz=1G hugepages=64 enforcing=0
So now mitigations are disabled and huge pages are enabled similarly to aarch64.
I've been cloning the last 100 passing jobs from OSD to see whether it makes a difference: https://openqa.suse.de/tests/overview?build=test-arm4-3
#5
Updated by mkittler about 1 month ago
A short review of the test results we've go so far already reveals quite a lot of typing issues.
Typing issues:
- https://openqa.suse.de/tests/8798813#step/setup/82
- https://openqa.suse.de/tests/8798628#step/before_test/18
- https://openqa.suse.de/tests/8799018#step/before_test/10
- https://openqa.suse.de/tests/8799075#step/installation_overview/3
- https://openqa.suse.de/tests/8798877#step/force_scheduled_tasks/8
- https://openqa.suse.de/tests/8798697#step/barrier_init/15
- https://openqa.suse.de/tests/8798644#step/setup/73
- https://openqa.suse.de/tests/8800122#step/snapper_thin_lvm/12
- https://openqa.suse.de/tests/8800121#step/consoletest_setup/71
- https://openqa.suse.de/tests/8799373#step/pam_su/6
Possibly typing issue:
Other failures / not sure about the cause:
- https://openqa.suse.de/tests/8798760#step/consoletest_setup/8
- https://openqa.suse.de/tests/8798752#step/iscsi_client/22
- https://openqa.suse.de/tests/8799090#step/hostname_inst/6
- https://openqa.suse.de/tests/8798979#step/scc_registration/6
- https://openqa.suse.de/tests/8799369#step/command_not_found/18
- https://openqa.suse.de/tests/8799133#step/update_kernel/87
- https://openqa.suse.de/tests/8799127#step/madvise06/7
#6
Updated by openqa_review about 1 month ago
- Due date set to 2022-06-04
Setting due date based on mean cycle time of SUSE QE Tools
#7
Updated by mkittler about 1 month ago
The overall fail rate is still quite high:
openqa=> with test_jobs as (select distinct id, state, result from jobs where build = 'test-arm4-3') select state, result, count(id) * 100. / (select count(id) from test_jobs) as ratio from test_jobs group by test_jobs.state, test_jobs.result order by ratio desc; state | result | ratio -------+-----------------+------------------------ done | passed | 59.1836734693877551 done | failed | 24.4897959183673469 done | parallel_failed | 15.6462585034013605 done | softfailed | 0.68027210884353741497 (4 Zeilen)
I've updated the previous comment. I guess the number of typing issues is still too high to consider using mitigations=off default_hugepagesz=1G hugepagesz=1G hugepages=64
kernel parameters an improvement.
#8
Updated by mkittler about 1 month ago
I now disabled progdevfreq
in the firmware (after previously only disabling progcpufreq
). Not sure how I'd disable hardware threading in firmware (as mentioned in suggestions). The same counts for useing a single socket instead of dual sockets.
So that's what the current firmware settings are:
CAVM_CN99xx# env save drivername snor snor_erase: off=0x3ff0000, len=0x10000 ----------------------------------- ENV Variable Settings ----------------------------------- Name : Value ----------------------------------- turbo : 2 smt : 4 corefreq : 2199 numcores : 32 icispeed : 1 socnclk : 666 socsclk : 1199 memclk : 2199 ddrspeed_auto : 1 ddrspeed : 2400 progcpufreq : 0 progdevfreq : 0 dmc_node_channel_mask : 0000ffff thermcontrol : 1 thermlimit : 110 enter_debug_shell : 0 dbg_speed_up_ddr_lvl : 0 enable_dram_scrub : 0 ipmbcontrol : 1 ddr_dmt_advanced : 0 cppccontrol : 0 loglevel : 0 uart_params : 115200/8-N-1 none core_feature_mask : 0 sys_feature_mask : 0x00000000 ddr_refresh_rate : 1 fw_feature_mask : 0x00000000 dram_ce_threshold : 1 dram_ce_step_threshold: 0 dram_ce_record_max : 10 dram_ce_window : 60 sec dram_ce_leak_rate : 2000 msec/error pcie_ce_threshold : 1 pcie_ce_window : 30 sec pcie_ce_leak_rate : 15000 msec/error -----------------------------------
Btw, I've just found: https://en.opensuse.org/HCL:ThunderX2 - Somehow I doubt these are "the best processors".
#9
Updated by mkittler about 1 month ago
I invoked cvmrundiag
in the hope it would maybe print something useful. However, it left the system in a broken state where neither power cycle nor power reset help. I'm currently trying a factory reset (hopefully preserving most of the settings for authentication/IPMI).
#10
Updated by mkittler about 1 month ago
- Status changed from In Progress to Feedback
The worker is still not working, when resetting the power only the following is printed:
Rom... CRC: len=0xf080, cal=0x27ff5de9, img=0x27ff5de9, match! Loading from boot device SPI NOR Header: 000|0x23ffdc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 010|0x23ffdd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020|0x23ffde0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030|0x23ffdf0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 040|0x23ffe00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 050|0x23ffe10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 060|0x23ffe20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 070|0x23ffe30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 080|0x23ffe40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 090|0x23ffe50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0A0|0x23ffe60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0B0|0x23ffe70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0C0|0x23ffe80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0D0|0x23ffe90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0E0|0x23ffea0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0F0|0x23ffeb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 100|0x23ffec0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 110|0x23ffed0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 120|0x23ffee0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 130|0x23ffef0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 140|0x23fff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 150|0x23fff10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 160|0x23fff20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 170|0x23fff30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 180|0x23fff40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 190|0x23fff50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 1A0|0x23fff60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 1B0|0x23fff70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 1C0|0x23fff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 1D0|0x23fff90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 1E0|0x23fffa0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 1F0|0x23fffb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
It likely needs further investigation on-site.
okurz Maybe you want to take over, at least for trying to recover it on Friday?
#11
Updated by okurz about 1 month ago
- Tags set to next-office-day
- Status changed from Feedback to Workable
- Assignee changed from mkittler to okurz
we will look into this.
#12
Updated by okurz about 1 month ago
found https://www.gigabyte.com/Enterprise/ARM-Server/R181-T92-rev-100#Support-Bios . Downloaded server_system_boot_mt91-fsx_f34.zip, extracted from that "image.RBU" and flashed that over https://ipmi.openqaworker-arm-4.qa.suse.de/#maintenance/firmware_update_wizard selecting "Update Type: BIOS". There could be options for CPDLD and BMC itself it seems.
Updated, system behaves the same as reported in https://progress.opensuse.org/issues/110545#note-10 . I wonder how we can configure boot devices.
I saved all configuration from BMC to a local ZIP file. Now restorting factory defaults. This saves all entries listed in a checklist so if this has no effect we likely need to configure the system to not preserve that much, then try again.
No effect, same behaviour. Configured on https://ipmi.openqaworker-arm-4.qa.suse.de/#maintenance/preserve_configuration to not preserve anything except IPMI(+network),Authentication so that our remote access password should be preserved.
#13
Updated by okurz about 1 month ago
Conducted a "factory reset", no difference. Then with nsinger being my witness I connected to https://ipmi.openqaworker-arm-5.qa.suse.de/ and in the remote control "h5viewer", likely HTML5 viewer which actually looks quite nice and usable that showed the picture of a getty session with some linux messages on the screen and the serial console showed a getty as well, so working nicely. Then I triggered a "power reset" and the machine started a reboot and it looked like there would be similar output as #110545#note-10 but with some "ff ff" included or some non-zero output. Then I triggered a "power cycle" (while the system was still booting) to resemble what mkittler reported he did on arm-4 and we actually ended up with the same symptoms, system does not boot anymore and "Header" shows only zeroes
#14
Updated by okurz about 1 month ago
- Copied to action #111578: Recover openqaworker-arm-4/5 after "bricking" in #110545 added
#15
Updated by okurz about 1 month ago
- Status changed from Workable to Blocked
nsinger, mkittler and me tried to recover both openqaworker-arm-4/5 and so far have not succeeded. I don't think there is anything useful we could do when being in the server room physically but of course we can still try to hook up a local VGA monitor or something. I suggest we continue in a specific "recover" ticket so that we are not polluting this ticket with more recovery specific information: #111578
#16
Updated by mkittler about 1 month ago
- Due date deleted (
2022-06-04)