action #110545
openopenQA Project (public) - coordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3
Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 - further things to try size:M
0%
Description
Motivation¶
See parent #101048 . In #109232#note-5 ggardet_arm gave some additional hints that we could try. We should try all and run tests as mkittler did in #109232
Acceptance criteria¶
- AC1: All concrete ideas have been tried and openQA tests have been executed with a statement regarding stability
Suggestions¶
- Remind mkittler that he should always write down the commands he used in tickets as otherwise his colleagues will ask him anyway what he did in in #109232 to run openQA tests ;)
- See my notes on exporting job IDs via
psql
: https://github.com/Martchus/openQA-helper#useful-sql-queries=
- See my notes on exporting job IDs via
- Change the parameters on the systems as written in #109232#note-5 , one by one or in combination, reconduct tests and gather stability figures
- Come up with final assessment
Concrete ideas to try out¶
- DONE Ask Guillaume if we can trade the machine for another one -> nope
- DONE (does not help, see #110545#note-4): Disable mitigation (KPTI, etc.)
- ~Use kernel parameter
mitigations=off
(see https://www.kernel.org/doc/html/v5.15-rc1/admin-guide/kernel-parameters.html)~
- ~Use kernel parameter
- DONE (does not help, see #110545#note-4): Enable/disable huge pages
- DONE (at least
progdevfreq
, see #110545#note-8): Disable hardware threading in firmware (it will lower the number of CPU seen by the kernel)- Also tried disabling
progdevfreq
but haven't done any testing after that as the machines broke after that.
- Also tried disabling
- Check actual CPU frequency
- Check temperature (cpu throttling could slow down cpu freq and you get lower perfs)
- Use single socket instead of dual sockets (may be configurable in the firmware)
- Which firmware option (see paste in #110545#note-8 for options) would this correspond to?
- Use a distribution without LSE-atomics (known to be slow on TX2)
- Not sure whether there's a firmware option to disable that support.
- Not sure whether it is enabled in Leap anyways (https://en.opensuse.org/Arm_architecture_support#ARMv8.1_-_LSE_(Large_System_Extension)_atomics only mentions Tumbleweed).
export GLIBC_TUNABLES=”glibc.mem.tagging=X”
where X defaults to 0, and we could confirm if 1 has an effect- see https://www.gnu.org/software/libc/manual/html_node/Memory-Related-Tunables.html for documentation
- You can also run sudo perf stat while the system is busy with openQA tests
Updated by okurz over 2 years ago
- Project changed from openQA Project (public) to openQA Infrastructure (public)
- Category deleted (
Regressions/Crashes)
Updated by livdywan over 2 years ago
- Subject changed from Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 - further things to try to Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 - further things to try size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler over 2 years ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
Updated by mkittler over 2 years ago
I've now added kernel parameters that we also have on the o3 worker aarch64:
martchus@openqaworker-arm-4:~> cat /proc/cmdline
BOOT_IMAGE=/boot/Image-5.3.18-150300.59.63-default root=UUID=be776b2a-53e6-458c-9ab6-c35b63e4a834 console=tty0 console=ttyAMA0,115200 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M mitigations=off default_hugepagesz=1G hugepagesz=1G hugepages=64 enforcing=0
So now mitigations are disabled and huge pages are enabled similarly to aarch64.
I've been cloning the last 100 passing jobs from OSD to see whether it makes a difference: https://openqa.suse.de/tests/overview?build=test-arm4-3
Updated by mkittler over 2 years ago
A short review of the test results we've go so far already reveals quite a lot of typing issues.
Typing issues:
- https://openqa.suse.de/tests/8798813#step/setup/82
- https://openqa.suse.de/tests/8798628#step/before_test/18
- https://openqa.suse.de/tests/8799018#step/before_test/10
- https://openqa.suse.de/tests/8799075#step/installation_overview/3
- https://openqa.suse.de/tests/8798877#step/force_scheduled_tasks/8
- https://openqa.suse.de/tests/8798697#step/barrier_init/15
- https://openqa.suse.de/tests/8798644#step/setup/73
- https://openqa.suse.de/tests/8800122#step/snapper_thin_lvm/12
- https://openqa.suse.de/tests/8800121#step/consoletest_setup/71
- https://openqa.suse.de/tests/8799373#step/pam_su/6
Possibly typing issue:
Other failures / not sure about the cause:
- https://openqa.suse.de/tests/8798760#step/consoletest_setup/8
- https://openqa.suse.de/tests/8798752#step/iscsi_client/22
- https://openqa.suse.de/tests/8799090#step/hostname_inst/6
- https://openqa.suse.de/tests/8798979#step/scc_registration/6
- https://openqa.suse.de/tests/8799369#step/command_not_found/18
- https://openqa.suse.de/tests/8799133#step/update_kernel/87
- https://openqa.suse.de/tests/8799127#step/madvise06/7
Updated by openqa_review over 2 years ago
- Due date set to 2022-06-04
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 2 years ago
The overall fail rate is still quite high:
openqa=> with test_jobs as (select distinct id, state, result from jobs where build = 'test-arm4-3') select state, result, count(id) * 100. / (select count(id) from test_jobs) as ratio from test_jobs group by test_jobs.state, test_jobs.result order by ratio desc;
state | result | ratio
-------+-----------------+------------------------
done | passed | 59.1836734693877551
done | failed | 24.4897959183673469
done | parallel_failed | 15.6462585034013605
done | softfailed | 0.68027210884353741497
(4 Zeilen)
I've updated the previous comment. I guess the number of typing issues is still too high to consider using mitigations=off default_hugepagesz=1G hugepagesz=1G hugepages=64
kernel parameters an improvement.
Updated by mkittler over 2 years ago
I now disabled progdevfreq
in the firmware (after previously only disabling progcpufreq
). Not sure how I'd disable hardware threading in firmware (as mentioned in suggestions). The same counts for useing a single socket instead of dual sockets.
So that's what the current firmware settings are:
CAVM_CN99xx# env save
drivername snor
snor_erase: off=0x3ff0000, len=0x10000
-----------------------------------
ENV Variable Settings
-----------------------------------
Name : Value
-----------------------------------
turbo : 2
smt : 4
corefreq : 2199
numcores : 32
icispeed : 1
socnclk : 666
socsclk : 1199
memclk : 2199
ddrspeed_auto : 1
ddrspeed : 2400
progcpufreq : 0
progdevfreq : 0
dmc_node_channel_mask : 0000ffff
thermcontrol : 1
thermlimit : 110
enter_debug_shell : 0
dbg_speed_up_ddr_lvl : 0
enable_dram_scrub : 0
ipmbcontrol : 1
ddr_dmt_advanced : 0
cppccontrol : 0
loglevel : 0
uart_params : 115200/8-N-1 none
core_feature_mask : 0
sys_feature_mask : 0x00000000
ddr_refresh_rate : 1
fw_feature_mask : 0x00000000
dram_ce_threshold : 1
dram_ce_step_threshold: 0
dram_ce_record_max : 10
dram_ce_window : 60 sec
dram_ce_leak_rate : 2000 msec/error
pcie_ce_threshold : 1
pcie_ce_window : 30 sec
pcie_ce_leak_rate : 15000 msec/error
-----------------------------------
Btw, I've just found: https://en.opensuse.org/HCL:ThunderX2 - Somehow I doubt these are "the best processors".
Updated by mkittler over 2 years ago
I invoked cvmrundiag
in the hope it would maybe print something useful. However, it left the system in a broken state where neither power cycle nor power reset help. I'm currently trying a factory reset (hopefully preserving most of the settings for authentication/IPMI).
Updated by mkittler over 2 years ago
- Status changed from In Progress to Feedback
The worker is still not working, when resetting the power only the following is printed:
Rom...
CRC: len=0xf080, cal=0x27ff5de9, img=0x27ff5de9, match!
Loading from boot device SPI NOR
Header:
000|0x23ffdc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
010|0x23ffdd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
020|0x23ffde0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
030|0x23ffdf0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
040|0x23ffe00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
050|0x23ffe10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
060|0x23ffe20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
070|0x23ffe30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
080|0x23ffe40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
090|0x23ffe50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0A0|0x23ffe60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0B0|0x23ffe70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0C0|0x23ffe80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0D0|0x23ffe90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0E0|0x23ffea0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0F0|0x23ffeb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
100|0x23ffec0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
110|0x23ffed0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
120|0x23ffee0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
130|0x23ffef0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
140|0x23fff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
150|0x23fff10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
160|0x23fff20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
170|0x23fff30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
180|0x23fff40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
190|0x23fff50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1A0|0x23fff60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1B0|0x23fff70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1C0|0x23fff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1D0|0x23fff90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1E0|0x23fffa0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1F0|0x23fffb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
It likely needs further investigation on-site.
@okurz Maybe you want to take over, at least for trying to recover it on Friday?
Updated by okurz over 2 years ago
- Tags set to next-office-day
- Status changed from Feedback to Workable
- Assignee changed from mkittler to okurz
we will look into this.
Updated by okurz over 2 years ago
found https://www.gigabyte.com/Enterprise/ARM-Server/R181-T92-rev-100#Support-Bios . Downloaded server_system_boot_mt91-fsx_f34.zip, extracted from that "image.RBU" and flashed that over https://ipmi.openqaworker-arm-4.qa.suse.de/#maintenance/firmware_update_wizard selecting "Update Type: BIOS". There could be options for CPDLD and BMC itself it seems.
Updated, system behaves the same as reported in https://progress.opensuse.org/issues/110545#note-10 . I wonder how we can configure boot devices.
I saved all configuration from BMC to a local ZIP file. Now restorting factory defaults. This saves all entries listed in a checklist so if this has no effect we likely need to configure the system to not preserve that much, then try again.
No effect, same behaviour. Configured on https://ipmi.openqaworker-arm-4.qa.suse.de/#maintenance/preserve_configuration to not preserve anything except IPMI(+network),Authentication so that our remote access password should be preserved.
Updated by okurz over 2 years ago
Conducted a "factory reset", no difference. Then with nsinger being my witness I connected to https://ipmi.openqaworker-arm-5.qa.suse.de/ and in the remote control "h5viewer", likely HTML5 viewer which actually looks quite nice and usable that showed the picture of a getty session with some linux messages on the screen and the serial console showed a getty as well, so working nicely. Then I triggered a "power reset" and the machine started a reboot and it looked like there would be similar output as #110545#note-10 but with some "ff ff" included or some non-zero output. Then I triggered a "power cycle" (while the system was still booting) to resemble what mkittler reported he did on arm-4 and we actually ended up with the same symptoms, system does not boot anymore and "Header" shows only zeroes
Updated by okurz over 2 years ago
- Copied to action #111578: Recover openqaworker-arm-4/5 after "bricking" in #110545 size:M added
Updated by okurz over 2 years ago
- Status changed from Workable to Blocked
nsinger, mkittler and me tried to recover both openqaworker-arm-4/5 and so far have not succeeded. I don't think there is anything useful we could do when being in the server room physically but of course we can still try to hook up a local VGA monitor or something. I suggest we continue in a specific "recover" ticket so that we are not polluting this ticket with more recovery specific information: #111578
Updated by okurz over 2 years ago
- Status changed from Blocked to Workable
- Assignee deleted (
okurz)
back to try further stuff after we could recover both machines with a complete cold power cycle, see #111578
Updated by okurz over 2 years ago
- Status changed from Workable to Blocked
- Assignee set to okurz
let's wait for moving those machines to the Nbg TAM lab: #114604
Updated by livdywan over 2 years ago
- Status changed from Blocked to Workable
- Assignee deleted (
okurz)
The blocker is gone (SRV2 suffering from high temperatures)
Updated by okurz over 2 years ago
- Status changed from Workable to In Progress
- Assignee set to okurz
Nobody from the team wants to touch these beasts so I will ask OBS team if maybe they want to trade machines.
Updated by okurz over 2 years ago
- Status changed from In Progress to Feedback
Asked BuildOPS if they would be interested for a trade, see #110539
Brought up the topic again in https://suse.slack.com/archives/C02CANHLANP/p1663928419759359
Hi. Some months ago we have ordered two ARM machines to be used as openQA workers. Unfortunately for yet unknown reasons these two machines are much less reliable than our older ARM workers. If you are interested or want to help see https://progress.opensuse.org/issues/101048 and all subtasks for the full story. I also asked the BuildOPS team if they would potentially be interested in trading machines. Other than that I see no good path to continue with replacing our aging and unstable ARM workers which are still more stable than the new ones.
EDIT: szarate will have a I have a meeting with afaerber in week 2022-W39 bringing up this topic as well. Some background info by mawerner: https://confluence.suse.com/display/LEONG/2022-09-19+WG-+ALP%3A+QE+on+Arm
Updated by okurz over 2 years ago
okurz wrote:
Asked BuildOPS if they would be interested for a trade, see #110539
The answer for #110539 is "No". awaiting response from szarate in https://suse.slack.com/archives/C02CANHLANP/p1664966773773939
(Oliver Kurz) @Santiago Zarate for https://progress.opensuse.org/issues/110545, what's the result regarding ARM workers discussion with afaerber?
Updated by okurz over 2 years ago
There was no update by szarate yet so I reminded them in
https://suse.slack.com/archives/C02CANHLANP/p1665563709438689?thread_ts=1664966773.773939&cid=C02CANHLANP
@Santiago Zarate still missing the update from above, plz
Updated by okurz over 2 years ago
- Description updated (diff)
- Status changed from Feedback to New
- Assignee deleted (
okurz) - Priority changed from High to Normal
- Target version changed from Ready to future
Thanks. So additional information and additional ideas have been provided. I updated the description of the ticket about the suggestions that are still open to be tried. This could be done by anyone with access to the team, i.e. within SUSE. The SUSE QE Tools team does currently not plan to try any further.
@szarate if you or anyone within QE-Core would like to go on testing the stability of the machines I would appreciate that a lot. The invitation is to everybody with access to the machine. We can provide support in getting access and starting experiments.
Updated by okurz over 1 year ago
- Target version changed from Ready to future
I honestly don't remember anymore why two months ago I added the ticket back to the backlog without a comment. It might actually have been a mistake. #110545-29 is still the most recent and valid state. I consider it unfortunate that so far nobody could find clear requirements for what a machine needs to fulfill to be able to run stable openQA tests.