action #109232
closedcoordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3
Document relevant differences of arm-4/5 vs. arm-1/2/3 and aarch64.o.o, involve domain experts in asking what parameters are important to be able to run openQA tests size:M
Description
Motivation¶
The last time we spoke about we thought of the idea to involve ARM experts. okurz asked ggardet_arm in https://app.element.io/#/room/#openqa:opensuse.org (or maybe it was #opensuse-factory ) and he offered help but needs more details about the machines.
Acceptance criteria¶
- AC1: Differences between arm-4/5 and arm-1/2/3 and aarch64.o.o are known (i.e. documented at least in ticket)
- AC2: Domain experts are aware of the problems we face and have an opportunity to take a look at the differences
Suggestions¶
- As necessary switch on machines, e.g.
ipmi-ipmi.openqaworker-arm-4.qa power on
- We suggest to get details, e.g. log in, call dmesg and dmidecode and provide that details in the ticket and ask ggardet_arm again if he needs more
- Maybe something about hugepages, cpu flags, some boot kernel parameters to work around I/O quirks, anything like that.
- Optional: Produce "diff" to arm-1/2/3 and aarch64.o.o
- Optional: Send an email to SUSE internal ARM mailing list or Slack channel to ask for help.
- Optional: Provide access to internal machines temporarily over tmate within an investigation session
Files
Updated by mkittler over 2 years ago
- File dmidecode-openqaworker-arm-1.txt dmidecode-openqaworker-arm-1.txt added
- File dmidecode-openqaworker-arm-2.txt dmidecode-openqaworker-arm-2.txt added
- File dmidecode-openqaworker-arm-3.txt dmidecode-openqaworker-arm-3.txt added
I'm still powering on arm 4 and 5 which were apparently completely shut down.
Updated by mkittler over 2 years ago
- File dmidecode-openqaworker-arm-4.txt dmidecode-openqaworker-arm-4.txt added
- File dmidecode-openqaworker-arm-5.txt dmidecode-openqaworker-arm-5.txt added
- File inxi-openqaworker-arm-1.txt inxi-openqaworker-arm-1.txt added
- File inxi-openqaworker-arm-2.txt inxi-openqaworker-arm-2.txt added
- File inxi-openqaworker-arm-3.txt inxi-openqaworker-arm-3.txt added
- File inxi-openqaworker-arm-4.txt inxi-openqaworker-arm-4.txt added
- File inxi-openqaworker-arm-5.txt inxi-openqaworker-arm-5.txt added
I've also generated a more condensed summary with inxi (for w in openqaworker-arm-1 openqaworker-arm-2 openqaworker-arm-3 ; do ssh "$w" sudo inxi -c0 -F -xxx > "inxi-$w.txt" ; done
).
On arm 4 and 5 I executed the commands via IPMI because the network isn't working on these machines.
Updated by mkittler over 2 years ago
- File dmidecode-aarch64.txt dmidecode-aarch64.txt added
- File inxi-aarch64.txt inxi-aarch64.txt added
- Status changed from Workable to Feedback
@ggardet_arm We have problems with the arm workers used in our internal openQA instance. Maybe you have an idea why the fail ratio of openQA tests conducted on the two aarch64 workers arm-4/5 is higher (~ 30 %) compared to the three aarch64 workers arm-1/2/3 (~ 15 %)? The fail ratio is within the same time range and jobs were scheduled equally. I've been trying to summarize some differences between the these aarch64 machines below and attached dmidecode/inxi output from all those machines in previous comments.
Note that arm-4/5 are quite new and were supposed to improve our generally bad situation with arm workers. Our existing workers arm-1/2/3 are actually quite unstable themselves as they randomly crash very often. However, their supposed replacements arm-4/5 are even worse. The new workers don't crash but as mentioned their fail ratio is much higher compared to the old/crashing ones.
At this point both groups of workers use Leap 15.3 and we've tested several kernel versions on arm-4/5 without noticing a difference in the fail ratio. We also reduced the number of worker slots on arm-4/5 to only 4 slots (per worker). So even if they were just quite slow that should have been compensated.
By the way, the only good aarch64 openQA worker we have is the worker "aarch64" used on openqa.opensuse.org. So I've also attached some details about this worker for a comparison.
Unfortunately I cannot give you network access to arm-4/5 at this point. Let me know if you need any further details.
Differences between arm-4 and arm-5: none, they seem to be identical except for serial numbers
So it should be sufficient to look at just one of the specs.
Differences between arm-4/5 and arm-1/2/3:
- The kernel version differs. However, that should not be relevant. We've already tested various Kernel versions on arm-4/5 and it didn't make a difference.
- arm-4/5 seem to be a newer version of the same product from the same vendor (R181-T92-00 > R120-T32, both from GIGABYTE)
- The mainboard is is newer but from the same vendor and likely similar (MT91-FS4-00 > MT30-GS1).
- The CPU is faster judging by frequencies and core counts (Cavium ThunderX2(R) CPU CN9980 v2.2 @ 2.20GHz > whatever arm-1/2/3 have installed).
- The CPU has more features/characteristics (Hardware Thread, Power/Performance Control).
Updated by ggardet_arm over 2 years ago
It seems to be ThunderX1 and ThunderX2 machines. From my experience, they are not good as openQA workers and better to be used to build packages.
You can check/try:
- to disable hardware threading in firmware (it will lower the number of CPU seen by the kernel)
- check actual CPU frequency
- check temperature (cpu throttling could slow down cpu freq and you get lower perfs)
- Use single socket instead of dual sockets (may be configurable in the firmware)
- Use a distribution without LSE-atomics (known to be slow on TX2)
- Disable mitigation (KPTI, etc.)
- enable/disable huge pages
- You can also run
sudo perf stat
while the system is busy with openQA tests
Updated by mkittler over 2 years ago
Thanks. I'll look into these points.
Because it came up as well: The failing tests seem to be a mix of typing issues and connection issues. Here are some examples:
- https://openqa.suse.de/tests/7631673
- https://openqa.suse.de/tests/7631458
- https://openqa.suse.de/tests/7632312
- https://openqa.suse.de/tests/7632463
- https://openqa.suse.de/tests/7632073
- https://openqa.suse.de/tests/7632006
- https://openqa.suse.de/tests/7631965
(via select id, t_finished, result, reason from jobs where (select host from workers where id = assigned_worker_id) = 'openqaworker-arm-4' and (result = 'failed') and t_finished >= '2021-08-24T00:00:00'
)
Updated by ggardet_arm over 2 years ago
You can also play with openQA settings:
VNC_TYPING_LIMIT
TIMEOUT_SCALE
QEMU_COMPRESS_LEVEL
QEMU_COMPRESS_THREADS
QEMUCPU
(I recommendhost
)QEMUMACHINE
(I recommendvirt,gic-version=host
)
and of courses, the number of parallel openQA jobs.
Updated by mkittler over 2 years ago
- Blocked by action #109494: Restore network connection of arm-4/5 size:M added
Updated by mkittler over 2 years ago
journalctl | grep stuck
prints nothing on arm-4/5 (unlike to aarch64) and the logs reach back to November. So we definitely see a different problem than on aarch64 (where we frequently see message like kernel:[480826.136444] watchdog: BUG: soft lockup - CPU#49 stuck for 26s! [qemu-system-aar:13803]
).
Instead we get lots of Nov 13 22:55:19 openqaworker-arm-5 kernel: ACPI CPPC: PCC check channel failed for ss: 0. ret=-110
and Nov 13 22:55:20 openqaworker-arm-5 kernel: CPPC Cpufreq:cppc_scale_freq_workfn: failed to read perf counters
messages (on arm-4 and arm-5).
Apparently we're not the only ones seeing those messages with a ThunderX2 CPU: https://bugzilla.kernel.org/show_bug.cgi?id=208785
@ggardet_arm suggested to disable frequency scaling when I brought up these error messages (also see https://wiki.archlinux.org/title/CPU_frequency_scaling).
Considering https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1763817 the firmware setting cppccontrol=1
is relevant so we might try disabling this setting.
Updated by okurz over 2 years ago
Independent of fixing the actual performance problemsI think it can help in general for openQA tests if we configure the systems we test in with a disabled key repeat to mitigate the issues, i.e. "xset r off" within X11 sessions or whatever can be seen as equivalent in non-X11 getty terminals. According to fvogt "There's also an ioctl for the vtcon". The command "kbdrate" allows to configure something but says only for "Intel" and does not allow to disable key repeat completely, only increase times, which could already help enough. By default keyboard delay is "250 ms". Any different setting likely needs to be applied per VT. So in a true text tty one could use kbdrate -d 1000
and in X xset r off
whenever we start a new terminal.
Updated by mkittler over 2 years ago
- Status changed from Blocked to In Progress
Thanks to Nick arm-4 is back online. I disabled cppccontrol
in the shell accessible at boot time, updated arm-4, applied the latest salt-states/pillars with 16 worker slots and scheduled 100 jobs¹: https://openqa.suse.de/tests/overview?build=test-arm4
Not sure whether it generally makes sense to disable frequency scaling. I suppose this suggestion effectively means locking the CPU at a certain frequency which seems like a huge disadvantage. So for now I only tried the cppcontrol
approach.
¹ Cloned the 100 most recently passed/softfailed aarch64 jobs on OSD (see /tmp/jobs_to_clone_arm
on OSD). So the fail ratio within this set of jobs should be very low.
Updated by openqa_review over 2 years ago
- Due date set to 2022-05-10
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 2 years ago
- Status changed from In Progress to Blocked
I'll have to re-conduct the tests because many failed due to networking problems. Maybe I had the problematic version of libslirp0
installed. I suppose some jobs also failed because dependencies were not cloned as well so I'll need to adjust the clone call as well. Due to general network problems arm-4 isn't reachable so I'm currently blocked by https://sd.suse.com/servicedesk/customer/portal/1/SD-84633.
Updated by mkittler over 2 years ago
- Status changed from Blocked to In Progress
It had indeed the broken libslirp0
version installed. Newly scheduled jobs: https://openqa.suse.de/tests/overview?build=test-arm4-2
Updated by mkittler over 2 years ago
- Looks like there are nevertheless still problems with the network connection: https://openqa.suse.de/tests/8639166#step/setup/29
- I've also already spotted typing issues: https://openqa.suse.de/tests/8639164#step/setup/27 (character typed too often), https://openqa.suse.de/tests/8639174 (character
"
missing)
Updated by mkittler over 2 years ago
With cppccontrol=0
the ACPI error mentioned in #109232#note-10 isn't occurring again. Considering the job results I've seen so far it likely still doesn't help with the typing issues, though. However, I'll wait for all jobs to finish for a conclusion (as also the other arm workers have some typing issues).
Further typing issues:
- https://openqa.suse.de/tests/8639168#step/system_prepare#1/11 (not 100% sure but likely some typing issue)
- https://openqa.suse.de/tests/8639169#step/patch_sle/128 (missing characters)
- https://openqa.suse.de/tests/8639188#step/conman/1 (missing
(
) - https://openqa.suse.de/tests/8639283#step/installation_overview/3 (missing
o
) - https://openqa.suse.de/tests/8639302#step/hostname_inst/6 (missing
u
) - https://openqa.suse.de/tests/8639271#step/setup/41 (missing
d
)
Updated by mkittler over 2 years ago
33 % failures is not good (within a set of jobs that passed on other arm workers before):
openqa=> with test_jobs as (select distinct id, result from jobs where build = 'test-arm4-2') select result, count(id) * 100. / (select count(id) from test_jobs) as ratio from test_jobs group by test_jobs.result order by ratio desc;
result | ratio
------------------+------------------------
passed | 40.8510638297872340
failed | 33.6170212765957447
parallel_failed | 14.8936170212765957
incomplete | 8.5106382978723404
timeout_exceeded | 0.85106382978723404255
user_cancelled | 0.85106382978723404255
softfailed | 0.42553191489361702128
(7 Zeilen)
Updated by mkittler over 2 years ago
I now disabled cpu frequency scaling via the firmware menu:
CAVM_CN99xx# env set progcpufreq 0
CPU Freq set by SYS will be used
Env Var progcpufreq set with Value 0
Execute 'env save' Command to make the changes persistent
CAVM_CN99xx# env save
drivername snor
snor_erase: off=0x3ff0000, len=0x10000
-----------------------------------
ENV Variable Settings
-----------------------------------
Name : Value
-----------------------------------
turbo : 2
smt : 4
corefreq : 2199
numcores : 32
icispeed : 1
socnclk : 666
socsclk : 1199
memclk : 2199
ddrspeed_auto : 1
ddrspeed : 2400
progcpufreq : 0
progdevfreq : 1
dmc_node_channel_mask : 0000ffff
thermcontrol : 1
thermlimit : 110
enter_debug_shell : 0
dbg_speed_up_ddr_lvl : 0
enable_dram_scrub : 0
ipmbcontrol : 1
ddr_dmt_advanced : 0
cppccontrol : 0
loglevel : 0
uart_params : 115200/8-N-1 none
core_feature_mask : 0
sys_feature_mask : 0x00000000
ddr_refresh_rate : 1
fw_feature_mask : 0x00000000
dram_ce_threshold : 1
dram_ce_step_threshold: 0
dram_ce_record_max : 10
dram_ce_window : 60 sec
dram_ce_leak_rate : 2000 msec/error
pcie_ce_threshold : 1
pcie_ce_window : 30 sec
pcie_ce_leak_rate : 15000 msec/error
-----------------------------------
Updated by mkittler over 2 years ago
Another round of test jobs: https://openqa.suse.de/tests/overview?build=test-arm4-4
Updated by mkittler over 2 years ago
And it already looks bad - so here some more typing issues:
Updated by mkittler over 2 years ago
Disabling cpu frequency scaling didn't help much, also considering all the test jobs:
openqa=> with test_jobs as (select distinct id, result from jobs where build = 'test-arm4-4') select result, count(id) * 100. / (select count(id) from test_jobs) as ratio from test_jobs group by test_jobs.result order by ratio desc;
result | ratio
-----------------+------------------------
failed | 45.2054794520547945
passed | 37.8995433789954338
parallel_failed | 11.8721461187214612
incomplete | 4.1095890410958904
softfailed | 0.91324200913242009132
(5 Zeilen)
Updated by mkittler over 2 years ago
- Blocked by deleted (action #109494: Restore network connection of arm-4/5 size:M)
Updated by mkittler over 2 years ago
- Status changed from In Progress to Resolved
I've been removing the blocker #109494 because at least arm-4 is online again and that's sufficient.
I'd also like to conclude the issue here. The outcome is that the CPU (the specific model and version, Cavium ThunderX2) is known to behave badly for our use-case and that's the difference to the older arm workers (which have the previous version of that CPU model installed).
I tried disabling cpu control and cpu frequency scaling in the firmware environment but it didn't make a difference. Before that we've already tried to reduce the number of worker slots a lot and it didn't help either. There are still a few ideas to consider (see #109232#note-5) but this ticket is mainly about documenting differences so I'm resolving it now.
Updated by okurz over 2 years ago
- Status changed from Resolved to Feedback
Please make sure that the relevant ideas and suggestions to follow are written down in open tickets, e.g. the parent epic or another subticket of the same epic.
Updated by mkittler over 2 years ago
- Status changed from Feedback to Resolved
Updated the epic (https://progress.opensuse.org/journals/514993/diff?detail_id=486994).
Updated by okurz over 2 years ago
- Category changed from Organisational to Feature requests