action #109232: Document relevant differences of arm-4/5 vs. arm-1/2/3 and aarch64.o.o, involve domain experts in asking what parameters are important to be able to run openQA tests size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #109232

closed

coordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3

Document relevant differences of arm-4/5 vs. arm-1/2/3 and aarch64.o.o, involve domain experts in asking what parameters are important to be able to run openQA tests size:M

Added by okurz about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

2022-03-30

Due date:

% Done:

Estimated time:

Description

Motivation¶

The last time we spoke about we thought of the idea to involve ARM experts. okurz asked ggardet_arm in https://app.element.io/#/room/#openqa:opensuse.org (or maybe it was #opensuse-factory ) and he offered help but needs more details about the machines.

Acceptance criteria¶

AC1: Differences between arm-4/5 and arm-1/2/3 and aarch64.o.o are known (i.e. documented at least in ticket)
AC2: Domain experts are aware of the problems we face and have an opportunity to take a look at the differences

Suggestions¶

As necessary switch on machines, e.g. ipmi-ipmi.openqaworker-arm-4.qa power on
We suggest to get details, e.g. log in, call dmesg and dmidecode and provide that details in the ticket and ask ggardet_arm again if he needs more
Maybe something about hugepages, cpu flags, some boot kernel parameters to work around I/O quirks, anything like that.
Optional: Produce "diff" to arm-1/2/3 and aarch64.o.o
Optional: Send an email to SUSE internal ARM mailing list or Slack channel to ask for help.
Optional: Provide access to internal machines temporarily over tmate within an investigation session

Files

Download all files

dmidecode-openqaworker-arm-1.txt (16.2 KB) dmidecode-openqaworker-arm-1.txt		mkittler, 2022-04-05 10:19
dmidecode-openqaworker-arm-2.txt (21.5 KB) dmidecode-openqaworker-arm-2.txt		mkittler, 2022-04-05 10:19
dmidecode-openqaworker-arm-3.txt (21.5 KB) dmidecode-openqaworker-arm-3.txt		mkittler, 2022-04-05 10:19
dmidecode-openqaworker-arm-4.txt (31.5 KB) dmidecode-openqaworker-arm-4.txt		mkittler, 2022-04-05 10:59
dmidecode-openqaworker-arm-5.txt (31.5 KB) dmidecode-openqaworker-arm-5.txt		mkittler, 2022-04-05 10:59
inxi-openqaworker-arm-1.txt (11.8 KB) inxi-openqaworker-arm-1.txt		mkittler, 2022-04-05 10:59
inxi-openqaworker-arm-2.txt (14 KB) inxi-openqaworker-arm-2.txt		mkittler, 2022-04-05 10:59
inxi-openqaworker-arm-3.txt (15.9 KB) inxi-openqaworker-arm-3.txt		mkittler, 2022-04-05 10:59
inxi-openqaworker-arm-4.txt (5.8 KB) inxi-openqaworker-arm-4.txt		mkittler, 2022-04-05 10:59
inxi-openqaworker-arm-5.txt (5.8 KB) inxi-openqaworker-arm-5.txt		mkittler, 2022-04-05 10:59
dmidecode-aarch64.txt (23 KB) dmidecode-aarch64.txt		mkittler, 2022-04-05 12:40
inxi-aarch64.txt (3.74 KB) inxi-aarch64.txt		mkittler, 2022-04-05 12:40

Actions

Copy link

Updated by mkittler about 3 years ago

Assignee set to mkittler

Actions

Copy link Download all files

Updated by mkittler about 3 years ago

File dmidecode-openqaworker-arm-1.txt dmidecode-openqaworker-arm-1.txt added
File dmidecode-openqaworker-arm-2.txt dmidecode-openqaworker-arm-2.txt added
File dmidecode-openqaworker-arm-3.txt dmidecode-openqaworker-arm-3.txt added

I'm still powering on arm 4 and 5 which were apparently completely shut down.

Actions

Copy link Download all files

Updated by mkittler about 3 years ago

File dmidecode-openqaworker-arm-4.txt dmidecode-openqaworker-arm-4.txt added
File dmidecode-openqaworker-arm-5.txt dmidecode-openqaworker-arm-5.txt added
File inxi-openqaworker-arm-1.txt inxi-openqaworker-arm-1.txt added
File inxi-openqaworker-arm-2.txt inxi-openqaworker-arm-2.txt added
File inxi-openqaworker-arm-3.txt inxi-openqaworker-arm-3.txt added
File inxi-openqaworker-arm-4.txt inxi-openqaworker-arm-4.txt added
File inxi-openqaworker-arm-5.txt inxi-openqaworker-arm-5.txt added

I've also generated a more condensed summary with inxi (for w in openqaworker-arm-1 openqaworker-arm-2 openqaworker-arm-3 ; do ssh "$w" sudo inxi -c0 -F -xxx > "inxi-$w.txt" ; done).

On arm 4 and 5 I executed the commands via IPMI because the network isn't working on these machines.

Actions

Copy link Download all files

Updated by mkittler about 3 years ago

File dmidecode-aarch64.txt dmidecode-aarch64.txt added
File inxi-aarch64.txt inxi-aarch64.txt added
Status changed from Workable to Feedback

@ggardet_arm We have problems with the arm workers used in our internal openQA instance. Maybe you have an idea why the fail ratio of openQA tests conducted on the two aarch64 workers arm-4/5 is higher (~ 30 %) compared to the three aarch64 workers arm-1/2/3 (~ 15 %)? The fail ratio is within the same time range and jobs were scheduled equally. I've been trying to summarize some differences between the these aarch64 machines below and attached dmidecode/inxi output from all those machines in previous comments.

Note that arm-4/5 are quite new and were supposed to improve our generally bad situation with arm workers. Our existing workers arm-1/2/3 are actually quite unstable themselves as they randomly crash very often. However, their supposed replacements arm-4/5 are even worse. The new workers don't crash but as mentioned their fail ratio is much higher compared to the old/crashing ones.

At this point both groups of workers use Leap 15.3 and we've tested several kernel versions on arm-4/5 without noticing a difference in the fail ratio. We also reduced the number of worker slots on arm-4/5 to only 4 slots (per worker). So even if they were just quite slow that should have been compensated.

By the way, the only good aarch64 openQA worker we have is the worker "aarch64" used on openqa.opensuse.org. So I've also attached some details about this worker for a comparison.

Unfortunately I cannot give you network access to arm-4/5 at this point. Let me know if you need any further details.

Differences between arm-4 and arm-5: none, they seem to be identical except for serial numbers
So it should be sufficient to look at just one of the specs.

Differences between arm-4/5 and arm-1/2/3:

The kernel version differs. However, that should not be relevant. We've already tested various Kernel versions on arm-4/5 and it didn't make a difference.
arm-4/5 seem to be a newer version of the same product from the same vendor (R181-T92-00 > R120-T32, both from GIGABYTE)
The mainboard is is newer but from the same vendor and likely similar (MT91-FS4-00 > MT30-GS1).
The CPU is faster judging by frequencies and core counts (Cavium ThunderX2(R) CPU CN9980 v2.2 @ 2.20GHz > whatever arm-1/2/3 have installed).
The CPU has more features/characteristics (Hardware Thread, Power/Performance Control).

Actions

Copy link

Updated by ggardet_arm about 3 years ago

It seems to be ThunderX1 and ThunderX2 machines. From my experience, they are not good as openQA workers and better to be used to build packages.

You can check/try:

to disable hardware threading in firmware (it will lower the number of CPU seen by the kernel)
check actual CPU frequency
check temperature (cpu throttling could slow down cpu freq and you get lower perfs)
Use single socket instead of dual sockets (may be configurable in the firmware)
Use a distribution without LSE-atomics (known to be slow on TX2)
Disable mitigation (KPTI, etc.)
enable/disable huge pages
You can also run sudo perf stat while the system is busy with openQA tests

Actions

Copy link

Updated by mkittler about 3 years ago

Thanks. I'll look into these points.

Because it came up as well: The failing tests seem to be a mix of typing issues and connection issues. Here are some examples:

(via select id, t_finished, result, reason from jobs where (select host from workers where id = assigned_worker_id) = 'openqaworker-arm-4' and (result = 'failed') and t_finished >= '2021-08-24T00:00:00')

Actions

Copy link

Updated by ggardet_arm about 3 years ago

You can also play with openQA settings:

VNC_TYPING_LIMIT
TIMEOUT_SCALE
QEMU_COMPRESS_LEVEL
QEMU_COMPRESS_THREADS
QEMUCPU (I recommend host)
QEMUMACHINE (I recommend virt,gic-version=host)

and of courses, the number of parallel openQA jobs.

Actions

Copy link

Updated by mkittler about 3 years ago

Blocked by action #109494: Restore network connection of arm-4/5 size:M added

Actions

Copy link

Updated by mkittler about 3 years ago

Status changed from Feedback to Blocked

Actions

Copy link

#10

Updated by mkittler about 3 years ago

journalctl | grep stuck prints nothing on arm-4/5 (unlike to aarch64) and the logs reach back to November. So we definitely see a different problem than on aarch64 (where we frequently see message like kernel:[480826.136444] watchdog: BUG: soft lockup - CPU#49 stuck for 26s! [qemu-system-aar:13803]).

Instead we get lots of Nov 13 22:55:19 openqaworker-arm-5 kernel: ACPI CPPC: PCC check channel failed for ss: 0. ret=-110 and Nov 13 22:55:20 openqaworker-arm-5 kernel: CPPC Cpufreq:cppc_scale_freq_workfn: failed to read perf counters messages (on arm-4 and arm-5).

Apparently we're not the only ones seeing those messages with a ThunderX2 CPU: https://bugzilla.kernel.org/show_bug.cgi?id=208785

@ggardet_arm suggested to disable frequency scaling when I brought up these error messages (also see https://wiki.archlinux.org/title/CPU_frequency_scaling).

Considering https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1763817 the firmware setting cppccontrol=1 is relevant so we might try disabling this setting.

Actions

Copy link

#12

Updated by okurz about 3 years ago

Independent of fixing the actual performance problemsI think it can help in general for openQA tests if we configure the systems we test in with a disabled key repeat to mitigate the issues, i.e. "xset r off" within X11 sessions or whatever can be seen as equivalent in non-X11 getty terminals. According to fvogt "There's also an ioctl for the vtcon". The command "kbdrate" allows to configure something but says only for "Intel" and does not allow to disable key repeat completely, only increase times, which could already help enough. By default keyboard delay is "250 ms". Any different setting likely needs to be applied per VT. So in a true text tty one could use kbdrate -d 1000 and in X xset r off whenever we start a new terminal.

Actions

Copy link

#13

Updated by mkittler about 3 years ago

Status changed from Blocked to In Progress

Thanks to Nick arm-4 is back online. I disabled cppccontrol in the shell accessible at boot time, updated arm-4, applied the latest salt-states/pillars with 16 worker slots and scheduled 100 jobs¹: https://openqa.suse.de/tests/overview?build=test-arm4

Not sure whether it generally makes sense to disable frequency scaling. I suppose this suggestion effectively means locking the CPU at a certain frequency which seems like a huge disadvantage. So for now I only tried the cppcontrol approach.

¹ Cloned the 100 most recently passed/softfailed aarch64 jobs on OSD (see /tmp/jobs_to_clone_arm on OSD). So the fail ratio within this set of jobs should be very low.

Actions

Copy link

#14

Updated by openqa_review about 3 years ago

Due date set to 2022-05-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#15

Updated by mkittler about 3 years ago

Status changed from In Progress to Blocked

I'll have to re-conduct the tests because many failed due to networking problems. Maybe I had the problematic version of libslirp0 installed. I suppose some jobs also failed because dependencies were not cloned as well so I'll need to adjust the clone call as well. Due to general network problems arm-4 isn't reachable so I'm currently blocked by https://sd.suse.com/servicedesk/customer/portal/1/SD-84633.

Actions

Copy link

#16

Updated by mkittler about 3 years ago

Status changed from Blocked to In Progress

It had indeed the broken libslirp0 version installed. Newly scheduled jobs: https://openqa.suse.de/tests/overview?build=test-arm4-2

Actions

Copy link

#17

Updated by mkittler about 3 years ago

Looks like there are nevertheless still problems with the network connection: https://openqa.suse.de/tests/8639166#step/setup/29
I've also already spotted typing issues: https://openqa.suse.de/tests/8639164#step/setup/27 (character typed too often), https://openqa.suse.de/tests/8639174 (character " missing)

Actions

Copy link

#18

Updated by mkittler about 3 years ago

With cppccontrol=0 the ACPI error mentioned in #109232#note-10 isn't occurring again. Considering the job results I've seen so far it likely still doesn't help with the typing issues, though. However, I'll wait for all jobs to finish for a conclusion (as also the other arm workers have some typing issues).

Further typing issues:

https://openqa.suse.de/tests/8639168#step/system_prepare#1/11 (not 100% sure but likely some typing issue)
https://openqa.suse.de/tests/8639169#step/patch_sle/128 (missing characters)
https://openqa.suse.de/tests/8639188#step/conman/1 (missing ()
https://openqa.suse.de/tests/8639283#step/installation_overview/3 (missing o)
https://openqa.suse.de/tests/8639302#step/hostname_inst/6 (missing u)
https://openqa.suse.de/tests/8639271#step/setup/41 (missing d)

Actions

Copy link

#19

Updated by mkittler about 3 years ago

33 % failures is not good (within a set of jobs that passed on other arm workers before):

openqa=> with test_jobs as (select distinct id, result from jobs where build = 'test-arm4-2') select result, count(id) * 100. / (select count(id) from test_jobs) as ratio from test_jobs group by test_jobs.result order by ratio desc;
      result      |         ratio          
------------------+------------------------
 passed           |    40.8510638297872340
 failed           |    33.6170212765957447
 parallel_failed  |    14.8936170212765957
 incomplete       |     8.5106382978723404
 timeout_exceeded | 0.85106382978723404255
 user_cancelled   | 0.85106382978723404255
 softfailed       | 0.42553191489361702128
(7 Zeilen)

Actions

Copy link

#20

Updated by mkittler about 3 years ago

I now disabled cpu frequency scaling via the firmware menu:

CAVM_CN99xx# env set progcpufreq 0
CPU Freq set by SYS will be used 
Env Var progcpufreq set with Value 0 
Execute 'env save' Command to make the changes persistent 
CAVM_CN99xx# env save
drivername snor
snor_erase: off=0x3ff0000, len=0x10000

-----------------------------------
       ENV Variable Settings 
-----------------------------------
Name                  : Value 
-----------------------------------
turbo                 : 2 
smt                   : 4 
corefreq              : 2199 
numcores              : 32 
icispeed              : 1 
socnclk               : 666 
socsclk               : 1199 
memclk                : 2199 
ddrspeed_auto         : 1 
ddrspeed              : 2400 
progcpufreq           : 0 
progdevfreq           : 1 
dmc_node_channel_mask : 0000ffff 
thermcontrol          : 1 
thermlimit            : 110 
enter_debug_shell     : 0 
dbg_speed_up_ddr_lvl  : 0 
enable_dram_scrub     : 0 
ipmbcontrol           : 1
ddr_dmt_advanced      : 0 
cppccontrol           : 0
loglevel              : 0
uart_params           : 115200/8-N-1 none
core_feature_mask     : 0
sys_feature_mask      : 0x00000000
ddr_refresh_rate      : 1
fw_feature_mask       : 0x00000000
dram_ce_threshold     : 1
dram_ce_step_threshold: 0
dram_ce_record_max    : 10
dram_ce_window        : 60 sec
dram_ce_leak_rate     : 2000 msec/error
pcie_ce_threshold     : 1
pcie_ce_window        : 30 sec
pcie_ce_leak_rate     : 15000 msec/error
-----------------------------------

Actions

Copy link

#21

Updated by mkittler about 3 years ago

Another round of test jobs: https://openqa.suse.de/tests/overview?build=test-arm4-4

Actions

Copy link

#22

Updated by mkittler about 3 years ago

And it already looks bad - so here some more typing issues:

https://openqa.suse.de/tests/8646939#step/setup/3 (casing)
https://openqa.suse.de/tests/8646941#step/patch_sle/14 (casing)
https://openqa.suse.de/tests/8646960#step/before_test/14 (missing A)
https://openqa.suse.de/tests/8646948#step/wait_children/3 (missing O)

Actions

Copy link

#23

Updated by mkittler about 3 years ago

Disabling cpu frequency scaling didn't help much, also considering all the test jobs:

openqa=> with test_jobs as (select distinct id, result from jobs where build = 'test-arm4-4') select result, count(id) * 100. / (select count(id) from test_jobs) as ratio from test_jobs group by test_jobs.result order by ratio desc;
     result      |         ratio          
-----------------+------------------------
 failed          |    45.2054794520547945
 passed          |    37.8995433789954338
 parallel_failed |    11.8721461187214612
 incomplete      |     4.1095890410958904
 softfailed      | 0.91324200913242009132
(5 Zeilen)

Actions

Copy link

#24

Updated by mkittler about 3 years ago

Blocked by deleted (action #109494: Restore network connection of arm-4/5 size:M)

Actions

Copy link

#25

Updated by mkittler about 3 years ago

Status changed from In Progress to Resolved

I've been removing the blocker #109494 because at least arm-4 is online again and that's sufficient.

I'd also like to conclude the issue here. The outcome is that the CPU (the specific model and version, Cavium ThunderX2) is known to behave badly for our use-case and that's the difference to the older arm workers (which have the previous version of that CPU model installed).

I tried disabling cpu control and cpu frequency scaling in the firmware environment but it didn't make a difference. Before that we've already tried to reduce the number of worker slots a lot and it didn't help either. There are still a few ideas to consider (see #109232#note-5) but this ticket is mainly about documenting differences so I'm resolving it now.

Actions

Copy link

#26

Updated by okurz about 3 years ago

Status changed from Resolved to Feedback

Please make sure that the relevant ideas and suggestions to follow are written down in open tickets, e.g. the parent epic or another subticket of the same epic.

Actions

Copy link

#27

Updated by mkittler about 3 years ago

Status changed from Feedback to Resolved

Updated the epic (https://progress.opensuse.org/journals/514993/diff?detail_id=486994).

Actions

Copy link

#28

Updated by okurz about 3 years ago

Due date deleted (~~2022-05-10~~)

Actions

Copy link

#29

Updated by okurz about 3 years ago

Category changed from Organisational to Feature requests

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #109232

Document relevant differences of arm-4/5 vs. arm-1/2/3 and aarch64.o.o, involve domain experts in asking what parameters are important to be able to run openQA tests size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by ggardet_arm about 3 years ago

Updated by mkittler about 3 years ago

Updated by ggardet_arm about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by openqa_review about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago