Project

General

Profile

Actions

action #109232

closed

coordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3

Document relevant differences of arm-4/5 vs. arm-1/2/3 and aarch64.o.o, involve domain experts in asking what parameters are important to be able to run openQA tests size:M

Added by okurz almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2022-03-30
Due date:
% Done:

0%

Estimated time:

Description

Motivation

The last time we spoke about we thought of the idea to involve ARM experts. okurz asked ggardet_arm in https://app.element.io/#/room/#openqa:opensuse.org (or maybe it was #opensuse-factory ) and he offered help but needs more details about the machines.

Acceptance criteria

  • AC1: Differences between arm-4/5 and arm-1/2/3 and aarch64.o.o are known (i.e. documented at least in ticket)
  • AC2: Domain experts are aware of the problems we face and have an opportunity to take a look at the differences

Suggestions

  • As necessary switch on machines, e.g. ipmi-ipmi.openqaworker-arm-4.qa power on
  • We suggest to get details, e.g. log in, call dmesg and dmidecode and provide that details in the ticket and ask ggardet_arm again if he needs more
  • Maybe something about hugepages, cpu flags, some boot kernel parameters to work around I/O quirks, anything like that.
  • Optional: Produce "diff" to arm-1/2/3 and aarch64.o.o
  • Optional: Send an email to SUSE internal ARM mailing list or Slack channel to ask for help.
  • Optional: Provide access to internal machines temporarily over tmate within an investigation session

Files

dmidecode-openqaworker-arm-1.txt (16.2 KB) dmidecode-openqaworker-arm-1.txt mkittler, 2022-04-05 10:19
dmidecode-openqaworker-arm-2.txt (21.5 KB) dmidecode-openqaworker-arm-2.txt mkittler, 2022-04-05 10:19
dmidecode-openqaworker-arm-3.txt (21.5 KB) dmidecode-openqaworker-arm-3.txt mkittler, 2022-04-05 10:19
dmidecode-openqaworker-arm-4.txt (31.5 KB) dmidecode-openqaworker-arm-4.txt mkittler, 2022-04-05 10:59
dmidecode-openqaworker-arm-5.txt (31.5 KB) dmidecode-openqaworker-arm-5.txt mkittler, 2022-04-05 10:59
inxi-openqaworker-arm-1.txt (11.8 KB) inxi-openqaworker-arm-1.txt mkittler, 2022-04-05 10:59
inxi-openqaworker-arm-2.txt (14 KB) inxi-openqaworker-arm-2.txt mkittler, 2022-04-05 10:59
inxi-openqaworker-arm-3.txt (15.9 KB) inxi-openqaworker-arm-3.txt mkittler, 2022-04-05 10:59
inxi-openqaworker-arm-4.txt (5.8 KB) inxi-openqaworker-arm-4.txt mkittler, 2022-04-05 10:59
inxi-openqaworker-arm-5.txt (5.8 KB) inxi-openqaworker-arm-5.txt mkittler, 2022-04-05 10:59
dmidecode-aarch64.txt (23 KB) dmidecode-aarch64.txt mkittler, 2022-04-05 12:40
inxi-aarch64.txt (3.74 KB) inxi-aarch64.txt mkittler, 2022-04-05 12:40
Actions #1

Updated by mkittler almost 3 years ago

  • Assignee set to mkittler

Updated by mkittler almost 3 years ago

I've also generated a more condensed summary with inxi (for w in openqaworker-arm-1 openqaworker-arm-2 openqaworker-arm-3 ; do ssh "$w" sudo inxi -c0 -F -xxx > "inxi-$w.txt" ; done).

On arm 4 and 5 I executed the commands via IPMI because the network isn't working on these machines.

Updated by mkittler almost 3 years ago

@ggardet_arm We have problems with the arm workers used in our internal openQA instance. Maybe you have an idea why the fail ratio of openQA tests conducted on the two aarch64 workers arm-4/5 is higher (~ 30 %) compared to the three aarch64 workers arm-1/2/3 (~ 15 %)? The fail ratio is within the same time range and jobs were scheduled equally. I've been trying to summarize some differences between the these aarch64 machines below and attached dmidecode/inxi output from all those machines in previous comments.

Note that arm-4/5 are quite new and were supposed to improve our generally bad situation with arm workers. Our existing workers arm-1/2/3 are actually quite unstable themselves as they randomly crash very often. However, their supposed replacements arm-4/5 are even worse. The new workers don't crash but as mentioned their fail ratio is much higher compared to the old/crashing ones.

At this point both groups of workers use Leap 15.3 and we've tested several kernel versions on arm-4/5 without noticing a difference in the fail ratio. We also reduced the number of worker slots on arm-4/5 to only 4 slots (per worker). So even if they were just quite slow that should have been compensated.

By the way, the only good aarch64 openQA worker we have is the worker "aarch64" used on openqa.opensuse.org. So I've also attached some details about this worker for a comparison.

Unfortunately I cannot give you network access to arm-4/5 at this point. Let me know if you need any further details.


Differences between arm-4 and arm-5: none, they seem to be identical except for serial numbers
So it should be sufficient to look at just one of the specs.


Differences between arm-4/5 and arm-1/2/3:

  1. The kernel version differs. However, that should not be relevant. We've already tested various Kernel versions on arm-4/5 and it didn't make a difference.
  2. arm-4/5 seem to be a newer version of the same product from the same vendor (R181-T92-00 > R120-T32, both from GIGABYTE)
  3. The mainboard is is newer but from the same vendor and likely similar (MT91-FS4-00 > MT30-GS1).
  4. The CPU is faster judging by frequencies and core counts (Cavium ThunderX2(R) CPU CN9980 v2.2 @ 2.20GHz > whatever arm-1/2/3 have installed).
  5. The CPU has more features/characteristics (Hardware Thread, Power/Performance Control).
Actions #5

Updated by ggardet_arm almost 3 years ago

It seems to be ThunderX1 and ThunderX2 machines. From my experience, they are not good as openQA workers and better to be used to build packages.

You can check/try:

  • to disable hardware threading in firmware (it will lower the number of CPU seen by the kernel)
  • check actual CPU frequency
  • check temperature (cpu throttling could slow down cpu freq and you get lower perfs)
  • Use single socket instead of dual sockets (may be configurable in the firmware)
  • Use a distribution without LSE-atomics (known to be slow on TX2)
  • Disable mitigation (KPTI, etc.)
  • enable/disable huge pages
  • You can also run sudo perf stat while the system is busy with openQA tests
Actions #6

Updated by mkittler almost 3 years ago

Thanks. I'll look into these points.


Because it came up as well: The failing tests seem to be a mix of typing issues and connection issues. Here are some examples:

(via select id, t_finished, result, reason from jobs where (select host from workers where id = assigned_worker_id) = 'openqaworker-arm-4' and (result = 'failed') and t_finished >= '2021-08-24T00:00:00')

Actions #7

Updated by ggardet_arm almost 3 years ago

You can also play with openQA settings:

  • VNC_TYPING_LIMIT
  • TIMEOUT_SCALE
  • QEMU_COMPRESS_LEVEL
  • QEMU_COMPRESS_THREADS
  • QEMUCPU (I recommend host)
  • QEMUMACHINE (I recommend virt,gic-version=host)

and of courses, the number of parallel openQA jobs.

Actions #8

Updated by mkittler almost 3 years ago

  • Blocked by action #109494: Restore network connection of arm-4/5 size:M added
Actions #9

Updated by mkittler almost 3 years ago

  • Status changed from Feedback to Blocked
Actions #10

Updated by mkittler almost 3 years ago

journalctl | grep stuck prints nothing on arm-4/5 (unlike to aarch64) and the logs reach back to November. So we definitely see a different problem than on aarch64 (where we frequently see message like kernel:[480826.136444] watchdog: BUG: soft lockup - CPU#49 stuck for 26s! [qemu-system-aar:13803]).

Instead we get lots of Nov 13 22:55:19 openqaworker-arm-5 kernel: ACPI CPPC: PCC check channel failed for ss: 0. ret=-110 and Nov 13 22:55:20 openqaworker-arm-5 kernel: CPPC Cpufreq:cppc_scale_freq_workfn: failed to read perf counters messages (on arm-4 and arm-5).

Apparently we're not the only ones seeing those messages with a ThunderX2 CPU: https://bugzilla.kernel.org/show_bug.cgi?id=208785

@ggardet_arm suggested to disable frequency scaling when I brought up these error messages (also see https://wiki.archlinux.org/title/CPU_frequency_scaling).

Considering https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1763817 the firmware setting cppccontrol=1 is relevant so we might try disabling this setting.

Actions #12

Updated by okurz almost 3 years ago

Independent of fixing the actual performance problemsI think it can help in general for openQA tests if we configure the systems we test in with a disabled key repeat to mitigate the issues, i.e. "xset r off" within X11 sessions or whatever can be seen as equivalent in non-X11 getty terminals. According to fvogt "There's also an ioctl for the vtcon". The command "kbdrate" allows to configure something but says only for "Intel" and does not allow to disable key repeat completely, only increase times, which could already help enough. By default keyboard delay is "250 ms". Any different setting likely needs to be applied per VT. So in a true text tty one could use kbdrate -d 1000 and in X xset r off whenever we start a new terminal.

Actions #13

Updated by mkittler over 2 years ago

  • Status changed from Blocked to In Progress

Thanks to Nick arm-4 is back online. I disabled cppccontrol in the shell accessible at boot time, updated arm-4, applied the latest salt-states/pillars with 16 worker slots and scheduled 100 jobs¹: https://openqa.suse.de/tests/overview?build=test-arm4

Not sure whether it generally makes sense to disable frequency scaling. I suppose this suggestion effectively means locking the CPU at a certain frequency which seems like a huge disadvantage. So for now I only tried the cppcontrol approach.


¹ Cloned the 100 most recently passed/softfailed aarch64 jobs on OSD (see /tmp/jobs_to_clone_arm on OSD). So the fail ratio within this set of jobs should be very low.

Actions #14

Updated by openqa_review over 2 years ago

  • Due date set to 2022-05-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Blocked

I'll have to re-conduct the tests because many failed due to networking problems. Maybe I had the problematic version of libslirp0 installed. I suppose some jobs also failed because dependencies were not cloned as well so I'll need to adjust the clone call as well. Due to general network problems arm-4 isn't reachable so I'm currently blocked by https://sd.suse.com/servicedesk/customer/portal/1/SD-84633.

Actions #16

Updated by mkittler over 2 years ago

  • Status changed from Blocked to In Progress

It had indeed the broken libslirp0 version installed. Newly scheduled jobs: https://openqa.suse.de/tests/overview?build=test-arm4-2

Actions #17

Updated by mkittler over 2 years ago

Actions #18

Updated by mkittler over 2 years ago

With cppccontrol=0 the ACPI error mentioned in #109232#note-10 isn't occurring again. Considering the job results I've seen so far it likely still doesn't help with the typing issues, though. However, I'll wait for all jobs to finish for a conclusion (as also the other arm workers have some typing issues).

Further typing issues:

Actions #19

Updated by mkittler over 2 years ago

33 % failures is not good (within a set of jobs that passed on other arm workers before):

openqa=> with test_jobs as (select distinct id, result from jobs where build = 'test-arm4-2') select result, count(id) * 100. / (select count(id) from test_jobs) as ratio from test_jobs group by test_jobs.result order by ratio desc;
      result      |         ratio          
------------------+------------------------
 passed           |    40.8510638297872340
 failed           |    33.6170212765957447
 parallel_failed  |    14.8936170212765957
 incomplete       |     8.5106382978723404
 timeout_exceeded | 0.85106382978723404255
 user_cancelled   | 0.85106382978723404255
 softfailed       | 0.42553191489361702128
(7 Zeilen)
Actions #20

Updated by mkittler over 2 years ago

I now disabled cpu frequency scaling via the firmware menu:

CAVM_CN99xx# env set progcpufreq 0
CPU Freq set by SYS will be used 
Env Var progcpufreq set with Value 0 
Execute 'env save' Command to make the changes persistent 
CAVM_CN99xx# env save
drivername snor
snor_erase: off=0x3ff0000, len=0x10000

-----------------------------------
       ENV Variable Settings 
-----------------------------------
Name                  : Value 
-----------------------------------
turbo                 : 2 
smt                   : 4 
corefreq              : 2199 
numcores              : 32 
icispeed              : 1 
socnclk               : 666 
socsclk               : 1199 
memclk                : 2199 
ddrspeed_auto         : 1 
ddrspeed              : 2400 
progcpufreq           : 0 
progdevfreq           : 1 
dmc_node_channel_mask : 0000ffff 
thermcontrol          : 1 
thermlimit            : 110 
enter_debug_shell     : 0 
dbg_speed_up_ddr_lvl  : 0 
enable_dram_scrub     : 0 
ipmbcontrol           : 1
ddr_dmt_advanced      : 0 
cppccontrol           : 0
loglevel              : 0
uart_params           : 115200/8-N-1 none
core_feature_mask     : 0
sys_feature_mask      : 0x00000000
ddr_refresh_rate      : 1
fw_feature_mask       : 0x00000000
dram_ce_threshold     : 1
dram_ce_step_threshold: 0
dram_ce_record_max    : 10
dram_ce_window        : 60 sec
dram_ce_leak_rate     : 2000 msec/error
pcie_ce_threshold     : 1
pcie_ce_window        : 30 sec
pcie_ce_leak_rate     : 15000 msec/error
-----------------------------------
Actions #23

Updated by mkittler over 2 years ago

Disabling cpu frequency scaling didn't help much, also considering all the test jobs:

openqa=> with test_jobs as (select distinct id, result from jobs where build = 'test-arm4-4') select result, count(id) * 100. / (select count(id) from test_jobs) as ratio from test_jobs group by test_jobs.result order by ratio desc;
     result      |         ratio          
-----------------+------------------------
 failed          |    45.2054794520547945
 passed          |    37.8995433789954338
 parallel_failed |    11.8721461187214612
 incomplete      |     4.1095890410958904
 softfailed      | 0.91324200913242009132
(5 Zeilen)
Actions #24

Updated by mkittler over 2 years ago

  • Blocked by deleted (action #109494: Restore network connection of arm-4/5 size:M)
Actions #25

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Resolved

I've been removing the blocker #109494 because at least arm-4 is online again and that's sufficient.

I'd also like to conclude the issue here. The outcome is that the CPU (the specific model and version, Cavium ThunderX2) is known to behave badly for our use-case and that's the difference to the older arm workers (which have the previous version of that CPU model installed).

I tried disabling cpu control and cpu frequency scaling in the firmware environment but it didn't make a difference. Before that we've already tried to reduce the number of worker slots a lot and it didn't help either. There are still a few ideas to consider (see #109232#note-5) but this ticket is mainly about documenting differences so I'm resolving it now.

Actions #26

Updated by okurz over 2 years ago

  • Status changed from Resolved to Feedback

Please make sure that the relevant ideas and suggestions to follow are written down in open tickets, e.g. the parent epic or another subticket of the same epic.

Actions #27

Updated by mkittler over 2 years ago

  • Status changed from Feedback to Resolved
Actions #28

Updated by okurz over 2 years ago

  • Due date deleted (2022-05-10)
Actions #29

Updated by okurz over 2 years ago

  • Category changed from Organisational to Feature requests
Actions

Also available in: Atom PDF