Project

General

Profile

Actions

coordination #102882

closed

[epic] All OSD PPC64LE workers except malbec appear to have horribly broken cache service

Added by okurz about 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2022-02-10
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

User report https://suse.slack.com/archives/C02CANHLANP/p1637666699462700 .
mdoucha: "All jobs are stuck downloading assets until they time out. OSD dashboard shows that the workers are downloading ridiculous amounts of data all the time since yesterday."

Suggestions

  • Find corresponding monitoring data on https://monitor.qa.suse.de/ that can be used to visualize the problem as well as a verification after any potential fix
  • Identify what might cause such problems "since yesterday", i.e. 2021-11-22

Rollback steps (to be done once the actual issue has been resolved)

powerqaworker-qam-1 # systemctl unmask openqa-worker-auto-restart@{3..6} openqa-reload-worker-auto-restart@{3..6}.{service,timer} && systemctl enable --now openqa-worker-auto-restart@{3..6} openqa-reload-worker-auto-restart@{3..6}.{service,timer}
QA-Power8-4-kvm # systemctl unmask openqa-worker-auto-restart@{4..8} openqa-reload-worker-auto-restart@{4..8}.{service,timer} && systemctl enable --now openqa-worker-auto-restart@{4..8} openqa-reload-worker-auto-restart@{4..8}.{service,timer}
QA-Power8-5-kvm # systemctl unmask openqa-worker-auto-restart@{4..8} openqa-reload-worker-auto-restart@{4..8}.{service,timer} && systemctl enable --now openqa-worker-auto-restart@{4..8} openqa-reload-worker-auto-restart@{4..8}.{service,timer}
  • Add qa-power8-4-kvm.qa.suse.de, qa-power8-5-kvm.qa.suse.de and powerqaworker-qam-1.qa.suse.de back to salt and ensure all services are running again.

Subtasks 6 (0 open6 closed)

action #106538: lessons learned "five whys" for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" size:SResolvedokurz2022-02-10

Actions
action #106540: Mitigate/resolve All OSD PPC64LE workers except malbec appear to have horribly broken cache serviceResolvedkraih2022-02-10

Actions
action #106543: Conduct rollback steps and check impact for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" size:MResolvedkraih2022-02-10

Actions
action #107083: SUSE QE Tools team must learn about switch administration and get accessResolvedokurz2022-02-18

Actions
action #107086: Ask for volunteers in SUSE QE Tools that would be able to visit the Nbg server rooms, e.g. as second person accompanying nsinger or any potential new adminResolvedokurz2022-02-18

Actions
action #107089: Make SUSE QE Tools team aware that we need to support EngInfra due to limited capacityResolvedokurz2022-02-18

Actions

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #104106: [qe-core] test fails in await_install - Network peformace for ppc installations is decreasing size:SResolvedmkittler2021-12-16

Actions
Related to openQA Project (public) - action #105804: Job age (scheduled) (median) alert size:SResolvedmkittler2022-02-01

Actions
Copied to openQA Project (public) - coordination #102951: [epic] Better network performance monitoringResolvedokurz2021-11-24

Actions
Actions #1

Updated by okurz about 3 years ago

powerqaworker-qam-1:/home/okurz # ps auxf | grep openqa
root      88223  0.0  0.0   4608  1472 pts/2    S+   12:45   0:00                                      \_ grep --color=auto openqa
_openqa+   4976  0.0  0.0 118720 114496 ?       Ss   Nov21   0:42 /usr/bin/perl /usr/share/openqa/script/worker --instance 7
_openqa+   4983  0.0  0.0 112640 108608 ?       Ss   Nov21   0:36 /usr/bin/perl /usr/share/openqa/script/worker --instance 8
_openqa+  29016  0.0  0.0  90624 78144 ?        Ss   Nov22   0:09 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+  51130  0.0  0.0  90624 72896 ?        S    06:07   0:21  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+  77232  0.0  0.0  90624 74048 ?        S    10:41   0:07  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+  80692  0.0  0.0  90624 73600 ?        S    11:25   0:04  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+  80892  0.0  0.0  90624 73728 ?        S    11:27   0:04  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+  29017  0.0  0.0  82368 78144 ?        Ss   Nov22   0:50 /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+  78221  1.1  0.0  84416 73920 ?        S    10:53   1:19  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+  79516  1.1  0.0  85120 74560 ?        S    11:09   1:07  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+  80419  1.1  0.0  84544 74048 ?        S    11:21   0:58  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+  85600  1.1  0.0  84352 73984 ?        S    12:21   0:16  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+  86025  1.1  0.0  84544 73984 ?        S    12:26   0:12  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+  83287  0.0  0.0  76224 72192 ?        Ss   11:51   0:02 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
_openqa+  83289  0.0  0.0  75968 72192 ?        Ss   11:51   0:02 /usr/bin/perl /usr/share/openqa/script/worker --instance 4
_openqa+  83521  0.0  0.0  76096 72128 ?        Ss   11:54   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 6
_openqa+  84583  0.0  0.0  76352 72320 ?        Ss   12:08   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 1
_openqa+  87663  0.1  0.0  75968 72064 ?        Ss   12:40   0:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 2
_openqa+  87759  0.2  0.0  76160 72192 ?        Ss   12:41   0:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 3

and strace-ing one process, the oldest still running cache minion process, reveals:

poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "O\253\302\342\277\t\367z\305x\340E\325\344\340\353\23\261+\353\r\21\315\207\211\301\334\251\364\357\262\347"..., 131072) = 4284
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "O\253\302\342\277\t\367z\305x\340E\325\344\340\353\23\261+\353\r\21\315\207\211\301\334\251\364\357\262\347"..., 4284) = 4284
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\270\321x\243\356\263&\303_\254E{\242.\370-\32\274!\275YC\177\244\265\206\355T\227\7\327\255"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\270\321x\243\356\263&\303_\254E{\242.\370-\32\274!\275YC\177\244\265\206\355T\227\7\327\255"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "U\269\247NE;\325\ta\210\275\314y\244M\346]4 \340Y\312\343<\374~\376\370_\336@"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "U\269\247NE;\325\ta\210\275\314y\244M\346]4 \340Y\312\343<\374~\376\370_\336@"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\256\231h]\240\361Y>}\213\376\221m\310\263:\27\310\33204u\327=(\2729/\317\252\367\22"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\256\231h]\240\361Y>}\213\376\221m\310\263:\27\310\33204u\327=(\2729/\317\252\367\22"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\2134\255\354\216\34\t\365\305+\250\25y\207s\204y\234\235\253\332}\376\356x\251\346C2\17\370="..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\2134\255\354\216\34\t\365\305+\250\25y\207s\204y\234\235\253\332}\376\356x\251\346C2\17\370="..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\364+\375\200^\36\362yB\314\210.[}&\37\351\231\371\36\247\22\317\245~\260\vy\205\354\206_"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\364+\375\200^\36\362yB\314\210.[}&\37\351\231\371\36\247\22\317\245~\260\vy\205\354\206_"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "2N\235\306,\317\333\"W\37\267\304f\236\234\n\317\376\367\314\206\375\261\226\32#W\v\316\246\221\265"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "2N\235\306,\317\333\"W\37\267\304f\236\234\n\317\376\367\314\206\375\261\226\32#W\v\316\246\221\265"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\301c\314\335\6zK\35^E\314\330\23\276\10\301\277\360\367\216_\6?Y\220\t\370\330\30gtU"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\301c\314\335\6zK\35^E\314\330\23\276\10\301\277\360\367\216_\6?Y\220\t\370\330\30gtU"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "S\252\2354\3654g\246\217x\"\272\307E\226K\5J\255\350(\331\223\fE\357V\253l\1W\340"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "S\252\2354\3654g\246\217x\"\272\307E\226K\5J\255\350(\331\223\fE\357V\253l\1W\340"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\257m.\324\265Y\361\2k^\270\374\335\251o~\374\351\271\177\354\213\16K_\273\v5\231\262/\236"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\257m.\324\265Y\361\2k^\270\374\335\251o~\374\351\271\177\354\213\16K_\273\v5\231\262/\236"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, " -\3471\373+\273kr`\323cf[}F\227\27\265\23\313\243\366\25\366{}\324\356Zf\27"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, " -\3471\373+\273kr`\323cf[}F\227\27\265\23\313\243\366\25\366{}\324\356Zf\27"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\351\254\231\325\320#z\230\351\335-\217\214\350\354\3041\232\227*\27\332rE\251\274o\305\305\265\232\363"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\351\254\231\325\320#z\230\351\335-\217\214\350\354\3041\232\227*\27\332rE\251\274o\305\305\265\232\363"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "C\337\206\23W\366O\302\v\226\207\256\273\26o\302\265$\306\375\6O\265\263|\250\276-\254\275Y\263"..., 131072) = 4284
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "C\337\206\23W\366O\302\v\226\207\256\273\26o\302\265$\306\375\6O\265\263|\250\276-\254\275Y\263"..., 4284) = 4284
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\34\242\376\314\n\267\323\322\244\2300,o\270?~\315\234\236\277f~\225\271i\372\26\257\27{\342\227"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\34\242\376\314\n\267\323\322\244\2300,o\270?~\315\234\236\277f~\225\271i\372\26\257\27{\342\227"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\236\304;r\303\342uxB}b\31I\357\333\214\242\213^\243.\350\33Zk\317@3\t\307S#"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\236\304;r\303\342uxB}b\31I\357\333\214\242\213^\243.\350\33Zk\317@3\t\307S#"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "Fy_\362\317N\230{Y\376\366\364\206\264\37\277\323\31\256h=\350\36=\212\37\257\352\177\324\226~"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "Fy_\362\317N\230{Y\376\366\364\206\264\37\277\323\31\256h=\350\36=\212\37\257\352\177\324\226~"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "~E\263\307\231\370\204\374P\361\234kK\347|\324\372\325\253\277\276\362\345\253\344\341\303\346\317\277\273\242"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "~E\263\307\231\370\204\374P\361\234kK\347|\324\372\325\253\277\276\362\345\253\344\341\303\346\317\277\273\242"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\364J\376p\r\217@W\361.b\232\0313.\351\220\325,\356#=k\3253\256\244U\203\374\266#"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\364J\376p\r\217@W\361.b\232\0313.\351\220\325,\356#=k\3253\256\244U\203\374\266#"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\267\231k\345\311\333SV\313\373\366\25\177\364\n\357\312\317\325r\305\322ox\30yAo\320\32\371\320"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\267\231k\345\311\333SV\313\373\366\25\177\364\n\357\312\317\325r\305\322ox\30yAo\320\32\371\320"..., 1428) = 1428

so the process is busy reading over network and writing into a local cache file?

Actions #2

Updated by mkittler about 3 years ago

On the Minion dashboard no download jobs have been piling up. However, judging by htop the speed it writes to disk is below 1 M/s (per process). That's very slow. And yes, it is reading over network and writes into a local cache file. I suppose that is expected - just not that it is that slow.

The network connection to OSD isn't generally slow. I've just tested with iperf3 on power8-4-kvm.qa.suse.de and powerqaworker-qam-1.qa.suse.de and got > 600 Mbits/sec. The write performance on /var/lib/openqa/cache/tmp looks also good on both workers.

Actions #3

Updated by mkittler about 3 years ago

Judging by the job history, the affected machines are qa-power8-4-kvm.qa.suse.de, qa-power8-5-kvm.qa.suse.de and powerqaworker-qam-1.qa.suse.de. grenache-1 and malbec look good.

Actions #4

Updated by nicksinger about 3 years ago

mkittler wrote:

Judging by the job history, the affected machines are qa-power8-4-kvm.qa.suse.de, qa-power8-5-kvm.qa.suse.de and powerqaworker-qam-1.qa.suse.de. grenache-1 and malbec look good.

grenache-1 looking good is an interesting observation as it would show that affected machines are not only in the qa.suse.de subdomain. Given our history I'd recommend to check network performance to/from OSD using iperf3 with respective parameters for IPv4 and IPv6. Maybe this reveals some first details.

Actions #5

Updated by mkittler about 3 years ago

  • Assignee set to mkittler
Actions #6

Updated by mkittler about 3 years ago

I've checked with iperf3 again. There was no difference between using -4 and -6.

Actions #7

Updated by kraih about 3 years ago

Not seeing anything unusual in the logs on powerqaworker-qam-1.qa.suse.de either.

Actions #8

Updated by mkittler about 3 years ago

When using iperf3 -R to test downloading (from OSD) on qa-power8-4-kvm.qa.suse.de and powerqaworker-qam-1.qa.suse.de there's a huge slowdown to < 5 Mbit/s (regardless whether IPv4 or 6 is used). That's not the case on the good host malbec so I assume we have our problem - unless this is really just due to the ongoing downloads. The ongoing downloads use only 20 Mbit/s (3.33 Mbyte/s). That is very slow. If we add it to performance test speed we're still only at a receive rate of 25 Mbit/s.

Actions #9

Updated by nicksinger about 3 years ago

All affected machines seem to be located in SRV2 according to racktables: https://racktables.suse.de/index.php?page=object&tab=default&object_id=3026
Here you have some network-graphs for the switch they are most likely connected to: http://mrtg.suse.de/qanet13nue/index.html

I checked the connection speeds on that switch. According to these graphs 3 of these ports seem to max out at ~100Mbit/s (still quite a bit more then measured by @mkittler):

qanet13nue#show interfaces status
                                             Flow Link          Back   Mdix
Port     Type         Duplex  Speed Neg      ctrl State       Pressure Mode
-------- ------------ ------  ----- -------- ---- ----------- -------- -------
gi1      1G-Copper    Full    1000  Enabled  Off  Up          Disabled Off    
gi2      1G-Copper    Full    1000  Enabled  Off  Up          Disabled On     
gi3      1G-Copper    Full    1000  Enabled  Off  Up          Disabled On     
gi4      1G-Copper      --      --     --     --  Down           --     --    
gi5      1G-Copper    Full    1000  Enabled  Off  Up          Disabled Off    
gi6      1G-Copper    Full    1000  Enabled  Off  Up          Disabled On     
gi7      1G-Copper    Full    100   Enabled  Off  Up          Disabled On     
gi8      1G-Copper    Full    1000  Enabled  Off  Up          Disabled Off    
gi9      1G-Copper      --      --     --     --  Down           --     --    
gi10     1G-Copper      --      --     --     --  Down           --     --    
gi11     1G-Copper    Full    100   Enabled  Off  Up          Disabled On     
gi12     1G-Copper    Full    100   Enabled  Off  Up          Disabled On     
gi13     1G-Copper    Full    100   Enabled  Off  Up          Disabled Off    
gi14     1G-Copper      --      --     --     --  Down           --     --    
gi15     1G-Copper      --      --     --     --  Down           --     --    
gi16     1G-Copper      --      --     --     --  Down           --     --    
gi17     1G-Copper      --      --     --     --  Down           --     --    
gi18     1G-Copper      --      --     --     --  Down           --     --    
gi19     1G-Copper    Full    1000  Enabled  Off  Up          Disabled On     

From the MAC address-table I see the following connections:
powerqaworker-qam-1.qa.suse.de: gi5
QA-Power8-5.qa.suse.de: gi8
QA-Power8-4.qa.suse.de: gi7

So only qa-power8-4 is connected over 100Mbit/s.

Actions #10

Updated by mkittler about 3 years ago

I've stopped all services on powerqaworker-qam-1.qa.suse.de. Even without ongoing downloads the network speed is very slow:

martchus@powerqaworker-qam-1:~> iperf3 -R -4 -c openqa.suse.de -i 1 -t 30
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 10.162.7.211 port 38894 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   232 KBytes  1.90 Mbits/sec                  
[  5]   1.00-2.00   sec   861 KBytes  7.06 Mbits/sec                  
[  5]   2.00-3.00   sec   897 KBytes  7.34 Mbits/sec                  
[  5]   3.00-4.00   sec   441 KBytes  3.61 Mbits/sec                  
[  5]   4.00-5.00   sec   168 KBytes  1.38 Mbits/sec                  
[  5]   5.00-6.00   sec   810 KBytes  6.64 Mbits/sec                  
[  5]   6.00-7.00   sec   427 KBytes  3.50 Mbits/sec                  
[  5]   7.00-8.00   sec   157 KBytes  1.29 Mbits/sec                  
[  5]   8.00-9.00   sec   577 KBytes  4.73 Mbits/sec                  
[  5]   9.00-10.00  sec   566 KBytes  4.63 Mbits/sec                  
[  5]  10.00-11.00  sec   406 KBytes  3.32 Mbits/sec                  
[  5]  11.00-12.00  sec   714 KBytes  5.85 Mbits/sec                  
[  5]  12.00-13.00  sec   571 KBytes  4.68 Mbits/sec                  
[  5]  13.00-14.00  sec   925 KBytes  7.58 Mbits/sec                  
[  5]  14.00-15.00  sec   474 KBytes  3.88 Mbits/sec                  
[  5]  15.00-16.00  sec   952 KBytes  7.80 Mbits/sec                  
[  5]  16.00-17.00  sec   161 KBytes  1.32 Mbits/sec                  
[  5]  17.00-18.00  sec   218 KBytes  1.78 Mbits/sec                  
[  5]  18.00-19.00  sec  1.16 MBytes  9.72 Mbits/sec                  
[  5]  19.00-20.00  sec   475 KBytes  3.89 Mbits/sec                  
[  5]  20.00-21.00  sec   976 KBytes  7.99 Mbits/sec                  
[  5]  21.00-22.00  sec  1.38 MBytes  11.6 Mbits/sec                  
[  5]  22.00-23.00  sec   496 KBytes  4.07 Mbits/sec                  
[  5]  23.00-24.00  sec   358 KBytes  2.93 Mbits/sec                  
[  5]  24.00-25.00  sec  1024 KBytes  8.39 Mbits/sec                  
[  5]  25.00-26.00  sec   779 KBytes  6.38 Mbits/sec                  
[  5]  26.00-27.00  sec   761 KBytes  6.23 Mbits/sec                  
[  5]  27.00-28.00  sec   434 KBytes  3.56 Mbits/sec                  
[  5]  28.00-29.00  sec   663 KBytes  5.43 Mbits/sec                  
[  5]  29.00-30.00  sec   786 KBytes  6.44 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  18.6 MBytes  5.19 Mbits/sec  2284             sender
[  5]   0.00-30.00  sec  18.5 MBytes  5.16 Mbits/sec                  receiver

All affected workers are in the same rack: https://racktables.suse.de/index.php?page=rack&rack_id=520

Actions #11

Updated by mkittler about 3 years ago

  • Status changed from New to Feedback

I've created an Infra ticket: https://sd.suse.com/servicedesk/customer/portal/1/SD-67703
I've also just stopped all worker slots on the affected hosts and removed them from salt-key.

Actions #12

Updated by okurz about 3 years ago

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&editPanel=84&tab=alert&from=1637506758247&to=1637657889676 maybe points to the same. The apache response time seems to have gone up in the past two days

Actions #13

Updated by mkittler about 3 years ago

  • Description updated (diff)
Actions #14

Updated by okurz about 3 years ago

  1. Does running iperf3 in server mode when there are no clients reading from there incur any overhead? If not, should we run it there permanently for monitoring and investigation purposes?
  2. @mkittler can you provide something like a "one-liner" to reproduce the problem, e.g. the necessary iperf3 command line both on server+worker
  3. can we run the according iperf3 commands periodically in our monitoring? I guess just some seconds every hour should provide enough data and we can smooth in grafana
  4. I suggest to try to power down + power up the according machines over IPMI. Maybe this already helps with port-renegotiation or something
  5. As the problem did appear just recently I suggest we rollback package changes, e.g. kernel version. Despite some workers still behaving fine it could still be a problem after updates only that only some machines are affected due to their network setup particularities
Actions #15

Updated by nicksinger about 3 years ago

okurz wrote:

  1. Does running iperf3 in server mode when there are no clients reading from there incur any overhead? If not, should we run it there permanently for monitoring and investigation purposes?

Running just the server does not really come with much overhead despite the usual load a idling process causes.

  1. can we run the according iperf3 commands periodically in our monitoring? I guess just some seconds every hour should provide enough data and we can smooth in grafana

There is this open request with an simple exec example: https://github.com/influxdata/telegraf/issues/3866#issuecomment-694429507 - this should work for our use-case. We just need to make sure not to run all requests at the same time to all workers because it would quite easily saturate the whole link of OSD

Actions #16

Updated by mkittler about 3 years ago

Unfortunately Infra doesn't have access to the server as well. Maybe they can at least tell us who has.

I've rebooted but it didn't help. I've booted into the snapshot from Mi 24 Nov 2021 10:49:35 CET but it didn't help. The rates are a tiny bit higher now but that's likely just because now all downloads on all the hosts had been stopped. It is still just 15.8 Mbits/sec.

Actions #17

Updated by okurz about 3 years ago

Actions #18

Updated by okurz about 3 years ago

From https://sd.suse.com/servicedesk/customer/portal/1/SD-67703

Gerhard Schlotter (3 hours ago): "how should we help with the issues, we neither have access to the switch nor the affected servers. to the qanet switches, someone from QA team [has access]. The uplink from our side is completely fine and can carry a lot more load."

Who can pick this up and access the switches to check, maybe reboot, unplug some cables, etc.?

Actions #19

Updated by mkittler about 3 years ago

  • Assignee changed from mkittler to nicksinger

Who can pick this up and access the switches to check, maybe reboot, unplug some cables, etc.?

Nick says he has access to I'm assigning the ticket to him.


@nicksinger I can of course still do the rollback steps (mentioned in the ticket description) for you in the end or do some further testing to see whether something works better after some changes.

Actions #20

Updated by mkittler about 3 years ago

@nicksinger has restarted the switch but the networking speed is still slow. (Even though all workers are now back online I'd expect more than 4 Mbit/s download rate via iperf3 from OSD.)

Actions #21

Updated by okurz about 3 years ago

so what's the plan?

Actions #22

Updated by mkittler about 3 years ago

I don't know. Maybe it is possible to plug the machines in another switch or try with a spare switch?

Actions #23

Updated by nicksinger about 3 years ago

I asked Wei Gao if I can get access to migration-smt2.qa.suse.de to run an iperf crosscheck with another machine in the same rack

Actions #24

Updated by okurz about 3 years ago

  • Due date set to 2021-12-29
  • Status changed from Feedback to In Progress
  • Priority changed from Urgent to High

Current state

We have reduced performance but still some worker instances running so with degraded performance we have addressed the urgency of the ticket and can reduce to "High"

Observations

  • http://mrtg.suse.de/qanet13nue/10.162.0.73_gi1.html shows that there is significant traffic on that port since 2021-W47, i.e. 2021-11-22, the start of the problems and near-zero going back to 2020-11. same for gi2, gi5, gi7, gi8, gi11, gi12, gi13, gi23, gi24
  • the corresponding counterpart to qanet13 is visible on http://mrtg.suse.de/Nx5696Q-Core2/192.168.0.121_369098892.html (qanet13 connection on core2) and http://mrtg.suse.de/Nx5696Q-Core1/192.168.0.120_526649856.html (qanet13 connection on core1) but neither seem to show significant traffic increase since 2021-11-22, so where is the traffic coming from? Is the switch qanet13 sending out broadcasts itself?
  • qanet13nue uplink seems to be gi27+gi28 (found with show interfaces Port-Channel 1). http://mrtg.suse.de/qanet13nue/10.162.0.73_po1.html is the aggregated view and shows nothing significant. But we see that also in the past we had spikes to 320 MBit/s "in" and 240 MBit/s "out" and no such spikes since 2021-W47, limited to 100 MBit/s? Yearly average looks sane, nothing special, average 46 MBit/s "in" and 16 MBit/s "out". We identified that the hosts that are called like S812LC and S822LC on http://mrtg.suse.de/qanet13nue/index.html are according to https://racktables.suse.de/index.php?page=object&object_id=992 our power hosts qa-power8-4 (S812LC) and qa-power8-5 (S822LC) and respective "service processors" S812LC-SP and S822LC-SP. gi6 is powerqaworker-qam-1 (according to iperf experiment from hyperv host). On http://mrtg.suse.de/qanet13nue/index.html we can see that many hosts receive significant traffic since 2021-11-22 but don't show change in sending traffic. The only port that shows significant corresponding incoming traffic is the uplink. So our conclusion is that unintended broadcast traffic received by the rack switch is forwarded to all hosts and especially the Power machines seem to be badly affected by this (either traffic on SPs or the host itself or both) so that sending still works with high bandwidth but receiving only gets a very low bandwidth
  • booted powerqaworker-qam-1 with kernel 5.3.18-lp152.102-default from 2021-11-11 from /boot, that is before the start of the problem on 2021-11-22 and ran iperf3 -t 1200 -R -c openqaworker12 yielding 5.9 MBit/s so same on this older kernel => kernel regression unlikely

Suggestions

  • WAITING Ask users of other machines in the same rack if they have network problems, e.g. migration-smt2.qa.suse.de , ask migration team -> nsinger asked, @waitfor response
  • DONE Conduct network performance test between two hosts within the same rack, nsinger conducted this test between qa-power8-4 (server) and powerqaworker-qam-1 (client) and received 3.13 MBit/s so a magnitude too low for 1 GBit/s, same for qa-power8-5 (client, both directions). Crosscheck between two other hosts in another rack. We did for openqaworker10+13 and got 945 MBit/s so as expected near 1 GBit/s accounting for overhead.
  • DONE Try to sature the switch bandwidth using iperf3 until we can actually see the result on http://mrtg.suse.de/qanet13nue/index.html -> we could see the results using openqaw9-hyperv which we verified to be connected to g1. http://mrtg.suse.de/qanet13nue/10.162.0.73_gi1.html
  • DONE Logged in over RDP to openqaw9-hyperv.qa.suse.de and downloaded https://iperf.fr/iperf-download.php for Windows
    • DONE executed tests against qa-power8-4-kvm resulting in 1.3 MBit/s, openqaworker10->openqaw9-hyperv.qa.suse.de => 204 MBit/s, openqaw9-hyperv.qa.suse->openqaworker10 248 MBit/s so system is fine, switch is not generally broken
    • DONE Started iperf3 -s on openqaworker12 and on openqaw9-hyperv iperf3.exe -t 1200 -c openqaworker12 at 11:09:00Z, trying to see the bandwidth on http://mrtg.suse.de/qanet20nue/index.html . stopped as expected 11:29:00Z. Reported bandwidth 77 MBit/s in both directions. MAC-address 00:15:17:B1:03:88 or 00:15:17:B1:03:89 . nsinger has confirmed that he sees this address on qanet13nue:gi1 .
  • DONE Now starting iperf3 -t 1200 -c powerqaworker-qam-1 -> 1.02 MBit/s. Reverse iperf3 -t 1200 -R -c powerqaworker-qam-1 shows bandwidth of 692 MBit/s (!) => only download to machine affected
  • DONE Examine the traffic, e.g. wireshark on any host on the rack, and see if we can identify the traffic and forward that information to the according users or Eng Infra -> nothing found by nsinger so far
  • Try to connect the affected machines to another switch, e.g. in a neighboring rack, and execute iperf3 runs. nicksinger will coordinate with gschlotter from Eng Infra to do that
  • REJECTED Check for log output on power8-4 why the link is only 100 MBit/s and coordinate with Eng Infra to replace the cable on the port connected to power8-4 and/or connect to another port on the same switch -> mkittler confirmed that Linux reports the link is 1GB/s so this is a false report. Maybe some BMC that is connected on that port.
  • Ask Eng Infra to give more members or the complete team of QE Tools ssh access to the switch, at least read-only access for monitoring. If Eng Infra does not know how to do that maybe nsinger can do it himself directly
  • Disable individual ports on the switch to check if that improves the situation for power workers -> likely will not affect as we assume the problem to come from outside the switch over the uplink
  • Conduct network performance benchmark on affected power hosts in a stripped-down environment with no other significant traffic. Also we can not access the host powerqaworker-qam-1 using iperf or any other port from the other hosts.
Actions #25

Updated by okurz about 3 years ago

https://progress.opensuse.org/issues/102882

on powerqaworker-qam-1 I stopped many services and also unmounted NFS. I ran tcpdump -i eth4. Traffic I found (example block):

14:05:32.357093 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29685225:29688121, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660203], length 2896
14:05:32.357238 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [P.], seq 29688121:29689569, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660203], length 1448
14:05:32.357239 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29689569:29691017, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660203], length 1448
14:05:32.357385 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29691017:29693913, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660204], length 2896
14:05:32.357533 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29693913:29696809, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660204], length 2896
14:05:32.357677 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29696809:29699705, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660204], length 2896
14:05:32.357825 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29699705:29702601, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660204], length 2896
14:05:32.357968 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29702601:29705497, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660204], length 2896
14:05:32.358107 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29705497:29706945, ack 751, win 505, options [nop,nop,TS val 3180811359 ecr 2119660204], length 1448
14:05:32.369753 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125080480:125081928, ack 28937, win 529, options [nop,nop,TS val 1725941080 ecr 1084980048], length 1448
14:05:32.369810 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125081928, win 3896, options [nop,nop,TS val 1084981522 ecr 1725941080], length 0
14:05:32.369945 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125089168:125092064, ack 28937, win 529, options [nop,nop,TS val 1725941081 ecr 1084981522], length 2896
14:05:32.369995 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125081928, win 3896, options [nop,nop,TS val 1084981522 ecr 1725941080,nop,nop,sack 1 {125089168:125092064}], length 0
14:05:32.370107 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125081928:125084824, ack 28937, win 529, options [nop,nop,TS val 1725941081 ecr 1084981522], length 2896
14:05:32.370148 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125084824, win 3874, options [nop,nop,TS val 1084981523 ecr 1725941081,nop,nop,sack 1 {125089168:125092064}], length 0
14:05:32.370296 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125084824:125089168, ack 28937, win 529, options [nop,nop,TS val 1725941081 ecr 1084981523], length 4344
14:05:32.370297 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125092064:125093512, ack 28937, win 529, options [nop,nop,TS val 1725941081 ecr 1084981523], length 1448
14:05:32.370345 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125092064, win 3862, options [nop,nop,TS val 1084981523 ecr 1725941081], length 0
14:05:32.370345 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125093512, win 3853, options [nop,nop,TS val 1084981523 ecr 1725941081], length 0
14:05:32.370440 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125093512:125094960, ack 28937, win 529, options [nop,nop,TS val 1725941081 ecr 1084981523], length 1448
14:05:32.370480 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125094960, win 3896, options [nop,nop,TS val 1084981523 ecr 1725941081], length 0
14:05:32.377555 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29708393:29709841, ack 751, win 505, options [nop,nop,TS val 3180811378 ecr 2119660204], length 1448
14:05:32.377757 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29709841:29711289, ack 751, win 505, options [nop,nop,TS val 3180811378 ecr 2119660224], length 1448

asked in #help-it-ama who is 149.44.176.6. drodgriguez answered and stated that it's https://api.suse.de and the racktables entry https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=6198&hl_ip=149.44.176.6 . I see quite some https traffic from that host to QA-Power8-5-kvm.qa.suse.de, I guess it's AMQP. The above traffice shows traffic to and from QA-Power8-4-kvm and QA-Power8-5-kvm, so why do I see it at all on powerqaworker-qam-1?

Trying an older kernel on powerqaworker-qam-1.qa:

sudo kexec --exec --load /boot/vmlinux-5.3.18-lp152.102-default --initrd=/boot/initrd-5.3.18-lp152.102-default --command-line=$(cat /proc/cmdline)

Same results there so no impact of kernel. I asked in SUSE-IT ticket.

Actions #26

Updated by okurz about 3 years ago

mdoucha reported in https://suse.slack.com/archives/C02CANHLANP/p1639646867388200 that PPC64LE jobs are failing again on MAX_SETUP_TIME and that again many instances are online. I did:

powerqaworker-qam-1 # systemctl mask --now openqa-worker-auto-restart@{3..6}
QA-Power8-4-kvm # systemctl mask --now openqa-worker-auto-restart@{4..8}
QA-Power8-5-kvm # systemctl mask --now openqa-worker-auto-restart@{4..8}

I called

for i in powerqaworker-qam-1 QA-Power8-4-kvm QA-Power8-5-kvm ;do host=openqa.suse.de WORKER=powerqaworker-qam-1 failed_since=2021-12-01 result="result='timeout_exceeded'" bash -ex openqa-advanced-retrigger-jobs; done

but found no jobs that were not already automatically restarted. I also called

for i in powerqaworker-qam-1 QA-Power8-4-kvm QA-Power8-5-kvm ;do host=openqa.suse.de WORKER=powerqaworker-qam-1 failed_since=2021-12-01 result="result='incomplete'" bash -ex openqa-advanced-retrigger-jobs; done

which looks like it also effectively did not restart any jobs as they all miss a necessary asset.

EDIT: We observed failed systemd services because now the according "openqa-reload-worker-auto-restart" services fail as the "openqa-worker-auto-restart" services are masked. So we also need to (and I did that now) mask those:

powerqaworker-qam-1 # systemctl mask --now openqa-reload-worker-auto-restart@{3..6} ; systemctl reset-failed
QA-Power8-4-kvm # systemctl mask --now openqa-reload-worker-auto-restart@{4..8} ; systemctl reset-failed
QA-Power8-5-kvm # systemctl mask --now openqa-reload-worker-auto-restart@{4..8} ; systemctl reset-failed
Actions #27

Updated by livdywan about 3 years ago

okurz wrote:

  • WAITING Ask users of other machines in the same rack if they have network problems, e.g. migration-smt2.qa.suse.de , ask migration team -> nsinger asked, @waitfor response

Did we find out if migration-smt2.qa.suse.de is affected?

  • Ask Eng Infra to give more members or the complete team of QE Tools ssh access to the switch, at least read-only access for monitoring. If Eng Infra does not know how to do that maybe nsinger can do it himself directly
  • Disable individual ports on the switch to check if that improves the situation for power workers -> likely will not affect as we assume the problem to come from outside the switch over the uplink
  • Conduct network performance benchmark on affected power hosts in a stripped-down environment with no other significant traffic. Also we can not access the host powerqaworker-qam-1 using iperf or any other port from the other hosts.

Are we still waiting to get access to the switch?

Actions #28

Updated by okurz about 3 years ago

cdywan wrote:

Are we still waiting to get access to the switch?

Well, I am still waiting for access to the switch, nicksinger has access

Actions #29

Updated by nicksinger about 3 years ago

okurz wrote:

cdywan wrote:

Are we still waiting to get access to the switch?

Well, I am still waiting for access to the switch, nicksinger has access

Could you please try if ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 admin@qanet13nue.qa.suse.de lets you in?

Actions #30

Updated by okurz about 3 years ago

nicksinger wrote:

Could you please try if ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 admin@qanet13nue.qa.suse.de lets you in?

Well, ssh "let's me in", then I am asked for "User Name:" so I guess the answer is, "yes" up to this point

Actions #31

Updated by szarate about 3 years ago

  • Related to action #104106: [qe-core] test fails in await_install - Network peformace for ppc installations is decreasing size:S added
Actions #32

Updated by nicksinger about 3 years ago

okurz wrote:

nicksinger wrote:

Could you please try if ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 admin@qanet13nue.qa.suse.de lets you in?

Well, ssh "let's me in", then I am asked for "User Name:" so I guess the answer is, "yes" up to this point

ok so apparently it didn't work as I expected it. Unfortunately the iOS version is quite old and I can only find guides for more modern versions. I will write you the password in slack so you can at least manually log in.

Actions #33

Updated by nicksinger about 3 years ago

I talked to gschlotter regarding https://sd.suse.com/servicedesk/customer/portal/1/SD-67703 - copying the current plan (for everybody who can not access this ticket):

I had some brainstorming with Nick.
on Monday I will be in the serverroom and will connect one of these servers with a new cable to a diffrent switch.
Nick will test if this solves the situation, if yes, he will be in the office with Matthias on Tuesday and recable these servers.
Actions #34

Updated by livdywan about 3 years ago

What happened since the last episode

  • Nick took over from Marius
  • Oli reported an extensive report of ideas and attempts to investigate
  • Nothing seemingly happened for two weeks
  • Stakeholders are seeing problems again
  • We still don't know if maybe it's just a kink in the ethernet cable

Ideas for improvement

  • We could have implemented work-arounds sooner
    • Consider getting access to another machine as a temporary replacement
  • Was the infra ticket updated / visible?
    • Comments should have been added to clarify changes
  • Due date set to 2021-12-29
    • We should have been keeping up with updates?
Actions #35

Updated by okurz about 3 years ago

  • Description updated (diff)
Actions #36

Updated by okurz about 3 years ago

  • Description updated (diff)
Actions #37

Updated by nicksinger almost 3 years ago

Gerhard replugged qa-power8-4 into qanet10 port 8. I ran an iperf but saw no improvement:

QA-Power8-4-kvm:~ # iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 2620:113:80c0:80a0:10:162:29:60f port 36854 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  60.6 MBytes   509 Mbits/sec    7    252 KBytes
[  5]   1.00-2.00   sec  46.8 MBytes   392 Mbits/sec    3    212 KBytes
[  5]   2.00-3.00   sec  45.5 MBytes   381 Mbits/sec    2    177 KBytes
[  5]   3.00-4.00   sec  35.7 MBytes   300 Mbits/sec    2    187 KBytes
[  5]   4.00-5.00   sec  51.8 MBytes   435 Mbits/sec    4    286 KBytes
[  5]   5.00-6.00   sec  50.5 MBytes   424 Mbits/sec    3    212 KBytes
[  5]   6.00-7.00   sec  60.4 MBytes   506 Mbits/sec    1    308 KBytes
[  5]   7.00-8.00   sec  44.4 MBytes   372 Mbits/sec    5    180 KBytes
[  5]   8.00-9.00   sec  52.8 MBytes   443 Mbits/sec    1    271 KBytes
[  5]   9.00-10.00  sec  44.4 MBytes   372 Mbits/sec    0    351 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   493 MBytes   413 Mbits/sec   28             sender
[  5]   0.00-10.00  sec   490 MBytes   411 Mbits/sec                  receiver

iperf Done.
QA-Power8-4-kvm:~ # iperf3 -R -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 2620:113:80c0:80a0:10:162:29:60f port 36880 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   344 KBytes  2.82 Mbits/sec
[  5]   1.00-2.00   sec   413 KBytes  3.38 Mbits/sec
[  5]   2.00-3.00   sec   520 KBytes  4.26 Mbits/sec
[  5]   3.00-4.00   sec   370 KBytes  3.03 Mbits/sec
[  5]   4.00-5.00   sec   342 KBytes  2.80 Mbits/sec
[  5]   5.00-6.00   sec   301 KBytes  2.47 Mbits/sec
[  5]   6.00-7.00   sec   322 KBytes  2.64 Mbits/sec
[  5]   7.00-8.00   sec   248 KBytes  2.03 Mbits/sec
[  5]   8.00-9.00   sec   522 KBytes  4.27 Mbits/sec
[  5]   9.00-10.00  sec   457 KBytes  3.75 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  3.84 MBytes  3.22 Mbits/sec  816             sender
[  5]   0.00-10.00  sec  3.75 MBytes  3.14 Mbits/sec                  receiver

iperf Done.
QA-Power8-4-kvm:~ # iperf3 -4 -R -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 10.162.6.201 port 60458 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   406 KBytes  3.32 Mbits/sec
[  5]   1.00-2.00   sec   276 KBytes  2.26 Mbits/sec
[  5]   2.00-3.00   sec   421 KBytes  3.45 Mbits/sec
[  5]   3.00-4.00   sec   568 KBytes  4.66 Mbits/sec
[  5]   4.00-5.00   sec   462 KBytes  3.79 Mbits/sec
[  5]   5.00-6.00   sec   352 KBytes  2.88 Mbits/sec
[  5]   6.00-7.00   sec   588 KBytes  4.82 Mbits/sec
[  5]   7.00-8.00   sec   373 KBytes  3.06 Mbits/sec
[  5]   8.00-9.00   sec   454 KBytes  3.72 Mbits/sec
[  5]   9.00-10.00  sec   423 KBytes  3.46 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  4.33 MBytes  3.63 Mbits/sec  880             sender
[  5]   0.00-10.00  sec  4.22 MBytes  3.54 Mbits/sec                  receiver

iperf Done.
QA-Power8-4-kvm:~ # iperf3 -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 10.162.6.201 port 60496 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  47.9 MBytes   402 Mbits/sec   28    174 KBytes
[  5]   1.00-2.00   sec  60.8 MBytes   510 Mbits/sec    5    198 KBytes
[  5]   2.00-3.00   sec  66.9 MBytes   561 Mbits/sec    3    167 KBytes
[  5]   3.00-4.00   sec  56.8 MBytes   476 Mbits/sec    4    130 KBytes
[  5]   4.00-5.00   sec  45.0 MBytes   378 Mbits/sec    2    161 KBytes
[  5]   5.00-6.00   sec  42.8 MBytes   359 Mbits/sec    2    187 KBytes
[  5]   6.00-7.00   sec  76.0 MBytes   638 Mbits/sec    2    182 KBytes
[  5]   7.00-8.00   sec  65.0 MBytes   545 Mbits/sec    4    150 KBytes
[  5]   8.00-9.00   sec  39.7 MBytes   333 Mbits/sec   50    315 KBytes
[  5]   9.00-10.00  sec  40.5 MBytes   339 Mbits/sec    7    117 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   541 MBytes   454 Mbits/sec  107             sender
[  5]   0.00-10.00  sec   539 MBytes   452 Mbits/sec                  receiver

iperf Done.
Actions #38

Updated by nicksinger almost 3 years ago

Gerhard also replugged another port of that machine. Apparently this brought some improvement but still way to less:

nsinger@QA-Power8-4-kvm:~> iperf3 -6 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 2620:113:80c0:80a0:10:162:29:60f port 48730 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  91.4 MBytes   767 Mbits/sec  127    290 KBytes
[  5]   1.00-2.00   sec  87.6 MBytes   735 Mbits/sec    5    343 KBytes
[  5]   2.00-3.00   sec  85.9 MBytes   720 Mbits/sec   30    319 KBytes
[  5]   3.00-4.00   sec  95.6 MBytes   802 Mbits/sec    5    278 KBytes
[  5]   4.00-5.00   sec  91.3 MBytes   765 Mbits/sec    8    424 KBytes
[  5]   5.00-6.00   sec  81.9 MBytes   687 Mbits/sec   32    282 KBytes
[  5]   6.00-7.00   sec  71.5 MBytes   599 Mbits/sec   34    351 KBytes
[  5]   7.00-8.00   sec  87.2 MBytes   732 Mbits/sec    0    449 KBytes
[  5]   8.00-9.00   sec  88.5 MBytes   742 Mbits/sec    5    300 KBytes
[  5]   9.00-10.00  sec  57.4 MBytes   482 Mbits/sec   14    332 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   838 MBytes   703 Mbits/sec  260             sender
[  5]   0.00-10.00  sec   835 MBytes   700 Mbits/sec                  receiver

iperf Done.
nsinger@QA-Power8-4-kvm:~> iperf3 -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 10.162.6.201 port 44070 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  84.2 MBytes   707 Mbits/sec   71    413 KBytes
[  5]   1.00-2.00   sec  96.9 MBytes   813 Mbits/sec   43    325 KBytes
[  5]   2.00-3.00   sec  92.8 MBytes   778 Mbits/sec    0    438 KBytes
[  5]   3.00-4.00   sec  72.2 MBytes   606 Mbits/sec    4    211 KBytes
[  5]   4.00-5.00   sec  60.0 MBytes   504 Mbits/sec    0    344 KBytes
[  5]   5.00-6.00   sec  87.3 MBytes   732 Mbits/sec   92    204 KBytes
[  5]   6.00-7.00   sec  58.7 MBytes   492 Mbits/sec   25    259 KBytes
[  5]   7.00-8.00   sec  77.2 MBytes   648 Mbits/sec   52    287 KBytes
[  5]   8.00-9.00   sec  71.3 MBytes   598 Mbits/sec    0    387 KBytes
[  5]   9.00-10.00  sec  76.3 MBytes   640 Mbits/sec    0    482 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   777 MBytes   652 Mbits/sec  287             sender
[  5]   0.00-10.00  sec   775 MBytes   650 Mbits/sec                  receiver

iperf Done.
nsinger@QA-Power8-4-kvm:~> iperf3 -R -6 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 2620:113:80c0:80a0:10:162:29:60f port 48788 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.25 MBytes  10.5 Mbits/sec
[  5]   1.00-2.00   sec   761 KBytes  6.24 Mbits/sec
[  5]   2.00-3.00   sec   968 KBytes  7.93 Mbits/sec
[  5]   3.00-4.00   sec  1.45 MBytes  12.2 Mbits/sec
[  5]   4.00-5.00   sec   877 KBytes  7.19 Mbits/sec
[  5]   5.00-6.00   sec   170 KBytes  1.39 Mbits/sec
[  5]   6.00-7.00   sec   828 KBytes  6.79 Mbits/sec
[  5]   7.00-8.00   sec   841 KBytes  6.89 Mbits/sec
[  5]   8.00-9.00   sec  1.65 MBytes  13.8 Mbits/sec
[  5]   9.00-10.00  sec   965 KBytes  7.91 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  9.75 MBytes  8.18 Mbits/sec  1177             sender
[  5]   0.00-10.00  sec  9.63 MBytes  8.08 Mbits/sec                  receiver

iperf Done.
nsinger@QA-Power8-4-kvm:~> iperf3 -R -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 10.162.6.201 port 44118 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.39 MBytes  11.6 Mbits/sec
[  5]   1.00-2.00   sec  1.42 MBytes  11.9 Mbits/sec
[  5]   2.00-3.00   sec  1.04 MBytes  8.71 Mbits/sec
[  5]   3.00-4.00   sec  1.29 MBytes  10.8 Mbits/sec
[  5]   4.00-5.00   sec  1.27 MBytes  10.7 Mbits/sec
[  5]   5.00-6.00   sec  1.91 MBytes  16.1 Mbits/sec
[  5]   6.00-7.00   sec  1.16 MBytes  9.72 Mbits/sec
[  5]   7.00-8.00   sec   578 KBytes  4.74 Mbits/sec
[  5]   8.00-9.00   sec  1.41 MBytes  11.8 Mbits/sec
[  5]   9.00-10.00  sec  1.35 MBytes  11.3 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  12.9 MBytes  10.8 Mbits/sec  799             sender
[  5]   0.00-10.00  sec  12.8 MBytes  10.7 Mbits/sec                  receiver

iperf Done.

I will be in the office today testing if a direct connection with my notebook brings better speeds. Hopefully I can manage to pull this experiment off to exclude any problems with any hardware (e.g. switch, router) in between.

Actions #39

Updated by okurz almost 3 years ago

Please keep the observation from #102882#note-24 in mind regarding the high increase of traffic we saw. I don't think at this point it helps to simply plug the machines elsewhere without making sure that this traffic goes away, e.g. unplug other stuff, the uplink, etc.

Actions #40

Updated by okurz almost 3 years ago

  • Due date changed from 2021-12-29 to 2022-01-28
Actions #41

Updated by nicksinger almost 3 years ago

So here are my results of several switch ports I tested in srv2 (and the qalab) with my notebook:

back2back with power8-4:

nsinger@QA-Power8-4-kvm:~> iperf3 -c 192.168.0.1
Connecting to host 192.168.0.1, port 5201
[  5] local 192.168.0.106 port 48382 connected to 192.168.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   114 MBytes   958 Mbits/sec    0    379 KBytes
[  5]   1.00-2.00   sec   112 MBytes   941 Mbits/sec    0    379 KBytes
[  5]   2.00-3.00   sec   112 MBytes   937 Mbits/sec    0    379 KBytes
[  5]   3.00-4.00   sec   113 MBytes   946 Mbits/sec    0    379 KBytes
[  5]   4.00-5.00   sec   112 MBytes   939 Mbits/sec    0    399 KBytes
[  5]   5.00-6.00   sec   112 MBytes   942 Mbits/sec    0    399 KBytes
[  5]   6.00-7.00   sec   112 MBytes   943 Mbits/sec    0    399 KBytes
[  5]   7.00-8.00   sec   112 MBytes   943 Mbits/sec    0    399 KBytes
[  5]   8.00-9.00   sec   112 MBytes   942 Mbits/sec    0    399 KBytes
[  5]   9.00-10.00  sec   112 MBytes   937 Mbits/sec    0    399 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.10 GBytes   943 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  1.10 GBytes   941 Mbits/sec                  receiver

iperf Done.
nsinger@QA-Power8-4-kvm:~> iperf3 -R -c 192.168.0.1
Connecting to host 192.168.0.1, port 5201
Reverse mode, remote host 192.168.0.1 is sending
[  5] local 192.168.0.106 port 48386 connected to 192.168.0.1 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   111 MBytes   934 Mbits/sec
[  5]   1.00-2.00   sec   111 MBytes   934 Mbits/sec
[  5]   2.00-3.00   sec   111 MBytes   934 Mbits/sec
[  5]   3.00-4.00   sec   111 MBytes   934 Mbits/sec
[  5]   4.00-5.00   sec   111 MBytes   934 Mbits/sec
[  5]   5.00-6.00   sec   111 MBytes   934 Mbits/sec
[  5]   6.00-7.00   sec   111 MBytes   934 Mbits/sec
[  5]   7.00-8.00   sec   111 MBytes   934 Mbits/sec
[  5]   8.00-9.00   sec   111 MBytes   934 Mbits/sec
[  5]   9.00-10.00  sec   111 MBytes   934 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.09 GBytes   936 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver

iperf from notebook connected to qanet10nue (srv2, located next to the rack of power8-4):

selenium ~ » iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 2620:113:80c0:80a0:10:162:2d:4d65 port 38898 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   110 MBytes   925 Mbits/sec    0   1017 KBytes
[  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec    0   1.27 MBytes
[  5]   2.00-3.00   sec   109 MBytes   912 Mbits/sec    0   1.55 MBytes
[  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0   2.24 MBytes
[  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec    0   2.36 MBytes
[  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec    0   2.47 MBytes
[  5]   6.00-7.00   sec   109 MBytes   912 Mbits/sec    0   2.61 MBytes
[  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0   2.61 MBytes
[  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0   2.74 MBytes
[  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec    0   2.74 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.07 GBytes   921 Mbits/sec    0             sender
[  5]   0.00-10.01  sec  1.07 GBytes   918 Mbits/sec                  receiver

iperf Done.
selenium ~ » iperf3 -R -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 2620:113:80c0:80a0:10:162:2d:4d65 port 38902 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   745 KBytes  6.10 Mbits/sec
[  5]   1.00-2.00   sec   998 KBytes  8.18 Mbits/sec
[  5]   2.00-3.00   sec  1.10 MBytes  9.27 Mbits/sec
[  5]   3.00-4.00   sec  1.11 MBytes  9.31 Mbits/sec
[  5]   4.00-5.00   sec  1.10 MBytes  9.22 Mbits/sec
[  5]   5.00-6.00   sec  1.09 MBytes  9.15 Mbits/sec
[  5]   6.00-7.00   sec  1.10 MBytes  9.24 Mbits/sec
[  5]   7.00-8.00   sec  1.10 MBytes  9.27 Mbits/sec
[  5]   8.00-9.00   sec  1.10 MBytes  9.22 Mbits/sec
[  5]   9.00-10.00  sec   679 KBytes  5.56 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.2 MBytes  8.53 Mbits/sec  981             sender
[  5]   0.00-10.00  sec  10.1 MBytes  8.45 Mbits/sec                  receiver

iperf Done.

iperf from notebook connected to qanet13nue (srv2, where power8-4 is originally connected to):

selenium ~ » iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 2620:113:80c0:80a0:10:162:2c:985 port 45236 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   103 MBytes   863 Mbits/sec    0   1.41 MBytes
[  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec    0   1.80 MBytes
[  5]   2.00-3.00   sec   108 MBytes   902 Mbits/sec   28   1.40 MBytes
[  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0   1.52 MBytes
[  5]   4.00-5.00   sec  98.8 MBytes   828 Mbits/sec  490   82.3 KBytes
[  5]   5.00-6.00   sec  85.0 MBytes   713 Mbits/sec    0    356 KBytes
[  5]   6.00-7.00   sec  93.8 MBytes   786 Mbits/sec  208    222 KBytes
[  5]   7.00-8.00   sec  93.8 MBytes   786 Mbits/sec    0    416 KBytes
[  5]   8.00-9.00   sec  95.0 MBytes   797 Mbits/sec    0    469 KBytes
[  5]   9.00-10.00  sec   101 MBytes   849 Mbits/sec    0    494 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   998 MBytes   837 Mbits/sec  726             sender
[  5]   0.00-10.01  sec   995 MBytes   834 Mbits/sec                  receiver

iperf Done.
selenium ~ » iperf3 -R -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 2620:113:80c0:80a0:10:162:2c:985 port 45240 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.15 MBytes  9.64 Mbits/sec
[  5]   1.00-2.00   sec  2.12 MBytes  17.8 Mbits/sec
[  5]   2.00-3.00   sec  1.18 MBytes  9.93 Mbits/sec
[  5]   3.00-4.00   sec  1.32 MBytes  11.0 Mbits/sec
[  5]   4.00-5.00   sec  1.17 MBytes  9.82 Mbits/sec
[  5]   5.00-6.00   sec   890 KBytes  7.29 Mbits/sec
[  5]   6.00-7.00   sec   636 KBytes  5.21 Mbits/sec
[  5]   7.00-8.00   sec  1.43 MBytes  12.0 Mbits/sec
[  5]   8.00-9.00   sec   945 KBytes  7.75 Mbits/sec
[  5]   9.00-10.00  sec  1.04 MBytes  8.69 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  12.1 MBytes  10.2 Mbits/sec  1094             sender
[  5]   0.00-10.00  sec  11.8 MBytes  9.91 Mbits/sec                  receiver

iperf Done.

iperf from notebook connected to qanet15nue (srv2, another switch close to power8-4):

selenium ~ » iperf3 -R -c openqa.suse.de                                                                         130 ↵
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 2620:113:80c0:80a0:10:162:2e:3a8a port 53490 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  6.35 MBytes  53.3 Mbits/sec
[  5]   1.00-2.00   sec  6.58 MBytes  55.2 Mbits/sec
[  5]   2.00-3.00   sec  7.65 MBytes  64.2 Mbits/sec
[  5]   3.00-4.00   sec  5.88 MBytes  49.3 Mbits/sec
[  5]   4.00-5.00   sec  6.19 MBytes  51.9 Mbits/sec
[  5]   5.00-6.00   sec  7.65 MBytes  64.2 Mbits/sec
[  5]   6.00-7.00   sec  5.79 MBytes  48.5 Mbits/sec
[  5]   7.00-8.00   sec  8.21 MBytes  68.9 Mbits/sec
[  5]   8.00-9.00   sec  7.08 MBytes  59.4 Mbits/sec
[  5]   9.00-10.00  sec  6.19 MBytes  51.9 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.01  sec  67.8 MBytes  56.8 Mbits/sec  7751             sender
[  5]   0.00-10.00  sec  67.6 MBytes  56.7 Mbits/sec                  receiver

iperf Done.
selenium ~ » iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 2620:113:80c0:80a0:10:162:2e:3a8a port 53494 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  93.6 MBytes   785 Mbits/sec    6   1.15 MBytes
[  5]   1.00-2.00   sec  92.5 MBytes   776 Mbits/sec    0   1.26 MBytes
[  5]   2.00-3.00   sec   105 MBytes   881 Mbits/sec    0   1.36 MBytes
[  5]   3.00-4.00   sec   109 MBytes   912 Mbits/sec    0   1.41 MBytes
[  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0   1.49 MBytes
[  5]   5.00-6.00   sec   108 MBytes   902 Mbits/sec    0   1.53 MBytes
[  5]   6.00-7.00   sec   102 MBytes   860 Mbits/sec    0   1.55 MBytes
[  5]   7.00-8.00   sec  86.2 MBytes   723 Mbits/sec   89   1.11 MBytes
[  5]   8.00-9.00   sec   104 MBytes   870 Mbits/sec    0   1.19 MBytes
[  5]   9.00-10.00  sec   101 MBytes   849 Mbits/sec    0   1.24 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1007 MBytes   845 Mbits/sec   95             sender
[  5]   0.00-10.02  sec  1004 MBytes   841 Mbits/sec                  receiver

iperf Done.
selenium ~ » iperf3 -R -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 10.162.29.76 port 51332 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  6.51 MBytes  54.6 Mbits/sec
[  5]   1.00-2.00   sec  7.45 MBytes  62.5 Mbits/sec
[  5]   2.00-3.00   sec  6.35 MBytes  53.2 Mbits/sec
[  5]   3.00-4.00   sec  6.49 MBytes  54.4 Mbits/sec
[  5]   4.00-5.00   sec  5.94 MBytes  49.8 Mbits/sec
[  5]   5.00-6.00   sec  7.34 MBytes  61.6 Mbits/sec
[  5]   6.00-7.00   sec  5.11 MBytes  42.9 Mbits/sec
[  5]   7.00-8.00   sec  6.18 MBytes  51.8 Mbits/sec
[  5]   8.00-9.00   sec  6.36 MBytes  53.3 Mbits/sec
[  5]   9.00-10.00  sec  6.12 MBytes  51.3 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  64.1 MBytes  53.8 Mbits/sec  7306             sender
[  5]   0.00-10.00  sec  63.8 MBytes  53.6 Mbits/sec                  receiver

iperf Done.

iperf from notebook connected to qanet15nue (srv2, yet another switch close to power8-4):

selenium ~ » iperf3 -R -c openqa.suse.de                                                                         130 ↵
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 2620:113:80c0:80a0:10:162:2e:3a8a port 53490 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  6.35 MBytes  53.3 Mbits/sec
[  5]   1.00-2.00   sec  6.58 MBytes  55.2 Mbits/sec
[  5]   2.00-3.00   sec  7.65 MBytes  64.2 Mbits/sec
[  5]   3.00-4.00   sec  5.88 MBytes  49.3 Mbits/sec
[  5]   4.00-5.00   sec  6.19 MBytes  51.9 Mbits/sec
[  5]   5.00-6.00   sec  7.65 MBytes  64.2 Mbits/sec
[  5]   6.00-7.00   sec  5.79 MBytes  48.5 Mbits/sec
[  5]   7.00-8.00   sec  8.21 MBytes  68.9 Mbits/sec
[  5]   8.00-9.00   sec  7.08 MBytes  59.4 Mbits/sec
[  5]   9.00-10.00  sec  6.19 MBytes  51.9 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.01  sec  67.8 MBytes  56.8 Mbits/sec  7751             sender
[  5]   0.00-10.00  sec  67.6 MBytes  56.7 Mbits/sec                  receiver

iperf Done.
selenium ~ » iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 2620:113:80c0:80a0:10:162:2e:3a8a port 53494 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  93.6 MBytes   785 Mbits/sec    6   1.15 MBytes
[  5]   1.00-2.00   sec  92.5 MBytes   776 Mbits/sec    0   1.26 MBytes
[  5]   2.00-3.00   sec   105 MBytes   881 Mbits/sec    0   1.36 MBytes
[  5]   3.00-4.00   sec   109 MBytes   912 Mbits/sec    0   1.41 MBytes
[  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0   1.49 MBytes
[  5]   5.00-6.00   sec   108 MBytes   902 Mbits/sec    0   1.53 MBytes
[  5]   6.00-7.00   sec   102 MBytes   860 Mbits/sec    0   1.55 MBytes
[  5]   7.00-8.00   sec  86.2 MBytes   723 Mbits/sec   89   1.11 MBytes
[  5]   8.00-9.00   sec   104 MBytes   870 Mbits/sec    0   1.19 MBytes
[  5]   9.00-10.00  sec   101 MBytes   849 Mbits/sec    0   1.24 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1007 MBytes   845 Mbits/sec   95             sender
[  5]   0.00-10.02  sec  1004 MBytes   841 Mbits/sec                  receiver

iperf Done.
selenium ~ » iperf3 -R -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 10.162.29.76 port 51332 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  6.51 MBytes  54.6 Mbits/sec
[  5]   1.00-2.00   sec  7.45 MBytes  62.5 Mbits/sec
[  5]   2.00-3.00   sec  6.35 MBytes  53.2 Mbits/sec
[  5]   3.00-4.00   sec  6.49 MBytes  54.4 Mbits/sec
[  5]   4.00-5.00   sec  5.94 MBytes  49.8 Mbits/sec
[  5]   5.00-6.00   sec  7.34 MBytes  61.6 Mbits/sec
[  5]   6.00-7.00   sec  5.11 MBytes  42.9 Mbits/sec
[  5]   7.00-8.00   sec  6.18 MBytes  51.8 Mbits/sec
[  5]   8.00-9.00   sec  6.36 MBytes  53.3 Mbits/sec
[  5]   9.00-10.00  sec  6.12 MBytes  51.3 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  64.1 MBytes  53.8 Mbits/sec  7306             sender
[  5]   0.00-10.00  sec  63.8 MBytes  53.6 Mbits/sec                  receiver

iperf Done.

iperf from notebook connected to qanet03nue (switch in the big qalab):

selenium ~ » iperf3 -R -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 10.162.29.76 port 51336 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  91.2 MBytes   765 Mbits/sec
[  5]   1.00-2.00   sec  98.7 MBytes   828 Mbits/sec
[  5]   2.00-3.00   sec  95.8 MBytes   804 Mbits/sec
[  5]   3.00-4.00   sec  93.5 MBytes   785 Mbits/sec
[  5]   4.00-5.00   sec  98.4 MBytes   826 Mbits/sec
[  5]   5.00-6.00   sec  97.7 MBytes   820 Mbits/sec
[  5]   6.00-7.00   sec   105 MBytes   879 Mbits/sec
[  5]   7.00-8.00   sec  97.1 MBytes   815 Mbits/sec
[  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec
[  5]   9.00-10.00  sec   101 MBytes   850 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   988 MBytes   829 Mbits/sec  1129             sender
[  5]   0.00-10.00  sec   985 MBytes   826 Mbits/sec                  receiver

iperf Done.
selenium ~ » iperf3 -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 10.162.29.76 port 51340 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  80.5 MBytes   675 Mbits/sec    7    255 KBytes
[  5]   1.00-2.00   sec  81.2 MBytes   682 Mbits/sec    0    421 KBytes
[  5]   2.00-3.00   sec  85.0 MBytes   713 Mbits/sec    0    503 KBytes
[  5]   3.00-4.00   sec  75.0 MBytes   629 Mbits/sec    5    296 KBytes
[  5]   4.00-5.00   sec  71.2 MBytes   598 Mbits/sec    0    426 KBytes
[  5]   5.00-6.00   sec  67.5 MBytes   566 Mbits/sec    6    191 KBytes
[  5]   6.00-7.00   sec  50.0 MBytes   419 Mbits/sec    0    331 KBytes
[  5]   7.00-8.00   sec  66.2 MBytes   556 Mbits/sec    0    441 KBytes
[  5]   8.00-9.00   sec  65.0 MBytes   545 Mbits/sec    0    519 KBytes
[  5]   9.00-10.00  sec  62.5 MBytes   524 Mbits/sec    0    581 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   704 MBytes   591 Mbits/sec   18             sender
[  5]   0.00-10.00  sec   702 MBytes   589 Mbits/sec                  receiver

iperf Done.

With all these tests I can conclude:

  1. The machine itself is not misconfigured and is perfectly able to deliver 1Gbit/s up and down
  2. Several switches in srv2 are affected by the performance loss
  3. The QA VLAN itself does not cause the performance loss (as a switch in the qalab - so a different location - is running fine)

I'd suggest that we try to map out how these switches are interconnected. I could imagine that several switches in srv2 are "daisychained" and maybe one switch in that chain is behaving wrong. I will try to come up with a graph showing how the switches are connected. If we have a better overview we can start debugging by e.g. comparing configurations or replugging the uplink of several switches.

Actions #42

Updated by okurz almost 3 years ago

you haven't mentioned the increase of "unexpected traffic". Do you see any relation between that and the measurements you conducted?

Actions #43

Updated by okurz almost 3 years ago

  • Status changed from In Progress to Feedback
  • Priority changed from High to Normal

gschlotter will check the daisy-chained core switches. Current hypothesis is that at least one is misbehaving and causing the problems. nsinger told us that gschlotter read the ticket so we assume he is aware about the "unexpected traffic". So we expect an update within the next days from gschlotter.

Actions #44

Updated by okurz almost 3 years ago

  • Priority changed from Normal to High

Given that we have recurring user reports like https://suse.slack.com/archives/C02CU8X53RC/p1643270602218800 we should still treat this as high prio

Actions #45

Updated by nicksinger almost 3 years ago

  • Assignee deleted (nicksinger)

I'd highly appreciate a helping hand here to perform new benchmarks after the machine now got replugged into the core switch directly. I'm unassigning for now but feel free to ask if you need something from me

Actions #46

Updated by mkittler almost 3 years ago

  • Assignee set to mkittler

Ok, I can run the iperf tests again.

Actions #47

Updated by mkittler almost 3 years ago

  • Assignee deleted (mkittler)

Unfortunately it doesn't look better. I've tested on qa-power8-4-kvm.qa.suse.de, qa-power8-5-kvm.qa.suse.de and powerqaworker-qam-1. The results on all hosts were looking like this:

martchus@QA-Power8-4-kvm:~> iperf3 -R -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 2620:113:80c0:80a0:10:162:29:60f port 40416 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.04 MBytes  8.73 Mbits/sec                  
[  5]   1.00-2.00   sec  1.32 MBytes  11.1 Mbits/sec                  
[  5]   2.00-3.00   sec  1.06 MBytes  8.91 Mbits/sec                  
[  5]   3.00-4.00   sec  1.42 MBytes  11.9 Mbits/sec                  
[  5]   4.00-5.00   sec   679 KBytes  5.56 Mbits/sec                  
[  5]   5.00-6.00   sec  1.28 MBytes  10.8 Mbits/sec                  
[  5]   6.00-7.00   sec  1.39 MBytes  11.7 Mbits/sec                  
[  5]   7.00-8.00   sec   937 KBytes  7.68 Mbits/sec                  
[  5]   8.00-9.00   sec  1.05 MBytes  8.82 Mbits/sec                  
[  5]   9.00-10.00  sec   950 KBytes  7.78 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  11.2 MBytes  9.38 Mbits/sec  784             sender
[  5]   0.00-10.00  sec  11.1 MBytes  9.29 Mbits/sec                  receiver

iperf Done.
Actions #48

Updated by okurz almost 3 years ago

  • Related to action #105804: Job age (scheduled) (median) alert size:S added
Actions #49

Updated by okurz almost 3 years ago

  • Status changed from Feedback to Workable
  • Priority changed from High to Urgent

please don't keep around without assignee. The ticket has been raised again during weekly QE sync.

Actions #50

Updated by kraih almost 3 years ago

  • Assignee set to kraih
Actions #51

Updated by livdywan almost 3 years ago

  • What machine was re-plugged? We don't know. We've not seen any improvement. Looks to be qapower8-5
  • At least one switch might run in hub mode? (or bridge mode?)
    • Likely both rack and core switch need to run in hub mode
    • Are any debug settings enabled?
    • Check the management console
  • Compare traffic of machines directly connected to the router
  • Re-conduct previous experiments on all affected power machines, at least one x86-64 machine on another rack and malbec incl. tcpdumps @mkittler
    • Note, do add the commands used, for easy reproduction
  • Ping https://sd.suse.com/servicedesk/customer/portal/1/SD-67703
Actions #52

Updated by livdywan almost 3 years ago

  • Due date changed from 2022-01-28 to 2022-02-04
Actions #53

Updated by mkittler almost 3 years ago

TLDR: It looks still as bad as from the beginning on all three power hosts and doing the same tests on malbec or x86_64 machines doesn't show those symptoms.


Here (again) the exact commands used for performance testing:

Start server on openqa.suse.de (e.g. in screen session):

martchus@openqa:~> iperf3 -s

Check on affected host (not possible via salt as they are currently not in salt):

ssh qa-power8-5-kvm.qa.suse.de -C 'iperf3 -R -c openqa.suse.de' # -R is important as it is only slow in one direction

Check other hosts (glob must only match one at a time as server can only handle one request at a time):

martchus@openqa:~> sudo salt 'malbec*' cmd.run "sudo -u '$USER' iperf3 -R -c openqa.suse.de"

Check tcpdump for unrelated output (like in #102882#note-25), e.g.:

martchus@malbec:~> sudo zypper in tcpdump
martchus@malbec:~> ip addr # to find relevant eth dev
martchus@malbec:~> sudo tcpdump -i eth4 # look for suspicious traffic for "wrong" hosts
martchus@malbec:~> sudo tcpdump -i eth4 | grep -v malbec.arch.suse.de # filter traffic not directly related to host (might not cover IPv6 address)

My findings so far:

  • iperf3 still shows slow performance on these three power hosts (but not on malbec or other x86_64 hosts).
  • tcpdump still shows unrelated traffic on these three power hosts (but not on malbec or other x86_64 hosts).
    • I still see traffic for e.g. power5 on power1, e.g. 13:22:27.428007 IP 10.163.28.162.52464 > QA-Power8-5-kvm.qa.suse.de.ssh: Flags [.], ack 229662332, win 14792, options [nop,nop,TS val 2833931280 ecr 2859724898], length 0.
    • The same can be observed on power4, e.g. 13:26:46.001535 IP 10.160.1.93 > QA-Power8-5-kvm.qa.suse.de: GREv0, length 186: IP 10.0.2.15.hpoms-dps-lstn > 239.37.84.23.netsupport: UDP, length 13
    • The same can be observed on power5, e.g. 13:38:52.734608 IP openqa-monitor.qa.suse.de.d-s-n > QA-Power8-4-kvm.qa.suse.de.34204: Flags [.], ack 3618777680, win 2906, options [nop,nop,TS val 3944523984 ecr 1904290933], length 0
    • I didn't observe similar behavior on malbec or openqaworker10 where the host tcpdump is executed on always appears on at least one side (except for ARP, ICMP and multicast traffic).
Actions #54

Updated by nicksinger almost 3 years ago

I ran wireshark on qa-power8-4 & qa-power8-5 with the following filter: ((!ipv6) && tcp) && !(ip.dst == $IP_OF_THE_HOST) and could still observe a lot of traffic which is designated for other hosts. However I see mainly "qa traffic" on power8-5 while power8-4 sees a lot of other traffic (e.g. 10.0.2.1) so I wonder if in reality power8-4 is the one connected to the core switch?

Actions #55

Updated by okurz almost 3 years ago

@Sebastian Riedel @Marius Kittler @Nick Singer thank you for the quick but thorough and diligent investigation work in https://progress.opensuse.org/issues/102882#note-53 and https://progress.opensuse.org/issues/102882#note-54 , that's what I would say is truly professional work. I checked my text in https://sd.suse.com/servicedesk/customer/portal/1/SD-67703 and I think it's still valid and current as is. Eng-Infra or whoever has control over the network as a whole needs to follow up. I don't see how we would be able to do much more poking through the keyhole and with limited access to switches. @kraih I think the ticket could now be in "In Progress" or "Feedback" with active monitoring of any progress in https://sd.suse.com/servicedesk/customer/portal/1/SD-67703

Actions #56

Updated by mkittler almost 3 years ago

I cannot answer the question but considering your findings I'd ask myself the same question. It could be a mixup. Since I've always been checking on all three hosts anyways and haven't noticed any difference I suppose it doesn't matter much at this point.

Actions #57

Updated by kraih almost 3 years ago

I'm looking at the rack switch now, first up we have the current hardware connections again (some are only 100Mbit):

qanet13nue#show interfaces status
                                             Flow Link          Back   Mdix
Port     Type         Duplex  Speed Neg      ctrl State       Pressure Mode
-------- ------------ ------  ----- -------- ---- ----------- -------- -------
gi1      1G-Copper    Full    1000  Enabled  Off  Up          Disabled On     
gi2      1G-Copper    Full    1000  Enabled  Off  Up          Disabled Off    
gi3      1G-Copper    Full    1000  Enabled  Off  Up          Disabled On     
gi4      1G-Copper      --      --     --     --  Down           --     --    
gi5      1G-Copper    Full    1000  Enabled  Off  Up          Disabled Off    
gi6      1G-Copper    Full    1000  Enabled  Off  Up          Disabled On     
gi7      1G-Copper    Full    100   Enabled  Off  Up          Disabled On     
gi8      1G-Copper    Full    1000  Enabled  Off  Up          Disabled Off    
gi9      1G-Copper      --      --     --     --  Down           --     --    
gi10     1G-Copper      --      --     --     --  Down           --     --    
gi11     1G-Copper    Full    100   Enabled  Off  Up          Disabled On     
gi12     1G-Copper    Full    100   Enabled  Off  Up          Disabled On     
gi13     1G-Copper    Full    100   Enabled  Off  Up          Disabled Off    
gi14     1G-Copper      --      --     --     --  Down           --     --    
gi15     1G-Copper      --      --     --     --  Down           --     --    
gi16     1G-Copper      --      --     --     --  Down           --     --    
gi17     1G-Copper      --      --     --     --  Down           --     --    
gi18     1G-Copper      --      --     --     --  Down           --     --    
gi19     1G-Copper    Full    1000  Enabled  Off  Up          Disabled On     
gi20     1G-Copper    Full    1000  Enabled  Off  Up          Disabled On     
gi21     1G-Copper      --      --     --     --  Down           --     --    
gi22     1G-Copper      --      --     --     --  Down           --     --    
gi23     1G-Copper    Full    100   Enabled  Off  Up          Disabled Off    
gi24     1G-Copper    Full    100   Enabled  Off  Up          Disabled On     
gi25     1G-Copper      --      --     --     --  Down           --     --    
gi26     1G-Copper      --      --     --     --  Down           --     --    
gi27     1G-Combo-C   Full    1000  Enabled  Off  Up          Disabled On     
gi28     1G-Combo-C   Full    1000  Enabled  Off  Up          Disabled On     

                                          Flow    Link        
Ch       Type    Duplex  Speed  Neg      control  State       
-------- ------- ------  -----  -------- -------  ----------- 
Po1      1G      Full    1000   Enabled  Off      Up          
Po2         --     --      --      --       --    Not Present 
Po3         --     --      --      --       --    Not Present 
Po4         --     --      --      --       --    Not Present 
Po5         --     --      --      --       --    Not Present 
Po6         --     --      --      --       --    Not Present 
Po7         --     --      --      --       --    Not Present 
Po8         --     --      --      --       --    Not Present

The ARP table is pretty much empty:

qanet13nue#show arp

Total number of entries: 1


  VLAN    Interface     IP address        HW address          status      
--------------------- --------------- ------------------- --------------- 
vlan 12               10.162.63.254   00:00:5e:00:01:04   dynamic

Mac address table is the opposite (7 known):

qanet13nue#show mac address-table
Flags: I - Internal usage VLAN
Aging time is 300 sec

    Vlan          Mac Address         Port       Type    
------------ --------------------- ---------- ---------- 
     1         c0:7b:bc:8f:f7:2a      Po1      dynamic   
     1         c0:7b:bc:8f:f7:ea      Po1      dynamic   
     1         cc:d5:39:52:50:9a       0         self    
     12        00:00:5e:00:01:04      Po1      dynamic   
     12        00:00:5e:00:02:04      Po1      dynamic   
     12        00:11:25:7d:2c:ce      Po1      dynamic   
     12        00:16:3e:9e:66:a6      Po1      dynamic   
     12        00:25:90:1a:7c:7d      Po1      dynamic   
     12        00:25:90:1a:7c:81      Po1      dynamic   
     12        00:25:90:1a:fc:24      Po1      dynamic   
     12        00:25:90:1a:fc:2c      Po1      dynamic   
     12        00:25:90:9a:ca:38      Po1      dynamic   
     12        00:25:90:9f:9d:84      Po1      dynamic   
     12        00:25:90:9f:f2:85      Po1      dynamic   
     12        00:25:90:9f:f2:86      Po1      dynamic   
     12        00:25:90:9f:f2:a6      Po1      dynamic   
     12        00:25:90:9f:f3:1f      Po1      dynamic   
     12        00:25:90:f2:06:14      Po1      dynamic   
     12        00:26:0b:f1:f0:8d      Po1      dynamic   
     12        00:50:56:44:51:87      Po1      dynamic   
     12        00:60:16:0f:1c:7b      gi13     dynamic   
     12        00:60:16:0f:1c:a3      gi23     dynamic   
     12        00:a0:98:6e:3a:1f      Po1      dynamic   
     12        00:a0:98:6e:3a:21      Po1      dynamic   
     12        00:a0:98:6e:3d:11      Po1      dynamic   
     12        00:c0:b7:30:7e:33      Po1      dynamic   
     12        00:c0:b7:4c:97:f7      Po1      dynamic   
     12        00:c0:b7:4c:98:87      Po1      dynamic   
     12        00:c0:b7:4c:98:99      Po1      dynamic   
     12        00:c0:b7:51:cb:e7      Po1      dynamic   
     12        00:c0:b7:51:cc:63      Po1      dynamic   
     12        00:c0:b7:6b:d8:20      Po1      dynamic   
     12        00:c0:b7:6b:d8:80      Po1      dynamic   
     12        00:c0:b7:d2:d9:87      Po1      dynamic   
     12        00:c0:dd:13:3b:9f      gi12     dynamic   
     12        00:de:fb:e3:d7:7c      Po1      dynamic   
     12        00:de:fb:e3:da:fc      Po1      dynamic   
     12        00:e0:81:64:f7:3f      Po1      dynamic   
     12        00:e0:86:0a:b4:4b      gi11     dynamic   
     12        04:da:d2:0e:50:49      Po1      dynamic   
     12        0c:fd:37:17:fe:92      Po1      dynamic   
     12        18:c0:4d:06:ce:59      Po1      dynamic   
     12        18:c0:4d:8c:82:90      Po1      dynamic   
     12        1c:1b:0d:ef:73:64      Po1      dynamic   
     12        20:bb:c0:c1:07:c7      Po1      dynamic   
     12        20:bb:c0:c1:0a:0b      Po1      dynamic   
     12        20:bb:c0:c1:0a:62      Po1      dynamic   
     12        20:bb:c0:c1:0b:f8      Po1      dynamic   
     12        20:bb:c0:c1:20:79      Po1      dynamic   
     12        26:9e:c2:c4:2c:0b      Po1      dynamic   
     12        2c:c8:1b:61:80:43      Po1      dynamic   
     12        36:aa:b7:fb:07:04      Po1      dynamic   
     12        3c:4a:92:75:67:66      Po1      dynamic   
     12        3c:ec:ef:5a:79:16      Po1      dynamic   
     12        40:f2:e9:73:5d:54      Po1      dynamic   
     12        40:f2:e9:73:5d:55      Po1      dynamic   
     12        40:f2:e9:a5:53:4c      Po1      dynamic   
     12        52:54:00:00:89:4e      Po1      dynamic   
     12        52:54:00:10:5e:0d      Po1      dynamic   
     12        52:54:00:1e:1b:04      Po1      dynamic   
     12        52:54:00:7b:ad:b5      Po1      dynamic   
     12        52:54:00:96:30:74      Po1      dynamic   
     12        52:54:00:d7:ff:7d      Po1      dynamic   
     12        5c:a4:8a:71:f4:88      Po1      dynamic   
     12        5c:f3:fc:00:2e:80      Po1      dynamic   
     12        5c:f3:fc:00:43:f4      Po1      dynamic   
     12        68:05:ca:92:c1:bb      Po1      dynamic   
     12        68:b5:99:76:8c:74      Po1      dynamic   
     12        6c:ae:8b:6e:04:a8      gi5      dynamic   
     12        70:e2:84:14:07:21      gi8      dynamic   
     12        74:4d:28:e2:c9:86      Po1      dynamic   
     12        7c:25:86:96:a9:d8      Po1      dynamic   
     12        90:1b:0e:db:6e:ef      Po1      dynamic   
     12        90:1b:0e:e8:d6:19      Po1      dynamic   
     12        98:be:94:02:9b:94      Po1      dynamic   
     12        98:be:94:07:3a:90      Po1      dynamic   
     12        98:be:94:4b:d3:96      Po1      dynamic   
     12        a0:42:3f:32:b4:71      gi7      dynamic   
     12        ac:1f:6b:03:22:4e      Po1      dynamic   
     12        ac:1f:6b:03:22:f8      Po1      dynamic   
     12        ac:1f:6b:e6:c2:e9      Po1      dynamic   
     12        b4:e9:b0:67:b9:d2      Po1      dynamic   
     12        b4:e9:b0:67:bd:57      Po1      dynamic   
     12        b4:e9:b0:67:be:60      Po1      dynamic   
     12        b4:e9:b0:67:bf:34      Po1      dynamic   
     12        b4:e9:b0:6e:84:06      Po1      dynamic   
     12        c4:72:95:2b:50:ed      Po1      dynamic   
     12        c4:7d:46:f2:72:35      Po1      dynamic   
     12        c4:7d:46:f2:78:36      Po1      dynamic   
     12        c8:00:84:a3:a1:33      Po1      dynamic   
     12        e8:6a:64:97:6b:a9      Po1      dynamic   
     12        ec:e1:a9:f8:8a:02      Po1      dynamic   
     12        ec:e1:a9:fc:c9:0c      Po1      dynamic   
     14        00:26:0b:f1:f0:8d      Po1      dynamic   
     14        00:a0:98:6e:3a:20      Po1      dynamic   
     14        00:a0:98:6e:3d:12      Po1      dynamic   
    710        00:00:5e:00:01:12      Po1      dynamic   
    711        00:00:5e:00:01:13      Po1      dynamic

IP routing table:

qanet13nue#show ip route
Maximum Parallel Paths: 1 (1 after reset)
IP Forwarding: disabled
Codes: > - best, C - connected, S - static


S   0.0.0.0/0 [1/1] via 10.162.63.254, 10585:32:02, vlan 12                
C   10.162.0.0/18 is directly connected, vlan 12                           

And the bridge settings (unicast and multicast tables look the same):

qanet13nue#show bridge unicast unknown
  Port      Unregistered
--------   --------------
gi1           Forward
gi2           Forward
gi3           Forward
gi4           Forward
gi5           Forward
gi6           Forward
gi7           Forward
gi8           Forward
gi9           Forward
gi10          Forward
gi11          Forward
gi12          Forward
gi13          Forward
gi14          Forward
gi15          Forward
gi16          Forward
gi17          Forward
gi18          Forward
gi19          Forward
gi20          Forward
gi21          Forward                                 
gi22          Forward
gi23          Forward
gi24          Forward
gi25          Forward
gi26          Forward
gi27          Forward
gi28          Forward
Po1           Forward
Po2           Forward
Po3           Forward
Po4           Forward
Po5           Forward
Po6           Forward
Po7           Forward
Po8           Forward

The mac address for Power8-5-kvm should be 98:be:94:03:e9:4b, but it does not appear anywhere.

Actions #58

Updated by kraih almost 3 years ago

I can't post the whole config here, but these are the interfaces:

qanet13nue#show running-config
...
interface vlan 12
 name qa                                              
 ip address 10.162.0.73 255.255.192.0
!
interface vlan 14
 name testnet
!
interface vlan 710
 name cloudqa-admin
!
interface vlan 711
 name cloudqa-bmc
!
interface gigabitethernet1
 switchport mode access
 switchport access vlan 12
!
interface gigabitethernet2
 description "cloud4.qa Node1 BMC"
 switchport trunk native vlan 12
!
interface gigabitethernet3
 description "cloud4.qa Node2 BMC"
 switchport trunk allowed vlan add 711                
 switchport trunk native vlan 12
!
interface gigabitethernet4
 description "cloud4.qa Node3 BMC"
 switchport trunk allowed vlan add 711
!
interface gigabitethernet5
 description "cloud4.qa Node4 BMC"
 switchport mode access
 switchport access vlan 12
!
interface gigabitethernet6
 switchport mode access
 switchport access vlan 12
!
interface gigabitethernet7
 description S812LC-SP
 switchport mode access
 switchport access vlan 12
!
interface gigabitethernet8
 description S822LC-SP                                
 switchport mode access
 switchport access vlan 12
!
interface gigabitethernet9
 switchport mode access
 switchport access vlan 12
!
interface gigabitethernet10
 switchport mode access
 switchport access vlan 12
!
interface gigabitethernet11
 switchport mode access
 switchport access vlan 12
!
interface gigabitethernet12
 switchport mode access
 switchport access vlan 12
!
interface gigabitethernet13
 switchport mode access
 switchport access vlan 12                            
!
interface gigabitethernet14
 description "cloud4.qa Node5 BMC"
 switchport trunk allowed vlan add 711
!
interface gigabitethernet15
 description "cloud4.qa Node6 BMC"
 switchport trunk allowed vlan add 711
!
interface gigabitethernet16
 description "cloud4.qa Node7 BMC"
 switchport trunk allowed vlan add 711
!
interface gigabitethernet17
 description "cloud4.qa Node8 BMC"
 switchport trunk allowed vlan add 711
!
interface gigabitethernet18
 switchport mode access
 switchport access vlan 12
!
interface gigabitethernet19                           
 description S812LC
 switchport mode access
 switchport access vlan 12
!
interface gigabitethernet20
 description S822LC
 switchport mode access
 switchport access vlan 12
!
interface gigabitethernet21
 switchport mode access
 switchport access vlan 12
!
interface gigabitethernet22
 switchport mode access
 switchport access vlan 12
!
interface gigabitethernet23
 switchport mode access
 switchport access vlan 12
!
interface gigabitethernet24                           
 switchport mode access
 switchport access vlan 12
!
interface gigabitethernet25
 switchport trunk native vlan 12
!
interface gigabitethernet26
 switchport trunk native vlan 12
!
interface gigabitethernet27
 channel-group 1 mode auto
!
interface gigabitethernet28
 channel-group 1 mode auto
!
interface Port-channel1
 flowcontrol auto
 description nx3kup
 switchport trunk allowed vlan add 12,14,710-711
!
...
Actions #59

Updated by okurz almost 3 years ago

  • Status changed from Workable to In Progress
Actions #60

Updated by kraih almost 3 years ago

With all the data collected, i think we can conclude that the rack switch is definitely misconfigured, with a lot of legacy settings from previous uses. As a next step, we will need help from someone with Cisco IOS knowledge from Infra, to reset and properly configure the switch. Irrespective of who is ultimately responsible for maintaining the switch (SNMP settings still say snmp-server contact infra@suse.com btw.).

Actions #61

Updated by kraih almost 3 years ago

  • Due date changed from 2022-02-04 to 2022-02-28

Talked to Gerhard Schlotter from Infra today and they are working on the problem now. I've given them access to the switch and two machines in the rack (Power8-4-kvm/Power8-5-kvm), so they can do network tests themselves. The initial plan is to simply reset and reconfigure the switch. I've also forwarded our concerns regarding the core switch. If necessary they will have physical access to the rack on Tuesday. So we should know more around Wednesday next week.

Actions #62

Updated by kraih almost 3 years ago

Quick update, so far Infra has not worked on the switch. It is planned for today though.

Actions #63

Updated by kraih almost 3 years ago

Got another update from Gerhard, the switch might be working fine again after a firmware update. I can now see both machines (Power8-4-kvm/Power8-5-kvm) in the mac address table.

qanet13nue#show mac address-table
...
     12        98:be:94:03:e9:4b      gi20     dynamic
     12        98:be:94:04:48:17      gi19     dynamic
Actions #64

Updated by kraih almost 3 years ago

  • Status changed from In Progress to Feedback
Actions #65

Updated by kraih almost 3 years ago

And for completeness some iperf results:

QA-Power8-4-kvm:~ # iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 2620:113:80c0:80a0:10:162:29:60f port 57394 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  85.2 MBytes   715 Mbits/sec    5    325 KBytes
[  5]   1.00-2.00   sec  92.5 MBytes   776 Mbits/sec   21    457 KBytes
[  5]   2.00-3.00   sec  90.0 MBytes   755 Mbits/sec   45    453 KBytes
[  5]   3.00-4.00   sec   100 MBytes   839 Mbits/sec    1    395 KBytes
[  5]   4.00-5.00   sec  97.5 MBytes   818 Mbits/sec   34    513 KBytes
[  5]   5.00-6.00   sec  83.8 MBytes   703 Mbits/sec   34    222 KBytes
[  5]   6.00-7.00   sec  81.2 MBytes   682 Mbits/sec    0    379 KBytes
[  5]   7.00-8.00   sec  96.2 MBytes   807 Mbits/sec   51    510 KBytes
[  5]   8.00-9.00   sec  82.5 MBytes   692 Mbits/sec   31    305 KBytes
[  5]   9.00-10.00  sec  78.8 MBytes   661 Mbits/sec    5    192 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   888 MBytes   745 Mbits/sec  227             sender
[  5]   0.00-10.00  sec   885 MBytes   742 Mbits/sec                  receiver

iperf Done
QA-Power8-4-kvm:~ # iperf3 -R -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 2620:113:80c0:80a0:10:162:29:60f port 57390 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  55.7 MBytes   467 Mbits/sec
[  5]   1.00-2.00   sec  68.7 MBytes   576 Mbits/sec
[  5]   2.00-3.00   sec  67.8 MBytes   569 Mbits/sec
[  5]   3.00-4.00   sec  65.5 MBytes   550 Mbits/sec
[  5]   4.00-5.00   sec  66.8 MBytes   560 Mbits/sec
[  5]   5.00-6.00   sec  76.1 MBytes   638 Mbits/sec
[  5]   6.00-7.00   sec  57.1 MBytes   479 Mbits/sec
[  5]   7.00-8.00   sec  57.0 MBytes   478 Mbits/sec
[  5]   8.00-9.00   sec  51.6 MBytes   433 Mbits/sec
[  5]   9.00-10.00  sec  50.1 MBytes   420 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   619 MBytes   519 Mbits/sec  4943             sender
[  5]   0.00-10.00  sec   616 MBytes   517 Mbits/sec                  receiver

iperf Done.

That's a pretty significant improvement.

Actions #66

Updated by MDoucha almost 3 years ago

Looks good so far. I've run some test jobs on QA-Power8-4-kvm and 2.2GB disk image was downloaded in ~61 seconds. Yesterday similarly sized files took 15-60 minutes to download on the same worker. We'll see tomorrow how full KOTD tests perform.

Actions #67

Updated by okurz almost 3 years ago

@kraih any news from you about this today? Any plans to continue?

As the EngInfra ticket was closed but is missing some details that I would like to learn about I asked there in the ticket as well:

This is great news. For an issue that had such big impact I would be happy to read a bit more about the investigation and fixing process. Could you please describe why you came to the conclusion that firmware should be updated? How was that conducted and what is the current, final state? What is done to prevent a similar situation for other switches/racks/rooms? What can we do to improve in a similar situation in the future? As the measurements in switch throughput clearly showed a significant change with the beginning of the problems what measures are taken on monitoring&alerting level?

Given the impact of this issue we definitely should conduct a lessons learned meeting and Five Why analysis with follow-up tasks.

Actions #68

Updated by okurz almost 3 years ago

  • Status changed from Feedback to Blocked
Actions #69

Updated by okurz almost 3 years ago

  • Tracker changed from action to coordination
  • Subject changed from All OSD PPC64LE workers except malbec appear to have horribly broken cache service to [epic] All OSD PPC64LE workers except malbec appear to have horribly broken cache service
Actions #70

Updated by kraih almost 3 years ago

All machines in the rack are back in production.

Actions #71

Updated by okurz over 2 years ago

  • Status changed from Blocked to Resolved

All subtasks resolved, lessons learned and recorded :)

Actions #72

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: autoyast_sles4sap_hana@ppc64le-sap
https://openqa.suse.de/tests/8751846#step/installation/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 140 days if nothing changes in this ticket.

Actions #73

Updated by okurz over 2 years ago

  • Status changed from Resolved to Feedback

Please check the reminder comment about a recent failure using this ticket as label

Actions #74

Updated by kraih over 2 years ago

Looks like the comment takeover was pointless, since the new issue is completely unrelated to the cache service and downloaded assets.

Actions #75

Updated by livdywan over 2 years ago

  • Status changed from Feedback to Resolved

kraih wrote:

Looks like the comment takeover was pointless, since the new issue is completely unrelated to the cache service and downloaded assets.

Ack. I created #112184 for the new issue

Actions

Also available in: Atom PDF