coordination #102882
closed[epic] All OSD PPC64LE workers except malbec appear to have horribly broken cache service
100%
Description
Observation¶
User report https://suse.slack.com/archives/C02CANHLANP/p1637666699462700 .
mdoucha: "All jobs are stuck downloading assets until they time out. OSD dashboard shows that the workers are downloading ridiculous amounts of data all the time since yesterday."
Suggestions¶
- Find corresponding monitoring data on https://monitor.qa.suse.de/ that can be used to visualize the problem as well as a verification after any potential fix
- Identify what might cause such problems "since yesterday", i.e. 2021-11-22
Rollback steps (to be done once the actual issue has been resolved)¶
powerqaworker-qam-1 # systemctl unmask openqa-worker-auto-restart@{3..6} openqa-reload-worker-auto-restart@{3..6}.{service,timer} && systemctl enable --now openqa-worker-auto-restart@{3..6} openqa-reload-worker-auto-restart@{3..6}.{service,timer}
QA-Power8-4-kvm # systemctl unmask openqa-worker-auto-restart@{4..8} openqa-reload-worker-auto-restart@{4..8}.{service,timer} && systemctl enable --now openqa-worker-auto-restart@{4..8} openqa-reload-worker-auto-restart@{4..8}.{service,timer}
QA-Power8-5-kvm # systemctl unmask openqa-worker-auto-restart@{4..8} openqa-reload-worker-auto-restart@{4..8}.{service,timer} && systemctl enable --now openqa-worker-auto-restart@{4..8} openqa-reload-worker-auto-restart@{4..8}.{service,timer}
- Add qa-power8-4-kvm.qa.suse.de, qa-power8-5-kvm.qa.suse.de and powerqaworker-qam-1.qa.suse.de back to salt and ensure all services are running again.
Updated by okurz about 3 years ago
powerqaworker-qam-1:/home/okurz # ps auxf | grep openqa
root 88223 0.0 0.0 4608 1472 pts/2 S+ 12:45 0:00 \_ grep --color=auto openqa
_openqa+ 4976 0.0 0.0 118720 114496 ? Ss Nov21 0:42 /usr/bin/perl /usr/share/openqa/script/worker --instance 7
_openqa+ 4983 0.0 0.0 112640 108608 ? Ss Nov21 0:36 /usr/bin/perl /usr/share/openqa/script/worker --instance 8
_openqa+ 29016 0.0 0.0 90624 78144 ? Ss Nov22 0:09 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+ 51130 0.0 0.0 90624 72896 ? S 06:07 0:21 \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+ 77232 0.0 0.0 90624 74048 ? S 10:41 0:07 \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+ 80692 0.0 0.0 90624 73600 ? S 11:25 0:04 \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+ 80892 0.0 0.0 90624 73728 ? S 11:27 0:04 \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+ 29017 0.0 0.0 82368 78144 ? Ss Nov22 0:50 /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+ 78221 1.1 0.0 84416 73920 ? S 10:53 1:19 \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+ 79516 1.1 0.0 85120 74560 ? S 11:09 1:07 \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+ 80419 1.1 0.0 84544 74048 ? S 11:21 0:58 \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+ 85600 1.1 0.0 84352 73984 ? S 12:21 0:16 \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+ 86025 1.1 0.0 84544 73984 ? S 12:26 0:12 \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+ 83287 0.0 0.0 76224 72192 ? Ss 11:51 0:02 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
_openqa+ 83289 0.0 0.0 75968 72192 ? Ss 11:51 0:02 /usr/bin/perl /usr/share/openqa/script/worker --instance 4
_openqa+ 83521 0.0 0.0 76096 72128 ? Ss 11:54 0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 6
_openqa+ 84583 0.0 0.0 76352 72320 ? Ss 12:08 0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 1
_openqa+ 87663 0.1 0.0 75968 72064 ? Ss 12:40 0:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 2
_openqa+ 87759 0.2 0.0 76160 72192 ? Ss 12:41 0:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 3
and strace-ing one process, the oldest still running cache minion process, reveals:
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "O\253\302\342\277\t\367z\305x\340E\325\344\340\353\23\261+\353\r\21\315\207\211\301\334\251\364\357\262\347"..., 131072) = 4284
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "O\253\302\342\277\t\367z\305x\340E\325\344\340\353\23\261+\353\r\21\315\207\211\301\334\251\364\357\262\347"..., 4284) = 4284
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\270\321x\243\356\263&\303_\254E{\242.\370-\32\274!\275YC\177\244\265\206\355T\227\7\327\255"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\270\321x\243\356\263&\303_\254E{\242.\370-\32\274!\275YC\177\244\265\206\355T\227\7\327\255"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "U\269\247NE;\325\ta\210\275\314y\244M\346]4 \340Y\312\343<\374~\376\370_\336@"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "U\269\247NE;\325\ta\210\275\314y\244M\346]4 \340Y\312\343<\374~\376\370_\336@"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\256\231h]\240\361Y>}\213\376\221m\310\263:\27\310\33204u\327=(\2729/\317\252\367\22"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\256\231h]\240\361Y>}\213\376\221m\310\263:\27\310\33204u\327=(\2729/\317\252\367\22"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\2134\255\354\216\34\t\365\305+\250\25y\207s\204y\234\235\253\332}\376\356x\251\346C2\17\370="..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\2134\255\354\216\34\t\365\305+\250\25y\207s\204y\234\235\253\332}\376\356x\251\346C2\17\370="..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\364+\375\200^\36\362yB\314\210.[}&\37\351\231\371\36\247\22\317\245~\260\vy\205\354\206_"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\364+\375\200^\36\362yB\314\210.[}&\37\351\231\371\36\247\22\317\245~\260\vy\205\354\206_"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "2N\235\306,\317\333\"W\37\267\304f\236\234\n\317\376\367\314\206\375\261\226\32#W\v\316\246\221\265"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "2N\235\306,\317\333\"W\37\267\304f\236\234\n\317\376\367\314\206\375\261\226\32#W\v\316\246\221\265"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\301c\314\335\6zK\35^E\314\330\23\276\10\301\277\360\367\216_\6?Y\220\t\370\330\30gtU"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\301c\314\335\6zK\35^E\314\330\23\276\10\301\277\360\367\216_\6?Y\220\t\370\330\30gtU"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "S\252\2354\3654g\246\217x\"\272\307E\226K\5J\255\350(\331\223\fE\357V\253l\1W\340"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "S\252\2354\3654g\246\217x\"\272\307E\226K\5J\255\350(\331\223\fE\357V\253l\1W\340"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\257m.\324\265Y\361\2k^\270\374\335\251o~\374\351\271\177\354\213\16K_\273\v5\231\262/\236"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\257m.\324\265Y\361\2k^\270\374\335\251o~\374\351\271\177\354\213\16K_\273\v5\231\262/\236"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, " -\3471\373+\273kr`\323cf[}F\227\27\265\23\313\243\366\25\366{}\324\356Zf\27"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, " -\3471\373+\273kr`\323cf[}F\227\27\265\23\313\243\366\25\366{}\324\356Zf\27"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\351\254\231\325\320#z\230\351\335-\217\214\350\354\3041\232\227*\27\332rE\251\274o\305\305\265\232\363"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\351\254\231\325\320#z\230\351\335-\217\214\350\354\3041\232\227*\27\332rE\251\274o\305\305\265\232\363"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "C\337\206\23W\366O\302\v\226\207\256\273\26o\302\265$\306\375\6O\265\263|\250\276-\254\275Y\263"..., 131072) = 4284
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "C\337\206\23W\366O\302\v\226\207\256\273\26o\302\265$\306\375\6O\265\263|\250\276-\254\275Y\263"..., 4284) = 4284
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\34\242\376\314\n\267\323\322\244\2300,o\270?~\315\234\236\277f~\225\271i\372\26\257\27{\342\227"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\34\242\376\314\n\267\323\322\244\2300,o\270?~\315\234\236\277f~\225\271i\372\26\257\27{\342\227"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\236\304;r\303\342uxB}b\31I\357\333\214\242\213^\243.\350\33Zk\317@3\t\307S#"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\236\304;r\303\342uxB}b\31I\357\333\214\242\213^\243.\350\33Zk\317@3\t\307S#"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "Fy_\362\317N\230{Y\376\366\364\206\264\37\277\323\31\256h=\350\36=\212\37\257\352\177\324\226~"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "Fy_\362\317N\230{Y\376\366\364\206\264\37\277\323\31\256h=\350\36=\212\37\257\352\177\324\226~"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "~E\263\307\231\370\204\374P\361\234kK\347|\324\372\325\253\277\276\362\345\253\344\341\303\346\317\277\273\242"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "~E\263\307\231\370\204\374P\361\234kK\347|\324\372\325\253\277\276\362\345\253\344\341\303\346\317\277\273\242"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\364J\376p\r\217@W\361.b\232\0313.\351\220\325,\356#=k\3253\256\244U\203\374\266#"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\364J\376p\r\217@W\361.b\232\0313.\351\220\325,\356#=k\3253\256\244U\203\374\266#"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\267\231k\345\311\333SV\313\373\366\25\177\364\n\357\312\317\325r\305\322ox\30yAo\320\32\371\320"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\267\231k\345\311\333SV\313\373\366\25\177\364\n\357\312\317\325r\305\322ox\30yAo\320\32\371\320"..., 1428) = 1428
so the process is busy reading over network and writing into a local cache file?
Updated by mkittler about 3 years ago
On the Minion dashboard no download jobs have been piling up. However, judging by htop the speed it writes to disk is below 1 M/s (per process). That's very slow. And yes, it is reading over network and writes into a local cache file. I suppose that is expected - just not that it is that slow.
The network connection to OSD isn't generally slow. I've just tested with iperf3 on power8-4-kvm.qa.suse.de and powerqaworker-qam-1.qa.suse.de and got > 600 Mbits/sec. The write performance on /var/lib/openqa/cache/tmp
looks also good on both workers.
Updated by mkittler about 3 years ago
Judging by the job history, the affected machines are qa-power8-4-kvm.qa.suse.de, qa-power8-5-kvm.qa.suse.de and powerqaworker-qam-1.qa.suse.de. grenache-1 and malbec look good.
Updated by nicksinger about 3 years ago
mkittler wrote:
Judging by the job history, the affected machines are qa-power8-4-kvm.qa.suse.de, qa-power8-5-kvm.qa.suse.de and powerqaworker-qam-1.qa.suse.de. grenache-1 and malbec look good.
grenache-1 looking good is an interesting observation as it would show that affected machines are not only in the qa.suse.de subdomain. Given our history I'd recommend to check network performance to/from OSD using iperf3
with respective parameters for IPv4 and IPv6. Maybe this reveals some first details.
Updated by mkittler about 3 years ago
I've checked with iperf3 again. There was no difference between using -4
and -6
.
Updated by kraih about 3 years ago
Not seeing anything unusual in the logs on powerqaworker-qam-1.qa.suse.de either.
Updated by mkittler about 3 years ago
When using iperf3 -R
to test downloading (from OSD) on qa-power8-4-kvm.qa.suse.de and powerqaworker-qam-1.qa.suse.de there's a huge slowdown to < 5 Mbit/s (regardless whether IPv4 or 6 is used). That's not the case on the good host malbec so I assume we have our problem - unless this is really just due to the ongoing downloads. The ongoing downloads use only 20 Mbit/s (3.33 Mbyte/s). That is very slow. If we add it to performance test speed we're still only at a receive rate of 25 Mbit/s.
Updated by nicksinger about 3 years ago
All affected machines seem to be located in SRV2 according to racktables: https://racktables.suse.de/index.php?page=object&tab=default&object_id=3026
Here you have some network-graphs for the switch they are most likely connected to: http://mrtg.suse.de/qanet13nue/index.html
I checked the connection speeds on that switch. According to these graphs 3 of these ports seem to max out at ~100Mbit/s (still quite a bit more then measured by @mkittler):
qanet13nue#show interfaces status
Flow Link Back Mdix
Port Type Duplex Speed Neg ctrl State Pressure Mode
-------- ------------ ------ ----- -------- ---- ----------- -------- -------
gi1 1G-Copper Full 1000 Enabled Off Up Disabled Off
gi2 1G-Copper Full 1000 Enabled Off Up Disabled On
gi3 1G-Copper Full 1000 Enabled Off Up Disabled On
gi4 1G-Copper -- -- -- -- Down -- --
gi5 1G-Copper Full 1000 Enabled Off Up Disabled Off
gi6 1G-Copper Full 1000 Enabled Off Up Disabled On
gi7 1G-Copper Full 100 Enabled Off Up Disabled On
gi8 1G-Copper Full 1000 Enabled Off Up Disabled Off
gi9 1G-Copper -- -- -- -- Down -- --
gi10 1G-Copper -- -- -- -- Down -- --
gi11 1G-Copper Full 100 Enabled Off Up Disabled On
gi12 1G-Copper Full 100 Enabled Off Up Disabled On
gi13 1G-Copper Full 100 Enabled Off Up Disabled Off
gi14 1G-Copper -- -- -- -- Down -- --
gi15 1G-Copper -- -- -- -- Down -- --
gi16 1G-Copper -- -- -- -- Down -- --
gi17 1G-Copper -- -- -- -- Down -- --
gi18 1G-Copper -- -- -- -- Down -- --
gi19 1G-Copper Full 1000 Enabled Off Up Disabled On
From the MAC address-table I see the following connections:
powerqaworker-qam-1.qa.suse.de: gi5
QA-Power8-5.qa.suse.de: gi8
QA-Power8-4.qa.suse.de: gi7
So only qa-power8-4 is connected over 100Mbit/s.
Updated by mkittler about 3 years ago
I've stopped all services on powerqaworker-qam-1.qa.suse.de. Even without ongoing downloads the network speed is very slow:
martchus@powerqaworker-qam-1:~> iperf3 -R -4 -c openqa.suse.de -i 1 -t 30
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[ 5] local 10.162.7.211 port 38894 connected to 10.160.0.207 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 232 KBytes 1.90 Mbits/sec
[ 5] 1.00-2.00 sec 861 KBytes 7.06 Mbits/sec
[ 5] 2.00-3.00 sec 897 KBytes 7.34 Mbits/sec
[ 5] 3.00-4.00 sec 441 KBytes 3.61 Mbits/sec
[ 5] 4.00-5.00 sec 168 KBytes 1.38 Mbits/sec
[ 5] 5.00-6.00 sec 810 KBytes 6.64 Mbits/sec
[ 5] 6.00-7.00 sec 427 KBytes 3.50 Mbits/sec
[ 5] 7.00-8.00 sec 157 KBytes 1.29 Mbits/sec
[ 5] 8.00-9.00 sec 577 KBytes 4.73 Mbits/sec
[ 5] 9.00-10.00 sec 566 KBytes 4.63 Mbits/sec
[ 5] 10.00-11.00 sec 406 KBytes 3.32 Mbits/sec
[ 5] 11.00-12.00 sec 714 KBytes 5.85 Mbits/sec
[ 5] 12.00-13.00 sec 571 KBytes 4.68 Mbits/sec
[ 5] 13.00-14.00 sec 925 KBytes 7.58 Mbits/sec
[ 5] 14.00-15.00 sec 474 KBytes 3.88 Mbits/sec
[ 5] 15.00-16.00 sec 952 KBytes 7.80 Mbits/sec
[ 5] 16.00-17.00 sec 161 KBytes 1.32 Mbits/sec
[ 5] 17.00-18.00 sec 218 KBytes 1.78 Mbits/sec
[ 5] 18.00-19.00 sec 1.16 MBytes 9.72 Mbits/sec
[ 5] 19.00-20.00 sec 475 KBytes 3.89 Mbits/sec
[ 5] 20.00-21.00 sec 976 KBytes 7.99 Mbits/sec
[ 5] 21.00-22.00 sec 1.38 MBytes 11.6 Mbits/sec
[ 5] 22.00-23.00 sec 496 KBytes 4.07 Mbits/sec
[ 5] 23.00-24.00 sec 358 KBytes 2.93 Mbits/sec
[ 5] 24.00-25.00 sec 1024 KBytes 8.39 Mbits/sec
[ 5] 25.00-26.00 sec 779 KBytes 6.38 Mbits/sec
[ 5] 26.00-27.00 sec 761 KBytes 6.23 Mbits/sec
[ 5] 27.00-28.00 sec 434 KBytes 3.56 Mbits/sec
[ 5] 28.00-29.00 sec 663 KBytes 5.43 Mbits/sec
[ 5] 29.00-30.00 sec 786 KBytes 6.44 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 18.6 MBytes 5.19 Mbits/sec 2284 sender
[ 5] 0.00-30.00 sec 18.5 MBytes 5.16 Mbits/sec receiver
All affected workers are in the same rack: https://racktables.suse.de/index.php?page=rack&rack_id=520
Updated by mkittler about 3 years ago
- Status changed from New to Feedback
I've created an Infra ticket: https://sd.suse.com/servicedesk/customer/portal/1/SD-67703
I've also just stopped all worker slots on the affected hosts and removed them from salt-key.
Updated by okurz about 3 years ago
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&editPanel=84&tab=alert&from=1637506758247&to=1637657889676 maybe points to the same. The apache response time seems to have gone up in the past two days
Updated by okurz about 3 years ago
- Does running iperf3 in server mode when there are no clients reading from there incur any overhead? If not, should we run it there permanently for monitoring and investigation purposes?
- @mkittler can you provide something like a "one-liner" to reproduce the problem, e.g. the necessary iperf3 command line both on server+worker
- can we run the according iperf3 commands periodically in our monitoring? I guess just some seconds every hour should provide enough data and we can smooth in grafana
- I suggest to try to power down + power up the according machines over IPMI. Maybe this already helps with port-renegotiation or something
- As the problem did appear just recently I suggest we rollback package changes, e.g. kernel version. Despite some workers still behaving fine it could still be a problem after updates only that only some machines are affected due to their network setup particularities
Updated by nicksinger about 3 years ago
okurz wrote:
- Does running iperf3 in server mode when there are no clients reading from there incur any overhead? If not, should we run it there permanently for monitoring and investigation purposes?
Running just the server does not really come with much overhead despite the usual load a idling process causes.
- can we run the according iperf3 commands periodically in our monitoring? I guess just some seconds every hour should provide enough data and we can smooth in grafana
There is this open request with an simple exec example: https://github.com/influxdata/telegraf/issues/3866#issuecomment-694429507 - this should work for our use-case. We just need to make sure not to run all requests at the same time to all workers because it would quite easily saturate the whole link of OSD
Updated by mkittler about 3 years ago
Unfortunately Infra doesn't have access to the server as well. Maybe they can at least tell us who has.
I've rebooted but it didn't help. I've booted into the snapshot from Mi 24 Nov 2021 10:49:35 CET but it didn't help. The rates are a tiny bit higher now but that's likely just because now all downloads on all the hosts had been stopped. It is still just 15.8 Mbits/sec.
Updated by okurz about 3 years ago
- Copied to coordination #102951: [epic] Better network performance monitoring added
Updated by okurz about 3 years ago
From https://sd.suse.com/servicedesk/customer/portal/1/SD-67703
Gerhard Schlotter (3 hours ago): "how should we help with the issues, we neither have access to the switch nor the affected servers. to the qanet switches, someone from QA team [has access]. The uplink from our side is completely fine and can carry a lot more load."
Who can pick this up and access the switches to check, maybe reboot, unplug some cables, etc.?
Updated by mkittler about 3 years ago
- Assignee changed from mkittler to nicksinger
Who can pick this up and access the switches to check, maybe reboot, unplug some cables, etc.?
Nick says he has access to I'm assigning the ticket to him.
@nicksinger I can of course still do the rollback steps (mentioned in the ticket description) for you in the end or do some further testing to see whether something works better after some changes.
Updated by mkittler about 3 years ago
@nicksinger has restarted the switch but the networking speed is still slow. (Even though all workers are now back online I'd expect more than 4 Mbit/s download rate via iperf3 from OSD.)
Updated by mkittler about 3 years ago
I don't know. Maybe it is possible to plug the machines in another switch or try with a spare switch?
Updated by nicksinger about 3 years ago
I asked Wei Gao if I can get access to migration-smt2.qa.suse.de to run an iperf crosscheck with another machine in the same rack
Updated by okurz about 3 years ago
- Due date set to 2021-12-29
- Status changed from Feedback to In Progress
- Priority changed from Urgent to High
Current state¶
We have reduced performance but still some worker instances running so with degraded performance we have addressed the urgency of the ticket and can reduce to "High"
Observations¶
- http://mrtg.suse.de/qanet13nue/10.162.0.73_gi1.html shows that there is significant traffic on that port since 2021-W47, i.e. 2021-11-22, the start of the problems and near-zero going back to 2020-11. same for gi2, gi5, gi7, gi8, gi11, gi12, gi13, gi23, gi24
- the corresponding counterpart to qanet13 is visible on http://mrtg.suse.de/Nx5696Q-Core2/192.168.0.121_369098892.html (qanet13 connection on core2) and http://mrtg.suse.de/Nx5696Q-Core1/192.168.0.120_526649856.html (qanet13 connection on core1) but neither seem to show significant traffic increase since 2021-11-22, so where is the traffic coming from? Is the switch qanet13 sending out broadcasts itself?
- qanet13nue uplink seems to be gi27+gi28 (found with
show interfaces Port-Channel 1
). http://mrtg.suse.de/qanet13nue/10.162.0.73_po1.html is the aggregated view and shows nothing significant. But we see that also in the past we had spikes to 320 MBit/s "in" and 240 MBit/s "out" and no such spikes since 2021-W47, limited to 100 MBit/s? Yearly average looks sane, nothing special, average 46 MBit/s "in" and 16 MBit/s "out". We identified that the hosts that are called like S812LC and S822LC on http://mrtg.suse.de/qanet13nue/index.html are according to https://racktables.suse.de/index.php?page=object&object_id=992 our power hosts qa-power8-4 (S812LC) and qa-power8-5 (S822LC) and respective "service processors" S812LC-SP and S822LC-SP. gi6 is powerqaworker-qam-1 (according to iperf experiment from hyperv host). On http://mrtg.suse.de/qanet13nue/index.html we can see that many hosts receive significant traffic since 2021-11-22 but don't show change in sending traffic. The only port that shows significant corresponding incoming traffic is the uplink. So our conclusion is that unintended broadcast traffic received by the rack switch is forwarded to all hosts and especially the Power machines seem to be badly affected by this (either traffic on SPs or the host itself or both) so that sending still works with high bandwidth but receiving only gets a very low bandwidth - booted powerqaworker-qam-1 with kernel 5.3.18-lp152.102-default from 2021-11-11 from /boot, that is before the start of the problem on 2021-11-22 and ran
iperf3 -t 1200 -R -c openqaworker12
yielding 5.9 MBit/s so same on this older kernel => kernel regression unlikely
Suggestions¶
- WAITING Ask users of other machines in the same rack if they have network problems, e.g. migration-smt2.qa.suse.de , ask migration team -> nsinger asked, @waitfor response
- DONE Conduct network performance test between two hosts within the same rack, nsinger conducted this test between qa-power8-4 (server) and powerqaworker-qam-1 (client) and received 3.13 MBit/s so a magnitude too low for 1 GBit/s, same for qa-power8-5 (client, both directions). Crosscheck between two other hosts in another rack. We did for openqaworker10+13 and got 945 MBit/s so as expected near 1 GBit/s accounting for overhead.
- DONE Try to sature the switch bandwidth using iperf3 until we can actually see the result on http://mrtg.suse.de/qanet13nue/index.html -> we could see the results using openqaw9-hyperv which we verified to be connected to g1. http://mrtg.suse.de/qanet13nue/10.162.0.73_gi1.html
- DONE Logged in over RDP to openqaw9-hyperv.qa.suse.de and downloaded https://iperf.fr/iperf-download.php for Windows
- DONE executed tests against qa-power8-4-kvm resulting in 1.3 MBit/s, openqaworker10->openqaw9-hyperv.qa.suse.de => 204 MBit/s, openqaw9-hyperv.qa.suse->openqaworker10 248 MBit/s so system is fine, switch is not generally broken
- DONE Started
iperf3 -s
on openqaworker12 and on openqaw9-hyperviperf3.exe -t 1200 -c openqaworker12
at 11:09:00Z, trying to see the bandwidth on http://mrtg.suse.de/qanet20nue/index.html . stopped as expected 11:29:00Z. Reported bandwidth 77 MBit/s in both directions. MAC-address 00:15:17:B1:03:88 or 00:15:17:B1:03:89 . nsinger has confirmed that he sees this address on qanet13nue:gi1 .
- DONE Now starting
iperf3 -t 1200 -c powerqaworker-qam-1
-> 1.02 MBit/s. Reverseiperf3 -t 1200 -R -c powerqaworker-qam-1
shows bandwidth of 692 MBit/s (!) => only download to machine affected - DONE Examine the traffic, e.g. wireshark on any host on the rack, and see if we can identify the traffic and forward that information to the according users or Eng Infra -> nothing found by nsinger so far
- Try to connect the affected machines to another switch, e.g. in a neighboring rack, and execute iperf3 runs. nicksinger will coordinate with gschlotter from Eng Infra to do that
- REJECTED Check for log output on power8-4 why the link is only 100 MBit/s and coordinate with Eng Infra to replace the cable on the port connected to power8-4 and/or connect to another port on the same switch -> mkittler confirmed that Linux reports the link is 1GB/s so this is a false report. Maybe some BMC that is connected on that port.
- Ask Eng Infra to give more members or the complete team of QE Tools ssh access to the switch, at least read-only access for monitoring. If Eng Infra does not know how to do that maybe nsinger can do it himself directly
- Disable individual ports on the switch to check if that improves the situation for power workers -> likely will not affect as we assume the problem to come from outside the switch over the uplink
- Conduct network performance benchmark on affected power hosts in a stripped-down environment with no other significant traffic. Also we can not access the host powerqaworker-qam-1 using iperf or any other port from the other hosts.
Updated by okurz about 3 years ago
https://progress.opensuse.org/issues/102882
on powerqaworker-qam-1 I stopped many services and also unmounted NFS. I ran tcpdump -i eth4
. Traffic I found (example block):
14:05:32.357093 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29685225:29688121, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660203], length 2896
14:05:32.357238 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [P.], seq 29688121:29689569, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660203], length 1448
14:05:32.357239 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29689569:29691017, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660203], length 1448
14:05:32.357385 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29691017:29693913, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660204], length 2896
14:05:32.357533 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29693913:29696809, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660204], length 2896
14:05:32.357677 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29696809:29699705, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660204], length 2896
14:05:32.357825 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29699705:29702601, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660204], length 2896
14:05:32.357968 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29702601:29705497, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660204], length 2896
14:05:32.358107 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29705497:29706945, ack 751, win 505, options [nop,nop,TS val 3180811359 ecr 2119660204], length 1448
14:05:32.369753 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125080480:125081928, ack 28937, win 529, options [nop,nop,TS val 1725941080 ecr 1084980048], length 1448
14:05:32.369810 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125081928, win 3896, options [nop,nop,TS val 1084981522 ecr 1725941080], length 0
14:05:32.369945 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125089168:125092064, ack 28937, win 529, options [nop,nop,TS val 1725941081 ecr 1084981522], length 2896
14:05:32.369995 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125081928, win 3896, options [nop,nop,TS val 1084981522 ecr 1725941080,nop,nop,sack 1 {125089168:125092064}], length 0
14:05:32.370107 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125081928:125084824, ack 28937, win 529, options [nop,nop,TS val 1725941081 ecr 1084981522], length 2896
14:05:32.370148 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125084824, win 3874, options [nop,nop,TS val 1084981523 ecr 1725941081,nop,nop,sack 1 {125089168:125092064}], length 0
14:05:32.370296 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125084824:125089168, ack 28937, win 529, options [nop,nop,TS val 1725941081 ecr 1084981523], length 4344
14:05:32.370297 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125092064:125093512, ack 28937, win 529, options [nop,nop,TS val 1725941081 ecr 1084981523], length 1448
14:05:32.370345 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125092064, win 3862, options [nop,nop,TS val 1084981523 ecr 1725941081], length 0
14:05:32.370345 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125093512, win 3853, options [nop,nop,TS val 1084981523 ecr 1725941081], length 0
14:05:32.370440 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125093512:125094960, ack 28937, win 529, options [nop,nop,TS val 1725941081 ecr 1084981523], length 1448
14:05:32.370480 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125094960, win 3896, options [nop,nop,TS val 1084981523 ecr 1725941081], length 0
14:05:32.377555 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29708393:29709841, ack 751, win 505, options [nop,nop,TS val 3180811378 ecr 2119660204], length 1448
14:05:32.377757 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29709841:29711289, ack 751, win 505, options [nop,nop,TS val 3180811378 ecr 2119660224], length 1448
asked in #help-it-ama who is 149.44.176.6. drodgriguez answered and stated that it's https://api.suse.de and the racktables entry https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=6198&hl_ip=149.44.176.6 . I see quite some https traffic from that host to QA-Power8-5-kvm.qa.suse.de, I guess it's AMQP. The above traffice shows traffic to and from QA-Power8-4-kvm and QA-Power8-5-kvm, so why do I see it at all on powerqaworker-qam-1?
Trying an older kernel on powerqaworker-qam-1.qa:
sudo kexec --exec --load /boot/vmlinux-5.3.18-lp152.102-default --initrd=/boot/initrd-5.3.18-lp152.102-default --command-line=$(cat /proc/cmdline)
Same results there so no impact of kernel. I asked in SUSE-IT ticket.
Updated by okurz about 3 years ago
mdoucha reported in https://suse.slack.com/archives/C02CANHLANP/p1639646867388200 that PPC64LE jobs are failing again on MAX_SETUP_TIME and that again many instances are online. I did:
powerqaworker-qam-1 # systemctl mask --now openqa-worker-auto-restart@{3..6}
QA-Power8-4-kvm # systemctl mask --now openqa-worker-auto-restart@{4..8}
QA-Power8-5-kvm # systemctl mask --now openqa-worker-auto-restart@{4..8}
I called
for i in powerqaworker-qam-1 QA-Power8-4-kvm QA-Power8-5-kvm ;do host=openqa.suse.de WORKER=powerqaworker-qam-1 failed_since=2021-12-01 result="result='timeout_exceeded'" bash -ex openqa-advanced-retrigger-jobs; done
but found no jobs that were not already automatically restarted. I also called
for i in powerqaworker-qam-1 QA-Power8-4-kvm QA-Power8-5-kvm ;do host=openqa.suse.de WORKER=powerqaworker-qam-1 failed_since=2021-12-01 result="result='incomplete'" bash -ex openqa-advanced-retrigger-jobs; done
which looks like it also effectively did not restart any jobs as they all miss a necessary asset.
EDIT: We observed failed systemd services because now the according "openqa-reload-worker-auto-restart" services fail as the "openqa-worker-auto-restart" services are masked. So we also need to (and I did that now) mask those:
powerqaworker-qam-1 # systemctl mask --now openqa-reload-worker-auto-restart@{3..6} ; systemctl reset-failed
QA-Power8-4-kvm # systemctl mask --now openqa-reload-worker-auto-restart@{4..8} ; systemctl reset-failed
QA-Power8-5-kvm # systemctl mask --now openqa-reload-worker-auto-restart@{4..8} ; systemctl reset-failed
Updated by livdywan about 3 years ago
okurz wrote:
- WAITING Ask users of other machines in the same rack if they have network problems, e.g. migration-smt2.qa.suse.de , ask migration team -> nsinger asked, @waitfor response
Did we find out if migration-smt2.qa.suse.de is affected?
- Ask Eng Infra to give more members or the complete team of QE Tools ssh access to the switch, at least read-only access for monitoring. If Eng Infra does not know how to do that maybe nsinger can do it himself directly
- Disable individual ports on the switch to check if that improves the situation for power workers -> likely will not affect as we assume the problem to come from outside the switch over the uplink
- Conduct network performance benchmark on affected power hosts in a stripped-down environment with no other significant traffic. Also we can not access the host powerqaworker-qam-1 using iperf or any other port from the other hosts.
Are we still waiting to get access to the switch?
Updated by okurz about 3 years ago
cdywan wrote:
Are we still waiting to get access to the switch?
Well, I am still waiting for access to the switch, nicksinger has access
Updated by nicksinger about 3 years ago
okurz wrote:
cdywan wrote:
Are we still waiting to get access to the switch?
Well, I am still waiting for access to the switch, nicksinger has access
Could you please try if ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 admin@qanet13nue.qa.suse.de
lets you in?
Updated by okurz about 3 years ago
nicksinger wrote:
Could you please try if
ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 admin@qanet13nue.qa.suse.de
lets you in?
Well, ssh "let's me in", then I am asked for "User Name:" so I guess the answer is, "yes" up to this point
Updated by szarate about 3 years ago
- Related to action #104106: [qe-core] test fails in await_install - Network peformace for ppc installations is decreasing size:S added
Updated by nicksinger about 3 years ago
okurz wrote:
nicksinger wrote:
Could you please try if
ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 admin@qanet13nue.qa.suse.de
lets you in?Well, ssh "let's me in", then I am asked for "User Name:" so I guess the answer is, "yes" up to this point
ok so apparently it didn't work as I expected it. Unfortunately the iOS version is quite old and I can only find guides for more modern versions. I will write you the password in slack so you can at least manually log in.
Updated by nicksinger about 3 years ago
I talked to gschlotter regarding https://sd.suse.com/servicedesk/customer/portal/1/SD-67703 - copying the current plan (for everybody who can not access this ticket):
I had some brainstorming with Nick.
on Monday I will be in the serverroom and will connect one of these servers with a new cable to a diffrent switch.
Nick will test if this solves the situation, if yes, he will be in the office with Matthias on Tuesday and recable these servers.
Updated by livdywan about 3 years ago
What happened since the last episode¶
- Nick took over from Marius
- Oli reported an extensive report of ideas and attempts to investigate
- Nothing seemingly happened for two weeks
- Stakeholders are seeing problems again
- We still don't know if maybe it's just a kink in the ethernet cable
Ideas for improvement¶
- We could have implemented work-arounds sooner
- Consider getting access to another machine as a temporary replacement
- Was the infra ticket updated / visible?
- Comments should have been added to clarify changes
- Due date set to 2021-12-29
- We should have been keeping up with updates?
Updated by nicksinger almost 3 years ago
Gerhard replugged qa-power8-4 into qanet10 port 8. I ran an iperf but saw no improvement:
QA-Power8-4-kvm:~ # iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[ 5] local 2620:113:80c0:80a0:10:162:29:60f port 36854 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 60.6 MBytes 509 Mbits/sec 7 252 KBytes
[ 5] 1.00-2.00 sec 46.8 MBytes 392 Mbits/sec 3 212 KBytes
[ 5] 2.00-3.00 sec 45.5 MBytes 381 Mbits/sec 2 177 KBytes
[ 5] 3.00-4.00 sec 35.7 MBytes 300 Mbits/sec 2 187 KBytes
[ 5] 4.00-5.00 sec 51.8 MBytes 435 Mbits/sec 4 286 KBytes
[ 5] 5.00-6.00 sec 50.5 MBytes 424 Mbits/sec 3 212 KBytes
[ 5] 6.00-7.00 sec 60.4 MBytes 506 Mbits/sec 1 308 KBytes
[ 5] 7.00-8.00 sec 44.4 MBytes 372 Mbits/sec 5 180 KBytes
[ 5] 8.00-9.00 sec 52.8 MBytes 443 Mbits/sec 1 271 KBytes
[ 5] 9.00-10.00 sec 44.4 MBytes 372 Mbits/sec 0 351 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 493 MBytes 413 Mbits/sec 28 sender
[ 5] 0.00-10.00 sec 490 MBytes 411 Mbits/sec receiver
iperf Done.
QA-Power8-4-kvm:~ # iperf3 -R -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[ 5] local 2620:113:80c0:80a0:10:162:29:60f port 36880 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 344 KBytes 2.82 Mbits/sec
[ 5] 1.00-2.00 sec 413 KBytes 3.38 Mbits/sec
[ 5] 2.00-3.00 sec 520 KBytes 4.26 Mbits/sec
[ 5] 3.00-4.00 sec 370 KBytes 3.03 Mbits/sec
[ 5] 4.00-5.00 sec 342 KBytes 2.80 Mbits/sec
[ 5] 5.00-6.00 sec 301 KBytes 2.47 Mbits/sec
[ 5] 6.00-7.00 sec 322 KBytes 2.64 Mbits/sec
[ 5] 7.00-8.00 sec 248 KBytes 2.03 Mbits/sec
[ 5] 8.00-9.00 sec 522 KBytes 4.27 Mbits/sec
[ 5] 9.00-10.00 sec 457 KBytes 3.75 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 3.84 MBytes 3.22 Mbits/sec 816 sender
[ 5] 0.00-10.00 sec 3.75 MBytes 3.14 Mbits/sec receiver
iperf Done.
QA-Power8-4-kvm:~ # iperf3 -4 -R -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[ 5] local 10.162.6.201 port 60458 connected to 10.160.0.207 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 406 KBytes 3.32 Mbits/sec
[ 5] 1.00-2.00 sec 276 KBytes 2.26 Mbits/sec
[ 5] 2.00-3.00 sec 421 KBytes 3.45 Mbits/sec
[ 5] 3.00-4.00 sec 568 KBytes 4.66 Mbits/sec
[ 5] 4.00-5.00 sec 462 KBytes 3.79 Mbits/sec
[ 5] 5.00-6.00 sec 352 KBytes 2.88 Mbits/sec
[ 5] 6.00-7.00 sec 588 KBytes 4.82 Mbits/sec
[ 5] 7.00-8.00 sec 373 KBytes 3.06 Mbits/sec
[ 5] 8.00-9.00 sec 454 KBytes 3.72 Mbits/sec
[ 5] 9.00-10.00 sec 423 KBytes 3.46 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 4.33 MBytes 3.63 Mbits/sec 880 sender
[ 5] 0.00-10.00 sec 4.22 MBytes 3.54 Mbits/sec receiver
iperf Done.
QA-Power8-4-kvm:~ # iperf3 -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[ 5] local 10.162.6.201 port 60496 connected to 10.160.0.207 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 47.9 MBytes 402 Mbits/sec 28 174 KBytes
[ 5] 1.00-2.00 sec 60.8 MBytes 510 Mbits/sec 5 198 KBytes
[ 5] 2.00-3.00 sec 66.9 MBytes 561 Mbits/sec 3 167 KBytes
[ 5] 3.00-4.00 sec 56.8 MBytes 476 Mbits/sec 4 130 KBytes
[ 5] 4.00-5.00 sec 45.0 MBytes 378 Mbits/sec 2 161 KBytes
[ 5] 5.00-6.00 sec 42.8 MBytes 359 Mbits/sec 2 187 KBytes
[ 5] 6.00-7.00 sec 76.0 MBytes 638 Mbits/sec 2 182 KBytes
[ 5] 7.00-8.00 sec 65.0 MBytes 545 Mbits/sec 4 150 KBytes
[ 5] 8.00-9.00 sec 39.7 MBytes 333 Mbits/sec 50 315 KBytes
[ 5] 9.00-10.00 sec 40.5 MBytes 339 Mbits/sec 7 117 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 541 MBytes 454 Mbits/sec 107 sender
[ 5] 0.00-10.00 sec 539 MBytes 452 Mbits/sec receiver
iperf Done.
Updated by nicksinger almost 3 years ago
Gerhard also replugged another port of that machine. Apparently this brought some improvement but still way to less:
nsinger@QA-Power8-4-kvm:~> iperf3 -6 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[ 5] local 2620:113:80c0:80a0:10:162:29:60f port 48730 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 91.4 MBytes 767 Mbits/sec 127 290 KBytes
[ 5] 1.00-2.00 sec 87.6 MBytes 735 Mbits/sec 5 343 KBytes
[ 5] 2.00-3.00 sec 85.9 MBytes 720 Mbits/sec 30 319 KBytes
[ 5] 3.00-4.00 sec 95.6 MBytes 802 Mbits/sec 5 278 KBytes
[ 5] 4.00-5.00 sec 91.3 MBytes 765 Mbits/sec 8 424 KBytes
[ 5] 5.00-6.00 sec 81.9 MBytes 687 Mbits/sec 32 282 KBytes
[ 5] 6.00-7.00 sec 71.5 MBytes 599 Mbits/sec 34 351 KBytes
[ 5] 7.00-8.00 sec 87.2 MBytes 732 Mbits/sec 0 449 KBytes
[ 5] 8.00-9.00 sec 88.5 MBytes 742 Mbits/sec 5 300 KBytes
[ 5] 9.00-10.00 sec 57.4 MBytes 482 Mbits/sec 14 332 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 838 MBytes 703 Mbits/sec 260 sender
[ 5] 0.00-10.00 sec 835 MBytes 700 Mbits/sec receiver
iperf Done.
nsinger@QA-Power8-4-kvm:~> iperf3 -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[ 5] local 10.162.6.201 port 44070 connected to 10.160.0.207 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 84.2 MBytes 707 Mbits/sec 71 413 KBytes
[ 5] 1.00-2.00 sec 96.9 MBytes 813 Mbits/sec 43 325 KBytes
[ 5] 2.00-3.00 sec 92.8 MBytes 778 Mbits/sec 0 438 KBytes
[ 5] 3.00-4.00 sec 72.2 MBytes 606 Mbits/sec 4 211 KBytes
[ 5] 4.00-5.00 sec 60.0 MBytes 504 Mbits/sec 0 344 KBytes
[ 5] 5.00-6.00 sec 87.3 MBytes 732 Mbits/sec 92 204 KBytes
[ 5] 6.00-7.00 sec 58.7 MBytes 492 Mbits/sec 25 259 KBytes
[ 5] 7.00-8.00 sec 77.2 MBytes 648 Mbits/sec 52 287 KBytes
[ 5] 8.00-9.00 sec 71.3 MBytes 598 Mbits/sec 0 387 KBytes
[ 5] 9.00-10.00 sec 76.3 MBytes 640 Mbits/sec 0 482 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 777 MBytes 652 Mbits/sec 287 sender
[ 5] 0.00-10.00 sec 775 MBytes 650 Mbits/sec receiver
iperf Done.
nsinger@QA-Power8-4-kvm:~> iperf3 -R -6 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[ 5] local 2620:113:80c0:80a0:10:162:29:60f port 48788 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 1.25 MBytes 10.5 Mbits/sec
[ 5] 1.00-2.00 sec 761 KBytes 6.24 Mbits/sec
[ 5] 2.00-3.00 sec 968 KBytes 7.93 Mbits/sec
[ 5] 3.00-4.00 sec 1.45 MBytes 12.2 Mbits/sec
[ 5] 4.00-5.00 sec 877 KBytes 7.19 Mbits/sec
[ 5] 5.00-6.00 sec 170 KBytes 1.39 Mbits/sec
[ 5] 6.00-7.00 sec 828 KBytes 6.79 Mbits/sec
[ 5] 7.00-8.00 sec 841 KBytes 6.89 Mbits/sec
[ 5] 8.00-9.00 sec 1.65 MBytes 13.8 Mbits/sec
[ 5] 9.00-10.00 sec 965 KBytes 7.91 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 9.75 MBytes 8.18 Mbits/sec 1177 sender
[ 5] 0.00-10.00 sec 9.63 MBytes 8.08 Mbits/sec receiver
iperf Done.
nsinger@QA-Power8-4-kvm:~> iperf3 -R -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[ 5] local 10.162.6.201 port 44118 connected to 10.160.0.207 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 1.39 MBytes 11.6 Mbits/sec
[ 5] 1.00-2.00 sec 1.42 MBytes 11.9 Mbits/sec
[ 5] 2.00-3.00 sec 1.04 MBytes 8.71 Mbits/sec
[ 5] 3.00-4.00 sec 1.29 MBytes 10.8 Mbits/sec
[ 5] 4.00-5.00 sec 1.27 MBytes 10.7 Mbits/sec
[ 5] 5.00-6.00 sec 1.91 MBytes 16.1 Mbits/sec
[ 5] 6.00-7.00 sec 1.16 MBytes 9.72 Mbits/sec
[ 5] 7.00-8.00 sec 578 KBytes 4.74 Mbits/sec
[ 5] 8.00-9.00 sec 1.41 MBytes 11.8 Mbits/sec
[ 5] 9.00-10.00 sec 1.35 MBytes 11.3 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 12.9 MBytes 10.8 Mbits/sec 799 sender
[ 5] 0.00-10.00 sec 12.8 MBytes 10.7 Mbits/sec receiver
iperf Done.
I will be in the office today testing if a direct connection with my notebook brings better speeds. Hopefully I can manage to pull this experiment off to exclude any problems with any hardware (e.g. switch, router) in between.
Updated by okurz almost 3 years ago
Please keep the observation from #102882#note-24 in mind regarding the high increase of traffic we saw. I don't think at this point it helps to simply plug the machines elsewhere without making sure that this traffic goes away, e.g. unplug other stuff, the uplink, etc.
Updated by okurz almost 3 years ago
- Due date changed from 2021-12-29 to 2022-01-28
Updated by nicksinger almost 3 years ago
So here are my results of several switch ports I tested in srv2 (and the qalab) with my notebook:
back2back with power8-4:
nsinger@QA-Power8-4-kvm:~> iperf3 -c 192.168.0.1
Connecting to host 192.168.0.1, port 5201
[ 5] local 192.168.0.106 port 48382 connected to 192.168.0.1 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 114 MBytes 958 Mbits/sec 0 379 KBytes
[ 5] 1.00-2.00 sec 112 MBytes 941 Mbits/sec 0 379 KBytes
[ 5] 2.00-3.00 sec 112 MBytes 937 Mbits/sec 0 379 KBytes
[ 5] 3.00-4.00 sec 113 MBytes 946 Mbits/sec 0 379 KBytes
[ 5] 4.00-5.00 sec 112 MBytes 939 Mbits/sec 0 399 KBytes
[ 5] 5.00-6.00 sec 112 MBytes 942 Mbits/sec 0 399 KBytes
[ 5] 6.00-7.00 sec 112 MBytes 943 Mbits/sec 0 399 KBytes
[ 5] 7.00-8.00 sec 112 MBytes 943 Mbits/sec 0 399 KBytes
[ 5] 8.00-9.00 sec 112 MBytes 942 Mbits/sec 0 399 KBytes
[ 5] 9.00-10.00 sec 112 MBytes 937 Mbits/sec 0 399 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.10 GBytes 943 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec receiver
iperf Done.
nsinger@QA-Power8-4-kvm:~> iperf3 -R -c 192.168.0.1
Connecting to host 192.168.0.1, port 5201
Reverse mode, remote host 192.168.0.1 is sending
[ 5] local 192.168.0.106 port 48386 connected to 192.168.0.1 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 111 MBytes 934 Mbits/sec
[ 5] 1.00-2.00 sec 111 MBytes 934 Mbits/sec
[ 5] 2.00-3.00 sec 111 MBytes 934 Mbits/sec
[ 5] 3.00-4.00 sec 111 MBytes 934 Mbits/sec
[ 5] 4.00-5.00 sec 111 MBytes 934 Mbits/sec
[ 5] 5.00-6.00 sec 111 MBytes 934 Mbits/sec
[ 5] 6.00-7.00 sec 111 MBytes 934 Mbits/sec
[ 5] 7.00-8.00 sec 111 MBytes 934 Mbits/sec
[ 5] 8.00-9.00 sec 111 MBytes 934 Mbits/sec
[ 5] 9.00-10.00 sec 111 MBytes 934 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.09 GBytes 936 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.09 GBytes 934 Mbits/sec receiver
iperf from notebook connected to qanet10nue (srv2, located next to the rack of power8-4):
selenium ~ » iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[ 5] local 2620:113:80c0:80a0:10:162:2d:4d65 port 38898 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 110 MBytes 925 Mbits/sec 0 1017 KBytes
[ 5] 1.00-2.00 sec 110 MBytes 923 Mbits/sec 0 1.27 MBytes
[ 5] 2.00-3.00 sec 109 MBytes 912 Mbits/sec 0 1.55 MBytes
[ 5] 3.00-4.00 sec 110 MBytes 923 Mbits/sec 0 2.24 MBytes
[ 5] 4.00-5.00 sec 110 MBytes 923 Mbits/sec 0 2.36 MBytes
[ 5] 5.00-6.00 sec 110 MBytes 923 Mbits/sec 0 2.47 MBytes
[ 5] 6.00-7.00 sec 109 MBytes 912 Mbits/sec 0 2.61 MBytes
[ 5] 7.00-8.00 sec 110 MBytes 923 Mbits/sec 0 2.61 MBytes
[ 5] 8.00-9.00 sec 110 MBytes 923 Mbits/sec 0 2.74 MBytes
[ 5] 9.00-10.00 sec 110 MBytes 923 Mbits/sec 0 2.74 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.07 GBytes 921 Mbits/sec 0 sender
[ 5] 0.00-10.01 sec 1.07 GBytes 918 Mbits/sec receiver
iperf Done.
selenium ~ » iperf3 -R -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[ 5] local 2620:113:80c0:80a0:10:162:2d:4d65 port 38902 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 745 KBytes 6.10 Mbits/sec
[ 5] 1.00-2.00 sec 998 KBytes 8.18 Mbits/sec
[ 5] 2.00-3.00 sec 1.10 MBytes 9.27 Mbits/sec
[ 5] 3.00-4.00 sec 1.11 MBytes 9.31 Mbits/sec
[ 5] 4.00-5.00 sec 1.10 MBytes 9.22 Mbits/sec
[ 5] 5.00-6.00 sec 1.09 MBytes 9.15 Mbits/sec
[ 5] 6.00-7.00 sec 1.10 MBytes 9.24 Mbits/sec
[ 5] 7.00-8.00 sec 1.10 MBytes 9.27 Mbits/sec
[ 5] 8.00-9.00 sec 1.10 MBytes 9.22 Mbits/sec
[ 5] 9.00-10.00 sec 679 KBytes 5.56 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 10.2 MBytes 8.53 Mbits/sec 981 sender
[ 5] 0.00-10.00 sec 10.1 MBytes 8.45 Mbits/sec receiver
iperf Done.
iperf from notebook connected to qanet13nue (srv2, where power8-4 is originally connected to):
selenium ~ » iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[ 5] local 2620:113:80c0:80a0:10:162:2c:985 port 45236 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 103 MBytes 863 Mbits/sec 0 1.41 MBytes
[ 5] 1.00-2.00 sec 110 MBytes 923 Mbits/sec 0 1.80 MBytes
[ 5] 2.00-3.00 sec 108 MBytes 902 Mbits/sec 28 1.40 MBytes
[ 5] 3.00-4.00 sec 110 MBytes 923 Mbits/sec 0 1.52 MBytes
[ 5] 4.00-5.00 sec 98.8 MBytes 828 Mbits/sec 490 82.3 KBytes
[ 5] 5.00-6.00 sec 85.0 MBytes 713 Mbits/sec 0 356 KBytes
[ 5] 6.00-7.00 sec 93.8 MBytes 786 Mbits/sec 208 222 KBytes
[ 5] 7.00-8.00 sec 93.8 MBytes 786 Mbits/sec 0 416 KBytes
[ 5] 8.00-9.00 sec 95.0 MBytes 797 Mbits/sec 0 469 KBytes
[ 5] 9.00-10.00 sec 101 MBytes 849 Mbits/sec 0 494 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 998 MBytes 837 Mbits/sec 726 sender
[ 5] 0.00-10.01 sec 995 MBytes 834 Mbits/sec receiver
iperf Done.
selenium ~ » iperf3 -R -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[ 5] local 2620:113:80c0:80a0:10:162:2c:985 port 45240 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 1.15 MBytes 9.64 Mbits/sec
[ 5] 1.00-2.00 sec 2.12 MBytes 17.8 Mbits/sec
[ 5] 2.00-3.00 sec 1.18 MBytes 9.93 Mbits/sec
[ 5] 3.00-4.00 sec 1.32 MBytes 11.0 Mbits/sec
[ 5] 4.00-5.00 sec 1.17 MBytes 9.82 Mbits/sec
[ 5] 5.00-6.00 sec 890 KBytes 7.29 Mbits/sec
[ 5] 6.00-7.00 sec 636 KBytes 5.21 Mbits/sec
[ 5] 7.00-8.00 sec 1.43 MBytes 12.0 Mbits/sec
[ 5] 8.00-9.00 sec 945 KBytes 7.75 Mbits/sec
[ 5] 9.00-10.00 sec 1.04 MBytes 8.69 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 12.1 MBytes 10.2 Mbits/sec 1094 sender
[ 5] 0.00-10.00 sec 11.8 MBytes 9.91 Mbits/sec receiver
iperf Done.
iperf from notebook connected to qanet15nue (srv2, another switch close to power8-4):
selenium ~ » iperf3 -R -c openqa.suse.de 130 ↵
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[ 5] local 2620:113:80c0:80a0:10:162:2e:3a8a port 53490 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 6.35 MBytes 53.3 Mbits/sec
[ 5] 1.00-2.00 sec 6.58 MBytes 55.2 Mbits/sec
[ 5] 2.00-3.00 sec 7.65 MBytes 64.2 Mbits/sec
[ 5] 3.00-4.00 sec 5.88 MBytes 49.3 Mbits/sec
[ 5] 4.00-5.00 sec 6.19 MBytes 51.9 Mbits/sec
[ 5] 5.00-6.00 sec 7.65 MBytes 64.2 Mbits/sec
[ 5] 6.00-7.00 sec 5.79 MBytes 48.5 Mbits/sec
[ 5] 7.00-8.00 sec 8.21 MBytes 68.9 Mbits/sec
[ 5] 8.00-9.00 sec 7.08 MBytes 59.4 Mbits/sec
[ 5] 9.00-10.00 sec 6.19 MBytes 51.9 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.01 sec 67.8 MBytes 56.8 Mbits/sec 7751 sender
[ 5] 0.00-10.00 sec 67.6 MBytes 56.7 Mbits/sec receiver
iperf Done.
selenium ~ » iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[ 5] local 2620:113:80c0:80a0:10:162:2e:3a8a port 53494 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 93.6 MBytes 785 Mbits/sec 6 1.15 MBytes
[ 5] 1.00-2.00 sec 92.5 MBytes 776 Mbits/sec 0 1.26 MBytes
[ 5] 2.00-3.00 sec 105 MBytes 881 Mbits/sec 0 1.36 MBytes
[ 5] 3.00-4.00 sec 109 MBytes 912 Mbits/sec 0 1.41 MBytes
[ 5] 4.00-5.00 sec 106 MBytes 891 Mbits/sec 0 1.49 MBytes
[ 5] 5.00-6.00 sec 108 MBytes 902 Mbits/sec 0 1.53 MBytes
[ 5] 6.00-7.00 sec 102 MBytes 860 Mbits/sec 0 1.55 MBytes
[ 5] 7.00-8.00 sec 86.2 MBytes 723 Mbits/sec 89 1.11 MBytes
[ 5] 8.00-9.00 sec 104 MBytes 870 Mbits/sec 0 1.19 MBytes
[ 5] 9.00-10.00 sec 101 MBytes 849 Mbits/sec 0 1.24 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1007 MBytes 845 Mbits/sec 95 sender
[ 5] 0.00-10.02 sec 1004 MBytes 841 Mbits/sec receiver
iperf Done.
selenium ~ » iperf3 -R -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[ 5] local 10.162.29.76 port 51332 connected to 10.160.0.207 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 6.51 MBytes 54.6 Mbits/sec
[ 5] 1.00-2.00 sec 7.45 MBytes 62.5 Mbits/sec
[ 5] 2.00-3.00 sec 6.35 MBytes 53.2 Mbits/sec
[ 5] 3.00-4.00 sec 6.49 MBytes 54.4 Mbits/sec
[ 5] 4.00-5.00 sec 5.94 MBytes 49.8 Mbits/sec
[ 5] 5.00-6.00 sec 7.34 MBytes 61.6 Mbits/sec
[ 5] 6.00-7.00 sec 5.11 MBytes 42.9 Mbits/sec
[ 5] 7.00-8.00 sec 6.18 MBytes 51.8 Mbits/sec
[ 5] 8.00-9.00 sec 6.36 MBytes 53.3 Mbits/sec
[ 5] 9.00-10.00 sec 6.12 MBytes 51.3 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 64.1 MBytes 53.8 Mbits/sec 7306 sender
[ 5] 0.00-10.00 sec 63.8 MBytes 53.6 Mbits/sec receiver
iperf Done.
iperf from notebook connected to qanet15nue (srv2, yet another switch close to power8-4):
selenium ~ » iperf3 -R -c openqa.suse.de 130 ↵
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[ 5] local 2620:113:80c0:80a0:10:162:2e:3a8a port 53490 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 6.35 MBytes 53.3 Mbits/sec
[ 5] 1.00-2.00 sec 6.58 MBytes 55.2 Mbits/sec
[ 5] 2.00-3.00 sec 7.65 MBytes 64.2 Mbits/sec
[ 5] 3.00-4.00 sec 5.88 MBytes 49.3 Mbits/sec
[ 5] 4.00-5.00 sec 6.19 MBytes 51.9 Mbits/sec
[ 5] 5.00-6.00 sec 7.65 MBytes 64.2 Mbits/sec
[ 5] 6.00-7.00 sec 5.79 MBytes 48.5 Mbits/sec
[ 5] 7.00-8.00 sec 8.21 MBytes 68.9 Mbits/sec
[ 5] 8.00-9.00 sec 7.08 MBytes 59.4 Mbits/sec
[ 5] 9.00-10.00 sec 6.19 MBytes 51.9 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.01 sec 67.8 MBytes 56.8 Mbits/sec 7751 sender
[ 5] 0.00-10.00 sec 67.6 MBytes 56.7 Mbits/sec receiver
iperf Done.
selenium ~ » iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[ 5] local 2620:113:80c0:80a0:10:162:2e:3a8a port 53494 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 93.6 MBytes 785 Mbits/sec 6 1.15 MBytes
[ 5] 1.00-2.00 sec 92.5 MBytes 776 Mbits/sec 0 1.26 MBytes
[ 5] 2.00-3.00 sec 105 MBytes 881 Mbits/sec 0 1.36 MBytes
[ 5] 3.00-4.00 sec 109 MBytes 912 Mbits/sec 0 1.41 MBytes
[ 5] 4.00-5.00 sec 106 MBytes 891 Mbits/sec 0 1.49 MBytes
[ 5] 5.00-6.00 sec 108 MBytes 902 Mbits/sec 0 1.53 MBytes
[ 5] 6.00-7.00 sec 102 MBytes 860 Mbits/sec 0 1.55 MBytes
[ 5] 7.00-8.00 sec 86.2 MBytes 723 Mbits/sec 89 1.11 MBytes
[ 5] 8.00-9.00 sec 104 MBytes 870 Mbits/sec 0 1.19 MBytes
[ 5] 9.00-10.00 sec 101 MBytes 849 Mbits/sec 0 1.24 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1007 MBytes 845 Mbits/sec 95 sender
[ 5] 0.00-10.02 sec 1004 MBytes 841 Mbits/sec receiver
iperf Done.
selenium ~ » iperf3 -R -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[ 5] local 10.162.29.76 port 51332 connected to 10.160.0.207 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 6.51 MBytes 54.6 Mbits/sec
[ 5] 1.00-2.00 sec 7.45 MBytes 62.5 Mbits/sec
[ 5] 2.00-3.00 sec 6.35 MBytes 53.2 Mbits/sec
[ 5] 3.00-4.00 sec 6.49 MBytes 54.4 Mbits/sec
[ 5] 4.00-5.00 sec 5.94 MBytes 49.8 Mbits/sec
[ 5] 5.00-6.00 sec 7.34 MBytes 61.6 Mbits/sec
[ 5] 6.00-7.00 sec 5.11 MBytes 42.9 Mbits/sec
[ 5] 7.00-8.00 sec 6.18 MBytes 51.8 Mbits/sec
[ 5] 8.00-9.00 sec 6.36 MBytes 53.3 Mbits/sec
[ 5] 9.00-10.00 sec 6.12 MBytes 51.3 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 64.1 MBytes 53.8 Mbits/sec 7306 sender
[ 5] 0.00-10.00 sec 63.8 MBytes 53.6 Mbits/sec receiver
iperf Done.
iperf from notebook connected to qanet03nue (switch in the big qalab):
selenium ~ » iperf3 -R -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[ 5] local 10.162.29.76 port 51336 connected to 10.160.0.207 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 91.2 MBytes 765 Mbits/sec
[ 5] 1.00-2.00 sec 98.7 MBytes 828 Mbits/sec
[ 5] 2.00-3.00 sec 95.8 MBytes 804 Mbits/sec
[ 5] 3.00-4.00 sec 93.5 MBytes 785 Mbits/sec
[ 5] 4.00-5.00 sec 98.4 MBytes 826 Mbits/sec
[ 5] 5.00-6.00 sec 97.7 MBytes 820 Mbits/sec
[ 5] 6.00-7.00 sec 105 MBytes 879 Mbits/sec
[ 5] 7.00-8.00 sec 97.1 MBytes 815 Mbits/sec
[ 5] 8.00-9.00 sec 106 MBytes 891 Mbits/sec
[ 5] 9.00-10.00 sec 101 MBytes 850 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 988 MBytes 829 Mbits/sec 1129 sender
[ 5] 0.00-10.00 sec 985 MBytes 826 Mbits/sec receiver
iperf Done.
selenium ~ » iperf3 -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[ 5] local 10.162.29.76 port 51340 connected to 10.160.0.207 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 80.5 MBytes 675 Mbits/sec 7 255 KBytes
[ 5] 1.00-2.00 sec 81.2 MBytes 682 Mbits/sec 0 421 KBytes
[ 5] 2.00-3.00 sec 85.0 MBytes 713 Mbits/sec 0 503 KBytes
[ 5] 3.00-4.00 sec 75.0 MBytes 629 Mbits/sec 5 296 KBytes
[ 5] 4.00-5.00 sec 71.2 MBytes 598 Mbits/sec 0 426 KBytes
[ 5] 5.00-6.00 sec 67.5 MBytes 566 Mbits/sec 6 191 KBytes
[ 5] 6.00-7.00 sec 50.0 MBytes 419 Mbits/sec 0 331 KBytes
[ 5] 7.00-8.00 sec 66.2 MBytes 556 Mbits/sec 0 441 KBytes
[ 5] 8.00-9.00 sec 65.0 MBytes 545 Mbits/sec 0 519 KBytes
[ 5] 9.00-10.00 sec 62.5 MBytes 524 Mbits/sec 0 581 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 704 MBytes 591 Mbits/sec 18 sender
[ 5] 0.00-10.00 sec 702 MBytes 589 Mbits/sec receiver
iperf Done.
With all these tests I can conclude:
- The machine itself is not misconfigured and is perfectly able to deliver 1Gbit/s up and down
- Several switches in srv2 are affected by the performance loss
- The QA VLAN itself does not cause the performance loss (as a switch in the qalab - so a different location - is running fine)
I'd suggest that we try to map out how these switches are interconnected. I could imagine that several switches in srv2 are "daisychained" and maybe one switch in that chain is behaving wrong. I will try to come up with a graph showing how the switches are connected. If we have a better overview we can start debugging by e.g. comparing configurations or replugging the uplink of several switches.
Updated by okurz almost 3 years ago
you haven't mentioned the increase of "unexpected traffic". Do you see any relation between that and the measurements you conducted?
Updated by okurz almost 3 years ago
- Status changed from In Progress to Feedback
- Priority changed from High to Normal
gschlotter will check the daisy-chained core switches. Current hypothesis is that at least one is misbehaving and causing the problems. nsinger told us that gschlotter read the ticket so we assume he is aware about the "unexpected traffic". So we expect an update within the next days from gschlotter.
Updated by okurz almost 3 years ago
- Priority changed from Normal to High
Given that we have recurring user reports like https://suse.slack.com/archives/C02CU8X53RC/p1643270602218800 we should still treat this as high prio
Updated by nicksinger almost 3 years ago
- Assignee deleted (
nicksinger)
I'd highly appreciate a helping hand here to perform new benchmarks after the machine now got replugged into the core switch directly. I'm unassigning for now but feel free to ask if you need something from me
Updated by mkittler almost 3 years ago
- Assignee set to mkittler
Ok, I can run the iperf tests again.
Updated by mkittler almost 3 years ago
- Assignee deleted (
mkittler)
Unfortunately it doesn't look better. I've tested on qa-power8-4-kvm.qa.suse.de, qa-power8-5-kvm.qa.suse.de and powerqaworker-qam-1. The results on all hosts were looking like this:
martchus@QA-Power8-4-kvm:~> iperf3 -R -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[ 5] local 2620:113:80c0:80a0:10:162:29:60f port 40416 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 1.04 MBytes 8.73 Mbits/sec
[ 5] 1.00-2.00 sec 1.32 MBytes 11.1 Mbits/sec
[ 5] 2.00-3.00 sec 1.06 MBytes 8.91 Mbits/sec
[ 5] 3.00-4.00 sec 1.42 MBytes 11.9 Mbits/sec
[ 5] 4.00-5.00 sec 679 KBytes 5.56 Mbits/sec
[ 5] 5.00-6.00 sec 1.28 MBytes 10.8 Mbits/sec
[ 5] 6.00-7.00 sec 1.39 MBytes 11.7 Mbits/sec
[ 5] 7.00-8.00 sec 937 KBytes 7.68 Mbits/sec
[ 5] 8.00-9.00 sec 1.05 MBytes 8.82 Mbits/sec
[ 5] 9.00-10.00 sec 950 KBytes 7.78 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 11.2 MBytes 9.38 Mbits/sec 784 sender
[ 5] 0.00-10.00 sec 11.1 MBytes 9.29 Mbits/sec receiver
iperf Done.
Updated by okurz almost 3 years ago
- Related to action #105804: Job age (scheduled) (median) alert size:S added
Updated by okurz almost 3 years ago
- Status changed from Feedback to Workable
- Priority changed from High to Urgent
please don't keep around without assignee. The ticket has been raised again during weekly QE sync.
Updated by livdywan almost 3 years ago
- What machine was re-plugged?
We don't know.We've not seen any improvement. Looks to be qapower8-5 - At least one switch might run in hub mode? (or bridge mode?)
- Likely both rack and core switch need to run in hub mode
- Are any debug settings enabled?
- Check the management console
- Compare traffic of machines directly connected to the router
- Re-conduct previous experiments on all affected power machines, at least one x86-64 machine on another rack and malbec incl. tcpdumps @mkittler
- Note, do add the commands used, for easy reproduction
- Ping https://sd.suse.com/servicedesk/customer/portal/1/SD-67703
Updated by livdywan almost 3 years ago
- Due date changed from 2022-01-28 to 2022-02-04
Updated by mkittler almost 3 years ago
TLDR: It looks still as bad as from the beginning on all three power hosts and doing the same tests on malbec or x86_64 machines doesn't show those symptoms.
Here (again) the exact commands used for performance testing:
Start server on openqa.suse.de (e.g. in screen session):
martchus@openqa:~> iperf3 -s
Check on affected host (not possible via salt as they are currently not in salt):
ssh qa-power8-5-kvm.qa.suse.de -C 'iperf3 -R -c openqa.suse.de' # -R is important as it is only slow in one direction
Check other hosts (glob must only match one at a time as server can only handle one request at a time):
martchus@openqa:~> sudo salt 'malbec*' cmd.run "sudo -u '$USER' iperf3 -R -c openqa.suse.de"
Check tcpdump
for unrelated output (like in #102882#note-25), e.g.:
martchus@malbec:~> sudo zypper in tcpdump
martchus@malbec:~> ip addr # to find relevant eth dev
martchus@malbec:~> sudo tcpdump -i eth4 # look for suspicious traffic for "wrong" hosts
martchus@malbec:~> sudo tcpdump -i eth4 | grep -v malbec.arch.suse.de # filter traffic not directly related to host (might not cover IPv6 address)
My findings so far:
iperf3
still shows slow performance on these three power hosts (but not on malbec or other x86_64 hosts).tcpdump
still shows unrelated traffic on these three power hosts (but not on malbec or other x86_64 hosts).- I still see traffic for e.g. power5 on power1, e.g.
13:22:27.428007 IP 10.163.28.162.52464 > QA-Power8-5-kvm.qa.suse.de.ssh: Flags [.], ack 229662332, win 14792, options [nop,nop,TS val 2833931280 ecr 2859724898], length 0
. - The same can be observed on power4, e.g.
13:26:46.001535 IP 10.160.1.93 > QA-Power8-5-kvm.qa.suse.de: GREv0, length 186: IP 10.0.2.15.hpoms-dps-lstn > 239.37.84.23.netsupport: UDP, length 13
- The same can be observed on power5, e.g.
13:38:52.734608 IP openqa-monitor.qa.suse.de.d-s-n > QA-Power8-4-kvm.qa.suse.de.34204: Flags [.], ack 3618777680, win 2906, options [nop,nop,TS val 3944523984 ecr 1904290933], length 0
- I didn't observe similar behavior on malbec or openqaworker10 where the host
tcpdump
is executed on always appears on at least one side (except for ARP, ICMP and multicast traffic).
- I still see traffic for e.g. power5 on power1, e.g.
Updated by nicksinger almost 3 years ago
I ran wireshark on qa-power8-4 & qa-power8-5 with the following filter: ((!ipv6) && tcp) && !(ip.dst == $IP_OF_THE_HOST)
and could still observe a lot of traffic which is designated for other hosts. However I see mainly "qa traffic" on power8-5 while power8-4 sees a lot of other traffic (e.g. 10.0.2.1) so I wonder if in reality power8-4 is the one connected to the core switch?
Updated by okurz almost 3 years ago
@Sebastian Riedel @Marius Kittler @Nick Singer thank you for the quick but thorough and diligent investigation work in https://progress.opensuse.org/issues/102882#note-53 and https://progress.opensuse.org/issues/102882#note-54 , that's what I would say is truly professional work. I checked my text in https://sd.suse.com/servicedesk/customer/portal/1/SD-67703 and I think it's still valid and current as is. Eng-Infra or whoever has control over the network as a whole needs to follow up. I don't see how we would be able to do much more poking through the keyhole and with limited access to switches. @kraih I think the ticket could now be in "In Progress" or "Feedback" with active monitoring of any progress in https://sd.suse.com/servicedesk/customer/portal/1/SD-67703
Updated by mkittler almost 3 years ago
I cannot answer the question but considering your findings I'd ask myself the same question. It could be a mixup. Since I've always been checking on all three hosts anyways and haven't noticed any difference I suppose it doesn't matter much at this point.
Updated by kraih almost 3 years ago
I'm looking at the rack switch now, first up we have the current hardware connections again (some are only 100Mbit):
qanet13nue#show interfaces status
Flow Link Back Mdix
Port Type Duplex Speed Neg ctrl State Pressure Mode
-------- ------------ ------ ----- -------- ---- ----------- -------- -------
gi1 1G-Copper Full 1000 Enabled Off Up Disabled On
gi2 1G-Copper Full 1000 Enabled Off Up Disabled Off
gi3 1G-Copper Full 1000 Enabled Off Up Disabled On
gi4 1G-Copper -- -- -- -- Down -- --
gi5 1G-Copper Full 1000 Enabled Off Up Disabled Off
gi6 1G-Copper Full 1000 Enabled Off Up Disabled On
gi7 1G-Copper Full 100 Enabled Off Up Disabled On
gi8 1G-Copper Full 1000 Enabled Off Up Disabled Off
gi9 1G-Copper -- -- -- -- Down -- --
gi10 1G-Copper -- -- -- -- Down -- --
gi11 1G-Copper Full 100 Enabled Off Up Disabled On
gi12 1G-Copper Full 100 Enabled Off Up Disabled On
gi13 1G-Copper Full 100 Enabled Off Up Disabled Off
gi14 1G-Copper -- -- -- -- Down -- --
gi15 1G-Copper -- -- -- -- Down -- --
gi16 1G-Copper -- -- -- -- Down -- --
gi17 1G-Copper -- -- -- -- Down -- --
gi18 1G-Copper -- -- -- -- Down -- --
gi19 1G-Copper Full 1000 Enabled Off Up Disabled On
gi20 1G-Copper Full 1000 Enabled Off Up Disabled On
gi21 1G-Copper -- -- -- -- Down -- --
gi22 1G-Copper -- -- -- -- Down -- --
gi23 1G-Copper Full 100 Enabled Off Up Disabled Off
gi24 1G-Copper Full 100 Enabled Off Up Disabled On
gi25 1G-Copper -- -- -- -- Down -- --
gi26 1G-Copper -- -- -- -- Down -- --
gi27 1G-Combo-C Full 1000 Enabled Off Up Disabled On
gi28 1G-Combo-C Full 1000 Enabled Off Up Disabled On
Flow Link
Ch Type Duplex Speed Neg control State
-------- ------- ------ ----- -------- ------- -----------
Po1 1G Full 1000 Enabled Off Up
Po2 -- -- -- -- -- Not Present
Po3 -- -- -- -- -- Not Present
Po4 -- -- -- -- -- Not Present
Po5 -- -- -- -- -- Not Present
Po6 -- -- -- -- -- Not Present
Po7 -- -- -- -- -- Not Present
Po8 -- -- -- -- -- Not Present
The ARP table is pretty much empty:
qanet13nue#show arp
Total number of entries: 1
VLAN Interface IP address HW address status
--------------------- --------------- ------------------- ---------------
vlan 12 10.162.63.254 00:00:5e:00:01:04 dynamic
Mac address table is the opposite (7 known):
qanet13nue#show mac address-table
Flags: I - Internal usage VLAN
Aging time is 300 sec
Vlan Mac Address Port Type
------------ --------------------- ---------- ----------
1 c0:7b:bc:8f:f7:2a Po1 dynamic
1 c0:7b:bc:8f:f7:ea Po1 dynamic
1 cc:d5:39:52:50:9a 0 self
12 00:00:5e:00:01:04 Po1 dynamic
12 00:00:5e:00:02:04 Po1 dynamic
12 00:11:25:7d:2c:ce Po1 dynamic
12 00:16:3e:9e:66:a6 Po1 dynamic
12 00:25:90:1a:7c:7d Po1 dynamic
12 00:25:90:1a:7c:81 Po1 dynamic
12 00:25:90:1a:fc:24 Po1 dynamic
12 00:25:90:1a:fc:2c Po1 dynamic
12 00:25:90:9a:ca:38 Po1 dynamic
12 00:25:90:9f:9d:84 Po1 dynamic
12 00:25:90:9f:f2:85 Po1 dynamic
12 00:25:90:9f:f2:86 Po1 dynamic
12 00:25:90:9f:f2:a6 Po1 dynamic
12 00:25:90:9f:f3:1f Po1 dynamic
12 00:25:90:f2:06:14 Po1 dynamic
12 00:26:0b:f1:f0:8d Po1 dynamic
12 00:50:56:44:51:87 Po1 dynamic
12 00:60:16:0f:1c:7b gi13 dynamic
12 00:60:16:0f:1c:a3 gi23 dynamic
12 00:a0:98:6e:3a:1f Po1 dynamic
12 00:a0:98:6e:3a:21 Po1 dynamic
12 00:a0:98:6e:3d:11 Po1 dynamic
12 00:c0:b7:30:7e:33 Po1 dynamic
12 00:c0:b7:4c:97:f7 Po1 dynamic
12 00:c0:b7:4c:98:87 Po1 dynamic
12 00:c0:b7:4c:98:99 Po1 dynamic
12 00:c0:b7:51:cb:e7 Po1 dynamic
12 00:c0:b7:51:cc:63 Po1 dynamic
12 00:c0:b7:6b:d8:20 Po1 dynamic
12 00:c0:b7:6b:d8:80 Po1 dynamic
12 00:c0:b7:d2:d9:87 Po1 dynamic
12 00:c0:dd:13:3b:9f gi12 dynamic
12 00:de:fb:e3:d7:7c Po1 dynamic
12 00:de:fb:e3:da:fc Po1 dynamic
12 00:e0:81:64:f7:3f Po1 dynamic
12 00:e0:86:0a:b4:4b gi11 dynamic
12 04:da:d2:0e:50:49 Po1 dynamic
12 0c:fd:37:17:fe:92 Po1 dynamic
12 18:c0:4d:06:ce:59 Po1 dynamic
12 18:c0:4d:8c:82:90 Po1 dynamic
12 1c:1b:0d:ef:73:64 Po1 dynamic
12 20:bb:c0:c1:07:c7 Po1 dynamic
12 20:bb:c0:c1:0a:0b Po1 dynamic
12 20:bb:c0:c1:0a:62 Po1 dynamic
12 20:bb:c0:c1:0b:f8 Po1 dynamic
12 20:bb:c0:c1:20:79 Po1 dynamic
12 26:9e:c2:c4:2c:0b Po1 dynamic
12 2c:c8:1b:61:80:43 Po1 dynamic
12 36:aa:b7:fb:07:04 Po1 dynamic
12 3c:4a:92:75:67:66 Po1 dynamic
12 3c:ec:ef:5a:79:16 Po1 dynamic
12 40:f2:e9:73:5d:54 Po1 dynamic
12 40:f2:e9:73:5d:55 Po1 dynamic
12 40:f2:e9:a5:53:4c Po1 dynamic
12 52:54:00:00:89:4e Po1 dynamic
12 52:54:00:10:5e:0d Po1 dynamic
12 52:54:00:1e:1b:04 Po1 dynamic
12 52:54:00:7b:ad:b5 Po1 dynamic
12 52:54:00:96:30:74 Po1 dynamic
12 52:54:00:d7:ff:7d Po1 dynamic
12 5c:a4:8a:71:f4:88 Po1 dynamic
12 5c:f3:fc:00:2e:80 Po1 dynamic
12 5c:f3:fc:00:43:f4 Po1 dynamic
12 68:05:ca:92:c1:bb Po1 dynamic
12 68:b5:99:76:8c:74 Po1 dynamic
12 6c:ae:8b:6e:04:a8 gi5 dynamic
12 70:e2:84:14:07:21 gi8 dynamic
12 74:4d:28:e2:c9:86 Po1 dynamic
12 7c:25:86:96:a9:d8 Po1 dynamic
12 90:1b:0e:db:6e:ef Po1 dynamic
12 90:1b:0e:e8:d6:19 Po1 dynamic
12 98:be:94:02:9b:94 Po1 dynamic
12 98:be:94:07:3a:90 Po1 dynamic
12 98:be:94:4b:d3:96 Po1 dynamic
12 a0:42:3f:32:b4:71 gi7 dynamic
12 ac:1f:6b:03:22:4e Po1 dynamic
12 ac:1f:6b:03:22:f8 Po1 dynamic
12 ac:1f:6b:e6:c2:e9 Po1 dynamic
12 b4:e9:b0:67:b9:d2 Po1 dynamic
12 b4:e9:b0:67:bd:57 Po1 dynamic
12 b4:e9:b0:67:be:60 Po1 dynamic
12 b4:e9:b0:67:bf:34 Po1 dynamic
12 b4:e9:b0:6e:84:06 Po1 dynamic
12 c4:72:95:2b:50:ed Po1 dynamic
12 c4:7d:46:f2:72:35 Po1 dynamic
12 c4:7d:46:f2:78:36 Po1 dynamic
12 c8:00:84:a3:a1:33 Po1 dynamic
12 e8:6a:64:97:6b:a9 Po1 dynamic
12 ec:e1:a9:f8:8a:02 Po1 dynamic
12 ec:e1:a9:fc:c9:0c Po1 dynamic
14 00:26:0b:f1:f0:8d Po1 dynamic
14 00:a0:98:6e:3a:20 Po1 dynamic
14 00:a0:98:6e:3d:12 Po1 dynamic
710 00:00:5e:00:01:12 Po1 dynamic
711 00:00:5e:00:01:13 Po1 dynamic
IP routing table:
qanet13nue#show ip route
Maximum Parallel Paths: 1 (1 after reset)
IP Forwarding: disabled
Codes: > - best, C - connected, S - static
S 0.0.0.0/0 [1/1] via 10.162.63.254, 10585:32:02, vlan 12
C 10.162.0.0/18 is directly connected, vlan 12
And the bridge settings (unicast and multicast tables look the same):
qanet13nue#show bridge unicast unknown
Port Unregistered
-------- --------------
gi1 Forward
gi2 Forward
gi3 Forward
gi4 Forward
gi5 Forward
gi6 Forward
gi7 Forward
gi8 Forward
gi9 Forward
gi10 Forward
gi11 Forward
gi12 Forward
gi13 Forward
gi14 Forward
gi15 Forward
gi16 Forward
gi17 Forward
gi18 Forward
gi19 Forward
gi20 Forward
gi21 Forward
gi22 Forward
gi23 Forward
gi24 Forward
gi25 Forward
gi26 Forward
gi27 Forward
gi28 Forward
Po1 Forward
Po2 Forward
Po3 Forward
Po4 Forward
Po5 Forward
Po6 Forward
Po7 Forward
Po8 Forward
The mac address for Power8-5-kvm should be 98:be:94:03:e9:4b
, but it does not appear anywhere.
Updated by kraih almost 3 years ago
I can't post the whole config here, but these are the interfaces:
qanet13nue#show running-config
...
interface vlan 12
name qa
ip address 10.162.0.73 255.255.192.0
!
interface vlan 14
name testnet
!
interface vlan 710
name cloudqa-admin
!
interface vlan 711
name cloudqa-bmc
!
interface gigabitethernet1
switchport mode access
switchport access vlan 12
!
interface gigabitethernet2
description "cloud4.qa Node1 BMC"
switchport trunk native vlan 12
!
interface gigabitethernet3
description "cloud4.qa Node2 BMC"
switchport trunk allowed vlan add 711
switchport trunk native vlan 12
!
interface gigabitethernet4
description "cloud4.qa Node3 BMC"
switchport trunk allowed vlan add 711
!
interface gigabitethernet5
description "cloud4.qa Node4 BMC"
switchport mode access
switchport access vlan 12
!
interface gigabitethernet6
switchport mode access
switchport access vlan 12
!
interface gigabitethernet7
description S812LC-SP
switchport mode access
switchport access vlan 12
!
interface gigabitethernet8
description S822LC-SP
switchport mode access
switchport access vlan 12
!
interface gigabitethernet9
switchport mode access
switchport access vlan 12
!
interface gigabitethernet10
switchport mode access
switchport access vlan 12
!
interface gigabitethernet11
switchport mode access
switchport access vlan 12
!
interface gigabitethernet12
switchport mode access
switchport access vlan 12
!
interface gigabitethernet13
switchport mode access
switchport access vlan 12
!
interface gigabitethernet14
description "cloud4.qa Node5 BMC"
switchport trunk allowed vlan add 711
!
interface gigabitethernet15
description "cloud4.qa Node6 BMC"
switchport trunk allowed vlan add 711
!
interface gigabitethernet16
description "cloud4.qa Node7 BMC"
switchport trunk allowed vlan add 711
!
interface gigabitethernet17
description "cloud4.qa Node8 BMC"
switchport trunk allowed vlan add 711
!
interface gigabitethernet18
switchport mode access
switchport access vlan 12
!
interface gigabitethernet19
description S812LC
switchport mode access
switchport access vlan 12
!
interface gigabitethernet20
description S822LC
switchport mode access
switchport access vlan 12
!
interface gigabitethernet21
switchport mode access
switchport access vlan 12
!
interface gigabitethernet22
switchport mode access
switchport access vlan 12
!
interface gigabitethernet23
switchport mode access
switchport access vlan 12
!
interface gigabitethernet24
switchport mode access
switchport access vlan 12
!
interface gigabitethernet25
switchport trunk native vlan 12
!
interface gigabitethernet26
switchport trunk native vlan 12
!
interface gigabitethernet27
channel-group 1 mode auto
!
interface gigabitethernet28
channel-group 1 mode auto
!
interface Port-channel1
flowcontrol auto
description nx3kup
switchport trunk allowed vlan add 12,14,710-711
!
...
Updated by okurz almost 3 years ago
- Status changed from Workable to In Progress
Updated by kraih almost 3 years ago
With all the data collected, i think we can conclude that the rack switch is definitely misconfigured, with a lot of legacy settings from previous uses. As a next step, we will need help from someone with Cisco IOS knowledge from Infra, to reset and properly configure the switch. Irrespective of who is ultimately responsible for maintaining the switch (SNMP settings still say snmp-server contact infra@suse.com
btw.).
Updated by kraih almost 3 years ago
- Due date changed from 2022-02-04 to 2022-02-28
Talked to Gerhard Schlotter from Infra today and they are working on the problem now. I've given them access to the switch and two machines in the rack (Power8-4-kvm/Power8-5-kvm), so they can do network tests themselves. The initial plan is to simply reset and reconfigure the switch. I've also forwarded our concerns regarding the core switch. If necessary they will have physical access to the rack on Tuesday. So we should know more around Wednesday next week.
Updated by kraih almost 3 years ago
Quick update, so far Infra has not worked on the switch. It is planned for today though.
Updated by kraih almost 3 years ago
Got another update from Gerhard, the switch might be working fine again after a firmware update. I can now see both machines (Power8-4-kvm/Power8-5-kvm) in the mac address table.
qanet13nue#show mac address-table
...
12 98:be:94:03:e9:4b gi20 dynamic
12 98:be:94:04:48:17 gi19 dynamic
Updated by kraih almost 3 years ago
- Status changed from In Progress to Feedback
Updated by kraih almost 3 years ago
And for completeness some iperf
results:
QA-Power8-4-kvm:~ # iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[ 5] local 2620:113:80c0:80a0:10:162:29:60f port 57394 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 85.2 MBytes 715 Mbits/sec 5 325 KBytes
[ 5] 1.00-2.00 sec 92.5 MBytes 776 Mbits/sec 21 457 KBytes
[ 5] 2.00-3.00 sec 90.0 MBytes 755 Mbits/sec 45 453 KBytes
[ 5] 3.00-4.00 sec 100 MBytes 839 Mbits/sec 1 395 KBytes
[ 5] 4.00-5.00 sec 97.5 MBytes 818 Mbits/sec 34 513 KBytes
[ 5] 5.00-6.00 sec 83.8 MBytes 703 Mbits/sec 34 222 KBytes
[ 5] 6.00-7.00 sec 81.2 MBytes 682 Mbits/sec 0 379 KBytes
[ 5] 7.00-8.00 sec 96.2 MBytes 807 Mbits/sec 51 510 KBytes
[ 5] 8.00-9.00 sec 82.5 MBytes 692 Mbits/sec 31 305 KBytes
[ 5] 9.00-10.00 sec 78.8 MBytes 661 Mbits/sec 5 192 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 888 MBytes 745 Mbits/sec 227 sender
[ 5] 0.00-10.00 sec 885 MBytes 742 Mbits/sec receiver
iperf Done
QA-Power8-4-kvm:~ # iperf3 -R -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[ 5] local 2620:113:80c0:80a0:10:162:29:60f port 57390 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 55.7 MBytes 467 Mbits/sec
[ 5] 1.00-2.00 sec 68.7 MBytes 576 Mbits/sec
[ 5] 2.00-3.00 sec 67.8 MBytes 569 Mbits/sec
[ 5] 3.00-4.00 sec 65.5 MBytes 550 Mbits/sec
[ 5] 4.00-5.00 sec 66.8 MBytes 560 Mbits/sec
[ 5] 5.00-6.00 sec 76.1 MBytes 638 Mbits/sec
[ 5] 6.00-7.00 sec 57.1 MBytes 479 Mbits/sec
[ 5] 7.00-8.00 sec 57.0 MBytes 478 Mbits/sec
[ 5] 8.00-9.00 sec 51.6 MBytes 433 Mbits/sec
[ 5] 9.00-10.00 sec 50.1 MBytes 420 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 619 MBytes 519 Mbits/sec 4943 sender
[ 5] 0.00-10.00 sec 616 MBytes 517 Mbits/sec receiver
iperf Done.
That's a pretty significant improvement.
Updated by MDoucha almost 3 years ago
Looks good so far. I've run some test jobs on QA-Power8-4-kvm and 2.2GB disk image was downloaded in ~61 seconds. Yesterday similarly sized files took 15-60 minutes to download on the same worker. We'll see tomorrow how full KOTD tests perform.
Updated by okurz almost 3 years ago
@kraih any news from you about this today? Any plans to continue?
As the EngInfra ticket was closed but is missing some details that I would like to learn about I asked there in the ticket as well:
This is great news. For an issue that had such big impact I would be happy to read a bit more about the investigation and fixing process. Could you please describe why you came to the conclusion that firmware should be updated? How was that conducted and what is the current, final state? What is done to prevent a similar situation for other switches/racks/rooms? What can we do to improve in a similar situation in the future? As the measurements in switch throughput clearly showed a significant change with the beginning of the problems what measures are taken on monitoring&alerting level?
Given the impact of this issue we definitely should conduct a lessons learned meeting and Five Why analysis with follow-up tasks.
Updated by okurz almost 3 years ago
- Tracker changed from action to coordination
- Subject changed from All OSD PPC64LE workers except malbec appear to have horribly broken cache service to [epic] All OSD PPC64LE workers except malbec appear to have horribly broken cache service
Updated by kraih almost 3 years ago
All machines in the rack are back in production.
Updated by okurz over 2 years ago
- Status changed from Blocked to Resolved
All subtasks resolved, lessons learned and recorded :)
Updated by openqa_review over 2 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: autoyast_sles4sap_hana@ppc64le-sap
https://openqa.suse.de/tests/8751846#step/installation/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 140 days if nothing changes in this ticket.
Updated by okurz over 2 years ago
- Status changed from Resolved to Feedback
Please check the reminder comment about a recent failure using this ticket as label
Updated by kraih over 2 years ago
Looks like the comment takeover was pointless, since the new issue is completely unrelated to the cache service and downloaded assets.
Updated by livdywan over 2 years ago
- Status changed from Feedback to Resolved
kraih wrote:
Looks like the comment takeover was pointless, since the new issue is completely unrelated to the cache service and downloaded assets.
Ack. I created #112184 for the new issue