Project

General

Profile

action #102882

All OSD PPC64LE workers except malbec appear to have horribly broken cache service

Added by okurz 2 months ago. Updated 4 days ago.

Status:
Feedback
Priority:
Normal
Assignee:
Target version:
Start date:
2021-11-23
Due date:
2022-01-28
% Done:

0%

Estimated time:

Description

Observation

User report https://suse.slack.com/archives/C02CANHLANP/p1637666699462700 .
mdoucha: "All jobs are stuck downloading assets until they time out. OSD dashboard shows that the workers are downloading ridiculous amounts of data all the time since yesterday."

Suggestions

  • Find corresponding monitoring data on https://monitor.qa.suse.de/ that can be used to visualize the problem as well as a verification after any potential fix
  • Identify what might cause such problems "since yesterday", i.e. 2021-11-22

Rollback steps (to be done once the actual issue has been resolved)

powerqaworker-qam-1 # systemctl unmask openqa-worker-auto-restart@{3..6} openqa-reload-worker-auto-restart@{3..6}.{service,timer} && systemctl enable --now openqa-worker-auto-restart@{3..6} openqa-reload-worker-auto-restart@{3..6}.{service,timer}
QA-Power8-4-kvm # systemctl unmask openqa-worker-auto-restart@{4..8} openqa-reload-worker-auto-restart@{4..8}.{service,timer} && systemctl enable --now openqa-worker-auto-restart@{4..8} openqa-reload-worker-auto-restart@{4..8}.{service,timer}
QA-Power8-5-kvm # systemctl unmask openqa-worker-auto-restart@{4..8} openqa-reload-worker-auto-restart@{4..8}.{service,timer} && systemctl enable --now openqa-worker-auto-restart@{4..8} openqa-reload-worker-auto-restart@{4..8}.{service,timer}
  • Add qa-power8-4-kvm.qa.suse.de, qa-power8-5-kvm.qa.suse.de and powerqaworker-qam-1.qa.suse.de back to salt and ensure all services are running again.

Related issues

Related to openQA Infrastructure - action #104106: [qe-core] test fails in await_install - Network peformace for ppc installations is decreasing size:SResolved2021-12-16

Copied to openQA Project - coordination #102951: [epic] Better network performance monitoringWorkable2021-11-24

History

#1 Updated by okurz 2 months ago

powerqaworker-qam-1:/home/okurz # ps auxf | grep openqa
root      88223  0.0  0.0   4608  1472 pts/2    S+   12:45   0:00                                      \_ grep --color=auto openqa
_openqa+   4976  0.0  0.0 118720 114496 ?       Ss   Nov21   0:42 /usr/bin/perl /usr/share/openqa/script/worker --instance 7
_openqa+   4983  0.0  0.0 112640 108608 ?       Ss   Nov21   0:36 /usr/bin/perl /usr/share/openqa/script/worker --instance 8
_openqa+  29016  0.0  0.0  90624 78144 ?        Ss   Nov22   0:09 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+  51130  0.0  0.0  90624 72896 ?        S    06:07   0:21  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+  77232  0.0  0.0  90624 74048 ?        S    10:41   0:07  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+  80692  0.0  0.0  90624 73600 ?        S    11:25   0:04  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+  80892  0.0  0.0  90624 73728 ?        S    11:27   0:04  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+  29017  0.0  0.0  82368 78144 ?        Ss   Nov22   0:50 /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+  78221  1.1  0.0  84416 73920 ?        S    10:53   1:19  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+  79516  1.1  0.0  85120 74560 ?        S    11:09   1:07  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+  80419  1.1  0.0  84544 74048 ?        S    11:21   0:58  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+  85600  1.1  0.0  84352 73984 ?        S    12:21   0:16  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+  86025  1.1  0.0  84544 73984 ?        S    12:26   0:12  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+  83287  0.0  0.0  76224 72192 ?        Ss   11:51   0:02 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
_openqa+  83289  0.0  0.0  75968 72192 ?        Ss   11:51   0:02 /usr/bin/perl /usr/share/openqa/script/worker --instance 4
_openqa+  83521  0.0  0.0  76096 72128 ?        Ss   11:54   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 6
_openqa+  84583  0.0  0.0  76352 72320 ?        Ss   12:08   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 1
_openqa+  87663  0.1  0.0  75968 72064 ?        Ss   12:40   0:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 2
_openqa+  87759  0.2  0.0  76160 72192 ?        Ss   12:41   0:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 3

and strace-ing one process, the oldest still running cache minion process, reveals:

poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "O\253\302\342\277\t\367z\305x\340E\325\344\340\353\23\261+\353\r\21\315\207\211\301\334\251\364\357\262\347"..., 131072) = 4284
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "O\253\302\342\277\t\367z\305x\340E\325\344\340\353\23\261+\353\r\21\315\207\211\301\334\251\364\357\262\347"..., 4284) = 4284
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\270\321x\243\356\263&\303_\254E{\242.\370-\32\274!\275YC\177\244\265\206\355T\227\7\327\255"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\270\321x\243\356\263&\303_\254E{\242.\370-\32\274!\275YC\177\244\265\206\355T\227\7\327\255"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "U\269\247NE;\325\ta\210\275\314y\244M\346]4 \340Y\312\343<\374~\376\370_\336@"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "U\269\247NE;\325\ta\210\275\314y\244M\346]4 \340Y\312\343<\374~\376\370_\336@"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\256\231h]\240\361Y>}\213\376\221m\310\263:\27\310\33204u\327=(\2729/\317\252\367\22"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\256\231h]\240\361Y>}\213\376\221m\310\263:\27\310\33204u\327=(\2729/\317\252\367\22"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\2134\255\354\216\34\t\365\305+\250\25y\207s\204y\234\235\253\332}\376\356x\251\346C2\17\370="..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\2134\255\354\216\34\t\365\305+\250\25y\207s\204y\234\235\253\332}\376\356x\251\346C2\17\370="..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\364+\375\200^\36\362yB\314\210.[}&\37\351\231\371\36\247\22\317\245~\260\vy\205\354\206_"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\364+\375\200^\36\362yB\314\210.[}&\37\351\231\371\36\247\22\317\245~\260\vy\205\354\206_"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "2N\235\306,\317\333\"W\37\267\304f\236\234\n\317\376\367\314\206\375\261\226\32#W\v\316\246\221\265"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "2N\235\306,\317\333\"W\37\267\304f\236\234\n\317\376\367\314\206\375\261\226\32#W\v\316\246\221\265"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\301c\314\335\6zK\35^E\314\330\23\276\10\301\277\360\367\216_\6?Y\220\t\370\330\30gtU"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\301c\314\335\6zK\35^E\314\330\23\276\10\301\277\360\367\216_\6?Y\220\t\370\330\30gtU"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "S\252\2354\3654g\246\217x\"\272\307E\226K\5J\255\350(\331\223\fE\357V\253l\1W\340"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "S\252\2354\3654g\246\217x\"\272\307E\226K\5J\255\350(\331\223\fE\357V\253l\1W\340"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\257m.\324\265Y\361\2k^\270\374\335\251o~\374\351\271\177\354\213\16K_\273\v5\231\262/\236"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\257m.\324\265Y\361\2k^\270\374\335\251o~\374\351\271\177\354\213\16K_\273\v5\231\262/\236"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, " -\3471\373+\273kr`\323cf[}F\227\27\265\23\313\243\366\25\366{}\324\356Zf\27"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, " -\3471\373+\273kr`\323cf[}F\227\27\265\23\313\243\366\25\366{}\324\356Zf\27"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\351\254\231\325\320#z\230\351\335-\217\214\350\354\3041\232\227*\27\332rE\251\274o\305\305\265\232\363"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\351\254\231\325\320#z\230\351\335-\217\214\350\354\3041\232\227*\27\332rE\251\274o\305\305\265\232\363"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "C\337\206\23W\366O\302\v\226\207\256\273\26o\302\265$\306\375\6O\265\263|\250\276-\254\275Y\263"..., 131072) = 4284
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "C\337\206\23W\366O\302\v\226\207\256\273\26o\302\265$\306\375\6O\265\263|\250\276-\254\275Y\263"..., 4284) = 4284
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\34\242\376\314\n\267\323\322\244\2300,o\270?~\315\234\236\277f~\225\271i\372\26\257\27{\342\227"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\34\242\376\314\n\267\323\322\244\2300,o\270?~\315\234\236\277f~\225\271i\372\26\257\27{\342\227"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\236\304;r\303\342uxB}b\31I\357\333\214\242\213^\243.\350\33Zk\317@3\t\307S#"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\236\304;r\303\342uxB}b\31I\357\333\214\242\213^\243.\350\33Zk\317@3\t\307S#"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "Fy_\362\317N\230{Y\376\366\364\206\264\37\277\323\31\256h=\350\36=\212\37\257\352\177\324\226~"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "Fy_\362\317N\230{Y\376\366\364\206\264\37\277\323\31\256h=\350\36=\212\37\257\352\177\324\226~"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "~E\263\307\231\370\204\374P\361\234kK\347|\324\372\325\253\277\276\362\345\253\344\341\303\346\317\277\273\242"..., 131072) = 2856
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "~E\263\307\231\370\204\374P\361\234kK\347|\324\372\325\253\277\276\362\345\253\344\341\303\346\317\277\273\242"..., 2856) = 2856
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\364J\376p\r\217@W\361.b\232\0313.\351\220\325,\356#=k\3253\256\244U\203\374\266#"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\364J\376p\r\217@W\361.b\232\0313.\351\220\325,\356#=k\3253\256\244U\203\374\266#"..., 1428) = 1428
poll([{fd=7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, events=POLLIN|POLLPRI}], 1, 300000) = 1 ([{fd=7, revents=POLLIN}])
read(7<TCPv6:[[2620:113:80c0:80a0:10:162:29:5e72]:43490->[2620:113:80c0:8080:10:160:0:207]:80]>, "\267\231k\345\311\333SV\313\373\366\25\177\364\n\357\312\317\325r\305\322ox\30yAo\320\32\371\320"..., 131072) = 1428
write(9</var/lib/openqa/cache/tmp/mojo.tmp.JLzn3IzEMLHQR1uv>, "\267\231k\345\311\333SV\313\373\366\25\177\364\n\357\312\317\325r\305\322ox\30yAo\320\32\371\320"..., 1428) = 1428

so the process is busy reading over network and writing into a local cache file?

#2 Updated by mkittler 2 months ago

On the Minion dashboard no download jobs have been piling up. However, judging by htop the speed it writes to disk is below 1 M/s (per process). That's very slow. And yes, it is reading over network and writes into a local cache file. I suppose that is expected - just not that it is that slow.

The network connection to OSD isn't generally slow. I've just tested with iperf3 on power8-4-kvm.qa.suse.de and powerqaworker-qam-1.qa.suse.de and got > 600 Mbits/sec. The write performance on /var/lib/openqa/cache/tmp looks also good on both workers.

#3 Updated by mkittler 2 months ago

Judging by the job history, the affected machines are qa-power8-4-kvm.qa.suse.de, qa-power8-5-kvm.qa.suse.de and powerqaworker-qam-1.qa.suse.de. grenache-1 and malbec look good.

#4 Updated by nicksinger 2 months ago

mkittler wrote:

Judging by the job history, the affected machines are qa-power8-4-kvm.qa.suse.de, qa-power8-5-kvm.qa.suse.de and powerqaworker-qam-1.qa.suse.de. grenache-1 and malbec look good.

grenache-1 looking good is an interesting observation as it would show that affected machines are not only in the qa.suse.de subdomain. Given our history I'd recommend to check network performance to/from OSD using iperf3 with respective parameters for IPv4 and IPv6. Maybe this reveals some first details.

#5 Updated by mkittler 2 months ago

  • Assignee set to mkittler

#6 Updated by mkittler 2 months ago

I've checked with iperf3 again. There was no difference between using -4 and -6.

#7 Updated by kraih 2 months ago

Not seeing anything unusual in the logs on powerqaworker-qam-1.qa.suse.de either.

#8 Updated by mkittler 2 months ago

When using iperf3 -R to test downloading (from OSD) on qa-power8-4-kvm.qa.suse.de and powerqaworker-qam-1.qa.suse.de there's a huge slowdown to < 5 Mbit/s (regardless whether IPv4 or 6 is used). That's not the case on the good host malbec so I assume we have our problem - unless this is really just due to the ongoing downloads. The ongoing downloads use only 20 Mbit/s (3.33 Mbyte/s). That is very slow. If we add it to performance test speed we're still only at a receive rate of 25 Mbit/s.

#9 Updated by nicksinger 2 months ago

All affected machines seem to be located in SRV2 according to racktables: https://racktables.suse.de/index.php?page=object&tab=default&object_id=3026
Here you have some network-graphs for the switch they are most likely connected to: http://mrtg.suse.de/qanet13nue/index.html

I checked the connection speeds on that switch. According to these graphs 3 of these ports seem to max out at ~100Mbit/s (still quite a bit more then measured by mkittler):

qanet13nue#show interfaces status
                                             Flow Link          Back   Mdix
Port     Type         Duplex  Speed Neg      ctrl State       Pressure Mode
-------- ------------ ------  ----- -------- ---- ----------- -------- -------
gi1      1G-Copper    Full    1000  Enabled  Off  Up          Disabled Off    
gi2      1G-Copper    Full    1000  Enabled  Off  Up          Disabled On     
gi3      1G-Copper    Full    1000  Enabled  Off  Up          Disabled On     
gi4      1G-Copper      --      --     --     --  Down           --     --    
gi5      1G-Copper    Full    1000  Enabled  Off  Up          Disabled Off    
gi6      1G-Copper    Full    1000  Enabled  Off  Up          Disabled On     
gi7      1G-Copper    Full    100   Enabled  Off  Up          Disabled On     
gi8      1G-Copper    Full    1000  Enabled  Off  Up          Disabled Off    
gi9      1G-Copper      --      --     --     --  Down           --     --    
gi10     1G-Copper      --      --     --     --  Down           --     --    
gi11     1G-Copper    Full    100   Enabled  Off  Up          Disabled On     
gi12     1G-Copper    Full    100   Enabled  Off  Up          Disabled On     
gi13     1G-Copper    Full    100   Enabled  Off  Up          Disabled Off    
gi14     1G-Copper      --      --     --     --  Down           --     --    
gi15     1G-Copper      --      --     --     --  Down           --     --    
gi16     1G-Copper      --      --     --     --  Down           --     --    
gi17     1G-Copper      --      --     --     --  Down           --     --    
gi18     1G-Copper      --      --     --     --  Down           --     --    
gi19     1G-Copper    Full    1000  Enabled  Off  Up          Disabled On     

From the MAC address-table I see the following connections:
powerqaworker-qam-1.qa.suse.de: gi5
QA-Power8-5.qa.suse.de: gi8
QA-Power8-4.qa.suse.de: gi7

So only qa-power8-4 is connected over 100Mbit/s.

#10 Updated by mkittler 2 months ago

I've stopped all services on powerqaworker-qam-1.qa.suse.de. Even without ongoing downloads the network speed is very slow:

martchus@powerqaworker-qam-1:~> iperf3 -R -4 -c openqa.suse.de -i 1 -t 30
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 10.162.7.211 port 38894 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   232 KBytes  1.90 Mbits/sec                  
[  5]   1.00-2.00   sec   861 KBytes  7.06 Mbits/sec                  
[  5]   2.00-3.00   sec   897 KBytes  7.34 Mbits/sec                  
[  5]   3.00-4.00   sec   441 KBytes  3.61 Mbits/sec                  
[  5]   4.00-5.00   sec   168 KBytes  1.38 Mbits/sec                  
[  5]   5.00-6.00   sec   810 KBytes  6.64 Mbits/sec                  
[  5]   6.00-7.00   sec   427 KBytes  3.50 Mbits/sec                  
[  5]   7.00-8.00   sec   157 KBytes  1.29 Mbits/sec                  
[  5]   8.00-9.00   sec   577 KBytes  4.73 Mbits/sec                  
[  5]   9.00-10.00  sec   566 KBytes  4.63 Mbits/sec                  
[  5]  10.00-11.00  sec   406 KBytes  3.32 Mbits/sec                  
[  5]  11.00-12.00  sec   714 KBytes  5.85 Mbits/sec                  
[  5]  12.00-13.00  sec   571 KBytes  4.68 Mbits/sec                  
[  5]  13.00-14.00  sec   925 KBytes  7.58 Mbits/sec                  
[  5]  14.00-15.00  sec   474 KBytes  3.88 Mbits/sec                  
[  5]  15.00-16.00  sec   952 KBytes  7.80 Mbits/sec                  
[  5]  16.00-17.00  sec   161 KBytes  1.32 Mbits/sec                  
[  5]  17.00-18.00  sec   218 KBytes  1.78 Mbits/sec                  
[  5]  18.00-19.00  sec  1.16 MBytes  9.72 Mbits/sec                  
[  5]  19.00-20.00  sec   475 KBytes  3.89 Mbits/sec                  
[  5]  20.00-21.00  sec   976 KBytes  7.99 Mbits/sec                  
[  5]  21.00-22.00  sec  1.38 MBytes  11.6 Mbits/sec                  
[  5]  22.00-23.00  sec   496 KBytes  4.07 Mbits/sec                  
[  5]  23.00-24.00  sec   358 KBytes  2.93 Mbits/sec                  
[  5]  24.00-25.00  sec  1024 KBytes  8.39 Mbits/sec                  
[  5]  25.00-26.00  sec   779 KBytes  6.38 Mbits/sec                  
[  5]  26.00-27.00  sec   761 KBytes  6.23 Mbits/sec                  
[  5]  27.00-28.00  sec   434 KBytes  3.56 Mbits/sec                  
[  5]  28.00-29.00  sec   663 KBytes  5.43 Mbits/sec                  
[  5]  29.00-30.00  sec   786 KBytes  6.44 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  18.6 MBytes  5.19 Mbits/sec  2284             sender
[  5]   0.00-30.00  sec  18.5 MBytes  5.16 Mbits/sec                  receiver

All affected workers are in the same rack: https://racktables.suse.de/index.php?page=rack&rack_id=520

#11 Updated by mkittler 2 months ago

  • Status changed from New to Feedback

I've created an Infra ticket: https://sd.suse.com/servicedesk/customer/portal/1/SD-67703
I've also just stopped all worker slots on the affected hosts and removed them from salt-key.

#12 Updated by okurz 2 months ago

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&editPanel=84&tab=alert&from=1637506758247&to=1637657889676 maybe points to the same. The apache response time seems to have gone up in the past two days

#13 Updated by mkittler 2 months ago

  • Description updated (diff)

#14 Updated by okurz 2 months ago

  1. Does running iperf3 in server mode when there are no clients reading from there incur any overhead? If not, should we run it there permanently for monitoring and investigation purposes?
  2. mkittler can you provide something like a "one-liner" to reproduce the problem, e.g. the necessary iperf3 command line both on server+worker
  3. can we run the according iperf3 commands periodically in our monitoring? I guess just some seconds every hour should provide enough data and we can smooth in grafana
  4. I suggest to try to power down + power up the according machines over IPMI. Maybe this already helps with port-renegotiation or something
  5. As the problem did appear just recently I suggest we rollback package changes, e.g. kernel version. Despite some workers still behaving fine it could still be a problem after updates only that only some machines are affected due to their network setup particularities

#15 Updated by nicksinger 2 months ago

okurz wrote:

  1. Does running iperf3 in server mode when there are no clients reading from there incur any overhead? If not, should we run it there permanently for monitoring and investigation purposes?

Running just the server does not really come with much overhead despite the usual load a idling process causes.

  1. can we run the according iperf3 commands periodically in our monitoring? I guess just some seconds every hour should provide enough data and we can smooth in grafana

There is this open request with an simple exec example: https://github.com/influxdata/telegraf/issues/3866#issuecomment-694429507 - this should work for our use-case. We just need to make sure not to run all requests at the same time to all workers because it would quite easily saturate the whole link of OSD

#16 Updated by mkittler about 2 months ago

Unfortunately Infra doesn't have access to the server as well. Maybe they can at least tell us who has.

I've rebooted but it didn't help. I've booted into the snapshot from Mi 24 Nov 2021 10:49:35 CET but it didn't help. The rates are a tiny bit higher now but that's likely just because now all downloads on all the hosts had been stopped. It is still just 15.8 Mbits/sec.

#17 Updated by okurz about 2 months ago

#18 Updated by okurz about 2 months ago

From https://sd.suse.com/servicedesk/customer/portal/1/SD-67703

Gerhard Schlotter (3 hours ago): "how should we help with the issues, we neither have access to the switch nor the affected servers. to the qanet switches, someone from QA team [has access]. The uplink from our side is completely fine and can carry a lot more load."

Who can pick this up and access the switches to check, maybe reboot, unplug some cables, etc.?

#19 Updated by mkittler about 2 months ago

  • Assignee changed from mkittler to nicksinger

Who can pick this up and access the switches to check, maybe reboot, unplug some cables, etc.?

Nick says he has access to I'm assigning the ticket to him.


nicksinger I can of course still do the rollback steps (mentioned in the ticket description) for you in the end or do some further testing to see whether something works better after some changes.

#20 Updated by mkittler about 2 months ago

nicksinger has restarted the switch but the networking speed is still slow. (Even though all workers are now back online I'd expect more than 4 Mbit/s download rate via iperf3 from OSD.)

#21 Updated by okurz about 2 months ago

so what's the plan?

#22 Updated by mkittler about 2 months ago

I don't know. Maybe it is possible to plug the machines in another switch or try with a spare switch?

#23 Updated by nicksinger about 2 months ago

I asked Wei Gao if I can get access to migration-smt2.qa.suse.de to run an iperf crosscheck with another machine in the same rack

#24 Updated by okurz about 2 months ago

  • Due date set to 2021-12-29
  • Status changed from Feedback to In Progress
  • Priority changed from Urgent to High

Current state

We have reduced performance but still some worker instances running so with degraded performance we have addressed the urgency of the ticket and can reduce to "High"

Observations

  • http://mrtg.suse.de/qanet13nue/10.162.0.73_gi1.html shows that there is significant traffic on that port since 2021-W47, i.e. 2021-11-22, the start of the problems and near-zero going back to 2020-11. same for gi2, gi5, gi7, gi8, gi11, gi12, gi13, gi23, gi24
  • the corresponding counterpart to qanet13 is visible on http://mrtg.suse.de/Nx5696Q-Core2/192.168.0.121_369098892.html (qanet13 connection on core2) and http://mrtg.suse.de/Nx5696Q-Core1/192.168.0.120_526649856.html (qanet13 connection on core1) but neither seem to show significant traffic increase since 2021-11-22, so where is the traffic coming from? Is the switch qanet13 sending out broadcasts itself?
  • qanet13nue uplink seems to be gi27+gi28 (found with show interfaces Port-Channel 1). http://mrtg.suse.de/qanet13nue/10.162.0.73_po1.html is the aggregated view and shows nothing significant. But we see that also in the past we had spikes to 320 MBit/s "in" and 240 MBit/s "out" and no such spikes since 2021-W47, limited to 100 MBit/s? Yearly average looks sane, nothing special, average 46 MBit/s "in" and 16 MBit/s "out". We identified that the hosts that are called like S812LC and S822LC on http://mrtg.suse.de/qanet13nue/index.html are according to https://racktables.suse.de/index.php?page=object&object_id=992 our power hosts qa-power8-4 (S812LC) and qa-power8-5 (S822LC) and respective "service processors" S812LC-SP and S822LC-SP. gi6 is powerqaworker-qam-1 (according to iperf experiment from hyperv host). On http://mrtg.suse.de/qanet13nue/index.html we can see that many hosts receive significant traffic since 2021-11-22 but don't show change in sending traffic. The only port that shows significant corresponding incoming traffic is the uplink. So our conclusion is that unintended broadcast traffic received by the rack switch is forwarded to all hosts and especially the Power machines seem to be badly affected by this (either traffic on SPs or the host itself or both) so that sending still works with high bandwidth but receiving only gets a very low bandwidth
  • booted powerqaworker-qam-1 with kernel 5.3.18-lp152.102-default from 2021-11-11 from /boot, that is before the start of the problem on 2021-11-22 and ran iperf3 -t 1200 -R -c openqaworker12 yielding 5.9 MBit/s so same on this older kernel => kernel regression unlikely

Suggestions

  • WAITING Ask users of other machines in the same rack if they have network problems, e.g. migration-smt2.qa.suse.de , ask migration team -> nsinger asked, @waitfor response
  • DONE Conduct network performance test between two hosts within the same rack, nsinger conducted this test between qa-power8-4 (server) and powerqaworker-qam-1 (client) and received 3.13 MBit/s so a magnitude too low for 1 GBit/s, same for qa-power8-5 (client, both directions). Crosscheck between two other hosts in another rack. We did for openqaworker10+13 and got 945 MBit/s so as expected near 1 GBit/s accounting for overhead.
  • DONE Try to sature the switch bandwidth using iperf3 until we can actually see the result on http://mrtg.suse.de/qanet13nue/index.html -> we could see the results using openqaw9-hyperv which we verified to be connected to g1. http://mrtg.suse.de/qanet13nue/10.162.0.73_gi1.html
  • DONE Logged in over RDP to openqaw9-hyperv.qa.suse.de and downloaded https://iperf.fr/iperf-download.php for Windows
    • DONE executed tests against qa-power8-4-kvm resulting in 1.3 MBit/s, openqaworker10->openqaw9-hyperv.qa.suse.de => 204 MBit/s, openqaw9-hyperv.qa.suse->openqaworker10 248 MBit/s so system is fine, switch is not generally broken
    • DONE Started iperf3 -s on openqaworker12 and on openqaw9-hyperv iperf3.exe -t 1200 -c openqaworker12 at 11:09:00Z, trying to see the bandwidth on http://mrtg.suse.de/qanet20nue/index.html . stopped as expected 11:29:00Z. Reported bandwidth 77 MBit/s in both directions. MAC-address 00:15:17:B1:03:88 or 00:15:17:B1:03:89 . nsinger has confirmed that he sees this address on qanet13nue:gi1 .
  • DONE Now starting iperf3 -t 1200 -c powerqaworker-qam-1 -> 1.02 MBit/s. Reverse iperf3 -t 1200 -R -c powerqaworker-qam-1 shows bandwidth of 692 MBit/s (!) => only download to machine affected
  • DONE Examine the traffic, e.g. wireshark on any host on the rack, and see if we can identify the traffic and forward that information to the according users or Eng Infra -> nothing found by nsinger so far
  • Try to connect the affected machines to another switch, e.g. in a neighboring rack, and execute iperf3 runs. nicksinger will coordinate with gschlotter from Eng Infra to do that
  • REJECTED Check for log output on power8-4 why the link is only 100 MBit/s and coordinate with Eng Infra to replace the cable on the port connected to power8-4 and/or connect to another port on the same switch -> mkittler confirmed that Linux reports the link is 1GB/s so this is a false report. Maybe some BMC that is connected on that port.
  • Ask Eng Infra to give more members or the complete team of QE Tools ssh access to the switch, at least read-only access for monitoring. If Eng Infra does not know how to do that maybe nsinger can do it himself directly
  • Disable individual ports on the switch to check if that improves the situation for power workers -> likely will not affect as we assume the problem to come from outside the switch over the uplink
  • Conduct network performance benchmark on affected power hosts in a stripped-down environment with no other significant traffic. Also we can not access the host powerqaworker-qam-1 using iperf or any other port from the other hosts.

#25 Updated by okurz about 2 months ago

https://progress.opensuse.org/issues/102882

on powerqaworker-qam-1 I stopped many services and also unmounted NFS. I ran tcpdump -i eth4. Traffic I found (example block):

14:05:32.357093 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29685225:29688121, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660203], length 2896
14:05:32.357238 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [P.], seq 29688121:29689569, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660203], length 1448
14:05:32.357239 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29689569:29691017, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660203], length 1448
14:05:32.357385 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29691017:29693913, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660204], length 2896
14:05:32.357533 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29693913:29696809, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660204], length 2896
14:05:32.357677 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29696809:29699705, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660204], length 2896
14:05:32.357825 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29699705:29702601, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660204], length 2896
14:05:32.357968 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29702601:29705497, ack 751, win 505, options [nop,nop,TS val 3180811358 ecr 2119660204], length 2896
14:05:32.358107 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29705497:29706945, ack 751, win 505, options [nop,nop,TS val 3180811359 ecr 2119660204], length 1448
14:05:32.369753 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125080480:125081928, ack 28937, win 529, options [nop,nop,TS val 1725941080 ecr 1084980048], length 1448
14:05:32.369810 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125081928, win 3896, options [nop,nop,TS val 1084981522 ecr 1725941080], length 0
14:05:32.369945 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125089168:125092064, ack 28937, win 529, options [nop,nop,TS val 1725941081 ecr 1084981522], length 2896
14:05:32.369995 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125081928, win 3896, options [nop,nop,TS val 1084981522 ecr 1725941080,nop,nop,sack 1 {125089168:125092064}], length 0
14:05:32.370107 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125081928:125084824, ack 28937, win 529, options [nop,nop,TS val 1725941081 ecr 1084981522], length 2896
14:05:32.370148 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125084824, win 3874, options [nop,nop,TS val 1084981523 ecr 1725941081,nop,nop,sack 1 {125089168:125092064}], length 0
14:05:32.370296 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125084824:125089168, ack 28937, win 529, options [nop,nop,TS val 1725941081 ecr 1084981523], length 4344
14:05:32.370297 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125092064:125093512, ack 28937, win 529, options [nop,nop,TS val 1725941081 ecr 1084981523], length 1448
14:05:32.370345 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125092064, win 3862, options [nop,nop,TS val 1084981523 ecr 1725941081], length 0
14:05:32.370345 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125093512, win 3853, options [nop,nop,TS val 1084981523 ecr 1725941081], length 0
14:05:32.370440 IP 1c119.qa.suse.de.nfs > QA-Power8-4-kvm.qa.suse.de.45578: Flags [.], seq 125093512:125094960, ack 28937, win 529, options [nop,nop,TS val 1725941081 ecr 1084981523], length 1448
14:05:32.370480 IP QA-Power8-4-kvm.qa.suse.de.45578 > 1c119.qa.suse.de.nfs: Flags [.], ack 125094960, win 3896, options [nop,nop,TS val 1084981523 ecr 1725941081], length 0
14:05:32.377555 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29708393:29709841, ack 751, win 505, options [nop,nop,TS val 3180811378 ecr 2119660204], length 1448
14:05:32.377757 IP 149.44.176.6.https > QA-Power8-5-kvm.qa.suse.de.36422: Flags [.], seq 29709841:29711289, ack 751, win 505, options [nop,nop,TS val 3180811378 ecr 2119660224], length 1448

asked in #help-it-ama who is 149.44.176.6. drodgriguez answered and stated that it's https://api.suse.de and the racktables entry https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=6198&hl_ip=149.44.176.6 . I see quite some https traffic from that host to QA-Power8-5-kvm.qa.suse.de, I guess it's AMQP. The above traffice shows traffic to and from QA-Power8-4-kvm and QA-Power8-5-kvm, so why do I see it at all on powerqaworker-qam-1?

Trying an older kernel on powerqaworker-qam-1.qa:

sudo kexec --exec --load /boot/vmlinux-5.3.18-lp152.102-default --initrd=/boot/initrd-5.3.18-lp152.102-default --command-line=$(cat /proc/cmdline)

Same results there so no impact of kernel. I asked in SUSE-IT ticket.

#26 Updated by okurz about 1 month ago

mdoucha reported in https://suse.slack.com/archives/C02CANHLANP/p1639646867388200 that PPC64LE jobs are failing again on MAX_SETUP_TIME and that again many instances are online. I did:

powerqaworker-qam-1 # systemctl mask --now openqa-worker-auto-restart@{3..6}
QA-Power8-4-kvm # systemctl mask --now openqa-worker-auto-restart@{4..8}
QA-Power8-5-kvm # systemctl mask --now openqa-worker-auto-restart@{4..8}

I called

for i in powerqaworker-qam-1 QA-Power8-4-kvm QA-Power8-5-kvm ;do host=openqa.suse.de WORKER=powerqaworker-qam-1 failed_since=2021-12-01 result="result='timeout_exceeded'" bash -ex openqa-advanced-retrigger-jobs; done

but found no jobs that were not already automatically restarted. I also called

for i in powerqaworker-qam-1 QA-Power8-4-kvm QA-Power8-5-kvm ;do host=openqa.suse.de WORKER=powerqaworker-qam-1 failed_since=2021-12-01 result="result='incomplete'" bash -ex openqa-advanced-retrigger-jobs; done

which looks like it also effectively did not restart any jobs as they all miss a necessary asset.

EDIT: We observed failed systemd services because now the according "openqa-reload-worker-auto-restart" services fail as the "openqa-worker-auto-restart" services are masked. So we also need to (and I did that now) mask those:

powerqaworker-qam-1 # systemctl mask --now openqa-reload-worker-auto-restart@{3..6} ; systemctl reset-failed
QA-Power8-4-kvm # systemctl mask --now openqa-reload-worker-auto-restart@{4..8} ; systemctl reset-failed
QA-Power8-5-kvm # systemctl mask --now openqa-reload-worker-auto-restart@{4..8} ; systemctl reset-failed

#27 Updated by cdywan about 1 month ago

okurz wrote:

  • WAITING Ask users of other machines in the same rack if they have network problems, e.g. migration-smt2.qa.suse.de , ask migration team -> nsinger asked, @waitfor response

Did we find out if migration-smt2.qa.suse.de is affected?

  • Ask Eng Infra to give more members or the complete team of QE Tools ssh access to the switch, at least read-only access for monitoring. If Eng Infra does not know how to do that maybe nsinger can do it himself directly
  • Disable individual ports on the switch to check if that improves the situation for power workers -> likely will not affect as we assume the problem to come from outside the switch over the uplink
  • Conduct network performance benchmark on affected power hosts in a stripped-down environment with no other significant traffic. Also we can not access the host powerqaworker-qam-1 using iperf or any other port from the other hosts.

Are we still waiting to get access to the switch?

#28 Updated by okurz about 1 month ago

cdywan wrote:

Are we still waiting to get access to the switch?

Well, I am still waiting for access to the switch, nicksinger has access

#29 Updated by nicksinger about 1 month ago

okurz wrote:

cdywan wrote:

Are we still waiting to get access to the switch?

Well, I am still waiting for access to the switch, nicksinger has access

Could you please try if ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 admin@qanet13nue.qa.suse.de lets you in?

#30 Updated by okurz about 1 month ago

nicksinger wrote:

Could you please try if ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 admin@qanet13nue.qa.suse.de lets you in?

Well, ssh "let's me in", then I am asked for "User Name:" so I guess the answer is, "yes" up to this point

#31 Updated by szarate about 1 month ago

  • Related to action #104106: [qe-core] test fails in await_install - Network peformace for ppc installations is decreasing size:S added

#32 Updated by nicksinger about 1 month ago

okurz wrote:

nicksinger wrote:

Could you please try if ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 admin@qanet13nue.qa.suse.de lets you in?

Well, ssh "let's me in", then I am asked for "User Name:" so I guess the answer is, "yes" up to this point

ok so apparently it didn't work as I expected it. Unfortunately the iOS version is quite old and I can only find guides for more modern versions. I will write you the password in slack so you can at least manually log in.

#33 Updated by nicksinger about 1 month ago

I talked to gschlotter regarding https://sd.suse.com/servicedesk/customer/portal/1/SD-67703 - copying the current plan (for everybody who can not access this ticket):

I had some brainstorming with Nick.
on Monday I will be in the serverroom and will connect one of these servers with a new cable to a diffrent switch.
Nick will test if this solves the situation, if yes, he will be in the office with Matthias on Tuesday and recable these servers.

#34 Updated by cdywan about 1 month ago

What happened since the last episode

  • Nick took over from Marius
  • Oli reported an extensive report of ideas and attempts to investigate
  • Nothing seemingly happened for two weeks
  • Stakeholders are seeing problems again
  • We still don't know if maybe it's just a kink in the ethernet cable

Ideas for improvement

  • We could have implemented work-arounds sooner
    • Consider getting access to another machine as a temporary replacement
  • Was the infra ticket updated / visible?
    • Comments should have been added to clarify changes
  • Due date set to 2021-12-29
    • We should have been keeping up with updates?

#35 Updated by okurz about 1 month ago

  • Description updated (diff)

#36 Updated by okurz about 1 month ago

  • Description updated (diff)

#37 Updated by nicksinger about 1 month ago

Gerhard replugged qa-power8-4 into qanet10 port 8. I ran an iperf but saw no improvement:

QA-Power8-4-kvm:~ # iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 2620:113:80c0:80a0:10:162:29:60f port 36854 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  60.6 MBytes   509 Mbits/sec    7    252 KBytes
[  5]   1.00-2.00   sec  46.8 MBytes   392 Mbits/sec    3    212 KBytes
[  5]   2.00-3.00   sec  45.5 MBytes   381 Mbits/sec    2    177 KBytes
[  5]   3.00-4.00   sec  35.7 MBytes   300 Mbits/sec    2    187 KBytes
[  5]   4.00-5.00   sec  51.8 MBytes   435 Mbits/sec    4    286 KBytes
[  5]   5.00-6.00   sec  50.5 MBytes   424 Mbits/sec    3    212 KBytes
[  5]   6.00-7.00   sec  60.4 MBytes   506 Mbits/sec    1    308 KBytes
[  5]   7.00-8.00   sec  44.4 MBytes   372 Mbits/sec    5    180 KBytes
[  5]   8.00-9.00   sec  52.8 MBytes   443 Mbits/sec    1    271 KBytes
[  5]   9.00-10.00  sec  44.4 MBytes   372 Mbits/sec    0    351 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   493 MBytes   413 Mbits/sec   28             sender
[  5]   0.00-10.00  sec   490 MBytes   411 Mbits/sec                  receiver

iperf Done.
QA-Power8-4-kvm:~ # iperf3 -R -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 2620:113:80c0:80a0:10:162:29:60f port 36880 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   344 KBytes  2.82 Mbits/sec
[  5]   1.00-2.00   sec   413 KBytes  3.38 Mbits/sec
[  5]   2.00-3.00   sec   520 KBytes  4.26 Mbits/sec
[  5]   3.00-4.00   sec   370 KBytes  3.03 Mbits/sec
[  5]   4.00-5.00   sec   342 KBytes  2.80 Mbits/sec
[  5]   5.00-6.00   sec   301 KBytes  2.47 Mbits/sec
[  5]   6.00-7.00   sec   322 KBytes  2.64 Mbits/sec
[  5]   7.00-8.00   sec   248 KBytes  2.03 Mbits/sec
[  5]   8.00-9.00   sec   522 KBytes  4.27 Mbits/sec
[  5]   9.00-10.00  sec   457 KBytes  3.75 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  3.84 MBytes  3.22 Mbits/sec  816             sender
[  5]   0.00-10.00  sec  3.75 MBytes  3.14 Mbits/sec                  receiver

iperf Done.
QA-Power8-4-kvm:~ # iperf3 -4 -R -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 10.162.6.201 port 60458 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   406 KBytes  3.32 Mbits/sec
[  5]   1.00-2.00   sec   276 KBytes  2.26 Mbits/sec
[  5]   2.00-3.00   sec   421 KBytes  3.45 Mbits/sec
[  5]   3.00-4.00   sec   568 KBytes  4.66 Mbits/sec
[  5]   4.00-5.00   sec   462 KBytes  3.79 Mbits/sec
[  5]   5.00-6.00   sec   352 KBytes  2.88 Mbits/sec
[  5]   6.00-7.00   sec   588 KBytes  4.82 Mbits/sec
[  5]   7.00-8.00   sec   373 KBytes  3.06 Mbits/sec
[  5]   8.00-9.00   sec   454 KBytes  3.72 Mbits/sec
[  5]   9.00-10.00  sec   423 KBytes  3.46 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  4.33 MBytes  3.63 Mbits/sec  880             sender
[  5]   0.00-10.00  sec  4.22 MBytes  3.54 Mbits/sec                  receiver

iperf Done.
QA-Power8-4-kvm:~ # iperf3 -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 10.162.6.201 port 60496 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  47.9 MBytes   402 Mbits/sec   28    174 KBytes
[  5]   1.00-2.00   sec  60.8 MBytes   510 Mbits/sec    5    198 KBytes
[  5]   2.00-3.00   sec  66.9 MBytes   561 Mbits/sec    3    167 KBytes
[  5]   3.00-4.00   sec  56.8 MBytes   476 Mbits/sec    4    130 KBytes
[  5]   4.00-5.00   sec  45.0 MBytes   378 Mbits/sec    2    161 KBytes
[  5]   5.00-6.00   sec  42.8 MBytes   359 Mbits/sec    2    187 KBytes
[  5]   6.00-7.00   sec  76.0 MBytes   638 Mbits/sec    2    182 KBytes
[  5]   7.00-8.00   sec  65.0 MBytes   545 Mbits/sec    4    150 KBytes
[  5]   8.00-9.00   sec  39.7 MBytes   333 Mbits/sec   50    315 KBytes
[  5]   9.00-10.00  sec  40.5 MBytes   339 Mbits/sec    7    117 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   541 MBytes   454 Mbits/sec  107             sender
[  5]   0.00-10.00  sec   539 MBytes   452 Mbits/sec                  receiver

iperf Done.

#38 Updated by nicksinger about 1 month ago

Gerhard also replugged another port of that machine. Apparently this brought some improvement but still way to less:

nsinger@QA-Power8-4-kvm:~> iperf3 -6 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 2620:113:80c0:80a0:10:162:29:60f port 48730 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  91.4 MBytes   767 Mbits/sec  127    290 KBytes
[  5]   1.00-2.00   sec  87.6 MBytes   735 Mbits/sec    5    343 KBytes
[  5]   2.00-3.00   sec  85.9 MBytes   720 Mbits/sec   30    319 KBytes
[  5]   3.00-4.00   sec  95.6 MBytes   802 Mbits/sec    5    278 KBytes
[  5]   4.00-5.00   sec  91.3 MBytes   765 Mbits/sec    8    424 KBytes
[  5]   5.00-6.00   sec  81.9 MBytes   687 Mbits/sec   32    282 KBytes
[  5]   6.00-7.00   sec  71.5 MBytes   599 Mbits/sec   34    351 KBytes
[  5]   7.00-8.00   sec  87.2 MBytes   732 Mbits/sec    0    449 KBytes
[  5]   8.00-9.00   sec  88.5 MBytes   742 Mbits/sec    5    300 KBytes
[  5]   9.00-10.00  sec  57.4 MBytes   482 Mbits/sec   14    332 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   838 MBytes   703 Mbits/sec  260             sender
[  5]   0.00-10.00  sec   835 MBytes   700 Mbits/sec                  receiver

iperf Done.
nsinger@QA-Power8-4-kvm:~> iperf3 -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 10.162.6.201 port 44070 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  84.2 MBytes   707 Mbits/sec   71    413 KBytes
[  5]   1.00-2.00   sec  96.9 MBytes   813 Mbits/sec   43    325 KBytes
[  5]   2.00-3.00   sec  92.8 MBytes   778 Mbits/sec    0    438 KBytes
[  5]   3.00-4.00   sec  72.2 MBytes   606 Mbits/sec    4    211 KBytes
[  5]   4.00-5.00   sec  60.0 MBytes   504 Mbits/sec    0    344 KBytes
[  5]   5.00-6.00   sec  87.3 MBytes   732 Mbits/sec   92    204 KBytes
[  5]   6.00-7.00   sec  58.7 MBytes   492 Mbits/sec   25    259 KBytes
[  5]   7.00-8.00   sec  77.2 MBytes   648 Mbits/sec   52    287 KBytes
[  5]   8.00-9.00   sec  71.3 MBytes   598 Mbits/sec    0    387 KBytes
[  5]   9.00-10.00  sec  76.3 MBytes   640 Mbits/sec    0    482 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   777 MBytes   652 Mbits/sec  287             sender
[  5]   0.00-10.00  sec   775 MBytes   650 Mbits/sec                  receiver

iperf Done.
nsinger@QA-Power8-4-kvm:~> iperf3 -R -6 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 2620:113:80c0:80a0:10:162:29:60f port 48788 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.25 MBytes  10.5 Mbits/sec
[  5]   1.00-2.00   sec   761 KBytes  6.24 Mbits/sec
[  5]   2.00-3.00   sec   968 KBytes  7.93 Mbits/sec
[  5]   3.00-4.00   sec  1.45 MBytes  12.2 Mbits/sec
[  5]   4.00-5.00   sec   877 KBytes  7.19 Mbits/sec
[  5]   5.00-6.00   sec   170 KBytes  1.39 Mbits/sec
[  5]   6.00-7.00   sec   828 KBytes  6.79 Mbits/sec
[  5]   7.00-8.00   sec   841 KBytes  6.89 Mbits/sec
[  5]   8.00-9.00   sec  1.65 MBytes  13.8 Mbits/sec
[  5]   9.00-10.00  sec   965 KBytes  7.91 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  9.75 MBytes  8.18 Mbits/sec  1177             sender
[  5]   0.00-10.00  sec  9.63 MBytes  8.08 Mbits/sec                  receiver

iperf Done.
nsinger@QA-Power8-4-kvm:~> iperf3 -R -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 10.162.6.201 port 44118 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.39 MBytes  11.6 Mbits/sec
[  5]   1.00-2.00   sec  1.42 MBytes  11.9 Mbits/sec
[  5]   2.00-3.00   sec  1.04 MBytes  8.71 Mbits/sec
[  5]   3.00-4.00   sec  1.29 MBytes  10.8 Mbits/sec
[  5]   4.00-5.00   sec  1.27 MBytes  10.7 Mbits/sec
[  5]   5.00-6.00   sec  1.91 MBytes  16.1 Mbits/sec
[  5]   6.00-7.00   sec  1.16 MBytes  9.72 Mbits/sec
[  5]   7.00-8.00   sec   578 KBytes  4.74 Mbits/sec
[  5]   8.00-9.00   sec  1.41 MBytes  11.8 Mbits/sec
[  5]   9.00-10.00  sec  1.35 MBytes  11.3 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  12.9 MBytes  10.8 Mbits/sec  799             sender
[  5]   0.00-10.00  sec  12.8 MBytes  10.7 Mbits/sec                  receiver

iperf Done.

I will be in the office today testing if a direct connection with my notebook brings better speeds. Hopefully I can manage to pull this experiment off to exclude any problems with any hardware (e.g. switch, router) in between.

#39 Updated by okurz about 1 month ago

Please keep the observation from #102882#note-24 in mind regarding the high increase of traffic we saw. I don't think at this point it helps to simply plug the machines elsewhere without making sure that this traffic goes away, e.g. unplug other stuff, the uplink, etc.

#40 Updated by okurz about 1 month ago

  • Due date changed from 2021-12-29 to 2022-01-28

#41 Updated by nicksinger about 1 month ago

So here are my results of several switch ports I tested in srv2 (and the qalab) with my notebook:

back2back with power8-4:

nsinger@QA-Power8-4-kvm:~> iperf3 -c 192.168.0.1
Connecting to host 192.168.0.1, port 5201
[  5] local 192.168.0.106 port 48382 connected to 192.168.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   114 MBytes   958 Mbits/sec    0    379 KBytes
[  5]   1.00-2.00   sec   112 MBytes   941 Mbits/sec    0    379 KBytes
[  5]   2.00-3.00   sec   112 MBytes   937 Mbits/sec    0    379 KBytes
[  5]   3.00-4.00   sec   113 MBytes   946 Mbits/sec    0    379 KBytes
[  5]   4.00-5.00   sec   112 MBytes   939 Mbits/sec    0    399 KBytes
[  5]   5.00-6.00   sec   112 MBytes   942 Mbits/sec    0    399 KBytes
[  5]   6.00-7.00   sec   112 MBytes   943 Mbits/sec    0    399 KBytes
[  5]   7.00-8.00   sec   112 MBytes   943 Mbits/sec    0    399 KBytes
[  5]   8.00-9.00   sec   112 MBytes   942 Mbits/sec    0    399 KBytes
[  5]   9.00-10.00  sec   112 MBytes   937 Mbits/sec    0    399 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.10 GBytes   943 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  1.10 GBytes   941 Mbits/sec                  receiver

iperf Done.
nsinger@QA-Power8-4-kvm:~> iperf3 -R -c 192.168.0.1
Connecting to host 192.168.0.1, port 5201
Reverse mode, remote host 192.168.0.1 is sending
[  5] local 192.168.0.106 port 48386 connected to 192.168.0.1 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   111 MBytes   934 Mbits/sec
[  5]   1.00-2.00   sec   111 MBytes   934 Mbits/sec
[  5]   2.00-3.00   sec   111 MBytes   934 Mbits/sec
[  5]   3.00-4.00   sec   111 MBytes   934 Mbits/sec
[  5]   4.00-5.00   sec   111 MBytes   934 Mbits/sec
[  5]   5.00-6.00   sec   111 MBytes   934 Mbits/sec
[  5]   6.00-7.00   sec   111 MBytes   934 Mbits/sec
[  5]   7.00-8.00   sec   111 MBytes   934 Mbits/sec
[  5]   8.00-9.00   sec   111 MBytes   934 Mbits/sec
[  5]   9.00-10.00  sec   111 MBytes   934 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.09 GBytes   936 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver

iperf from notebook connected to qanet10nue (srv2, located next to the rack of power8-4):

selenium ~ » iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 2620:113:80c0:80a0:10:162:2d:4d65 port 38898 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   110 MBytes   925 Mbits/sec    0   1017 KBytes
[  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec    0   1.27 MBytes
[  5]   2.00-3.00   sec   109 MBytes   912 Mbits/sec    0   1.55 MBytes
[  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0   2.24 MBytes
[  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec    0   2.36 MBytes
[  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec    0   2.47 MBytes
[  5]   6.00-7.00   sec   109 MBytes   912 Mbits/sec    0   2.61 MBytes
[  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0   2.61 MBytes
[  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0   2.74 MBytes
[  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec    0   2.74 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.07 GBytes   921 Mbits/sec    0             sender
[  5]   0.00-10.01  sec  1.07 GBytes   918 Mbits/sec                  receiver

iperf Done.
selenium ~ » iperf3 -R -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 2620:113:80c0:80a0:10:162:2d:4d65 port 38902 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   745 KBytes  6.10 Mbits/sec
[  5]   1.00-2.00   sec   998 KBytes  8.18 Mbits/sec
[  5]   2.00-3.00   sec  1.10 MBytes  9.27 Mbits/sec
[  5]   3.00-4.00   sec  1.11 MBytes  9.31 Mbits/sec
[  5]   4.00-5.00   sec  1.10 MBytes  9.22 Mbits/sec
[  5]   5.00-6.00   sec  1.09 MBytes  9.15 Mbits/sec
[  5]   6.00-7.00   sec  1.10 MBytes  9.24 Mbits/sec
[  5]   7.00-8.00   sec  1.10 MBytes  9.27 Mbits/sec
[  5]   8.00-9.00   sec  1.10 MBytes  9.22 Mbits/sec
[  5]   9.00-10.00  sec   679 KBytes  5.56 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.2 MBytes  8.53 Mbits/sec  981             sender
[  5]   0.00-10.00  sec  10.1 MBytes  8.45 Mbits/sec                  receiver

iperf Done.

iperf from notebook connected to qanet13nue (srv2, where power8-4 is originally connected to):

selenium ~ » iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 2620:113:80c0:80a0:10:162:2c:985 port 45236 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   103 MBytes   863 Mbits/sec    0   1.41 MBytes
[  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec    0   1.80 MBytes
[  5]   2.00-3.00   sec   108 MBytes   902 Mbits/sec   28   1.40 MBytes
[  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0   1.52 MBytes
[  5]   4.00-5.00   sec  98.8 MBytes   828 Mbits/sec  490   82.3 KBytes
[  5]   5.00-6.00   sec  85.0 MBytes   713 Mbits/sec    0    356 KBytes
[  5]   6.00-7.00   sec  93.8 MBytes   786 Mbits/sec  208    222 KBytes
[  5]   7.00-8.00   sec  93.8 MBytes   786 Mbits/sec    0    416 KBytes
[  5]   8.00-9.00   sec  95.0 MBytes   797 Mbits/sec    0    469 KBytes
[  5]   9.00-10.00  sec   101 MBytes   849 Mbits/sec    0    494 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   998 MBytes   837 Mbits/sec  726             sender
[  5]   0.00-10.01  sec   995 MBytes   834 Mbits/sec                  receiver

iperf Done.
selenium ~ » iperf3 -R -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 2620:113:80c0:80a0:10:162:2c:985 port 45240 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.15 MBytes  9.64 Mbits/sec
[  5]   1.00-2.00   sec  2.12 MBytes  17.8 Mbits/sec
[  5]   2.00-3.00   sec  1.18 MBytes  9.93 Mbits/sec
[  5]   3.00-4.00   sec  1.32 MBytes  11.0 Mbits/sec
[  5]   4.00-5.00   sec  1.17 MBytes  9.82 Mbits/sec
[  5]   5.00-6.00   sec   890 KBytes  7.29 Mbits/sec
[  5]   6.00-7.00   sec   636 KBytes  5.21 Mbits/sec
[  5]   7.00-8.00   sec  1.43 MBytes  12.0 Mbits/sec
[  5]   8.00-9.00   sec   945 KBytes  7.75 Mbits/sec
[  5]   9.00-10.00  sec  1.04 MBytes  8.69 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  12.1 MBytes  10.2 Mbits/sec  1094             sender
[  5]   0.00-10.00  sec  11.8 MBytes  9.91 Mbits/sec                  receiver

iperf Done.

iperf from notebook connected to qanet15nue (srv2, another switch close to power8-4):

selenium ~ » iperf3 -R -c openqa.suse.de                                                                         130 ↵
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 2620:113:80c0:80a0:10:162:2e:3a8a port 53490 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  6.35 MBytes  53.3 Mbits/sec
[  5]   1.00-2.00   sec  6.58 MBytes  55.2 Mbits/sec
[  5]   2.00-3.00   sec  7.65 MBytes  64.2 Mbits/sec
[  5]   3.00-4.00   sec  5.88 MBytes  49.3 Mbits/sec
[  5]   4.00-5.00   sec  6.19 MBytes  51.9 Mbits/sec
[  5]   5.00-6.00   sec  7.65 MBytes  64.2 Mbits/sec
[  5]   6.00-7.00   sec  5.79 MBytes  48.5 Mbits/sec
[  5]   7.00-8.00   sec  8.21 MBytes  68.9 Mbits/sec
[  5]   8.00-9.00   sec  7.08 MBytes  59.4 Mbits/sec
[  5]   9.00-10.00  sec  6.19 MBytes  51.9 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.01  sec  67.8 MBytes  56.8 Mbits/sec  7751             sender
[  5]   0.00-10.00  sec  67.6 MBytes  56.7 Mbits/sec                  receiver

iperf Done.
selenium ~ » iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 2620:113:80c0:80a0:10:162:2e:3a8a port 53494 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  93.6 MBytes   785 Mbits/sec    6   1.15 MBytes
[  5]   1.00-2.00   sec  92.5 MBytes   776 Mbits/sec    0   1.26 MBytes
[  5]   2.00-3.00   sec   105 MBytes   881 Mbits/sec    0   1.36 MBytes
[  5]   3.00-4.00   sec   109 MBytes   912 Mbits/sec    0   1.41 MBytes
[  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0   1.49 MBytes
[  5]   5.00-6.00   sec   108 MBytes   902 Mbits/sec    0   1.53 MBytes
[  5]   6.00-7.00   sec   102 MBytes   860 Mbits/sec    0   1.55 MBytes
[  5]   7.00-8.00   sec  86.2 MBytes   723 Mbits/sec   89   1.11 MBytes
[  5]   8.00-9.00   sec   104 MBytes   870 Mbits/sec    0   1.19 MBytes
[  5]   9.00-10.00  sec   101 MBytes   849 Mbits/sec    0   1.24 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1007 MBytes   845 Mbits/sec   95             sender
[  5]   0.00-10.02  sec  1004 MBytes   841 Mbits/sec                  receiver

iperf Done.
selenium ~ » iperf3 -R -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 10.162.29.76 port 51332 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  6.51 MBytes  54.6 Mbits/sec
[  5]   1.00-2.00   sec  7.45 MBytes  62.5 Mbits/sec
[  5]   2.00-3.00   sec  6.35 MBytes  53.2 Mbits/sec
[  5]   3.00-4.00   sec  6.49 MBytes  54.4 Mbits/sec
[  5]   4.00-5.00   sec  5.94 MBytes  49.8 Mbits/sec
[  5]   5.00-6.00   sec  7.34 MBytes  61.6 Mbits/sec
[  5]   6.00-7.00   sec  5.11 MBytes  42.9 Mbits/sec
[  5]   7.00-8.00   sec  6.18 MBytes  51.8 Mbits/sec
[  5]   8.00-9.00   sec  6.36 MBytes  53.3 Mbits/sec
[  5]   9.00-10.00  sec  6.12 MBytes  51.3 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  64.1 MBytes  53.8 Mbits/sec  7306             sender
[  5]   0.00-10.00  sec  63.8 MBytes  53.6 Mbits/sec                  receiver

iperf Done.

iperf from notebook connected to qanet15nue (srv2, yet another switch close to power8-4):

selenium ~ » iperf3 -R -c openqa.suse.de                                                                         130 ↵
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 2620:113:80c0:80a0:10:162:2e:3a8a port 53490 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  6.35 MBytes  53.3 Mbits/sec
[  5]   1.00-2.00   sec  6.58 MBytes  55.2 Mbits/sec
[  5]   2.00-3.00   sec  7.65 MBytes  64.2 Mbits/sec
[  5]   3.00-4.00   sec  5.88 MBytes  49.3 Mbits/sec
[  5]   4.00-5.00   sec  6.19 MBytes  51.9 Mbits/sec
[  5]   5.00-6.00   sec  7.65 MBytes  64.2 Mbits/sec
[  5]   6.00-7.00   sec  5.79 MBytes  48.5 Mbits/sec
[  5]   7.00-8.00   sec  8.21 MBytes  68.9 Mbits/sec
[  5]   8.00-9.00   sec  7.08 MBytes  59.4 Mbits/sec
[  5]   9.00-10.00  sec  6.19 MBytes  51.9 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.01  sec  67.8 MBytes  56.8 Mbits/sec  7751             sender
[  5]   0.00-10.00  sec  67.6 MBytes  56.7 Mbits/sec                  receiver

iperf Done.
selenium ~ » iperf3 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 2620:113:80c0:80a0:10:162:2e:3a8a port 53494 connected to 2620:113:80c0:8080:10:160:0:207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  93.6 MBytes   785 Mbits/sec    6   1.15 MBytes
[  5]   1.00-2.00   sec  92.5 MBytes   776 Mbits/sec    0   1.26 MBytes
[  5]   2.00-3.00   sec   105 MBytes   881 Mbits/sec    0   1.36 MBytes
[  5]   3.00-4.00   sec   109 MBytes   912 Mbits/sec    0   1.41 MBytes
[  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0   1.49 MBytes
[  5]   5.00-6.00   sec   108 MBytes   902 Mbits/sec    0   1.53 MBytes
[  5]   6.00-7.00   sec   102 MBytes   860 Mbits/sec    0   1.55 MBytes
[  5]   7.00-8.00   sec  86.2 MBytes   723 Mbits/sec   89   1.11 MBytes
[  5]   8.00-9.00   sec   104 MBytes   870 Mbits/sec    0   1.19 MBytes
[  5]   9.00-10.00  sec   101 MBytes   849 Mbits/sec    0   1.24 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1007 MBytes   845 Mbits/sec   95             sender
[  5]   0.00-10.02  sec  1004 MBytes   841 Mbits/sec                  receiver

iperf Done.
selenium ~ » iperf3 -R -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 10.162.29.76 port 51332 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  6.51 MBytes  54.6 Mbits/sec
[  5]   1.00-2.00   sec  7.45 MBytes  62.5 Mbits/sec
[  5]   2.00-3.00   sec  6.35 MBytes  53.2 Mbits/sec
[  5]   3.00-4.00   sec  6.49 MBytes  54.4 Mbits/sec
[  5]   4.00-5.00   sec  5.94 MBytes  49.8 Mbits/sec
[  5]   5.00-6.00   sec  7.34 MBytes  61.6 Mbits/sec
[  5]   6.00-7.00   sec  5.11 MBytes  42.9 Mbits/sec
[  5]   7.00-8.00   sec  6.18 MBytes  51.8 Mbits/sec
[  5]   8.00-9.00   sec  6.36 MBytes  53.3 Mbits/sec
[  5]   9.00-10.00  sec  6.12 MBytes  51.3 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  64.1 MBytes  53.8 Mbits/sec  7306             sender
[  5]   0.00-10.00  sec  63.8 MBytes  53.6 Mbits/sec                  receiver

iperf Done.

iperf from notebook connected to qanet03nue (switch in the big qalab):

selenium ~ » iperf3 -R -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
Reverse mode, remote host openqa.suse.de is sending
[  5] local 10.162.29.76 port 51336 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  91.2 MBytes   765 Mbits/sec
[  5]   1.00-2.00   sec  98.7 MBytes   828 Mbits/sec
[  5]   2.00-3.00   sec  95.8 MBytes   804 Mbits/sec
[  5]   3.00-4.00   sec  93.5 MBytes   785 Mbits/sec
[  5]   4.00-5.00   sec  98.4 MBytes   826 Mbits/sec
[  5]   5.00-6.00   sec  97.7 MBytes   820 Mbits/sec
[  5]   6.00-7.00   sec   105 MBytes   879 Mbits/sec
[  5]   7.00-8.00   sec  97.1 MBytes   815 Mbits/sec
[  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec
[  5]   9.00-10.00  sec   101 MBytes   850 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   988 MBytes   829 Mbits/sec  1129             sender
[  5]   0.00-10.00  sec   985 MBytes   826 Mbits/sec                  receiver

iperf Done.
selenium ~ » iperf3 -4 -c openqa.suse.de
Connecting to host openqa.suse.de, port 5201
[  5] local 10.162.29.76 port 51340 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  80.5 MBytes   675 Mbits/sec    7    255 KBytes
[  5]   1.00-2.00   sec  81.2 MBytes   682 Mbits/sec    0    421 KBytes
[  5]   2.00-3.00   sec  85.0 MBytes   713 Mbits/sec    0    503 KBytes
[  5]   3.00-4.00   sec  75.0 MBytes   629 Mbits/sec    5    296 KBytes
[  5]   4.00-5.00   sec  71.2 MBytes   598 Mbits/sec    0    426 KBytes
[  5]   5.00-6.00   sec  67.5 MBytes   566 Mbits/sec    6    191 KBytes
[  5]   6.00-7.00   sec  50.0 MBytes   419 Mbits/sec    0    331 KBytes
[  5]   7.00-8.00   sec  66.2 MBytes   556 Mbits/sec    0    441 KBytes
[  5]   8.00-9.00   sec  65.0 MBytes   545 Mbits/sec    0    519 KBytes
[  5]   9.00-10.00  sec  62.5 MBytes   524 Mbits/sec    0    581 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   704 MBytes   591 Mbits/sec   18             sender
[  5]   0.00-10.00  sec   702 MBytes   589 Mbits/sec                  receiver

iperf Done.

With all these tests I can conclude:

  1. The machine itself is not misconfigured and is perfectly able to deliver 1Gbit/s up and down
  2. Several switches in srv2 are affected by the performance loss
  3. The QA VLAN itself does not cause the performance loss (as a switch in the qalab - so a different location - is running fine)

I'd suggest that we try to map out how these switches are interconnected. I could imagine that several switches in srv2 are "daisychained" and maybe one switch in that chain is behaving wrong. I will try to come up with a graph showing how the switches are connected. If we have a better overview we can start debugging by e.g. comparing configurations or replugging the uplink of several switches.

#42 Updated by okurz about 1 month ago

you haven't mentioned the increase of "unexpected traffic". Do you see any relation between that and the measurements you conducted?

#43 Updated by okurz 4 days ago

  • Status changed from In Progress to Feedback
  • Priority changed from High to Normal

gschlotter will check the daisy-chained core switches. Current hypothesis is that at least one is misbehaving and causing the problems. nsinger told us that gschlotter read the ticket so we assume he is aware about the "unexpected traffic". So we expect an update within the next days from gschlotter.

Also available in: Atom PDF