Project

General

Profile

Actions

action #122668

closed

Flaky network connection from SUSE Nbg Frankencampus for okurz and mkittler size:M

Added by okurz almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2023-01-03
Due date:
% Done:

0%

Estimated time:

Description

Observation

Over the day of 2023-01-03 I was in SUSE Nbg Frankencampus. Connected with my notebook over WiFi, IPv4 only as apparently IPv6 is not served at Frankencampus. During my work day I had multiple complete connection drops each about 2m long. This caused outages during video-conferences, problems to connect to https://etherpad.opensuse.org/p/suse_qe_tools and dropped my SSH connections as well as listening to online music, maybe the best "monitoring" approach for me to realize when there are problems. Trying to fix this problem I hope I can also learn to become better at debugging network problems as we have also in other cases, e.g. see #107062

Acceptance criteria

  • AC1: No obvious 2min network outages for okurz in Frankencampus office
  • AC2: Someone has learned something about how to debug network problems

Suggestions

  • Run mtr to various connection endpoints to narrow down the problem
  • Check system log files and openVPN logs
  • Document findings in the ticket how I narrowed down problems
  • Find "industry best practices" how to debug such problems
  • Crosscheck from a different network location, e.g. homeoffice
  • Confirm that it's not just okurz having misconfigured things
  • Reach out to IT to get the problem fixed
Actions #1

Updated by okurz almost 2 years ago

output from sudo mtr --interval .1 --show-ips s390zp18.suse.de:

linux-28d7 (10.163.25.206)                                                                                                                                                                                                                                        2023-01-03T15:09:08+0100
Keys:  Help   Display mode   Restart statistics   Order of fields   quit                                                                                                                                                                       
                                                                                                                                                                                                                                                  Packets               Pings
 Host                                                                                                                                                                                                                                           Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 10.163.24.1 (10.163.24.1)    97.4% 41201  7478.  92.0   2.0 9784. 731.6
    10.162.196.251 (10.162.196.251)
    10.163.40.1 (10.163.40.1)
 2. 10.162.196.250 (10.162.196.250)    73.2% 41201  7779.   7.0   2.6 7779. 137.3
    10.163.40.1 (10.163.40.1)                                                                                                                                                                                                                            
    10.162.196.251 (10.162.196.251)                                                                                                                                                                                                                      
 3. s390zp18.suse.de (10.161.159.123)    73.2% 41201  9483.   8.1   2.2 9483. 181.8
    10.162.196.251 (10.162.196.251)
    10.163.40.1 (10.163.40.1)

so a significant loss of packets. But maybe this is due to priorization of UDP/TCP over ICMP packets.

Then I tried sudo watch mtr --tcp --port 22 --show-ips s390zp18.suse.de which fails often with "mtr: address in use". Known upstream issue https://github.com/traviscross/mtr/issues/338 in our mtr version, fixed in more recent version not available on openSUSE Leap 15.4. Instead using as workaround while :; do sudo mtr --report --report-cycles 20 --tcp --port 22 --show-ips s390zp18.suse.de; done which shows the 2m outages:

Start: 2023-01-03T15:00:35+0100
HOST: linux-28d7                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.163.40.1                0.0%    20    2.7   7.0   2.7  79.7  17.1
  2.|-- 10.162.196.251             0.0%    20    4.1   4.1   3.3   4.7   0.4
  3.|-- s390zp18.suse.de (10.161.  0.0%    20    3.5   3.6   3.1   4.3   0.4
Start: 2023-01-03T15:00:59+0100
HOST: linux-28d7                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.163.40.1                0.0%    20    4.0   3.3   2.7   4.0   0.4
  2.|-- 10.162.196.251             0.0%    20    4.5   4.3   3.4   5.6   0.6
  3.|-- s390zp18.suse.de (10.161.  0.0%    20    4.7   4.0   3.1   7.8   1.0
Start: 2023-01-03T15:01:23+0100
HOST: linux-28d7                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.163.40.1                0.0%    20    3.5   3.5   2.6   7.4   1.0
  2.|-- 10.162.196.251             0.0%    20    3.8   5.4   3.4  26.1   4.9
  3.|-- s390zp18.suse.de (10.161.  0.0%    20    4.3   3.8   3.2   4.9   0.4
Start: 2023-01-03T15:01:48+0100
HOST: linux-28d7                  Loss%   Snt   Last   Avg  Best  Wrst StDev
Start: 2023-01-03T15:02:14+0100
HOST: linux-28d7                  Loss%   Snt   Last   Avg  Best  Wrst StDev
Start: 2023-01-03T15:02:39+0100
HOST: linux-28d7                  Loss%   Snt   Last   Avg  Best  Wrst StDev
Start: 2023-01-03T15:03:05+0100
HOST: linux-28d7                  Loss%   Snt   Last   Avg  Best  Wrst StDev
Start: 2023-01-03T15:03:30+0100
HOST: linux-28d7                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.163.40.1               15.0%    20    3.3 2112.   3.0 7271. 3069.1
  2.|-- 10.162.196.251             0.0%    20    4.6 1781.   3.4 7172. 2891.0
  3.|-- s390zp18.suse.de (10.161.  0.0%    20    4.5 1804.   2.8 7308. 2934.5
Start: 2023-01-03T15:03:55+0100
HOST: linux-28d7                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.163.40.1                0.0%    20    2.9   3.6   2.9   6.8   0.8
  2.|-- 10.162.196.251             0.0%    20    4.6   4.8   3.5   8.6   1.4
  3.|-- s390zp18.suse.de (10.161.  0.0%    20    3.4   3.8   3.1   4.5   0.4
Start: 2023-01-03T15:04:20+0100
HOST: linux-28d7                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.163.40.1                0.0%    20    3.2   3.2   2.4   3.7   0.3
  2.|-- 10.162.196.251             0.0%    20    5.0   4.7   3.7   7.4   0.9
  3.|-- s390zp18.suse.de (10.161.  0.0%    20    3.2   7.7   3.2  75.4  16.0

As visible many runs are fine without loss and then for 2m there is no response at all so not "sporadic packet loss" but sometimes no network connection at all also apparently not to intermediate hosts. So one might still suspect that maybe this is only a problem trying to reach other SUSE internal hosts over VPN. So to crosscheck sudo mtr --show-ips 1.1.1.1 shows:

                                                                                                                                  My traceroute  [v0.92]
linux-28d7 (192.168.43.39)                                                                                                                                                                                                                                        2023-01-03T15:13:37+0100
Keys:  Help   Display mode   Restart statistics   Order of fields   quit                                                                                                                                                                      
                                                                                                                                                                                                                                                  Packets               Pings
 Host                                                                                                                                                                                                                                           Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 192.168.43.252 (192.168.43.252)    5.0%  1910    2.2   1.8   1.1  78.0   3.5
 2. ae1-432.nbg30.core-backbone.com (81.95.8.241)    5.0%  1910    3.0   2.5   1.7  77.0   4.4
 3. ae15-2029.fra30.core-backbone.com (81.95.15.70)    5.0%  1909    5.9   5.5   4.8  85.2   3.3
 4. 162.158.84.62 (162.158.84.62)    5.0%  1909    9.6   8.3   5.3  72.4   6.2
 5. 172.71.244.5 (172.71.244.5)    5.0%  1909    5.7   8.3   4.9 166.0   8.5
 6. one.one.one.one (1.1.1.1)    5.0%  1909    6.0   5.8   4.8  89.2   5.0

one can see that there is significant packet loss but what can not be seen here is that it's not "sporadic packet loss" but reproducibly only during the 2m outage windows. I guess a continuous ping 1.1.1.1 would also more clearly show this.

From journalctl:

Jan 03 15:00:03 linux-28d7 autossh[20344]: starting ssh (count 1404)
Jan 03 15:00:03 linux-28d7 autossh[20344]: ssh child pid is 542
Jan 03 15:00:03 linux-28d7 autossh[20344]: ssh exited with error status 255; restarting ssh
…
Jan 03 15:00:17 linux-28d7 NetworkManager[1833]: <info>  [1672754417.6927] dnsmasq: starting /usr/sbin/dnsmasq
Jan 03 15:00:17 linux-28d7 NetworkManager[566]: dnsmasq: failed to create listening socket for 127.0.0.1: Address already in use
Jan 03 15:00:17 linux-28d7 dnsmasq[566]: failed to create listening socket for 127.0.0.1: Address already in use
…
Jan 03 15:00:17 linux-28d7 dnsmasq[570]: failed to create listening socket for 127.0.0.1: Address already in use
Jan 03 15:00:17 linux-28d7 dnsmasq[570]: FAILED to start up
Jan 03 15:00:17 linux-28d7 NetworkManager[1833]: <warn>  [1672754417.7184] dnsmasq: spawn: dnsmasq process 570 exited with error: Network access problem (address in use, permissions) (2)
Jan 03 15:00:17 linux-28d7 NetworkManager[1833]: <warn>  [1672754417.7185] dnsmasq[0bf46e72d728a5a0]: dnsmasq dies and gets respawned too quickly. Back off. Something is very wrong
…
Jan 03 15:03:29 linux-28d7 syncthing[1899]: [YD5JO] INFO: Joined relay relay://91.4.156.131:22067
…
Jan 03 15:03:35 linux-28d7 NetworkManager[1833]: <info>  [1672754615.7009] manager: NetworkManager state is now CONNECTED_GLOBAL

so NetworkManager tries to start dnsmasq which fails because I already run dnsmasq separately running https://wiki.suse.net/index.php/Services_Team/Policies/openVPN/client_setup#Via_NetworkManager . I am suspecting that if one follows the instruction both systemd as well as NetworkManager start dnsmasq causing the above error. But it seems the error does not cause the problem because later the system journal reports the same error but the network is still fine:

Jan 03 15:04:17 linux-28d7 NetworkManager[1075]: dnsmasq: failed to create listening socket for 127.0.0.1: Address already in use
Jan 03 15:04:17 linux-28d7 dnsmasq[1075]: failed to create listening socket for 127.0.0.1: Address already in use
Jan 03 15:04:17 linux-28d7 dnsmasq[1075]: FAILED to start up
Jan 03 15:04:17 linux-28d7 NetworkManager[1833]: <warn>  [1672754657.7028] dnsmasq: spawn: dnsmasq process 1075 exited with error: Network access problem (address in use, permissions) (2)

showing the dnsmasq start attempt by NetworkManager but mtr is still happy:

Start: 2023-01-03T15:03:55+0100
HOST: linux-28d7                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.163.40.1                0.0%    20    2.9   3.6   2.9   6.8   0.8
  2.|-- 10.162.196.251             0.0%    20    4.6   4.8   3.5   8.6   1.4
  3.|-- s390zp18.suse.de (10.161.  0.0%    20    3.4   3.8   3.1   4.5   0.4
Start: 2023-01-03T15:04:20+0100
HOST: linux-28d7                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.163.40.1                0.0%    20    3.2   3.2   2.4   3.7   0.3
  2.|-- 10.162.196.251             0.0%    20    5.0   4.7   3.7   7.4   0.9
  3.|-- s390zp18.suse.de (10.161.  0.0%    20    3.2   7.7   3.2  75.4  16.0
Start: 2023-01-03T15:04:44+0100
HOST: linux-28d7                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.163.40.1                0.0%    20    4.0   3.4   2.8   4.2   0.4
  2.|-- 10.162.196.251             0.0%    20    4.1   4.5   3.3   8.2   1.3
  3.|-- s390zp18.suse.de (10.161.  0.0%    20    3.6   3.8   3.1   4.4   0.4
Start: 2023-01-03T15:05:09+0100
HOST: linux-28d7                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.163.40.1                0.0%    20    3.6   4.4   2.5  28.2   5.6
  2.|-- 10.162.196.251             0.0%    20    4.0   5.9   3.3  39.6   7.9
  3.|-- s390zp18.suse.de (10.161.  0.0%    20    3.9   3.7   2.9   4.6   0.5

I asked in https://suse.slack.com/archives/C029ANHBQ5R/p1672755997835289 if anybody else in the network can confirm:

Today in Frankencampus office I had multiple 2 minute network outages with my work notebook connected over wifi. Can anybody confirm?

so, what next can I try?

Actions #2

Updated by openqa_review almost 2 years ago

  • Due date set to 2023-01-18

Setting due date based on mean cycle time of SUSE QE Tools

Actions #3

Updated by livdywan almost 2 years ago

  • Subject changed from Flaky network connection from SUSE Nbg Frankencampus for okurz to Flaky network connection from SUSE Nbg Frankencampus for okurz size:M
  • Description updated (diff)
Actions #4

Updated by livdywan almost 2 years ago

Talked about it in the infra daily. Might be worth checking if the VPN is actually necessary here - but we're not sure what the exact setup is in the office. Nick also recommends checking dnsmasc/NetworkManager (although it may be fine). Tomorrow looks to be the next opportunity.

Actions #5

Updated by okurz almost 2 years ago

Back in the office. Reproduced the problem already once. I stopped the openvpn process. I can still ping the gateway 192.168.43.254 and 1.1.1.1 but not SUSE R&D, e.g. 10.160.0.1 nor openqa.suse.de nor openqa.nue.suse.com. I stopped the system process dnsmasq so that NetworkManager can start it. This fails with:

Jan 10 11:15:32 linux-28d7 NetworkManager[1833]: <info>  [1673345732.2689] dnsmasq: starting /usr/sbin/dnsmasq
Jan 10 11:15:32 linux-28d7 dnsmasq[16411]: started, version 2.86 cachesize 400
Jan 10 11:15:32 linux-28d7 dnsmasq[16411]: compile time options: IPv6 GNU-getopt DBus no-UBus i18n IDN2 DHCP DHCPv6 Lua TFTP conntrack ipset auth cryptohash DNSSEC loop-detect inotify dumpfile
Jan 10 11:15:32 linux-28d7 dnsmasq[16411]: chown of PID file /run/NetworkManager/dnsmasq.pid failed: Operation not permitted
Jan 10 11:15:32 linux-28d7 dnsmasq[16411]: DBus support enabled: connected to system bus
Jan 10 11:15:32 linux-28d7 dnsmasq[16411]: warning: no upstream servers configured
Jan 10 11:15:32 linux-28d7 dnsmasq[16411]: cleared cache
Jan 10 11:15:36 linux-28d7 autossh[20344]: starting ssh (count 1682)
Jan 10 11:15:36 linux-28d7 autossh[20344]: ssh child pid is 16420
Jan 10 11:15:36 linux-28d7 autossh[20344]: ssh exited with error status 255; restarting ssh
Jan 10 11:15:42 linux-28d7 NetworkManager[1833]: <warn>  [1673345742.2702] dnsmasq[0bf46e72d728a5a0]: timeout waiting for dnsmasq to appear on D-Bus
Jan 10 11:15:42 linux-28d7 dnsmasq[16411]: exiting on receipt of SIGTERM

while dnsmasq spawned by NetworkManager is running DNS resolution does not work. Well, the logs say "no upstream servers configured" so that's understandable. I see two paths forward that at best I can both explore: Ensure NetworkManager-dnsmasq can read /etc/dnsmasq and also configure NetworkManager to disable internal dnsmasq when I want to use the system provided one.

I am not yet sure if the outages happen if the VPN stays disabled for now. No problem for 6900s observed in ping 192.168.43.254. After restarting the ping again though I hit an outage at icmp_seq=574 lasting until icmp_seq=653.

I now did:

sudo sed -i 's/^dns=dnsmasq/dns=none/' /etc/NetworkManager/NetworkManager.conf
sudo nmcli general reload

And I think I misread https://wiki.suse.net/index.php/Services_Team/Policies/openVPN/client_setup#Split_DNS. I think it mentions the "via …" sections as alternatives. But then I don't understand how https://wiki.suse.net/index.php/Services_Team/Policies/openVPN/client_setup#Via_NetworkManager would be sufficient to do the DNS resolution within SUSE properly.

Now also started VPN again after confirming that outages still happen regardless of openvpn and regardless of dnsmasq-attempts of NetworkManager.

Next network outage at icmp_seq=1851-1936. Remaining hypothesis: Problem on infrastructure side.

Actions #6

Updated by okurz almost 2 years ago

  • Due date changed from 2023-01-18 to 2023-02-03

Next hypotheses: Maybe specific to that OS installation or notebook so at next opportunity I can bring my old work notebook to crosscheck and also boot a different OS for my normal notebook.

Actions #7

Updated by okurz almost 2 years ago

First I picked another desk and I could reproduce the problem with ping 192.168.43.254. showing multiple outages even though the first one appeared after only roughly one hour. Or maybe even longer at icmp_seq=7232

Actions #8

Updated by okurz almost 2 years ago

I booted grml 2020.06 which was quite limited to me, e.g. no USB detected, hence no audio, no external mouse or keyboard but I could configure network. I have at least observed a 30s outage

[1673527713.114663] 64 bytes from 192.168.43.254: icmp_seq=1463 ttl=64 time=1.48 ms
[1673527714.114729] 64 bytes from 192.168.43.254: icmp_seq=1464 ttl=64 time=1.72 ms
[1673527743.745901] 64 bytes from 192.168.43.254: icmp_seq=1493 ttl=64 time=3.85 ms

But while :; do sudo mtr --report --report-cycles 20 --port 80 --show-ips 192.168.43.254; done did not report a problem for nearly an hour until I had to abort the experiment.

Actions #9

Updated by okurz over 1 year ago

  • Tags changed from infra, reactive work to infra, reactive work, next-office-day
  • Status changed from In Progress to Workable

I think I will try a more recent GRML version for longer if I am available at Frankencampus the next time

Actions #10

Updated by okurz over 1 year ago

I could try https://oss.oetiker.ch/smokeping/ or munin or grafana on my notebook :)

Actions #11

Updated by okurz over 1 year ago

Today I could reproduce the problem. Then booted grml 2022.11 with Linux kernel version 6.0.0-4 (Debian) and eventually I could also reproduce a problem:

Start: 2023-01-16T09:58:15+0000
HOST: grml                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- _gateway                  70.0%    20    2.9   4.6   2.9   7.1   1.6
Start: 2023-01-16T09:59:00+0000
HOST: grml                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0
  2.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0
  3.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0
  4.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0
  5.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0
  6.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0
  7.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0
  8.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0
  9.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0
 10.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0
 11.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0
 12.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0
 13.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0
 14.|-- _gateway                  95.0%    20  3486. 3486. 3486. 3486.   0.0
Start: 2023-01-16T09:59:26+0000
HOST: grml                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- _gateway                   0.0%    20    2.8   6.0   2.8  16.3   3.0

by the way this was definitely when I was leaving the system alone. I was not at the desk for some minutes before and afterwards for sure. Also in system journal in this system I could not find any relevant information about missing network connection:

Jan 16 09:54:17 grml systemd[1]: Started screen on tty2.
Jan 16 09:54:20 grml ntpd[3967]: PROTO: 45.9.61.155 unlink local addr 192.168.43.39 -> <n>
Jan 16 10:05:33 grml ntpd[3967]: PROTO: 136.243.177.133 unlink local addr 192.168.43.39 ->
Jan 16 10:09:47 grml ntpd[3967]: PROTO: 129.70.132.35 unlink local addr 192.168.43.39 -> >
Jan 16 10:10:55 grml dhclient[3833]: DHCPREQUEST for 192.168.43.39 on wlan0 to 192.168.43>
Jan 16 10:10:55 grml dhclient[3833]: DHCPACK of 192.168.43.39 from 192.168.43.252
Jan 16 10:10:55 grml dhclient[3833]: bound to 192.168.43.39 -- renewal in 2991 seconds.
Jan 16 10:34:09 grml ntpd[3967]: PROTO: 85.214.96.5 unlink local addr 192.168.43.39 -> <n>
Jan 16 10:45:05 grml ntpd[3967]: PROTO: 5.1.73.2 unlink local addr 192.168.43.39 -> <null>
Jan 16 10:59:15 grml systemd[1]: Started screen on tty3.
Jan 16 10:59:46 grml sudo[10998]:     grml : TTY=pts/7 ; PWD=/home/grml ; USER=root ; COM>
Jan 16 10:59:46 grml sudo[10998]: pam_unix(sudo:session): session opened for user root(ui>
Jan 16 10:59:46 grml su[10999]: (to root) root on pts/7
Jan 16 10:59:46 grml su[10999]: pam_unix(su:session): session opened for user root(uid=0)>
Jan 16 11:00:46 grml dhclient[3833]: DHCPREQUEST for 192.168.43.39 on wlan0 to 192.168.43>
Jan 16 11:00:46 grml dhclient[3833]: DHCPACK of 192.168.43.39 from 192.168.43.252
Jan 16 11:00:46 grml dhclient[3833]: bound to 192.168.43.39 -- renewal in 3012 seconds.

Another point: It looks like mkittler could reproduce the issue while he was in the office next to me. The outage for mkittler looked like the NetworkManager applet was still showing a wifi connection but also no IP traffic coming through.

EDIT: Further monitoring did not reveal a problem until 1253Z, 1353L. With nsinger found another ping option which looks helpful, ping -O ... reports with "no answer yet for icmp_seq=..."

Actions #13

Updated by okurz over 1 year ago

Debugged together with nsinger, mkittler, asmorodskyi. mkittler can reproduce the problem, also 60s-90s outages, not at the same time as I have. asmorodskyi has different problems when wifi actually disconnects, e.g. messages in system journal about "wlan0: … deauthenticating". For me I can see that wifi still stays connected (layer 2), sudo journalctl | grep wlan0 says:

Jan 16 14:00:14 linux-28d7 NetworkManager[1550]: <info>  [1673874014.0008] device (wlan0): Activation: successful, device activated.
Jan 16 14:04:28 linux-28d7 NetworkManager[1550]: <info>  [1673874268.0087] dhcp4 (wlan0):   address 192.168.42.35
Jan 16 14:04:28 linux-28d7 NetworkManager[1550]: <info>  [1673874268.0088] dhcp4 (wlan0):   plen 23 (255.255.254.0)
Jan 16 14:04:28 linux-28d7 NetworkManager[1550]: <info>  [1673874268.0088] dhcp4 (wlan0):   gateway 192.168.43.254
Jan 16 14:04:28 linux-28d7 NetworkManager[1550]: <info>  [1673874268.0088] dhcp4 (wlan0):   lease time 7200
Jan 16 14:04:28 linux-28d7 NetworkManager[1550]: <info>  [1673874268.0088] dhcp4 (wlan0):   nameserver '1.1.1.1'
Jan 16 14:04:28 linux-28d7 NetworkManager[1550]: <info>  [1673874268.0089] dhcp4 (wlan0):   nameserver '8.8.8.8'
Jan 16 14:04:28 linux-28d7 NetworkManager[1550]: <info>  [1673874268.0089] dhcp4 (wlan0):   domain name 'guest.suse'
Jan 16 14:04:28 linux-28d7 NetworkManager[1550]: <info>  [1673874268.0089] dhcp4 (wlan0): state changed bound -> extended, address=192.168.42.35
(no further messages)

while ping -O 192.168.42.254 | ts shows outages at the same time:

Jan 16 14:20:08 64 bytes from 192.168.43.254: icmp_seq=1012 ttl=64 time=4.83 ms
Jan 16 14:20:09 64 bytes from 192.168.43.254: icmp_seq=1013 ttl=64 time=3.63 ms
Jan 16 14:20:10 64 bytes from 192.168.43.254: icmp_seq=1014 ttl=64 time=4.32 ms
Jan 16 14:20:11 64 bytes from 192.168.43.254: icmp_seq=1015 ttl=64 time=3.73 ms
Jan 16 14:20:12 64 bytes from 192.168.43.254: icmp_seq=1016 ttl=64 time=2.20 ms
Jan 16 14:20:13 64 bytes from 192.168.43.254: icmp_seq=1017 ttl=64 time=6.20 ms
Jan 16 14:20:14 64 bytes from 192.168.43.254: icmp_seq=1018 ttl=64 time=4.95 ms
Jan 16 14:20:16 no answer yet for icmp_seq=1019
Jan 16 14:20:17 no answer yet for icmp_seq=1020
…
Jan 16 14:21:21 no answer yet for icmp_seq=1083
Jan 16 14:21:21 64 bytes from 192.168.43.254: icmp_seq=1084 ttl=64 time=66.7 ms
Jan 16 14:21:22 64 bytes from 192.168.43.254: icmp_seq=1085 ttl=64 time=2.76 ms
Jan 16 14:21:23 64 bytes from 192.168.43.254: icmp_seq=1086 ttl=64 time=3.93 ms

Next step: Report ticket to SUSE-IT about the network problem affecting multiple persons.

Actions #14

Updated by mkittler over 1 year ago

  • Subject changed from Flaky network connection from SUSE Nbg Frankencampus for okurz size:M to Flaky network connection from SUSE Nbg Frankencampus for okurz and mkittler size:M

Happens on my system as well, e.g.:

$ ping -O 192.168.43.254 | ts
Jan 16 15:13:31 64 Bytes von 192.168.43.254: icmp_seq=3075 ttl=64 Zeit=3.18 ms
Jan 16 15:13:32 64 Bytes von 192.168.43.254: icmp_seq=3076 ttl=64 Zeit=2.38 ms
Jan 16 15:13:34 noch keine Antwort für icmp_seq=3077
…
Jan 16 15:15:06 noch keine Antwort für icmp_seq=3167
Jan 16 15:15:06 64 Bytes von 192.168.43.254: icmp_seq=3168 ttl=64 Zeit=7.31 ms
Jan 16 15:15:07 64 Bytes von 192.168.43.254: icmp_seq=3169 ttl=64 Zeit=1.48 ms

The Wifi itself stays connected the whole time and it wasn't even bad enough to kill SSH and VPN connections. However, it is bad enough to be kicked out of Jiti conferences and being annoying while browsing the web.

Actions #15

Updated by mkittler over 1 year ago

I've created https://sd.suse.com/servicedesk/customer/portal/1/SD-109471. I hope it is in the right "System"/category.

Actions #16

Updated by okurz over 1 year ago

  • Due date deleted (2023-02-03)
  • Status changed from Workable to Blocked

Blocking on SD ticket

Actions #17

Updated by okurz over 1 year ago

  • Status changed from Blocked to Resolved

Last two tickets from https://sd.suse.com/servicedesk/customer/portal/1/SD-109471

(Oliver Kurz) I had been using different, other desks in the last weeks and have seen no outages while I continuously monitor the connection with all ping+arping+wireshark. Depending on your preferences you can directly close the ticket as the problem is currently not happening or you can keep the ticket open and I will run a more thorough crosscheck by again using the originally affected desk positions
(Robert Wawrig) Closing this ticket. If you identify an AP that clearly has packet loss, I believe we have a spare one in PRG. Open a ticket, specifying the AP name (AP mac will be better) and we'll get it replaced. Or just disconnect that AP. It is very possible you get good signals from other APs and disconnecting one will make no difference in the coverage.

I wouldn't know how to clearly identify a faulty access point but as the problem itself is gone we can treat this as done.

Actions #18

Updated by okurz over 1 year ago

  • Project changed from 46 to QA
  • Category deleted (Infrastructure)
Actions

Also available in: Atom PDF