action #120112
closedworker worker2.oqa.suse.de auto_review:"Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out":retry size:M
0%
Description
Observation¶
openQA test in scenario sle-15-SP5-JeOS-for-MS-HyperV-x86_64-jeos-containers-docker@svirt-hyperv-uefi fails in
bootloader_hyperv
Test suite description¶
worker2:~> ping -c 10 win2k19.qa.suse.cz
PING win2k19.qa.suse.cz (10.100.101.33) 56(84) bytes of data.
--- win2k19.qa.suse.cz ping statistics ---
10 packets transmitted, 0 received, 100% packet loss, time 9195ms
Connection from my local machine to win2k19.qa.suse.cz
works fine
Reproducible¶
Fails since (at least) Build 1.95
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#120112
Expected result¶
Last good: (unknown) (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by mloviska about 2 years ago
- Project changed from openQA Tests (public) to openQA Infrastructure (public)
- Category deleted (
Bugs in existing tests)
Updated by okurz about 2 years ago
- Assignee set to okurz
- Priority changed from Normal to Urgent
- Target version set to Ready
Updated by okurz about 2 years ago
- Related to action #119443: Conduct the migration of SUSE openQA systems from Nbg SRV1 to new security zones size:M added
Updated by okurz about 2 years ago
- Due date set to 2022-11-22
- Status changed from New to Feedback
On OSD
sudo salt --no-color --state-output=changes \* cmd.run 'command -v nmap>/dev/null || zypper -n in nmap && nmap -p 22 win2k19.qa.suse.cz'
shows how the non-migrated workers can still reach it, migrated ones can't.
okurz@openqa:~> sudo salt --no-color --state-output=changes \* cmd.run 'command -v mtr>/dev/null || zypper -n in mtr && mtr -r win2k19.qa.suse.cz'
openqaworker3.suse.de:
Start: 2022-11-08T18:40:20+0100
HOST: worker3 Loss% Snt Last Avg Best Wrst StDev
1.|-- gateway.oqa.suse.de 0.0% 10 0.2 0.2 0.2 0.2 0.0
2.|-- 10.136.0.12 0.0% 10 0.4 0.4 0.4 0.5 0.1
3.|-- vpn20.open.ch 0.0% 10 11.0 11.4 11.0 13.3 0.7
4.|-- 10.156.234.201 0.0% 10 12.6 13.2 11.7 22.1 3.2
5.|-- win2k19.qa.suse.cz 0.0% 10 12.0 14.1 11.5 29.0 5.5
worker10.oqa.suse.de:
Start: 2022-11-08T18:40:20+0100
HOST: worker10 Loss% Snt Last Avg Best Wrst StDev
1.|-- gateway.oqa.suse.de 0.0% 10 0.3 0.3 0.2 0.4 0.1
2.|-- 10.136.0.12 0.0% 10 0.3 0.7 0.3 2.6 0.7
3.|-- vpn20.open.ch 0.0% 10 11.0 12.4 11.0 18.4 2.2
4.|-- 10.156.234.201 0.0% 10 11.6 14.2 11.4 28.4 5.1
5.|-- win2k19.qa.suse.cz 0.0% 10 11.6 14.2 11.5 28.5 5.4
shows that all can reach the host though. I can ping from worker2 as well.
I could login into the windows host and start wireshark. I don't see any incoming packets on TCP port 22 when I try to connect from a migrated host but I see packets from other hosts. So this seems to be a problem on the side of PRG incoming firewall. lhaleplidis is informed and on it.
Updated by okurz about 2 years ago
- Subject changed from worker worker2.oqa.suse.de cannot reach win2k19.qa.suse.cz to worker worker2.oqa.suse.de auto_review:"Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out"
- Description updated (diff)
Updated by okurz about 2 years ago
(Oliver Kurz) @Lazaros Haleplidis what is the status on access to win2k19.suse.cz?
(Lazaros Haleplidis) trying to identify the device blocking the traffic so that we can resolve it. (we resolved the problem of the tunnel/firewall blocking this (opensystems), now we reach PRG but there is something there, trying to identify the device first)
Updated by okurz about 2 years ago
(Lazaros Haleplidis) trying to troubleshoot the access to win2k19.suse.cz. You said you can access it from other locations, can you give me a copy of it's routing table?
from worker3, one of the migrated machines that now can not access win2k19.qa.suse.cz over ssh port 22
worker3:/home/okurz # ip r
default via 10.137.10.254 dev br0 proto dhcp
10.0.0.0/15 dev br1 proto kernel scope link src 10.0.2.2
10.137.10.0/24 dev br0 proto kernel scope link src 10.137.10.3
in contrast to OSD that can reach 22/tcp ssh on the host just fine:
okurz@openqa:~> ip r
default via 149.44.183.254 dev eth1
10.136.0.0/14 via 10.160.255.254 dev eth0
10.137.10.0/24 via 10.160.255.254 dev eth0
10.160.0.0/16 dev eth0 proto kernel scope link src 10.160.0.207
127.0.0.0/8 dev lo scope link
149.44.176.0/21 dev eth1 proto kernel scope link src 149.44.176.58
Updated by livdywan about 2 years ago
- Subject changed from worker worker2.oqa.suse.de auto_review:"Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out" to worker worker2.oqa.suse.de auto_review:"Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out" size:M
Updated by okurz about 2 years ago
compare a port scan to win2k19.qa.suse.cz from my notebook:
okurz@linux-28d7:~ 0 (master) $ sudo nmap 10.100.101.33
Starting Nmap 7.92 ( https://nmap.org ) at 2022-11-10 17:02 CET
Nmap scan report for win2k19.qa.suse.cz (10.100.101.33)
Host is up (0.041s latency).
Not shown: 992 closed tcp ports (reset)
PORT STATE SERVICE
22/tcp open ssh
135/tcp open msrpc
139/tcp open netbios-ssn
445/tcp open microsoft-ds
2179/tcp open vmrdp
3389/tcp open ms-wbt-server
5357/tcp open wsdapi
10012/tcp open unknown
Nmap done: 1 IP address (1 host up) scanned in 2.17 seconds
to the same scan done from worker2:
okurz@worker2:~> sudo nmap win2k19.qa.suse.cz
Starting Nmap 7.92 ( https://nmap.org ) at 2022-11-10 17:00 CET
Nmap scan report for win2k19.qa.suse.cz (10.100.101.33)
Host is up (0.011s latency).
Not shown: 995 filtered tcp ports (no-response)
PORT STATE SERVICE
53/tcp closed domain
113/tcp closed ident
2000/tcp open cisco-sccp
5060/tcp open sip
8008/tcp open http
Nmap done: 1 IP address (1 host up) scanned in 4.69 seconds
in wireshark I can see ICMP but as TCP traffic only the one to/from 53
Updated by okurz about 2 years ago
- Status changed from Feedback to In Progress
(Lazaros Haleplidis) ok, final attempt. Thank you @Martin Caj for point me to the right direction so @Oliver Kurz please test for a final time
(Marius Kittler) Works now, tested from worker2.
(Lazaros Haleplidis) kudos goes to @Martin Caj for point me to the right direction
(Oliver Kurz) let us learn, what was it?
(Lazaros Haleplidis) on PRG, on l3 core, they were ACL in place
(Oliver Kurz) Well, ok. That's what I meant with it's becoming tedious if one does not have access to the controlling systems. But please let's handle it better for other problems. You don't need to wait for us to execute a simple ping or nmap call. We will provide all the necessary access so that you can check yourself.
Updated by okurz about 2 years ago
- Subject changed from worker worker2.oqa.suse.de auto_review:"Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out" size:M to worker worker2.oqa.suse.de auto_review:"Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out":retry size:M
Updated by okurz about 2 years ago
Calling
export host=openqa.suse.de; failed_since="(timezone('UTC', now()) - interval '120 hour')" bash -ex ./openqa-monitor-investigation-candidates | bash -e ./openqa-label-known-issues
to find and retrigger according failed tests.
Updated by okurz about 2 years ago
I assume https://openqa.suse.de/tests/9920506 is the same problem but trying to connect to esxi7.qa.suse.cz, labeled and retriggered.
Updated by okurz about 2 years ago
- Status changed from In Progress to Feedback
Jobs were retriggered
Updated by okurz about 2 years ago
From now:
$ openqa-query-for-job-label poo#120112
9918133|2022-11-10 15:57:43|done|failed|select_modules_and_patterns+registration_dev|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9918139|2022-11-10 15:50:17|done|failed|select_modules_and_patterns+registration_dev|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9920079|2022-11-10 13:47:51|done|failed|jeos-containers-docker|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9919801|2022-11-10 13:39:19|done|failed|msdos_dev|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9919668|2022-11-10 13:39:17|done|failed|jeos-main|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9919869|2022-11-10 13:39:16|done|failed|jeos-filesystem|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9919863|2022-11-10 13:39:16|done|failed|jeos-main|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9919662|2022-11-10 13:39:16|done|failed|jeos-base+sdk+desktop|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9919606|2022-11-10 13:28:43|done|failed|default|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9919342|2022-11-10 13:13:20|done|failed|online_upgrade_sles15sp4_hyperv|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
so no more failures since more than 12h, good sign.
Updated by okurz about 2 years ago
- Due date deleted (
2022-11-22) - Status changed from Feedback to Resolved
https://openqa.suse.de/tests/9922599 from the original scenario passing at least bootloader_hyperv.
Updated by openqa_review about 2 years ago
- Status changed from Resolved to Feedback
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: online_upgrade_sles15sp4_vmware@svirt-vmware70
https://openqa.suse.de/tests/10031922#step/bootloader_svirt/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by okurz about 2 years ago
- Status changed from Feedback to Resolved
That one test was labeled via carry-over but failed in
# Test died: {
"console" => "svirt",
"function" => "define_and_start",
"json_cmd_token" => "plFjuRzC",
"args" => [],
"wantarray" => undef,
"cmd" => "backend_proxy_console_call"
}
virsh define failed at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 523.
which is unrelated. I removed the comment from the openQA job.