Project

General

Profile

Actions

action #120112

closed

worker worker2.oqa.suse.de auto_review:"Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out":retry size:M

Added by mloviska about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Start date:
2022-11-08
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-15-SP5-JeOS-for-MS-HyperV-x86_64-jeos-containers-docker@svirt-hyperv-uefi fails in
bootloader_hyperv

Test suite description

worker2:~> ping -c 10 win2k19.qa.suse.cz
PING win2k19.qa.suse.cz (10.100.101.33) 56(84) bytes of data.

--- win2k19.qa.suse.cz ping statistics ---
10 packets transmitted, 0 received, 100% packet loss, time 9195ms

Connection from my local machine to win2k19.qa.suse.cz works fine

Reproducible

Fails since (at least) Build 1.95

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#120112

Expected result

Last good: (unknown) (or more recent)

Further details

Always latest result in this scenario: latest

Actions #1

Updated by mloviska about 2 years ago

  • Project changed from openQA Tests (public) to openQA Infrastructure (public)
  • Category deleted (Bugs in existing tests)
Actions #2

Updated by okurz about 2 years ago

  • Assignee set to okurz
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #4

Updated by okurz about 2 years ago

  • Due date set to 2022-11-22
  • Status changed from New to Feedback

On OSD

sudo salt --no-color --state-output=changes \* cmd.run 'command -v nmap>/dev/null || zypper -n in nmap && nmap -p 22 win2k19.qa.suse.cz'

shows how the non-migrated workers can still reach it, migrated ones can't.

okurz@openqa:~> sudo salt --no-color --state-output=changes \* cmd.run 'command -v mtr>/dev/null || zypper -n in mtr && mtr -r win2k19.qa.suse.cz'
openqaworker3.suse.de:
    Start: 2022-11-08T18:40:20+0100
    HOST: worker3                     Loss%   Snt   Last   Avg  Best  Wrst StDev
      1.|-- gateway.oqa.suse.de        0.0%    10    0.2   0.2   0.2   0.2   0.0
      2.|-- 10.136.0.12                0.0%    10    0.4   0.4   0.4   0.5   0.1
      3.|-- vpn20.open.ch              0.0%    10   11.0  11.4  11.0  13.3   0.7
      4.|-- 10.156.234.201             0.0%    10   12.6  13.2  11.7  22.1   3.2
      5.|-- win2k19.qa.suse.cz         0.0%    10   12.0  14.1  11.5  29.0   5.5
worker10.oqa.suse.de:
    Start: 2022-11-08T18:40:20+0100
    HOST: worker10                    Loss%   Snt   Last   Avg  Best  Wrst StDev
      1.|-- gateway.oqa.suse.de        0.0%    10    0.3   0.3   0.2   0.4   0.1
      2.|-- 10.136.0.12                0.0%    10    0.3   0.7   0.3   2.6   0.7
      3.|-- vpn20.open.ch              0.0%    10   11.0  12.4  11.0  18.4   2.2
      4.|-- 10.156.234.201             0.0%    10   11.6  14.2  11.4  28.4   5.1
      5.|-- win2k19.qa.suse.cz         0.0%    10   11.6  14.2  11.5  28.5   5.4

shows that all can reach the host though. I can ping from worker2 as well.

I could login into the windows host and start wireshark. I don't see any incoming packets on TCP port 22 when I try to connect from a migrated host but I see packets from other hosts. So this seems to be a problem on the side of PRG incoming firewall. lhaleplidis is informed and on it.

Actions #5

Updated by okurz about 2 years ago

  • Subject changed from worker worker2.oqa.suse.de cannot reach win2k19.qa.suse.cz to worker worker2.oqa.suse.de auto_review:"Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out"
  • Description updated (diff)
Actions #6

Updated by okurz about 2 years ago

(Oliver Kurz) @Lazaros Haleplidis what is the status on access to win2k19.suse.cz?
(Lazaros Haleplidis) trying to identify the device blocking the traffic so that we can resolve it. (we resolved the problem of the tunnel/firewall blocking this (opensystems), now we reach PRG but there is something there, trying to identify the device first)

Actions #7

Updated by okurz about 2 years ago

(Lazaros Haleplidis) trying to troubleshoot the access to win2k19.suse.cz. You said you can access it from other locations, can you give me a copy of it's routing table?

from worker3, one of the migrated machines that now can not access win2k19.qa.suse.cz over ssh port 22

worker3:/home/okurz # ip r
default via 10.137.10.254 dev br0 proto dhcp 
10.0.0.0/15 dev br1 proto kernel scope link src 10.0.2.2 
10.137.10.0/24 dev br0 proto kernel scope link src 10.137.10.3 

in contrast to OSD that can reach 22/tcp ssh on the host just fine:

okurz@openqa:~> ip r
default via 149.44.183.254 dev eth1 
10.136.0.0/14 via 10.160.255.254 dev eth0 
10.137.10.0/24 via 10.160.255.254 dev eth0 
10.160.0.0/16 dev eth0 proto kernel scope link src 10.160.0.207 
127.0.0.0/8 dev lo scope link 
149.44.176.0/21 dev eth1 proto kernel scope link src 149.44.176.58 
Actions #8

Updated by livdywan about 2 years ago

  • Subject changed from worker worker2.oqa.suse.de auto_review:"Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out" to worker worker2.oqa.suse.de auto_review:"Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out" size:M
Actions #9

Updated by okurz about 2 years ago

compare a port scan to win2k19.qa.suse.cz from my notebook:

okurz@linux-28d7:~ 0 (master) $ sudo nmap 10.100.101.33
Starting Nmap 7.92 ( https://nmap.org ) at 2022-11-10 17:02 CET
Nmap scan report for win2k19.qa.suse.cz (10.100.101.33)
Host is up (0.041s latency).
Not shown: 992 closed tcp ports (reset)
PORT      STATE SERVICE
22/tcp    open  ssh
135/tcp   open  msrpc
139/tcp   open  netbios-ssn
445/tcp   open  microsoft-ds
2179/tcp  open  vmrdp
3389/tcp  open  ms-wbt-server
5357/tcp  open  wsdapi
10012/tcp open  unknown

Nmap done: 1 IP address (1 host up) scanned in 2.17 seconds

to the same scan done from worker2:

okurz@worker2:~> sudo nmap  win2k19.qa.suse.cz
Starting Nmap 7.92 ( https://nmap.org ) at 2022-11-10 17:00 CET
Nmap scan report for win2k19.qa.suse.cz (10.100.101.33)
Host is up (0.011s latency).
Not shown: 995 filtered tcp ports (no-response)
PORT     STATE  SERVICE
53/tcp   closed domain
113/tcp  closed ident
2000/tcp open   cisco-sccp
5060/tcp open   sip
8008/tcp open   http

Nmap done: 1 IP address (1 host up) scanned in 4.69 seconds

in wireshark I can see ICMP but as TCP traffic only the one to/from 53

Actions #10

Updated by okurz about 2 years ago

  • Status changed from Feedback to In Progress

(Lazaros Haleplidis) ok, final attempt. Thank you @Martin Caj for point me to the right direction so @Oliver Kurz please test for a final time
(Marius Kittler) Works now, tested from worker2.
(Lazaros Haleplidis) kudos goes to @Martin Caj for point me to the right direction
(Oliver Kurz) let us learn, what was it?
(Lazaros Haleplidis) on PRG, on l3 core, they were ACL in place
(Oliver Kurz) Well, ok. That's what I meant with it's becoming tedious if one does not have access to the controlling systems. But please let's handle it better for other problems. You don't need to wait for us to execute a simple ping or nmap call. We will provide all the necessary access so that you can check yourself.

Actions #11

Updated by okurz about 2 years ago

  • Subject changed from worker worker2.oqa.suse.de auto_review:"Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out" size:M to worker worker2.oqa.suse.de auto_review:"Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out":retry size:M
Actions #12

Updated by okurz about 2 years ago

Calling

export host=openqa.suse.de; failed_since="(timezone('UTC', now()) - interval '120 hour')" bash -ex ./openqa-monitor-investigation-candidates | bash -e ./openqa-label-known-issues

to find and retrigger according failed tests.

Actions #13

Updated by okurz about 2 years ago

I assume https://openqa.suse.de/tests/9920506 is the same problem but trying to connect to esxi7.qa.suse.cz, labeled and retriggered.

Actions #14

Updated by okurz about 2 years ago

  • Status changed from In Progress to Feedback

Jobs were retriggered

Actions #15

Updated by okurz about 2 years ago

From now:

$ openqa-query-for-job-label poo#120112
9918133|2022-11-10 15:57:43|done|failed|select_modules_and_patterns+registration_dev|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9918139|2022-11-10 15:50:17|done|failed|select_modules_and_patterns+registration_dev|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9920079|2022-11-10 13:47:51|done|failed|jeos-containers-docker|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9919801|2022-11-10 13:39:19|done|failed|msdos_dev|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9919668|2022-11-10 13:39:17|done|failed|jeos-main|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9919869|2022-11-10 13:39:16|done|failed|jeos-filesystem|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9919863|2022-11-10 13:39:16|done|failed|jeos-main|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9919662|2022-11-10 13:39:16|done|failed|jeos-base+sdk+desktop|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9919606|2022-11-10 13:28:43|done|failed|default|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2
9919342|2022-11-10 13:13:20|done|failed|online_upgrade_sles15sp4_hyperv|backend done: Error connecting to <root@win2k19.qa.suse.cz>: Connection timed out|worker2

so no more failures since more than 12h, good sign.

Actions #16

Updated by okurz about 2 years ago

  • Due date deleted (2022-11-22)
  • Status changed from Feedback to Resolved

https://openqa.suse.de/tests/9922599 from the original scenario passing at least bootloader_hyperv.

Actions #17

Updated by openqa_review about 2 years ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: online_upgrade_sles15sp4_vmware@svirt-vmware70
https://openqa.suse.de/tests/10031922#step/bootloader_svirt/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #18

Updated by okurz about 2 years ago

  • Status changed from Feedback to Resolved

That one test was labeled via carry-over but failed in

# Test died: {
  "console" => "svirt",
  "function" => "define_and_start",
  "json_cmd_token" => "plFjuRzC",
  "args" => [],
  "wantarray" => undef,
  "cmd" => "backend_proxy_console_call"
}
virsh define failed at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 523.

which is unrelated. I removed the comment from the openQA job.

Actions

Also available in: Atom PDF