Project

General

Profile

Actions

action #96512

closed

coordination #96980: [qe-core][samba_adcli][epic] Tracker for samba_adcli failures

[qe-core][sporadic] Re-Enable Active Directory tests

Added by dzedro over 3 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Sprint:
QE-Core: March Sprint (Mar 08 - Apr 05)

Description

Observation

openQA test in scenario sle-12-SP5-Server-DVD-Incidents-x86_64-mau-extratests2@64bit fails in
samba_adcli

Test suite description

Run console tests against aggregated test repo

Reproducible

Fails since (at least) Build :20666:webkit2gtk3 (current job)

Expected result

Last good: :4705:fetchmail (or more recent)

Further details

Always latest result in this scenario: latest


Files

Win2019_Validation.png (150 KB) Win2019_Validation.png ph03nix, 2023-02-14 14:59

Related issues 5 (0 open5 closed)

Related to openQA Tests (public) - action #96513: [qe-core][sporadic][samba_adcli] wbinfo failsRejected

Actions
Related to openQA Tests (public) - action #120859: [qe-core] test fails in samba_adcli - Unschedule testResolvedmgrifalconi

Actions
Related to openQA Tests (public) - action #96983: [qe-core][sporadic][samba_adcli] adcli joining domain failsRejected

Actions
Related to QE-Workstation - tickets #117979: [Regression][AD] AD: Use nautilus to access DFS share FolderResolvedzcjia2022-10-11

Actions
Related to openQA Tests (public) - action #126866: [qe-core] test fails in samba_adcli on s390xResolvedrfan12023-03-29

Actions
Actions #1

Updated by dzedro over 3 years ago

  • Subject changed from [sporadic] test fails in samba_adcli to [qe-core][sporadic] test fails in samba_adcli
Actions #2

Updated by okurz over 3 years ago

  • Related to action #96513: [qe-core][sporadic][samba_adcli] wbinfo fails added
Actions #3

Updated by okurz over 3 years ago

  • Priority changed from Normal to Urgent
Actions #4

Updated by okurz over 3 years ago

Proposing removing the test module for now from mau-extratests2 in https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/13072 to address the urgency

Actions #5

Updated by geor over 3 years ago

  • Parent task set to #96980
Actions #6

Updated by geor over 3 years ago

  • Subject changed from [qe-core][sporadic] test fails in samba_adcli to [qe-core][sporadic] test fails in samba_adcli (fails on ssh to domain)
Actions #7

Updated by okurz over 3 years ago

  • Priority changed from Urgent to Normal

Discussed in weekly QE sync together. This ticket showed up violating the SLO for urgent tickets. We agreed that the ticket is actually not urgent according to these criteria. See https://progress.opensuse.org/projects/openqatests/wiki#SLOs-service-level-objectives for details

Actions #8

Updated by tjyrinki_suse over 3 years ago

  • Status changed from New to Workable
  • Start date deleted (2021-08-03)
Actions #9

Updated by tjyrinki_suse about 3 years ago

  • Target version set to QE-Core: Ready

Would be nice to work on samba_ad issues (or at least find out which tasks are relevant at the moment) for the next sprint.

Actions #10

Updated by tjyrinki_suse almost 3 years ago

  • Priority changed from Normal to High
Actions #11

Updated by tjyrinki_suse almost 3 years ago

  • Target version deleted (QE-Core: Ready)
Actions #12

Updated by tjyrinki_suse almost 3 years ago

  • Priority changed from High to Normal
Actions #14

Updated by slo-gin almost 2 years ago

This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.

Actions #15

Updated by dzedro almost 2 years ago

  • Assignee set to dzedro
  • Target version set to QE-Core: Ready
Actions #16

Updated by dzedro almost 2 years ago

  • Tags set to qe-core-february-sprint
Actions #17

Updated by szarate almost 2 years ago

  • Related to action #120859: [qe-core] test fails in samba_adcli - Unschedule test added
Actions #19

Updated by szarate almost 2 years ago

  • Sprint set to QE-Core: February Sprint (Feb 08 - Mar 08)

samba_adcli tests are unscheduled for now, as the windows machine needs to be set up again/moved out from phobos.qa.suse.de (which is in the TAM lab).

Actions #20

Updated by szarate almost 2 years ago

  • Subject changed from [qe-core][sporadic] test fails in samba_adcli (fails on ssh to domain) to [qe-core][sporadic] Re-Enable Active Directory tests - test fails in samba_adcli (fails on ssh to domain)

You can use qamaster.qa.suse.de to deploy a new Windows VM. I think Marita can provide a valid windows 2019 license key, in case I can't find it.

Actions #21

Updated by ph03nix almost 2 years ago

  • Status changed from Workable to In Progress
  • Assignee changed from dzedro to ph03nix

Cheekily taking this ticket over :-)

Actions #23

Updated by ph03nix almost 2 years ago

I'm collecting notes and setup instructions on https://confluence.suse.com/display/qasle/AD+configuration+for+testing

Actions #24

Updated by ph03nix almost 2 years ago

The provided product key doesn't seem to work:

Actions #25

Updated by ph03nix almost 2 years ago

Managed to install, activate and update the Windows Server installation. Now I'm assigning a static IP address to it.

The hostname will be "methusalix"

https://gitlab.suse.de/qa-sle/qanet-configs/-/merge_requests/48

Actions #26

Updated by ph03nix almost 2 years ago

I'm updating the https://confluence.suse.com/display/qasle/AD+configuration+for+testing along the way to document the installation process.

Actions #27

Updated by szarate almost 2 years ago

GPO enablement PR (Needs it's own ticket, & somebody to pick it up) https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/16107

Actions #28

Updated by ph03nix almost 2 years ago

I've installed the Windows Server 2019 and documented the process on https://confluence.suse.com/display/qasle/AD+configuration+for+testing

I've tested to join the Domain via Windows 10 (OK) and via SLES 15-SP5 (YaST - OK). Now will check for joining the domain via the CLI and on how the test setup is interacting with it.

Actions #29

Updated by ph03nix almost 2 years ago

szarate wrote:

GPO enablement PR (Needs it's own ticket, & somebody to pick it up) https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/16107

Let's please do this in a separate ticket, otherwise this grows too much. The new ticket should be blocked by this one and I'm likely to take it afterwards. But please still: New ticket.

Actions #30

Updated by ph03nix almost 2 years ago

  • % Done changed from 0 to 30

Right now I'm adapting the old test case to work with the new server. In the process I'm also changing the hard-coded value for the domain stuff to become settings.

Actions #31

Updated by ph03nix almost 2 years ago

Currently facing the problem, that the AD Domain Controller has network issues from duck norris

# host feldspaten.org 10.162.30.119
;; connection timed out; no servers could be reached

# ping -c 2 10.162.30.119
PING 10.162.30.119 (10.162.30.119) 56(84) bytes of data.
64 bytes from 10.162.30.119: icmp_seq=1 ttl=126 time=1.10 ms
64 bytes from 10.162.30.119: icmp_seq=2 ttl=126 time=1.16 ms

--- 10.162.30.119 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 1.099/1.129/1.160/0.030 ms

However it works from my laptop.

Actions #32

Updated by ph03nix almost 2 years ago

DNS works from laptop and duck-norris against a new host (automatix) on qa-master (10.162.30.153).

The issue is apparently the AD configuration.

Actions #33

Updated by ph03nix almost 2 years ago

  • Related to action #96983: [qe-core][sporadic][samba_adcli] adcli joining domain fails added
Actions #34

Updated by ph03nix almost 2 years ago

After some firewall fiddling, I notice that the openQA host can reach it now.

I added a rule to allow all firewall traffic, so that I can advance for now.

Actions #35

Updated by ph03nix almost 2 years ago

# net ads join --domain METHUSALIX.QA.SUSE.DE -U Administrator -S methusalix.qa.suse.de
Password for [METHUSALIX.QA.SUSE.DE\Administrator]:
Failed to join domain: failed to lookup DC info for domain 'METHUSALIX.QA.SUSE.DE' over rpc: {Operation Failed} The requested operation was unsuccessful
Actions #36

Updated by ph03nix almost 2 years ago

https://duck-norris.qe.suse.de/tests/11940#step/samba_adcli/161

Password for [METHUSALIX0\Administrator]:
Failed to join domain: failed to lookup DC info for domain 'METHUSALIX.QA.SUSE.DE' over rpc: {Device Timeout} The specified I/O operation on %hs was not completed before the time-out period expired.

There is https://www.suse.com/support/kb/doc/?id=000020507 for this, looks like a configuration issue on the Windows Server. I have allowed all ports and services there, so I don't know where this is coming from.

Actions #37

Updated by ph03nix almost 2 years ago

I can join the domain just fine on a Leap 15.4 VM from my Laptop:

# net ads join --domain 'METHUSALIX.QA.SUSE.DE' -U Administrator -S 'methusalix.qa.suse.de' --no-dns-updates
Password for [METHUSALIX0\Administrator]:
Using short domain name -- METHUSALIX0
Joined 'LEAP15-4' to dns domain 'methusalix.qa.suse.de'

Need to check what is blocking the network from the openQA VM to the Domain.

Actions #38

Updated by ph03nix almost 2 years ago

Good, issue can be reproduced on a new VM on ada:

zergling:~ # net ads join --domain 'METHUSALIX.QA.SUSE.DE' -U Administrator -S 'methusalix.qa.suse.de' --no-dns-updates
Password for [METHUSALIX0\Administrator]:
Failed to join domain: failed to lookup DC info for domain 'METHUSALIX.QA.SUSE.DE' over rpc: {Device Timeout} The specified I/O operation on %hs was not completed before the time-out period expired.

Now going to check where the issue comes from.

Actions #39

Updated by ph03nix almost 2 years ago

A tcpdump shows that requests are going out to ports 445 and 139 (as described in https://www.suse.com/support/kb/doc/?id=000020507) in TCP - but nothing comes back

09:05:47.579065 IP 10.137.7.199.56258 > 10.162.30.119.445: Flags [S], seq 3725595907, win 64240, options [mss 1460,sackOK,TS val 377180263 ecr 0,nop,wscale 7], length 0
09:05:47.584195 IP 10.137.7.199.48880 > 10.162.30.119.139: Flags [S], seq 3763447905, win 64240, options [mss 1460,sackOK,TS val 377180268 ecr 0,nop,wscale 7], length 0
09:05:48.593784 IP 10.137.7.199.48880 > 10.162.30.119.139: Flags [S], seq 3763447905, win 64240, options [mss 1460,sackOK,TS val 377181278 ecr 0,nop,wscale 7], length 0
09:05:48.593808 IP 10.137.7.199.56258 > 10.162.30.119.445: Flags [S], seq 3725595907, win 64240, options [mss 1460,sackOK,TS val 377181278 ecr 0,nop,wscale 7], length 0
09:05:50.609816 IP 10.137.7.199.56258 > 10.162.30.119.445: Flags [S], seq 3725595907, win 64240, options [mss 1460,sackOK,TS val 377183294 ecr 0,nop,wscale 7], length 0
09:05:50.609820 IP 10.137.7.199.48880 > 10.162.30.119.139: Flags [S], seq 3763447905, win 64240, options [mss 1460,sackOK,TS val 377183294 ecr 0,nop,wscale 7], length 0
Actions #40

Updated by ph03nix almost 2 years ago

I can see on the Domain Controller that DNS and CLDAP (UDP/389) packets are incoming, but I'm missing the tcp packets for port 445 and 139.

I have firewall rules to allow everything incoming and outgoing on the Domain Controller in place.

Actions #41

Updated by ph03nix almost 2 years ago

A tcpdump between the test host (zergling) and the VM Hypervisor qamaster host shows that there are packets being blocked between the subnets:

OUTGOING from zergling (showing only TCP)

09:30:36.657256 IP 10.137.7.199.49328 > 10.162.30.119.445: Flags [S], seq 3452051387, win 64240, options [mss 1460,sackOK,TS val 378669341 ecr 0,nop,wscale 7], length 0
09:30:36.662387 IP 10.137.7.199.37764 > 10.162.30.119.139: Flags [S], seq 2501689049, win 64240, options [mss 1460,sackOK,TS val 378669347 ecr 0,nop,wscale 7], length 0
09:30:37.681754 IP 10.137.7.199.37764 > 10.162.30.119.139: Flags [S], seq 2501689049, win 64240, options [mss 1460,sackOK,TS val 378670366 ecr 0,nop,wscale 7], length 0
09:30:37.681770 IP 10.137.7.199.49328 > 10.162.30.119.445: Flags [S], seq 3452051387, win 64240, options [mss 1460,sackOK,TS val 378670366 ecr 0,nop,wscale 7], length 0
09:30:39.697759 IP 10.137.7.199.49328 > 10.162.30.119.445: Flags [S], seq 3452051387, win 64240, options [mss 1460,sackOK,TS val 378672382 ecr 0,nop,wscale 7], length 0
09:30:39.697762 IP 10.137.7.199.37764 > 10.162.30.119.139: Flags [S], seq 2501689049, win 64240, options [mss 1460,sackOK,TS val 378672382 ecr 0,nop,wscale 7], length 0

INCOMING on qa-master (showing all)

10:30:36.645760 IP 10.137.7.199.58684 > 10.162.30.119.389: UDP, length 103
10:30:36.646472 IP 10.162.30.119.389 > 10.137.7.199.58684: UDP, length 184

I'm missing here all TCP packets from port 445 and 139.

Actions #43

Updated by ph03nix almost 2 years ago

Firewall issue has been resolved, I can continue with the test itself now.

Actions #44

Updated by ph03nix almost 2 years ago

While working on the test, I notice that some rather important kerberos utitlities are missing: kdestroy, kpasswd, kswitch. Investigating.

Update: https://bugzilla.suse.com/show_bug.cgi?id=1208702

Actions #46

Updated by ph03nix almost 2 years ago

The DC password is exposed and that is not nice. I need to move the define_secret_variable function from https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/lib/publiccloud/utils.pm#L177 to lib/utils.pm.

Actions #48

Updated by ph03nix almost 2 years ago

Actions #50

Updated by ph03nix almost 2 years ago

  • Related to tickets #117979: [Regression][AD] AD: Use nautilus to access DFS share Folder added
Actions #51

Updated by ph03nix almost 2 years ago

A first prototype works for 15-SP4. Now checking how well it works for other, especially older versions.

Actions #52

Updated by ph03nix almost 2 years ago

kinit succeeded but ads_sasl_spnego_gensec_bind(KRB5) failed for ldap/methusalix.qa.suse.de with user[Administrator] realm[METHUSALIX.QA.SUSE.DE]: An invalid parameter was passed to a service or function.
kinit succeeded but ads_sasl_spnego_gensec_bind(KRB5) failed for ldap/methusalix.qa.suse.de with user[SUSETEST$] realm[METHUSALIX.QA.SUSE.DE]: An invalid parameter was passed to a service or function.

https://duck-norris.qe.suse.de/tests/12176#step/samba_adcli/163 It was clear, that older versions would not work immediately.

Actions #55

Updated by szarate almost 2 years ago

  • Sprint changed from QE-Core: February Sprint (Feb 08 - Mar 08) to QE-Core: March Sprint (Mar 08 - Apr 05)
Actions #56

Updated by ph03nix almost 2 years ago

How long does it take to edit a stupid simple config files? 1 full day of learning salt because that's a requirement for contributing. grml

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/802

Actions #57

Updated by dzedro almost 2 years ago

@pho3nix is our new Salt expert. 🤩

Actions #58

Updated by ph03nix almost 2 years ago

dzedro wrote:

@pho3nix is our new Salt expert. 🤩

hides

Actions #59

Updated by ph03nix almost 2 years ago

This week I was busy with the manual validation, will continue next week.

Actions #60

Updated by ph03nix almost 2 years ago

Back on the task. I checked and the AD Windows machine is updating itself automatically, which is nice :-)

Actions #61

Updated by ph03nix almost 2 years ago

PR: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/16517

The Schedule needs to be modified accordingly in the next step.

Actions #62

Updated by ph03nix almost 2 years ago

  • % Done changed from 30 to 90

PR merged, let's see if this works in tomorrow's test runs.

Actions #64

Updated by ph03nix almost 2 years ago

Actions #67

Updated by ph03nix almost 2 years ago

Actions #69

Updated by ph03nix almost 2 years ago

Asking our s390x workers to also get access to qa.suse.de, so they can reach the AD: https://sd.suse.com/servicedesk/customer/portal/1/SD-115963

Actions #70

Updated by ph03nix almost 2 years ago

I also need to disable samba_adcli for 12-SP2. Yes, 12-SP2, see https://openqa.suse.de/tests/10753217

Actions #71

Updated by ph03nix almost 2 years ago

Actions #72

Updated by ph03nix over 1 year ago

Request in the SD Ticket to get a test machine, so they can fix the network issue (DC not reachable for s390x)

Actions #73

Updated by ph03nix over 1 year ago

  • Subject changed from [qe-core][sporadic] Re-Enable Active Directory tests - test fails in samba_adcli (fails on ssh to domain) to [qe-core][sporadic] Re-Enable Active Directory tests
Actions #74

Updated by ph03nix over 1 year ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

Marking this as resolved and will fix the firewall issues in https://progress.opensuse.org/issues/126866

Actions #75

Updated by ph03nix over 1 year ago

  • Related to action #126866: [qe-core] test fails in samba_adcli on s390x added
Actions

Also available in: Atom PDF