action #32605

bring openqaworker10 back into the infrastructure (was: openqaworker10 is giving us many incompletes "DIE can't open qmp")

Added by okurz almost 2 years ago. Updated 4 months ago.

Status:ResolvedStart date:01/03/2018
Priority:NormalDue date:11/11/2019
Assignee:okurz% Done:

0%

Category:-
Target version:openQA Project - Current Sprint
Duration: 443

Description

Observation

https://openqa.suse.de/tests/1514215/file/autoinst-log.txt . seems like openqaworker10 is giving us quite some trouble, many incompletes


Related issues

Copied to openQA Infrastructure - action #58568: salt-states-openqa chokes still on ca repo, seen in osd-d... Rejected 01/03/2018

History

#1 Updated by okurz almost 2 years ago

  • Status changed from New to Rejected

I think openqaworker10 has been disabled and asked to be handled by hardware maintenance. Unfortunate that no one else picked up this ticket

#2 Updated by okurz over 1 year ago

  • Subject changed from [tools]openqaworker10 is giving us many incompletes "DIE can't open qmp" to [tools] bring openqaworker10 back into the infrastructure (was: openqaworker10 is giving us many incompletes "DIE can't open qmp")
  • Status changed from Rejected to Workable
  • Priority changed from Immediate to Normal

According to runger "openqaworker10 is fully operational again" but I can not ping the machine. I guess together with mmaher it should be possible to bring the machine back.

#4 Updated by coolo over 1 year ago

  • Project changed from openQA Tests to openQA Infrastructure
  • Subject changed from [tools] bring openqaworker10 back into the infrastructure (was: openqaworker10 is giving us many incompletes "DIE can't open qmp") to bring openqaworker10 back into the infrastructure (was: openqaworker10 is giving us many incompletes "DIE can't open qmp")
  • Category deleted (Infrastructure)

#5 Updated by nicksinger over 1 year ago

  • Status changed from Workable to Blocked
  • Assignee set to nicksinger

Yup the machine is back again but changed with changed MACs. Therefore the host was never reachable.
We adjusted the dhcp now to serve the old hostnames to this machine again.

However the SOL console didn't show any output and in an attempt to fix it remotely I just locked myself out completely.
I'll grab max the next days and just go downstairs into the server room.

#6 Updated by okurz 8 months ago

  • Status changed from Blocked to Workable

I could login to openqaworker10 over ssh, was it forgotten for 8 months? :)

#7 Updated by nicksinger 5 months ago

  • Assignee deleted (nicksinger)

#8 Updated by okurz 4 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
  • Target version set to Current Sprint

#9 Updated by okurz 4 months ago

w10 has a single NVMe device. I used parted -a optimal /dev/nvme0n1 mkpart primary 0% 100%. But I should upgrade or reinstall the OS first. Upgraded to Leap 15.1 . Realized that https://gitlab.suse.de/openqa/salt-states-openqa/commit/dff0558d924a79f14bf73aa881ba4784b58ebf24 introduced a problem because we not ensure that any repos containing that package are added. This is fixed with https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/209

I propose to reenable the worker for production use with
https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/197
as there are some successful test jobs passed already.

Still using the time for testing:

WORKER_CLASS=openqaworker10; env end=080 openqa-clone-set https://openqa.suse.de/tests/3492979 okurz_poo32605_gnome WORKER_CLASS=$WORKER_CLASS

-> https://openqa.suse.de/tests/overview?build=okurz_poo32605_gnome&version=12-SP5&distri=sle

#10 Updated by okurz 4 months ago

  • Due date set to 23/10/2019
  • Status changed from In Progress to Feedback

#11 Updated by okurz 4 months ago

  • Due date changed from 23/10/2019 to 29/10/2019

All tests showed up fine, merged https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/197 to bring openqaworker10 back with proper worker class. I should check again that the worker is used as expected after some days.

#12 Updated by okurz 4 months ago

  • Copied to action #58568: salt-states-openqa chokes still on ca repo, seen in osd-deployment fails added

#13 Updated by okurz 4 months ago

many jobs were successfully executed on openqaworker10. I checked all worker instances. Checking the failed jobs if they look worker specific. A lot of migration tests fail but I do not trust their stability anyway, unfortunately. So I will just assume it's not worker specific. kernel test groups use jdp with a delayed manual carry-over so I will also need to ignore these. https://openqa.suse.de/tests/3509492#step/before_test/51 looks like it could potentially be multi-machine worker setup related. asmorodskyi will do some testing.

comment from asmorodskyi: "not sure if it can help but I notice that in openqaworker10 net.ipv4.conf.tap*.forwarding=0 while on my machine I have it =1 recently I have learned that more specific systcl parameters overwrite more generic , so it is not enough just have net.ipv4.conf.all.forwarding=1 but also you need to have net.ipv4.conf.*.forwarding=1"

I checked on openqaworker10, ip forwarding is enabled:

grep -q 1 /proc/sys/net/ipv4/ip_forward
sudo salt -l error --state-output=changes -E 'openqaworker(3|10).suse.de' cmd.run "ovs-vsctl show"

shows differences regarding gre, otherwise the config looks ok:

openqaworker3.suse.de:
    c958fe20-0fed-4d04-b285-962dc3157802
        Bridge "br1"
            Port "gre6"
                Interface "gre6"
                    type: gre
                    options: {remote_ip="10.160.1.18"}
…

I think with

openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3508035 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0 
Cloning dependencies of sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_basic_sut@64bit
Created job #3516401: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_basic_ref@64bit -> https://openqa.suse.de/t3516401
Created job #3516402: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_basic_sut@64bit -> https://openqa.suse.de/t3516402

I can properly trigger tests without disturbing other parts. Both tests are fine. Could be the HPC tests do something different. Let's try with "advanced" and more:

openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3507684 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0 
openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3508117 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0 
openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3509635 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0 

Cloning dependencies of sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_advanced_sut@64bit
Created job #3516430: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_advanced_ref@64bit -> https://openqa.suse.de/t3516430
Created job #3516431: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_advanced_sut@64bit -> https://openqa.suse.de/t3516431
Cloning dependencies of sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_aggregate_sut@64bit
Created job #3516432: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_aggregate_ref@64bit -> https://openqa.suse.de/t3516432
Created job #3516433: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_aggregate_sut@64bit -> https://openqa.suse.de/t3516433
Cloning dependencies of sle-12-SP5-Server-DVD-x86_64-Build0368-hpc_BETA_mvapich2_mpi_slave00@64bit
Created job #3516434: sle-12-SP5-Server-DVD-x86_64-Build0368-hpc_BETA_mvapich2_mpi_supportserver@64bit -> https://openqa.suse.de/t3516434
Created job #3516435: sle-12-SP5-Server-DVD-x86_64-Build0368-hpc_BETA_mvapich2_mpi_slave00@64bit -> https://openqa.suse.de/t3516435

EDIT: All tests are fine

In the meantime I looked into the original HPC failure. The HPC tests show that dhcp is used and it seems eth0 does not receive an adress: https://openqa.suse.de/tests/3509492/file/serial_terminal.txt , the support server should give out leases as visible in a passed example: https://openqa.suse.de/tests/3509634/file/serial0.txt next to the sut https://openqa.suse.de/tests/3509635/file/serial_terminal.txt . I wonder why https://openqa.suse.de/tests/3508004 is "parallel_failed" but shows no dependencies, https://openqa.suse.de/admin/auditlog?eventid=3250156 is the corresponding audit log event which states that "geekotest" triggered this job. Events for correctly triggered job clusters look very similar, with just the single job mentioned. I wonder what or who is triggering the job, is this really how an iso post looks like?

EDIT: https://openqa.suse.de/tests/3516989# failed on openqaworker10, the parallel ref was running on openqaworker8. Seems to be a problem with GRE tunnel. I can see that the config files have been correctly applied on openqaworker10 in files but not in the active config. wicked ifup br1 fixed this as I can see GRE config in ovs-vsctl show now. salt-states-openqa mentions a wicked ifup br1 correctly so I am not sure if this is a generic problem or a single incident linked to incorrect application of salt state on the worker. Probably even a reboot would have fixed it the same. I should see how the system behaves during a reboot.

"Stephan Kulow @coolo: 12:25 Oliver Kurz https://openqa.suse.de/tests/3517470 - please remove tap class from worker10 until you figure out how to fix the tunnel"

Not sure what else I can do. I guess I need to trigger specific jobs that are running in both w10 and another worker so that the GRE tunnel is relied upon. I masked all but two openqa worker instances, masked the salt minion and triggered a reboot.

changed worker class to "openqaworker10" only.

openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3517471 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0

Cloning dependencies of sle-15-Server-DVD-Updates-x86_64-Build20191024-1-qam_wicked_advanced_sut@64bit
Created job #3518285: sle-15-Server-DVD-Updates-x86_64-Build20191024-1-qam_wicked_advanced_ref@64bit -> https://openqa.suse.de/t3518285
Created job #3518286: sle-15-Server-DVD-Updates-x86_64-Build20191024-1-qam_wicked_advanced_sut@64bit -> https://openqa.suse.de/t3518286

was fine. One more time to be save: https://openqa.suse.de/t3528097 , https://openqa.suse.de/t3528098

#14 Updated by mkittler 4 months ago

I wonder why https://openqa.suse.de/tests/3508004 is "parallel_failed" but shows no dependencies

That's unfortunately the only thing I can clarify: The job has dependencies. Look at vars.json. The tree isn't shown because it has been cloned.

#15 Updated by okurz 4 months ago

  • Due date deleted (29/10/2019)
  • Status changed from Feedback to Workable

mkittler wrote:

The job has dependencies. Look at vars.json. The tree isn't shown because it has been cloned.

Shouldn't it?

https://openqa.suse.de/t3528097 failed with timeout_exceeded, https://openqa.suse.de/t3528098 , the SUT, passed. Not sure if this is reproducible or what it means. One hypothesis is that MM setup with GRE tunnel would equally fail on other workers when setup from scratch so what we might have here is a more common problem.

#16 Updated by mkittler 4 months ago

Shouldn't it?

Having a graph with all the clones would be confusing. To keep things simple we decided to show only the most recent jobs in the dependency graph.

#17 Updated by okurz 4 months ago

mkittler wrote:

Having a graph with all the clones would be confusing. To keep things simple we decided to show only the most recent jobs in the dependency graph.

I mean the cloned job appears as if it would not have any dependencies at all. Can we not render the tree with the most recent jobs but show it in all jobs?

#18 Updated by coolo 4 months ago

Yes, we ignore dependencies for cloned jobs

#19 Updated by okurz 4 months ago

  • Due date set to 11/11/2019
  • Status changed from Workable to Feedback

For the mentioned scenario qam_wicked_advanced_sut/ref I could find find problems in the "ref" scenario timing out as well: https://openqa.suse.de/tests/3526087 so far I could not see again any openqaworker10 specific problems left. asmorodskyi mentioned the number of tap devices would potentially not match the necessary numbers for worker instances.

grep -c tap /etc/sysconfig/network/ifcfg-br1 yields 30, i.e. matching 10 worker instances mentioned in /etc/openqa/workers.ini . Only ls /etc/sysconfig/network/| grep -c tap yields 72 but should not concern us when there are more files than necessary.

Cloning a more recent example:

openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3552356 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0

Created job #3552892: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_ref@64bit -> https://openqa.suse.de/t3552892
Created job #3552893: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_sut@64bit -> https://openqa.suse.de/t3552893

this is running both jobs in parallel on openqaworker10 as both have the worker class configured to stick to openqaworker10

Trying to trigger jobs where one is started on w10, the other not to use the GRE tunnel:

openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 3552356 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0

Created job #3552894: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_ref@64bit -> https://openqa.suse.de/t3552894
Created job #3552895: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_sut@64bit -> https://openqa.suse.de/t3552895

The not so nice part here is that the ref job is in the original job group and build, polluting results. So manually updating job group:

openqa_client_osd jobs/3552894 put --json-data '{"group_id": 132}'

where "132" is the SLE12 test development job group but the job still shows up in the same scenario. I don't know how to update the build from API. So done over SQL with update jobs set build='' where id=3552894; but only after updating the name the job disappears from the scenario list in https://openqa.suse.de/tests/3552355#next_previous and this way also https://openqa.suse.de/tests/overview?distri=sle&version=15&build=20191104-1 is not polluted.

In the meantime other jobs were happily using the worker also over GRE tunnel (as again the worker class was reset and not actually limited to openqaworker10): https://openqa.suse.de/tests/3552813 and https://openqa.suse.de/tests/3552867

https://openqa.suse.de/tests/3552893 is the sut on the double-w10 scenario, all good but the ref
fails in https://openqa.suse.de/tests/3552892#step/t01_gre_tunnel_legacy/68 to upload a pcap file. This also happens in other cases as well though: https://openqa.suse.de/tests/3552355#step/t01_gre_tunnel_legacy/68

https://openqa.suse.de/t3552894 and https://openqa.suse.de/t3552895 were running on w10 and passed so this neither showed a problem nor did it succeed to do what we wanted: Use the GRE tunnel

So first trying to get rid of this annoyance:

openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3552356 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0 CASEDIR=https://github.com/okurz/os-autoinst-distri-opensuse.git#fix/serial_terminal_upload

Created job #3552959: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_ref@64bit -> https://openqa.suse.de/t3552959
Created job #3552960: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_sut@64bit -> https://openqa.suse.de/t3552960

-> https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8832

EDIT: All green.

and try to force jobs onto differing machines

and updated the worker class over SQL: update job_settings set value='qemu_x86_64,tap' where (job_id=3552902 and key='WORKER_CLASS'); . This does not work as the worker class change is not effective. The jobs started both on w10 after starting two instances again.

All in all I currently see no problem with openqaworker10. Checking again the test logs on both two instances:

I will re-enable it for production but only with 2 worker instances for the next days. Unmasked and enabled "salt-minion openqa-worker.target", accepted salt key on OSD and applied current high state.

EDIT: 2019-11-11: Enabled worker instance {3..10}

#20 Updated by okurz 4 months ago

  • Status changed from Feedback to Resolved

Enabled worker instance {3..10}. All 10 worker instances are now enabled. The machine is back and controlled by salt, no special worker class or config. After thoroughly checking test results over the past days I assume now we do not have any worker specific problem left. Followup for GRE tunnel setup issue in #59300

Also available in: Atom PDF