action #32605: bring openqaworker10 back into the infrastructure (was: openqaworker10 is giving us many incompletes "DIE can't open qmp") - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #32605

closed

bring openqaworker10 back into the infrastructure (was: openqaworker10 is giving us many incompletes "DIE can't open qmp")

Added by okurz about 7 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Current Sprint

Start date:

2018-03-01

Due date:

2019-11-11

% Done:

Estimated time:

Description

Observation¶

https://openqa.suse.de/tests/1514215/file/autoinst-log.txt . seems like openqaworker10 is giving us quite some trouble, many incompletes

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz about 7 years ago

Status changed from New to Rejected

I think openqaworker10 has been disabled and asked to be handled by hardware maintenance. Unfortunate that no one else picked up this ticket

Actions

Copy link

Updated by okurz over 6 years ago

Subject changed from [tools]openqaworker10 is giving us many incompletes "DIE can't open qmp" to [tools] bring openqaworker10 back into the infrastructure (was: openqaworker10 is giving us many incompletes "DIE can't open qmp")
Status changed from Rejected to Workable
Priority changed from Immediate to Normal

According to runger "openqaworker10 is fully operational again" but I can not ping the machine. I guess together with mmaher it should be possible to bring the machine back.

Actions

Copy link

Updated by runger over 6 years ago

Infra ticket ID is https://infra.nue.suse.com/Ticket/Display.html?id=115263

Actions

Copy link

Updated by coolo over 6 years ago

Project changed from openQA Tests (public) to openQA Infrastructure (public)
Subject changed from [tools] bring openqaworker10 back into the infrastructure (was: openqaworker10 is giving us many incompletes "DIE can't open qmp") to bring openqaworker10 back into the infrastructure (was: openqaworker10 is giving us many incompletes "DIE can't open qmp")
Category deleted (~~Infrastructure~~)

Actions

Copy link

Updated by nicksinger over 6 years ago

Status changed from Workable to Blocked
Assignee set to nicksinger

Yup the machine is back again but changed with changed MACs. Therefore the host was never reachable.
We adjusted the dhcp now to serve the old hostnames to this machine again.

However the SOL console didn't show any output and in an attempt to fix it remotely I just locked myself out completely.
I'll grab max the next days and just go downstairs into the server room.

Actions

Copy link

Updated by okurz almost 6 years ago

Status changed from Blocked to Workable

I could login to openqaworker10 over ssh, was it forgotten for 8 months? :)

Actions

Copy link

Updated by nicksinger over 5 years ago

Assignee deleted (~~nicksinger~~)

Actions

Copy link

Updated by okurz over 5 years ago

Status changed from Workable to In Progress
Assignee set to okurz
Target version set to Current Sprint

Actions

Copy link

Updated by okurz over 5 years ago

w10 has a single NVMe device. I used parted -a optimal /dev/nvme0n1 mkpart primary 0% 100%. But I should upgrade or reinstall the OS first. Upgraded to Leap 15.1 . Realized that https://gitlab.suse.de/openqa/salt-states-openqa/commit/dff0558d924a79f14bf73aa881ba4784b58ebf24 introduced a problem because we not ensure that any repos containing that package are added. This is fixed with https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/209

I propose to reenable the worker for production use with
https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/197
as there are some successful test jobs passed already.

Still using the time for testing:

WORKER_CLASS=openqaworker10; env end=080 openqa-clone-set https://openqa.suse.de/tests/3492979 okurz_poo32605_gnome WORKER_CLASS=$WORKER_CLASS

-> https://openqa.suse.de/tests/overview?build=okurz_poo32605_gnome&version=12-SP5&distri=sle

Actions

Copy link

#10

Updated by okurz over 5 years ago

Due date set to 2019-10-23
Status changed from In Progress to Feedback

Actions

Copy link

#11

Updated by okurz over 5 years ago

Due date changed from 2019-10-23 to 2019-10-29

All tests showed up fine, merged https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/197 to bring openqaworker10 back with proper worker class. I should check again that the worker is used as expected after some days.

Actions

Copy link

#12

Updated by okurz over 5 years ago

Copied to action #58568: salt-states-openqa chokes still on ca repo, seen in osd-deployment fails added

Actions

Copy link

#13

Updated by okurz over 5 years ago

many jobs were successfully executed on openqaworker10. I checked all worker instances. Checking the failed jobs if they look worker specific. A lot of migration tests fail but I do not trust their stability anyway, unfortunately. So I will just assume it's not worker specific. kernel test groups use jdp with a delayed manual carry-over so I will also need to ignore these. https://openqa.suse.de/tests/3509492#step/before_test/51 looks like it could potentially be multi-machine worker setup related. asmorodskyi will do some testing.

comment from asmorodskyi: "not sure if it can help but I notice that in openqaworker10 net.ipv4.conf.tap*.forwarding=0 while on my machine I have it =1 recently I have learned that more specific systcl parameters overwrite more generic , so it is not enough just have net.ipv4.conf.all.forwarding=1 but also you need to have net.ipv4.conf.*.forwarding=1"

I checked on openqaworker10, ip forwarding is enabled:

grep -q 1 /proc/sys/net/ipv4/ip_forward

sudo salt -l error --state-output=changes -E 'openqaworker(3|10).suse.de' cmd.run "ovs-vsctl show"

shows differences regarding gre, otherwise the config looks ok:

openqaworker3.suse.de:
    c958fe20-0fed-4d04-b285-962dc3157802
        Bridge "br1"
            Port "gre6"
                Interface "gre6"
                    type: gre
                    options: {remote_ip="10.160.1.18"}
…

I think with

openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3508035 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0 
Cloning dependencies of sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_basic_sut@64bit
Created job #3516401: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_basic_ref@64bit -> https://openqa.suse.de/t3516401
Created job #3516402: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_basic_sut@64bit -> https://openqa.suse.de/t3516402

I can properly trigger tests without disturbing other parts. Both tests are fine. Could be the HPC tests do something different. Let's try with "advanced" and more:

openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3507684 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0 
openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3508117 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0 
openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3509635 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0

Cloning dependencies of sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_advanced_sut@64bit
Created job #3516430: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_advanced_ref@64bit -> https://openqa.suse.de/t3516430
Created job #3516431: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_advanced_sut@64bit -> https://openqa.suse.de/t3516431
Cloning dependencies of sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_aggregate_sut@64bit
Created job #3516432: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_aggregate_ref@64bit -> https://openqa.suse.de/t3516432
Created job #3516433: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_aggregate_sut@64bit -> https://openqa.suse.de/t3516433
Cloning dependencies of sle-12-SP5-Server-DVD-x86_64-Build0368-hpc_BETA_mvapich2_mpi_slave00@64bit
Created job #3516434: sle-12-SP5-Server-DVD-x86_64-Build0368-hpc_BETA_mvapich2_mpi_supportserver@64bit -> https://openqa.suse.de/t3516434
Created job #3516435: sle-12-SP5-Server-DVD-x86_64-Build0368-hpc_BETA_mvapich2_mpi_slave00@64bit -> https://openqa.suse.de/t3516435

EDIT: All tests are fine

In the meantime I looked into the original HPC failure. The HPC tests show that dhcp is used and it seems eth0 does not receive an adress: https://openqa.suse.de/tests/3509492/file/serial_terminal.txt , the support server should give out leases as visible in a passed example: https://openqa.suse.de/tests/3509634/file/serial0.txt next to the sut https://openqa.suse.de/tests/3509635/file/serial_terminal.txt . I wonder why https://openqa.suse.de/tests/3508004 is "parallel_failed" but shows no dependencies, https://openqa.suse.de/admin/auditlog?eventid=3250156 is the corresponding audit log event which states that "geekotest" triggered this job. Events for correctly triggered job clusters look very similar, with just the single job mentioned. I wonder what or who is triggering the job, is this really how an iso post looks like?

EDIT: https://openqa.suse.de/tests/3516989# failed on openqaworker10, the parallel ref was running on openqaworker8. Seems to be a problem with GRE tunnel. I can see that the config files have been correctly applied on openqaworker10 in files but not in the active config. wicked ifup br1 fixed this as I can see GRE config in ovs-vsctl show now. salt-states-openqa mentions a wicked ifup br1 correctly so I am not sure if this is a generic problem or a single incident linked to incorrect application of salt state on the worker. Probably even a reboot would have fixed it the same. I should see how the system behaves during a reboot.

"Stephan Kulow @coolo: 12:25 Oliver Kurz https://openqa.suse.de/tests/3517470 - please remove tap class from worker10 until you figure out how to fix the tunnel"

Not sure what else I can do. I guess I need to trigger specific jobs that are running in both w10 and another worker so that the GRE tunnel is relied upon. I masked all but two openqa worker instances, masked the salt minion and triggered a reboot.

changed worker class to "openqaworker10" only.

openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3517471 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0

Cloning dependencies of sle-15-Server-DVD-Updates-x86_64-Build20191024-1-qam_wicked_advanced_sut@64bit
Created job #3518285: sle-15-Server-DVD-Updates-x86_64-Build20191024-1-qam_wicked_advanced_ref@64bit -> https://openqa.suse.de/t3518285
Created job #3518286: sle-15-Server-DVD-Updates-x86_64-Build20191024-1-qam_wicked_advanced_sut@64bit -> https://openqa.suse.de/t3518286

was fine. One more time to be save: https://openqa.suse.de/t3528097 , https://openqa.suse.de/t3528098

Actions

Copy link

#14

Updated by mkittler over 5 years ago

I wonder why https://openqa.suse.de/tests/3508004 is "parallel_failed" but shows no dependencies

That's unfortunately the only thing I can clarify: The job has dependencies. Look at vars.json. The tree isn't shown because it has been cloned.

Actions

Copy link

#15

Updated by okurz over 5 years ago

Due date deleted (~~2019-10-29~~)
Status changed from Feedback to Workable

mkittler wrote:

The job has dependencies. Look at vars.json. The tree isn't shown because it has been cloned.

Shouldn't it?

https://openqa.suse.de/t3528097 failed with timeout_exceeded, https://openqa.suse.de/t3528098 , the SUT, passed. Not sure if this is reproducible or what it means. One hypothesis is that MM setup with GRE tunnel would equally fail on other workers when setup from scratch so what we might have here is a more common problem.

Actions

Copy link

#16

Updated by mkittler over 5 years ago

Shouldn't it?

Having a graph with all the clones would be confusing. To keep things simple we decided to show only the most recent jobs in the dependency graph.

Actions

Copy link

#17

Updated by okurz over 5 years ago

mkittler wrote:

Having a graph with all the clones would be confusing. To keep things simple we decided to show only the most recent jobs in the dependency graph.

I mean the cloned job appears as if it would not have any dependencies at all. Can we not render the tree with the most recent jobs but show it in all jobs?

Actions

Copy link

#18

Updated by coolo over 5 years ago

Yes, we ignore dependencies for cloned jobs

Actions

Copy link

#19

Updated by okurz over 5 years ago

Due date set to 2019-11-11
Status changed from Workable to Feedback

For the mentioned scenario qam_wicked_advanced_sut/ref I could find find problems in the "ref" scenario timing out as well: https://openqa.suse.de/tests/3526087 so far I could not see again any openqaworker10 specific problems left. asmorodskyi mentioned the number of tap devices would potentially not match the necessary numbers for worker instances.

grep -c tap /etc/sysconfig/network/ifcfg-br1 yields 30, i.e. matching 10 worker instances mentioned in /etc/openqa/workers.ini . Only ls /etc/sysconfig/network/| grep -c tap yields 72 but should not concern us when there are more files than necessary.

Cloning a more recent example:

openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3552356 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0

Created job #3552892: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_ref@64bit -> https://openqa.suse.de/t3552892
Created job #3552893: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_sut@64bit -> https://openqa.suse.de/t3552893

this is running both jobs in parallel on openqaworker10 as both have the worker class configured to stick to openqaworker10

Trying to trigger jobs where one is started on w10, the other not to use the GRE tunnel:

openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 3552356 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0

Created job #3552894: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_ref@64bit -> https://openqa.suse.de/t3552894
Created job #3552895: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_sut@64bit -> https://openqa.suse.de/t3552895

The not so nice part here is that the ref job is in the original job group and build, polluting results. So manually updating job group:

openqa_client_osd jobs/3552894 put --json-data '{"group_id": 132}'

where "132" is the SLE12 test development job group but the job still shows up in the same scenario. I don't know how to update the build from API. So done over SQL with update jobs set build='' where id=3552894; but only after updating the name the job disappears from the scenario list in https://openqa.suse.de/tests/3552355#next_previous and this way also https://openqa.suse.de/tests/overview?distri=sle&version=15&build=20191104-1 is not polluted.

In the meantime other jobs were happily using the worker also over GRE tunnel (as again the worker class was reset and not actually limited to openqaworker10): https://openqa.suse.de/tests/3552813 and https://openqa.suse.de/tests/3552867

https://openqa.suse.de/tests/3552893 is the sut on the double-w10 scenario, all good but the ref
fails in https://openqa.suse.de/tests/3552892#step/t01_gre_tunnel_legacy/68 to upload a pcap file. This also happens in other cases as well though: https://openqa.suse.de/tests/3552355#step/t01_gre_tunnel_legacy/68

https://openqa.suse.de/t3552894 and https://openqa.suse.de/t3552895 were running on w10 and passed so this neither showed a problem nor did it succeed to do what we wanted: Use the GRE tunnel

So first trying to get rid of this annoyance:

openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3552356 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0 CASEDIR=https://github.com/okurz/os-autoinst-distri-opensuse.git#fix/serial_terminal_upload

Created job #3552959: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_ref@64bit -> https://openqa.suse.de/t3552959
Created job #3552960: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_sut@64bit -> https://openqa.suse.de/t3552960

-> https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8832

EDIT: All green.

and try to force jobs onto differing machines

and updated the worker class over SQL: update job_settings set value='qemu_x86_64,tap' where (job_id=3552902 and key='WORKER_CLASS'); . This does not work as the worker class change is not effective. The jobs started both on w10 after starting two instances again.

All in all I currently see no problem with openqaworker10. Checking again the test logs on both two instances:

openqaworker10:1 : last 40 jobs good, no jobs failed with w10 specific issues except maybe https://openqa.suse.de/tests/3544979#step/wireshark/96 which fails because the terminal title mentions "localhost.localdomain" instead of the shorter default hostname. Created needle "wireshark-generic-desktop-with-terminal-20191104" to cover less open space in the title bar
openqaworker10:2 :
- failed in https://openqa.suse.de/tests/3550908#step/patch_and_reboot/59 with network problems, recheck. Cloned with openqa_clone_job_osd 3552175 _GROUP=0 BUILD=X TEST=poo32605_qam-suseconnect WORKER_CLASS=openqaworker10 -> https://openqa.suse.de/t3552951 green
- https://openqa.suse.de/tests/3550178#step/gpg/105 shows a system log message covering a gpg console dialog. It did not seem to show up recently in the same scenario but assumed to be not a w10 specific problem.
- https://openqa.suse.de/tests/3545156#step/user_defined_snapshot/32 failed already in other job runs in the same scenario . The test records a soft-fail referencing https://bugzilla.suse.com/show_bug.cgi?id=980337 which I now reopened
- https://openqa.suse.de/tests/3545975#step/addon_products_sle/19 is https://progress.opensuse.org/issues/58028
- https://openqa.suse.de/tests/3546054#step/accept_license/3 is a single job scenario by b10n1k, probably for development, ignored
- https://openqa.suse.de/tests/3546064#step/installation/11 is consistently failing in the same scenario on other machines as well
- https://openqa.suse.de/tests/3546207# incomplete same as on other machines

I will re-enable it for production but only with 2 worker instances for the next days. Unmasked and enabled "salt-minion openqa-worker.target", accepted salt key on OSD and applied current high state.

EDIT: 2019-11-11: Enabled worker instance {3..10}

Actions

Copy link

#20

Updated by okurz over 5 years ago

Status changed from Feedback to Resolved

Enabled worker instance {3..10}. All 10 worker instances are now enabled. The machine is back and controlled by salt, no special worker class or config. After thoroughly checking test results over the past days I assume now we do not have any worker specific problem left. Followup for GRE tunnel setup issue in #59300

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #32605

bring openqaworker10 back into the infrastructure (was: openqaworker10 is giving us many incompletes "DIE can't open qmp")

Observation¶

Updated by okurz about 7 years ago

Updated by okurz over 6 years ago

Updated by runger over 6 years ago

Updated by coolo over 6 years ago

Updated by nicksinger over 6 years ago

Updated by okurz almost 6 years ago

Updated by nicksinger over 5 years ago

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago

Updated by mkittler over 5 years ago

Updated by okurz over 5 years ago

Updated by mkittler over 5 years ago

Updated by okurz over 5 years ago

Updated by coolo over 5 years ago

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago