action #32605
closedbring openqaworker10 back into the infrastructure (was: openqaworker10 is giving us many incompletes "DIE can't open qmp")
Added by okurz over 6 years ago. Updated about 5 years ago.
0%
Description
Observation¶
https://openqa.suse.de/tests/1514215/file/autoinst-log.txt . seems like openqaworker10 is giving us quite some trouble, many incompletes
Updated by okurz over 6 years ago
- Status changed from New to Rejected
I think openqaworker10 has been disabled and asked to be handled by hardware maintenance. Unfortunate that no one else picked up this ticket
Updated by okurz about 6 years ago
- Subject changed from [tools]openqaworker10 is giving us many incompletes "DIE can't open qmp" to [tools] bring openqaworker10 back into the infrastructure (was: openqaworker10 is giving us many incompletes "DIE can't open qmp")
- Status changed from Rejected to Workable
- Priority changed from Immediate to Normal
According to runger "openqaworker10 is fully operational again" but I can not ping the machine. I guess together with mmaher it should be possible to bring the machine back.
Updated by runger about 6 years ago
Infra ticket ID is https://infra.nue.suse.com/Ticket/Display.html?id=115263
Updated by coolo about 6 years ago
- Project changed from openQA Tests to openQA Infrastructure
- Subject changed from [tools] bring openqaworker10 back into the infrastructure (was: openqaworker10 is giving us many incompletes "DIE can't open qmp") to bring openqaworker10 back into the infrastructure (was: openqaworker10 is giving us many incompletes "DIE can't open qmp")
- Category deleted (
Infrastructure)
Updated by nicksinger about 6 years ago
- Status changed from Workable to Blocked
- Assignee set to nicksinger
Yup the machine is back again but changed with changed MACs. Therefore the host was never reachable.
We adjusted the dhcp now to serve the old hostnames to this machine again.
However the SOL console didn't show any output and in an attempt to fix it remotely I just locked myself out completely.
I'll grab max the next days and just go downstairs into the server room.
Updated by okurz over 5 years ago
- Status changed from Blocked to Workable
I could login to openqaworker10 over ssh, was it forgotten for 8 months? :)
Updated by okurz about 5 years ago
- Status changed from Workable to In Progress
- Assignee set to okurz
- Target version set to Current Sprint
Updated by okurz about 5 years ago
w10 has a single NVMe device. I used parted -a optimal /dev/nvme0n1 mkpart primary 0% 100%
. But I should upgrade or reinstall the OS first. Upgraded to Leap 15.1 . Realized that https://gitlab.suse.de/openqa/salt-states-openqa/commit/dff0558d924a79f14bf73aa881ba4784b58ebf24 introduced a problem because we not ensure that any repos containing that package are added. This is fixed with https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/209
I propose to reenable the worker for production use with
https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/197
as there are some successful test jobs passed already.
Still using the time for testing:
WORKER_CLASS=openqaworker10; env end=080 openqa-clone-set https://openqa.suse.de/tests/3492979 okurz_poo32605_gnome WORKER_CLASS=$WORKER_CLASS
-> https://openqa.suse.de/tests/overview?build=okurz_poo32605_gnome&version=12-SP5&distri=sle
Updated by okurz about 5 years ago
- Due date set to 2019-10-23
- Status changed from In Progress to Feedback
Updated by okurz about 5 years ago
- Due date changed from 2019-10-23 to 2019-10-29
All tests showed up fine, merged https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/197 to bring openqaworker10 back with proper worker class. I should check again that the worker is used as expected after some days.
Updated by okurz about 5 years ago
- Copied to action #58568: salt-states-openqa chokes still on ca repo, seen in osd-deployment fails added
Updated by okurz about 5 years ago
many jobs were successfully executed on openqaworker10. I checked all worker instances. Checking the failed jobs if they look worker specific. A lot of migration tests fail but I do not trust their stability anyway, unfortunately. So I will just assume it's not worker specific. kernel test groups use jdp with a delayed manual carry-over so I will also need to ignore these. https://openqa.suse.de/tests/3509492#step/before_test/51 looks like it could potentially be multi-machine worker setup related. asmorodskyi will do some testing.
comment from asmorodskyi: "not sure if it can help but I notice that in openqaworker10 net.ipv4.conf.tap*.forwarding=0 while on my machine I have it =1 recently I have learned that more specific systcl parameters overwrite more generic , so it is not enough just have net.ipv4.conf.all.forwarding=1 but also you need to have net.ipv4.conf.*.forwarding=1"
I checked on openqaworker10, ip forwarding is enabled:
grep -q 1 /proc/sys/net/ipv4/ip_forward
sudo salt -l error --state-output=changes -E 'openqaworker(3|10).suse.de' cmd.run "ovs-vsctl show"
shows differences regarding gre, otherwise the config looks ok:
openqaworker3.suse.de:
c958fe20-0fed-4d04-b285-962dc3157802
Bridge "br1"
Port "gre6"
Interface "gre6"
type: gre
options: {remote_ip="10.160.1.18"}
…
I think with
openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3508035 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0
Cloning dependencies of sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_basic_sut@64bit
Created job #3516401: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_basic_ref@64bit -> https://openqa.suse.de/t3516401
Created job #3516402: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_basic_sut@64bit -> https://openqa.suse.de/t3516402
I can properly trigger tests without disturbing other parts. Both tests are fine. Could be the HPC tests do something different. Let's try with "advanced" and more:
openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3507684 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0
openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3508117 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0
openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3509635 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0
Cloning dependencies of sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_advanced_sut@64bit
Created job #3516430: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_advanced_ref@64bit -> https://openqa.suse.de/t3516430
Created job #3516431: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_advanced_sut@64bit -> https://openqa.suse.de/t3516431
Cloning dependencies of sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_aggregate_sut@64bit
Created job #3516432: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_aggregate_ref@64bit -> https://openqa.suse.de/t3516432
Created job #3516433: sle-12-SP5-Server-DVD-x86_64-Build0368-wicked_aggregate_sut@64bit -> https://openqa.suse.de/t3516433
Cloning dependencies of sle-12-SP5-Server-DVD-x86_64-Build0368-hpc_BETA_mvapich2_mpi_slave00@64bit
Created job #3516434: sle-12-SP5-Server-DVD-x86_64-Build0368-hpc_BETA_mvapich2_mpi_supportserver@64bit -> https://openqa.suse.de/t3516434
Created job #3516435: sle-12-SP5-Server-DVD-x86_64-Build0368-hpc_BETA_mvapich2_mpi_slave00@64bit -> https://openqa.suse.de/t3516435
EDIT: All tests are fine
In the meantime I looked into the original HPC failure. The HPC tests show that dhcp is used and it seems eth0 does not receive an adress: https://openqa.suse.de/tests/3509492/file/serial_terminal.txt , the support server should give out leases as visible in a passed example: https://openqa.suse.de/tests/3509634/file/serial0.txt next to the sut https://openqa.suse.de/tests/3509635/file/serial_terminal.txt . I wonder why https://openqa.suse.de/tests/3508004 is "parallel_failed" but shows no dependencies, https://openqa.suse.de/admin/auditlog?eventid=3250156 is the corresponding audit log event which states that "geekotest" triggered this job. Events for correctly triggered job clusters look very similar, with just the single job mentioned. I wonder what or who is triggering the job, is this really how an iso post looks like?
EDIT: https://openqa.suse.de/tests/3516989# failed on openqaworker10, the parallel ref was running on openqaworker8. Seems to be a problem with GRE tunnel. I can see that the config files have been correctly applied on openqaworker10 in files but not in the active config. wicked ifup br1
fixed this as I can see GRE config in ovs-vsctl show
now. salt-states-openqa mentions a wicked ifup br1
correctly so I am not sure if this is a generic problem or a single incident linked to incorrect application of salt state on the worker. Probably even a reboot would have fixed it the same. I should see how the system behaves during a reboot.
"Stephan Kulow @coolo: 12:25 Oliver Kurz https://openqa.suse.de/tests/3517470 - please remove tap class from worker10 until you figure out how to fix the tunnel"
Not sure what else I can do. I guess I need to trigger specific jobs that are running in both w10 and another worker so that the GRE tunnel is relied upon. I masked all but two openqa worker instances, masked the salt minion and triggered a reboot.
changed worker class to "openqaworker10" only.
openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3517471 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0
Cloning dependencies of sle-15-Server-DVD-Updates-x86_64-Build20191024-1-qam_wicked_advanced_sut@64bit
Created job #3518285: sle-15-Server-DVD-Updates-x86_64-Build20191024-1-qam_wicked_advanced_ref@64bit -> https://openqa.suse.de/t3518285
Created job #3518286: sle-15-Server-DVD-Updates-x86_64-Build20191024-1-qam_wicked_advanced_sut@64bit -> https://openqa.suse.de/t3518286
was fine. One more time to be save: https://openqa.suse.de/t3528097 , https://openqa.suse.de/t3528098
Updated by mkittler about 5 years ago
I wonder why https://openqa.suse.de/tests/3508004 is "parallel_failed" but shows no dependencies
That's unfortunately the only thing I can clarify: The job has dependencies. Look at vars.json
. The tree isn't shown because it has been cloned.
Updated by okurz about 5 years ago
- Due date deleted (
2019-10-29) - Status changed from Feedback to Workable
mkittler wrote:
The job has dependencies. Look at
vars.json
. The tree isn't shown because it has been cloned.
Shouldn't it?
https://openqa.suse.de/t3528097 failed with timeout_exceeded, https://openqa.suse.de/t3528098 , the SUT, passed. Not sure if this is reproducible or what it means. One hypothesis is that MM setup with GRE tunnel would equally fail on other workers when setup from scratch so what we might have here is a more common problem.
Updated by mkittler about 5 years ago
Shouldn't it?
Having a graph with all the clones would be confusing. To keep things simple we decided to show only the most recent jobs in the dependency graph.
Updated by okurz about 5 years ago
mkittler wrote:
Having a graph with all the clones would be confusing. To keep things simple we decided to show only the most recent jobs in the dependency graph.
I mean the cloned job appears as if it would not have any dependencies at all. Can we not render the tree with the most recent jobs but show it in all jobs?
Updated by okurz about 5 years ago
- Due date set to 2019-11-11
- Status changed from Workable to Feedback
For the mentioned scenario qam_wicked_advanced_sut/ref I could find find problems in the "ref" scenario timing out as well: https://openqa.suse.de/tests/3526087 so far I could not see again any openqaworker10 specific problems left. asmorodskyi mentioned the number of tap devices would potentially not match the necessary numbers for worker instances.
grep -c tap /etc/sysconfig/network/ifcfg-br1
yields 30, i.e. matching 10 worker instances mentioned in /etc/openqa/workers.ini . Only ls /etc/sysconfig/network/| grep -c tap
yields 72 but should not concern us when there are more files than necessary.
Cloning a more recent example:
openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3552356 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0
Created job #3552892: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_ref@64bit -> https://openqa.suse.de/t3552892
Created job #3552893: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_sut@64bit -> https://openqa.suse.de/t3552893
this is running both jobs in parallel on openqaworker10 as both have the worker class configured to stick to openqaworker10
Trying to trigger jobs where one is started on w10, the other not to use the GRE tunnel:
openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 3552356 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0
Created job #3552894: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_ref@64bit -> https://openqa.suse.de/t3552894
Created job #3552895: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_sut@64bit -> https://openqa.suse.de/t3552895
The not so nice part here is that the ref job is in the original job group and build, polluting results. So manually updating job group:
openqa_client_osd jobs/3552894 put --json-data '{"group_id": 132}'
where "132" is the SLE12 test development job group but the job still shows up in the same scenario. I don't know how to update the build from API. So done over SQL with update jobs set build='' where id=3552894;
but only after updating the name the job disappears from the scenario list in https://openqa.suse.de/tests/3552355#next_previous and this way also https://openqa.suse.de/tests/overview?distri=sle&version=15&build=20191104-1 is not polluted.
In the meantime other jobs were happily using the worker also over GRE tunnel (as again the worker class was reset and not actually limited to openqaworker10): https://openqa.suse.de/tests/3552813 and https://openqa.suse.de/tests/3552867
https://openqa.suse.de/tests/3552893 is the sut on the double-w10 scenario, all good but the ref
fails in https://openqa.suse.de/tests/3552892#step/t01_gre_tunnel_legacy/68 to upload a pcap file. This also happens in other cases as well though: https://openqa.suse.de/tests/3552355#step/t01_gre_tunnel_legacy/68
https://openqa.suse.de/t3552894 and https://openqa.suse.de/t3552895 were running on w10 and passed so this neither showed a problem nor did it succeed to do what we wanted: Use the GRE tunnel
So first trying to get rid of this annoyance:
openqa-clone-job --within-instance https://openqa.suse.de --parental-inheritance --skip-chained-deps 3552356 BUILD= WORKER_CLASS=openqaworker10 _GROUP=0 CASEDIR=https://github.com/okurz/os-autoinst-distri-opensuse.git#fix/serial_terminal_upload
Created job #3552959: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_ref@64bit -> https://openqa.suse.de/t3552959
Created job #3552960: sle-15-Server-DVD-Updates-x86_64-Build20191104-1-qam_wicked_advanced_sut@64bit -> https://openqa.suse.de/t3552960
-> https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8832
EDIT: All green.
and try to force jobs onto differing machines
and updated the worker class over SQL: update job_settings set value='qemu_x86_64,tap' where (job_id=3552902 and key='WORKER_CLASS');
. This does not work as the worker class change is not effective. The jobs started both on w10 after starting two instances again.
All in all I currently see no problem with openqaworker10. Checking again the test logs on both two instances:
- openqaworker10:1 : last 40 jobs good, no jobs failed with w10 specific issues except maybe https://openqa.suse.de/tests/3544979#step/wireshark/96 which fails because the terminal title mentions "localhost.localdomain" instead of the shorter default hostname. Created needle "wireshark-generic-desktop-with-terminal-20191104" to cover less open space in the title bar
- openqaworker10:2 :
- failed in https://openqa.suse.de/tests/3550908#step/patch_and_reboot/59 with network problems, recheck. Cloned with
openqa_clone_job_osd 3552175 _GROUP=0 BUILD=X TEST=poo32605_qam-suseconnect WORKER_CLASS=openqaworker10
-> https://openqa.suse.de/t3552951 green - https://openqa.suse.de/tests/3550178#step/gpg/105 shows a system log message covering a gpg console dialog. It did not seem to show up recently in the same scenario but assumed to be not a w10 specific problem.
- https://openqa.suse.de/tests/3545156#step/user_defined_snapshot/32 failed already in other job runs in the same scenario . The test records a soft-fail referencing https://bugzilla.suse.com/show_bug.cgi?id=980337 which I now reopened
- https://openqa.suse.de/tests/3545975#step/addon_products_sle/19 is https://progress.opensuse.org/issues/58028
- https://openqa.suse.de/tests/3546054#step/accept_license/3 is a single job scenario by b10n1k, probably for development, ignored
- https://openqa.suse.de/tests/3546064#step/installation/11 is consistently failing in the same scenario on other machines as well
- https://openqa.suse.de/tests/3546207# incomplete same as on other machines
- failed in https://openqa.suse.de/tests/3550908#step/patch_and_reboot/59 with network problems, recheck. Cloned with
I will re-enable it for production but only with 2 worker instances for the next days. Unmasked and enabled "salt-minion openqa-worker.target", accepted salt key on OSD and applied current high state.
EDIT: 2019-11-11: Enabled worker instance {3..10}
Updated by okurz about 5 years ago
- Status changed from Feedback to Resolved
Enabled worker instance {3..10}. All 10 worker instances are now enabled. The machine is back and controlled by salt, no special worker class or config. After thoroughly checking test results over the past days I assume now we do not have any worker specific problem left. Followup for GRE tunnel setup issue in #59300