Project

General

Profile

Actions

action #120261

closed

openQA Tests - action #107062: Multiple failures due to network issues

tests should try to access worker by WORKER_HOSTNAME FQDN but sometimes get 'worker2' or something auto_review:".*curl.*worker\d+:.*failed at.*":retry size:meow

Added by okurz over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2022-11-10
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-15-SP4-JeOS-for-kvm-and-xen-Updates-x86_64-jeos-kdump@svirt-xen-pv fails in
image_info
to access worker by WORKER_HOSTNAME FQDN which in this case is worker2.oqa.suse.de but here gets "worker2" instead.

It looks like the WORKER_HOSTNAME is really not configured correctly in those cases, e.g. when the same problem happened on worker6 yesterday there was really just "WORKER_HOSTNAME=worker6" in workers.ini. So this appears to be a problem on salt level where the FQDN grain doesn't return the actual fully qualified domain. On worker6 re-applying the salt states helped to get the full FQDN configured again. Rebooting the machine did not break it again.

Steps to reproduce

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#120261

Acceptance criteria

  • AC1: All recent jobs failing to upload to an incomplete worker hostname are retriggered and clones end up ok
  • AC2: Jobs are able to upload logs after reboot of the worker machine
  • AC3: Jobs still work just after a salt high state was applied

Acceptance tests

  • AT1-1: openqa-query-for-job-label poo#120261 returns no matches more recent than 48h
  • AT2-1: Trigger reboot of the machine at least 2 times, trigger openQA tests (or wait for jobs to finish automatically) and verify that jobs succeed to upload logs
  • AT3-1: Apply salt high state from OSD, trigger openQA tests (or await automatic results) and verify that jobs succeed to upload logs

Suggestions

  • See what has been done in #109241 originally
  • Maybe we need to specify the FQDN in /etc/hostname . If we do that then we should revisit all occurences of "grains['host']" in https://gitlab.suse.de/openqa/salt-states-openqa
  • Check via sudo salt -C 'G@roles:worker' cmd.run 'grep -i worker_hostname /etc/openqa/workers.ini' on OSD whether all hostnames are configured correctly
  • If all other options fail we can still revert to hardcoding IPv4 addresses but FQDN would be preferred

Rollback steps

  • Add back worker2 to salt

Out of scope

Automatic distinction if the upload problem originates from test object misconfigurations, product regressions or problem within os-autoinst or openQA


Related issues 5 (0 open5 closed)

Related to QA - action #119443: Conduct the migration of SUSE openQA systems from Nbg SRV1 to new security zones size:MResolvedokurz2022-11-17

Actions
Related to openQA Tests - action #120363: [qe-core][functional] test fails in prepare_test_dataClosed

Actions
Related to openQA Project - action #120579: test fails in openqa_workerResolvedmkittler2022-11-152022-11-30

Actions
Related to openQA Project - action #121567: test fails in test_runningResolvedmkittler2022-12-062022-12-21

Actions
Copied from openQA Infrastructure - action #109241: Prefer to use domain names rather than IPv4 in salt pillars size:MResolvedokurz

Actions
Actions

Also available in: Atom PDF