Project

General

Profile

Actions

action #120261

closed

openQA Tests - action #107062: Multiple failures due to network issues

tests should try to access worker by WORKER_HOSTNAME FQDN but sometimes get 'worker2' or something auto_review:".*curl.*worker\d+:.*failed at.*":retry size:meow

Added by okurz almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2022-11-10
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-15-SP4-JeOS-for-kvm-and-xen-Updates-x86_64-jeos-kdump@svirt-xen-pv fails in
image_info
to access worker by WORKER_HOSTNAME FQDN which in this case is worker2.oqa.suse.de but here gets "worker2" instead.

It looks like the WORKER_HOSTNAME is really not configured correctly in those cases, e.g. when the same problem happened on worker6 yesterday there was really just "WORKER_HOSTNAME=worker6" in workers.ini. So this appears to be a problem on salt level where the FQDN grain doesn't return the actual fully qualified domain. On worker6 re-applying the salt states helped to get the full FQDN configured again. Rebooting the machine did not break it again.

Steps to reproduce

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#120261

Acceptance criteria

  • AC1: All recent jobs failing to upload to an incomplete worker hostname are retriggered and clones end up ok
  • AC2: Jobs are able to upload logs after reboot of the worker machine
  • AC3: Jobs still work just after a salt high state was applied

Acceptance tests

  • AT1-1: openqa-query-for-job-label poo#120261 returns no matches more recent than 48h
  • AT2-1: Trigger reboot of the machine at least 2 times, trigger openQA tests (or wait for jobs to finish automatically) and verify that jobs succeed to upload logs
  • AT3-1: Apply salt high state from OSD, trigger openQA tests (or await automatic results) and verify that jobs succeed to upload logs

Suggestions

  • See what has been done in #109241 originally
  • Maybe we need to specify the FQDN in /etc/hostname . If we do that then we should revisit all occurences of "grains['host']" in https://gitlab.suse.de/openqa/salt-states-openqa
  • Check via sudo salt -C 'G@roles:worker' cmd.run 'grep -i worker_hostname /etc/openqa/workers.ini' on OSD whether all hostnames are configured correctly
  • If all other options fail we can still revert to hardcoding IPv4 addresses but FQDN would be preferred

Rollback steps

  • Add back worker2 to salt

Out of scope

Automatic distinction if the upload problem originates from test object misconfigurations, product regressions or problem within os-autoinst or openQA


Related issues 5 (0 open5 closed)

Related to QA - action #119443: Conduct the migration of SUSE openQA systems from Nbg SRV1 to new security zones size:MResolvedokurz2022-11-17

Actions
Related to openQA Tests - action #120363: [qe-core][functional] test fails in prepare_test_dataClosed

Actions
Related to openQA Project - action #120579: test fails in openqa_workerResolvedmkittler2022-11-152022-11-30

Actions
Related to openQA Project - action #121567: test fails in test_runningResolvedmkittler2022-12-062022-12-21

Actions
Copied from openQA Infrastructure - action #109241: Prefer to use domain names rather than IPv4 in salt pillars size:MResolvedokurz

Actions
Actions #1

Updated by okurz almost 2 years ago

  • Copied from action #109241: Prefer to use domain names rather than IPv4 in salt pillars size:M added
Actions #2

Updated by okurz almost 2 years ago

  • Related to action #119443: Conduct the migration of SUSE openQA systems from Nbg SRV1 to new security zones size:M added
Actions #3

Updated by mkittler almost 2 years ago

  • Description updated (diff)
  • Assignee set to mkittler
Actions #4

Updated by mkittler almost 2 years ago

  • Subject changed from tests should try to access worker by WORKER_HOSTNAME FQDN but sometimes get "worker2" or something to tests should try to access worker by WORKER_HOSTNAME FQDN but sometimes get "worker2" or something auto_review:".*curl.*worker\d+:.*failed at.*":retry

Letting auto-review handle the restarting for future occurrences.

About recent jobs: I found all affected jobs by running for d in 0991* 0992* ; do grep --files-with-matches -PR --include autoinst-log.txt '.*curl.*worker\d+:.*failed at.*' "$d" ; done > /home/martchus/failed-du-to-hostname.txt on OSD and restarted them via for i in $(grep --invert-match ':retry' /home/martchus/failed-du-to-hostname.txt | cut -f2 --delimiter\=/ | cut -f1 --delimiter\=-) ; do sudo openqa-cli api --host "openqa.suse.de" -X POST jobs/"$i"/restart ; done. This automatically skips restarting jobs that have already been restarted.

The list of affected jobs found that way is fortunately not that long and seems plausible.

I'm currently checking on the restarted jobs via:

martchus@openqa:~> cat restarts.json | jq -r '[.[] | .result | .[] | map(.)] | flatten | sort | join(",")'
9921403,9921404,9921405,9921406,9921407,9921408,9921409,9921410,9921411,9921412,9921413,9921414,9921415,9921416,9921417,9921418,9921419,9921420,9921421,9921422,9921423,9921424,9921425,9921426,9921427,9921428,9921429
[martchus@linux-9lzf ~]$ openqa-mon https://openqa.suse.de --jobs 9921403,9921404,9921405,9921406,9921407,9921408,9921409,9921410,9921411,9921412,9921413,9921414,9921415,9921416,9921417,9921418,9921419,9921420,9921421,9921422,9921423,9921424,9921425,9921426,9921427,9921428,9921429
Actions #5

Updated by mkittler almost 2 years ago

  • Status changed from New to In Progress
Actions #6

Updated by openqa_review almost 2 years ago

  • Due date set to 2022-11-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by acarvajal almost 2 years ago

Noticed several HA jobs failing on this as well, for example:

https://openqa.suse.de/tests/9917615#step/patch_sle/96
https://openqa.suse.de/tests/9917874#step/setup_hosts_and_luns/12

Tagging those failures with the poo#.

Actions #8

Updated by okurz almost 2 years ago

  • Description updated (diff)

Investigating this together in the team. We learned from the man page of hostname man 5 hostname that /etc/hostname should really only contain a label so we should not put an FQDN there.

As acarvajal stated the issue still reproduced. One open point:

  1. DONE Why did auto-review not label https://openqa.suse.de/tests/9917615#step/patch_sle/96? -> we used double quotes for two terms in the subject line which was the main problem, maybe also the :

We reviewed the variables, e.g. WORKER_HOSTNAME and despite slight ambiguity regarding the term "hostname", see https://en.wikipedia.org/wiki/Hostname , we agreed that the variable name is ok.

os-autoinst/testapi.pm defines a function host_ip which for qemu returns an IPv4 address but in other cases returns the value of WORKER_HOSTNAME which normally shouldn't be an IPv4 address but a FQDN.

  1. We should consider adding a new method "host_address" returning what host_ip does but providing a default value for WORKER_HOSTNAME if not set and deprecate host_ip with a log message

-> https://github.com/os-autoinst/os-autoinst/pull/2202

We found that https://openqa.suse.de/tests/9925066 from 2022-11-10 23:37:35Z still has a correct WORKER_HOSTNAME=worker2.oqa.suse.de

  1. openQA itself uses a "worker_hostname" attribute that it initializes from POSIX::uname() so e.g. resolving to worker2 regardless of the FQDN. This is something that we should change: Either get rid of this at all if it's only for logging or update the variable name or change it to return what WORKER_HOSTNAME in the config file would state

Coming back, so https://openqa.suse.de/tests/9925066 from 2022-11-10 23:37:35Z is our "last good". The next job on the assigned worker worker2:45 https://openqa.suse.de/tests/9925838 has WORKER_HOSTNAME=worker2 in https://openqa.suse.de/tests/9925838/file/vars.json

We took a look into system journal. With journalctl --since=today | grep -v 'worker2 worker\[' | less I found

Nov 11 00:00:10 worker2 openqa-worker-cacheservice-minion[28605]: [28605] [i] [#10801] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Nov 11 00:00:10 worker2 openqa-worker-cacheservice-minion[28605]: [28605] [i] [#10801] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Nov 11 00:01:00 worker2 telegraf[11514]: 2022-11-10T23:01:00Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state failed': hostname: Name or service not known
Nov 11 00:01:00 worker2 telegraf[11514]: 2022-11-10T23:01:00Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state masked': hostname: Name or service not known

but this problem persists periodically. Specifically what that bash script calls is hostname -f. Calling

sudo salt --no-color --state-output=changes -C 'G@roles:worker' cmd.run 'hostname -f'

which shows that this problem really only appears on worker2. I have for now removed worker2 from salt keys and manually edited /etc/openqa/workers.ini to have the FQDN. To be sure that we have consistent data I triggered a reboot of worker2.

After reboot the command hostname -f works just fine also on worker2. Maybe we just never rebooted the host since the migration and something was not restarted or so?

  1. If tests turn out to be ok I suggest to readd the machine back to salt, let salt apply a high state and monitor tests for at least 2 days

@mkittler please decide yourself which of the above points you would like to follow-up with yourself or pull out into separate tickets

Actions #9

Updated by okurz almost 2 years ago

  • Related to action #120363: [qe-core][functional] test fails in prepare_test_data added
Actions #10

Updated by okurz almost 2 years ago

Output of openqa-query-for-job-label poo#120261 as of now:

[[B9926841|2022-11-11 14:26:35|done|failed|docker_tests||worker2
9926842|2022-11-11 14:25:59|done|failed|podman_tests||worker2
9926833|2022-11-11 14:18:33|done|failed|kubectl_tests||worker2
9918009|2022-11-11 05:14:06|done|failed|default||worker2
9917877|2022-11-11 05:09:14|done|failed|migration_offline_scc_verify_sle12sp5_ha_alpha_node02||worker2
9917874|2022-11-11 05:09:14|done|failed|migration_offline_scc_verify_sle12sp5_ha_alpha_node01||worker2
9917912|2022-11-11 04:42:34|done|failed|migration_online_sle15sp3_ha_alpha_node01||worker2
9917914|2022-11-11 04:07:53|done|failed|migration_online_sle15sp3_ha_alpha_node02||worker2
9917615|2022-11-11 02:03:44|done|failed|migration_offline_dvd_sle15sp3_ha_alpha_node02||worker2
9917612|2022-11-11 01:59:47|done|failed|migration_offline_dvd_sle15sp3_ha_alpha_node01||worker2

so no further failed jobs or regex fails to match.

Actions #11

Updated by okurz almost 2 years ago

  • Subject changed from tests should try to access worker by WORKER_HOSTNAME FQDN but sometimes get "worker2" or something auto_review:".*curl.*worker\d+:.*failed at.*":retry to tests should try to access worker by WORKER_HOSTNAME FQDN but sometimes get "worker2" or something auto_review:".*curl.*worker\d+.*failed at.*":retry
Actions #12

Updated by okurz almost 2 years ago

  • Subject changed from tests should try to access worker by WORKER_HOSTNAME FQDN but sometimes get "worker2" or something auto_review:".*curl.*worker\d+.*failed at.*":retry to tests should try to access worker by WORKER_HOSTNAME FQDN but sometimes get 'worker2' or something auto_review:".*curl.*worker\d+.*failed at.*":retry
Actions #13

Updated by mkittler almost 2 years ago

Note that the " in the ticket name prevented openqa-label-known-issues to work. That's why it was changed. Maybe we should document this limitation somewhere.

Actions #14

Updated by okurz almost 2 years ago

Over the weekend no further jobs were labeled with this ticket so we assume that our hypothesis is correct that the salt updated the worker config file with the incomplete hostname and not the FQDN due to hostname -f not being able to return the full hostname, maybe shortly after bootup. So I am conducting a "reboot test":

host=worker2.oqa.suse.de; for run in {1..5}; do for host in $host; do echo -n "run: $run, $host: ping .. " && timeout -k 5 600 sh -c "until ping -c30 $host >/dev/null; do :; done" && echo -n "ok, ssh .. " && timeout -k 5 600 sh -c "until nc -z -w 1 $host 22; do :; done" && echo -n "ok, hostname/uptime/reboot: " && ssh $host "hostname -f && uptime && sudo reboot" && sleep 120 || break; done || break; done

result:

run: 1, worker2.oqa.suse.de: ping .. ok, ssh .. ok, hostname/uptime/reboot: worker2.oqa.suse.de
 13:54:52  up   1:58,  1 user,  load average: 5.81, 5.87, 6.38
Connection to worker2.oqa.suse.de closed by remote host.
run: 2, worker2.oqa.suse.de: ping .. ok, ssh .. ok, hostname/uptime/reboot: worker2.oqa.suse.de
 14:04:19  up   0:03,  0 users,  load average: 2.93, 2.36, 1.01
Connection to worker2.oqa.suse.de closed by remote host.
run: 3, worker2.oqa.suse.de: ping .. ok, ssh .. ok, hostname/uptime/reboot: worker2.oqa.suse.de
 14:13:44  up   0:03,  0 users,  load average: 3.06, 2.47, 1.05
Connection to worker2.oqa.suse.de closed by remote host.
run: 4, worker2.oqa.suse.de: ping .. ok, ssh .. ok, hostname/uptime/reboot: worker2.oqa.suse.de
 14:23:05  up   0:03,  0 users,  load average: 2.21, 1.68, 0.72
Connection to worker2.oqa.suse.de closed by remote host.
run: 5, worker2.oqa.suse.de: ping .. ok, ssh .. ok, hostname/uptime/reboot: worker2.oqa.suse.de
 14:32:31  up   0:03,  0 users,  load average: 2.43, 2.47, 1.11
Connection to worker2.oqa.suse.de closed by remote host.

so it looks like there is no problem to retrieve the FQDN after every reboot. What could be happening is that still during early system initialization the hostname can not be properly retrieved. The command sudo journalctl -b | grep hostname reveals:

Nov 14 14:39:23 worker2 ovs-ctl[1812]: hostname: Name or service not known
Nov 14 14:39:23 worker2 ovs-vsctl[1817]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait add Open_vSwitch . external-ids hostname=worker2
Nov 14 14:39:23 worker2 ovs-ctl[1945]: hostname: Name or service not known
Nov 14 14:39:23 worker2 ovs-vsctl[1954]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait add Open_vSwitch . external-ids hostname=worker2
Nov 14 14:40:03 worker2 worker[3023]:  - worker hostname:       worker2
Nov 14 14:40:03 worker2 worker[2986]:  - worker hostname:       worker2
Nov 14 14:40:03 worker2 worker[3024]:  - worker hostname:       worker2
Nov 14 14:40:03 worker2 worker[2987]:  - worker hostname:       worker2
Nov 14 14:40:03 worker2 worker[3027]:  - worker hostname:       worker2
Nov 14 14:40:03 worker2 worker[2985]:  - worker hostname:       worker2
…
Nov 14 14:40:04 worker2 worker[2988]:  - worker hostname:       worker2
Nov 14 14:40:07 worker2 salt-minion[3182]: [ERROR   ] Master hostname: 'openqa.suse.de' not found or not responsive. Retrying in 30 seconds
Nov 14 14:40:37 worker2 salt-minion[3182]: [ERROR   ] Master hostname: 'openqa.suse.de' not found or not responsive. Retrying in 30 seconds
Nov 14 14:41:00 worker2 telegraf[2839]: 2022-11-14T13:41:00Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state failed': hostname: Name or service not known
Nov 14 14:41:00 worker2 telegraf[2839]: 2022-11-14T13:41:00Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state masked': hostname: Name or service not known
Nov 14 14:41:07 worker2 salt-minion[3182]: [ERROR   ] Master hostname: 'openqa.suse.de' not found or not responsive. Retrying in 30 seconds
Nov 14 14:41:37 worker2 salt-minion[3182]: [ERROR   ] Master hostname: 'openqa.suse.de' not found or not responsive. Retrying in 30 seconds
Nov 14 14:42:17 worker2 worker[5628]:  - worker hostname:       worker2
…
Nov 14 15:04:48 worker2 worker[7783]:  - worker hostname:       worker2

Anyway, following up the points from #120261#note-8 would help

EDIT: Maybe also https://askubuntu.com/questions/185070/why-do-i-get-hostname-name-or-service-not-known-error/185084 and https://unix.stackexchange.com/questions/283550/i-got-error-hostname-name-or-service-not-known-when-checking-ip-of-hostname and https://man7.org/linux/man-pages/man8/nss-myhostname.8.html help

Actions #16

Updated by mkittler almost 2 years ago

The latest occurrence is still from 2022-11-11 (based on the automatic labeling which should now work as we've tested yesterday).

Actions #17

Updated by okurz almost 2 years ago

Actions #18

Updated by livdywan almost 2 years ago

  • Subject changed from tests should try to access worker by WORKER_HOSTNAME FQDN but sometimes get 'worker2' or something auto_review:".*curl.*worker\d+.*failed at.*":retry to tests should try to access worker by WORKER_HOSTNAME FQDN but sometimes get 'worker2' or something auto_review:".*curl.*worker\d+.*failed at.*":retry size:meow
  • Description updated (diff)
Actions #19

Updated by mkittler almost 2 years ago

  • Description updated (diff)

The most recent jobs returned by openqa-query-for-job-label poo#120261 are still from 11-11. So I assume also relying on the new worker feature which was tested on worker11 and worker12 didn't make things worse. So I'm preparing a SR to use it everywhere. I'll also add worker11 and worker12 back to salt.

Actions #21

Updated by mkittler almost 2 years ago

  • Description updated (diff)
Actions #23

Updated by mkittler almost 2 years ago

The SR has been merged. So far it looks still good.

Here's another SR to remove the explicit override of WORKER_HOSTNAME for grenache-1: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/463

Actions #24

Updated by mkittler almost 2 years ago

  • Status changed from In Progress to Feedback
Actions #25

Updated by okurz almost 2 years ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/463 merged.

Please look into:

$ openqa-query-for-job-label poo#120261
9986752|2022-11-18 09:36:39|done|failed|gi-guest_developing-on-host_developing-xen||worker2
9981295|2022-11-17 22:38:36|done|failed|sriov_pci_passthrough-guest_sles12sp5_fv-on-host_developing-xen||worker2
Actions #26

Updated by okurz almost 2 years ago

FYI: Traffic between .oqa.suse.de and .qa.suse.de unblocked, see #120264#note-9

I checked
https://openqa.suse.de/tests/9986752#step/xen_guest_irqbalance/175 and found it failing to upload but the FQDN was specified so that's not the same problem. Same for https://openqa.suse.de/tests/9981295#step/sriov_network_card_pci_passthrough/173

I think worker2 can be considered good again and you can conduct the rollback steps and conclude.

Actions #27

Updated by mkittler almost 2 years ago

  • Status changed from Feedback to Resolved

I don't know on which basis we can consider worker2 specifically good again because it wasn't in salt (so salt would not have overridden the FQDN setting with a short hostname anyways). However, all the other workers look still good so I suppose the setup relying on the worker auto-detection is good enough. I added worker2 back to salt so it is now also using that setup.

Actions #29

Updated by mkittler almost 2 years ago

Looks like WORKER_HOSTNAME was completely empty when https://openqa.suse.de/tests/10011018 ran (not entry in https://openqa.suse.de/tests/10011018/file/vars.json and variable not logged in https://openqa.suse.de/tests/10011018/logfile?filename=worker-log.txt). Likely the worker couldn't determine the FQDN on startup due to some network issue.

Actions #30

Updated by mkittler almost 2 years ago

I had a closer look at the logs¹. Apparently just an old version of the worker package was installed at the time. Only on "Nov 22 06:22:49" logs show "worker address (WORKER_HOSTNAME)" (log message from newer worker version) and not just "worker hostname" (log message from older worker version). If it is just the worker version then we should be good again (and don't need to blame DNS availability).


¹

sudo journalctl --since '7 hours ago' -fu openqa-worker-auto-restart@45
…
Nov 22 05:22:38 worker2 worker[30782]: [debug] [pid:30782] Informing openqa.suse.de that we are going offline
Nov 22 05:22:38 worker2 systemd[1]: openqa-worker-auto-restart@45.service: Deactivated successfully.
Nov 22 05:22:38 worker2 systemd[1]: openqa-worker-auto-restart@45.service: Scheduled restart job, restart counter is at 59.
Nov 22 05:22:38 worker2 systemd[1]: Stopped openQA Worker #45.
Nov 22 05:22:38 worker2 systemd[1]: Starting openQA Worker #45...
Nov 22 05:22:38 worker2 systemd[1]: Started openQA Worker #45.
Nov 22 05:22:39 worker2 worker[32113]: [info] [pid:32113] worker 45:
Nov 22 05:22:39 worker2 worker[32113]:  - config file:           /etc/openqa/workers.ini
Nov 22 05:22:39 worker2 worker[32113]:  - worker hostname:       worker2
Nov 22 05:22:39 worker2 worker[32113]:  - isotovideo version:    33
Nov 22 05:22:39 worker2 worker[32113]:  - websocket API version: 1
Nov 22 05:22:39 worker2 worker[32113]:  - web UI hosts:          openqa.suse.de
Nov 22 05:22:39 worker2 worker[32113]:  - class:                 s390-kvm-sle12,worker2
Nov 22 05:22:39 worker2 worker[32113]:  - no cleanup:            no
Nov 22 05:22:39 worker2 worker[32113]:  - pool directory:        /var/lib/openqa/pool/45
Nov 22 05:22:39 worker2 worker[32113]: [info] [pid:32113] Project dir for host openqa.suse.de is /var/lib/openqa/share
Nov 22 05:22:39 worker2 worker[32113]: [info] [pid:32113] Registering with openQA openqa.suse.de
Nov 22 05:22:39 worker2 worker[32113]: [info] [pid:32113] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2122
Nov 22 05:22:39 worker2 worker[32113]: [info] [pid:32113] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 2122
Nov 22 06:18:04 worker2 worker[32113]: [warn] [pid:32113] Websocket connection to http://openqa.suse.de/api/v1/ws/2122 finished by remote side with code 1006, no reason - trying again in 10 seconds
Nov 22 06:18:14 worker2 worker[32113]: [info] [pid:32113] Registering with openQA openqa.suse.de
Nov 22 06:18:16 worker2 worker[32113]: [info] [pid:32113] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2122
Nov 22 06:18:18 worker2 worker[32113]: [info] [pid:32113] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 2122
Nov 22 06:22:46 worker2 systemd[1]: Reloading openQA Worker #45...
Nov 22 06:22:46 worker2 worker[32113]: [info] [pid:32113] Received signal HUP
Nov 22 06:22:46 worker2 worker[32113]: [debug] [pid:32113] Informing openqa.suse.de that we are going offline
Nov 22 06:22:46 worker2 systemd[1]: Reloaded openQA Worker #45.
Nov 22 06:22:46 worker2 systemd[1]: openqa-worker-auto-restart@45.service: Deactivated successfully.
Nov 22 06:22:46 worker2 systemd[1]: openqa-worker-auto-restart@45.service: Scheduled restart job, restart counter is at 60.
Nov 22 06:22:46 worker2 systemd[1]: Stopped openQA Worker #45.
Nov 22 06:22:46 worker2 systemd[1]: Starting openQA Worker #45...
Nov 22 06:22:46 worker2 systemd[1]: Started openQA Worker #45.
Nov 22 06:22:49 worker2 worker[19963]: [info] [pid:19963] worker 45:
Nov 22 06:22:49 worker2 worker[19963]:  - config file:                      /etc/openqa/workers.ini
Nov 22 06:22:49 worker2 worker[19963]:  - name used to register:            worker2
Nov 22 06:22:49 worker2 worker[19963]:  - worker address (WORKER_HOSTNAME): worker2.oqa.suse.de
Nov 22 06:22:49 worker2 worker[19963]:  - isotovideo version:               34
Nov 22 06:22:49 worker2 worker[19963]:  - websocket API version:            1
Nov 22 06:22:49 worker2 worker[19963]:  - web UI hosts:                     openqa.suse.de
Nov 22 06:22:49 worker2 worker[19963]:  - class:                            s390-kvm-sle12,worker2
Nov 22 06:22:49 worker2 worker[19963]:  - no cleanup:                       no
Nov 22 06:22:49 worker2 worker[19963]:  - pool directory:                   /var/lib/openqa/pool/45
Nov 22 06:22:49 worker2 worker[19963]: [info] [pid:19963] Project dir for host openqa.suse.de is /var/lib/openqa/share
Nov 22 06:22:49 worker2 worker[19963]: [info] [pid:19963] Registering with openQA openqa.suse.de
Nov 22 06:22:49 worker2 worker[19963]: [info] [pid:19963] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2122
Nov 22 06:22:49 worker2 worker[19963]: [info] [pid:19963] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 2122
Nov 22 08:44:35 worker2 worker[19963]: [info] [pid:19963] Received signal HUP
Nov 22 08:44:35 worker2 worker[19963]: [debug] [pid:19963] Informing openqa.suse.de that we are going offline
Nov 22 08:44:35 worker2 systemd[1]: Reloading openQA Worker #45...
Nov 22 08:44:36 worker2 systemd[1]: openqa-worker-auto-restart@45.service: Deactivated successfully.
Nov 22 08:44:36 worker2 systemd[1]: Reloaded openQA Worker #45.
Nov 22 08:44:36 worker2 systemd[1]: openqa-worker-auto-restart@45.service: Scheduled restart job, restart counter is at 61.
Nov 22 08:44:36 worker2 systemd[1]: Stopped openQA Worker #45.
Actions #31

Updated by livdywan almost 2 years ago

  • Related to action #120441: OSD parallel jobs failed with "get_job_autoinst_url: No worker info for job xxx available" size:meow added
Actions #32

Updated by mkittler almost 2 years ago

I've just added the worker back to salt and applied salt states immediately. I suppose I should have been updating all packages as well. Now it is good anyways because a deployment was done (automatically) meanwhile.

Since it was just that we can likely give the FQDN setup another chance (before resorting back to using IPs).

Actions #33

Updated by mkittler almost 2 years ago

  • Related to deleted (action #120441: OSD parallel jobs failed with "get_job_autoinst_url: No worker info for job xxx available" size:meow)
Actions #34

Updated by mkittler almost 2 years ago

  • Status changed from Feedback to Resolved

I found no further jobs so I suppose it is good after the update.

(I actually found https://openqa.suse.de/tests/10013250 but it had the correct WORKER_HOSTNAME setting and failed because the SUT could not resolve the FQDN, see https://openqa.suse.de/tests/10013250#step/journal_check/8. I removed the wrong ticket reference on that job. It has already been restarted, ran on worker2 again and also succeeded. Not sure why it couldn't resolve the correct FQDN in this case but since the FQDN itself was correct I'd ignore this for now unless we see it more frequently.)

Actions #36

Updated by mkittler almost 2 years ago

The hostname resolution was indeed not working:

martchus@worker2:~> sudo journalctl -u openqa-worker-auto-restart@* | grep 'worker address'
Nov 23 21:41:33 worker2 worker[22686]:  - worker address (WORKER_HOSTNAME): worker2.oqa.suse.de
Nov 23 21:46:18 worker2 worker[25007]:  - worker address (WORKER_HOSTNAME): worker2.oqa.suse.de
Nov 23 21:46:26 worker2 worker[25034]:  - worker address (WORKER_HOSTNAME): worker2.oqa.suse.de
…
Nov 27 03:12:10 worker2 worker[5125]:  - worker address (WORKER_HOSTNAME): worker2.oqa.suse.de
Nov 27 03:21:02 worker2 worker[5937]:  - worker address (WORKER_HOSTNAME): worker2.oqa.suse.de
Nov 27 03:22:44 worker2 worker[6012]:  - worker address (WORKER_HOSTNAME): worker2.oqa.suse.de
Nov 27 03:37:52 worker2 worker[3420]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3419]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3458]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3425]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3407]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3437]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3417]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3442]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3433]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3451]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3418]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3409]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3440]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3428]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3415]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3445]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3444]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3461]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3411]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3446]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3452]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3464]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3416]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3435]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3449]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3450]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3453]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3436]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3423]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3413]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3410]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3448]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3439]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3421]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3426]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3424]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3414]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3443]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3438]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3463]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3447]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3459]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3427]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3465]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3434]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3429]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3406]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3412]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3432]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3431]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3422]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3430]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 03:37:52 worker2 worker[3408]:  - worker address (WORKER_HOSTNAME): worker2
Nov 27 10:47:12 worker2 worker[15958]:  - worker address (WORKER_HOSTNAME): worker2.oqa.suse.de
Nov 27 10:56:55 worker2 worker[16865]:  - worker address (WORKER_HOSTNAME): worker2.oqa.suse.de
Nov 27 12:29:22 worker2 worker[28039]:  - worker address (WORKER_HOSTNAME): worker2.oqa.suse.de
Nov 27 12:30:07 worker2 worker[28192]:  - worker address (WORKER_HOSTNAME): worker2.oqa.suse.de
…

Since the FQDN can apparently not be reliably detected (at least shortly after booting) I see multiple options:

  1. We go back to using IP addresses again
    1. by setting WORKER_HOSTNAME again via salt to an IPv4 address.
    2. by adding a fallback to the worker's auto-detection so it'll at least assign an IP if it cannot determine the FQDN.
    3. by setting WORKER_HOSTNAME_FALLBACK via salt to an IPv4 address to fix the flaw of 1.2 pointed out in my next comment.
  2. We let the worker consider itself "broken" if the FQDN isn't set or cannot be auto-detected. We can simply re-use the already existing mechanisms which also made a re-try for the auto-detection easy to implement.
Actions #37

Updated by okurz almost 2 years ago

1.2 sounds like a good idea

How likely do you consider the hypothesis that this is a worker2 specific problem? Did we see any problem on grenache-1 that also runs a lot of non-qemu workers?

Actions #38

Updated by mkittler almost 2 years ago

  • Due date changed from 2022-11-25 to 2022-12-09

There's more work to be done so I'm bumping the due date.


1.2 sounds like a good idea

When thinking about it further myself I must say that it likely won't cut it. If the worker is started early enough not even an IP address might be assigned yet. With 1.1 we'd likely won't run into the problem because Salt is supposedly really only becoming "active" once there's also network (at least we haven't seen any problems putting the IP address there via Salt in the past).

So likely its either 1.1 or 2 after all. I've also been adding the new option 1.3 but I find it rather messy if this gets spread over two different places (Salt and the worker code).

Actions #39

Updated by mkittler almost 2 years ago

  • Status changed from Workable to Feedback
Actions #40

Updated by mkittler almost 2 years ago

  • Subject changed from tests should try to access worker by WORKER_HOSTNAME FQDN but sometimes get 'worker2' or something auto_review:".*curl.*worker\d+.*failed at.*":retry size:meow to tests should try to access worker by WORKER_HOSTNAME FQDN but sometimes get 'worker2' or something auto_review:".*curl.*worker\d+:.*failed at.*":retry size:meow

The PR has just been merged.

I've also re-conducted AT-1 to see how bad the situation is (without the PR being deployed so far). There were some jobs labeled but actually only one of them was actually about this issue (https://openqa.suse.de/tests/10088274). Otherwise jobs were labeled incorrectly either due to the regex of this ticket being to generic, wrong manual labeling and wrong automatic takeover (failure happened in the very same set of test modules so that's expected). I removed the wrong labels and pinged Nan Zhang about the wrong labeling in the chat.

That's where this ticket's regex wrongly matched: https://openqa.suse.de/tests/10090329#step/suseconnect_scc/17 - Those tickets contain log lines like

[2022-12-06T02:48:45.338104+01:00] [info] ::: basetest::runtest: # Test died: command 'curl --form upload=@/tmp/full_journal.log --form upname=journal_check-full_journal.log http://worker2.oqa.suse.de:20113/VCc6rYDvppJx12SJ/uploadlog/full_journal.log' failed at /usr/lib/os-autoinst/testapi.pm line 922.
    testapi::assert_script_run("curl --form upload=\@/tmp/full_journal.log --form upname=journ"..., 90) called at /usr/lib/os-autoinst/testapi.pm line 2191
    testapi::upload_logs("/tmp/full_journal.log") called at sle/tests/console/journal_check.pm line 91

So I made the regex more specific to include the colon.

Actions #41

Updated by okurz almost 2 years ago

Actions #42

Updated by okurz almost 2 years ago

As discussed after daily 2022-12-07 refreshing the ACs we want to cover:

  1. It is possible for users to override WORKER_HOSTNAME with a specific FQDN or IP address
  2. For a properly resolvable FQDN WORKER_HOSTNAME does not need to be configured and developer mode still works, i.e. we don't need to specify WORKER_HOSTNAME by default in OSD salt pillars because our FQDNs are sane
  3. For a single-host webui+worker setup the worker by default connects to 'localhost' (i.e. default for the HOST variable) and also WORKER_HOSTNAME does not need to be specified even if FQDN is not fully resolvable, i.e. openQA-in-openQA tests work with an "incomplete qemu usermode network"
  4. A combination of localhost+non-local in HOST does not need to auto-resolve correctly for localhost, i.e. users need to ensure a proper FQDN resolution that also works locally (2.) or just needs to specify manually
  5. REJECT In cause auto-resolution does not work (reg. 2.) openQA jobs still start to run and fail accordingly with a message, e.g. that log upload fails -> We can only ensure 2. if we never start jobs without a valid WORKER_HOSTNAME unless we explicitly "opt-out"
Actions #43

Updated by mkittler almost 2 years ago

"Avoid running jobs with undetermined worker address" has been deployed today so this issue should be gone. There's also https://github.com/os-autoinst/openQA/pull/4949 which hasn't been deployed yet to fix the regression handled in #121567 according to #120261#note-42.

Actions #44

Updated by mkittler almost 2 years ago

There was another occurrence (https://openqa.suse.de/tests/10086991) but it ran before the change has been deployed.

Unfortunately 9 other jobs were labeled due via automatic carry over. The jobs the label had been carried over from were already wrongly labeled manually (e.g. https://openqa.suse.de/tests/10086966#comments) or the carry over's limitation to distinguish different problems lead to the carry over (e.g. https://openqa.suse.de/tests/10097687#comments). Maybe we can at least improve the latter. For now I've just removed the wrong labels.

Actions #45

Updated by okurz almost 2 years ago

#121567 has been resolved. IMHO we can resolve here as well?

Actions #46

Updated by okurz almost 2 years ago

  • Due date changed from 2022-12-09 to 2022-12-16
  • Status changed from Feedback to Workable
$ openqa-query-for-job-label poo#120261
10111972|2022-12-09 06:52:49|done|failed|fips_install_lvm_encrypt_separate_boot||worker2
10110342|2022-12-09 04:03:30|done|failed|fips_install_lvm_encrypt_separate_boot||worker2
10105119|2022-12-08 04:27:58|done|failed|fips_install_lvm_encrypt_separate_boot||worker2
10104295|2022-12-08 03:42:15|done|failed|fips_install_lvm_encrypt_separate_boot||worker2
10086982|2022-12-07 16:41:23|done|failed|xfstests_xfs-generic||worker2
10086976|2022-12-07 16:17:13|done|failed|xfstests_btrfs-generic||worker2
10097638|2022-12-07 08:31:21|done|failed|fips_install_lvm_encrypt_separate_boot||worker2
10086991|2022-12-06 01:31:41|done|failed|jeos-base+sdk+desktop||worker2
10089358|2022-12-06 01:30:37|done|failed|jeos-containers-podman:investigate:last_good_tests:4adccec6a14a722ab92fc25f72b7a5a6a9d2bfd5||worker2
10086996|2022-12-06 01:22:02|done|failed|jeos-extratest||worker2

looking at the most recent one I find https://openqa.suse.de/tests/10077880/file/vars.json referencing WORKER_HOSTNAME=worker2 which again is wrong and problematic.

Right now the worker https://openqa.suse.de/admin/workers/2111 shows again the correct "worker2.oqa.suse.de".

I logged in to worker2 over ssh and checked in rpm -q --changelog openQA-worker and found that also the latest PR https://github.com/os-autoinst/openQA/pull/4949 is resolved.

So apparently the later PR did not resolve the issues. We need to focus on this task next week.

Actions #47

Updated by okurz almost 2 years ago

  • Status changed from Workable to Feedback

I recommend to hardcode the FQDN in the worker config of worker2 only assuming we only have special problems there:
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/474

Actions #48

Updated by mkittler almost 2 years ago

Your investigation doesn't make sense. The problematic job https://openqa.suse.de/tests/10077880 ran 8 days ago. At this point https://github.com/os-autoinst/openQA/pull/4935 (that is the actually relevant PR) hasn't been merged yet as it was only merged 6 days ago.

I'll nevertheless check labeled jobs again myself to see whether there are any relevant ones.

Actions #49

Updated by mkittler almost 2 years ago

I only found jobs wrongly labeled manually, with wrong carry over or jobs that ran before https://github.com/os-autoinst/openQA/pull/4935 has been merged.

Actions #51

Updated by okurz almost 2 years ago

  • Due date deleted (2022-12-16)
  • Status changed from Feedback to Resolved

mkittler wrote:

MR to remove workaround: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/475

MR was merged. I am not aware of further problems. It seems we are good.

Actions #52

Updated by yosun almost 2 years ago

Hello, I have a question, may I expect this change enable in osd SP5 test build61.1 or later? because my s390x case still fails in curl download log part in build61.1, any idea?
https://openqa.suse.de/tests/10185510#step/generate_report/2

Actions #53

Updated by okurz almost 2 years ago

yosun wrote:

Hello, I have a question, may I expect this change enable in osd SP5 test build61.1 or later?

yes, the change is effective since https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/475 was merged and the linked pipeline job passed.

because my s390x case still fails in curl download log part in build61.1, any idea?
https://openqa.suse.de/tests/10185510#step/generate_report/2

https://openqa.suse.de/tests/10185510#step/generate_report/2 shows

# wait_serial expected: "curl -O http://worker2.oqa.suse.de:20403/poqC6RHW8bUb6NVl/files/status.log; cat status.log > /opt/status.log; echo ZlKoJ-\$?-"
# Result:

so the correct FQDN "worker2.oqa.suse.de" is referenced. Hence the problem must be something different. I suggest you investigate in more detail and create a specific ticket for that

Actions #54

Updated by openqa_review over 1 year ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: xfstests_xfs
https://openqa.suse.de/tests/10339785#step/generate_report/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #55

Updated by mkittler over 1 year ago

  • Status changed from Feedback to Resolved

This was just a takeover from a 2 month old job with a wrong manual amendment. The failing job has Test died: command 'curl -O http://worker2.oqa.suse.de:20303 so the FQDN is correct (and the download error happened despite that). I'm closing this issue again and pinged the user.

Actions #56

Updated by openqa_review over 1 year ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-apparmor@svirt-vmware65
https://openqa.suse.de/tests/10436752#step/journal_check/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #57

Updated by okurz over 1 year ago

  • Status changed from Feedback to Resolved

I called

$ interval='90 day' openqa-query-for-job-label poo#120261
10436780|2023-02-06 17:04:32|done|failed|jeos-main||worker2
10353788|2023-01-20 14:54:15|done|failed|jeos-apparmor||worker2
10353481|2023-01-20 14:22:53|done|failed|jeos-main||worker2
10165585|2022-12-14 10:24:43|done|failed|jeos-containers-podman||worker2
10086991|2022-12-06 01:31:41|done|failed|jeos-base+sdk+desktop||worker2
10086996|2022-12-06 01:22:02|done|failed|jeos-extratest||worker2
10088170|2022-12-06 01:20:01|done|failed|online_upgrade_sles15sp4_hyperv||worker2
10088145|2022-12-06 01:05:30|done|failed|textmode_svirt||worker2
10089379|2022-12-06 00:54:27|done|failed|jeos-base+sdk+desktop||worker2
10087142|2022-12-06 00:54:20|done|failed|jeos-main||worker2

and removed the label from the previously mentioned job and the more recent ones from the list above. I also checked https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html for references which only showed two jobs from the above list. We only found automatic take-overs, no manually referenced jobs so I would say for now we are ok with just removing those references from jobs assuming that there won't be more take-overs.

Actions

Also available in: Atom PDF