action #131123: qa-jump.qe.nue2.suse.org is not reachable since 2023-06-19 - openQA Infrastructure (public) - openSUSE Project Management Tool

for i in $(ssh openqa.suse.de "sudo -u geekotest psql -t --command=\"select jobs.id from job_modules join jobs on jobs.id = job_modules.job_id where job_modules.result='failed' and t_started between NOW() - Interval '24 HOURS' and NOW() and name='boot_from_pxe';\" openqa"); do openqa-client --host openqa.suse.de jobs/$i/restart post; done

{
  errors   => ["Specified job 11392667 has already been cloned as 11395877"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11377521 has already been cloned as 11397236"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11392668 => 11397616 }],
  test_url => [{ 11392668 => "/tests/11397616" }],
}
{
  result   => [{ 11395237 => 11397617 }],
  test_url => [{ 11395237 => "/tests/11397617" }],
}
{
  errors   => ["Specified job 11390240 has already been cloned as 11397615"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11377525 has already been cloned as 11397233"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11392669 => 11397618 }],
  test_url => [{ 11392669 => "/tests/11397618" }],
}
{
  result   => [{ 11377515 => 11397619 }],
  test_url => [{ 11377515 => "/tests/11397619" }],
}
{
  result   => [{ 11390294 => 11397620 }],
  test_url => [{ 11390294 => "/tests/11397620" }],
}
{
  errors   => ["Specified job 11377524 has already been cloned as 11397232"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11388049 => 11397621 }],
  test_url => [{ 11388049 => "/tests/11397621" }],
}
{
  errors   => ["Specified job 11388639 has already been cloned as 11389800"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11377516 => 11397622 }],
  test_url => [{ 11377516 => "/tests/11397622" }],
}
{
  result   => [{ 11389762 => 11397623 }],
  test_url => [{ 11389762 => "/tests/11397623" }],
}
{
  errors   => ["Specified job 11377520 has already been cloned as 11397235"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11377514 has already been cloned as 11397512"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11377518 has already been cloned as 11397234"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11377523 => 11397624 }],
  test_url => [{ 11377523 => "/tests/11397624" }],
}
{
  result   => [{ 11377526 => 11397625 }],
  test_url => [{ 11377526 => "/tests/11397625" }],
}
{
  result   => [{ 11388042 => 11397626 }],
  test_url => [{ 11388042 => "/tests/11397626" }],
}
{
  result   => [{ 11387254 => 11397627 }],
  test_url => [{ 11387254 => "/tests/11397627" }],
}
{
  result   => [{ 11387247 => 11397628 }],
  test_url => [{ 11387247 => "/tests/11397628" }],
}
{
  result   => [{ 11388040 => 11397629 }],
  test_url => [{ 11388040 => "/tests/11397629" }],
}
{
  result   => [{ 11390163 => 11397630 }],
  test_url => [{ 11390163 => "/tests/11397630" }],
}

Actions

Copy link

Updated by okurz over 1 year ago

Status changed from Feedback to Resolved

looks good. The next time you can also use https://github.com/os-autoinst/scripts/blob/master/openqa-advanced-retrigger-jobs

Actions

Copy link

Updated by okurz over 1 year ago

Status changed from Resolved to In Progress

happened again. @Nick Singer please create an SD ticket and point out that lately the VM was already down and that https://sd.suse.com/servicedesk/customer/portal/1/SD-124776 still has an unanswered question.

Actions

Copy link

Updated by nicksinger over 1 year ago

Status changed from In Progress to Feedback

created https://sd.suse.com/servicedesk/customer/portal/1/SD-125516

Actions

Copy link

Updated by crameleon over 1 year ago

Same issue as last time, machine did not automatically apply its network configuration after a reboot:

qa-jump:~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 72:1c:68:af:4d:17 brd ff:ff:ff:ff:ff:ff
    altname enp0s2
    altname ens2
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 52:54:00:3b:4c:8c brd ff:ff:ff:ff:ff:ff
    altname enp0s3
    altname ens3

qa-jump:~ # rcnetwork restart

qa-jump:~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 72:1c:68:af:4d:17 brd ff:ff:ff:ff:ff:ff
    altname enp0s2
    altname ens2
    inet 10.168.192.10/22 brd 10.168.195.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::701c:68ff:feaf:4d17/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:3b:4c:8c brd ff:ff:ff:ff:ff:ff
    altname enp0s3
    altname ens3
    inet 192.168.44.1/24 brd 192.168.44.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe3b:4c8c/64 scope link
       valid_lft forever preferred_lft forever

qa-jump:~ # systemctl --failed
  UNIT             LOAD   ACTIVE SUB    DESCRIPTION
● mnt-dist.mount   loaded failed failed /mnt/dist
● mnt-openqa.mount loaded failed failed /mnt/openqa

qa-jump:~ # systemctl restart mnt-dist.mount
qa-jump:~ # systemctl restart mnt-openqa.mount

This seems ok:

qa-jump:~ # systemctl is-enabled network
alias
qa-jump:~ # systemctl is-enabled wicked
enabled

Not sure yet what's different with this machine.

Actions

Copy link

#10

Updated by crameleon over 1 year ago

There is an issue with the mounts, causing wicked to fail during boot:

Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found ordering cycle on network-online.target/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on wicked.service/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on wickedd-auto4.service/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on local-fs.target/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on srv-tftpboot-mnt-openqa.mount/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on mnt-openqa.mount/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Job local-fs.target/start deleted to break ordering cycle starting with mnt-openqa.mount/start

This seems like a similar issue as the one described in https://www.suse.com/support/kb/doc/?id=000018796.

I edited the fstab entry from @okurz respectively by appending _netdev to its options:

# okurz: 2023-05-17: "journalctl -e" revealed that systems (still?) try to access /mounts/dist so adding that bind mount additionally for convenience but that should be streamlined
# gpfuetzenreuter: 2023-06-28: repair broken network dependency cycle by adding _netdev
/mnt/dist                                       /mounts/dist                  none     defaults,bind,_netdev    0 0

After another reboot, the network came up fine.

Actions

Copy link

#11

Updated by nicksinger over 1 year ago

Status changed from Feedback to Resolved

Actions

Copy link

#12

Updated by nicksinger over 1 year ago

All affected jobs restarted:

{
  result   => [{ 11462148 => 11469067 }],
  test_url => [{ 11462148 => "/tests/11469067" }],
}
{
  result   => [{ 11468047 => 11469068 }],
  test_url => [{ 11468047 => "/tests/11469068" }],
}
{
  errors   => ["Specified job 11467991 has already been cloned as 11468908"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11462542 has already been cloned as 11467992"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11463220 => 11469082 }],
  test_url => [{ 11463220 => "/tests/11469082" }],
}
{
  errors   => ["Specified job 11463236 has already been cloned as 11468027"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11463194 has already been cloned as 11468046"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11463213 has already been cloned as 11468047"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11468046 => 11469088 }],
  test_url => [{ 11468046 => "/tests/11469088" }],
}
{
  errors   => ["Specified job 11463228 has already been cloned as 11467988"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11467826 => 11469089 }],
  test_url => [{ 11467826 => "/tests/11469089" }],
}
{
  errors   => ["Specified job 11462530 has already been cloned as 11467991"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11468045 => 11469090 }],
  test_url => [{ 11468045 => "/tests/11469090" }],
}
{
  result   => [{ 11449788 => 11469091 }],
  test_url => [{ 11449788 => "/tests/11469091" }],
}
{
  errors   => ["Specified job 11464106 has already been cloned as 11468016"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11462129 has already been cloned as 11468045"],
  result   => [],
  test_url => [],
}

Actions

Copy link

#13

Updated by okurz over 1 year ago

Status changed from Resolved to Feedback

good, thx. I would still like to follow-up on the comment "The machine is already monitored by our Zabbix infrastructure […] However, our team does not actively monitor alerts." from https://sd.suse.com/servicedesk/customer/portal/1/SD-125516 . That does not sound like a good approach to me. I see this related to #121726 "Get management access to o3/osd VMs". IMHO the team that can control VMs should act on alerts. I am ok if we do that if Eng-Infra can't do that but needing to wait for users to be confused and then create a ticket just to find that there was already an alert does not make sense.

Actions

Copy link

#15

Updated by okurz over 1 year ago

Status changed from Feedback to Resolved

What I wrote in #131123-13 will be covered in #132149

Actions

Copy link

#16

Updated by okurz over 1 year ago

Copied to action #133322: qa-jump.qe.nue2.suse.org is not reachable - take 3 added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #131123

qa-jump.qe.nue2.suse.org is not reachable since 2023-06-19

Observation¶

Expected result¶

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by nicksinger over 1 year ago

Updated by crameleon over 1 year ago

Updated by crameleon over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago