Project

General

Profile

Actions

action #131123

closed

qa-jump.qe.nue2.suse.org is not reachable since 2023-06-19

Added by okurz 11 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2023-06-20
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://openqa.suse.de/tests/11377514/video?filename=video.ogv&t=19.8,20.5
shows that the PXE server on qa-jump.qe.nue2.suse.org is not reachable. I also can not ping the host at all anymore. Also visible in https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=4&from=1687165791669&to=1687222145516

The host seems to have stopped responding 2023-06-19 13:46:48Z

Expected result

The machines should be able to boot from PXE.

Last good: https://openqa.suse.de/tests/11377512#step/boot_from_pxe/1 from 2023-06-19 13:17:59Z


Related issues 3 (1 open2 closed)

Related to QA - coordination #121726: [epic] Get management access to o3/osd and other QE related VMsBlockedokurz2022-12-08

Actions
Has duplicate openQA Infrastructure - action #131141: [alert] Packet loss between qa-jump.qe.nue2.suse.org and other hostsRejectedokurz2023-06-20

Actions
Copied to openQA Infrastructure - action #133322: qa-jump.qe.nue2.suse.org is not reachable - take 3Resolvedokurz

Actions
Actions #1

Updated by okurz 11 months ago

  • Status changed from New to Blocked
Actions #2

Updated by okurz 11 months ago

  • Has duplicate action #131141: [alert] Packet loss between qa-jump.qe.nue2.suse.org and other hosts added
Actions #3

Updated by okurz 11 months ago

  • Status changed from Blocked to In Progress

SD ticket closed, host should be reachable again and according job failures should be handled, e.g. with restart

Actions #4

Updated by nicksinger 11 months ago

  • Assignee changed from okurz to nicksinger
Actions #5

Updated by nicksinger 11 months ago

  • Status changed from In Progress to Feedback

I restarted every job from the last 24h failing in boot_from_pxe with:

for i in $(ssh openqa.suse.de "sudo -u geekotest psql -t --command=\"select jobs.id from job_modules join jobs on jobs.id = job_modules.job_id where job_modules.result='failed' and t_started between NOW() - Interval '24 HOURS' and NOW() and name='boot_from_pxe';\" openqa"); do openqa-client --host openqa.suse.de jobs/$i/restart post; done
{
  errors   => ["Specified job 11392667 has already been cloned as 11395877"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11377521 has already been cloned as 11397236"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11392668 => 11397616 }],
  test_url => [{ 11392668 => "/tests/11397616" }],
}
{
  result   => [{ 11395237 => 11397617 }],
  test_url => [{ 11395237 => "/tests/11397617" }],
}
{
  errors   => ["Specified job 11390240 has already been cloned as 11397615"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11377525 has already been cloned as 11397233"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11392669 => 11397618 }],
  test_url => [{ 11392669 => "/tests/11397618" }],
}
{
  result   => [{ 11377515 => 11397619 }],
  test_url => [{ 11377515 => "/tests/11397619" }],
}
{
  result   => [{ 11390294 => 11397620 }],
  test_url => [{ 11390294 => "/tests/11397620" }],
}
{
  errors   => ["Specified job 11377524 has already been cloned as 11397232"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11388049 => 11397621 }],
  test_url => [{ 11388049 => "/tests/11397621" }],
}
{
  errors   => ["Specified job 11388639 has already been cloned as 11389800"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11377516 => 11397622 }],
  test_url => [{ 11377516 => "/tests/11397622" }],
}
{
  result   => [{ 11389762 => 11397623 }],
  test_url => [{ 11389762 => "/tests/11397623" }],
}
{
  errors   => ["Specified job 11377520 has already been cloned as 11397235"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11377514 has already been cloned as 11397512"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11377518 has already been cloned as 11397234"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11377523 => 11397624 }],
  test_url => [{ 11377523 => "/tests/11397624" }],
}
{
  result   => [{ 11377526 => 11397625 }],
  test_url => [{ 11377526 => "/tests/11397625" }],
}
{
  result   => [{ 11388042 => 11397626 }],
  test_url => [{ 11388042 => "/tests/11397626" }],
}
{
  result   => [{ 11387254 => 11397627 }],
  test_url => [{ 11387254 => "/tests/11397627" }],
}
{
  result   => [{ 11387247 => 11397628 }],
  test_url => [{ 11387247 => "/tests/11397628" }],
}
{
  result   => [{ 11388040 => 11397629 }],
  test_url => [{ 11388040 => "/tests/11397629" }],
}
{
  result   => [{ 11390163 => 11397630 }],
  test_url => [{ 11390163 => "/tests/11397630" }],
}
Actions #6

Updated by okurz 11 months ago

  • Status changed from Feedback to Resolved
Actions #7

Updated by okurz 10 months ago

  • Status changed from Resolved to In Progress

happened again. @Nick Singer please create an SD ticket and point out that lately the VM was already down and that https://sd.suse.com/servicedesk/customer/portal/1/SD-124776 still has an unanswered question.

Actions #8

Updated by nicksinger 10 months ago

  • Status changed from In Progress to Feedback
Actions #9

Updated by crameleon 10 months ago

Same issue as last time, machine did not automatically apply its network configuration after a reboot:

qa-jump:~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 72:1c:68:af:4d:17 brd ff:ff:ff:ff:ff:ff
    altname enp0s2
    altname ens2
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 52:54:00:3b:4c:8c brd ff:ff:ff:ff:ff:ff
    altname enp0s3
    altname ens3

qa-jump:~ # rcnetwork restart

qa-jump:~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 72:1c:68:af:4d:17 brd ff:ff:ff:ff:ff:ff
    altname enp0s2
    altname ens2
    inet 10.168.192.10/22 brd 10.168.195.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::701c:68ff:feaf:4d17/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:3b:4c:8c brd ff:ff:ff:ff:ff:ff
    altname enp0s3
    altname ens3
    inet 192.168.44.1/24 brd 192.168.44.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe3b:4c8c/64 scope link
       valid_lft forever preferred_lft forever

qa-jump:~ # systemctl --failed
  UNIT             LOAD   ACTIVE SUB    DESCRIPTION
● mnt-dist.mount   loaded failed failed /mnt/dist
● mnt-openqa.mount loaded failed failed /mnt/openqa

qa-jump:~ # systemctl restart mnt-dist.mount
qa-jump:~ # systemctl restart mnt-openqa.mount

This seems ok:

qa-jump:~ # systemctl is-enabled network
alias
qa-jump:~ # systemctl is-enabled wicked
enabled

Not sure yet what's different with this machine.

Actions #10

Updated by crameleon 10 months ago

There is an issue with the mounts, causing wicked to fail during boot:

Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found ordering cycle on network-online.target/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on wicked.service/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on wickedd-auto4.service/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on local-fs.target/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on srv-tftpboot-mnt-openqa.mount/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on mnt-openqa.mount/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Job local-fs.target/start deleted to break ordering cycle starting with mnt-openqa.mount/start

This seems like a similar issue as the one described in https://www.suse.com/support/kb/doc/?id=000018796.

I edited the fstab entry from @okurz respectively by appending _netdev to its options:

# okurz: 2023-05-17: "journalctl -e" revealed that systems (still?) try to access /mounts/dist so adding that bind mount additionally for convenience but that should be streamlined
# gpfuetzenreuter: 2023-06-28: repair broken network dependency cycle by adding _netdev
/mnt/dist                                       /mounts/dist                  none     defaults,bind,_netdev    0 0

After another reboot, the network came up fine.

Actions #11

Updated by nicksinger 10 months ago

  • Status changed from Feedback to Resolved
Actions #12

Updated by nicksinger 10 months ago

All affected jobs restarted:

{
  result   => [{ 11462148 => 11469067 }],
  test_url => [{ 11462148 => "/tests/11469067" }],
}
{
  result   => [{ 11468047 => 11469068 }],
  test_url => [{ 11468047 => "/tests/11469068" }],
}
{
  errors   => ["Specified job 11467991 has already been cloned as 11468908"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11462542 has already been cloned as 11467992"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11463220 => 11469082 }],
  test_url => [{ 11463220 => "/tests/11469082" }],
}
{
  errors   => ["Specified job 11463236 has already been cloned as 11468027"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11463194 has already been cloned as 11468046"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11463213 has already been cloned as 11468047"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11468046 => 11469088 }],
  test_url => [{ 11468046 => "/tests/11469088" }],
}
{
  errors   => ["Specified job 11463228 has already been cloned as 11467988"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11467826 => 11469089 }],
  test_url => [{ 11467826 => "/tests/11469089" }],
}
{
  errors   => ["Specified job 11462530 has already been cloned as 11467991"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11468045 => 11469090 }],
  test_url => [{ 11468045 => "/tests/11469090" }],
}
{
  result   => [{ 11449788 => 11469091 }],
  test_url => [{ 11449788 => "/tests/11469091" }],
}
{
  errors   => ["Specified job 11464106 has already been cloned as 11468016"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11462129 has already been cloned as 11468045"],
  result   => [],
  test_url => [],
}
Actions #13

Updated by okurz 10 months ago

  • Status changed from Resolved to Feedback

good, thx. I would still like to follow-up on the comment "The machine is already monitored by our Zabbix infrastructure […] However, our team does not actively monitor alerts." from https://sd.suse.com/servicedesk/customer/portal/1/SD-125516 . That does not sound like a good approach to me. I see this related to #121726 "Get management access to o3/osd VMs". IMHO the team that can control VMs should act on alerts. I am ok if we do that if Eng-Infra can't do that but needing to wait for users to be confused and then create a ticket just to find that there was already an alert does not make sense.

Actions #14

Updated by okurz 10 months ago

  • Related to coordination #121726: [epic] Get management access to o3/osd and other QE related VMs added
Actions #15

Updated by okurz 10 months ago

  • Status changed from Feedback to Resolved

What I wrote in #131123-13 will be covered in #132149

Actions #16

Updated by okurz 10 months ago

  • Copied to action #133322: qa-jump.qe.nue2.suse.org is not reachable - take 3 added
Actions

Also available in: Atom PDF