action #131123
closedqa-jump.qe.nue2.suse.org is not reachable since 2023-06-19
0%
Description
Observation¶
https://openqa.suse.de/tests/11377514/video?filename=video.ogv&t=19.8,20.5
shows that the PXE server on qa-jump.qe.nue2.suse.org is not reachable. I also can not ping the host at all anymore. Also visible in https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=4&from=1687165791669&to=1687222145516
The host seems to have stopped responding 2023-06-19 13:46:48Z
Expected result¶
The machines should be able to boot from PXE.
Last good: https://openqa.suse.de/tests/11377512#step/boot_from_pxe/1 from 2023-06-19 13:17:59Z
Updated by okurz over 1 year ago
- Status changed from New to Blocked
Updated by okurz over 1 year ago
- Has duplicate action #131141: [alert] Packet loss between qa-jump.qe.nue2.suse.org and other hosts added
Updated by okurz over 1 year ago
- Status changed from Blocked to In Progress
SD ticket closed, host should be reachable again and according job failures should be handled, e.g. with restart
Updated by nicksinger over 1 year ago
- Status changed from In Progress to Feedback
I restarted every job from the last 24h failing in boot_from_pxe with:
for i in $(ssh openqa.suse.de "sudo -u geekotest psql -t --command=\"select jobs.id from job_modules join jobs on jobs.id = job_modules.job_id where job_modules.result='failed' and t_started between NOW() - Interval '24 HOURS' and NOW() and name='boot_from_pxe';\" openqa"); do openqa-client --host openqa.suse.de jobs/$i/restart post; done
{
errors => ["Specified job 11392667 has already been cloned as 11395877"],
result => [],
test_url => [],
}
{
errors => ["Specified job 11377521 has already been cloned as 11397236"],
result => [],
test_url => [],
}
{
result => [{ 11392668 => 11397616 }],
test_url => [{ 11392668 => "/tests/11397616" }],
}
{
result => [{ 11395237 => 11397617 }],
test_url => [{ 11395237 => "/tests/11397617" }],
}
{
errors => ["Specified job 11390240 has already been cloned as 11397615"],
result => [],
test_url => [],
}
{
errors => ["Specified job 11377525 has already been cloned as 11397233"],
result => [],
test_url => [],
}
{
result => [{ 11392669 => 11397618 }],
test_url => [{ 11392669 => "/tests/11397618" }],
}
{
result => [{ 11377515 => 11397619 }],
test_url => [{ 11377515 => "/tests/11397619" }],
}
{
result => [{ 11390294 => 11397620 }],
test_url => [{ 11390294 => "/tests/11397620" }],
}
{
errors => ["Specified job 11377524 has already been cloned as 11397232"],
result => [],
test_url => [],
}
{
result => [{ 11388049 => 11397621 }],
test_url => [{ 11388049 => "/tests/11397621" }],
}
{
errors => ["Specified job 11388639 has already been cloned as 11389800"],
result => [],
test_url => [],
}
{
result => [{ 11377516 => 11397622 }],
test_url => [{ 11377516 => "/tests/11397622" }],
}
{
result => [{ 11389762 => 11397623 }],
test_url => [{ 11389762 => "/tests/11397623" }],
}
{
errors => ["Specified job 11377520 has already been cloned as 11397235"],
result => [],
test_url => [],
}
{
errors => ["Specified job 11377514 has already been cloned as 11397512"],
result => [],
test_url => [],
}
{
errors => ["Specified job 11377518 has already been cloned as 11397234"],
result => [],
test_url => [],
}
{
result => [{ 11377523 => 11397624 }],
test_url => [{ 11377523 => "/tests/11397624" }],
}
{
result => [{ 11377526 => 11397625 }],
test_url => [{ 11377526 => "/tests/11397625" }],
}
{
result => [{ 11388042 => 11397626 }],
test_url => [{ 11388042 => "/tests/11397626" }],
}
{
result => [{ 11387254 => 11397627 }],
test_url => [{ 11387254 => "/tests/11397627" }],
}
{
result => [{ 11387247 => 11397628 }],
test_url => [{ 11387247 => "/tests/11397628" }],
}
{
result => [{ 11388040 => 11397629 }],
test_url => [{ 11388040 => "/tests/11397629" }],
}
{
result => [{ 11390163 => 11397630 }],
test_url => [{ 11390163 => "/tests/11397630" }],
}
Updated by okurz over 1 year ago
- Status changed from Feedback to Resolved
looks good. The next time you can also use https://github.com/os-autoinst/scripts/blob/master/openqa-advanced-retrigger-jobs
Updated by okurz over 1 year ago
- Status changed from Resolved to In Progress
happened again. @Nick Singer please create an SD ticket and point out that lately the VM was already down and that https://sd.suse.com/servicedesk/customer/portal/1/SD-124776 still has an unanswered question.
Updated by nicksinger over 1 year ago
- Status changed from In Progress to Feedback
Updated by crameleon over 1 year ago
Same issue as last time, machine did not automatically apply its network configuration after a reboot:
qa-jump:~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 72:1c:68:af:4d:17 brd ff:ff:ff:ff:ff:ff
altname enp0s2
altname ens2
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 52:54:00:3b:4c:8c brd ff:ff:ff:ff:ff:ff
altname enp0s3
altname ens3
qa-jump:~ # rcnetwork restart
qa-jump:~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 72:1c:68:af:4d:17 brd ff:ff:ff:ff:ff:ff
altname enp0s2
altname ens2
inet 10.168.192.10/22 brd 10.168.195.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::701c:68ff:feaf:4d17/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 52:54:00:3b:4c:8c brd ff:ff:ff:ff:ff:ff
altname enp0s3
altname ens3
inet 192.168.44.1/24 brd 192.168.44.255 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe3b:4c8c/64 scope link
valid_lft forever preferred_lft forever
qa-jump:~ # systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● mnt-dist.mount loaded failed failed /mnt/dist
● mnt-openqa.mount loaded failed failed /mnt/openqa
qa-jump:~ # systemctl restart mnt-dist.mount
qa-jump:~ # systemctl restart mnt-openqa.mount
This seems ok:
qa-jump:~ # systemctl is-enabled network
alias
qa-jump:~ # systemctl is-enabled wicked
enabled
Not sure yet what's different with this machine.
Updated by crameleon over 1 year ago
There is an issue with the mounts, causing wicked to fail during boot:
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found ordering cycle on network-online.target/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on wicked.service/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on wickedd-auto4.service/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on local-fs.target/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on srv-tftpboot-mnt-openqa.mount/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on mnt-openqa.mount/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Job local-fs.target/start deleted to break ordering cycle starting with mnt-openqa.mount/start
This seems like a similar issue as the one described in https://www.suse.com/support/kb/doc/?id=000018796.
I edited the fstab entry from @okurz respectively by appending _netdev
to its options:
# okurz: 2023-05-17: "journalctl -e" revealed that systems (still?) try to access /mounts/dist so adding that bind mount additionally for convenience but that should be streamlined
# gpfuetzenreuter: 2023-06-28: repair broken network dependency cycle by adding _netdev
/mnt/dist /mounts/dist none defaults,bind,_netdev 0 0
After another reboot, the network came up fine.
Updated by nicksinger over 1 year ago
All affected jobs restarted:
{
result => [{ 11462148 => 11469067 }],
test_url => [{ 11462148 => "/tests/11469067" }],
}
{
result => [{ 11468047 => 11469068 }],
test_url => [{ 11468047 => "/tests/11469068" }],
}
{
errors => ["Specified job 11467991 has already been cloned as 11468908"],
result => [],
test_url => [],
}
{
errors => ["Specified job 11462542 has already been cloned as 11467992"],
result => [],
test_url => [],
}
{
result => [{ 11463220 => 11469082 }],
test_url => [{ 11463220 => "/tests/11469082" }],
}
{
errors => ["Specified job 11463236 has already been cloned as 11468027"],
result => [],
test_url => [],
}
{
errors => ["Specified job 11463194 has already been cloned as 11468046"],
result => [],
test_url => [],
}
{
errors => ["Specified job 11463213 has already been cloned as 11468047"],
result => [],
test_url => [],
}
{
result => [{ 11468046 => 11469088 }],
test_url => [{ 11468046 => "/tests/11469088" }],
}
{
errors => ["Specified job 11463228 has already been cloned as 11467988"],
result => [],
test_url => [],
}
{
result => [{ 11467826 => 11469089 }],
test_url => [{ 11467826 => "/tests/11469089" }],
}
{
errors => ["Specified job 11462530 has already been cloned as 11467991"],
result => [],
test_url => [],
}
{
result => [{ 11468045 => 11469090 }],
test_url => [{ 11468045 => "/tests/11469090" }],
}
{
result => [{ 11449788 => 11469091 }],
test_url => [{ 11449788 => "/tests/11469091" }],
}
{
errors => ["Specified job 11464106 has already been cloned as 11468016"],
result => [],
test_url => [],
}
{
errors => ["Specified job 11462129 has already been cloned as 11468045"],
result => [],
test_url => [],
}
Updated by okurz over 1 year ago
- Status changed from Resolved to Feedback
good, thx. I would still like to follow-up on the comment "The machine is already monitored by our Zabbix infrastructure […] However, our team does not actively monitor alerts." from https://sd.suse.com/servicedesk/customer/portal/1/SD-125516 . That does not sound like a good approach to me. I see this related to #121726 "Get management access to o3/osd VMs". IMHO the team that can control VMs should act on alerts. I am ok if we do that if Eng-Infra can't do that but needing to wait for users to be confused and then create a ticket just to find that there was already an alert does not make sense.
Updated by okurz over 1 year ago
- Status changed from Feedback to Resolved
What I wrote in #131123-13 will be covered in #132149
Updated by okurz over 1 year ago
- Copied to action #133322: qa-jump.qe.nue2.suse.org is not reachable - take 3 added