action #107227
closedbot-ng schedule aborted with "ERROR: something wrong with /etc/openqabot/singlearch.yml" size:M
0%
Description
Observation¶
The first bad seems to be from 9h ago: https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/850672#L46 . "ERROR: something wrong with /etc/openqabot/singlearch.yml"
ERROR: something wrong with /etc/openqabot/singlearch.yml
DEBUG: Getting id for Data(incident=0, settings_id=0, flavor='Server-DVD-HA-Updates', arch='x86_64', distri='sle', version='12-SP3', build='', product='HA12SP3')
Traceback (most recent call last):
File "./qem-bot/bot-ng.py", line 5, in <module>
main()
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/main.py", line 40, in main
sys.exit(cfg.func(cfg))
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/args.py", line 68, in do_sync_aggregate_results
return syncer()
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/aggrsync.py", line 29, in __call__
update_setting += get_aggregate_settings_data(self.token, product)
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/qem.py", line 175, in get_aggregate_settings_data
for s in settings[:3]:
TypeError: unhashable type: 'slice'
Uploading artifacts for failed job
a6a304dc1592:/ # cat /etc/openqabot/singlearch.yml
- powerpc-utils
- s390-tools
- yast2-s390
- lspvd
Acceptance crtieria¶
- AC1: Pipeline doesn't fail due to config errors
Suggestions¶
- Research what "slice" maps to in python
- Lookup where /etc/openqabot/singlearch.yml comes from (It comes from the openSUSE package
qam-metadata-openqabot
) and the expected format - Improve the error message
- Ensure singlearch.yml is deployed from GitLab
- Confirm if this is a regression due to a recent change
- Look into registry.suse.de/qa/maintenance/containers/qam-ci-leap:latest (https://registry.suse.de/cgi-bin/cooverview)
Updated by livdywan over 2 years ago
- Subject changed from bot-ng schedule aborted with "ERROR: something wrong with /etc/openqabot/singlearch.yml" to bot-ng schedule aborted with "ERROR: something wrong with /etc/openqabot/singlearch.yml" size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by osukup over 2 years ago
not regression, ERROR here is a hammerless warning ( I'll add proper exclude for singlearch.yml
file )
the real cause of the problem is broken dashboard.qam.suse.de --> same as https://progress.opensuse.org/issues/106179
Updated by livdywan over 2 years ago
- Status changed from Workable to In Progress
I assume you're working on it since you hinted at having a fix in Slack ;-)
Updated by jbaier_cz over 2 years ago
I suspected that the error message is misleading in this case (as there was no change in metadata for some time).
So it was the IP problem again. The interesting thing is, I have pinpointed the exact moment which caused it. A nightly update for systemd-networkd.
Logs from the database:
Feb 21 23:06:26 postgresql [RPM][18920]: erase systemd-network-246.16-150300.7.36.1.x86_64: success
Feb 21 23:06:26 postgresql dbus-daemon[25]: [system] Reloaded configuration
Feb 21 23:06:26 postgresql dbus-daemon[25]: [system] Reloaded configuration
Feb 21 23:06:26 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql [RPM][18920]: install systemd-network-246.16-150300.7.39.1.x86_64: success
Feb 21 23:06:27 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql systemd[1]: Stopping Network Service...
Feb 21 23:06:27 postgresql systemd-networkd[1142]: host0: DHCPv6 lease lost
Feb 21 23:06:27 postgresql systemd[1]: systemd-networkd.service: Succeeded.
Feb 21 23:06:27 postgresql systemd[1]: Stopped Network Service.
Feb 21 23:06:27 postgresql systemd[1]: Starting Network Service...
Feb 21 23:06:28 postgresql systemd-networkd[18991]: Failed to increase receive buffer size for general netlink socket, ignoring: Operation not permitted
Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: IPv6 successfully enabled
Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: Gained IPv6LL
Feb 21 23:06:28 postgresql systemd-networkd[18991]: Enumeration completed
Feb 21 23:06:28 postgresql systemd[1]: Started Network Service.
Feb 21 23:06:28 postgresql systemd[1]: Reloading.
Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: DHCPv4 address 192.168.0.48/24 via 192.168.0.1
And a corresponding snippet from DHCP server. For some unknown reason, the machine is asking for a particular address and is ignoring the DHCP offer.
Feb 21 22:43:31 qam2 dnsmasq-dhcp[6139]: DHCPREQUEST(br0) 192.168.0.168 aa:5d:de:99:79:13
Feb 21 22:43:31 qam2 dnsmasq-dhcp[6139]: DHCPACK(br0) 192.168.0.168 aa:5d:de:99:79:13 postgresql
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPDISCOVER(br0) aa:5d:de:99:79:13
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPOFFER(br0) 192.168.0.168 aa:5d:de:99:79:13
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPREQUEST(br0) 192.168.0.48 aa:5d:de:99:79:13
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPACK(br0) 192.168.0.48 aa:5d:de:99:79:13 postgresql
Updated by okurz over 2 years ago
- Related to action #106179: No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down size:S added
Updated by osukup over 2 years ago
- Status changed from In Progress to Feedback
https://github.com/openSUSE/qem-bot/pull/3
real problem solved by restarting systemd-networkd
service in container and analysis by @jbaier in https://progress.opensuse.org/issues/107227#note-7
Updated by osukup over 2 years ago
jbaier_cz wrote:
I suspected that the error message is misleading in this case (as there was no change in metadata for some time).
So it was the IP problem again. The interesting thing is, I have pinpointed the exact moment which caused it. A nightly update for systemd-networkd.
Logs from the database:
Feb 21 23:06:26 postgresql [RPM][18920]: erase systemd-network-246.16-150300.7.36.1.x86_64: success Feb 21 23:06:26 postgresql dbus-daemon[25]: [system] Reloaded configuration Feb 21 23:06:26 postgresql dbus-daemon[25]: [system] Reloaded configuration Feb 21 23:06:26 postgresql systemd[1]: Reloading. Feb 21 23:06:27 postgresql systemd[1]: Reloading. Feb 21 23:06:27 postgresql systemd[1]: Reloading. Feb 21 23:06:27 postgresql [RPM][18920]: install systemd-network-246.16-150300.7.39.1.x86_64: success Feb 21 23:06:27 postgresql systemd[1]: Reloading. Feb 21 23:06:27 postgresql systemd[1]: Stopping Network Service... Feb 21 23:06:27 postgresql systemd-networkd[1142]: host0: DHCPv6 lease lost Feb 21 23:06:27 postgresql systemd[1]: systemd-networkd.service: Succeeded. Feb 21 23:06:27 postgresql systemd[1]: Stopped Network Service. Feb 21 23:06:27 postgresql systemd[1]: Starting Network Service... Feb 21 23:06:28 postgresql systemd-networkd[18991]: Failed to increase receive buffer size for general netlink socket, ignoring: Operation not permitted Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: IPv6 successfully enabled Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: Gained IPv6LL Feb 21 23:06:28 postgresql systemd-networkd[18991]: Enumeration completed Feb 21 23:06:28 postgresql systemd[1]: Started Network Service. Feb 21 23:06:28 postgresql systemd[1]: Reloading. Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: DHCPv4 address 192.168.0.48/24 via 192.168.0.1
And a corresponding snippet from DHCP server. For some unknown reason, the machine is asking for a particular address and is ignoring the DHCP offer.
Feb 21 22:43:31 qam2 dnsmasq-dhcp[6139]: DHCPREQUEST(br0) 192.168.0.168 aa:5d:de:99:79:13 Feb 21 22:43:31 qam2 dnsmasq-dhcp[6139]: DHCPACK(br0) 192.168.0.168 aa:5d:de:99:79:13 postgresql Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPDISCOVER(br0) aa:5d:de:99:79:13 Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPOFFER(br0) 192.168.0.168 aa:5d:de:99:79:13 Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPREQUEST(br0) 192.168.0.48 aa:5d:de:99:79:13 Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPACK(br0) 192.168.0.48 aa:5d:de:99:79:13 postgresql
looks like problem/bug in systemd-networkd ?
Updated by okurz over 2 years ago
The above DHCP flow looks normal to me. AFAIK it's always:
- -> DISCOVER
- <- OFFER
- -> REQUEST
- <- ACK
Right?
Updated by jbaier_cz over 2 years ago
okurz wrote:
The above DHCP flow looks normal to me. AFAIK it's always:
- -> DISCOVER
- <- OFFER
- -> REQUEST
- <- ACK
Right?
Yes, the question here is why was the offer for 192.168.0.168 ignored by the client and 192.168.0.48 was requested instead.
Updated by osukup over 2 years ago
- Status changed from Feedback to Resolved
gitlab is ci working
AC1: was meet even before, different cause
Updated by okurz over 2 years ago
- Status changed from Resolved to Feedback
- Priority changed from Immediate to High
I agree that the original problem was resolved. Let's follow https://progress.opensuse.org/projects/qa/wiki/Tools#How-we-work-on-our-backlog in particular "For every regression or bigger issue that we encounter try to come up with at least two improvements, e.g. the actual issue is fixed and similar cases are prevented in the future with better tests and optionally also monitoring is improved" and try to identify and conduct at least one improvement.
EDIT: Discussed with jbaier and osukup in SUSE QE Tools and we found that currently we relied on hardcoded IPv4 addresses within qem-dashboard mainly due to qam2 not being able to resolve the systemd nspawn container by name. For that likely /etc/resolv.conf should be changed to use the local dnsmasq daemon over 127.0.0.1 with additional nameserver entries to be able to resolve both internal as well as external addresses. Then, when being able to resolve the database container over DNS we should ensure that the machine can also be reached from the dashboard itself and use a DNS entry instead of IPv4.
Updated by okurz over 2 years ago
Created https://gitlab.suse.de/qa-maintenance/qamops/-/merge_requests/9 with a better resolv.conf including forwarding to internal dnsmasq
EDIT: And also https://gitlab.suse.de/qa-maintenance/qamops/-/merge_requests/10 to save the config properly in ansible.
Updated by okurz over 2 years ago
- Status changed from Feedback to Resolved
With both MRs merged we have a proper DNS resolution and dashboard config stored in git. That should certainly count as good improvements :)
Updated by jbaier_cz over 2 years ago
Both request were merged and also deployed.
Updated by osukup over 2 years ago
- Copied to action #109878: bot-ng schedule/approve aborted added