action #159669
closedNo new openQA data on metrics.opensuse.org since o3 migration to PRG2 size:S
0%
Description
Observation¶
https://metrics.opensuse.org/d/osrt_openqa/osrt-openqa?orgId=1&from=1682065021494&to=1706492379258 shows that there is no new data since about the time that o3 was migrated to PRG2 as part of #132143
Suggestions¶
- Find out again who are admins of that instance and collaborate with them to fix the problem, possibly @witekbedyk (Witold Bedyk)?
- Check which ports the firewall needs to allow if any (supposedly just 80 and 443)
- The influxdb API route https://openqa.opensuse.org/admin/influxdb/jobs is generally available. Ensure that metrics.opensuse.org can read that as well
Updated by okurz 8 months ago
- Related to action #132143: Migration of o3 VM to PRG2 - 2023-07-19 size:M added
Updated by tinita 5 months ago
- Related to tickets #123828: metrics.o.o: no OSRT:Review graphs added
Updated by witekbedyk 5 months ago
I only maintain the access metrics service on metrics.o.o. and had never touched the others.
From what I can see in the systemd journal the osrt-metrics-telegraf.service
fails when attempting to connect to http://openqa.opensuse.org/admin/influxdb/jobs
[1]. Please make sure the service is available.
[1] https://github.com/openSUSE/openSUSE-release-tools/blob/master/metrics/telegraf/openqa.conf#L2
Updated by witekbedyk 5 months ago
# curl --digest https://openqa.opensuse.org/admin/influxdb/jobs
curl: (7) Failed to connect to openqa.opensuse.org port 443 after 0 ms: Couldn't connect to server
Updated by crameleon 5 months ago
fails when attempting
That's a very vague error description, it is better to analyze the issue based on the exact error message.
metrics (metrics.o.o):~ # journalctl --no-pager -n1 -u osrt-metrics-telegraf
Aug 05 16:07:20 metrics osrt-metrics[594]: 2024-08-05T16:07:20Z E! [inputs.http] Error in plugin: [url=http://openqa.opensuse.org/admin/influxdb/jobs]: Get "http://openqa.opensuse.org/admin/influxdb/jobs": dial tcp 192.168.47.13:80: connect: network is unreachable
1) 192.168.47.13 is a private IPv4 address which is not in our address space. Even if it was, I would be surprised if you were able to reach it from an IPv6-only network.
metrics (metrics.o.o):~ # grep openqa /etc/hosts
192.168.47.13 openqa.opensuse.org
2) Not only is the address wrong, there is also an /etc/hosts override which should not be there - it is better practice to use DNS, to avoid receiving outdated IP addresses to begin with.
Destination unreachable: Administratively prohibited
3) After the the bogus hosts entry is removed, the above error, or one similar to it, might be presented instead, because the maintainer of the machine never requested access to this resource.
Updated by witekbedyk 5 months ago
I've deleted the bogus hosts entry. Now we see the following error message:
Aug 06 08:17:30 metrics osrt-metrics[14767]: 2024-08-06T08:17:30Z E! [inputs.http] Error in plugin: [url=http://openqa.opensuse.org/admin/influxdb/jobs]: Get "http://openqa.opensuse.org/admin/influxdb/jobs": dial tcp [2a07:de40:b251:2:10:150:2:10]:80: connect: permission denied
Updated by witekbedyk 4 months ago
I have changed to https:
Aug 06 12:23:30 metrics osrt-metrics[17461]: 2024-08-06T12:23:30Z E! [inputs.http] Error in plugin: [url=https://openqa.opensuse.org/admin/influxdb/jobs]: Get "https://openqa.opensuse.org/admin/influxdb/jobs": dial tcp [2a07:de40:b251:2:10:150:2:10]:443: connect: permission denied
Updated by okurz 4 months ago
The URL and IPv6 address are correct. "permission denied" is unexpected but likely means we reach the right server. I guess as next step one could try to replicate this manually with telegraf accessing the route from elsewhere.
Or we try to tweak the nginx config on o3 in /etc/nginx/vhosts.d/openqa.conf to allow HTTP access to admin/influxdb/jobs
If https
Also TODO: Create PR to update https://github.com/openSUSE/openSUSE-release-tools/blob/master/metrics/telegraf/openqa.conf#L2
Updated by tinita 4 months ago
When searching for this error message, I only find references to docker, some related to influxdb: https://community.influxdata.com/t/got-permission-denied-while-trying-to-connect-to-the-docker-daemon-socket/23353/3
So maybe there is something wrong with the telegraf/influxdb installation.
Updated by nicksinger 4 months ago
okurz wrote in #note-16:
Also TODO: Create PR to update https://github.com/openSUSE/openSUSE-release-tools/blob/master/metrics/telegraf/openqa.conf#L2
https://github.com/openSUSE/openSUSE-release-tools/pull/3133
okurz wrote in #note-16:
The URL and IPv6 address are correct. "permission denied" is unexpected but likely means we reach the right server. I guess as next step one could try to replicate this manually with telegraf accessing the route from elsewhere.
I just tried it from my local machine and was able to receive data using http:// as well as https://
workstation ~ » telegraf --config /etc/telegraf/telegraf.conf --test | grep openqa
2024-08-09T09:43:44Z I! Loading config: /etc/telegraf/telegraf.conf
2024-08-09T09:43:44Z I! Starting Telegraf 1.26.3-
2024-08-09T09:43:44Z I! Available plugins: 235 inputs, 9 aggregators, 27 processors, 22 parsers, 57 outputs, 2 secret-stores
2024-08-09T09:43:44Z I! Loaded inputs: cpu http kernel linux_cpu mem net sensors system
2024-08-09T09:43:44Z I! Loaded aggregators:
2024-08-09T09:43:44Z I! Loaded processors: regex strings template (3x)
2024-08-09T09:43:44Z I! Loaded secretstores:
2024-08-09T09:43:44Z W! Outputs are not used in testing mode!
2024-08-09T09:43:44Z I! Tags enabled: host=workstation
> openqa_jobs,host=workstation,url=https://openqa.opensuse.org blocked=2i,running=32i,scheduled=24i 1723196625000000000
> openqa_jobs_by_group,group=Development,host=workstation,url=https://openqa.opensuse.org scheduled=14i 1723196625000000000
> openqa_jobs_by_group,group=No\ Group,host=workstation,url=https://openqa.opensuse.org running=4i 1723196625000000000
> openqa_jobs_by_group,group=Others,host=workstation,url=https://openqa.opensuse.org blocked=2i,running=18i 1723196625000000000
> openqa_jobs_by_group,group=openSUSE\ Leap\ 15.5\ Updates,host=workstation,url=https://openqa.opensuse.org running=1i 1723196625000000000
> openqa_jobs_by_group,group=openSUSE\ Leap\ 15.6\ Maintenance,host=workstation,url=https://openqa.opensuse.org running=9i 1723196625000000000
> openqa_jobs_by_group,group=openSUSE\ Tumbleweed\ AArch64,host=workstation,url=https://openqa.opensuse.org scheduled=10i 1723196625000000000
> openqa_jobs_by_worker,host=workstation,url=https://openqa.opensuse.org,worker=openqaworker-arm21 running=1i 1723196625000000000
> openqa_jobs_by_worker,host=workstation,url=https://openqa.opensuse.org,worker=openqaworker20 running=6i 1723196625000000000
> openqa_jobs_by_worker,host=workstation,url=https://openqa.opensuse.org,worker=openqaworker21 running=3i 1723196625000000000
> openqa_jobs_by_worker,host=workstation,url=https://openqa.opensuse.org,worker=openqaworker22 running=3i 1723196625000000000
> openqa_jobs_by_worker,host=workstation,url=https://openqa.opensuse.org,worker=openqaworker23 running=3i 1723196625000000000
> openqa_jobs_by_worker,host=workstation,url=https://openqa.opensuse.org,worker=openqaworker24 running=5i 1723196625000000000
> openqa_jobs_by_worker,host=workstation,url=https://openqa.opensuse.org,worker=openqaworker25 running=5i 1723196625000000000
> openqa_jobs_by_worker,host=workstation,url=https://openqa.opensuse.org,worker=openqaworker26 running=6i 1723196625000000000
> openqa_jobs_by_arch,arch=aarch64,host=workstation,url=https://openqa.opensuse.org running=1i,scheduled=21i 1723196625000000000
> openqa_jobs_by_arch,arch=arm,host=workstation,url=https://openqa.opensuse.org scheduled=3i 1723196625000000000
> openqa_jobs_by_arch,arch=x86_64,host=workstation,url=https://openqa.opensuse.org blocked=2i,running=31i 1723196625000000000
tinita wrote in #note-19:
Maybe one of us could also have a look, if we get access to the host where telegraf is running. Would that be possible?
I currently also have no further idea what we could do from our side.
Updated by okurz 4 months ago
- Due date set to 2024-09-16
- Status changed from Blocked to Feedback
- Assignee changed from tinita to okurz
That unfortunately never worked for us in the past. "Blocked" should only be used if we have an external reference that we should look into for the current status. If we don't have that then the ticket should stay in "Feedback" which shows that there is nowhere else to look for the current status or more information.
@witekbedyk I have access to .infra.opensuse.org but couldn't access metrics.infra.opensuse.org . https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=12045 says the host has 192.168.47.31 which is also what I saved in my /etc/hosts. Can you provide me current connection details and authentication so that I can login to the host and try myself?
Updated by witekbedyk 4 months ago
metrics.infra.opensuse.org has IPv6 address 2a07:de40:b27e:1203::141
Updated by okurz 4 months ago · Edited
ok, I updated https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=12045 with the correct IP addresses then.
okurz@metrics:/home/okurz> curl -6 -vvvv --digest http://openqa.opensuse.org/
* Trying [2a07:de40:b251:2:10:150:2:10]:80...
* connect to 2a07:de40:b251:2:10:150:2:10 port 80 failed: Permission denied
* Failed to connect to openqa.opensuse.org port 80 after 2 ms: Couldn't connect to server
* Closing connection 0
curl: (7) Failed to connect to openqa.opensuse.org port 80 after 2 ms: Couldn't connect to server
okurz@metrics:/home/okurz> ping -c1 openqa.opensuse.org
PING openqa.opensuse.org(openqa.opensuse.org (2a07:de40:b251:2:10:150:2:10)) 56 data bytes
From 2a07:de40:b27e:1203::3 (2a07:de40:b27e:1203::3) icmp_seq=1 Destination unreachable: Administratively prohibited
I suspect a firewall is giving us explicit REJECT
here. I could not see anything from ariel aka. o3 that should prevent this. Also temporarily disabled the firewall on ariel but no change in observation. With my non-root limited account on metrics I could also not see a firewall in effect.
@crameleon as you stated in #159669-12
Destination unreachable: Administratively prohibited
3) After the the bogus hosts entry is removed, the above error, or one similar to it, might be presented instead, because the maintainer of the machine never requested access to this resource.
Is there a network-level firewall in effect here? What do mean with "the maintainer of the machine never requested access to this resource."? which machine, which resource? If you mean access from metrics.i.o.o to openqa.o.o 80/tcp or 443/tcp then this statement is wrong as that access was present and possible in before hence the access was requested but obviously it would have been many years ago.
@witekbedyk can you provide me root permissions on the machine either with the help of sudo or the root password so that I could install network debug tools, check the firewall, etc.?
Updated by crameleon 4 months ago · Edited
Is there a network-level firewall in effect here?
All of infra.opensuse.org is firewalled.
then this statement is wrong as that access was present and possible in before hence the access was requested but obviously it would have been many years ago.
I informed about this change in November: https://lists.opensuse.org/archives/list/heroes@lists.opensuse.org/message/4YV256BAZPZ4ILA3MI76MVO2BSKBB265/. The maintainer of metrics.i.o.o never reached out about requiring any additional access.
Updated by okurz 4 months ago
crameleon wrote in #note-30:
Is there a network-level firewall in effect here?
All of infra.opensuse.org is firewalled.
then this statement is wrong as that access was present and possible in before hence the access was requested but obviously it would have been many years ago.
I informed about this change in November: https://lists.opensuse.org/archives/list/heroes@lists.opensuse.org/message/4YV256BAZPZ4ILA3MI76MVO2BSKBB265/. The maintainer of metrics.i.o.o never reached out about requiring any additional access.
well, regardless, I am reaching out now :) So can you allow the according access from metrics.infra.opensuse.org to openqa.opensuse.org 443/tcp?
Updated by crameleon 4 months ago · Edited
The reason I did not yet intervene is there being another request related to networking of the metrics.i.o.o machine (https://progress.opensuse.org/issues/123828) lacking maintainer input as well. I prefer the maintainer(s) of a machine to collaborate with others in the admin team to establish all that is needed for their service as opposed to me implementing miscellaneous changes to help users of a service without any acknowledgment of what is really needed by one of the listed maintainers.
That being said, I don't want to block you with politics.
Just to avoid surprises after my implementation: since openQA resides behind a SUSE-side NAT, could you please confirm the IP address I find in DNS (2a07:de40:b251:2:10:150:2:10
) is actually the source address of the machine as visible from our openSUSE network?
You can do so using the following on the machine behind openqa.opensuse.org:
curl -6 ip.opensuse.org
Potentially with https://
- I hope to recall your host already being allowed to reach us over either HTTP or HTTPS on the SUSE side correctly.
Updated by crameleon 4 months ago
Sorry, I mixed up the direction of traffic, I don't need your source address for this. ;-)
Submitted via https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/2032.
Will update once done (can't assign ticket to me as not part of the project).
Updated by crameleon 4 months ago
Committed as https://progress.opensuse.org/projects/opensuse-admin/repository/salt/revisions/a9ffb36776ac7f00ac5651c05ea2cffe9519f9b2 and tested:
crameleon@metrics:/home/crameleon> curl -sI https://openqa.opensuse.org|head -n3
HTTP/2 200
server: nginx/1.21.5
date: Tue, 20 Aug 2024 19:57:09 GMT
Updated by okurz 4 months ago
- Due date deleted (
2024-09-16) - Status changed from Feedback to Resolved
… and data shows up on https://metrics.opensuse.org/d/osrt_openqa/osrt-openqa?orgId=1&from=1724167501446&to=1724222225350 again \o/
thanks all