Project

General

Profile

Actions

tickets #164706

closed

Karma "proxyconnect tcp: dial tcp connect: connection refused"

Added by crameleon 10 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Monitoring
Target version:
-
Start date:
2024-07-30
Due date:
% Done:

0%

Estimated time:

Description

Karma always worked fine, and I often had it open in the background, always returning to it showing the latest data.
Now however it seems after about one refresh cycle on the page, it fails with the aforementioned message.

The log is not much more helpful either:

Jul 30 12:55:16 monitor karma[28610]: level=error msg="Collection failed" error="Get \"https://alertmanager.infra.opensuse.org/api/v2/status\": proxyconnect tcp: dial tcp :0: connect: connection refused" alertmanager=local try=1/2
Jul 30 12:55:16 monitor karma[28610]: level=info msg="GET request" timeout=10 uri=https://alertmanager.infra.opensuse.org/metrics
Jul 30 12:55:16 monitor karma[28610]: level=error msg="Request failed" error="Get \"https://alertmanager.infra.opensuse.org/metrics\": proxyconnect tcp: dial tcp :0: connect: connection refused" alertmanager=local uri=https://alertmanager.infra.opensuse.org
Jul 30 12:55:16 monitor karma[28610]: level=error msg="Collection failed" error="Get \"https://alertmanager.infra.opensuse.org/api/v2/status\": proxyconnect tcp: dial tcp :0: connect: connection refused" alertmanager=local try=2/2
Jul 30 12:55:16 monitor karma[28610]: level=info msg="Collection completed"

Note originally this would say uri=http://ipv6-localhost:9093, it now shows the httpd served URL for testing without proxy mode (see below).

There is no corresponding error on the Alertmanager side.

Running while true ; do curl https://alertmanager.infra.opensuse.org/api/v2/status ; echo ok ; done as a very rudimentary load test does not show any stalls either.

Curl from the Karma user works as well.

No relevant AVC messages can be found.

No hardening options in the service unit should impact network connectivity.

I also tried disabling the proxy mode (which tunnels Alertmanager requests through Karma):

# /etc/karma/config.yaml
    #proxy: true
    proxy: false
    #uri: http://ipv6-localhost:9093
    uri: https://alertmanager.infra.opensuse.org

This did not help either, and is rather surprising, as I thought the "proxyconnect" was a result of using the proxy mode.

Actions #1

Updated by crameleon 10 months ago

  • Private changed from Yes to No
Actions #2

Updated by crameleon 10 months ago · Edited

Now also tried running in foreground with --log-level debug, and without nftables, not much help either.

Actions #3

Updated by crameleon 10 months ago

It seems it always works once after restarting Karma, but then fails on any subsequent connection attempts.

For simpler debugging, I set the original variant (proxy mode with localhost connection) again.

Packet capture on the loopback interface with a filter on port 9093 then reveals lots of duplicate TCP ACKs for packets between ::1 and ::1. I could not find if this is normal, it could be the fast retransmission mechanism, but the kernel toggles related to it interestingly only exist in the IPv4 space, hence nothing to tune for IPv6:

$ ls /proc/sys/net/*/*retrans*
/proc/sys/net/ipv4/tcp_early_retrans  /proc/sys/net/ipv4/tcp_retrans_collapse

Test wise, I switched the localhost Karma -> Alertmanager connection to IPv4 (127.0.0.1), here no duplicate ACKs are observed, but the connectivity issue remains.

Watching the Karma log along the packet trace, it is interesting to note that at the time Karma reports

Jul 30 19:28:45 monitor karma[31783]: level=error msg="Collection failed" error="Get \"http://127.0.0.1:9093/api/v2/status\": proxyconnect tcp: dial tcp :0: connect: connection refused" alertmanager=local try=2/2

no such connection attempt is recorded in the packet trace! Only upon restarting Karma (which makes the UI work shortly), one initial chunk of requests is found:

21:30:04.729931	127.0.0.1	127.0.0.1	HTTP	168	GET /metrics HTTP/1.1 
21:30:04.732502	127.0.0.1	127.0.0.1	TCP	4162	9093  >  39882 [PSH, ACK] Seq=40553 Ack=1129 Win=65536 Len=4096 TSval=426666678 TSecr=426666675 [TCP segment of a reassembled PDU]
21:30:04.732549	127.0.0.1	127.0.0.1	HTTP	1635	HTTP/1.1 200 OK  (text/plain)
21:30:04.732626	127.0.0.1	127.0.0.1	TCP	66	39882  >  9093 [ACK] Seq=1129 Ack=46218 Win=65536 Len=0 TSval=426666678 TSecr=426666678
21:30:04.734217	127.0.0.1	127.0.0.1	HTTP	200	GET /api/v2/status HTTP/1.1 
21:30:04.734747	127.0.0.1	127.0.0.1	HTTP/JSON	3958	HTTP/1.1 200 OK , JSON (application/json)
21:30:04.734968	127.0.0.1	127.0.0.1	HTTP	202	GET /api/v2/silences HTTP/1.1 
21:30:04.735095	127.0.0.1	127.0.0.1	HTTP/JSON	728	HTTP/1.1 200 OK , JSON (application/json)
21:30:04.735349	127.0.0.1	127.0.0.1	HTTP	207	GET /api/v2/alerts/groups HTTP/1.1 
21:30:04.735743	127.0.0.1	127.0.0.1	TCP	4162	9093  >  39882 [PSH, ACK] Seq=50772 Ack=1540 Win=65536 Len=4096 TSval=426666681 TSecr=426666681 [TCP segment of a reassembled PDU]
21:30:04.735764	127.0.0.1	127.0.0.1	TCP	6571	9093  >  39882 [PSH, ACK] Seq=54868 Ack=1540 Win=65536 Len=6505 TSval=426666681 TSecr=426666681 [TCP segment of a reassembled PDU]
21:30:04.735776	127.0.0.1	127.0.0.1	HTTP/JSON	73	HTTP/1.1 200 OK , JSON (application/json)
21:30:04.735778	127.0.0.1	127.0.0.1	TCP	66	39882  >  9093 [ACK] Seq=1540 Ack=61373 Win=61952 Len=0 TSval=426666681 TSecr=426666681
21:30:04.778218	127.0.0.1	127.0.0.1	TCP	66	39882  >  9093 [ACK] Seq=1540 Ack=61380 Win=65536 Len=0 TSval=426666724 TSecr=426666681
Actions #4

Updated by crameleon 10 months ago · Edited

  • Status changed from New to In Progress
  • Assignee set to crameleon

First, I found a problem in our Alertmanager configuration. The parameter cluster.listen-address introduced in salt.git commit a1b6d5df555f6500ca3144822c9bbf7c4501580c is not present in the rendered configuration. The formula skips a value if it is falsy, which is the case for our empty string (''):

{%-       if not k or not v %}{% continue %}{% endif -%}

This can possibly be mitigated without a patch to the formula, by writing it as "''".

This however is not relevant to the issue at hand.

The real problem is in the Karma configuration. All the while I was focusing on the alertmanager section. I did notice that similar errors were produced for the connection to the Prometheus API, but wrongly figured to only debug with the Alertmanager ones. Eventually I did turn my attention to the history section (which is the one relevant for talking to Prometheus itself), and found this oddity:

history:
  enabled: true
  rewrite:
  - proxy_url: string
    source: https://prometheus.infra.opensuse.org
    uri: http://ipv6-localhost:9090

Now, proxy_url: string certainly looks wrong. It seems this was introduced already in the first commit related to Karma - 1937435a1b8b47ded3e55322d6a59ca5e2b85a27.

Removing this setting makes the errors vanish. But why this did not cause issues before is beyond me.

Actions

Also available in: Atom PDF