Project

General

Profile

Actions

action #127985

closed

[alert][grafana] Handle various problems with slow monitoring https requests and such (was: DatasourceNoData HTTP Response alert) size:M

Added by okurz 12 months ago. Updated 12 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-04-19
Due date:
2023-05-04
% Done:

0%

Estimated time:

Description

Observation

http://stats.openqa-monitor.qa.suse.de/alerting/grafana/tm0h5mf4k/view?orgId=1

Suggestions

  • Follow #127985#note-13 and try to prevent too strict timeouts for various http monitoring requests triggered by telegraf

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #127052: [alert] Apache Response Time alert followed by DatasourceNoData for min_apache_response size:MResolvedokurz2023-04-01

Actions
Copied from openQA Infrastructure - action #127982: [alert][grafana] DatasourceNoData e.g. openQA QA-Power8-5-kvm+openqaw5-xen Disk I/O time alert workerResolvedokurz2023-04-19

Actions
Actions #1

Updated by okurz 12 months ago

  • Copied from action #127982: [alert][grafana] DatasourceNoData e.g. openQA QA-Power8-5-kvm+openqaw5-xen Disk I/O time alert worker added
Actions #2

Updated by mkittler 12 months ago

  • Status changed from New to In Progress

The logs are generally dominated by messages from velociraptor.

Around the time this happened it also logs loss of sequences (not sure whether it is relevant, though):

Apr 19 16:19:50 openqa openqa-websockets-daemon[16738]: [debug] [pid:16738] Updating seen of worker 1348 from worker_status (free)
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 1 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 1 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 1 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 1 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 1 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 1 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":839,\"MaxSize\":1073741874,\"AvailableBytes\":357,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":1004,\"MaxSize\":1073741874,\"AvailableBytes\":514,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":1169,\"MaxSize\":1073741874,\"AvailableBytes\":671,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":1367,\"MaxSize\":1073741874,\"AvailableBytes\":861,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":1533,\"MaxSize\":1073741874,\"AvailableBytes\":1019,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":1698,\"MaxSize\":1073741874,\"AvailableBytes\":1176,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 4294967294 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 4294967294 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 4294967294 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 4294967294 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 4294967294 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 4294967294 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":1873,\"MaxSize\":1073741874,\"AvailableBytes\":1343,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":2048,\"MaxSize\":1073741874,\"AvailableBytes\":1510,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":2223,\"MaxSize\":1073741874,\"AvailableBytes\":1677,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":2430,\"MaxSize\":1073741874,\"AvailableBytes\":1876,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":2606,\"MaxSize\":1073741874,\"AvailableBytes\":2044,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":2781,\"MaxSize\":1073741874,\"AvailableBytes\":2211,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 1 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 1 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 1 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 1 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 1 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":2946,\"MaxSize\":1073741874,\"AvailableBytes\":2368,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":3111,\"MaxSize\":1073741874,\"AvailableBytes\":2525,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 Detected the loss of 1 sequences.
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":3276,\"MaxSize\":1073741874,\"AvailableBytes\":2682,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":3474,\"MaxSize\":1073741874,\"AvailableBytes\":2872,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":3640,\"MaxSize\":1073741874,\"AvailableBytes\":3030,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":3805,\"MaxSize\":1073741874,\"AvailableBytes\":3187,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 read_file: /proc/3101/cmdline: lstat /proc/3101: no such file or directory
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":4013,\"MaxSize\":1073741874,\"AvailableBytes\":3387,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 read_file: /proc/3118/cmdline: lstat /proc/3118: no such file or directory
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":4221,\"MaxSize\":1073741874,\"AvailableBytes\":3587,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 read_file: /proc/3118/cmdline: lstat /proc/3118: no such file or directory
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":4429,\"MaxSize\":1073741874,\"AvailableBytes\":3787,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 read_file: /proc/3122/cmdline: lstat /proc/3122: no such file or directory
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":4637,\"MaxSize\":1073741874,\"AvailableBytes\":3987,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 read_file: /proc/3126/cmdline: lstat /proc/3126: no such file or directory
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":4845,\"MaxSize\":1073741874,\"AvailableBytes\":4187,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 read_file: /proc/3126/cmdline: lstat /proc/3126: no such file or directory
Apr 19 16:19:50 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:50+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":5053,\"MaxSize\":1073741874,\"AvailableBytes\":4387,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:51 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:51+02:00 read_file: /proc/3122/cmdline: lstat /proc/3122: no such file or directory
Apr 19 16:19:51 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:51+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":5261,\"MaxSize\":1073741874,\"AvailableBytes\":4587,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:51 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:51+02:00 read_file: /proc/3078/cmdline: lstat /proc/3078: no such file or directory
Apr 19 16:19:51 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:51+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":5469,\"MaxSize\":1073741874,\"AvailableBytes\":4787,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:51 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:51+02:00 read_file: /proc/3131/cmdline: lstat /proc/3131: no such file or directory
Apr 19 16:19:51 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:51+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":5677,\"MaxSize\":1073741874,\"AvailableBytes\":4987,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:51 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:51+02:00 read_file: /proc/3131/cmdline: lstat /proc/3131: no such file or directory
Apr 19 16:19:51 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:51+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":5885,\"MaxSize\":1073741874,\"AvailableBytes\":5187,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:51 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:51+02:00 read_file: /proc/3083/cmdline: lstat /proc/3083: no such file or directory
Apr 19 16:19:51 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:51+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":466,\"WritePointer\":6093,\"MaxSize\":1073741874,\"AvailableBytes\":5387,\"LeasedBytes\":0}","leased_pointer":466}
Apr 19 16:19:51 openqa openqa-websockets-daemon[16738]: [debug] [pid:16738] Updating seen of worker 2230 from worker_status (working)
Apr 19 16:19:51 openqa openqa-websockets-daemon[16738]: [debug] [pid:16738] Updating seen of worker 2197 from worker_status (working)
Apr 19 16:19:51 openqa openqa-websockets-daemon[16738]: [debug] [pid:16738] Updating seen of worker 1247 from worker_status (working)
Apr 19 16:19:51 openqa openqa-websockets-daemon[16738]: [debug] [pid:16738] Updating seen of worker 2282 from worker_status (working)
Apr 19 16:19:51 openqa openqa-websockets-daemon[16738]: [debug] [pid:16738] Updating seen of worker 2209 from worker_status (working)
Apr 19 16:19:51 openqa openqa-websockets-daemon[16738]: [debug] [pid:16738] Updating seen of worker 2288 from worker_status (working)
Apr 19 16:19:51 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:51+02:00 Sender: Connected to https://sec-velociraptor.prg.suse.com:8000/control
Apr 19 16:19:51 openqa velociraptor[718]: [DEBUG] 2023-04-19T16:19:51+02:00 Connection Info {"IdleTime":1001856570,"LocalAddr":{"IP":"149.44.176.58","Port":56958,"Zone":""},"Reused":true,"WasIdle":true}
Apr 19 16:19:51 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:51+02:00 Sender: sent 1747 bytes, response with status: 200 OK
Apr 19 16:19:51 openqa velociraptor[718]: [INFO] 2023-04-19T16:19:51+02:00 Sender: received 626 bytes

While this is technically a no-data alert it looks like just before and right after we had data the response time was very long. Also similar statistics indicate that response times were simply quite long.

So this reminds me of #126212. In case of #126212 we quickly restarted the web UI (and possibly other services) and the problem ceased before the alert was triggered. Here it took longer so the alert fired. The long response times normalized themselves (without us having to restart anything).

Note that I've very recently updated the kernel version for #116722. However, I don't think we were able to reproduce this problem again because this problem was much more severe.

Actions #3

Updated by openqa_review 12 months ago

  • Due date set to 2023-05-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions #4

Updated by okurz 12 months ago

  • Description updated (diff)

I silenced the alert

Actions #5

Updated by mkittler 12 months ago

Yes, it apparently happened again: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=78&orgId=1&from=1681977074056&to=1681981528038

At the same time the response time went up and the CPU load down - same as before.

Actions #6

Updated by mkittler 12 months ago

Most of the traffic comes from our workers, e.g. the top-ten host is also exclusively containing workers:

# bsdcat /var/log/apache2/openqa.access_log.1.xz | cut -d "-" -f 1 | sort | uniq -c | sort -g | tail -n 10
  73924 10.160.0.245 
  75752 10.100.96.72 
  77262 2a07:de40:a203:12:ec4:7aff:fe7a:789e 
  79152 10.100.96.76 
 119733 2a07:de40:a203:12:56ab:3aff:fe16:dd73 
 139306 2620:113:80c0:80a0:10:162:2a:42f0 
 165088 2a07:de40:a203:12:56ab:3aff:fe24:358d 
 169252 10.160.0.85 
 199414 10.160.0.227 
 592536 2a07:de40:a203:12:2e60:cff:fe73:2ac

That's actually expected. The only exception is:

# host 2620:113:80c0:80a0:10:162:2a:42f0
Host 0.f.2.4.a.2.0.0.2.6.1.0.0.1.0.0.0.a.0.8.0.c.0.8.3.1.1.0.0.2.6.2.ip6.arpa not found: 3(NXDOMAIN)

Of course this is the statistic for the full logs from 20/Apr/2023:03:00:07 to 20/Apr/2023:13:00:03.

Actions #7

Updated by mkittler 12 months ago

We also see timeouts in the logs:

Apr 20 10:23:20 openqa telegraf[1270]: 2023-04-20T08:23:20Z E! [inputs.http] Error in plugin: [url=https://openqa.suse.de/admin/influxdb/minion]: Get "https://openqa.suse.de/admin/influxdb/minion": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Apr 20 10:23:20 openqa telegraf[1270]: 2023-04-20T08:23:20Z E! [inputs.http] Error in plugin: [url=https://openqa.suse.de/admin/influxdb/jobs]: Get "https://openqa.suse.de/admin/influxdb/jobs": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

There are also many PostgreSQL-related messages around that time:

…
Apr 20 10:23:18 openqa su[25344]: (to postgres) root on none
Apr 20 10:23:18 openqa systemd[1]: Started Session c8735 of User postgres.
Apr 20 10:23:18 openqa su[25344]: pam_unix(su-l:session): session opened for user postgres by (uid=0)
Apr 20 10:23:18 openqa su[25344]: pam_unix(su-l:session): session closed for user postgres
Apr 20 10:23:19 openqa systemd[1]: session-c8735.scope: Deactivated successfully.
Apr 20 10:23:19 openqa su[25369]: (to postgres) root on none
Apr 20 10:23:19 openqa systemd[1]: Started Session c8736 of User postgres.
Apr 20 10:23:19 openqa su[25369]: pam_unix(su-l:session): session opened for user postgres by (uid=0)
Apr 20 10:23:19 openqa su[25369]: pam_unix(su-l:session): session closed for user postgres
Apr 20 10:23:19 openqa systemd[1]: session-c8736.scope: Deactivated successfully.
Apr 20 10:23:19 openqa su[25394]: (to postgres) root on none
Apr 20 10:23:19 openqa systemd[1]: Started Session c8737 of User postgres.
Apr 20 10:23:19 openqa su[25394]: pam_unix(su-l:session): session opened for user postgres by (uid=0)
Apr 20 10:23:19 openqa su[25394]: pam_unix(su-l:session): session closed for user postgres
Apr 20 10:23:19 openqa systemd[1]: session-c8737.scope: Deactivated successfully.
Apr 20 10:23:19 openqa su[25420]: (to postgres) root on none
Apr 20 10:23:19 openqa systemd[1]: Started Session c8738 of User postgres.
Apr 20 10:23:19 openqa su[25420]: pam_unix(su-l:session): session opened for user postgres by (uid=0)
Apr 20 10:23:19 openqa openqa-websockets-daemon[19716]: [debug] [pid:19716] Updating seen of worker 479 from worker_status (free)
Apr 20 10:23:19 openqa su[25420]: pam_unix(su-l:session): session closed for user postgres
Apr 20 10:23:19 openqa systemd[1]: session-c8738.scope: Deactivated successfully.
Apr 20 10:23:19 openqa su[25445]: (to postgres) root on none
Apr 20 10:23:19 openqa systemd[1]: Started Session c8739 of User postgres.
Apr 20 10:23:19 openqa su[25445]: pam_unix(su-l:session): session opened for user postgres by (uid=0)
…

velociraptor is also quite verbose; when scrolling through the logs there are sometimes pages with just messages from it. However, that's "normal" and thus likely not the culprit.

The PostgreSQL log shows many permission denied errors around that time. So apparently some service tried to read data from the database without proper permissions. This always happens once per hour (always starts at minute 27), so also "normal" and not the culprit.

Not only the CPU usage but also the network usage (on eth0) went down during the time frame. The disk I/O usage was pretty normal overall. So this leaves one really wondering what was slowing down the system. Next time it happens I need to monitor more closely with htop.

Actions #8

Updated by okurz 12 months ago

Sounds to me very much alike the problems we already had with apache connections clogging up until effective deadlocks. Most important as always is to reproduce the issue even it's just waiting for this one time during a week at 1145 or something. Then we could try out things like restart apache, increase number of processes, connections. Or replace with nginx

Actions #9

Updated by mkittler 12 months ago

Note that the data comes from (which runs on the web UI itself):

[[inputs.http_response]]
  urls = [ "https://{{grains['fqdn']}}/tests" ]
  interval = "1m"

Since we've got a NoDataAlert one would assume that message like

Apr 20 10:25:20 openqa telegraf[1270]: 2023-04-20T08:25:20Z E! [inputs.http] Error in plugin: [url=https://openqa.suse.de/admin/influxdb/jobs]: Get "https://openqa.suse.de/admin/influxdb/jobs": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

show up in the journal around that time. However, they don't show up at all (only for /admin/influxdb/… routes). So there's a lack of data but telegraf doesn't complain about it as it does for other queries. Supposedly that is because the other queries explicitly specify a timeout:

[[inputs.http]]
  urls = [ "https://{{grains['fqdn']}}/admin/influxdb/jobs" ]
  data_format = "influx"
  timeout = "20s"
  interval = "30s"

[[inputs.http]]
  urls = [ "https://{{grains['fqdn']}}/admin/influxdb/minion" ]
  data_format = "influx"
  timeout = "20s"
  interval = "1m"

Maybe we should do that here as well. Of course it wouldn't help with the underlying issue but maybe still help to prevent the no-data situation. At least there wasn't missing data during that timeframe for the /admin/influxdb/… routes (despite context deadline exceeded being logged a few times).

Actions #10

Updated by okurz 12 months ago

  • Related to action #127052: [alert] Apache Response Time alert followed by DatasourceNoData for min_apache_response size:M added
Actions #11

Updated by okurz 12 months ago

please be aware about my MR https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/844 to handle NoData gracefully also affecting the alert mentioned here.

Actions #12

Updated by mkittler 12 months ago

  • Description updated (diff)

Yes, this means we can focus on the actual problem here.

Actions #13

Updated by mkittler 12 months ago

Let's see what @kraih #127052 can find out.


The timeout specified in inputs.http is basically ending up in the Client struct. If no timeout is specified the default is 5 seconds (see https://github.com/influxdata/telegraf/blob/35d1ee3d967c8328a6c718dfaac65a983f437ce1/plugins/common/http/config.go#L51). So this client is quite inpatient which explains why we get context deadline exceeded very soon/often.

Actions #14

Updated by mkittler 12 months ago

  • Status changed from In Progress to Feedback
Actions #15

Updated by okurz 12 months ago

  • Subject changed from [alert][grafana] DatasourceNoData HTTP Response alert to [alert][grafana] Handle various problems with slow monitoring https requests and such (was: DatasourceNoData HTTP Response alert) size:M
  • Description updated (diff)
Actions #16

Updated by mkittler 12 months ago

We're actually using inputs.http_response here (and not inputs.http). However, conclusion from my previous comment still holds. This MR should avoid the empty graph sections: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/848

Actions #18

Updated by mkittler 12 months ago

  • Status changed from Feedback to Resolved

The graph hasn't shown any gaps since the MR has been merged. So I guess this is good enough. (And the alert itself is suppressed anyways via https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/844.)

Actions

Also available in: Atom PDF