action #138287
closedpetrol sometimes take a long time to respond/render http://localhost:9530/influxdb/minion
0%
Description
Observation¶
Sometimes pipelines (e.g. https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1915033) fail with:
2023-10-19T13:14:13Z E! [inputs.http] Error in plugin: [url=http://localhost:9530/influxdb/minion]: Get "http://localhost:9530/influxdb/minion": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
It seems like the endpoint on that host sometimes takes a long time to respond:
petrol:~ # time curl http://localhost:9530/influxdb/minion
openqa_minion_jobs,url=http://localhost:9530 active=0i,delayed=0i,failed=19i,inactive=0i
openqa_minion_workers,url=http://localhost:9530 active=0i,inactive=1i,registered=1i
openqa_download_count,url=http://localhost:9530 count=0i
openqa_download_rate,url=http://localhost:9530 bytes=28359186i
real 0m0.008s
user 0m0.006s
sys 0m0.000s
petrol:~ # time curl http://localhost:9530/influxdb/minion
openqa_minion_jobs,url=http://localhost:9530 active=0i,delayed=0i,failed=19i,inactive=0i
openqa_minion_workers,url=http://localhost:9530 active=0i,inactive=1i,registered=1i
openqa_download_count,url=http://localhost:9530 count=0i
openqa_download_rate,url=http://localhost:9530 bytes=28359186i
real 0m0.008s
user 0m0.006s
sys 0m0.000s
petrol:~ # time curl http://localhost:9530/influxdb/minion
openqa_minion_jobs,url=http://localhost:9530 active=0i,delayed=0i,failed=19i,inactive=1i
openqa_minion_workers,url=http://localhost:9530 active=0i,inactive=1i,registered=1i
openqa_download_count,url=http://localhost:9530 count=0i
openqa_download_rate,url=http://localhost:9530 bytes=28359186i
real 0m6.242s
user 0m0.003s
sys 0m0.003s
petrol:~ # time curl http://localhost:9530/influxdb/minion
openqa_minion_jobs,url=http://localhost:9530 active=1i,delayed=0i,failed=19i,inactive=0i
openqa_minion_workers,url=http://localhost:9530 active=1i,inactive=0i,registered=1i
openqa_download_count,url=http://localhost:9530 count=1i
openqa_download_rate,url=http://localhost:9530 bytes=28359186i
real 0m11.547s
user 0m0.006s
sys 0m0.000s
Reproducible¶
Not sure what causes the long response times but I could easily reproduce it by running time curl http://localhost:9530/influxdb/minion
a couple of times.
Expected result¶
The route should be quite snappy and not that slow. At the very least, if we cannot understand or fix the underlying problem our pipelines should not fail because of this.
Suggestions¶
- Understand why that api endpoint needs so long to respond on only that host
- Bump curl timeouts in our telegraf config
Updated by mkittler 11 months ago
- Status changed from New to In Progress
I could reproduce a 4 second delay on the 3rd attempt. Of course 4 seconds is not that much considering the system is seriously busy (all worker slots are utilized). I would suspect that the SQLite database is busy (e.g. a write operation is blocking and/or the disk is generally busy).
Updated by openqa_review 11 months ago
- Due date set to 2023-11-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler 11 months ago
- Status changed from In Progress to Feedback
Besides the 4 second delay on one request yesterday I couldn't reproduce the problem anymore at all. I suppose it can nevertheless still happen if a worker is very busy so I created a MR to increase the timeout: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1043
Updated by okurz 11 months ago
- Due date deleted (
2023-11-28) - Status changed from Feedback to Resolved
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1043 merged. As you stated that you can't reproduce the problem and also because I tried now and could not reproduce the problem I'd say we can resolve right away.