action #152857
closed[tools] alert ping between hosts timeout proxy.scc.suse.de
0%
Description
Observation¶
https://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk/monitoring?viewPanel=4&orgId=1
looks like proxy.scc.de is down ..
https://suse.slack.com/archives/C029APBKLGK/p1703170652751919
Q: who is responsible for proxy.scc.suse.de and where is running ?
Rollback actions¶
- DONE Remove silence "alertname=Packet loss between worker hosts and other hosts alert" from https://monitor.qa.suse.de/alerting/silences
Updated by okurz 9 months ago
- Has duplicate action #153023: [FIRING:1] (Packet loss between worker hosts and other hosts alert Salt 2Z025iB4km) added
Updated by okurz 9 months ago
@jstehlik I added you to the SD ticket. But I suggest to just share this ticket here is it is accessible for everybody relevant and sharing SD tickets is inefficient and tedious due artificial limitations put in place. Also see https://gitlab.suse.de/suse/internal-tools/-/issues/29 about that
Updated by okurz 9 months ago
- Status changed from Blocked to In Progress
@here proxy-SCC should work again. I retriggered a SLE15-SP6 s390x job to check, currently running https://openqa.suse.de/tests/13223863
Updated by okurz 9 months ago · Edited
@Jose Gomez little progress but still with errors that need to be solved: https://openqa.suse.de/tests/13223863#live shows "Connection to registration server failed. Details: ExecuteError: … Unable to open configuration". Can you look into that?
EDIT: ok. I retriggered the test again so that people are not confused when they see a note that the job fails due to developer mode. The new clone should just cleanly reproduce https://bugzilla.suse.com/show_bug.cgi?id=1217923 . In the meantime I have triggered https://openqa.suse.de/tests/13224227#live which might more cleanly show that proxy-SCC is fine
Updated by okurz 9 months ago
- Status changed from In Progress to Resolved
https://openqa.suse.de/tests/13224227# passed, all good now.
Updated by okurz 9 months ago
- Status changed from In Progress to Blocked
- Priority changed from High to Normal
- Target version changed from Ready to Tools - Next
So it seems proxy.scc.suse.de is not pingable but otherwise fully usable.
I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/704 for now and also opened https://jira.suse.com/browse/ENGINFRA-3696.
Updated by josegomezr 9 months ago
So it seems proxy.scc.suse.de is not pingable but otherwise fully usable.
Correct, the new deployment is powered by Kubernetes. In that architecture there's no ICMP traffic involved at all. ping
is effectively impossible to provide.
Not to be a fortune teller but https://jira.suse.com/browse/ENGINFRA-3696 would likely not be fulfilled either.
To obtain an equivalent functionality please do an HTTP health check against the health endpoint:
http://proxy.scc.suse.de/api/health/status
Prometheus/Grafana allows for this. We used to have something like:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response.
static_configs:
- targets:
- https://scc.suse.com/api/health/status
relabel_configs:
- source_labels: [__address__]
regex: (.*)(:80)?
target_label: __param_target
- source_labels: [__param_target]
regex: (.*)
target_label: instance
replacement: ${1}
- source_labels: []
regex: .*
target_label: __address__
replacement: 127.0.0.1:9115
Updated by josegomezr 9 months ago
Q: who is responsible for proxy.scc.suse.de and where is running ?
As clarified in SD-142055. There's a shared responsibility model defined: https://itpe.io.suse.de/open-platform/docs/docs/getting_started/shared_responsibility_model/
TL;DR:
- SCC Team ensures that the application runs. The application runs now on one of the OpenPlatform clusters.
- The OpenPlatform clusters are managed by ITPE Platform team.
- And the hardware/underlying infrastructure (network, compute - Harvester cluster in PRG2; hardware) is owned by ITPE Infrastructure team (Eng-Infra).
Updated by okurz 9 months ago
- Copied to action #153418: http based health check against proxy.scc.suse.de size:M added
Updated by okurz 9 months ago
- Status changed from Blocked to In Progress
- Target version changed from Tools - Next to Ready
Thanks for the clarification. So I rejected the feature request in the Jira task and merged https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/704
Created #153418 for the follow-up. Awaiting deployment of https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/704 and then I can check the alert and remove the silence.
Updated by openqa_review 9 months ago
- Due date set to 2024-01-26
Setting due date based on mean cycle time of SUSE QE Tools