action #152857
closed[tools] alert ping between hosts timeout proxy.scc.suse.de
0%
Description
Observation¶
https://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk/monitoring?viewPanel=4&orgId=1
looks like proxy.scc.de is down ..
https://suse.slack.com/archives/C029APBKLGK/p1703170652751919
Q: who is responsible for proxy.scc.suse.de and where is running ?
Rollback actions¶
- DONE Remove silence "alertname=Packet loss between worker hosts and other hosts alert" from https://monitor.qa.suse.de/alerting/silences
Updated by osukup about 1 year ago
- Description updated (diff)
- Status changed from Closed to New
Updated by osukup about 1 year ago
Updated by okurz about 1 year ago
- Status changed from New to Blocked
- Assignee set to okurz
- Priority changed from Normal to High
- Target version set to Ready
Thank you. Will track this
Updated by livdywan about 1 year ago · Edited
osukup wrote in #note-3:
Filled https://sd.suse.com/servicedesk/customer/portal/1/SD-143085
Last response from 2023-12-22 9:48 so let's give it time whilst people are returning from end of year hols (this was flagged on our SLO's)
Updated by okurz about 1 year ago
- Has duplicate action #153023: [FIRING:1] (Packet loss between worker hosts and other hosts alert Salt 2Z025iB4km) added
Updated by jstehlik about 1 year ago
I don't have rights to see the SD ticket and to escalate it. Please add me to the ticket or escalate on my behalf. This seems to be currently the most critical blocker.
Updated by okurz about 1 year ago
@jstehlik I added you to the SD ticket. But I suggest to just share this ticket here is it is accessible for everybody relevant and sharing SD tickets is inefficient and tedious due artificial limitations put in place. Also see https://gitlab.suse.de/suse/internal-tools/-/issues/29 about that
Updated by okurz about 1 year ago
My vague understanding from the quite limited communication that I have access to is that currently the SCC team is setting up a new proxy-SCC instance on a new VM cluster somewhere, somehow. Awaiting more updates in the linked SD tickets and by jgomez.
Updated by livdywan about 1 year ago
Updated by okurz about 1 year ago
- Status changed from Blocked to In Progress
@here proxy-SCC should work again. I retriggered a SLE15-SP6 s390x job to check, currently running https://openqa.suse.de/tests/13223863
Updated by okurz about 1 year ago · Edited
@Jose Gomez little progress but still with errors that need to be solved: https://openqa.suse.de/tests/13223863#live shows "Connection to registration server failed. Details: ExecuteError: … Unable to open configuration". Can you look into that?
EDIT: ok. I retriggered the test again so that people are not confused when they see a note that the job fails due to developer mode. The new clone should just cleanly reproduce https://bugzilla.suse.com/show_bug.cgi?id=1217923 . In the meantime I have triggered https://openqa.suse.de/tests/13224227#live which might more cleanly show that proxy-SCC is fine
Updated by okurz about 1 year ago
- Status changed from In Progress to Resolved
https://openqa.suse.de/tests/13224227# passed, all good now.
Updated by okurz about 1 year ago
- Status changed from Resolved to In Progress
Wait, we still have the alert
Updated by okurz about 1 year ago
- Status changed from In Progress to Blocked
- Priority changed from High to Normal
- Target version changed from Ready to Tools - Next
So it seems proxy.scc.suse.de is not pingable but otherwise fully usable.
I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/704 for now and also opened https://jira.suse.com/browse/ENGINFRA-3696.
Updated by josegomezr about 1 year ago
So it seems proxy.scc.suse.de is not pingable but otherwise fully usable.
Correct, the new deployment is powered by Kubernetes. In that architecture there's no ICMP traffic involved at all. ping
is effectively impossible to provide.
Not to be a fortune teller but https://jira.suse.com/browse/ENGINFRA-3696 would likely not be fulfilled either.
To obtain an equivalent functionality please do an HTTP health check against the health endpoint:
http://proxy.scc.suse.de/api/health/status
Prometheus/Grafana allows for this. We used to have something like:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response.
static_configs:
- targets:
- https://scc.suse.com/api/health/status
relabel_configs:
- source_labels: [__address__]
regex: (.*)(:80)?
target_label: __param_target
- source_labels: [__param_target]
regex: (.*)
target_label: instance
replacement: ${1}
- source_labels: []
regex: .*
target_label: __address__
replacement: 127.0.0.1:9115
Updated by josegomezr about 1 year ago
Q: who is responsible for proxy.scc.suse.de and where is running ?
As clarified in SD-142055. There's a shared responsibility model defined: https://itpe.io.suse.de/open-platform/docs/docs/getting_started/shared_responsibility_model/
TL;DR:
- SCC Team ensures that the application runs. The application runs now on one of the OpenPlatform clusters.
- The OpenPlatform clusters are managed by ITPE Platform team.
- And the hardware/underlying infrastructure (network, compute - Harvester cluster in PRG2; hardware) is owned by ITPE Infrastructure team (Eng-Infra).
Updated by okurz about 1 year ago
- Copied to action #153418: http based health check against proxy.scc.suse.de size:M added
Updated by okurz about 1 year ago
- Status changed from Blocked to In Progress
- Target version changed from Tools - Next to Ready
Thanks for the clarification. So I rejected the feature request in the Jira task and merged https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/704
Created #153418 for the follow-up. Awaiting deployment of https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/704 and then I can check the alert and remove the silence.
Updated by openqa_review about 1 year ago
- Due date set to 2024-01-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz about 1 year ago
- Description updated (diff)
- Due date deleted (
2024-01-26) - Status changed from In Progress to Resolved
Alert not firing anymore, rollback action done, problem resolved, follow-up created, good to resolve.