Project

General

Profile

Actions

action #152857

closed

[tools] alert ping between hosts timeout proxy.scc.suse.de

Added by osukup 4 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-12-21
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk/monitoring?viewPanel=4&orgId=1

looks like proxy.scc.de is down ..

https://suse.slack.com/archives/C029APBKLGK/p1703170652751919

Q: who is responsible for proxy.scc.suse.de and where is running ?

Rollback actions


Related issues 2 (0 open2 closed)

Has duplicate openQA Infrastructure - action #153023: [FIRING:1] (Packet loss between worker hosts and other hosts alert Salt 2Z025iB4km)Rejectedokurz2023-12-08

Actions
Copied to openQA Infrastructure - action #153418: http based health check against proxy.scc.suse.de size:MResolvednicksinger2023-12-212024-01-30

Actions
Actions #1

Updated by osukup 4 months ago · Edited

  • Status changed from New to Closed
Actions #2

Updated by osukup 4 months ago

  • Description updated (diff)
  • Status changed from Closed to New
Actions #4

Updated by osukup 4 months ago

  • Tags set to alert, infra
Actions #5

Updated by okurz 4 months ago

  • Status changed from New to Blocked
  • Assignee set to okurz
  • Priority changed from Normal to High
  • Target version set to Ready

Thank you. Will track this

Actions #6

Updated by okurz 4 months ago

  • Description updated (diff)
Actions #7

Updated by livdywan 4 months ago · Edited

osukup wrote in #note-3:

Filled https://sd.suse.com/servicedesk/customer/portal/1/SD-143085

Last response from 2023-12-22 9:48 so let's give it time whilst people are returning from end of year hols (this was flagged on our SLO's)

Actions #8

Updated by okurz 4 months ago

  • Has duplicate action #153023: [FIRING:1] (Packet loss between worker hosts and other hosts alert Salt 2Z025iB4km) added
Actions #9

Updated by jstehlik 4 months ago

I don't have rights to see the SD ticket and to escalate it. Please add me to the ticket or escalate on my behalf. This seems to be currently the most critical blocker.

Actions #10

Updated by okurz 4 months ago

@jstehlik I added you to the SD ticket. But I suggest to just share this ticket here is it is accessible for everybody relevant and sharing SD tickets is inefficient and tedious due artificial limitations put in place. Also see https://gitlab.suse.de/suse/internal-tools/-/issues/29 about that

Actions #11

Updated by okurz 4 months ago

My vague understanding from the quite limited communication that I have access to is that currently the SCC team is setting up a new proxy-SCC instance on a new VM cluster somewhere, somehow. Awaiting more updates in the linked SD tickets and by jgomez.

Actions #13

Updated by okurz 4 months ago

  • Status changed from Blocked to In Progress

@here proxy-SCC should work again. I retriggered a SLE15-SP6 s390x job to check, currently running https://openqa.suse.de/tests/13223863

Actions #14

Updated by okurz 4 months ago · Edited

@Jose Gomez little progress but still with errors that need to be solved: https://openqa.suse.de/tests/13223863#live shows "Connection to registration server failed. Details: ExecuteError: … Unable to open configuration". Can you look into that?

EDIT: ok. I retriggered the test again so that people are not confused when they see a note that the job fails due to developer mode. The new clone should just cleanly reproduce https://bugzilla.suse.com/show_bug.cgi?id=1217923 . In the meantime I have triggered https://openqa.suse.de/tests/13224227#live which might more cleanly show that proxy-SCC is fine

Actions #15

Updated by okurz 4 months ago

  • Status changed from In Progress to Resolved
Actions #16

Updated by okurz 4 months ago

  • Status changed from Resolved to In Progress

Wait, we still have the alert

Actions #17

Updated by okurz 4 months ago

  • Status changed from In Progress to Blocked
  • Priority changed from High to Normal
  • Target version changed from Ready to Tools - Next

So it seems proxy.scc.suse.de is not pingable but otherwise fully usable.
I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/704 for now and also opened https://jira.suse.com/browse/ENGINFRA-3696.

Actions #18

Updated by josegomezr 4 months ago

So it seems proxy.scc.suse.de is not pingable but otherwise fully usable.

Correct, the new deployment is powered by Kubernetes. In that architecture there's no ICMP traffic involved at all. ping is effectively impossible to provide.

Not to be a fortune teller but https://jira.suse.com/browse/ENGINFRA-3696 would likely not be fulfilled either.

To obtain an equivalent functionality please do an HTTP health check against the health endpoint:

http://proxy.scc.suse.de/api/health/status

Prometheus/Grafana allows for this. We used to have something like:

  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]  # Look for a HTTP 200 response.
    static_configs:
      - targets:
        - https://scc.suse.com/api/health/status
    relabel_configs:
      - source_labels: [__address__]
        regex: (.*)(:80)?
        target_label: __param_target
      - source_labels: [__param_target]
        regex: (.*)
        target_label: instance
        replacement: ${1}
      - source_labels: []
        regex: .*
        target_label: __address__
        replacement: 127.0.0.1:9115

Actions #19

Updated by josegomezr 4 months ago

Q: who is responsible for proxy.scc.suse.de and where is running ?

As clarified in SD-142055. There's a shared responsibility model defined: https://itpe.io.suse.de/open-platform/docs/docs/getting_started/shared_responsibility_model/

TL;DR:

  • SCC Team ensures that the application runs. The application runs now on one of the OpenPlatform clusters.
  • The OpenPlatform clusters are managed by ITPE Platform team.
  • And the hardware/underlying infrastructure (network, compute - Harvester cluster in PRG2; hardware) is owned by ITPE Infrastructure team (Eng-Infra).
Actions #20

Updated by okurz 4 months ago

  • Copied to action #153418: http based health check against proxy.scc.suse.de size:M added
Actions #21

Updated by okurz 4 months ago

  • Status changed from Blocked to In Progress
  • Target version changed from Tools - Next to Ready

Thanks for the clarification. So I rejected the feature request in the Jira task and merged https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/704

Created #153418 for the follow-up. Awaiting deployment of https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/704 and then I can check the alert and remove the silence.

Actions #22

Updated by openqa_review 4 months ago

  • Due date set to 2024-01-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #23

Updated by okurz 4 months ago

  • Description updated (diff)
  • Due date deleted (2024-01-26)
  • Status changed from In Progress to Resolved

Alert not firing anymore, rollback action done, problem resolved, follow-up created, good to resolve.

Actions

Also available in: Atom PDF