action #121816: Cannot access installation media on updates.suse.com - maintenance tests broken size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #121816

closed

Cannot access installation media on updates.suse.com - maintenance tests broken size:S

Added by mgrifalconi over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

livdywan

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-12-12

Due date:

2023-02-24

% Done:

Estimated time:

Tags:

infra, reactive work

Description

Motivation¶

See http://dashboard.qam.suse.de/blocked

Or a concrete example: https://openqa.suse.de/tests/10144681#step/installation/28
Just reproduced: https://openqa.suse.de/tests/10148791#

Acceptance criteria¶

AC1: updates.suse.com is reachable consistently
AC2: Follow-up conversations were conducted

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz over 2 years ago

Tags set to infra
Target version set to Ready

Actions

Copy link

Updated by jbaier_cz over 2 years ago

Some more info:

The server is reachable from outside or from office network, but it is indeed unreachable from openqa itself or from openqaworker5:

worker5:~>  ping -c 1 updates.suse.com
PING cs183.wpc.gammacdn.net (68.232.34.211) 56(84) bytes of data.

--- cs183.wpc.gammacdn.net ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

Issue is already reported as SD-106762 (reported on Saturday, see https://suse.slack.com/archives/C029APBKLGK/p1670663165781539)

Actions

Copy link

Updated by nicksinger over 2 years ago

Status changed from New to Blocked
Assignee set to nicksinger

Rudi observed the same problem and raised a SD ticket (https://sd.suse.com/servicedesk/customer/portal/1/SD-106762) - we cannot access it. I keep this ticket as blocked until it works again but there is nothing we can do more atm.

Actions

Copy link

Updated by MDoucha over 2 years ago

updates.suse.com is reachable from openqa.qam.suse.cz (PRG office server room) so the problem is likely in firewall rules.

Actions

Copy link

Updated by MDoucha over 2 years ago

What's more interesting, traceroute reaches updates.suse.de even when ping, curl and wget fail. Tested from geordi.qam.suse.de:

geordi:~ # traceroute updates.suse.com
traceroute to updates.suse.com (68.232.34.211), 30 hops max, 60 byte packets
 1  10.161.255.251 (10.161.255.251)  1.082 ms  1.014 ms  0.992 ms
 2  10.156.163.126 (10.156.163.126)  0.307 ms  0.290 ms  0.273 ms
 3  10.156.163.92 (10.156.163.92)  0.455 ms  0.439 ms  0.420 ms
 4  195.135.221.18 (195.135.221.18)  0.687 ms  0.854 ms  0.831 ms
 5  ae1-479.nbg30.core-backbone.com (5.56.18.209)  1.390 ms  1.236 ms  1.214 ms
 6  ae1-2001.fra10.core-backbone.com (81.95.15.162)  4.100 ms  3.946 ms  4.764 ms
 7  ae-87.border1.frn.edgecastcdn.net (152.195.101.212)  4.451 ms  4.879 ms  4.819 ms
 8  ae-66.core1.frb.edgecastcdn.net (152.195.101.131)  4.264 ms  10.995 ms *
 9  68.232.34.211 (68.232.34.211)  3.660 ms *  3.574 ms
10  68.232.34.211 (68.232.34.211)  3.673 ms  3.769 ms  3.989 ms

Actions

Copy link

Updated by okurz over 2 years ago

nicksinger wrote:

Rudi observed the same problem and raised a SD ticket (https://sd.suse.com/servicedesk/customer/portal/1/SD-106762) - we cannot access it. I keep this ticket as blocked until it works again but there is nothing we can do more atm.

Please only block on tickets that we can really track. I don't mind if you spam the SUSE-IT team with duplicate tickets for multiple reporters because this is the way they prefer to work.

Actions

Copy link

Updated by jbaier_cz over 2 years ago

okurz wrote:

nicksinger wrote:

Rudi observed the same problem and raised a SD ticket (https://sd.suse.com/servicedesk/customer/portal/1/SD-106762) - we cannot access it. I keep this ticket as blocked until it works again but there is nothing we can do more atm.

Please only block on tickets that we can really track. I don't mind if you spam the SUSE-IT team with duplicate tickets for multiple reporters because this is the way they prefer to work.

AFAIK Nick is already able to track that ticket (and we should at least get updates by mail as Nick added openqa-admins ML).

Actions

Copy link

Updated by nicksinger over 2 years ago

Yes, I have access and added also the OSD-admins group to the ticket. From Slack I know that somebody from IT is actually working on this.

Actions

Copy link

Updated by jbaier_cz over 2 years ago

There has been a change:

worker5:~>  ping -c 1 updates.suse.com
PING cs183.wpc.gammacdn.net (68.232.34.211) 56(84) bytes of data.
64 bytes from 68.232.34.211 (68.232.34.211): icmp_seq=1 ttl=57 time=3.90 ms

--- cs183.wpc.gammacdn.net ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 3.901/3.901/3.901/0.000 ms

Also the restarted job is beyond the initial failure point. I believe that as far as openQA is concerned, this issue is resolved.

Actions

Copy link

#10

Updated by okurz over 2 years ago

I followed in Slack that apparently ping works again so likely the problem on the side of upstream CDN was fixed. I suggest to retrigger according failed tests.

Also as this issue was so severe let's "help" SUSE-IT by ensuring that we only close this ticket after there is an improvement in the monitoring and alerting chain, and I see the task mostly on the side of SUSE IT. Maybe we need to add updates.suse.com to our ping monitoring. But also I see something severely missing on the side of SUSE-IT. E.g. why did Rudi see an issue from manual monitoring already three days ago and nobody reacted until Monday? I would only call the ticket resolved if at least SUSE-IT has published and after-incident report with suggestions. After all, that's a topic that you, Nick, and me have just discussed last week with mflores as well, right?

Actions

Copy link

#11

Updated by jbaier_cz over 2 years ago

okurz wrote:

I followed in Slack that apparently ping works again so likely the problem on the side of upstream CDN was fixed. I suggest to retrigger according failed tests.

Also as this issue was so severe let's "help" SUSE-IT by ensuring that we only close this ticket after there is an improvement in the monitoring and alerting chain, and I see the task mostly on the side of SUSE IT. Maybe we need to add updates.suse.com to our ping monitoring. But also I see something severely missing on the side of SUSE-IT. E.g. why did Rudi see an issue from manual monitoring already three days ago and nobody reacted until Monday? I would only call the ticket resolved if at least SUSE-IT has published and after-incident report with suggestions. After all, that's a topic that you, Nick, and me have just discussed last week with mflores as well, right?

Totally agree (that is also the reason I did not touch the ticket status). If I am not mistaken, the monitoring from our side should be as easy as https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/476

Actions

Copy link

#12

Updated by livdywan over 2 years ago

Status changed from Blocked to Feedback

jbaier_cz wrote:

okurz wrote:

nicksinger wrote:

Rudi observed the same problem and raised a SD ticket (https://sd.suse.com/servicedesk/customer/portal/1/SD-106762) - we cannot access it. I keep this ticket as blocked until it works again but there is nothing we can do more atm.

The SD ticket was closed. I don't see any mention of a lessons learned there. Might be nice to nudge them to conduct one. Since this is crucial infrastructure for us.

Actions

Copy link

#13

Updated by nicksinger over 2 years ago

Assignee deleted (~~nicksinger~~)
Priority changed from Immediate to Normal

I'm not planning to do so

Actions

Copy link

#14

Updated by livdywan over 2 years ago

Status changed from Feedback to Resolved
Assignee set to livdywan
Priority changed from Normal to Immediate

I simply dropped a message in #help-it-ama. Let's consider this resolved on our end.

Actions

Copy link

#15

Updated by okurz over 2 years ago

Status changed from Resolved to Feedback
Priority changed from Immediate to Normal

Yes, thanks for bringing that up for discussion. There were answers but seems that SUSE-IT didn't/doesn't handle the problem properly and there was nothing regarding monitoring yet. Feel free to pass on to me to follow up at any time.

Actions

Copy link

#16

Updated by livdywan over 2 years ago

Status changed from Feedback to Blocked

okurz wrote:

Yes, thanks for bringing that up for discussion. There were answers but seems that SUSE-IT didn't/doesn't handle the problem properly and there was nothing regarding monitoring yet. Feel free to pass on to me to follow up at any time.

That is the point of the thread I linked. Let's consider it blocked then.

Actions

Copy link

#17

Updated by okurz over 2 years ago

Due date set to 2022-12-23
Status changed from Blocked to Feedback

As there was no response I asked now:

As there was no further response to the above let me ask:

Why was there no reaction to the original user report on Friday and only after more people reported problems on Monday?
Why was there no automatic monitoring that triggered an action already for unreachable updates.suse.com?
For a problem that severe affecting multiple functions of the company why is SUSE-IT not interested to run an RCA?
What can others help to improve in the future?
In general, what level of quality can we expect in general from such services?

Actions

Copy link

#18

Updated by okurz over 2 years ago

Due date changed from 2022-12-23 to 2023-01-27

There had been good responses:

(Bernhard Wiedemann) 1. the SD ticket was filed Saturday 2022-12-10 09:04 by Rudi - it was only assigned to the responsible team Monday 2022-12-12 01:01 indicating that maybe tickets are not monitored 24/7
(Bernhard Wiedemann) 5. in contrast to some other teams/services, the Infra team has no 24/7 SLAs or on-call duty atm. So if problems start on weekends or holidays, it can take a bit until someone debugs it.
(Bernhard Wiedemann) For 3 I think, service-ownership is to blame. E.g. updates.suse.com is not our (Infra team) service. Also it involved an external vendor, of which we don't have contact details. Maybe @Martin Piala could push for an RCA.
(Robin Christensen) to drive the RCA for this P1 incident. Oliver, all, I hope the outcome of the RCA will be action items, which will improve the processes around updates.suse.com (few of them mentioned in the comments already)
(Robin Christensen)

Why was there no reaction to the original user report on Friday and only after more people reported problems on Monday? A: The referenced ticket was created on Saturday, the team is not working weekends. If the service would be listed in the Tier 1 critical services we would normally have automatic monitoring in place that would trigger the 24x7 on call support, but it's not part of that list so no such automation is in place now.
Why was there no automatic monitoring that triggered an action already for unreachable updates.suse.com?
A: Per A1 above.
For a problem that severe affecting multiple functions of the company why is SUSE-IT not interested to run an RCA?
A: An RCA ticket has been opened, we will RCA this.
What can others help to improve in the future?
A: The criticality Tier needs to be set, but it needs to be understood that there is a cost of maintaining a 24x7 mission critical service. The main stakeholders and resolvers for this service need to agree whether or not it should be and can be included into automatic reporting and uninteruppted major incident management. There will be an action to ensure this is reviewed in the RCA outcome, and if it's agreed to be added we will increase the service criticality to include it.
In general, what level of quality can we expect in general from such services?
A: Per A4 above.

(Oliver Kurz) Thanks a lot for the great expectations. Regarding "24/7" I would not even expect that, I just wanted to clarify and check mutual expectations. All of the above makes sense to me. Maybe monitoring the accessibility of critical services even if not owned by the team itself would make sense. We also do something like that for the LSG QA infrastructure where we now also added updates.suse.com to the list of the hosts to be monitored

With that awaiting the results of the RCA and a reaction regarding the proposal to monitor the service from within the internal network.

Actions

Copy link

#19

Updated by livdywan over 2 years ago

Subject changed from Cannot access installation media on updates.suse.com - maintenance tests broken to Cannot access installation media on updates.suse.com - maintenance tests broken size:S
Description updated (diff)

Actions

Copy link

#20

Updated by okurz over 2 years ago

Tags changed from infra to infra, reactive work

Actions

Copy link

#21

Updated by livdywan over 2 years ago

okurz wrote:

With that awaiting the results of the RCA and a reaction regarding the proposal to monitor the service from within the internal network.

There's a draft RCA now

Actions

Copy link

#22

Updated by livdywan over 2 years ago

cdywan wrote:

There's a draft RCA now

Still pending

Actions

Copy link

#23

Updated by livdywan over 2 years ago

Due date changed from 2023-01-27 to 2023-02-10

Bumping due date due to hackweek.

Actions

Copy link

#24

Updated by okurz over 2 years ago

Related to action #123232: [Alerting] failed pipelines for openQABot/ bot-ng/ os-autoinst-needles-opensuse-mirror on gitlab staging instance size:S added

Actions

Copy link

#25

Updated by okurz over 2 years ago

Due date changed from 2023-02-10 to 2023-02-24

Escalation going on in mail thread and there will be a conversation tomorrow.

Actions

Copy link

#26

Updated by livdywan over 2 years ago

Relevant SD ticket about monitoring: SD-112717

Actions

Copy link

#27

Updated by livdywan over 2 years ago

Status changed from Feedback to Resolved

Aftermath concluded, monitoring ticket pending, very happy :-D

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #121816

Cannot access installation media on updates.suse.com - maintenance tests broken size:S

Motivation¶

Acceptance criteria¶

Updated by okurz over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by nicksinger over 2 years ago

Updated by MDoucha over 2 years ago

Updated by MDoucha over 2 years ago

Updated by okurz over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by nicksinger over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by okurz over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by livdywan over 2 years ago

Updated by nicksinger over 2 years ago

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by livdywan over 2 years ago

Updated by livdywan over 2 years ago

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by livdywan over 2 years ago

Updated by livdywan over 2 years ago