action #121816
closedCannot access installation media on updates.suse.com - maintenance tests broken size:S
0%
Description
Motivation¶
See http://dashboard.qam.suse.de/blocked
Or a concrete example: https://openqa.suse.de/tests/10144681#step/installation/28
Just reproduced: https://openqa.suse.de/tests/10148791#
Acceptance criteria¶
- AC1: updates.suse.com is reachable consistently
- AC2: Follow-up conversations were conducted
Updated by okurz almost 2 years ago
- Tags set to infra
- Target version set to Ready
Updated by jbaier_cz almost 2 years ago
Some more info:
The server is reachable from outside or from office network, but it is indeed unreachable from openqa itself or from openqaworker5:
worker5:~> ping -c 1 updates.suse.com
PING cs183.wpc.gammacdn.net (68.232.34.211) 56(84) bytes of data.
--- cs183.wpc.gammacdn.net ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
Issue is already reported as SD-106762 (reported on Saturday, see https://suse.slack.com/archives/C029APBKLGK/p1670663165781539)
Updated by nicksinger almost 2 years ago
- Status changed from New to Blocked
- Assignee set to nicksinger
Rudi observed the same problem and raised a SD ticket (https://sd.suse.com/servicedesk/customer/portal/1/SD-106762) - we cannot access it. I keep this ticket as blocked until it works again but there is nothing we can do more atm.
Updated by MDoucha almost 2 years ago
updates.suse.com is reachable from openqa.qam.suse.cz (PRG office server room) so the problem is likely in firewall rules.
Updated by MDoucha almost 2 years ago
What's more interesting, traceroute reaches updates.suse.de even when ping, curl and wget fail. Tested from geordi.qam.suse.de:
geordi:~ # traceroute updates.suse.com
traceroute to updates.suse.com (68.232.34.211), 30 hops max, 60 byte packets
1 10.161.255.251 (10.161.255.251) 1.082 ms 1.014 ms 0.992 ms
2 10.156.163.126 (10.156.163.126) 0.307 ms 0.290 ms 0.273 ms
3 10.156.163.92 (10.156.163.92) 0.455 ms 0.439 ms 0.420 ms
4 195.135.221.18 (195.135.221.18) 0.687 ms 0.854 ms 0.831 ms
5 ae1-479.nbg30.core-backbone.com (5.56.18.209) 1.390 ms 1.236 ms 1.214 ms
6 ae1-2001.fra10.core-backbone.com (81.95.15.162) 4.100 ms 3.946 ms 4.764 ms
7 ae-87.border1.frn.edgecastcdn.net (152.195.101.212) 4.451 ms 4.879 ms 4.819 ms
8 ae-66.core1.frb.edgecastcdn.net (152.195.101.131) 4.264 ms 10.995 ms *
9 68.232.34.211 (68.232.34.211) 3.660 ms * 3.574 ms
10 68.232.34.211 (68.232.34.211) 3.673 ms 3.769 ms 3.989 ms
Updated by okurz almost 2 years ago
nicksinger wrote:
Rudi observed the same problem and raised a SD ticket (https://sd.suse.com/servicedesk/customer/portal/1/SD-106762) - we cannot access it. I keep this ticket as blocked until it works again but there is nothing we can do more atm.
Please only block on tickets that we can really track. I don't mind if you spam the SUSE-IT team with duplicate tickets for multiple reporters because this is the way they prefer to work.
Updated by jbaier_cz almost 2 years ago
okurz wrote:
nicksinger wrote:
Rudi observed the same problem and raised a SD ticket (https://sd.suse.com/servicedesk/customer/portal/1/SD-106762) - we cannot access it. I keep this ticket as blocked until it works again but there is nothing we can do more atm.
Please only block on tickets that we can really track. I don't mind if you spam the SUSE-IT team with duplicate tickets for multiple reporters because this is the way they prefer to work.
AFAIK Nick is already able to track that ticket (and we should at least get updates by mail as Nick added openqa-admins ML).
Updated by nicksinger almost 2 years ago
Yes, I have access and added also the OSD-admins group to the ticket. From Slack I know that somebody from IT is actually working on this.
Updated by jbaier_cz almost 2 years ago
There has been a change:
worker5:~> ping -c 1 updates.suse.com
PING cs183.wpc.gammacdn.net (68.232.34.211) 56(84) bytes of data.
64 bytes from 68.232.34.211 (68.232.34.211): icmp_seq=1 ttl=57 time=3.90 ms
--- cs183.wpc.gammacdn.net ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 3.901/3.901/3.901/0.000 ms
Also the restarted job is beyond the initial failure point. I believe that as far as openQA is concerned, this issue is resolved.
Updated by okurz almost 2 years ago
I followed in Slack that apparently ping works again so likely the problem on the side of upstream CDN was fixed. I suggest to retrigger according failed tests.
Also as this issue was so severe let's "help" SUSE-IT by ensuring that we only close this ticket after there is an improvement in the monitoring and alerting chain, and I see the task mostly on the side of SUSE IT. Maybe we need to add updates.suse.com to our ping monitoring. But also I see something severely missing on the side of SUSE-IT. E.g. why did Rudi see an issue from manual monitoring already three days ago and nobody reacted until Monday? I would only call the ticket resolved if at least SUSE-IT has published and after-incident report with suggestions. After all, that's a topic that you, Nick, and me have just discussed last week with mflores as well, right?
Updated by jbaier_cz almost 2 years ago
okurz wrote:
I followed in Slack that apparently ping works again so likely the problem on the side of upstream CDN was fixed. I suggest to retrigger according failed tests.
Also as this issue was so severe let's "help" SUSE-IT by ensuring that we only close this ticket after there is an improvement in the monitoring and alerting chain, and I see the task mostly on the side of SUSE IT. Maybe we need to add updates.suse.com to our ping monitoring. But also I see something severely missing on the side of SUSE-IT. E.g. why did Rudi see an issue from manual monitoring already three days ago and nobody reacted until Monday? I would only call the ticket resolved if at least SUSE-IT has published and after-incident report with suggestions. After all, that's a topic that you, Nick, and me have just discussed last week with mflores as well, right?
Totally agree (that is also the reason I did not touch the ticket status). If I am not mistaken, the monitoring from our side should be as easy as https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/476
Updated by livdywan almost 2 years ago
- Status changed from Blocked to Feedback
jbaier_cz wrote:
okurz wrote:
nicksinger wrote:
Rudi observed the same problem and raised a SD ticket (https://sd.suse.com/servicedesk/customer/portal/1/SD-106762) - we cannot access it. I keep this ticket as blocked until it works again but there is nothing we can do more atm.
The SD ticket was closed. I don't see any mention of a lessons learned there. Might be nice to nudge them to conduct one. Since this is crucial infrastructure for us.
Updated by nicksinger almost 2 years ago
- Assignee deleted (
nicksinger) - Priority changed from Immediate to Normal
I'm not planning to do so
Updated by livdywan almost 2 years ago
- Status changed from Feedback to Resolved
- Assignee set to livdywan
- Priority changed from Normal to Immediate
I simply dropped a message in #help-it-ama. Let's consider this resolved on our end.
Updated by okurz almost 2 years ago
- Status changed from Resolved to Feedback
- Priority changed from Immediate to Normal
Yes, thanks for bringing that up for discussion. There were answers but seems that SUSE-IT didn't/doesn't handle the problem properly and there was nothing regarding monitoring yet. Feel free to pass on to me to follow up at any time.
Updated by livdywan almost 2 years ago
- Status changed from Feedback to Blocked
okurz wrote:
Yes, thanks for bringing that up for discussion. There were answers but seems that SUSE-IT didn't/doesn't handle the problem properly and there was nothing regarding monitoring yet. Feel free to pass on to me to follow up at any time.
That is the point of the thread I linked. Let's consider it blocked then.
Updated by okurz almost 2 years ago
- Due date set to 2022-12-23
- Status changed from Blocked to Feedback
As there was no response I asked now:
As there was no further response to the above let me ask:
- Why was there no reaction to the original user report on Friday and only after more people reported problems on Monday?
- Why was there no automatic monitoring that triggered an action already for unreachable updates.suse.com?
- For a problem that severe affecting multiple functions of the company why is SUSE-IT not interested to run an RCA?
- What can others help to improve in the future?
- In general, what level of quality can we expect in general from such services?
Updated by okurz almost 2 years ago
- Due date changed from 2022-12-23 to 2023-01-27
There had been good responses:
(Bernhard Wiedemann) 1. the SD ticket was filed Saturday 2022-12-10 09:04 by Rudi - it was only assigned to the responsible team Monday 2022-12-12 01:01 indicating that maybe tickets are not monitored 24/7
(Bernhard Wiedemann) 5. in contrast to some other teams/services, the Infra team has no 24/7 SLAs or on-call duty atm. So if problems start on weekends or holidays, it can take a bit until someone debugs it.
(Bernhard Wiedemann) For 3 I think, service-ownership is to blame. E.g. updates.suse.com is not our (Infra team) service. Also it involved an external vendor, of which we don't have contact details. Maybe @Martin Piala could push for an RCA.
(Robin Christensen) to drive the RCA for this P1 incident. Oliver, all, I hope the outcome of the RCA will be action items, which will improve the processes around updates.suse.com (few of them mentioned in the comments already)
(Robin Christensen)
- Why was there no reaction to the original user report on Friday and only after more people reported problems on Monday? A: The referenced ticket was created on Saturday, the team is not working weekends. If the service would be listed in the Tier 1 critical services we would normally have automatic monitoring in place that would trigger the 24x7 on call support, but it's not part of that list so no such automation is in place now.
- Why was there no automatic monitoring that triggered an action already for unreachable updates.suse.com? A: Per A1 above.
- For a problem that severe affecting multiple functions of the company why is SUSE-IT not interested to run an RCA? A: An RCA ticket has been opened, we will RCA this.
- What can others help to improve in the future? A: The criticality Tier needs to be set, but it needs to be understood that there is a cost of maintaining a 24x7 mission critical service. The main stakeholders and resolvers for this service need to agree whether or not it should be and can be included into automatic reporting and uninteruppted major incident management. There will be an action to ensure this is reviewed in the RCA outcome, and if it's agreed to be added we will increase the service criticality to include it.
- In general, what level of quality can we expect in general from such services? A: Per A4 above.
(Oliver Kurz) Thanks a lot for the great expectations. Regarding "24/7" I would not even expect that, I just wanted to clarify and check mutual expectations. All of the above makes sense to me. Maybe monitoring the accessibility of critical services even if not owned by the team itself would make sense. We also do something like that for the LSG QA infrastructure where we now also added updates.suse.com to the list of the hosts to be monitored
With that awaiting the results of the RCA and a reaction regarding the proposal to monitor the service from within the internal network.
Updated by livdywan almost 2 years ago
- Subject changed from Cannot access installation media on updates.suse.com - maintenance tests broken to Cannot access installation media on updates.suse.com - maintenance tests broken size:S
- Description updated (diff)
Updated by okurz almost 2 years ago
- Tags changed from infra to infra, reactive work
Updated by livdywan almost 2 years ago
okurz wrote:
With that awaiting the results of the RCA and a reaction regarding the proposal to monitor the service from within the internal network.
There's a draft RCA now
Updated by livdywan almost 2 years ago
Updated by livdywan almost 2 years ago
- Due date changed from 2023-01-27 to 2023-02-10
Bumping due date due to hackweek.
Updated by okurz almost 2 years ago
- Related to action #123232: [Alerting] failed pipelines for openQABot/ bot-ng/ os-autoinst-needles-opensuse-mirror on gitlab staging instance size:S added
Updated by okurz almost 2 years ago
- Due date changed from 2023-02-10 to 2023-02-24
Escalation going on in mail thread and there will be a conversation tomorrow.
Updated by livdywan almost 2 years ago
Relevant SD ticket about monitoring: SD-112717
Updated by livdywan almost 2 years ago
- Status changed from Feedback to Resolved
Aftermath concluded, monitoring ticket pending, very happy :-D