Project

General

Profile

action #154639 ยป slack_discussion.txt

asmorodskyi, 2024-01-31 20:18

 
Matthias Griessmeier
1 day ago
@Anton Smorodskyi
if it is this crucial, would it make sense to establish a fallback somewhere/somehow?
Anton Smorodskyi
1 day ago
we have ansible playbook allowing to setup everything in any random host within half-an-hour
Anton Smorodskyi
1 day ago
beside whole setup process which is fully automated we have two major time consuming activities :
Switching DNS name from one IP to another
finding actually this "random host"
Anton Smorodskyi
1 day ago
I think (1) is fully under control of SUSE IT and we can not do more than creating Jira SD ticket and escalate
Anton Smorodskyi
1 day ago
and you can not do this beforehand :)
Anton Smorodskyi
1 day ago
for (2) we can book some machine somewhere .... but I am afraid that there is high probability that in event of emergency this host will be unavailable for some random reason :man-shrugging:
Anton Smorodskyi
1 day ago
so I don't mind doing something about (2) but it won't make ME more confident than now :) but if it will make someone else more confident I don't mind :)
Anton Smorodskyi
1 day ago
we have ansible playbook allowing to setup everything in any random host within half-an-hour
CORRECTION : fully automated except one piece of the puzzle - cloud providers creds ... For this we have dedicated project but currently it can propagate only AWS creds ...
Anton Smorodskyi
1 day ago
let's have a call about this ? next week would be better ...
Oliver Kurz
24 hours ago
I'd say with a single VM you can only achieve a limited availability. If you need higher then you need to use high availability services. But maybe you can improve a little bit by putting the same A records into more zone files or something?
:+1:
1

Anton Smorodskyi
24 hours ago
long story short - without huge changes in current algorithm it is simply not possible to run 2 PCW instance with same configuration in parallel it will create a mess ...
Anton Smorodskyi
24 hours ago
also in my memory I don't remember any failures of PCW where having cluster of several PCW instances would change anything
Anton Smorodskyi
24 hours ago
for example in case of today's problem having a cluster will not improve situation :man-shrugging:
Anton Smorodskyi
24 hours ago
but +1 to your idea to propagate this DNS record to several different places !
Matthias Griessmeier
14 hours ago
@Anton Smorodskyi

wrt:
it is pretty severe as it will affect ALL public cloud tests and we near time of triggering QAM tests ..
This reads pretty severe, but obviously can happen from time to time, but what do you think about some automatism to detect stuff like this before triggering all tests, and stop the triggering with a warning.
@Oliver Kurz
I see this related to the topic we had recently discussed in FC with Santiago, wdyt?
Matthias Griessmeier
14 hours ago
(and I bet there is already some ticket about it)
Oliver Kurz
14 hours ago
I don't see a relation yet. Enlighten me. Given that I am not aware of a good process preventing releases on test coverage decrease or when tests are not triggered I am not convinced it's a good idea to not trigger certain tests at all. However what could be done is to make tests more dependant on each other, e.g. have one small and quick cloud smoke tests and trigger more downstream if the first is successful
Matthias Griessmeier
14 hours ago
yes, that's where I see the relation.
e.g. trigger basic test to check connectivity, and if that fails, don't execute hundreds more which will 99% fail as well
Matthias Griessmeier
14 hours ago
I agree that we should not lose test coverage by not triggering certain tests, but also see no point in wasting computing resources and engineering resources for review failed tests which could be foreseen to fail. especially in PC where each run costs more money than "generic" openQA test (edited)
Oliver Kurz
12 hours ago
true true. Though I see that all of that can be resolved simply within the openQA test schedules with the current features, so, go ahead :slightly_smiling_face: There might be some feature requests coming up regarding reporting further down the road if such hierarchical schedules are used more.
Liv Dywan
12 hours ago
Reminds me of our conversations about making mm scenarios fail earlier for similar reasons. Less time spent investigating symptoms of infra issues. Having a test that fails early and prevents many more being scheduled. That's defintiely something we can already do.
Oliver Kurz
12 hours ago
Yes, true as well but again mostly within the domain of test maintainers
Matthias Griessmeier
12 hours ago
yes I agree, not a topic for qe-tools (for now)
Anton Smorodskyi
10 hours ago
true true. Though I see that all of that can be resolved simply within the openQA test schedules with the current features, so, go ahead :slightly_smiling_face: There might be some feature requests coming up regarding reporting further down the road if such hierarchical schedules are used more.
yes true it can be done on test side , BUT doing it in such way will over complicate whole setup . Because currently we can not have set test dependencies among different flavors . Which means that just for PC we will need to setup dozens of such smoke tests among different combinations of flavor/version/arch . On other hand making such feature in backend will make it transparent to all tests
Oliver Kurz
10 hours ago
ok, good point. But if it's not possible to set test dependencies among flavors, what do you need flavors for? Maybe we can find an alternative for that?
Anton Smorodskyi
9 hours ago
this is really good question Oli , but it leads to big discussion which I will be happy to have but not in terms of this thread in Slack ;) huddle / google meet or in person meet would suit better
Liv Dywan
9 hours ago
Maybe you can keep it simple by starting with one flavor and take it from there. If there's clear limitations we can discuss extending the backend
Oliver Kurz
9 hours ago
@Anton Smorodskyi
Sure, I am happy to meet with you and discuss that. Can you please still create a ticket with just a rough explanation of the problem so that we have a place where we can take notes and such. Feel welcome to copy-paste content from this thread into the ticket for context
    (1-1/1)