Project

General

Profile

action #106179

coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release

coordination #109641: [epic] qem-bot improvements

No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down size:S

Added by mgrifalconi 5 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2022-02-08
Due date:
% Done:

0%

Estimated time:


Related issues

Related to QA - coordination #106546: [epic][tools] dashboard.qem.suse.de adoptionNew2022-02-10

Related to QA - action #107227: bot-ng schedule aborted with "ERROR: something wrong with /etc/openqabot/singlearch.yml" size:MResolved2022-02-22

Related to QA - action #107671: No aggregate maintenance runs scheduled today on osd size:MResolved

History

#1 Updated by mgrifalconi 5 months ago

  • Subject changed from No aggregate runs scheduled today - dashboard.qem.suse.de down to No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down

#2 Updated by osukup 5 months ago

http https://dashboard.qam.suse.de/                   

http: error: SSLError: HTTPSConnectionPool(host='dashboard.qam.suse.de', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)'))) while doing a GET request to URL: https://dashboard.qam.suse.de/

#3 Updated by okurz 5 months ago

  • Category set to Concrete Bugs
  • Target version set to Ready

#4 Updated by okurz 5 months ago

  • Project changed from openQA Project to QA
  • Category deleted (Concrete Bugs)

#5 Updated by osukup 5 months ago

dehydrated works as excepted and nginx works withot problems .. but from error.log:

022/02/07 23:08:07 [error] 25308#25308: *18791264 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.100.51.185, server: dashboard.qam.suse.de, request: "GET /app/api/blocked HTTP/1.1", upstream: "http://127.0.0.1:4000/app/api/blocked", host: "dashboard.qam.suse.de", referrer: "http://dashboard.qam.suse.de/blocked"
2022/02/07 23:08:37 [error] 25308#25308: *18820037 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.100.51.185, server: dashboard.qam.suse.de, request: "GET /app/api/blocked HTTP/1.1", upstream: "http://127.0.0.1:4000/app/api/blocked", host: "dashboard.qam.suse.de", referrer: "http://dashboard.qam.suse.de/blocked"
2022/02/07 23:09:07 [error] 25308#25308: *18820039 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.100.51.185, server: dashboard.qam.suse.de, request: "GET /app/api/blocked HTTP/1.1", upstream: "http://127.0.0.1:4000/app/api/blocked", host: "dashboard.qam.suse.de", referrer: "http://dashboard.qam.suse.de/blocked"
2022/02/07 23:09:37 [error] 25308#25308: *18791264 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.100.51.185, server: dashboard.qam.suse.de, request: "GET /app/api/blocked HTTP/1.1", upstream: "http://127.0.0.1:4000/app/api/blocked", host: "dashboard.qam.suse.de", referrer: "http://dashboard.qam.suse.de/blocked"

#6 Updated by osukup 5 months ago

broken connection between dashboard service and postgresql database ...

#7 Updated by mkittler 5 months ago

When just opening https://dashboard.qam.suse.de I get a 500 error for AJAX queries.

#8 Updated by osukup 5 months ago

  • Status changed from New to Feedback
  • Assignee set to osukup

fixed .... it looks like IP of posgresql container changed, so fixed dashboard.yml with new correct ip

#9 Updated by mgrifalconi 5 months ago

Great, thank you!
Will today's test be scheduled now or are they lost and will only run tomorrow?

#10 Updated by kraih 5 months ago

osukup wrote:

fixed .... it looks like IP of posgresql container changed, so fixed dashboard.yml with new correct ip

This is the second time we are having trouble with the postgres container. That really shouldn't happen. Can we maybe move postgres and deploy it without a container?

#11 Updated by okurz 5 months ago

  • Priority changed from Immediate to Urgent

osukup wrote:

fixed .... it looks like IP of posgresql container changed, so fixed dashboard.yml with new correct ip

Please provide more details here, e.g. reference the git commit containing the dashboard.yml change. And why do we need to manually maintain IP addresses? Lowering priority after the urgency of "immediate" was addressed

#12 Updated by okurz 5 months ago

  • Due date set to 2022-02-22

#13 Updated by kraih 5 months ago

We just had more problems with the postgres container becoming unreachable.

Feb 08 12:15:59 qam2 dashboard[6292]: [6292] [e] [Qys9dKEf7610] DBI connect('dbname=dashboard_db;host=192.168.0.48;port=5432','dashboard_user',...) failed: connection to server at "192.168.0.48", port 5432 failed: No route to host
Feb 08 12:15:59 qam2 dashboard[6292]:         Is the server running on that host and accepting TCP/IP connections? at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Pg.pm line 73.

#14 Updated by jbaier_cz 5 months ago

kraih wrote:

This is the second time we are having trouble with the postgres container. That really shouldn't happen. Can we maybe move postgres and deploy it without a container?

Nope, this is the second time we are having trouble with the networking (the first problem was caused by incorrect setting in main dhcp and wicked being weird). Running postgres outside container does not make it magically better. The solution could be to use postgres from dbproxy.suse.de (infra managed, clustered instance).

okurz wrote:

osukup wrote:

fixed .... it looks like IP of posgresql container changed, so fixed dashboard.yml with new correct ip

Please provide more details here, e.g. reference the git commit containing the dashboard.yml change. And why do we need to manually maintain IP addresses? Lowering priority after the urgency of "immediate" was addressed

The configuration is not in git (sensitive material inside), so the change needs to be done directly on the machine where dashboard is running.

It is a little bit mystery why the IP changed as there is a static lease. The configuration for dnsmasq was enhanced to include the mac address and after restarting the networking in the container, the correct address was offered and accepted.
As a follow-up, we could reconfigure the container to have static network configuration instead of the current DHCP setup (although there is nothing wrong on the current setup).

kraih wrote:

We just had more problems with the postgres container becoming unreachable.

Feb 08 12:15:59 qam2 dashboard[6292]: [6292] [e] [Qys9dKEf7610] DBI connect('dbname=dashboard_db;host=192.168.0.48;port=5432','dashboard_user',...) failed: connection to server at "192.168.0.48", port 5432 failed: No route to host
Feb 08 12:15:59 qam2 dashboard[6292]:         Is the server running on that host and accepting TCP/IP connections? at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Pg.pm line 73.

And that was me, fixing the lease and changing the configuration back as the .48 address was totally random and dynamic.

#15 Updated by kraih 5 months ago

jbaier_cz wrote:

Running postgres outside container does not make it magically better. The solution could be to use postgres from dbproxy.suse.de (infra managed, clustered instance).

Actually it would, outside the container it could just use a UNIX domain socket to connect to postgres. No dependence on internal networking. We have postgres deployed that way on countless machines, and they are usually rock solid.

#16 Updated by jbaier_cz 5 months ago

kraih wrote:

Actually it would, outside the container it could just use a UNIX domain socket to connect to postgres. No dependence on internal networking. We have postgres deployed that way on countless machines, and they are usually rock solid.

That's true, but only if you also run all the services which use the database from the same machine (which was not the case on qam2).

#17 Updated by mgrifalconi 5 months ago

The configuration is not in git (sensitive material inside), so the change needs to be done directly on the machine where dashboard is running.

Maybe a private git/gitlab repo? Could be useful to be able to track all changes in the config and easily restore them.
Password could still be kept out as env variables, using some templating language etc.

#18 Updated by osukup 5 months ago

for todays issue --> probadly best solution is close this poo ( is as subject is about todays outage)

for future we should consider move from own postgesql container to infra managed database , new poo?

#19 Updated by jbaier_cz 5 months ago

mgrifalconi wrote:

Maybe a private git/gitlab repo? Could be useful to be able to track all changes in the config and easily restore them.
Password could still be kept out as env variables, using some templating language etc.

Just a side note, if you keep private stuff in env variables you will loose the option to track the changes and restore them. And in this case, the configuration is nothing but a few secret tokens. Template language is an overkill.

osukup wrote:

for todays issue --> probadly best solution is close this poo ( is as subject is about todays outage)

for future we should consider move from own postgesql container to infra managed database , new poo?

Totally agree, the less things to manage the better. The server was never meant to run any database anyways.

#20 Updated by kraih 5 months ago

osukup wrote:

for todays issue --> probadly best solution is close this poo ( is as subject is about todays outage)

for future we should consider move from own postgesql container to infra managed database , new poo?

Totally agree, the less things to manage the better. The server was never meant to run any database anyways.

Then let's do that.

#21 Updated by okurz 5 months ago

osukup wrote:

for todays issue --> probadly best solution is close this poo ( is as subject is about todays outage)

for future we should consider move from own postgesql container to infra managed database , new poo?

No, it's fine. We should keep this ticket. Also see https://progress.opensuse.org/projects/qa/wiki/tools#How-we-work-on-our-backlog about improvements on regressions.

mgrifalconi wrote:

The configuration is not in git (sensitive material inside), so the change needs to be done directly on the machine where dashboard is running.

Maybe a private git/gitlab repo? Could be useful to be able to track all changes in the config and easily restore them.
Password could still be kept out as env variables, using some templating language etc.

I also don't understand how the configuration can not be stored in git.

#22 Updated by jbaier_cz 5 months ago

okurz wrote:

No, it's fine. We should keep this ticket. Also see https://progress.opensuse.org/projects/qa/wiki/tools#How-we-work-on-our-backlog about improvements on regressions.

Just a nitpick: this is technically not a regression as we didn't change anything in the first place; it can be qualified as a bigger issue though; if we are proceeding with improvements I would also like to finally tweak the deployment process as that came up as a quick improvisation

I also don't understand how the configuration can not be stored in git.

I think we should be more specific. There are several things which are mixed together and I have a feeling we are not understanding each other.

  • I do not think it is a good idea to have deployment specific configuration (database address, username / password) in the same repository with the code (because then you cannot deploy multiple copies, ...)
  • I agree that the configuration should be in some sort of git; that is the reason the deployment is done by ansible and part of the configuration is actually already there (at least the service files, user/group settings and such), so it is in the git (just without the passwords for now)
  • correct me if I am wrong, but the dashboard is still more like proof-of-concept than a production grade software, so we really did not have any time to tweak the accompanied processes (hence it is just deployed somewhere with some database)
My proposal
  1. Migrate database to a proper database cluster, EngInfra maintained if possible -- dbproxy.suse.de (as an alternative, share with OSD?)
  2. Find a better place to deploy the application itself, if there should be any guarantees about availability (it had some meaning historically as it shared the machine with the bot-ng, but that is no longer true)
  3. Document where one need to create a PR to change the deployment configuration and make sure people just did not "change that in production"

#23 Updated by okurz 5 months ago

jbaier_cz wrote:

okurz wrote:

No, it's fine. We should keep this ticket. Also see https://progress.opensuse.org/projects/qa/wiki/tools#How-we-work-on-our-backlog about improvements on regressions.

Just a nitpick: this is technically not a regression as we didn't change anything in the first place; it can be qualified as a bigger issue though; if we are proceeding with improvements I would also like to finally tweak the deployment process as that came up as a quick improvisation

Yes, I agree. We are not talking about fixing the original observation but improving with "feature work".

I also don't understand how the configuration can not be stored in git.

I think we should be more specific. There are several things which are mixed together and I have a feeling we are not understanding each other.

  • I do not think it is a good idea to have deployment specific configuration (database address, username / password) in the same repository with the code (because then you cannot deploy multiple copies, ...)

+1

  • I agree that the configuration should be in some sort of git; that is the reason the deployment is done by ansible and part of the configuration is actually already there (at least the service files, user/group settings and such), so it is in the git (just without the passwords for now)

+1

  • correct me if I am wrong, but the dashboard is still more like proof-of-concept than a production grade software, so we really did not have any time to tweak the accompanied processes (hence it is just deployed somewhere with some database)

+1 as well. I am most concerned with how others became reliant on qem-dashboard+qem-bot. There was a decision conducted by some in 2021-08 to delete code from qa-maintenance/openQABot so that now only qem-bot can handle incidents and aggregate tests which is effectively making qem-dashboard+qem-bot a critical component even though we may not like that

My proposal
  1. Migrate database to a proper database cluster, EngInfra maintained if possible -- dbproxy.suse.de (as an alternative, share with OSD?)

I understood that the database content can be considered transient and is effectively recreated automatically so I would not include it in a database cluster that treats the included data with a more expensive high-redundancy and backup process. And I don't think the data should be shared with OSD unless we merge what qem-dashboard provides actually within openQA itself.

  1. Find a better place to deploy the application itself, if there should be any guarantees about availability (it had some meaning historically as it shared the machine with the bot-ng, but that is no longer true)

can you just share for our readers a little bit more how the current setup looks like please :)

  1. Document where one need to create a PR to change the deployment configuration and make sure people just did not "change that in production"

+1

#24 Updated by cdywan 5 months ago

  • Subject changed from No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down to No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down size:S
  • Description updated (diff)

#25 Updated by jbaier_cz 5 months ago

#26 Updated by jbaier_cz 5 months ago

  • Status changed from Feedback to Resolved

okurz wrote:

  1. Find a better place to deploy the application itself, if there should be any guarantees about availability (it had some meaning historically as it shared the machine with the bot-ng, but that is no longer true)

can you just share for our readers a little bit more how the current setup looks like please :)

I created a new epic to coordinate, will comment there.

This particular issue was solved.

#27 Updated by okurz 4 months ago

  • Related to action #107227: bot-ng schedule aborted with "ERROR: something wrong with /etc/openqabot/singlearch.yml" size:M added

#28 Updated by okurz 4 months ago

  • Related to action #107671: No aggregate maintenance runs scheduled today on osd size:M added

#29 Updated by okurz 3 months ago

  • Parent task set to #91646

#30 Updated by okurz 3 months ago

  • Parent task changed from #91646 to #109641

#31 Updated by okurz 2 months ago

  • Due date deleted (2022-02-22)

Also available in: Atom PDF