Project

General

Profile

action #92701

backup of etc/ from both o3 was not working since some days due to OOM on backup.qa.suse.de (was: … and osd not updated anymore since 2019)

Added by okurz 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2021-05-14
Due date:
2021-06-30
% Done:

0%

Estimated time:

Description

Observation

$ ls -ltra /home/backup/o*/root-complete/etc/openqa/
/home/backup/o3/root-complete/etc/openqa/:
total 28
-rw-r--r--  1 root   root  174 Jan 19  2015 database.ini
-rw-r-----  1 chrony root  125 Mar 19  2015 client.conf
-rw-r--r--  1 root   root  452 Jan 27  2017 workers.ini
-rw-r--r--  1 root   root 2445 Jun  7  2019 openqa.ini
drwxr-xr-x 98 root   root 8192 Jul  1  2019 ..
drwxr-xr-x  2 root   root   82 Jul  5  2019 .

/home/backup/osd/root-complete/etc/openqa/:
total 36
-rw-r--r--   1 root    root  174 Jan 16  2015 database.ini.rpmsave
-rw-r-----   1 openslp root  229 Nov 19  2015 client.conf
-rw-r-----   1    1001 root   82 Jul  9  2018 database.ini.rpmnew
-rw-r--r--   1    1001 root 4058 Jul 31  2019 openqa.ini.rpmnew
drwxr-xr-x   3 root    root   18 Aug 15  2019 templates
drwxr-xr-x   3 root    root  160 Aug 16  2019 .
drwxr-xr-x 113 root    root 8192 Aug 18  2019 ..
-rw-r--r--   1    1001 root 3434 Aug 18  2019 openqa.ini
-rw-r-----   1    1001 root  194 Aug 18  2019 database.ini

Acceptance criteria

  • AC1: Automatic update from o3 to backup.qa.suse.de works again
  • AC2: Same as AC1 for osd -> #94015
  • AC3: Alert in place

Suggestions

Further details


Related issues

Related to openQA Infrastructure - action #88546: Make use of the new "Storage Server", e.g. complete OSD backupBlocked

History

#1 Updated by mkittler 4 months ago

I've just tried whether I can connect to the backup VM via SSH. I wanted to see what services are running and noticed that all the time I use systemctl status … the SSH connection is terminated, e.g.

backup-vm:~ # systemctl status dbus.service
Connection to backup.qa.suse.de closed by remote host.
Connection to backup.qa.suse.de closed.

#2 Updated by mkittler 4 months ago

  • Related to action #88546: Make use of the new "Storage Server", e.g. complete OSD backup added

#3 Updated by okurz 3 months ago

  • Subject changed from backup of etc/ from both o3 and osd not updated anymore since 2019 to backup of etc/ from both o3 was not working since some days due to OOM on backup.qa.suse.de (was: … and osd not updated anymore since 2019)
  • Status changed from Workable to In Progress
  • Assignee set to mkittler

The backup in /home/backup was actually only manually created backups. The automatic backups conducted by rsnapshot go into /home/rsnapshot and they work just fine with just the exception of backup.qa.suse.de being stuck in OOM since some days.

#4 Updated by okurz 3 months ago

The automatic backup for o3 was initially introduced with #44078 and that never covered osd so putting that into a separate ticket. Consider OSD out-of-scope and covered in the new ticket #94015

#5 Updated by mkittler 3 months ago

The host was stuck as it ran out-of-memory. It isn't clear what caused this condition. The automatic backups are actually stored under /home/rsnapshot/ and the files under /home/backup/ which are mentioned in the ticket description have been created manually (so it is no surprise that they're not updated). The actual automatic backups seem to work, e.g. triggering rsnapshot alpha manually worked and the cron configuration is actually in place.


the host is up since more than 176h …

I've now enabled our usual salt setup for the host so it should now reboot weekly.


The host has almost 4 GiB memory which should actually be more than enough considering only a small number of services runs on the host. Maybe it makes sense to add some basic graphs/alerts in Grafana for the backup host so we can keep an eye on it.

#6 Updated by openqa_review 3 months ago

  • Due date set to 2021-06-30

Setting due date based on mean cycle time of SUSE QE Tools

#7 Updated by mkittler 3 months ago

SR for adding generic the monitoring for the host: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/507

With https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/da6e34e0121f1f4f4042ef3e4687873311e6e228 the systemd services monitoring/alert should now cover the backup host as well.

#8 Updated by mkittler 3 months ago

  • Status changed from In Progress to Feedback

Judging by the timestamps in /home/rsnapshot/ it looks like the backup is still performed automatically (at least the alpha one).

#9 Updated by okurz 3 months ago

  • Description updated (diff)

#10 Updated by okurz 3 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/507 merged. If you can ensure that our generic alerts like "failed systemd services" are covered for the backup host as well I think that should be enough. Having a check for free space on the backup host would be nice on top.

#12 Updated by okurz 3 months ago

MR merged but deployment failed in https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/468624 , don't want to check manually if the relevant part was deployed as we should fix the problems in deployment anyway.

#13 Updated by mkittler 3 months ago

With https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/510 the memory alert works and I enabled it again.

And yes, the pipeline still fails. The remaining errors are about user creation. e.g.:

          ID: ldevulder
    Function: user.present
      Result: False
     Comment: These values could not be changed: {'home': '/home/ldevulder'}
     Started: 11:14:22.376449
    Duration: 27.562 ms

I still don't know why that's the case.

#14 Updated by mkittler 3 months ago

The memory alert is fixed and with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/513#note_327858 also the ping-based alert. What remains are the strange user creation errors.

#15 Updated by mkittler 3 months ago

The user problem is fixed, /etc/passwd was broken again. I hope it won't break again.

I'm keeping this ticket open until https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/514 has been merged.

#16 Updated by okurz 3 months ago

mkittler wrote:

The user problem is fixed, /etc/passwd was broken again. I hope it won't break again.

What do you mean with "broken again"? When has that happened in before?

#17 Updated by mkittler 3 months ago

The first time I ran salt on the machine the last entry in the file was broken causing many errors. Removing the line helped. At some point there were only these user-related errors remaining (but I haven't checked /etc/passwd again immediately because I haven't expected it to break again).

#18 Updated by okurz 3 months ago

hm, ok. We create users properly with salt so I don't think we should try to do something on top. Well, good enough :) Can you verify AC1+AC3?

#19 Updated by mkittler 3 months ago

I've read the comment but currently the VPN is down so no, I cannot verify it at the moment.

#20 Updated by mkittler 3 months ago

  • AC1: It looks good:
martchus@backup-vm:~> ls -ltra /home/rsnapshot
insgesamt 8
drwxr-xr-x  4 root root   50 25. Jun 2020  _delete.14889
drwxr-xr-x  4 root root   50 26. Jun 2020  _delete.15189
drwxr-xr-x  6 root root  109 29. Jan 04:04 delta.2
drwxr-xr-x  6 root root  109 26. Feb 04:03 delta.1
drwxr-xr-x  6 root root  109 26. Mär 04:04 delta.0
drwxr-xr-x  6 root root  109 14. Mai 04:05 gamma.3
drwxr-xr-x  6 root root  109 20. Mai 08:05 _delete.651
drwxr-xr-x  6 root root  109 21. Mai 04:12 gamma.2
drwxr-xr-x  6 root root  109 22. Mai 00:05 _delete.15401
drwxr-xr-x  6 root root  109 22. Mai 12:04 _delete.20003
drwxr-xr-x  4 root root   65 26. Mai 12:04 _delete.14362
drwxr-xr-x  6 root root  109 26. Mai 20:04 _delete.26284
drwxr-xr-x  6 root root  109 29. Mai 04:04 gamma.1
drwxr-xr-x  4 root root   65 29. Mai 12:04 _delete.8886
drwxr-xr-x  6 root root  109 29. Mai 16:04 _delete.16796
drwxr-xr-x  6 root root  109 29. Mai 20:04 _delete.22377
drwxr-xr-x  4 root root   65 29. Mai 20:04 _delete.12833
drwxr-xr-x  6 root root  109 30. Mai 00:04 _delete.2735
drwxr-xr-x  4 root root   50 30. Mai 08:00 _delete.14087
drwxr-xr-x  4 root root   50 31. Mai 16:00 _delete.23694
drwxr-xr-x  4 root root   50  2. Jun 00:00 _delete.13239
drwxr-xr-x  4 root root   50  2. Jun 20:00 _delete.16276
drwxr-xr-x  4 root root   50  3. Jun 00:00 _delete.22317
drwxr-xr-x  4 root root   50  3. Jun 08:00 _delete.27611
drwxr-xr-x  4 root root   50  3. Jun 12:00 _delete.1851
drwxr-xr-x  4 root root   50  4. Jun 04:00 _delete.8595
drwxr-xr-x  4 root root   50  4. Jun 08:00 _delete.12420
drwxr-xr-x  4 root root   50  4. Jun 16:00 _delete.16685
drwxr-xr-x  4 root root   50  4. Jun 20:00 _delete.19139
drwxr-xr-x 56 root root 4096 15. Jun 14:56 ..
drwxr-xr-x  6 root root  109 18. Jun 04:04 gamma.0
drwxr-xr-x  6 root root  109 23. Jun 04:03 beta.6
drwxr-xr-x  6 root root  109 24. Jun 04:03 beta.5
drwxr-xr-x  6 root root  109 25. Jun 04:03 beta.4
drwxr-xr-x  6 root root  109 26. Jun 04:03 beta.3
drwxr-xr-x  6 root root  109 27. Jun 04:03 beta.2
drwxr-xr-x  6 root root  109 28. Jun 04:03 beta.1
drwxr-xr-x  6 root root  109 29. Jun 04:03 beta.0
drwxr-xr-x  6 root root  109 29. Jun 12:04 alpha.5
drwxr-xr-x  6 root root  109 29. Jun 16:03 alpha.4
drwxr-xr-x  6 root root  109 29. Jun 20:04 alpha.3
drwxr-xr-x  6 root root  109 30. Jun 00:04 alpha.2
drwxr-xr-x  6 root root  109 30. Jun 04:04 alpha.1
drwxr-xr-x  6 root root  109 30. Jun 08:03 alpha.0
drwxr-xr-x 45 root root 4096 30. Jun 08:04 .

#21 Updated by okurz 3 months ago

alright. So, resolve?

#22 Updated by mkittler 3 months ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF