action #60923

[alert] /srv about to run full, postgres logs very big due to repeated error "duplicate key value violates unique constraint "screenshots_filename", Key (filename)=(8ca/3c9/98a00d8bb2ccba5a2de1d403b5.png) already exists. INSERT INTO screenshots …"

Added by okurz 2 months ago. Updated about 1 month ago.

Status:ResolvedStart date:11/12/2019
Priority:HighDue date:
Assignee:okurz% Done:

0%

Category:-
Target version:openQA Project - Done
Duration:

Description

Observation

On Wednesday, 11 December 2019 20.55.01 CET Grafana wrote:

[Alerting] File systems alert

One of the file systems is too full


Metric name

Value

/srv: Used Percentage

90.151

It seems /srv is about to be full soon. This seems to be an evolution since start of October:
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=74&orgId=1&refresh=30s&from=now-6M&to=now

On osd I can see that /srv/PSQL10/data/log has 39G of log files.

There is an error message repeated multiple times per second repeatedly, first occurence in postgresql-2019-10-23_000000.log:

2019-10-23 08:29:38.235 CEST openqa geekotest [30552]ERROR:  duplicate key value violates unique constraint "screenshots_filename"
2019-10-23 08:29:38.235 CEST openqa geekotest [30552]DETAIL:  Key (filename)=(68c/0ee/3ceb7071b7b9364aa721b3421e.png) already exists.
2019-10-23 08:29:38.235 CEST openqa geekotest [30552]STATEMENT:  INSERT INTO screenshots ( filename, t_created) VALUES ( $1, $2 ) RETURNING id

Problem

Formerly we also had quite some postgres errors which seems we did not care about but now we have a severe case causing log files to grow that much that multiple files per day are recorded. Either postgres never cleans up old rotations or did not reach the limit yet. The same happens on o3 but we have a little bit more space for the partition holding the database.


Related issues

Copied to openQA Project - action #60926: postgres error "duplicate key value violates unique const... Resolved 11/12/2019

History

#1 Updated by okurz 2 months ago

  • Copied to action #60926: postgres error "duplicate key value violates unique constraint "screenshots_filename", Key (filename)=(8ca/3c9/98a00d8bb2ccba5a2de1d403b5.png) already exists. INSERT INTO screenshots …" added

#2 Updated by okurz 2 months ago

I have deleted postgresql-2019-11-* which gives us a bit of time to solve the issue. This ticket shall focus on the admin impact. I created #60926 for the openQA bug itself.

#3 Updated by okurz 2 months ago

  • Priority changed from Urgent to High

I could resolve #60926 already and deploy the change to o3 so while there are also many logs there I do not even think we need to delete any. For osd next week's deployment should suffice or we do it in the middle. What I would still like to do is learn a bit about how postgres manages the log rotation and also if we have a possibility to monitor for errors in it's according logfiles.

EDIT: 2019-12-14: So what I learned by now is that postgres has logrotation and we seem to use the default, do daily rotations plus limit file size by 10M. On OSD we have an additional /etc/logrotate.d/postgresql that ensures deletion of old files with

/var/lib/pgsql/data/pg_log/postgresql-*.log {
    daily
    minage 15
    rotate 0
    missingok
}

on o3 we do not have that.

https://www.endpoint.com/blog/2010/01/01/monitoring-postgres-log-files-with mentions a program "tail_and_mail" monitoring postgres logs and sending email alerts depending on configuration, e.g. alerts and aggregated error reports. Sounds useful.

After the original problem what seems to be the second-most likely error in logs:

2019-12-15 12:42:25.201 UTC openqa geekotest [23773]STATEMENT:  INSERT INTO job_modules ( always_rollback, category, fatal, important, job_id, milestone, name, script, t_created, t_updated) VALUES ( $1, $2, $3, $4, $5, $6, $7, $8, $9, $10 ) RETURNING id
2019-12-15 12:42:25.204 UTC openqa geekotest [23773]ERROR:  duplicate key value violates unique constraint "job_modules_job_id_name_category_script"
2019-12-15 12:42:25.204 UTC openqa geekotest [23773]DETAIL:  Key (job_id, name, category, script)=(1114160, cleanup_before_shutdown, shutdown, tests/shutdown/cleanup_before_shutdown.pm) already exists.

#6 Updated by okurz about 1 month ago

  • Status changed from In Progress to Resolved
  • Target version changed from Current Sprint to Done

merged, deployed and active on osd. /srv is at 47% space usage.

Also available in: Atom PDF