action #60923: [alert] /srv about to run full, postgres logs very big due to repeated error "duplicate key value violates unique constraint "screenshots_filename", Key (filename)=(8ca/3c9/98a00d8bb2ccba5a2de1d403b5.png) already exists. INSERT INTO screenshots …" - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #60923

closed

[alert] /srv about to run full, postgres logs very big due to repeated error "duplicate key value violates unique constraint "screenshots_filename", Key (filename)=(8ca/3c9/98a00d8bb2ccba5a2de1d403b5.png) already exists. INSERT INTO screenshots …"

Added by okurz over 5 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Done

Start date:

2019-12-11

Due date:

% Done:

Estimated time:

Description

Observation¶

On Wednesday, 11 December 2019 20.55.01 CET Grafana wrote:

[Alerting] File systems alert
One of the file systems is too full

Metric name
Value
/srv: Used Percentage
90.151

It seems /srv is about to be full soon. This seems to be an evolution since start of October:
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=74&orgId=1&refresh=30s&from=now-6M&to=now

On osd I can see that /srv/PSQL10/data/log has 39G of log files.

There is an error message repeated multiple times per second repeatedly, first occurence in postgresql-2019-10-23_000000.log:

2019-10-23 08:29:38.235 CEST openqa geekotest [30552]ERROR:  duplicate key value violates unique constraint "screenshots_filename"
2019-10-23 08:29:38.235 CEST openqa geekotest [30552]DETAIL:  Key (filename)=(68c/0ee/3ceb7071b7b9364aa721b3421e.png) already exists.
2019-10-23 08:29:38.235 CEST openqa geekotest [30552]STATEMENT:  INSERT INTO screenshots ( filename, t_created) VALUES ( $1, $2 ) RETURNING id

Problem¶

Formerly we also had quite some postgres errors which seems we did not care about but now we have a severe case causing log files to grow that much that multiple files per day are recorded. Either postgres never cleans up old rotations or did not reach the limit yet. The same happens on o3 but we have a little bit more space for the partition holding the database.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz over 5 years ago

Copied to action #60926: postgres error "duplicate key value violates unique constraint "screenshots_filename", Key (filename)=(8ca/3c9/98a00d8bb2ccba5a2de1d403b5.png) already exists. INSERT INTO screenshots …" added

Actions

Copy link

Updated by okurz over 5 years ago

I have deleted postgresql-2019-11-* which gives us a bit of time to solve the issue. This ticket shall focus on the admin impact. I created #60926 for the openQA bug itself.

Actions

Copy link

Updated by okurz over 5 years ago

Priority changed from Urgent to High

I could resolve #60926 already and deploy the change to o3 so while there are also many logs there I do not even think we need to delete any. For osd next week's deployment should suffice or we do it in the middle. What I would still like to do is learn a bit about how postgres manages the log rotation and also if we have a possibility to monitor for errors in it's according logfiles.

EDIT: 2019-12-14: So what I learned by now is that postgres has logrotation and we seem to use the default, do daily rotations plus limit file size by 10M. On OSD we have an additional /etc/logrotate.d/postgresql that ensures deletion of old files with

/var/lib/pgsql/data/pg_log/postgresql-*.log {
    daily
    minage 15
    rotate 0
    missingok
}

on o3 we do not have that.

https://www.endpoint.com/blog/2010/01/01/monitoring-postgres-log-files-with mentions a program "tail_and_mail" monitoring postgres logs and sending email alerts depending on configuration, e.g. alerts and aggregated error reports. Sounds useful.

After the original problem what seems to be the second-most likely error in logs:

2019-12-15 12:42:25.201 UTC openqa geekotest [23773]STATEMENT:  INSERT INTO job_modules ( always_rollback, category, fatal, important, job_id, milestone, name, script, t_created, t_updated) VALUES ( $1, $2, $3, $4, $5, $6, $7, $8, $9, $10 ) RETURNING id
2019-12-15 12:42:25.204 UTC openqa geekotest [23773]ERROR:  duplicate key value violates unique constraint "job_modules_job_id_name_category_script"
2019-12-15 12:42:25.204 UTC openqa geekotest [23773]DETAIL:  Key (job_id, name, category, script)=(1114160, cleanup_before_shutdown, shutdown, tests/shutdown/cleanup_before_shutdown.pm) already exists.

Actions

Copy link