action #109836: All Jobs on OSD are incomplete since 2022-04-12 - openQA Project - openSUSE Project Management Tool

Actions

Copy link

action #109836

closed

All Jobs on OSD are incomplete since 2022-04-12

Added by Xiaojing_liu over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2022-04-12

Due date:

% Done:

Estimated time:

Description

Observation¶

We found almost all jobs on OSD are incomplete with the reason 'Reason: api failure: 400 response: OpenQA::Schema::Result::Jobs::insert_module(): DBI Exception: DBD::Pg::st execute failed: ERROR: integer out of range [for Statement " INSERT INTO job_modules ( job_id, name, category, script, milestone, important, fatal, always_rollback, t_created, t_u…' on 2022-04-12 morning.

Examples:
https://openqa.suse.de/tests/8519011#
https://openqa.suse.de/tests/8519030

BTW o3 doesn't have this problem.

Please help to check that.

Expected¶

Jobs on OSD could run as normal.

Rollback steps¶

DONE: ~~sudo systemctl start openqa-scheduler~~
DONE:* ~~OSD deployment schedule enabled again in https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit~~

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Xiaojing_liu over 2 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz over 2 years ago

Category changed from Support to Regressions/Crashes
Target version set to Ready

Actions

Copy link

Updated by okurz over 2 years ago

Description updated (diff)
Status changed from New to In Progress
Assignee set to okurz

on it. https://openqa.suse.de/tests/8516601 shows that the issue already happened before today's deployment.

openqa=> select * from job_modules order by id desc limit 10;
     id     | job_id  |        name        |         script          | category |         t_created          |         t_updated          | result | milestone | important | fatal | always_rollback 
------------+---------+--------------------+-------------------------+----------+----------------------------+----------------------------+--------+-----------+-----------+-------+-----------------
 2147482702 | 8508180 | fcntl09            | tests/kernel/run_ltp.pm | kernel   | 2022-04-11 23:03:39.362748 | 2022-04-11 23:03:39.362748 | none   |         0 |         1 |     0 |               0

shows that the most recent job module insert was 2147482702 from 022-04-11 23:03:39.362748. The number is suspiciously near 2 ** 31 = 2147483648

We currently have

openqa=> select count(*) from job_modules;
  count   
----------
 27364042
(1 row)

so I guess an in-place migration of 27M modules would take some minutes to hours but not endless. For now only new jobs since about 10h are affected, so we should be careful to not destroy everything.

I stopped the scheduler for now to not add more tests for now which can not complete anyway.

EDIT: Executing ALTER TABLE job_modules ALTER COLUMN id TYPE bigint; manually. Took some minutes (only).

Also triggered

host=openqa.suse.de failed_since=2022-04-11 ./openqa-advanced-retrigger-jobs

to restart the 3.5k incompletes, see https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=14&orgId=1 .

openqa=> \d job_modules;
                                           Table "public.job_modules"
     Column      |            Type             | Collation | Nullable |                 Default
-----------------+-----------------------------+-----------+----------+-----------------------------------------
 id              | bigint                      |           | not null | nextval('job_modules_id_seq'::regclass)
 job_id          | integer                     |           | not null |
 name            | text                        |           | not null |
 script          | text                        |           | not null |
 category        | text                        |           | not null |
 t_created       | timestamp without time zone |           | not null |
 t_updated       | timestamp without time zone |           | not null |
 result          | character varying           |           | not null | 'none'::character varying
 milestone       | integer                     |           | not null | 0
 important       | integer                     |           | not null | 0
 fatal           | integer                     |           | not null | 0
 always_rollback | integer                     |           | not null | 0
Indexes:
    "job_modules_pkey" PRIMARY KEY, btree (id)
    "idx_job_modules_result" btree (result)
    "job_modules_idx_job_id" btree (job_id)
    "job_modules_job_id_name_category_script" UNIQUE CONSTRAINT, btree (job_id, name, category, script)
Foreign-key constraints:
    "job_modules_fk_job_id" FOREIGN KEY (job_id) REFERENCES jobs(id) ON UPDATE CASCADE DEFERRABLE
Referenced by:
    TABLE "needles" CONSTRAINT "needles_fk_last_matched_module_id" FOREIGN KEY (last_matched_module_id) REFERENCES job_modules(id) ON DELETE SET NULL DEFERRABLE
    TABLE "needles" CONSTRAINT "needles_fk_last_seen_module_id" FOREIGN KEY (last_seen_module_id) REFERENCES job_modules(id) ON DELETE SET NULL DEFERRABLE

looks better now.

Actions

Copy link

Updated by okurz over 2 years ago

I restarted services but then hit DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: value "2147490421" is out of range for type integer CONTEXT: unnamed portal parameter $1 = '...' [for Statement "UPDATE needles SET last_seen_module_id = ?, last_seen_time = ?, t… looks like we need to find all occurences of job_module_id and change tables.

So I did

ALTER TABLE needles ALTER COLUMN last_seen_module_id TYPE bigint; ALTER TABLE needles ALTER COLUMN last_matched_module_id TYPE bigint;

which was quick.

Then I got

400 response: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: cached plan must not change result type [for Statement "SELECT me.id, me.dir_id, me.filename, me.last_seen_time, me.last_seen_module_id, me.last_matched_time, me.last_matched_modu…
Clone of 8515034 (restarted already 2 times)

oh, those carwos jobs can be quick, already some passed, e.g. https://openqa.suse.de/tests/8519408

Actions

Copy link

Updated by okurz over 2 years ago

Copied to coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids added

Actions

Copy link

Updated by okurz over 2 years ago

Current situation: The issue was mitigated on OSD. Incomplete jobs are in the process of being restarted. The situation will be monitored. A followup plan for migration of other existing instances as well as new installations will be conducted within #109846

Actions

Copy link

Updated by okurz over 2 years ago

Description updated (diff)

Actions

Copy link

Updated by openqa_review over 2 years ago

Due date set to 2022-04-27

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz over 2 years ago

All incomplete jobs were retriggered accordingly. Alerts are handled again. A proper upstream database migration was already prepared in https://github.com/os-autoinst/openQA/pull/4604 . The migration was done silently and successfully on o3 so I have a good feeling we can also enable OSD deployment. I enabled the deployment again and triggered a manual run now and I am monitoring that.

Actions

Copy link

#10

Updated by okurz over 2 years ago

Description updated (diff)
Due date deleted (~~2022-04-27~~)
Status changed from In Progress to Resolved

OSD deployment complete, rollback steps complete. Follow-up improvements and five-why-analysis planned in #109846 . We have a high amount of scheduled jobs. This might be a consequence of the incident but also related to #107014, commented there.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project

Tags

Custom queries

action #109836

All Jobs on OSD are incomplete since 2022-04-12

Observation¶

Expected¶

Rollback steps¶

Updated by Xiaojing_liu over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by openqa_review over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago