Project

General

Profile

action #109836

All Jobs on OSD are incomplete since 2022-04-12

Added by Xiaojing_liu 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2022-04-12
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

We found almost all jobs on OSD are incomplete with the reason 'Reason: api failure: 400 response: OpenQA::Schema::Result::Jobs::insert_module(): DBI Exception: DBD::Pg::st execute failed: ERROR: integer out of range [for Statement " INSERT INTO job_modules ( job_id, name, category, script, milestone, important, fatal, always_rollback, t_created, t_u…' on 2022-04-12 morning.

Examples:
https://openqa.suse.de/tests/8519011#
https://openqa.suse.de/tests/8519030

BTW o3 doesn't have this problem.

Please help to check that.

Expected

Jobs on OSD could run as normal.

Rollback steps


Related issues

Copied to openQA Project - coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for id'sBlocked2022-04-122022-07-07

History

#1 Updated by Xiaojing_liu 3 months ago

  • Description updated (diff)

#2 Updated by okurz 3 months ago

  • Category changed from Support to Concrete Bugs
  • Target version set to Ready

#3 Updated by okurz 3 months ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to okurz

on it. https://openqa.suse.de/tests/8516601 shows that the issue already happened before today's deployment.

openqa=> select * from job_modules order by id desc limit 10;
     id     | job_id  |        name        |         script          | category |         t_created          |         t_updated          | result | milestone | important | fatal | always_rollback 
------------+---------+--------------------+-------------------------+----------+----------------------------+----------------------------+--------+-----------+-----------+-------+-----------------
 2147482702 | 8508180 | fcntl09            | tests/kernel/run_ltp.pm | kernel   | 2022-04-11 23:03:39.362748 | 2022-04-11 23:03:39.362748 | none   |         0 |         1 |     0 |               0

shows that the most recent job module insert was 2147482702 from 022-04-11 23:03:39.362748. The number is suspiciously near 2 ** 31 = 2147483648

We currently have

openqa=> select count(*) from job_modules;
  count   
----------
 27364042
(1 row)

so I guess an in-place migration of 27M modules would take some minutes to hours but not endless. For now only new jobs since about 10h are affected, so we should be careful to not destroy everything.

I stopped the scheduler for now to not add more tests for now which can not complete anyway.

EDIT: Executing ALTER TABLE job_modules ALTER COLUMN id TYPE bigint; manually. Took some minutes (only).

Also triggered

host=openqa.suse.de failed_since=2022-04-11 ./openqa-advanced-retrigger-jobs

to restart the 3.5k incompletes, see https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=14&orgId=1 .

openqa=> \d job_modules;
                                           Table "public.job_modules"
     Column      |            Type             | Collation | Nullable |                 Default
-----------------+-----------------------------+-----------+----------+-----------------------------------------
 id              | bigint                      |           | not null | nextval('job_modules_id_seq'::regclass)
 job_id          | integer                     |           | not null |
 name            | text                        |           | not null |
 script          | text                        |           | not null |
 category        | text                        |           | not null |
 t_created       | timestamp without time zone |           | not null |
 t_updated       | timestamp without time zone |           | not null |
 result          | character varying           |           | not null | 'none'::character varying
 milestone       | integer                     |           | not null | 0
 important       | integer                     |           | not null | 0
 fatal           | integer                     |           | not null | 0
 always_rollback | integer                     |           | not null | 0
Indexes:
    "job_modules_pkey" PRIMARY KEY, btree (id)
    "idx_job_modules_result" btree (result)
    "job_modules_idx_job_id" btree (job_id)
    "job_modules_job_id_name_category_script" UNIQUE CONSTRAINT, btree (job_id, name, category, script)
Foreign-key constraints:
    "job_modules_fk_job_id" FOREIGN KEY (job_id) REFERENCES jobs(id) ON UPDATE CASCADE DEFERRABLE
Referenced by:
    TABLE "needles" CONSTRAINT "needles_fk_last_matched_module_id" FOREIGN KEY (last_matched_module_id) REFERENCES job_modules(id) ON DELETE SET NULL DEFERRABLE
    TABLE "needles" CONSTRAINT "needles_fk_last_seen_module_id" FOREIGN KEY (last_seen_module_id) REFERENCES job_modules(id) ON DELETE SET NULL DEFERRABLE

looks better now.

#4 Updated by okurz 3 months ago

I restarted services but then hit DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: value "2147490421" is out of range for type integer CONTEXT: unnamed portal parameter $1 = '...' [for Statement "UPDATE needles SET last_seen_module_id = ?, last_seen_time = ?, t… looks like we need to find all occurences of job_module_id and change tables.

So I did

ALTER TABLE needles ALTER COLUMN last_seen_module_id TYPE bigint; ALTER TABLE needles ALTER COLUMN last_matched_module_id TYPE bigint;

which was quick.

Then I got

400 response: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: cached plan must not change result type [for Statement "SELECT me.id, me.dir_id, me.filename, me.last_seen_time, me.last_seen_module_id, me.last_matched_time, me.last_matched_modu…
Clone of 8515034 (restarted already 2 times) 

oh, those carwos jobs can be quick, already some passed, e.g. https://openqa.suse.de/tests/8519408

#5 Updated by okurz 3 months ago

  • Copied to coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for id's added

#6 Updated by okurz 3 months ago

Current situation: The issue was mitigated on OSD. Incomplete jobs are in the process of being restarted. The situation will be monitored. A followup plan for migration of other existing instances as well as new installations will be conducted within #109846

#7 Updated by okurz 3 months ago

  • Description updated (diff)

#8 Updated by openqa_review 2 months ago

  • Due date set to 2022-04-27

Setting due date based on mean cycle time of SUSE QE Tools

#9 Updated by okurz 2 months ago

All incomplete jobs were retriggered accordingly. Alerts are handled again. A proper upstream database migration was already prepared in https://github.com/os-autoinst/openQA/pull/4604 . The migration was done silently and successfully on o3 so I have a good feeling we can also enable OSD deployment. I enabled the deployment again and triggered a manual run now and I am monitoring that.

#10 Updated by okurz 2 months ago

  • Description updated (diff)
  • Due date deleted (2022-04-27)
  • Status changed from In Progress to Resolved

OSD deployment complete, rollback steps complete. Follow-up improvements and five-why-analysis planned in #109846 . We have a high amount of scheduled jobs. This might be a consequence of the incident but also related to #107014, commented there.

Also available in: Atom PDF