Project

General

Profile

action #57683

o3 /space is nearly running out again, assets are not refreshed, not cleaned up (was: too much logs&results)

Added by okurz almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
Start date:
2019-10-03
Due date:
% Done:

0%

Estimated time:

Description

Observation

openqa=> select id,name,keep_logs_in_days,keep_results_in_days from job_groups order by keep_results_in_days desc;
 id |                name                | keep_logs_in_days | keep_results_in_days 
----+------------------------------------+-------------------+----------------------
 61 | openSUSE Leap 15.1 Updates         |                   |                     
 21 | openSUSE Leap 42.2 PowerPC         |                30 |                     
 20 | openSUSE Leap 42.2 AArch64         |                30 |                     
 15 | openSUSE Leap 42.1 JeOS            |                30 |                     
 55 | openSUSE Leap 15.0 Updates         |                   |                     
 34 | openSUSE Tumbleweed s390x          |                30 |                  365
 12 | openSUSE Leap 42.1 AArch64         |                30 |                  365
 53 | openSUSE Leap 15.0 Images          |                30 |                  365
  7 | openSUSE Leap 42.1                 |                30 |                  365
 58 | openSUSE Leap 15.1 Images          |                30 |                  365
 13 | openSUSE Leap 42.1 PowerPC         |                30 |                  365
  4 | openSUSE Tumbleweed PowerPC        |                30 |                  365
 48 | openSUSE Leap 42.3 Incidents       |                30 |                  365
 54 | openSUSE Leap 15.0 Incidents       |                30 |                  365
 65 | openSUSE Leap 15.2 Images          |                20 |                  120
 39 | Development Leap                   |                20 |                  120
 41 | Development Kubic                  |                30 |                  120
 59 | openSUSE Leap 15.1 AArch64 Images  |                20 |                  120

We simply currently can not afford to save that much.


Related issues

Related to openQA Project - action #57689: asset cleanup jobs do not run on o3 (results cleanup works), workaround: unlock locks manuallyResolved2019-10-032019-12-10

History

#1 Updated by okurz almost 2 years ago

going through mentioned job groups and reducing result retention period. Set many to "90" days now for results.

found some old, unhandled testresults:

ariel:/space/openqa/testresults # ls
00071  00302                                                                                      00328  00368  00408  00448  00488  00528  00568  00608  00648  00688  00728  00768  00808  00848  00888  00928  00968  01008
00097  00303                                                                                      00329  00369  00409  00449  00489  00529  00569  00609  00649  00689  00729  00769  00809  00849  00889  00929  00969  01009
00098  00304                                                                                      00330  00370  00410  00450  00490  00530  00570  00610  00650  00690  00730  00770  00810  00850  00890  00930  00970  01010
00121  00305                                                                                      00331  00371  00411  00451  00491  00531  00571  00611  00651  00691  00731  00771  00811  00851  00891  00931  00971  01011
00122  00306                                                                                      00332  00372  00412  00452  00492  00532  00572  00612  00652  00692  00732  00772  00812  00852  00892  00932  00972  01012
00125  00307                                                                                      00333  00373  00413  00453  00493  00533  00573  00613  00653  00693  00733  00773  00813  00853  00893  00933  00973  01013
00215  00308                                                                                      00334  00374  00414  00454  00494  00534  00574  00614  00654  00694  00734  00774  00814  00854  00894  00934  00974  01014
00225  00308234-opensuse-42.2-Updates-x86_64-Build20161121-2-kde@64bit-2G                         00335  00375  00415  00455  00495  00535  00575  00615  00655  00695  00735  00775  00815  00855  00895  00935  00975  01015
00226  00308238-opensuse-42.2-Updates-x86_64-Build20161121-2-gnome@64bit-2G                       00336  00376  00416  00456  00496  00536  00576  00616  00656  00696  00736  00776  00816  00856  00896  00936  00976  01016
00227  00308239-opensuse-42.1-UpdateTest-x86_64-Build20161121-1-gnome@uefi-2G                     00337  00377  00417  00457  00497  00537  00577  00617  00657  00697  00737  00777  00817  00857  00897  00937  00977  01017
00228  00308240-opensuse-42.2-Updates-x86_64-Build20161121-2-install_with_updates_gnome@64bit-2G  00338  00378  00418  00458  00498  00538  00578  00618  00658  00698  00738  00778  00818  00858  00898  00938  00978  01018
00266  00308241-opensuse-42.2-Updates-x86_64-Build20161121-2-install_with_updates_kde@uefi-2G     00339  00379  00419  00459  00499  00539  00579  00619  00659  00699  00739  00779  00819  00859  00899  00939  00979  01019
00267  00308242-opensuse-42.2-Updates-x86_64-Build20161121-2-kde@64bit-2G                         00340  00380  00420  00460  00500  00540  00580  00620  00660  00700  00740  00780  00820  00860  00900  00940  00980  01020
00268  00308243-opensuse-42.2-UpdateTest-x86_64-Build20161121-2-kde@64bit-2G                      00341  00381  00421  00461  00501  00541  00581  00621  00661  00701  00741  00781  00821  00861  00901  00941  00981  01021
00276  00308244-opensuse-42.2-UpdateTest-x86_64-Build20161121-2-gnome@uefi-2G                     00342  00382  00422  00462  00502  00542  00582  00622  00662  00702  00742  00782  00822  00862  00902  00942  00982  01022
00277  00308245-opensuse-42.2-Updates-x86_64-Build20161121-2-gnome@uefi                           00343  00383  00423  00463  00503  00543  00583  00623  00663  00703  00743  00783  00823  00863  00903  00943  00983  01023
00278  00308246-opensuse-5.8.90-Krypton-Live-x86_64-Build5.54-krypton-live@64bit-2G               00344  00384  00424  00464  00504  00544  00584  00624  00664  00704  00744  00784  00824  00864  00904  00944  00984  01024
00279  00308247-opensuse-5.7.90-Argon-Live-x86_64-Build11.4-krypton-live@64bit-2G                 00345  00385  00425  00465  00505  00545  00585  00625  00665  00705  00745  00785  00825  00865  00905  00945  00985  01025
00280  00308249-opensuse-42.2-Updates-x86_64-Build20161121-2-gnome@uefi                           00346  00386  00426  00466  00506  00546  00586  00626  00666  00706  00746  00786  00826  00866  00906  00946  00986  01026
00281  00308255-opensuse-42.2-Updates-x86_64-Build20161121-2-install_with_updates_kde@uefi-2G     00347  00387  00427  00467  00507  00547  00587  00627  00667  00707  00747  00787  00827  00867  00907  00947  00987  01027
00282  00308257-opensuse-42.1-UpdateTest-x86_64-Build20161121-1-gnome@uefi-2G                     00348  00388  00428  00468  00508  00548  00588  00628  00668  00708  00748  00788  00828  00868  00908  00948  00988  01028
00283  00309                                                                                      00349  00389  00429  00469  00509  00549  00589  00629  00669  00709  00749  00789  00829  00869  00909  00949  00989  01029
00284  00310                                                                                      00350  00390  00430  00470  00510  00550  00590  00630  00670  00710  00750  00790  00830  00870  00910  00950  00990  01030
00285  00311                                                                                      00351  00391  00431  00471  00511  00551  00591  00631  00671  00711  00751  00791  00831  00871  00911  00951  00991  01031
00286  00312                                                                                      00352  00392  00432  00472  00512  00552  00592  00632  00672  00712  00752  00792  00832  00872  00912  00952  00992  01032
00287  00313                                                                                      00353  00393  00433  00473  00513  00553  00593  00633  00673  00713  00753  00793  00833  00873  00913  00953  00993  01033
00288  00314                                                                                      00354  00394  00434  00474  00514  00554  00594  00634  00674  00714  00754  00794  00834  00874  00914  00954  00994  01034
00289  00315                                                                                      00355  00395  00435  00475  00515  00555  00595  00635  00675  00715  00755  00795  00835  00875  00915  00955  00995  01035
00290  00316                                                                                      00356  00396  00436  00476  00516  00556  00596  00636  00676  00716  00756  00796  00836  00876  00916  00956  00996  01036
00291  00317                                                                                      00357  00397  00437  00477  00517  00557  00597  00637  00677  00717  00757  00797  00837  00877  00917  00957  00997  01037
00292  00318                                                                                      00358  00398  00438  00478  00518  00558  00598  00638  00678  00718  00758  00798  00838  00878  00918  00958  00998  01038
00293  00319                                                                                      00359  00399  00439  00479  00519  00559  00599  00639  00679  00719  00759  00799  00839  00879  00919  00959  00999  01039
00294  00320                                                                                      00360  00400  00440  00480  00520  00560  00600  00640  00680  00720  00760  00800  00840  00880  00920  00960  01000  01040
00295  00321                                                                                      00361  00401  00441  00481  00521  00561  00601  00641  00681  00721  00761  00801  00841  00881  00921  00961  01001  01041
00296  00322                                                                                      00362  00402  00442  00482  00522  00562  00602  00642  00682  00722  00762  00802  00842  00882  00922  00962  01002  01042
00297  00323                                                                                      00363  00403  00443  00483  00523  00563  00603  00643  00683  00723  00763  00803  00843  00883  00923  00963  01003  01043
00298  00324                                                                                      00364  00404  00444  00484  00524  00564  00604  00644  00684  00724  00764  00804  00844  00884  00924  00964  01004  01044
00299  00325                                                                                      00365  00405  00445  00485  00525  00565  00605  00645  00685  00725  00765  00805  00845  00885  00925  00965  01005  01045
00300  00326                                                                                      00366  00406  00446  00486  00526  00566  00606  00646  00686  00726  00766  00806  00846  00886  00926  00966  01006  01046
00301  00327                                                                                      00367  00407  00447  00487  00527  00567  00607  00647  00687  00727  00767  00807  00847  00887  00927  00967  01007  01047

cleaned up some dirs manually.

Also deleted old files in /var/lib/openqa/share/factory/hdd/ . Now it seems like asset cleanup did not work since some days. https://openqa.opensuse.org/admin/assets shows assets only older than around 2019-09-28 so potentially the upgrade on 2019-09-29 was the one breaking the asset tracking (and cleanup). In /var/log/zypp/history I can not find any changes between 2019-09-28 and the next day but there are changes between 2019-09-27 and 2019-10-01 with their corresponding git commits:

git log1 --no-merges b5a1dadd6..683ca6661
c1046c5f8 (okurz/enhance/cleanup_circle_ci, enhance/cleanup_circle_ci) circleci: Remove whitespace at EOL
889f14b04 Fix publishing documentation via Travis
1abc10b24 Don't silently exit doc generation if asciidoctor not available
62241751c Load build results on dashboard via AJAX
763ded82c Update perl-DBIx-Class-DeploymentHandler dependency to 0.002233 (#2359)
43892c077 Move stale job detection from ws server to scheduler
152d5ed19 Remove obsolete comment regarding offline workers
5b3447b30 Rely on t_updated for the worker's online status in the web UI
6617fa456 Prevent failures in feature tour test
313790d8b Set default check interval for wait_util to 1 second
b9ac321af (Martchus/uniform-dependency-boxes) Enforce same width for nodes in dependency graph
92402582a Move test helper embed_server_for_testing to test utilities
15e48ae07 Add unit test for test schedule change processing
e96794dbc Avoid race condition if test_order.json changes too often
dbae31f30 Reload test_order.json if it changes at test runtime
519f61fe3 Refactor job result file path concatenation
2fefb9e8f (Martchus/staging) Move incompletion logic when worker shows up again to scheduler
9b04a7daa (okurz/feature/devel_test) Add package-test for openQA-devel allowing to check all dependencies in all repos

Nothing obvious jumps to eye. I am tempted to simply reboot the whole system.

# ps auxf | grep '\<gru\>'
root     13447  0.0  0.0   7432   968 pts/14   S+   21:14   0:00                  \_ grep --color=auto \<gru\>
geekote+ 26763  0.0  0.9 321996 155656 ?       SNs  18:45   0:08 /usr/bin/perl /usr/share/openqa/script/openqa gru -m production run
geekote+ 28826  4.9  1.3 386188 217024 ?       DN   19:01   6:36  \_ /usr/bin/perl /usr/share/openqa/script/openqa gru -m production run
ariel:/space/openqa/share/factory # cat /proc/28826/stack 
[<ffffffffa040fbfe>] xfs_buf_submit_wait+0x7e/0x200 [xfs]
[<ffffffffa040feb6>] xfs_buf_read_map+0x106/0x170 [xfs]
[<ffffffffa0442dec>] xfs_trans_read_buf_map+0xac/0x2e0 [xfs]
[<ffffffffa03fab77>] xfs_imap_to_bp+0x57/0xd0 [xfs]
[<ffffffffa03fb40e>] xfs_iread+0x6e/0x1f0 [xfs]
[<ffffffffa0419dcb>] xfs_iget+0x2eb/0x980 [xfs]
[<ffffffffa04232f8>] xfs_lookup+0xb8/0xf0 [xfs]
[<ffffffffa041fb7c>] xfs_vn_lookup+0x4c/0x80 [xfs]
[<ffffffff812644c9>] lookup_slow+0x99/0x150
[<ffffffff81264a1d>] walk_component+0x19d/0x440
[<ffffffff812652a5>] path_lookupat+0x75/0x1d0
[<ffffffff81268b87>] filename_lookup+0xa7/0x160
[<ffffffff8125cff3>] vfs_statx+0x63/0xb0
[<ffffffff8125d496>] SYSC_newlstat+0x26/0x40
[<ffffffff81003aeb>] do_syscall_64+0x7b/0x160
[<ffffffff8180009a>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<ffffffffffffffff>] 0xffffffffffffffff
# lsof -p 28826
COMMAND   PID      USER   FD   TYPE             DEVICE  SIZE/OFF        NODE NAME
openqa  28826 geekotest  cwd    DIR             253,17      8192 12898734141 /var/lib/openqa/testresults/00858/00858784-opensuse-15.0-DVD-Incidents-x86_64-Build:9409:gcc7.1550619705-cryptlvm@uefi-2G/.thumbs

and strace shows me that it's reading test result directories so it looks like everything is in order albeit a bit slow, possibly the big backlog of cleanup necessary for results and logs but no job is running for asset cleanup. It looks like asset cleanup is never given a chance to run. Let's monitor over the night.

#2 Updated by okurz almost 2 years ago

  • Subject changed from o3 /space is nearly running out again, too much logs&results to o3 /space is nearly running out again, assets are not refreshed, not cleaned up (was: too much logs&results)

#3 Updated by okurz almost 2 years ago

  • Status changed from In Progress to Feedback

#4 Updated by okurz almost 2 years ago

  • Related to action #57689: asset cleanup jobs do not run on o3 (results cleanup works), workaround: unlock locks manually added

#5 Updated by okurz almost 2 years ago

  • Status changed from Feedback to Resolved

Manually restarting minion jobs, cleaning locks, etc. worked. With strace -f -eopen,unlink -p 28805 I could follow the gru minion process and see that eventually it unlinked files and we are back to only 79% usage.

Also available in: Atom PDF