Project

General

Profile

action #77836

login to aarch64.o.o fails with ssh keys and password, also not working over IPMI SoL

Added by okurz 2 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2020-11-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

login to aarch64.o.o fails with ssh keys and password, also not working over IPMI SoL. Was reported by mkittler and reproduced by ggardet_arm

Acceptance criteria

  • AC1: login works again for QE Tools members
  • AC2: login works for ggardet_arm
  • AC3: We understood how it came to that, e.g. if we have been hacked, system corruption, user error, etc.

Suggestions

History

#1 Updated by okurz 2 months ago

  • Description updated (diff)

#2 Updated by favogt 2 months ago

With IPMI I checked the graphical console and on tty10 (I think) rsyslog was spamming write error: Read-only file system.
I triggered a reboot, as it was inaccessible anyway.

#3 Updated by favogt 2 months ago

It's up again, apparently running fine.
At "2020-11-12T03:38:14.930211+01:00", logging stopped, probably due to /var becoming read only or whatever the root cause was.

#4 Updated by okurz 2 months ago

there is also https://infra.nue.suse.com/SelfService/Display.html?id=179857 which I created. We should coordinate with EngInfra.

#5 Updated by nicksinger 2 months ago

  • Assignee set to nicksinger

#6 Updated by nicksinger 2 months ago

  • Status changed from Workable to In Progress

Machine was shut of now. I started it again, currently booting

#7 Updated by nicksinger 2 months ago

After I did an explicit "power on" the machine came back online and I was able to login with the expected credentials. As favogt mentioned (thanks for checking!) errors about /var I fired up smart to see if any errors where reported in the past:

openqa-aarch64:/home/opensuse # smartctl -i /dev/sda2
smartctl 7.0 2019-05-21 r4917 [aarch64-linux-5.8.15-1.gc680e93-default] (SUSE RPM)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
Device Model:     INTEL SSDSC2BP480G4
Serial Number:    BTJR507404WJ480BGN
LU WWN Device Id: 5 5cd2e4 04b7929ce
Firmware Version: L2010420
User Capacity:    480.103.981.056 bytes [480 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Nov 13 13:31:56 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

openqa-aarch64:/home/opensuse # smartctl -a /dev/sda2
smartctl 7.0 2019-05-21 r4917 [aarch64-linux-5.8.15-1.gc680e93-default] (SUSE RPM)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
Device Model:     INTEL SSDSC2BP480G4
Serial Number:    BTJR507404WJ480BGN
LU WWN Device Id: 5 5cd2e4 04b7929ce
Firmware Version: L2010420
User Capacity:    480.103.981.056 bytes [480 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Nov 13 13:32:30 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:        (    2) seconds.
Offline data collection
capabilities:            (0x79) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    (   2) minutes.
Conveyance self-test routine
recommended polling time:    (   2) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   097   097   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       17587
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       86
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       82
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       619 (90 17)
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Temperature_Case        0x0022   083   083   000    Old_age   Always       -       17 (Min/Max 12/18)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       82
194 Temperature_Internal    0x0022   100   100   000    Old_age   Always       -       17
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       18321456
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       11776
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       35
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       126853
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   043   043   000    Old_age   Always       -       0
234 Thermal_Throttle        0x0032   100   100   000    Old_age   Always       -       0/0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       18321456
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       7116970

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     17578         -
# 2  Short offline       Completed without error       00%     17554         -
# 3  Short offline       Completed without error       00%     17530         -
# 4  Short offline       Completed without error       00%     17506         -
# 5  Short offline       Completed without error       00%     17482         -
# 6  Short offline       Completed without error       00%     17458         -
# 7  Short offline       Completed without error       00%     17434         -
# 8  Short offline       Completed without error       00%     17410         -
# 9  Short offline       Completed without error       00%     17386         -
#10  Short offline       Completed without error       00%     17362         -
#11  Short offline       Completed without error       00%     17338         -
#12  Short offline       Completed without error       00%     17314         -
#13  Short offline       Completed without error       00%     17290         -
#14  Extended offline    Completed without error       00%     17288         -
#15  Short offline       Completed without error       00%     17266         -
#16  Short offline       Completed without error       00%     17242         -
#17  Short offline       Completed without error       00%     17218         -
#18  Short offline       Completed without error       00%     17194         -
#19  Short offline       Completed without error       00%     17170         -
#20  Short offline       Completed without error       00%     17146         -
#21  Short offline       Completed without error       00%     17122         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

So as you can see the last 21 self-tests where successful without errors. I will dig a little bit more if I can find the root cause but for now the main issue (nobody can login) is resolved.

#8 Updated by nicksinger 2 months ago

  • Status changed from In Progress to Feedback

#9 Updated by okurz 2 months ago

nicksinger thanks for taking the ticket . fvogt and guillaume_g are following, discussing about it in irc://chat.freenode.net/opensuse-factory but too shy to update you :D But of course they are happy that you fixed it :)

I updated
https://infra.nue.suse.com/SelfService/Display.html?id=179857
and asked to close it.

#10 Updated by nicksinger 2 months ago

  • Status changed from Feedback to Resolved

Closing for now as it is really hard to debug and the disk itself looks good. All files in /var/log are (obviously) truncated. If this happens again soon and we need to debug further we could think about setting up a central rsyslogd-server (e.g. on o3 itself) to collect the root cause there. I skipped this for now as it would require close monitoring (to not fill up o3 with all the worker logs) and I'm not up for this over the weekend :)

Also available in: Atom PDF