action #77836
closedlogin to aarch64.o.o fails with ssh keys and password, also not working over IPMI SoL
login to aarch64.o.o fails with ssh keys and password, also not working over IPMI SoL. Was reported by mkittler and reproduced by ggardet_arm
Acceptance criteria¶
- AC1: login works again for QE Tools members
- AC2: login works for ggardet_arm
- AC3: We understood how it came to that, e.g. if we have been hacked, system corruption, user error, etc.
- If you do not yet have IPMI aliases I suggest
- Connect over IPMI SoL
- try to boot previous snapshots
- If they don't work boot with
or a rescue system or installer if that is available over PXE, and chroot into the installed system - investigate, recover, fix
Updated by favogt over 4 years ago
With IPMI I checked the graphical console and on tty10 (I think) rsyslog was spamming write error: Read-only file system
I triggered a reboot, as it was inaccessible anyway.
Updated by favogt over 4 years ago
It's up again, apparently running fine.
At "2020-11-12T03:38:14.930211+01:00", logging stopped, probably due to /var becoming read only or whatever the root cause was.
Updated by okurz over 4 years ago
there is also which I created. We should coordinate with EngInfra.
Updated by nicksinger over 4 years ago
- Status changed from Workable to In Progress
Machine was shut of now. I started it again, currently booting
Updated by nicksinger over 4 years ago
After I did an explicit "power on" the machine came back online and I was able to login with the expected credentials. As @favogt mentioned (thanks for checking!) errors about /var
I fired up smart to see if any errors where reported in the past:
openqa-aarch64:/home/opensuse # smartctl -i /dev/sda2
smartctl 7.0 2019-05-21 r4917 [aarch64-linux-5.8.15-1.gc680e93-default] (SUSE RPM)
Copyright (C) 2002-18, Bruce Allen, Christian Franke,
Model Family: Intel 730 and DC S35x0/3610/3700 Series SSDs
Device Model: INTEL SSDSC2BP480G4
Serial Number: BTJR507404WJ480BGN
LU WWN Device Id: 5 5cd2e4 04b7929ce
Firmware Version: L2010420
User Capacity: 480.103.981.056 bytes [480 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Nov 13 13:31:56 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
openqa-aarch64:/home/opensuse # smartctl -a /dev/sda2
smartctl 7.0 2019-05-21 r4917 [aarch64-linux-5.8.15-1.gc680e93-default] (SUSE RPM)
Copyright (C) 2002-18, Bruce Allen, Christian Franke,
Model Family: Intel 730 and DC S35x0/3610/3700 Series SSDs
Device Model: INTEL SSDSC2BP480G4
Serial Number: BTJR507404WJ480BGN
LU WWN Device Id: 5 5cd2e4 04b7929ce
Firmware Version: L2010420
User Capacity: 480.103.981.056 bytes [480 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Nov 13 13:32:30 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 2) seconds.
Offline data collection
capabilities: (0x79) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 2) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
5 Reallocated_Sector_Ct 0x0032 097 097 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 17587
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 86
170 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
174 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 82
175 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 619 (90 17)
183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Temperature_Case 0x0022 083 083 000 Old_age Always - 17 (Min/Max 12/18)
192 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 82
194 Temperature_Internal 0x0022 100 100 000 Old_age Always - 17
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 18321456
226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 11776
227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 35
228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 126853
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 043 043 000 Old_age Always - 0
234 Thermal_Throttle 0x0032 100 100 000 Old_age Always - 0/0
241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 18321456
242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 7116970
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 17578 -
# 2 Short offline Completed without error 00% 17554 -
# 3 Short offline Completed without error 00% 17530 -
# 4 Short offline Completed without error 00% 17506 -
# 5 Short offline Completed without error 00% 17482 -
# 6 Short offline Completed without error 00% 17458 -
# 7 Short offline Completed without error 00% 17434 -
# 8 Short offline Completed without error 00% 17410 -
# 9 Short offline Completed without error 00% 17386 -
#10 Short offline Completed without error 00% 17362 -
#11 Short offline Completed without error 00% 17338 -
#12 Short offline Completed without error 00% 17314 -
#13 Short offline Completed without error 00% 17290 -
#14 Extended offline Completed without error 00% 17288 -
#15 Short offline Completed without error 00% 17266 -
#16 Short offline Completed without error 00% 17242 -
#17 Short offline Completed without error 00% 17218 -
#18 Short offline Completed without error 00% 17194 -
#19 Short offline Completed without error 00% 17170 -
#20 Short offline Completed without error 00% 17146 -
#21 Short offline Completed without error 00% 17122 -
SMART Selective self-test log data structure revision number 1
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
So as you can see the last 21 self-tests where successful without errors. I will dig a little bit more if I can find the root cause but for now the main issue (nobody can login) is resolved.
Updated by nicksinger over 4 years ago
- Status changed from In Progress to Feedback
Updated by okurz over 4 years ago
@nicksinger thanks for taking the ticket . fvogt and guillaume_g are following, discussing about it in irc:// but too shy to update you :D But of course they are happy that you fixed it :)
I updated
and asked to close it.
Updated by nicksinger over 4 years ago
- Status changed from Feedback to Resolved
Closing for now as it is really hard to debug and the disk itself looks good. All files in /var/log are (obviously) truncated. If this happens again soon and we need to debug further we could think about setting up a central rsyslogd-server (e.g. on o3 itself) to collect the root cause there. I skipped this for now as it would require close monitoring (to not fill up o3 with all the worker logs) and I'm not up for this over the weekend :)