When SSDs fail…..

An internal server started blue screening yesterday. Well, specifically, it black screened, meaning that the server had a black screen with a flashing cursor. A hard reboot resolved the issue but an hour later the same thing occured.

How do you troubleshoot such an issue?

Well firstly with a bluescreen you get a memory dump (unless you’ve turned it off) and this can be analysed using the dumpcheck windbg tool. If this isn’t installed then you need to grab it from the technet page and also install the symbols.

In this case, as the SSD is sata then we check the SMART for errors which there was none. However, googling around found an article:: “Corrects a condition where an incorrect response to a SMART counter will cause the m4 drive to become unresponsive after 5184 hours of Power-on time. The drive will recover after a power cycle, however, this failure will repeat once per hour after reaching this point. The condition will allow the end user to successfully update firmware, and poses no risk to user or system data stored on the drive

WOW. We updated the firmware on the SSD and hey presto all working again.