Jump to content
LoneWolf

Dell PERC - RAID monitoring - My take

Recommended Posts

So, as I've been developing setups for our Dell servers (and appreciated everyone's offerings), I thought about it, and there is one thing I wanted; I wanted myself or anyone to be able to maintain what they're doing.  I'm a big fan of borrowing or copying from others, as long as I can implement without relying on that someone for support someday down the road, or my successor, so knowledge needs to translate to myself and everyone else working with Automate.  In some cases, I start on the ground floor; @dpltadmin's and @timwiser's setup gave me a foundation for my ideas, so thank you for that.

I started out with the basic premise that Dell OpenManage provides everything I need, and that we have a requirement that it must be on every physical server or Hyper-V host.  You can even install it on VMWare ESXi, but that monitoring works a little differently, so I'm not going to cover that here.  But this means, I can create a search for all Dell physical servers, then add them to a custom group.  Here's my quick-dirty autojoin search.

image.png.5ff1395fd46747f435d2fd9f4c4a4e45.png

Next, from that group, I can make up for what I consider an Automate weakness; putting all Event Log monitors in a single EV-Blacklisted setup that does nothing to consider that Microsoft's Critical events are often far from critical.  Also, I can make the monitors remote if I wish so I can make them far more sensitive to problems than internal monitors.  As OpenManage generates Windows Event IDs that are easy to act on, I went through their documentation and snagged all of the storage events I feel are important.

Link to Dell OpenManage Event IDs documentation (PDF) for v8.5

Here are the remote monitors I added to my server group based on this documentation.

image.thumb.png.d8a2c02962349f7bc98c54d13559bdf2.png

Several of the above monitors, like Unexpected Sense, Command Timeout on Physical Disk, or SMART categories are canaries in the coal mine (possible issues but not yet failure), whereas others are quite obvious.  We direct these via different categories to Connectwise, with the high priority ones going to a Triage board our dispatcher has view of.

Part two (second post), I'll discuss and provide scripts I created for use in conjunction with these monitors.

Edited by LoneWolf
  • Like 1

Share this post


Link to post
Share on other sites

Part two - I have added a folder of scripts I created for Dell RAID management purposes.  All of these scripts have the prerequisite of requiring OpenManage be installed to work, and are intended to be run on physical servers (including HyperV hosts, just not VM guests which do not identify as Dell).  This ZIP contains the following scripts:

Copy of Dell - Identify Problematic Server Drives - This script identifies Seagate drives of a very, very problematic Cheetah mode series, specifically what my organization knows as "The Dreaded ES66 firmware drive".  Dell firmware updates to ES68 make them run a bit better, but these drives are extremely failure prone compared to other drives we have seen.  If drives are found meeting the script's criteria, a ticket is opened.

Copy of Dell - Identify RAID Controllers and VDisks - This script is to identify up to two RAID controllers in a Dell server (supposing you have more controllers, you could modify the script), and all of the virtual disks on those controllers.  The output is reported by log outputs in the script tile of Automate, but you could conceivably redirect these to text files and upload them to your Automate server.  I generally use it in order to run the next script, see below.

Copy of Dell - PERC RAID Consistency Check - This script will run a consistency check on the virtual disk ID of the Controller ID you specify on the server.  To get these variables, run the previously listed script, Dell - Identify RAID Controllers and VDisks. Note that for consistency checks, the output is generated as a start Event ID and a finish Event ID. These IDs are 2058 and 2085 (don't remember which is which, but the order is probably correct), unless the consistency check throws an error, in which case the event ID would be monitored by the monitors I already created.  If desired, you could create individual remote or internal event ID monitors for those events just to notify you of its start and/or completion.

Copy of Dell - OpenManage Check - While I do not have this script in production yet, I have tested it and so far it works well.  It is a check for OpenManage being present on a physical server.  If OpenManage is missing, if it is a 32-bit version running on a 64-bit server, or its version is less than the current 8.5.0, it will open a ticket.  It also generates a text output it keeps on the server, so if you schedule the script to run at intervals, and it finds that text file, it will be able to read that and skip a portion of the script to speed it up, knowing it has run in the past.

Copy of Dell - PERC Controller Log Upload to Automate Server - What do you do when a drive goes Failure or Predictive Failure on you?  You open a case with Dell Support to replace the drive.  Dell often requires the Controller Logs, which are generated by OpenManage to the system's C:\Windows directory, and I always provide them in case they notice a second drive in an array may appear dodgy when we don't see it.  Sure, you could do that manually - or you could use this script to do it for you, it then uploads the logs to your Automate Server. You can pull them before you even enter support chat, or even link this script to one of your monitors so that it runs for you during the ticket creation process.  You will need to specify the Controller ID number though, so if you want to automate it and you have more than a single RAID controller, you may need to make modifications.

Modify as necessary, please keep credit in documentation to the original scripter.  I hope y'all find them useful.

Dell RAID Management Scripts.zip

Edited by LoneWolf
  • Like 2

Share this post


Link to post
Share on other sites

I am about to submit a modified list of remote monitors which adds onto the originals.  One thing I'd like to ask throughout this project - Does anyone know if when building a LabTech remote monitor, you can set the monitor for more than one event in the event log via comma-separated values?  I'd rather build a few less monitors (some events are similar, and doubling them up would make sense).  And I'd rather keep the number of monitors shorter.  In the long run, I may convert some of these remote monitors to internal, at least the lower significance ones predicting failure.

EDIT:  Automate support let me know I cannot monitor more than one Event ID-per-monitor.

Edited by LoneWolf

Share this post


Link to post
Share on other sites

Okay.  While waiting to find out whether I can monitor multiple Event IDs with a single monitor (I've submitted a support ticket for this burning question), here is my updated list of monitored events.  I highly recommend looking at the PDF link I listed earlier for Dell OpenManage, it's incredibly insightful.  You could add additional monitors for temperature sensing or other pieces of hardware (power supplies, memory, etc.) that you desire.  For now, storage is our number one git-`er-done fix, so I have started there.  It is also likely that for some less critical events, you could build internal monitors.  I'm really hoping that Automate support tells me I can list events in monitors the same way as you can when filtering Microsoft Eventlog events with commas and dashes, so I can recommend combining certain related monitors to lower ticketing and monitor overhead.

Also, Dell OpenManage is now up to version 9.1.0.  Depending on how new you want yours to be, you may need to modify the OpenManage check for versions; my script reports 8.5 and lower as out of date, but I plan to test 9.1.0 soon to see if there are any improvements or fixes we need.

Dell-Monitor-Update.PNG

Edited by LoneWolf

Share this post


Link to post
Share on other sites

Automate has let me know that it is one Event ID per monitor at this time.  This is something I plan to make a feature request on for change, but I don't see it happening any time soon.

The monitoring above has already paid off for us.  We are catching drive failures sooner than we previously did, and I call that a win.

I am working on a script to install OpenManage to add to the scripts I have already posted; it's all set except there is one PowerShell command that is not being parsed properly through my script.  It works every time when in Powershell, but fails when executed as "Powershell Command" "Powershell Command as Admin" and "Execute Script" in Automate.  Until I have this set (that command is used to determine parameters to uninstall 32-bit OpenManage from a 64-bit system) I cannot release the script.  I've opened up a ticket with Automate to see if they're willing to take a look at this.

Share this post


Link to post
Share on other sites

Have you considered monitoring it through powershell in a remote monitor?

Get-Eventlog -LogName Application -After (Get-Date).AddDays(-1)| Where {($_.EventID -eq 'X' -or $_.E
ventID -eq 'Y')}

 

And then running a script that takes the appropriate action based on what events have been detected?

Edited by Klaymore
Clarification of Logic

Share this post


Link to post
Share on other sites

@LoneWolf I setup a group with remote monitors, however we are not receiving any alerts and the monitor is not failing when the event is generated.  Do you know how far back the monitor searches in the event log for the codes?

Share this post


Link to post
Share on other sites

Fantastic work @LoneWolf - thank you for taking the time to document this so well and share it out! Does anyone know of a similar setup for HP servers? Our Dell servers are now well covered, especially in monitoring for hard drive/RAID issues, but I'm still struggling to find a good way to monitor drives and RAID issues within HP servers.

Share this post


Link to post
Share on other sites

This is the Internal Monitor we created to filter out only the specific Dell OpenManage EventID's we wanted to report on. The EventID's listed in the Result field were in the Dell pdf linked to in the first post so you can chose which ones you want. I just set it up so it may need a bit of tweaking yet, but it seems to do the trick so far. 

-------

Table to Check: eventlogs

Field to Check: EventID

Check Condition: InSet

Result: (1004,1005,1100,1554,1700,2048,2050,2051,2056,2057,2076,2094,2107,2147,2163,2188,2273)

Identity Field: substr(Concat(eventlogs.`TimeGen`,': ', Replace(Replace(eventlogs.`message`,'\'', ''), '\n', '')),1,97) AS loggedEvent

Additional Field: eventlogs.Source='Server Administrator' OR 'DELL Open Manage Server' OR 'Storage Administrator' AND  Computers.LastContact > DATE_ADD(NOW(),INTERVAL -15 MINUTE)

 

 

Share this post


Link to post
Share on other sites

@LoneWolf

Kinda curious we just had a couple tickets in our queue this morning of the following.

SERVERREDACTED at REDACTED, location REDACTED has drives in its RAID array that are of a type that are known to fail early. This ticket has been opened to address the issue with Dell. The Physical Disk Report and the controller logs have been uploaded to the LabTech Server. Physical Disk Info: C:\LTShare\Uploads\REDACTED\REDACTED\storagelog.txt on the LabTech Server Controller Logs: C:\LTShare\Uploads\REDACTED\REDACTED\RAIDLogs.txt on the LabTech Server

Can you shed some light on this?  I'm assuming this is likely from one of your scripts.

I'm assuming this is the script doing it

 

"Copy of Dell - Identify Problematic Server Drives - This script identifies Seagate drives of a very, very problematic Cheetah mode series, specifically what my organization knows as "The Dreaded ES66 firmware drive".  Dell firmware updates to ES68 make them run a bit better, but these drives are extremely failure prone compared to other drives we have seen.  If drives are found meeting the script's criteria, a ticket is opened."

 

What action does your org normally take on these?  I'm assuming based on your notes updating firmware to ES68.  Can this be done remotely or should we onsite for it?

Edited by RobM

Share this post


Link to post
Share on other sites

Here's another take on monitoring it: https://www.cyberdrain.com/blog-series-monitoring-using-powershell-part-two-using-powershell-to-monitor-dell-systems/

We generally don't use event logs for monitoring anything as it doesn't auto-heal. Also, we don't want an alert flood for a single condition. For us we just group all our Dell checks into a single PowerShell script that runs as a monitor.

  • Like 1

Share this post


Link to post
Share on other sites
On 7/16/2018 at 12:47 PM, RobM said:

@LoneWolf

Kinda curious we just had a couple tickets in our queue this morning of the following.

SERVERREDACTED at REDACTED, location REDACTED has drives in its RAID array that are of a type that are known to fail early. This ticket has been opened to address the issue with Dell. The Physical Disk Report and the controller logs have been uploaded to the LabTech Server. Physical Disk Info: C:\LTShare\Uploads\REDACTED\REDACTED\storagelog.txt on the LabTech Server Controller Logs: C:\LTShare\Uploads\REDACTED\REDACTED\RAIDLogs.txt on the LabTech Server

Can you shed some light on this?  I'm assuming this is likely from one of your scripts.

I'm assuming this is the script doing it

 

"Copy of Dell - Identify Problematic Server Drives - This script identifies Seagate drives of a very, very problematic Cheetah mode series, specifically what my organization knows as "The Dreaded ES66 firmware drive".  Dell firmware updates to ES68 make them run a bit better, but these drives are extremely failure prone compared to other drives we have seen.  If drives are found meeting the script's criteria, a ticket is opened."

 

What action does your org normally take on these?  I'm assuming based on your notes updating firmware to ES68.  Can this be done remotely or should we onsite for it?

This can be done remotely, and while the server is online, it does not require a restart.  It's also recommended that you perform any ESM firmware updates, as sometimes these (the backplane or iDRAC) will take note of these Seagate drives and change the fan profile to improve cooling of them.

 

The Seagate Cheetah 300/600GB SAS drives were particularly problematic.  They run very hot and have a very accelerated death rate.  The ES68 firmware can help this, but if you've gone a long time without it, expect the drives to go predictive failure at some point.  I've had 3-4x the failure rate of other SAS drives with this family.  That monitor is meant to identify that type of drive for your remediation.

Share this post


Link to post
Share on other sites
On 7/18/2018 at 4:46 PM, j0dan said:

Here's another take on monitoring it: https://www.cyberdrain.com/blog-series-monitoring-using-powershell-part-two-using-powershell-to-monitor-dell-systems/

We generally don't use event logs for monitoring anything as it doesn't auto-heal. Also, we don't want an alert flood for a single condition. For us we just group all our Dell checks into a single PowerShell script that runs as a monitor.

I'll look into that.

In these cases, the alerts I've set up are problems that likely won't auto-heal anyway, as they're indicative of drive failures.  They won't heal without actual action taken on our part.

I've not had an alert flood with this setup.  However, for conditions that can auto-heal, you have very good points.  Thank you.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×