Automated Hardware Problem Reporting or “Remote Monitoring” Applications, Which is Better for Preventative Enterprise IT Hardware Support?

By Craig Wilson, Top Gun

Preventative Hardware Support Services are essential – and expected – in today’s mission critical computing environments.

Original Equipment Manufacturers (OEMs) and successful IT hardware maintenance providers rely on the predictive failure information provided by Automatic Hardware Problem Reporting (aka “Call-Home”), which identifies potential problems before performance and/or uptime is compromised.

Some independent hardware maintenance providers and Managed Service Providers persuade their clients to purchase “Remote Monitoring” applications, with claims that the remote monitoring tools are essential to receiving preventative hardware support. This blog aims to clarify why the OEM’s built-in hardware fault detection routines provide optimal fault detection and notifications, for IT hardware support providers looking to responsibly provide a truly preventative level of support.

For starters, the manufacturers of mission critical computer, storage and networking equipment intentionally design and develop automated Reliability, Availability and Serviceability (RAS) capabilities that predictively detect errors and proactively prevent critical outages by performing recovery routines.

  • Reliability designs of the hardware and firmware include extensive self-checking capabilities that can recognize patterns of errors and predict hardware failures before they occur.
  • Availability designs allow recovery by automatically replacing a failed hardware component with a spare component (aka Dynamic Sparing). These routines can occur in a manner that is transparent to the host’s operating system.
  • Serviceability designs provide a well-defined hardware problem report that describes why the fault occurred, including the location and part number of the failed hardware.

The RAS designs of a machine include a service layer interface that is segregated from the customer’s operational management interface. The service interface is designed to predictively detect errors and proactively prevent critical outages by performing recovery routines that are usually transparent to the customer’s operations. These events cause the service interface to generate and transmit a hardware problem report informing the support organization of the incident.

A major security benefit of “Call-Home” reporting is that all communications are initiated by the machine, within the customer’s environment, and are transmitted out to the support organization. No communications are initiated by the support organization into the customer’s environment or to the machine itself.

(Note: The phrase “Call-Home” originated with legacy machines that used a telephone line and modem to dial out and send “Call-Home” alerts to a support organization. Modern day machines use highly secure and state-of-the-art technology to transmit the more modern Call-Home alerts.)

This type of monitoring is radically different than the “Remote Monitoring” management solutions (BMC, SolarWinds, LogicMonitor, etc.) that are installed onto a Data Center’s “In Band” network.

These management applications are designed to query the operating environments to capture performance data regarding system, operating systems and client’s applications. The performance data must then be analyzed to determine if defined thresholds have been exceeded and if a generic alarm should be posted.

In stark contrast to a machine’s automated hardware problem RAS design, remote monitoring applications cannot identify the specific hardware component that has faulted and subsequently caused dynamic sparing routines to be invoked.

Remote Monitoring management solutions are dependent on several types of technology to capture performance data, using numerous network ports and some technologies require credentialed access to the machine:

  • Network performance tools, such as IP Service Level Agreement (IPSLA) and NetFlow
  • Deep Packet Inspection (DPI)
  • Internet Control Message Protocol (ICMP) aka “ping”
  • Windows Management Interface (WMI) polling
  • Simple Network Monitoring Protocol (SNMP) “traps” and “gets”
  • Vendor‐specific techniques that take advantage of a machine’s API capabilities.

The captured data needs to be analyzed, sometimes using additional software to process the voluminous amounts of data. For each type of Remote Monitoring product and for each environment there is a need to perform “tuning” to define thresholds that are used as criteria to determine when an alarm should be posted. The “tuning” process should also strive to make the alarm, and its accompanying message, as descriptive and informative as possible.

Be aware that these Remote Monitoring tools cannot identify the specific hardware that the machine’s service interface has predictably identified and proactively has recovered, so they cannot provide “preventative” hardware alerts.

At Top Gun, we enable clients to remain on their existing remote monitoring and management toolsets. Not only does this approach avoid unnecessary costs, it also prevents network intrusion associated with installing external applications into a network. We welcome clients to eBond their ServiceNow systems with our ServiceNow incident management system.

Before agreeing to purchase or install a hardware maintenance provider’s remote monitoring application, be sure to precisely understand why they need additional tools to provide proactive hardware fault identification and root-cause determination.

Top Gun provides independent hardware maintenance for data center servers, storage hardware, networking equipment and software (OS). Your inquiries are welcomed on our contact page.

Blog Author Details

Craig Wilson

EVP, Engineering

Craig’s LinkedIn Profile