Incidents are bound to happen from time to time even with the most careful development efforts – whether it’s due to a vendor infrastructure change or an overlooked element of the design. It is essential that when an incident occurs, your IT team reacts properly in finding the root cause and identifying (and successfully establishing) mitigation measures so that similar incidents do not occur in the future. 

No matter whether your investigating staff is on the front lines in support of customers or on your development team, your IT professionals will benefit greatly from a critical analysis approach to the diagnosis of any identified IT service management incidents. 

One strategy to preemptively diagnose and mitigate possible downtime incidents is through the use of a change advisory board (CAB) and a request for change (RFC) process. You can read more about the change advisory board here

Preemptive diagnosis of requests by the CAB

Service Management incident diagnosis is a strategy that can be adapted for use by Change Advisory Boards for preemptive diagnosis of planned requests for change (RFCs). By critically analyzing any request sent from a vendor to the CAB, possible problems due to changes requested can be more readily identified and averted.

By utilizing the methods for diagnosing IT service management incidents described below, a CAB can effectively scrutinize every request with timeliness and have the best opportunity to identify possible problems.

Established methods for diagnosing IT service incidents (and preemptive review of RFCs by a CAB) include the following approaches described below. An example of each approach’s application is also included.

The Richard Feynman Approach

This is a simple, straightforward process with limited steps. Limited in critical analysis approach. By using the Richard Feynman approach, IT professionals can quickly and efficiently diagnose service incidents, leading to faster resolution times and improved service availability.

Write down the problem.

Think very hard.

Write down the answer.

Example application of the Richard Feynman approach

Here is an example of how the Richard Feynman approach could be applied in an IT service incident:

Problem: A web application is experiencing frequent crashes.

  • Step 1: Write down the problem – The problem is that our web application is experiencing frequent crashes.
  • Step 2: Think very hard – Consider the various factors that could be contributing to the crashes. This could include issues with the application code, server configuration, hardware failures, network issues, etc.
  • Step 3: Write down the answer – After considering all possible factors, it is determined that the crashes are due to a bug in the application code that is causing a memory leak.

While this approach may be effective for simple issues, more complex incidents may require a more structured and detailed methodology.

Timeline Analysis

This approach involves listing everything that happened in time order and then looking for patterns or anomalies that may have contributed to the incident.

Example application of the Timeline Analysis approach

Here is an example of how timeline analysis could be applied in an IT service incident:

Problem: A company’s e-commerce website is experiencing slow page load times.

  • Step 1: Collect timeline data – Collect a timeline of events leading up to the incident, including any recent changes made to the website, server logs, and user feedback.
  • Step 2: Analyze the timeline – Review the timeline data and look for patterns or anomalies that may have contributed to the slow page load times. This could include changes to the website code, server configurations, network issues, or third-party integrations.
  • Step 3: Identify the root cause – Based on the timeline analysis, it is determined that the slow page load times are due to a recent change in the website’s code that is causing excessive database queries.
  • Step 4: Implement a solution – Once the root cause has been identified, appropriate measures can be taken to mitigate the issue, such as optimizing the database queries or rolling back the code change.

Kepner-Tregoe Problem Solving

This approach is a structured methodology to problem-solving where you break the problem down and define it across several dimensions including what, where, when, extent, and what has not failed. This approach ensures all relevant information is gathered and analyzed before developing a solution.

Example application of the Kepner-Tregoe Problem Solving approach

Here is an example of how the KT problem-solving approach could be applied in an IT service incident:

Problem: A company’s email server is experiencing intermittent outages, resulting in email delays and missed messages.

  • Step 1: Define the problem – The problem is that the email server is experiencing intermittent outages.
  • Step 2: Describe the problem – Describe the problem in more detail by answering the following questions:

What is the problem? The email server is experiencing intermittent outages.

Where is the problem occurring? The problem is occurring on the company’s email server.

When is the problem occurring? The problem is occurring intermittently.

To what extent is the problem occurring? The problem is resulting in email delays and missed messages.

What has not failed? Other systems and services are working properly.

  • Step 3: Identify possible causes – Based on the information gathered in step 2, possible causes of the email server outages could include hardware failures, software bugs, network issues, or configuration problems.
  • Step 4: Test possible causes – Test each possible cause to determine if it is the root cause of the problem. This could involve running diagnostic tests, reviewing server logs, or conducting interviews with system administrators.
  • Step 5: Verify the root cause – Based on the results of the testing, it is determined that the root cause of the email server outages is a software bug in the email server software.
  • Step 6: Implement a solution – Once the root cause has been identified, appropriate measures can be taken to mitigate the issue, such as installing a software patch or upgrading to a newer version of the email server software.

Ishikawa Fishbone Diagram

A process by which problems are broken down into causes and causes are grouped into categories, then further explored for potential causes. These diagrams also make great documentation for future troubleshooting.

Example application of the Ishikawa Fishbone Diagram approach

Here is an example of how the Ishikawa fishbone diagram approach could be applied in an IT service incident:

Problem: A company’s website is experiencing slow loading times.

  • Step 1: Identify the problem – The problem is that the company’s website is experiencing slow loading times.
  • Step 2: Create a fishbone diagram – Draw a horizontal line across the page. Label the horizontal line “Problem.”
  • Step 3: Identify categories – Identify the categories that could be contributing to the problem, such as people, processes, equipment, environment, or software. Write each category as a branch off the horizontal line.
  • Step 4: Identify potential causes – Under each category, brainstorm potential causes that could be contributing to the problem. For example, under the “software” category, potential causes could include outdated software, incorrect server settings, or poorly written code. Write each cause as a branch off the vertical branch lines.
  • Step 5: Analyze potential causes – Once all potential causes have been identified, analyze each one to determine its likelihood of contributing to the problem. This could involve gathering additional data or conducting further research.
  • Step 6: Identify root cause – Based on the analysis, it is determined that the root cause of the slow website loading times is outdated software on the company’s server.
  • Step 7: Implement a solution – Once the root cause has been identified, appropriate measures can be taken to mitigate the issue, such as updating the software or upgrading the server.
Example Ishikawa Fishbone Diagram
Fishbone diagram

Knowledge-Centered Support

This methodology creates a repository of information for any topics, insights, or resolutions to problems that an information technology professional may need to know so that the knowledge is readily available (and organized) when a new problem arises. The Knowledge-Centered Support (KCS) approach promotes a culture of continuous improvement and provides a mechanism for identifying knowledge gaps where additional documentation is needed, ensuring that the knowledge base stays up-to-date and relevant.

Each time an incident is resolved, the IT professional creates a knowledge article that documents the problem, the solution, and any other relevant information. This article is then added to the knowledge base, making available for future reference.

Example application of the Knowledge-Centered Support approach

Here’s an example of how knowledge-centered support might be used to diagnose an IT service incident:

Problem: a company’s help desk receives multiple tickets from users who are unable to access a particular application.

Solution: Using KCS, the IT professionals would document the incident and the steps they took to resolve it in a knowledge article. They might include information such as:

The symptoms reported by users

The steps taken to diagnose the problem

The root cause of the problem (e.g., a server misconfiguration)

The steps taken to fix the problem

Any additional information that might be useful for future reference

This knowledge article would then be added to the company’s knowledge base, making it available for future reference by IT professionals who encounter similar incidents. Over time, the knowledge base grows, becoming a valuable resource for incident diagnosis and resolution.

Swarming

This methodology and mindset is the approach most likely to improve your CAB’s performance. Swarming involves utilizing any team member that may have relevant knowledge or skill to collaboratively solve a problem. While this takes more of your team members in number to swarm, it often yields quick and thoroughly thought-out results.

By bringing together a cross-functional team, IT professionals are able to identify the root cause of a problem more quickly than if they work individually. Additionally, the collaborative nature of the swarming approach ensures that all team members have a shared understanding of the problem and are working together towards a common goal.

Example application of the Swarming approach

Here’s an example of how swarming might be used to diagnose an IT service incident:

Problem: Suppose a company’s e-commerce website is experiencing slow performance, and customers are reporting that it takes a long time to complete transactions. The company’s IT team receives multiple tickets related to the issue, and they quickly realize that it’s a critical incident that needs to be resolved as soon as possible.

Solution: The IT team forms a swarming team consisting of network engineers, database administrators, software developers, and other relevant stakeholders. The team members are brought together in a virtual war room to diagnose the problem collaboratively.

Using real-time monitoring tools and log analysis, the team quickly identifies that the issue is related to a database query that is taking a long time to complete. The database administrator is able to identify the root cause of the problem, which is a poorly optimized database query that is slowing down the entire system.

The team works together to develop a plan to optimize the query, and the software developers implement the changes. The team then tests the system to ensure that the issue has been resolved, and the e-commerce website is restored to normal performance.

Standard+Case

This is an approach that brings in a genre-collaborative mindset to problem-solving. It recognizes that problem-solving should follow a defined, linear process, but that professional techniques from other fields may need to be integrated to solve a more complex incident.

In this approach, incidents are categorized as either “standard” or “case” depending on their complexity and uniqueness. Standard incidents are those that can be resolved using pre-established procedures, while case incidents are those that require a more creative and customized approach.

Once an incident is categorized, the appropriate approach is taken to resolve it. For standard incidents, the established procedures are followed to ensure a timely resolution. For case incidents, a more creative and customized approach is taken to address the unique circumstances of the incident.

Example application of the Standard+Case approach

Problem: An example of the Standard+Case approach in action is a helpdesk receiving a call from a user who is unable to log into their computer.

Solution: If the cause of the problem is straightforward, such as an expired password or network issue, it would be classified as a standard incident and the established procedures would be followed to resolve it. However, if the cause of the problem is more complex, such as a malware infection or a hardware failure, it would be classified as a case incident and a more customized approach would be taken to resolve it.

In this case, the helpdesk technician would use a combination of problem-solving techniques, such as the Kepner-Tregoe problem-solving method or the Ishikawa fishbone diagram approach, to identify the root cause of the problem and develop a customized solution to resolve it.

Conclusion

In conclusion, incidents are inevitable in IT service management, and it is important to have a structured approach to diagnose and mitigate them. IT professionals can benefit greatly from a critical analysis approach to diagnosing IT service management incidents, and it is essential to continually evaluate and improve incident diagnosis processes to prevent future incidents and ensure efficient IT service management.