A troubleshooter's job doesn't start after a problem occurs, preparations should be made well in advance of a problem. Troubleshooters often see a problem (for example, the system crashes) and start conducting many long and complex analyses that can take days or even weeks. However, prudent system administrators or troubleshooters start planning long before a problem occurs. In other words, they prepare the environment so that troubleshooting can be done more quickly and effectively if and when problems occur. The following key strategies will help prevent problems from occurring with IBM WebSphere Business Process Manager V8.5.
Monitoring tools and a plan are needed to effectively detect problems or anomalies when they emerge. Monitoring is a trade-off. You want to detect important events, and yet not adversely impact the normal operation of the system. Monitoring is an entire technical area in itself, different from problem determination. This paper covers only a few points on this topic.
Passive monitoring can be done at all levels: network, operating system, application server, and application. The main system log files of dependent systems such as databases and LDAP directories can also be monitored for errors and events. For example, you might detect application server restarts that indicate the server is failing.
Some tools for passive monitoring include the Tivoli Performance Viewer and IBM Tivoli Composite Application Manager for Application Diagnostics.
Active monitoring goes beyond passive monitoring—you periodically test the operation of the entire system from end to end. One technique is to ping system components, such as one server or one database connection. Another technique is end-to-end pinging: you periodically send an entire "dummy" transaction through the system and verify that it completes. Some tools for active monitoring include IBM Tivoli Composite Application Manager for Transactions and web-based, load-generating programs like Rational Performance Tester.
Make sure that you:
Examples of ongoing system "health" monitoring include:
Be prepared to actively generate more diagnostic tests when a problem occurs. In addition to dealing with diagnostic artifacts that are present when an incident occurs, your troubleshooting plan should consider any additional explicit actions to take as soon as an incident is detected. You want these actions to take place before the data disappears or the system is restarted.
Here are some examples of explicit actions to generate more diagnostics:
When investigating a problem, you need information about the system, and especially about how the system works when there is no problem. This information should be collected in advance. You should formalize this information by:
First, create a system architecture or topology diagram that shows all system components and the main flows between them. The system architecture diagram can help move the progress of the troubleshooting process by providing information for the following activities:
When creating an architecture diagram make sure that you:
Second, establish a system baseline and gather extensive information about the state of the system at a time when the system is operating normally.
Some questions to answer when creating system baselines:
Here are some examples of information that you might collect:
In many actual systems, it is not unusual to see various benign "errors" during normal operation. Learn to recognize these benign errors, or better yet, eliminate as many of them from the implementation of the system as possible.
Lastly, keeping a rigorous log of all changes that are applied to the system over time can help you determine system differences. When a problem occurs, you can look back through the log for any recent changes that possibly contributed to the problem. You can also map these changes to the various baselines that were collected in the past to ascertain how to interpret differences in these baselines.
Your change log should at least track all software upgrades and fixes that are applied in every software component in the system, including both infrastructure products and application code. It should track every configuration change in any component. It should also track any known changes in the pattern of usage of the system, such as expected increases in load, or a different mix of operations that users invoke.
In a complex IT environment where many teams contribute to different parts of the environment, the task of maintaining an accurate, up-to-date, and global change log can be surprisingly difficult.
You can use tools and techniques to help this task, from collecting regular snapshots of the configuration with simple data collection scripts, which are used to collect diagnostic data, to sophisticated system management utilities.
It is important to know and understand the concept of change control, and keeping a change log is generally broader than the troubleshooting arena. Change control is also considered one of the key good practices for managing complex systems to prevent problems, as opposed to troubleshooting them.
When a problem occurs, there is often confusion along with great pressure to restore the system to normal operation quickly, which can cause mistakes that lead to unnecessary delays. The following steps are critical: make sure that you have an action plan, and make sure that everyone is aware of this action plan. As with monitoring, gathering data is a trade-off. You want to capture as much data as possible, but you do not want to adversely affect the normal operation of the system. You also want to make sure that you can reliably capture the diagnostic data that you need ask yourself:
The simplest diagnostic collection plan is in the form of plain, written documentation that lists all the detailed manual steps that must be taken.
To be more effective, try to automate as much of this plan as possible. Provide one or more command scripts that can be invoked to do a complex set of actions, or use more sophisticated system management tools. The various collector tools and scripts now offered as part of IBM Support Assistant can provide a good framework for you to start automating many diagnostic collection tasks.
Also, do not forget the human element:
The following are some tools to consider when you plan for and implement diagnostic data collection.
After a problem occurs, consider how much time is available to look into the problem before you must provide relief to affected users. It is a good practice to create a relief or recovery plan to help you restore function to users. The relief or recovery plan lays out general steps that you take to restore functions, and actions to take for specific problems.
For the relief or recovery plan:
Applying regular maintenance (interim fixes, fix packs) reduces the probability and impact of problems. In addition to regular scheduled maintenance, you also must perform emergency changes or maintenance to the system, in response to a newly diagnosed problem. The emergency maintenance plan outlines how to do so safely and effectively.
Maintenance should occur at all levels: on the operation systems level, on the application server, and on each of the products that are involved in the system. You can track current maintenance levels in the topology diagram. You should also have processes for verifying the maintenance levels regularly.
Lastly, consider upgrading to the latest available fix pack during an investigation. Individual fixes are meant to be temporary until a fix pack is available. There is considerable risk in using too many individual fixes because it is not possible to test all the possible interactions among individual fixes. Because of the complexity of the system and difficulty of reproducing problems and gathering diagnostic information, it is not always practical to determine exactly which fix (APAR) resolved a particular situation. Use Fix Central to download fixes: http://www.ibm.com/support/fixcentral
It is a good practice to create a "connectivity group" to represent the possible request sources for the system. A connectivity group is a specific pattern of behavior that is found in an SCA module. The connectivity group does not contain useful component types such as long-running business processes and business state machines. These connectivity groups provide encapsulation and isolation of the specific endpoint's integration requirements. WebSphere Enterprise Service Bus mediation modules are commonly used for this purpose as they represent convenient ways to implement infrastructure-related tasks.
The concept of connectivity groups also provides a convenient way to acquiesce the system in case there is a need for recovery. Since a connectivity group module is stateless, the module can be temporarily stopped, thus cutting off the inbound flow of new events. After the system is recovered and able to process new work, these modules can be restarted.
Good application design takes advantage of the error-handling and fault-processing capabilities from IBM Business Process Manager and WebSphere Enterprise Service Bus.
It is, therefore, necessary for solution architects to understand how IBM Business Process Manager and WebSphere Enterprise Service Bus represent declared and undeclared exceptions before they can create a comprehensive error-handling strategy.
The SCA programming model provides two types of exceptions:
The architecture team must understand the error-handling and recovery tools and capabilities of the product. This team is responsible for creating the error-handling strategy for the project and must account for the following items:
In addition to this list, the architecture team must create design patterns in which built-in recovery capabilities, such as the IBM Business Process Manager and failed event manager, are used appropriately.
The best tool for problem prevention in production is the execution of a comprehensive functional and system test plan. In general, tests for deployed solutions can be broken into two groups:
A major challenge of problem determination is dealing with unanticipated problems. It is much like detective work: finding clues, making educated guesses, verifying suspicions, and other considerations. An ideal strategy for problem prevention is to monitor the system regularly. Use the strategies outlined in this paper to minimize downtime and detective work so you can maximize performance.
The contents within this paper are derived from the "IBM Business Process Manager v8.5 Problem Determination" courseware developed by the WebSphere Education group, part of International Business Machines Corporation.