Managing an SOA environment requires a unified view of all levels of the system components. As mentioned before, the way to ensure a unified management view across all layers is to create a Management Dashboard, Centralized Logging Repository, and Single Sign-On Capability. These components ultimately rely on the introduction of probes to monitoring all resources and to track the flow of services across the SOA system.
Unfortunately, the software utility industry has yet to catch-up with the overall SOA management demands. After all, it took decades for the systems management suite to evolve around the mainframe model, and the integrated management view required by modern SOA systems is still evolving. This does not mean that you should take the, see no evil; hear no evil,view of the NASA administrator who ignored the request to check out the Shuttle. It simply means that you should endeavor to create the needed probes and components that will give you a minimum of capability in this area.
Notice from the diagram the suggestion that you manage security at each layer of the management stack. Security is not a layer, but rather an attribute of each layer.
In so far as the entirety of the management cycle is concerned, you should have capabilities for:
·Continuously monitoring the overall health of your system, with the ability to be notified on a trigger basis of events demanding immediate attention.
·Providing the ability to direct specific diagnostic checks to any component or layer in your system on an on-demand basis.
·Maintaining a comprehensive logging repository for all events and traffic taking place in your system. Clearly, this repository could grow to prohibitive levels, but you should at least have the ability to keep a solid log of all messages and events in your system for a period of time, with appropriate summary analytics for the log events that might have to be discarded.
Ideally there should be a unified view where alerts from one layer can be correlated to alerts from another. For this to occur, you will need a canonical way to represent all alerts and events of the various system layers. Unfortunately, chances are that you will have to deal with the formats and interfaces provided by the vendor of choice for each specific layer. If you can afford it, you could add an additional integration component that normalizes the various formats and events around a canonical form that can be used for future analytics.
As for the system probes measuring performance within each component, you should ensure that these monitors never add more than a few percentage points of overhead to the system (<5% is as high as it should be, in my opinion). Ideally, you will have the option of heightening or lowering the degree of monitoring, depending upon circumstances. You can have low level monitoring for steady-state operations and more intrusive monitoring for those cases where more detailed diagnosis is needed.
Finally, make sure to exercise appropriate change management controls in configuring the monitoring and tracing levels. I can’t count the number of times I have witnessed failures caused by someone “forgetting” to remove diagnostic tools from a production system.
On February 1, 2003, the Columbia space shuttle disintegrated over Texas upon re-entry. The cause of the tragedy was the damage sustained to the wing during liftoff. This had been a two-week space mission, mind you, and many at NASA had been aware for days of the potential problem after seeing videos of debris hitting the leading edge of the shuttle’s wing during liftoff. In fact, after receiving a request from the Debris Assessment Team (DAT) to have spy satellites take pictures of the shuttle as it circled the Earth, the NASA’s Columbia Mission Management Team leader answered with a “. . . this is not worth pursuing, for even if we see some damage to the shuttle, we can’t do anything about it”.
Just as with Apollo XIII, when NASA ingenuity and genius saved, against all odds, a mission in great peril, it is now believed that, had they been given the opportunity, NASA engineers could have come up with at least two strategies that would have saved the Columbia crew, if not the shuttle itself.
It’s only human nature to close one’s eyes when we believe we are powerless to rectify a problem. I’ve been there. There were times, after deployment of a complex system, when I wished I could simply close my eyes in order to avoid “finding” any problems. I suppose we never outgrow our instinct to “peek-a-boo” with reality, but alas, part of adulthood is the realization that reality often finds a way to bite us on the behind. Despite the unspoken desire to avoid looking at brewing problems, hoping they will go away (or at least pretending they don’t exist), it’s better to recognize that operating an SOA environment is, in fact, a more complex proposition than operating and managing a traditional mainframe-based environment. SOA demands our full attention and it necessitates the deployment of system and network management components to enable proactive identification and resolution of issues before it is too late to handle them with grace. Successful control of this environment requires that these concepts and tools be in place:
·Management and Monitoring at each level of the system stack
·Deployment of a centralized Logging Server
·Real-time operational dashboards.
It also must be said that none of the above would be useful without adequate planning of remediation strategies to deal with failure. These strategies must be part of the overall system organizational governance and will be covered later on when I discuss the administrative and management aspects related to managing the IT transformation. Next week, I’ll cover the Management and Monitoring components.
One of the most oft heard critiques against SOA is that the overhead of SOAP/XML formats make it intrinsically low performing. Yes, we all know that standards are often the result of consensus and aren’t always optimized, but SOA’s flexibility is needed to avoid recreating the monolithic “all-is done-here” view of the older development culture. There is no doubt that SOA architectures can be affected by message transmission delays due to larger message sizes resulting from standardization and overheads associated with modular designs.
So, how to solve this conundrum?
A common mistake is designing with the idea of avoiding these performance problems “from the start”. The outcome? Designs that are too monolithic, and that introduce inflexible interfaces with tightly coupled inter-process calls “in the interest of performance”. Talk about throwing out the baby with the bathwater! A better approach is to design for flexibility, as the SOA gods intended, but to introduce the safety valve of caching throughout the system. Caching is the technique used to preserve recently used information as close as possible to the user so that it can be accessed more rapidly by a subsequent caller. Think of caching as a series of release valves needed to ensure the flow of services occurs as pressure-free as possible from beginning to end.
The idea is to design a system that allows as many caching points as possible. This does not mean you will actually utilize all the caching points. Ironically, there is a performance penalty to caching and you should therefore make certain to follow these tenets when it comes to its use:
·Ensure that the caching logic operates asynchronously from the main execution path in order to avoid performance penalties due to the management of the cache.
·Ensure you use the appropriate caching strategy. There are several different strategies that apply to specific data dynamics. Should you clear the cache based on least-used, oldest, or most recently added criteria? Will you implement automatic caching space recollection techniques (i.e. have a daemon periodically releasing cached elements in the background) or will you do so only when certain thresholds are crossed?
·The rules for caching should be flexible and controllable from a centralized management console. It is imperative to always have real-time visibility of the various cache dynamics and to be able to react appropriately to correct any anomalies. Use the recommended cache flag field in the message headers to give you more controlled granularity of these dynamics.
·Allow pre-loading of caching, or sufficient cache warm-up, prior to opening the applications to the full force of requests.
·Always remember that blindly caching items is not a magic bullet. The success of caching depends significantly on the items you cache. If the items change very frequently, you will have to update the cache frequently as well and this overhead could upset any caching advantages..
Even though there are vendor products that provide single-image views of distributed systems caching, I recommend using them only for well-defined server clusters and not broadly for the entire system. You will be better off designing custom-made caching strategies for each particular service call and data element in your solution. There are several caching expiration strategies, such as time-based expiration, size-based expiration (expiring the oldest x% of cache entries when a certain cache threshold is reached), and change-triggered cache updates using a publish/subscribe mode.
Selecting the right expiration and refresh strategy is essential in ensuring the freshness of your data, high hit cache ratios (low cache ratios can make overall system performance suffer because of the overhead incurred in searching for a non-existing item in the cache), and avoidance of performance penalties due to cache management. Also, if you can preserve the cache in a non-volatile medium in order to permit rapid cache restore during a system start-up, then do so.
Clearly, choosing what data to cache is essential. Data that changes rapidly or whose precision is critical should not be cached (e.g. available product inventory should only be cached if the amount of product in the inventory is larger than the amount of the largest possible order). You’ll need to assess how fresh data must be, for any situation. The optimum strategy must be determined carefully via trial-and-error. You can also apply analytical methods such as simulation (see later) to better estimate the impact of any potential change to either the characteristics of the data being cached, or the preferred caching approach.
Finally, I can’t emphasize enough the need for accurate caching monitoring via use of real-time dashboards. These dashboards are a core component of the infrastructure needed to properly manage a complex SOA system. More on Managing SOA next.
This blog covers the practical techniques, trials and tribulations associated with the transformation of IT systems from legacy technologies to systems using SOA and modern open systems. I also include the occasional interlude with rants about technology in general.
About Me
Name: Israel del Rio
Location: Atlanta, GA, United States
Israel has been recognized by Computerworld as one of their Premiere 100 IT honorees. Israel is a business and technology leader who has contributed the technology vision as key strategist and designer behind the enterprise technology roadmaps of large hospitality and travel companies. Israel has also have developed and deployed various mission-critical systems and in the process, he's been instrumental in creating and building effective and skilled development organizations.