Technorati Profile Blog Flux Local IT Transformation with SOA
>

Friday, March 19, 2010

SOA Systems Management


On February 1, 2003, the Columbia space shuttle disintegrated over Texas upon re-entry. The cause of the tragedy was the damage sustained to the wing during liftoff. This had been a two-week space mission, mind you, and many at NASA had been aware for days of the potential problem after seeing videos of debris hitting the leading edge of the shuttle’s wing during liftoff. In fact,  after receiving a  request from the Debris Assessment Team (DAT) to have spy satellites take pictures of the shuttle as it circled the Earth, the NASA’s Columbia Mission Management Team leader answered with a  “. . . this is not worth pursuing, for even if we see some damage to the shuttle, we can’t do anything about it”.
Just as with Apollo XIII, when NASA ingenuity and genius saved, against all odds, a mission in great peril, it is now believed that, had they been given the opportunity, NASA engineers could have come up with at least two strategies that would have saved the Columbia crew, if not the shuttle itself.
It’s only human nature to close one’s eyes when we believe we are powerless to rectify a problem. I’ve been there. There were times, after deployment of a complex system, when I wished I could simply close my eyes in order to avoid “finding” any problems. I suppose we never outgrow our instinct to “peek-a-boo” with reality, but alas, part of adulthood is the realization that reality often finds a way to bite us on the behind.  Despite the unspoken desire to avoid looking at brewing problems, hoping they will go away (or at least pretending they don’t exist), it’s better to recognize that operating an SOA environment is, in fact, a more complex proposition than operating and managing a traditional mainframe-based environment. SOA demands our full attention and it necessitates the deployment of system and network management components to enable proactive identification and resolution of issues before it is too late to handle them with grace. Successful control of this environment requires that these concepts and tools be in place:
·         Management and Monitoring at each level of the system stack
·         Deployment of a centralized Logging Server
·         Real-time operational dashboards.
It also must be said that none of the above would be useful without adequate planning of remediation strategies to deal with failure. These strategies must be part of the overall system organizational governance and will be covered later on when I discuss the administrative and management aspects related to managing the IT transformation.  Next week, I’ll cover the Management and Monitoring components.

Labels: , , , , ,

Friday, March 5, 2010

System Availability & Reliability


Achieving the ability to provide services from any number of loosely coupled servers is essential in facilitating the deployment of redundant systems. Redundancy is the key to continued availability. Traditional mainframe environments were designed to be monolithic and had fewer components. Following the precept that, if put all your eggs in one basket, you better make sure that it’s a good basket, mainframe systems were designed to be extremely reliable and to come embedded with high-availability features. On the other hand, SOA systems tend to include more moving parts. More moving parts means there is a higher possibility of failure. Also, these moving parts may be components that have not been engineered or manufactured with the same high level of quality control applied to the more expensive mainframe. No use debating it: Out of the box, most mainframe systems deliver far higher availability levels.
SOA must overcome these inherent availability issues.  The method used to achieve redundancy in SOA is by introducing redundant elements; usually via clustering. To enable full utilization of the clustering capabilities provided by application server vendors, you should reduce state-dependent services.   This reduction will facilitate the logical decoupling that allows you to design a very resilient system that consists of active-active components in each layer of the stack from dual communication links, to redundant routers and switches, to clustered servers and redundant databases.
In the diagram below, a sample mainframe system has, for the sake of discussion, a 90% availability (mainframe systems usually have much higher availability ratings. I am using this number to simplify the following calculations).

Now, let’s say that you deploy a two-component SOA environment with each component giving 90% availability. . .

In this latter SOA system, you should expect the overall system availability to be no greater than 0.90 * 0.90 = 0.81! That is, by virtue of having added another component to the flow, you have gone from 90% availability to 81%. The reason for this is that both components are in a series and both have to be functional for the system to operate. In SOA you must adjust by adding additional fallback components: