The Data Sentinel is not unlike the grumpy bureaucrat processing your driver’s license application forms. After ensuring that you comply with what’s sure to be a ridiculously complicated list of required documents, it isolates you from directly accessing the files in the back.
While you, the applicant, the supplicant, cannot go around the counter and check the content of your files directly (not legally, anyway), the DMV supervisor in the back office is able to directly access any of the office files. After all, the supervisor is authorized to bypass the system processes intended to limit the direct access to the data. Direct supervisory access to data is one of the exceptions to the data visibility constrains mentioned earlier.
Next is the case of ETLs (Extract Transform Loads) of large sets of data as well as its reporting. These cases require batch level access to data in order to process or convert millions of data records and can wreck performance if carelessly implemented. Reporting jobs should ideally run against offline replicated databases; not the on-line production data bases. Better yet is to plan for a proper Data Warehousing strategy that allows you to run business intelligence processes independently of the main Operational Data Store (ODS). Never the less, on occasion, you will need to run summary reports or data-intensive real-time processes against the production database. When the report tool is allowed to access the database directly, bypassing the service layer provided by the Data Sentinel, you will need to ensure this access is well-behaved and that it runs as a low priority process and under restricted user privileges. The same control is required for the ETL processes. Operationally, you should always schedule batch-intensive processes for off-peak times such as nightly runs.
A third potential cause for exception to data visibility is implied by the use of off-the-shelf transaction monitors, requiring direct access to the databases in order to implement the ACID logic discussed earlier.
A fourth exception is demanded by the need to execute large data matching processes. If there is an interactive need to run a process against a large data base set with matching keys in a separate data base (“for all customers with sales greater than an $X amount, apply a promotion flag equal to the percentage corresponding to the customer’s geographic location in the promotion database”), then it makes no sense trying to implement each step via discrete services. Such an approach would be extremely contrived and inefficient. Instead, use of a Table-Joiner super-service will be required. More on that next.
Managing SOA complexity brings up the question of session state. By ’state’ I mean all the required information that must be maintained and stored across the series of interactions needed to complete a full business exchange. Maintaining the state of a service interaction implies remembering at what stage are the conversing partners and the working data in effect. It will often be at your discretion designing services to either depend more or less on the use of state information. At other times the problem at hand will force a specific avenue. In either case, you should remember this simple formula: State-Keeping = Complexity.
Maintaining state might be inescapable in automated orchestration logic, but it comes with a cost. State-Keeping constrains the options for maintaining high availability and indirectly may increase SOA’s fragility by making it more difficult to add redundant components to the environment. With redundant components you must ensure that messages flowing through the system maintain their state, regardless of the server resources used. Relying on session states, while also allowing flexible service flows, is hard to do. It’s done, yes, but the price you will have to pay is an increase complexity and performance penalties related to the need to propagate the state of a particular interaction across several nodes. Therefore, a key SOA tenet is that you should use sessionless flows whenever possible. In other words, every request should ideally be atomic and serviceable regardless of the occurrence of previous requests.
Do you want to know the name of an employee with a given social security number? No problem. As a part of the request pass the social security number, and receive the name. If you next want the employee’s address, you can pass either the social security number or the name as part of the request. While atomic, sessionless, requests such as these do impose a requirement that the client maintains the state of the interaction and holds the information elements related to the employee, this approach does simplify the design of systems using server clusters.
Still, while the preferred tenet is to avoid session keys. On occasion, it becomes impossible for the client to keep the state, forcing the server to assume this responsibility. In this case, the approach is to use a uniquely generated “session-id” whereby the server “remembers” the employee information (the state). You will have to ensure the session key and associated state data is accessible to all servers in a loosely-coupled cluster, making your system design more complicated.
For an example of keeping a session-based state, consider an air booking process where the client is reserving a pair of seats. The server will temporarily decrease the inventory for the flight. For the duration of the transaction the server will give a unique “reservation id” to the client so that any ongoing requests from the client can be associated with the holding of these seats. Clearly, such a process will need to include timeout logic to eventually release the two seats in the event the final booking does not take place before a predetermined amount of time.
This discussion leads to another tenet: maintaining state, either in the client or in the server, along the lines mentioned is ultimately acceptable. Keeping the state inside the intermediate nodes? Not so much. Why? An intermediate component should not have control in timing-out a resource that’s being held in the server. If it did, it would be disrupting the server’s ability to maintain integrity in its environment. Also, an intermediate component will not have full awareness of the business semantics of the service request/response. Relying on an intermediate component to preserve state is like expecting your mail carrier to remind you that your cable bill is due for payment on the 20th of each month. He might do it, yes, but the moment you forget to tip him during the holidays, he just might “forget”!
Ironically, many of today’s vendors offer solutions that encourage the processing of business logic in their intermediate infrastructure products, encouraging you to maintain state in these middleware components. They do so because enabling middleware is an area that does not require them to be aware of your applications, and thus is the easiest area for them to offer you a “value-add service” in a productized, commoditized fashion. You should resist the melodious chant of these mermaids and refrain from using their tempting extras services. If not, you may find yourself stuck with an inflexible design and with a dependency on specific vendor architecture to boot.
My advice is to avoid these vendor-enabled approaches. There is much that can get complicated with the maintenance of state, especially when the business process requires transactional integrity, referential integrity, and security (and most business processes do). The moment you give up this tenet and maintain session state inside the SOA middleware as opposed to the extreme end represented by the Client and the Server, you will be ensuring years of added complexity in the evolution of your SOA system.
Data is what we put into the system and information is what we expect to get out of it (actually, there’s an epistemological argument that what we really crave is knowledge. For now, however, I’ll use the term ‘information’ to refer to the system output). Data is the dough; Information the cake. When we seek information, we want it to taste good, to be accurate, relevant, current, and understandable. Data is another matter. Data must be acquired and stored in whatever is best from a utilitarian perspective. Data can be anything. This explains why two digits were used to store the date years in the pre-millennium system, leading to the big Y2K brouhaha (more on this later). Also, data is not always flat and homogeneous. It can have a hierarchical structure and come from multiple sources. In fact, data is whatever we choose to call the source of our information.
Google has reputedly hundreds of thousands of servers with Petabytes of data (1 Petabyte = 1,024 Terabytes), which you and I can access in a manner of milliseconds by typing free context searches. For many, a response from Google represents information, but to others this output is data to be used in the cooking of new information. As a matter of fact, one of the most exciting areas of research today is the emergence of Collective Intelligence via the mining of free text information on the web. Or consider the very promising WolframAlpha knowledge engine effort (wolframalpha.com) which very ambitiously taps a variety of databases to provide consolidated knowledge to users. There are still other mechanisms to provide information that rely on the human element as a source of data. Sites such as Mahalo.com or Chacha.com actually use carbon-based intelligent life forms to respond to questions.
Data can be stored in people’s neurons, spreadsheets, 3 x 5 index cards, papyrus scrolls, punched cards, magnetic media, optical disk or futuristic quantum storage. The point is that the user doesn’t care how the data is stored or how it is structured. In the end, Schemas, SQL, Rows, Columns, Indexes, Tables, are the ways we IT people store and manage data for our own convenience. But as long as the user can access data in a reliable, prompt, and comprehensive fashion, she could care less whether the data comes from a super-sophisticated object oriented data base or from a tattered printed copy of the World Almanac.
How should data be accessed then? I don’t recommend handling data in an explicit manner the way RDBMs vendors tell you to handle it. Data is at the core of the enterprise, but it does not have to be a “visible” core. You don’t visualize data with SQL. Instead, I suggest that you handle all access to data in an abstract way. You visualize data with services and this brings up the need via a Data Sentinel Layer. This layer should be, you guessed it, an SOA enabled component providing data accesses and maintenance services.
To put it simply, the Data Sentinel is the gatekeeper and abstraction layer for data. Nothing goes into the data storages without the Sentinel first passing it through; nothing gets out without the Sentinel allowing it. Furthermore, the Sentinel allows decoupling of how the data is ultimately stored from the way the data is perceived to be stored. Depending upon your needs, you may choose consolidated data storages or, alternatively, you may choose to follow a federated approach to heterogeneous data. It doesn’t matter. The Data Sentinel is responsible for presenting a common SOA façade to the outside world.
Clearly, a key tenet should be to not allow willy-nilly access to data by bypassing the Sentinel. You should not allow applications or services (whether atomic or composite) to fire their own SQL statements against a data base. If you want to maintain the integrity of your SOA design, make sure to access data via the data abstraction services provided by the Sentinel services only.
Then again, this being a world filled with frailty, there are three exceptions where you will have to allow SOA entities to bypass the abstraction layer provided by the Sentinel. Every castle has secret passageways. I will cover the situations where exceptions may apply later: Security/Monitoring, Batch/Reporting, and the Data Joiner Pattern.
Obviously, data abstraction requires attention to performance, data persistence, and data integrity aspects. Thankfully, there are off-the-shelf tools to help facilitate this abstraction and the implementation of a Sentinel layer, such as Object-Relational mapping, automated data replication, and data caching products (e.g. Hibernate). Whether you choose to use an off-the-shelf tool or to write your own will depend upon your needs, but the use of those tools is not always sufficient to implement a proper Sentinel. Object-Relational mapping or use of Stored Procedures, for example, are means to more easily map data access into SOA-like services, but you still need to ensure that the interfaces comply with the SOA interface criteria covered earlier. In the end, the use of a Data Sentinel Layer is a case of applying abstraction techniques to deal with the challenges of an SOA-based system, but one that also demands engineering work in order to deploy the Sentinel services in front of the Data Bases/Sources. There are additional techniques and considerations that also apply, and these will be discussed later on.
This blog covers the practical techniques, trials and tribulations associated with the transformation of IT systems from legacy technologies to systems using SOA and modern open systems. I also include the occasional interlude with rants about technology in general.
About Me
Name: Israel del Rio
Location: Atlanta, GA, United States
Israel has been recognized by Computerworld as one of their Premiere 100 IT honorees. Israel is a business and technology leader who has contributed the technology vision as key strategist and designer behind the enterprise technology roadmaps of large hospitality and travel companies. Israel has also have developed and deployed various mission-critical systems and in the process, he's been instrumental in creating and building effective and skilled development organizations.