Technorati Profile Blog Flux Local IT Transformation with SOA
>

Friday, January 1, 2010

Data Matching and Integration Engines


Encapsulation of data via data services via Data Sentinel works well when the data is being accessed intermittently and discretely. However, there are cases where the data access pattern requires matching large amounts of data records from one data base to large data volumes in another data base. An example could be a campaign management application with a need to combine the contents of a customer database with a promotion data base defining discount rates based on the customer’s place of residence.  Clearly, the idea to have this service call a data service for every customer record when performing promotional matches would be unsound and impractical from a performance perspective. The alternative, to allow applications to perform direct data base joins against the various data bases is not an ideal one either. This latter approach would violate many of the objectives SOA tries to solve by forcing applications to be directly aware and dependant of specific data schemas and data base technologies.
Yet another example is when implementing data extraction via an algorithm such as MapReduce that necessitates the orchestration of a large number of backend data clusters. This type of complex orchestration against potentially large sets of data cannot be left to the service requester and is best provided by sophisticated front end servers.
Both examples show the need to make these bulk data matching processes part of the service fabric, available as coarse data services. The solution then is to incorporate an abstraction layer service for this type of bulk data join process. Applications can then trigger the process by calling this broadly-coarse service. In practical terms, this means that when implementing the SOA system you should consider the design and deployment of data matching and integration engines needed to efficiently and securely implement this kind of coarsely defined services.  In fact, you are likely to find off-the-shelf products that at heart are instances of Data Matching Engines: Campaign Management Engines, Business Intelligence systems, Reporting Engines servicing users by generating multi-view reports.
Now, using off-the-shelf solutions has tremendous benefits but the use of external engines is likely to introduce varied data formats and protocols to the mix. Non withstanding the ideal to have a canonical data format all throughout, there will always be a need to perform data transformations.  That’s the next topic.

Labels: , , , , , , , ,

Friday, December 25, 2009

The Data Visibility Exceptions

The Data Sentinel is not unlike the grumpy bureaucrat processing your driver’s license application forms. After ensuring that you comply with what’s sure to be a ridiculously complicated list of required documents, it isolates you from directly accessing the files in the back.
While you, the applicant, the supplicant, cannot go around the counter and check the content of your files directly (not legally, anyway), the DMV supervisor in the back office is able to directly access any of the office files. After all, the supervisor is authorized to bypass the system processes intended to limit the direct access to the data.  Direct supervisory access to data is one of the exceptions to the data visibility constrains mentioned earlier. 
Next is the case of ETLs (Extract Transform Loads) of large sets of data as well as its reporting. These cases require batch level access to data in order to process or convert millions of data records and can wreck performance if carelessly implemented. Reporting jobs should ideally run against offline replicated databases; not the on-line production data bases. Better yet is to plan for a proper Data Warehousing strategy that allows you to run business intelligence processes independently of the main Operational Data Store (ODS). Never the less, on occasion, you will need to run summary reports or data-intensive real-time processes against the production database. When the report tool is allowed to access the database directly, bypassing the service layer provided by the Data Sentinel, you will need to ensure this access is well-behaved and that it runs as a low priority process and under restricted user privileges. The same control is required for the ETL processes.  Operationally, you should always schedule batch-intensive processes for off-peak times such as nightly runs.
A third potential cause for exception to data visibility is implied by the use of off-the-shelf transaction monitors, requiring direct access to the databases in order to implement the ACID logic discussed earlier.
A fourth exception is demanded by the need to execute large data matching processes. If there is an interactive need to run a process against a large data base set with matching keys in a separate data base (“for all customers with sales greater than an $X amount, apply a promotion flag equal to the percentage corresponding to the customer’s geographic location in the promotion database”), then it makes no sense trying to implement each step via discrete services. Such an approach would be extremely contrived and inefficient. Instead, use of a Table-Joiner super-service will be required. More on that next.

Labels: , , , , , , , , , , ,