Conquering the Integration Dilemma
by Jim Wilt
Summary: Extensible framework-based packages: They are everywhere—from portals to e-commerce, from content management to messaging. Effective? In many cases, yes; absolutely. I can think of many successful applications I've built based on the frameworks these products provide. They boost productivity, enhance quality, increase feature richness, and reduce the time to market greatly.
So, why don't integration solutions experience the same improvements? Regardless of how I plug at an integration solution with a given tool or framework, I don't seem to progress at the pace I experienced with my Web application or portal solution. This is what I define as the Integration Dilemma. (9 printed pages)
By definition, integration—in and of itself—is a difficult problem to solve. We will examine the contributing factors, so that they can be:
- Recognized and categorized.
- Understood with their resulting repercussions.
- Proactively addressed.
Ninety percent of the activity centered on integration solutions is in setting up, configuring, and fine-tuning the infrastructure environment and security properly. Operating Systems, Web Servers, Directories, and Application Settings—as well as Application Pools, Host Process users, Single Sign On, Directory Permissions, Security, and so on—all play a part in the execution and contribute to the frustration often associated with integration implementation.
Many application and extensible package developers are protected from having to worry about strong name keys, the GAC, certificates, encryption, in-process hosts versus isolated host processes, and the many other factors in which an integration developer must gain a great deal of expertise and mastery. The benefit is that integration-development experience will greatly enhance a developer's approach moving forward with future (normal) application solutions. That is to say, a developer's skills are strengthened by completing an integration solution.
The remaining 10 percent in the 90/10 rule is left for the integration developer to solve the actual integration problem itself. This is the cause of much frustration to both development teams and management alike as the 90 percent commitment generally is not in any way applicable to the actual integration problem at hand. It is further magnified, as we will see in the next sections, as the actual integration problem is generally more difficult than anticipated requiring far more time to solve than the remaining 10 percent allows.
Integrating in frameworks like Microsoft BizTalk Server is much like bowling with bumpers in the gutters. Their interfaces often keep you from doing too much harm to your solution, steering you with feature rich design and implementation interfaces. In contrast to the infrastructure and security challenges, this tool will actually help direct the solution. This positive integration experience, however, constitutes only 10 percent of the effort put forth.
Conquering the 90/10 Rule
Your infrastructure and operations teams are your new best friends:
- Keep a close working relationship with your infrastructure and operations teams from the project's onset, as they might preemptively identify potential security and operational issues.
- Many of the stifling issues an integration developer encounters are commonplace for an infrastructure resource, so utilize their experience to expedite problem identification and resolution.
Make your development environment mimic your production environment:
- Replicate your LDAP/Active Directory and install applications, servers, and packages using the same security model as production (or as close to it as you can).
- Never develop or run your solutions as an administrator.
- When you must make a change to the environment during development, review the change with your infrastructure/operations team and document your modification in your deployment documentation.
The best starting place for resolving integration-infrastructure problems is security in the form of permissions and accessibility.
Integration solutions generally involve diverse system scenarios. It is good to understand the two major types of integration problems, how they are solved, and their relative complexities.
A-to-B Problems: One House, Two Systems
The characteristics of A-to-B problems are when a System B wants to communicate with a System A. They generally are in the same infrastructure, but are not limited to it; they can span a WAN, VAN, or the Internet.
For example, System B could be a distributed system accessing information stored in System A, a legacy system (see Figure 1). Direct mapping of information from A to B requires intimate knowledge of both A and B, but because domain knowledge of both systems is generally available in-house, there is less guesswork on which data fields and elements in A relate to B.
Figure 1. A-to-B problem: two systems, one house
A-to-C Problems: Multiple Houses, Multiple Systems
An A-to-C problem describes scenarios in which trading-partner A wants to communicate with trading-partner C, but they both must do so through a sometimes external intermediate, common, or industry-standard format B. The houses are generally in separate infrastructures (see Figure 2).
Figure 2. A-to-C problem: two systems, two houses
An example of this common situation is when trading partners use an ANSI X.12 , HIPAA, or XCBL common industry format to communicate with each other, sometimes through a VPN. Often, the intermediary schema is 10 to 100 times larger than the internal schema used by System A or C. These intermediary schemas can be 1-2 MB, resulting in performance and stability issues when introduced to packaged integration-framework tools—especially frustrating when the average message payload is only 20 KB (see Figure 3).
Figure 3. A-to-C scenarios often involve a bloated intermediary-format B.
Aside from these more mechanical issues, the two trading partners might have to perform guesswork to decide where in the intermediary-format B they are to place their information and where their trading partner will place theirs. Coupled with the fact that an intermediary format often contains bloated redundancies (invariably leading to further confusion in the proper placement of information), these imprecisions might cause you to question the value in this form of integration (of course, there is value, but at times it can seem dubious).
These factors tend to make A-to-C problems significantly more difficult to solve, requiring far greater communication between trading partners to resolve interpretation differences with the intermediary-format B.
Conquering the A-to-B and A-to-C Problems
- Clearly document your intermediary-format B expectations, providing many samples.
- Reduce the size of intermediary-format B by working with your trading partner(s) to agree on a subset schema that includes only those components utilized by all trading partners.
- Double or triple your trading-partner test projections to compensate for intermediary-format B misinterpretations.
Many integration solutions find themselves in a form of scope-creep that is never planned for but heightens frustration from overruns in time management to the budget. This is known as the Mapping Pit of Despair.
Once all security, infrastructure, and operating environment issues are resolved to the point actual data moves from point A to point B (or C), the interpretation of the hundreds to thousands of fields from one schema to the next becomes the primary focal point. Too often, data required by one point is not readily available at another, or the intended meaning of one field is used for something entirely different. The ugliest secrets usually turn up during the mapping phase.
The following examples illustrate how this might happen.
Misused Fields in Data Repositories
Trading-partner A must supply identity data to schema B in the format shown in Table 1.
|First Name||Char *|
|Middle Initial||Char 1|
|Last Name||Char *|
Table 1. Identity-data format
Simple enough, but trading-partner A creatively uses the Middle Initial field to store years of service data, where 0-9 represents the number of years and A represents 10 or more years. When a middle initial is used by an identity, trading-partner A simply places it as part of the first name. Table 2 represents how data may be stored by trading-partner A. Not so simple anymore, is it?
|First name||Middle initial||Last name|
Table 2. Data table stored by trading-partner A
- Trading-partner A must parse the First Name field to determine if a middle initial exists.
- The parsing algorithm must distinguish between a two-word first name, Tory Ann, and a real middle initial, Mary A.
- Improper parsing could result in the confusion between Mary A. Anderson with 2 years of service and Mary Anderson with 10 or more years of service.
- A bad choice of field to hold years of service data (likely made years prior to any intention of sharing this data) has just doubled or even tripled the time to map from trading-partner A's schema to trading-partner B's.
When mapping is demonstrated by integration-software vendors, it usually looks something like Figure 4.
Figure 4. Integration-software mapping demo
Real-world mapping is a far different problem involving:
- Hundreds to thousands of fields.
- Hierarchical differences in data formats that might require complicated looping algorithms to properly position data.
- Large data streams requiring complicated loops to break apart messages.
- Fancy integration tools sometimes lead developers to pursue a graphical solution to a problem that is far too complicated for that tool.
- Pulling data from multiple internal sources (sometimes asynchronously).
- Manufacturing or calculating information that simply doesn't exist in any internal data repositories.
- Interruptions to other team members working on other parts of the solution when a map breaks.
Real-world maps look more like Figure 5. Significant innovations in tools for real-world maps, such as BizTalk's new XSLT Mapper, have been demonstrated and are forthcoming to alleviate this tedious task, but the challenge often stems beyond what any given tool is positioned to perform (see Resources).
Figure 5. Real-world data mapping
In normal application development, iterations are a good thing. You often can determine how many iterations will be allowed or decide on an acceptable form/function tolerance that will trigger moving on. Partial functionality is generally acceptable and you can utilize phases to introduce missing functionality at a later date.
Integration development, however, generally has zero tolerance: You iterate until it is perfect. Partial functionality and phases are usually not an option. Thus, integration becomes the source for many project delays and budget overruns.
Conquering the Mapping Pit of Despair
- Make no assumptions about the difficulty in mapping; always research sources thoroughly to identify those dirty little secrets.
- Establish a test/debug contract with trading partners specifying frequency for testing and issue turn-around metrics with appropriate escalation paths. Especially when working with external trading partners, it is imperative to define and establish testing agreements with appropriate escalation paths so that when interruptions to necessary testing procedures occur (which they most certainly will), everyone affected is notified as early as possible to appropriately communicate potential delays to the solution delivery.
- Utilize Test-Driven Development best practices to be able to thoroughly test and defend your side of the integration and reduce/prevent downtime to other team members.
- For complicated maps, consider using scripts and code over fancy UI paradigms (easier to read and debug).
- Break messages apart before mapping.
- Use managed code in place of maps when necessary (for example, for better performance and complicated hierarchies). This is accomplished by using serialized classes.
- Understand and honor performance ramifications related to mapping (for example, because maps are XSLT scripts, never call managed code from a map).
Performance always matters, especially when you're told it doesn't matter.
Conquering Performance Matters (BizTalk)
- Minimize trips in/out of the message box.
- Minimize dependency on Orchestrations—which can cause your solution to execute a magnitude slower.
- Performance-driven development means measuring metrics during all phases of development to identify bottlenecks as early as possible
- Utilize product performance tuning guidelines. Some excellent BizTalk guidelines have been published (see Resources).
- Out-of-box is not optimized for your solution. There are many places to tweak performance so it is important to understand all components of the solution and appropriately tune them for optimal operation.
- Serialized classes and managed code might be faster than messages and maps; consider using them for performance-critical components in the solution.
Once a team adopts a toolset or package to assist in the implementation of their integration solutions, a common error is to utilize every tool in their suite for every integration solution.
In most cases, not every tool needs to be used. In fact, using every tool might adversely affect performance of the overall solution.
A best practice is to think of your suite of tools as a bunch of LEGO bricks. LEGO bricks come in many sizes and colors. You don't need to use all sizes and all colors in all your creations. Sometimes, you might want to use only green and white LEGO bricks, while other times you might concentrate on blue and red. It all depends on your desired result.
Treat your integration tools the same. Not every solution needs an Orchestration. There is a performance price to pay when using an Orchestration. Maps can be utilized in Ports as well as Orchestrations. Sometimes, it is good to experiment by implementing several solution prototypes using various combinations of the suite's tools to understand the performance, maintenance, and deployment differences.
Know Your Tools and Use Only Those You Need (BizTalk)
- Make better use of the Port based publish/subscribe model; because there's no orchestration, this model is too often overlooked.
- Orchestrations are most effective for workflow; although they carry some overhead, Orchestrations have great purpose and are very effect when utilized for workflow.
- Map inside Ports, not just Orchestrations; this is another much overlooked capability.
- Pipelines can be fast and effective, consider a custom Pipeline Component over an Orchestration. It minimizes trips in/out of the message box and eliminates Orchestration overhead.
- Serialized classes and managed code are sometimes an effective alternative to messages and maps.
Think of your integration tools as a Ferrari (with manual transmission) in your garage. As long as you can only drive an automatic, this Ferrari will be the slowest, most frustrating vehicle you've ever known. However, once you master a manual, it will show itself for the finely tuned high-performance racing machine it truly is.
The right tools, teamed with the right skills and practices, most certainly can conquer the integration dilemma.
Churchill, Eddie. "BizTalk's Sexy New XSLT Mapper." Channel 9 Forums, October 2005.
Mohammed, Alaeddin, and Kevin Lam. "BizTalk Server 2004: Performance Tuning for Low-Latency Messaging." Microsoft Developer Network, August 2005.
About the author
Jim Wilt focuses his experience and problem solving skills toward helping customers architect the best possible solutions, to succeed with their needs related to system design, collaboration, data integration, and business intelligence. He is a Microsoft Certified Architect: Solutions and has received several industry awards, including the 1993 Industry Week Technology of the Year Award and the Burroughs Achievement Award for Excellence. Jim also is a Microsoft Most Valuable Professional (MVP) Visual Developer - Solutions Architect, member of the Microsoft MCA Board of Directors, the Central Michigan University College of Science and Technology Alumni Advisory Board, and a Central Michigan University Distinguished Alumni.
This article was published in the Architecture Journal, a print and online publication produced by Microsoft. For more articles from this publication, please visit the Architecture Journal Web site.