So it feels with my data integration journey, which really only amounts to really peeking inside the door at this point. I know life outside the world of data integration isn't possible for the scenarios I'm facing, but when I see all that is required to make it work I start to cringe.
Why Data Integration
So why am I even looking into real time data integration? I have a couple of reasons. The one I get most excited about is an event-driven architecture. The project I'm on has several different ways that data can change so really the only way to know something happened is to listen for data changes. To do that we can use a technology called Change Data Capture(CDC) which basically logs what data has changed. We could then theoretically take that changed data and then send a message through an enterprise service bus (ESB) or some kind of message queue (MQ) that we could then use to alert services that something happened they should pay attention to. This would prevent us from hammering the database as we poll for any bad scenarios we want to avoid.
For example, say you have a truck load of frozen turkeys. And say you accidentally set the temperature on the trailer to 60 degrees F. Now 60 degrees is a lot higher than freezing and if you want the turkeys to arrive to their destination frozen the temperature needs to be corrected. Now modern freight trucks are pretty high tech and report back all sorts of data to their brokers, such as the temperature in their trailers. So wouldn't it be nice to receive an alert that the temperature is off the second the data is received? Heck yes it would!
A lot of my thinking into this comes from a similar integration described here. But here is the theoretical model I was thinking of:
Another reason that real-time data integration can be very important is as the name suggests. Integrating data from multiple systems into one single data source for easier development and reporting (aka Business Intelligence(BI)).
If you're like me you've found yourself trying to utilize data from multiple data sources. And if you're like me you've found that trying to merge the data and perform any sort of operation in the application layer to be a real pain in the butt and it takes FOREVER to build it. So, the way I see it is that we can have the complexity in the application layer or we can have the complexity in the infrastructure.
Something like this:
So the concept seem pretty straight forward. You simply have three data stores and you only want to query against one. Easy enough! But some interesting question and answers came up as we were thinking about this.
Q. How are we going to get data to the consolidated data store?
A. We'll use some CDC tool!
Q. If we use a CDC tool, is the load on our source data stores going to increase?
A. A little as we'll need to write and read from the change logs, but hopefully it will be offset by moving any monitoring and BI operations to the consolidated store.
Q. Can we turn on CDC logging on our data stores?
A. Er...we better ask the System Admins/DBAs
Q. What level of normalization do we want in the consolidated store?
A. Good question.
Q. How are we going to save data that is changed by the users?
A. Well, we could figure out a two way sync and have it save back to the consolidated store or we could write back to the original source and let it propagate through.
Q. Do the CDC tools allow data transformations?
A. Some do.
Q. What tools are even out there?
A. We've come across DBMoto, IBM's Datastage, Oracle's GoldenGate, Talend's enterprise edition, Informatica also has something.
etc. etc.
So far lots of questions and even less answers, but as we open the door wider and wider we're getting some answers. I just hope that we're not opening a Pandora's box.
Our next steps are evaluating the CDC tools since this seems to be the biggest unknown at this point. I'll continue writing about our data integration journey and hopefully it doesn't end with the nasties slaying the heroes.
If you're like me you've found yourself trying to utilize data from multiple data sources. And if you're like me you've found that trying to merge the data and perform any sort of operation in the application layer to be a real pain in the butt and it takes FOREVER to build it. So, the way I see it is that we can have the complexity in the application layer or we can have the complexity in the infrastructure.
Something like this:
So the concept seem pretty straight forward. You simply have three data stores and you only want to query against one. Easy enough! But some interesting question and answers came up as we were thinking about this.
Q. How are we going to get data to the consolidated data store?
A. We'll use some CDC tool!
Q. If we use a CDC tool, is the load on our source data stores going to increase?
A. A little as we'll need to write and read from the change logs, but hopefully it will be offset by moving any monitoring and BI operations to the consolidated store.
Q. Can we turn on CDC logging on our data stores?
A. Er...we better ask the System Admins/DBAs
Q. What level of normalization do we want in the consolidated store?
A. Good question.
Q. How are we going to save data that is changed by the users?
A. Well, we could figure out a two way sync and have it save back to the consolidated store or we could write back to the original source and let it propagate through.
Q. Do the CDC tools allow data transformations?
A. Some do.
Q. What tools are even out there?
A. We've come across DBMoto, IBM's Datastage, Oracle's GoldenGate, Talend's enterprise edition, Informatica also has something.
etc. etc.
So far lots of questions and even less answers, but as we open the door wider and wider we're getting some answers. I just hope that we're not opening a Pandora's box.
Our next steps are evaluating the CDC tools since this seems to be the biggest unknown at this point. I'll continue writing about our data integration journey and hopefully it doesn't end with the nasties slaying the heroes.
No comments:
Post a Comment