SHOP
Shop!
DISCOVER
Library
DISCUSS
Forums
 

2010/02/11

Quick note about Logical Data Models LDM

Filed under: DV Model & Methodology, Hot News! — Tags: , , — Dan Linstedt @ 07:34

* Teradata has “LDM”’s – logical data models. IBM has LDM’s, Microsoft has “Industry Models”, SAP has them, etc…

****** PLEASE NOTE *******
I work with Teradata, I like their hardware, I like their software, I think their database and solution are worth their salt – no doubt.  I even believe that their Logical Data Model is valuable and helpful as a guide and as a logical data model.  Where I get off the boat (disagree) is where Teradata, IBM, and all the others implement LDM’s 1 for 1 with the Physical Model.

I chose this as a single case with which I have direct experience, please realize that it is not representative of all cases, and should not reflect badly on Teradata.  PLEASE if you have a different experience, speak up and comment – let us know how different things can be.
***************************

The problem with these are they push 3rd normal form for Data Warehousing. While they are tremendous “directional” pointers, they keep you from “missing” critical things about the business that need to be captured, they are at the end of the day: LOGICAL data models. They should NEVER be physically implemented, especially not 1 for 1 with the data warehouse, but that’s just what the prof. services organizations do… they implement the LDM 1 for 1 physically.

That by itself wouldn’t be so bad, except that They each have a few gotcha’s and quirks that are problematic in the EDW landscape. (some of this is explained in the Business Book on DV Modeling that I have on LULU.com), more of this is explained on DataVaultAcademy.com in the on-line lessons. Teradata for instance has more than one serious problem. One of the largest problems with their 3nf LDM’s is the way they “construct/create” internal versioning of records. They do the traditional start and stop time-stamps, but they add a number of versioning structures (for multi-headed tree node representation) in a flat 2d structure. At Center for Medicare/Medicaid here in the US, the structure required Teradata Prof. Svcs to put in 7 layers of staging data (that’s right: copied the staging data 7 times before it hit the warehouse). This in turn required the customer to buy 7x the storage from Teradata, ballooning their cost for hardware. Completely unneccessary.

Anyhow, there are successful stories where the LDM’s can be used as a guideline, and the DV model can be built as the physical underneath. This is the best approach IF the customer is willing to spend the money on the LDM (not cheap after-all). IF the customer is NOT willing to spend the money, then the Data Vault modeling techniques can be utilized for both logical and physical, but I do recommend “consolidating concepts/entities” in the logical – so that the business users don’t get bogged down in the “hub/link/satellite” pieces.

Anyhow, if you have questions, and want answers, I can help – let me know.

2009/11/11

Will Informatica 9 put your Data Vault efforts on Steroids?

Filed under: DV Model & Methodology, Hot News! — Tags: , , — Sanjay Pande @ 11:56

For those of us who were sworn to secrecy in June (when we saw the Alpha) we can now talk about the features in Informatica Version 9.

Informatica revealed the new incarnation of its data integration platform version 9 yesterday on the web. What they have done is quite interesting in some spaces and for folks who have been on the platform, many of their guesses would have been correct.

Integration Of the Tooling

This was earlier done at the metadata level with external widgets (accessed through mapplets). Now you can

- Do midstream profiling of a data integration stream (needs the Eclipse based client)
- Use DQ calls anywhere in a mapping
- The sharing capabilities of business rules between IT and Business can be very powerful
- The focus on enabling DQ movement to the business is the right thing to do
- SOA based data services is probably the coolest of the lot for the IT crowd

There are a whole bunch of other enhnancements, but people who are interested can find out themselves

What is so cool about SOA based data services?

Do you remember data federation technologies? Were they all not useless?

Well not anymore.

Here is what is the coolest thing. They created an interface layer which is very similar to virtualized objects but within the informatica metadata layer. This enables you to use any transformation on it, including the inline data quality.

Then went a step further and enabled any style of input and output. So you can source from a Virtualized mart and deliver to a consuming application which can access it via open protocols. Even direct SQL if you want.

Then they enabled the conversion of one mode to another with the click of a button. So you can use virtual for POC and/or development and then deliver to physical asap. This is extremely powerful.

The data services also enable integration with other apps such as google maps mashups and there are several other cool applications of this.

Why is this the most exciting thing for techies (Data Vaulters would love this)?

The build of the DV is unaffected, however, whenever you get a new source, you can test the integration with your existing DV objects using data services. Then build out.

You will still have to tune the physical layer. There are no shortcuts to optimization (which should be done when the need arises anyway, however that is another post) and you can check out Dan’s Performance and Tuning classes if you want to understand it in depth.

What is even more exciting is the build out of the data marts. There is so much power in delivering these as virtualized objects where the business can start using them. The best part is, you can connect the front-end tools to these virtual objects and start reporting. These can be converted to physical without any front-end impacts (perhaps a connection change). That is power right there.

DV + Informatica 9 = Rapid Deployment

The Informatica 9 platform combined with the DV architecture is like a dream for most architects. If you do not get why, perhaps you need to take a look at both those things. To learn about Informatica tools at a fraction of the cost, and to watch a few free videos on the technology visit trainovation.com

2009/11/03

Authorized Consulting Partnerships

Filed under: DV Model & Methodology — Dan Linstedt @ 16:22

What is the ACP program?

The ACP program is a partnership between Genesee Academy, LLC and The consulting partner specified.  It provides the partner with direct access to Hans Hultgren and Dan Linstedt within, around, and on the project – throughout the life of the project. 

What does it do for me?

As a customer/client – it ensures that the consulting partner is constructing the Data Vault models according to our stringent specifications.  It also ensures that the consulting partners maintain success with the Data Vault that they are building for you.  It allows you to sleep at night knowing that we garauntee the first version of the model that the consulting partner produces matches the Data Vault core fundamentals, and that you will see all the benefits we advertise around the model.

Why wouldn’t I just hire a consulting firm that says they know Data Vault?

Some consulting firms are really good, and have quite a few certified Data Vault Modelers – yet even then they still insist on breaking the basic rules.  If the rules and standards are broken, you as the customer run the risk of ending up where you’re starting (having all the same problems you have today, only you’ve got a Data Vault-ish model to deal with).  Some consulting firms are really good, and stick to the rules and standards – these firms should be joined with us through the ACP because they believe in their work, and they see the value of producing certified models, and having us engaged as a part of the project.

Some consulting firms have claimed they know the Data Vault, but when it comes down to it – they neither have certified modelers, nor do they follow the rules – thus producing “failures” once again.  Neither you, nor us want this to happen.  An ACP ensures success for you – making it easier to know exactly the work product you are getting.

So look to your preferred consulting providers – challenge them, ask them if they are DV ACP, check our web-site for a list of our ACP partners today.

Thanks,
Dan Linstedt

2009/11/01

Securing Media Content? Flash & Silverlight?

Filed under: Flash and PHP Code — Tags: , , , — Dan Linstedt @ 18:38

You name it, there are tons of media outlets on the web today.  In fact, media is the fastest growing segment for delivering information to users.  if a picture is worth 1000 words, then a video must be worth (sometimes) as many as 1 million words?  That is, if the video is good.

Now, what about securing the digital content? Is there or isn’t there a good way to do this?

First off, everything delivered to the screen and through the audio channels can be swiped, copied, or stolen.  That is – all it takes is a web-cam and a microphone trained on the display, and voila – all is copied.  So at the end of the day, all this digital rights management stuff seems a bit overkill.   That said, there still is a need (as we have shown) to build and deliver intellectual property, or e-learning over the web.  So can it be protected?

Beyond the first point, yes – but only to a degree.  If someone tries hard-enough, they can always steal content.  So the mechanisms or choices are as follows:

Video & Audio Formats:  Most video and audio formats are streaming bits of pixels and sound, they typically do not contain code (compiled or intermediatee).  They simply exist, are unpacked and rendered either through the audio channels or on screen as the bits demand.  The bits are the physical image, the idea of unpacking is done through something called a CODEC (encodes/decodes) audio and video.

You could, theoretically write your own codec – then encrypt the bits, or scramble them to build the file, then de-scramble them to play them back.  Usually (unless your into calculus or are a genius at high speed mathematical algorithms) getting this to work can be very difficult.  You could also wrap your code around others “codecs” to make your own, this is a bit more feasible – however it usually means none of the common web-based players will work with your codec to play your videos and audios.

What’s next?

You could use Microsoft DRM for Silverlight, or Flash Media Server with DRM (encryption/decryption) – this will get you to a point where most of your video and audi is protected through the streaming mechanisms.  This is probably secure for most of the audience and prevents the amature from “lifting” through playback.  The problem here however, is that both vendors charge a lot of money for their solution, and hosting services charge 10x that price to have you use the service (when it’s a matter of a simple server software for them).  In other words, it’s expensive and it shouldn’t be.

What else?

If you put your movie into a SWF format, then you can attach a “player” or a set of code around the FLV (flash video format) or around the MP3.   This would let you use a login, and a variety of checks and balances to get files off the server and play them back.  The problem? SWFs are easily found in the web-browser temporary cache, and can be copied out.  Even “downloading SWF on demand” (semi or quasi stream) still pulls the whole file down.

You can separate the SWF from the FLV, and store the FLV on a hosting service (better than above), then use the SWF to seek to position, and use browser RAM to buffer only a few seconds of the FLV.  But watch out for band-width problems.  If you use “progressive” download/stream, then the whole FLV ends up in your web-browser temporary files.

At the end of the day, there is no good way to protect Intellectual property delivered electronically over the web, it’s just near impossible.  From PDF to SWF to QuickTime to MP4… it is what it is, and once released – you run the risk.  What we just hope is that most of the people viewing the content are honest enough to want to compensate you for it.

Cheers,

Dan L

2009/10/28

What’s Coming in 2010?

Filed under: DV Model & Methodology — Tags: , , , — Dan Linstedt @ 03:55

Business Data Vault and Operational Data Vault are happening now.  If you’ve not checked out what a BDV or an ODV are, you probably should be doing some reading.

The Business Data Vault is an extension of the Data Vault – basically using the Data Vault Modeling to build a data warehouse that is downstream of the Raw Data Vault.  The business rules are executed once in the ETL layer that transferrs information from the Raw DV to the Business Data Vault.

You can read more about it by purchasing our article on the BDV at http://www.DataVaultInstitute.com/store

I’ve outlined the basics of the Operational Data Vault on my post at B-Eye-Networks this morning:  http://www.b-eye-network.com/blogs/linstedt/archives/2009/10/operational_dw.php

The ODV is what the Raw Data Vault is turning in to.  It includes not just batch data, but transactional data and unstructured data feeds.  It also includes a smattering of operational applications ON TOP of the Raw Data Vault to manage and maintain: Metadata, Raw Data Records, Ontologies, and Dependencies.  Look for the Operational Data Vault in 2010 to make a huge splash in your market place.

If you’re not up on Ontologies, or Unstructured data, I would suggest reading about them  – there are plenty of places on the web to get this information.

Thanks,
Dan Linstedt

2009/10/15

All BI applications do not need a Data Warehouse?

Filed under: DV Model & Methodology — Sanjay Pande @ 12:24

This is a statement I was reading in a recent white paper that mentions a few different things including how expensive it is to implement a data warehouse and there are alternate ways (using their tools) to integrate data… and other blah blah blah. This also contained some partially convincing arguments (more like statements) from them.

From a Data Vault perspective, that is a ridiculous statement. A single hub with its satellite qualifies as a “DW” in the DV approach.

One is encouraged to build small chunks iteratively and add to your DW just like building blocks. Because of the power of the linking structures and the architecture there is no propagation of one part of the model to another which simplifies maintenance.

So, I guess how expensive and complicated your DW is depends on your choice of architecture. For us DV implementers, it is a non-issue.

We just want to do it right the first time.

2009/10/09

BI Vendors … Stop Showing Photographs

Filed under: DV Model & Methodology — Sanjay Pande @ 12:51

So, I attended a luncheon/seminar by a vendor yesterday and learned some interesting things (including the fact that some of the largest vendors actually OEM many components and connectors). It was the usual customer focused, profit BI … blah blah … web 2.0 etc. Granted there _was_ cool stuff on the front-end, like an e-mail flash based report that was interactive even when disconnected because the data was included in the report. There are tons of applications for that sort of thing.

The thing that pinched my nerve, was this statement, “While we recommend that any BI be sourced from a Data Warehouse, our solutions allow you to do it without one” …. Arrrggh!!! Another one. This is even more dangerous than the flat out refusal because there is an image of sincerity and honesty that is portrayed.

Yes, it can be done. You _can_ build BI stuff from spreadsheets, but, who in their right minds would want to? (I have seen very high-profile apps done using spreadsheets … stop … this will take me to another rant and lose focus on the current one)

Well, lets analogize what happens in a sales situation.

So, you walk into a car dealer and they say, we have car parts A, B and C which when you buy will look like a …. Ferrari … see this picture of the Ferrari … its the coolest thing in the world, which of course is backed up by … “Success Stories” … or “Sucker” stories as I like to call them. Some are actually true because people have done it the right way. More than you believe are …. fabrications (actually more like, let me tell you a part of the truth that would make A,B and C parts look like rock stars … so half-truths). I have seen more “Lipstick on a Pig” solutions than would care to elaborate in organizations whose integrity is unquestionable (Ha ha .. snicker).

Anyway, you are sold by looking at the picture of the Ferrari. Then you realize, well I have A, B and C, how the hell am I going to use them. They have instruction manuals but something appears to be missing … the bloody engine and drivetrain (or BI repository if you will). Then you hire people who know how to use A, B and C to build the engine. These folks may be experts in A, B and C but do not know how to build and engine right …. because … they have not even worked on cars. So you get a 50cc moped engine (star-schemas) and are able to move this thing that you bought and spent lots of money on. At this time, you are blinded by the fact that it is a moped because it moves … Whoa cool.

So, how come everyone forgot the bloody Ferrari that was sold. Also they did not tell you that the moped was only designed for 1 rider and the engine was put together so it may just stop working from time to time. Did I mention that you got this at the price of a Ferrari?

In the real-world, I know for sure that you cannot walk-in to a Ferrari dealership and walk out with a car, no matter how much money you have. Each Ferrari is bespoke to the person who buys it , like a custom tailored suit, and the average delivery time is upwards of two years. You could buy a used Ferrari from someone else, but it will not be tailored to you and may not be right for you.

Someone like me goes in and says … hmmm … Thats not a Ferrari. Not even close. That can not even be converted to a Ferrari with a lot of time and money, however, we will try and reuse the parts if possible … no guarantees.

I can build one for you, and it will take 2-years. But I will build it in a way (a flexible and scalable DW) that you should be able to drive it in a few months. It will be functional, and constantly improve over time, but, will not be the entire car to begin with. If you get rid of part C (which is not very good or useful) and buy part X, I will reduce the time to build as well. You are already skeptical after all that investment, some results that blind you, and difficulty trusting a new perspective.

Oh and by the way, because we built it this way, you can use it for a multitude of things, not just driving from point 1 to point 2.

To vendors, stop showing photographs of a BI solution because without the “engine and drivetrain” (Data Warehouse) it is simply not sustainable. It is also not a solution, but a sales gimmick.

To business users, stop getting “suckered” in to these lures and traps that are so strategically placed for people in upper management. To practitioners, please educate your folks before they make these decisions. I have said it before and I have said it again, “A BI solution without a well architected Data Warehouse is like a house without a foundation”. It can be built … but it will not last … Guaranteed.

2009/09/28

Where does Data Quality fit in?

Filed under: DV Model & Methodology, MDM — Sanjay Pande @ 13:56

It is common practice to really cleanse your data before it lands to your data warehouse. Don’t do it. Stop your “data laundering”. This is what the Data Vault methodology requires one to do by not cleansing the data on the way into the DW and only on the way out. By the way I am talking about a data warehouse designed in a minimum of 3 tiers. The 2 tier star-schema enterprise data marts are not data warehouses in my book and if that is what you build, just go away and do not read this, because you need to fix your foundation first.

So basically I am talking about a data flow from

Source -> Staging -> DW (DV) -> DM

There are two places where one can and should apply data quality measures 1) Before Source 2) After DW

So DQ (1) -> Source -> Staging -> DW -> DQ (2) -> DM

The first one is ideal where data gets cleansed before landing to a persistence layer in the source system. The second one is reality where the first one does not really work a 100%. Now, people question why can I not place it earlier than that. There are two places earlier

1) Source to Staging – Bad idea since it complicates extract times and defeats the purpose of staging.

2) Staging to DW – Bad idea because you do not have a true system of record anymore.

But is my source not a system of record. Yes, it is as long as data has not be changed/removed with access to audit logs that enable recreation of that state of data. Without that your DW is not audit friendly and cannot give true traceability without reverse engineering the data quality rules that were applied.

Some of the reasoning behind why the DV has a methodology of placing business rules, DQ and cleansing rules after the DV can be justified by the following positive implications of doing so:

- You do not own the data. Business does and you did not change it.

- There is no overhead in getting the data into the DW so it is fast.

- If the business gives you the wrong rule or decides to change a rule, you do not need to lose sleep about recovering what you fixed. You do not need to pull up your defenses using SLAs and essentially losing time to perform a fix that is inevitable and would eventually cost a lot of money.

- You have raw data to feed Identity Resolution engines for various purposes including MDM.

- You are performing DQ on an integrated dataset instead of disparate ones which may reduce the number of DQ processes.

- You can get valuable metrics about your DQ solution and even track the mistakes.

- You do not have to mix your DQ solution with your DW build which reduces complex tasks.

- You can resolve entities in the DW using discovery links post load.

- Full traceability and recoverability to a point in time including all bad data (this is important and at times may be legally required).

The newer DQ assistants allow technology to choose matches across entities which are presented to business for final resolution. These are very highly recommended and the DV architecture naturally feeds to this. So, if you are placing DQ before your DW, think again. It may be one of the biggest mistakes you are making.

2009/09/14

Flash and Movie Playback

Filed under: Flash and PHP Code — Tags: , , , , , — Dan Linstedt @ 05:25

I’ve been working with Flash again, and I must say: it REALLY IS the NEW language for BI and EDW coders.  If you don’t believe me, just look at the number of FLASH based BI reporting systems popping up out there!  It’s an INCREDIBLE tool.

Now, that said:  The issue is finding a very good player, one that streams, one that works well, one that continues to be improved.  There are hundreds of players out there that work really really well, however, only two in my opinion that are good enough to “work with”.  

The first is FlowPlayer, it has a neat interface, and plays movies well.  It is common to many different web-sites.   You can find it at: http://flowplayer.org  I’m not a huge fan (yet) of flowplayer, and believe it has quite a ways to go.  It offers a Flash and Javascript based plugin system which exposes the plugin code to the public through JavaScript.

The second is my favorite: JWPlayer, you can find it here: http://www.longtailvideo.com/   It is ‘open-source’ which I really love, and only requires a small fee to buy a commercial license.  They offer the entire source code, and the plug-in system is much easier to use.  The plugins also receive TONS of events, and can plug in to the Controll bar, the Model (streaming, video, audio, etc…) and the Event components.  You can even re-skin the whole player.  I am currently working on new plugins for the playback mechanism. 

The plugins I’m working on are in fact: BI / EDW related, and getting the streaming movie to be content driven.  I still do believe that the content of EDW and BI is king, and ensuring statistics, quality, and the learning process are what makes the difference.  Now, how does this relate to Data Vault?  You will see… very very soon.  I can’t discuss it now, but I AM working on Data Vault related components in FLASH…

Stay tuned to hear more later.

2009/09/06

The Data Vault scores very high on “Good Design”

Filed under: DV Model & Methodology — Tags: , , — Sanjay Pande @ 18:13

Some of you know and some of you do not, but, I look up to Paul Graham. I like his irreverent style and most of the time I relate to what he is saying. He along with Jason Dufair are my primary influences in getting to know functional programming via Common Lisp and Scheme.

I recently came across an article that I had read in 2002 and feel that everyone should read it, especially, if you are interested in “good design”. It has hit me earlier and this time I could not help comparing all attributes to the DV and each one made me smile a little.

Good design is simple - The DV is a simple approach. It breaks things down and shuns complexity.

Good design is timeless - The DV takes into account potential changes. I could not help looking at a slighly older “Mercedes Benz” one day and thinking, “Wow, this is so boring and elegant at the same time, but, I cannot keep my eyes away from it”. It was timeless and would fit into any era.

Good design solves the right problem - No debates on this one.

Good design is suggestive - “In software, it means you should give users a few basic elements that they can combine as they wish, like Lego.” - Paul Graham

The DV is like “Lego”.

Good design is often slightly funny

Good design is hard - I know it has taken Dan great efforts to get the DV to the simple shape that it looks like today. Just looking at the simplicity of the design suggests that.

Good design looks easy - No contest on this one.

Good design uses symmetry - It is all pattern based.

Good design resembles nature - The original idea of the DV developed from neural networks which is how the brain works. Nature at its finest.

Good design is redesign - The DV architecture takes into account potential changes and accepts that redesign is inevitable even before one begins. It gives you a sustainable way to redesign as well, including, how to alter structures without losing audit.

Good design can copy - The DV is a pattern based architecture.

Good design is often strange - At first glimpse it is, especially if you are used to something else.

Good design happens in chunks - The DV encourages this.

Good design is often daring – The DV challenges the status quo by being “better”, even when the existing methods are thought to be good enough.

I would strongly encourage you to read Paul’s article “Taste for Makers” from which these points are lifted, as, that gives more context to each of these points, and hats off to Dan and Paul.

2009/09/01

Make data rigid and structure fluid

Filed under: DV Model & Methodology — Tags: , , — Sanjay Pande @ 23:36

There is a lot of talk about adaptive systems and dynamic databases. There are a few interesting ideas and works in progress with the dynamic data vault as well. All this takes me back to my tinkering with functional programming languages like Common Lisp and Scheme where destructive constructs are discouraged and from the purist point of view, even, disallowed. The majority of people are used to imperative programming where the change of state is allowed. To illustrate with a simple pseudocode example.

Let us say that we have a variable x1 that contains a list of values

x1 = [a, b, c, d]

We apply a replace function, that takes 3 arguments – the variable, replace what, replace to what.

replace(x1, ‘b’, ‘z’) would result in [a, z, c, d]

In imperative style x1 would be modified to hold the new values whereas in functional style x1 persists the old values and a new list is created with the new value, which is why they are referred to as non-destructive constructs. This does make computation memory intensive and that is why languages like Lisp have had garbage collectors since 1959. Barring the memory overhead it is very powerful which is why these are most used in Artificial Intelligence, Search and complex security algorithms.

Now getting back to databases, they are storage constructs based on rigid structures which can hold variations of data values. In effect, the structure is rigid and the data is fluid. One can change the values via updates, remove values etc.

However, imagine a database that has rigid data but fluid structure. The data never changes in perpetuity but its representation can change. The DV design potentially solves this issue. It is possible to move data from one construct to another without losing audit, in effect changing the structure.

As an example let us take addresses. Say these are represented as hub satellites on a few difference hubs. Now in a raw DV, we let a Data Quality tool with Identity Resolution capabilities run on the sets to identify common data elements. We run the analyses over a period of time and learn through discovery that address is “hub’ifiable”. We also learn that addresses have now become important to the organization for analyses.

Now we need to create the Address hub and convert the hub satellites to link to the address hub. This is possible to do without losing audit in a DV architecture. The idea if Dynamic Data Vault stems from this and extends the concept to modify metadata and data integration code to adapt to the new structures.

Most databases however, are imperative in nature and strongly typed. If databases were fluid structure but rigid data, that would do tons for flexibility, although one “may” have to give up SQL as the choice of query language. The reason I say “may”, is that it is not that hard to create a pseudo’SQL language in a functional programming language.

Common Lisp has potential structural representation where code is data and data is code and persistence is a trivial exercise. XML with XQuery does have some of this but is nowhere near as powerful as a full fledged language and has its own headaches with parsing and representation.

From a scalability point of view functional languages have always been used for concurrent programming and fault tolerant systems. Why have relational databases not evolved much beyond what they were in the late 60s in terms of data representation? Why are we still limited by SQL in 2009?

2009/08/29

Truth vs. Fact: Estafette Posts

I am starting a series of posts around two of the most interesting notions of data warehousing in general and Data Vault in particluar, namely “Single Version of the Truth“, usually abreviated to SVT, and the Data Vault equivalent “Single Version of The Facts”, which we could shorten likewise to SVF. I’ll post links, overviews and information about these concepts and their influence on contemporary BI and DWH thinking. Since this is an estafette post I hope other DVI blog posters will also share their wisdom to these subjects.

2009/08/27

Building an Oracle Data Vault – Part 3: Loading HUBS AND LINKS

Filed under: DV Model & Methodology — Bruce McCartney @ 08:18

Here we will load the HUBS and LINKS using simple Oracle PL/SQL.  Loading HUBS is pretty straight forward.  Following is a flow diagram.  Using PL/SQL, I will use the Oracle MERGE command to perform a set-based UPSERT.  This technique scales well.

Here is a code snippet in (PL/SQL) , note that we only have one STG source for this HUB.  If there were more, I simply UNION all the sources together in the SELECT.

MERGE INTO HUB_GLOBAL_NAME hub
USING
(
SELECT DISTINCT
GLOBAL_NAME AS GLOBAL_NAME,
STG_DATA_SOURCE, STG_LAST_UPDATE_DTTM
FROM STG_ALL_TABLES
) stg
ON (hub.GLOBAL_NAME = stg.GLOBAL_NAME)
WHEN MATCHED THEN
UPDATE
SET
hub.HUB_LAST_SEEN_DTTM = hub.HUB_LOAD_DTTM
where (HUB_LAST_SEEN_DTTM != HUB_LOAD_DTTM)
WHEN NOT MATCHED THEN
INSERT
(
HUB_GLOBAL_NAME_SQN, GLOBAL_NAME, HUB_LOAD_DTTM,
HUB_DATA_SOURCE,HUB_LAST_SEEN_DTTM
)
VALUES (
HUB_GLOBAL_NAME_SQN.NEXTVAL,
stg.GLOBAL_NAME,
stg.STG_LAST_UPDATE_DTTM,
stg.STG_DATA_SOURCE,
stg.STG_LAST_UPDATE_DTTM
)
;

I will comment on DATEs used later, suffice to say the LOAD_DTTM comes from the STAGE table and the LAST_SEEN_DTTM is an indicator that show the date of the last point in time the Business Key(s) were not found in ANY feed.  Loading LINKS is very similar to HUBs; the source for the mapping is a list of distinct combinations of keys.  Again, the Oracle MERGE can do the trick.

MERGE INTO LNK_ALL_TABLES lnk
USING
(
SELECT /*+ RULE */  DISTINCT
HUB_GLOBAL_NAME.HUB_GLOBAL_NAME_SQN,
HUB_OWNER.HUB_OWNER_SQN,
HUB_TABLE_NAME.HUB_TABLE_NAME_SQN, STG_ALL_TABLES.STG_LAST_UPDATE_DTTM, STG_ALL_TABLES.STG_DATA_SOURCE
FROM STG_ALL_TABLES,
HUB_GLOBAL_NAME,
HUB_OWNER,
HUB_TABLE_NAME
WHERE
STG_ALL_TABLES.GLOBAL_NAME = HUB_GLOBAL_NAME.GLOBAL_NAME
AND HUB_GLOBAL_NAME.HUB_LOAD_DTTM = HUB_GLOBAL_NAME.HUB_LAST_SEEN_DTTM
AND STG_ALL_TABLES.OWNER = HUB_OWNER.OWNER
AND HUB_OWNER.HUB_LOAD_DTTM = HUB_OWNER.HUB_LAST_SEEN_DTTM
AND STG_ALL_TABLES.TABLE_NAME = HUB_TABLE_NAME.TABLE_NAME
AND HUB_TABLE_NAME.HUB_LOAD_DTTM = HUB_TABLE_NAME.HUB_LAST_SEEN_DTTM
) stg
ON (
stg.HUB_GLOBAL_NAME_SQN = lnk.HUB_GLOBAL_NAME_SQN
and stg.HUB_OWNER_SQN = lnk.HUB_OWNER_SQN
and stg.HUB_TABLE_NAME_SQN = lnk.HUB_TABLE_NAME_SQN
)
WHEN MATCHED THEN
UPDATE
SET
lnk.LNK_LAST_SEEN_DTTM    = lnk.LNK_LOAD_DTTM
WHERE   (LNK_LAST_SEEN_DTTM != LNK_LOAD_DTTM)
WHEN NOT MATCHED THEN
INSERT
(
LNK_ALL_TABLES_SQN,HUB_GLOBAL_NAME_SQN,HUB_OWNER_SQN,HUB_TABLE_NAME_SQN,LNK_LAST_SEEN_DTTM,LNK_LOAD_DTTM,LNK_DATA_SOURCE)
VALUES (
LNK_ALL_TABLES_SQN.NEXTVAL,
stg.HUB_GLOBAL_NAME_SQN,
stg.HUB_OWNER_SQN,
stg.HUB_TABLE_NAME_SQN,
stg.STG_LAST_UPDATE_DTTM, stg.STG_LAST_UPDATE_DTTM, stg.STG_DATA_SOURCE
);

This approach scales very nicely too.  Next time I will load the SATELLITE srtucture.

2009/08/26

Microsoft SQLServer2008 is a good candidate for Data Vault

Filed under: Hot News! — Dan Linstedt @ 05:19

I’ve been working with SQLServer2008 for quite a while now, and am really impressed with it’s abilities to execute on large data, on small or reasonably sized platforms, at HIGH SPEEDS.   When Microsoft acquired DatAllegro, I believed that they would gut the product, and mash the IP into SQLServer.  Well, it appears that’s exactly what they’ve done.  It’s now even better than it was before.

http://blogs.technet.com/dataplatforminsider/archive/2009/08/24/microsoft-ships-the-first-technology-preview-for-project-code-named-madison.aspx

I am a firm believer in what Microsoft is doing, and it’s efforts to create a huge scalable database platform.  The Data Vault is a good fit for this technology, especially with the advent of this newly upgraded platform, and I recommend anyone dealing with Data Vault to experiment.

Cheers,
Dan Linstedt

2009/08/25

Invest in Data Integration First … then Front-Ends

Filed under: DV Model & Methodology — Tags: , , — Sanjay Pande @ 13:18

In the BI space there are many components like Data Integration, Databases, Reporting and Analytic tools, data profiling, data quality, data mining etc, depending on the breadth of your solution.

Data Integration or ETL is what ties everything together and is the largest component of any BI project. Any BI expert can validate the numbers, however, even analysts like Gartner and Forrester estimate it to be between 60% to 90% of budget and effort in different reports depending on the nature of the solution. Unfortunately this is also the most invisible portion of any BI solution with the front-ends being the most visible.

It makes sense with the high-percentage of the ETL portions, it has the most scope to be optimized for reduced cost and delivery times of the overall project, especially if one uses smart tools and a templatable pattern based architecture like the Data Vault.

With the recent popularity of open-source BI, I have seen companies spending prime dollar on the reporting end and using open source ETL. One of the companies was using Cognos and MicroStrategy for the front-end and Pentaho for ETL. You can easily see where the money went on that project. The truth is open source reporting and analytic tools are already there and if one wants to use open source software, go all the way and spend some money on support.

If not, spend the primary portion of your budget in smart tools for data integration, which can help reduce build and maintenance costs and then on the front-ends. The reduces overall costs to project over time.

What is worse is when money is spent on certain inferior commercial data integration products which turn into sunk costs mid-project. I was recently involved at a client where I recommended switching the ETL tool before we started building. This would have saved the client time and money and overall project costs. They did not listen. The build of course takes more time, it will result in more maintenance down the road and they would have spent more money to have a solution that works but is not top-notch because of the inferior product used. From a budgeting standpoint they are ok with it because consulting time is budgeted for and tools are not. However, if they bought the right tools in time, it would have resulted in reduced consulting costs.

As an IT manager would you prefer to have spent money on an asset for your organization or would you prefer to just follow the standard budgeting procedures which are silly to begin with. My very simple recommendations for BI projects are as follows:

1) Pick the right people (business, technical, organizational)
2) Let the right people pick the right architecture (Data Vault of course for a DW).
3) Equip the right people with the right tools and atmosphere. (In the following order: Database, Data Integration, Data Quality, Data Profiling, Front-ends).

If there is one place where you should spend money it is on architecture and data integration tools. Some tools are better than others and everyone has certain biases and reasons for those biases, however, do get your team to articulate the reasons for the choice of tools.

Older Posts »

Powered by WordPress