In my previous post, Data Warehouse versus Data Lake – The What and the Why (Part one), I discussed the differences between data lakes and data warehouses and when you would use each pattern. In this post I’ll discuss, at a high level, an iterative process for building a data warehouse.
I’ll begin by addressing the myth that a data warehouse takes forever to build and deliver. While I’ve seen several instances where the build phase seemed to take forever, my experience has been that the approach was the problem. Some people believe that it takes at least six months of planning before you can even begin to create the first table (I did early in my career). This generally occurs when an organization doesn’t have a good handle on what data they have, where it is, or what the business really needs. This isn’t a data warehouse build problem; it’s a data governance problem. However, even if this is the case, using the principals of agile, you can incrementally build your data warehouse and begin delivering value within a few months. If you’re new to Agile Software Development, you should read the Manifesto for Agile Software Development and What is Scrum?
The process I’ll describe consists of six steps. The first step is required as part of the first iteration. It is optional for the remaining steps depending on whether new requirements are discovered during the build process. Steps two through six represent the iteration for build and delivery.
Step 1: Identify the Business Cases
Don’t try to boil the ocean. And don’t gamble on the “If we build it, they will come” mentality. Work with the business sponsor to clearly identify what they need and how often they need it. Have a clear definition of how each business case will help the company and a clear definition of success factors. The more you know about how the data warehouse will be used, the more likely you are to build something that will be used. Prioritize the business cases. Break things down into manageable sizes. You should be targeting six weeks per delivery cycle. Time-box this process to no more than two weeks.
Step 2: Identify the needed data elements
Take the first, or next, item from the backlog and work with the business sponsor to identify the data elements required to satisfy the business requirements. Have the business sponsor, or a designee, create a Business Glossary and include data element name, definition, allowable values, and system of record. If calculations are required, have them document those as well. You don’t need expensive software to do this. Use a spreadsheet in the beginning. You can transfer the data to a meta-data repository later if necessary. Time-box this process to one week.
Step 3: Map and sample the data
Using the Business Glossary and the list of data elements required for this business case, do some basic sampling. Create a list of possible values for each data element and compare these to the values in the Business Glossary. Depending on the data type of the element, you’ll generally want the list of unique values. Where there are dependencies between elements, create a list of unique combinations. Work with the business sponsor to resolve inconsistencies. Time-box this process to one week.
Step 4: Build a prototype
Build prototype data structures. Don’t worry about perfection but do consider flexibility. This is a prototype and will likely change for each iteration. The goal of this step is to get the required data into a format that will allow you to quickly build sample queries or spreadsheets to review with the business sponsor. Set the expectation that at this point, you are just trying to validate the data required to satisfy the business case. You should be able to identify common dimensions and metric sets. Time-box this process to no more than two weeks.
Step 5: Stage the data
You may be tempted to build processes which load the target tables directly from the source tables. While this may seem like the simple solution, it has several problems: First, if the load process fails, you need to rerun from the top to ensure consistency between tables. Second, most shops want to minimize the time that the production source tables are being hit by data warehouse loads. Extract reruns should be limited. Extracting into staging tables allows for multiple reruns if load problems are encountered. The key here is to decouple the extract from the load.
Create a control table with a UUID as the primary key to hold the load time, completion time, and unique identifier for the load process. The staging tables should be identical to the source tables with a few exceptions. For each staging table, create a surrogate key, a load timestamp, and the unique identifier for the load process from the control table. This allows you to determine which rows were loaded together and when. You’ll need to know this for debugging problems. Do not enforce the primary key of the source tables on the staging tables. Since you are staging into nearly identical tables, this process shouldn’t take more than a week to build.
Step 6: Deliver
This is the point where you create production ready structures for your data warehouse. Your target tables will be patterned after the prototype structures. You’ll also add a column for the surrogate key to hold the unique identifier of the source staging table row. This provides lineage back to the source system. Make the staging and load process production ready at this point. This process shouldn’t take more than one to two weeks.
Perform the process again from step two for each of the remaining business processes.
In this blog I covered a high-level, six-step process for building a data warehouse. The process is based on agile principles. Following this process, you should be able to deliver to the business every six to eight weeks. In my next post, I’ll cover building a data lake.
Feel free to reach out if you have questions or comments. firstname.lastname@example.org