How many projects have you been involved with where you told your business partner you have all the data they need, only to find out the data is far from complete or not in a form they can use? How many deadlines are missed as a result? How does this impact your relationship with other parts of your organization?
Regardless of who causes the delay, the project may suffer, project costs may be affected, and business sponsors may begin to lose faith.
How can you mitigate the risk of delayed data delivery? Ensure that you have a reliable data repository from which to source the data before estimating the project. If not, the creation of an appropriate repository should be added to the project timeline and project risk should be adjusted accordingly. Ultimately, your goal should be to understand the business requirements and plan the repository to fit the business need, so you can provide the required data, when it is needed, in a validated state, to fulfill the requirements of the project. Understanding the various types of repositories, and the processes that are followed to build and maintain them, can go a long way in ensuring you properly size, and complete a successful project.
So what kind of repository do you need? To answer that question, you need to understand the types of data repositories, the problems they are designed to solve, and the processes required to build them.
This is the first in a multi-part blog in which I’ll cover types of data repositories, their primary purpose, and practical ways to build them.
A data repository can generally be classified as one of three types: Data Warehouse, Data Mart, and Data Lake.
The term Data Warehouse may be the most overloaded term in data management. Ask five organizations what a data warehouse is, and you are likely to get five different definitions. You may also find evangelists who proclaim they use the only “True” data warehouse paradigm (think Inmon vs. Kimball.) Oddly, the size of an organization doesn’t determine how well they understand the true meaning of a data warehouse. The reason you care about their concept of a data warehouse is that it will help to determine the data management maturity level of the organization, which, in turn, helps you to understand how much work needs to be done to mitigate the data delivery risk. A data warehouse has the following characteristics:
The primary use-case for a data warehouse is cross-departmental, historical reporting. A data warehouse should be considered a source of trusted data that can be relied upon to be accurate, timely, and certified as correct. They are also used as the source for smaller data marts, which are covered next.
Many organizations will say they have a data warehouse but often fail in the areas of integration and history. I’ve seen several organizations who drop their operational data, unaltered, on a database and claim to have a data warehouse. Their main use is point-in-time operational reporting and they overwrite the data the next day. This hardly meets the requirements to be classified as a data warehouse.
Data marts have these characteristics:
The primary use case for a data mart is departmental reporting or discovery. An example would be for a purchasing department to evaluate spending patterns across vendors to reduce the number of suppliers and be better prepared for negotiating prices. A use case closer to home would be for a finance department to do financial planning and what if analysis.
Data lakes are more recent and are generally considered to be stored on big data platforms. The characteristics of a data lake are:
A data lake is a good pattern to use when batch analytics must be performed using many large data files. Newer big data technologies such as Kafka and Spark allow for near-real time streaming and event processing as well. Often, after the data has been analyzed and processed, it is stored in one of the other repositories for easier reporting by end users.
Now that we have a common definition from which to proceed, next time, we’ll cover the processes required to build and maintain them. Keep in mind that some organizations may use all these patterns depending on their needs.