Blog

QueBIT Blog: Data Warehouse versus Data Lake – The What and the Why (Part 1)

Posted by Keith Hollen

Find me on:

Aug 16, 2018, 11:30:54 AM

How many projects have you been involved with where you told your business partner you have all the data they need, only to find out the data is far from complete or not in a form they can use? How many deadlines are missed as a result? How does this impact your relationship with other parts of your organization?

Regardless of who causes the delay, the project may suffer, project costs may be affected, and business sponsors may begin to lose faith.

How can you mitigate the risk of delayed data delivery? Ensure that you have a reliable data repository from which to source the data before estimating the project. If not, the creation of an appropriate repository should be added to the project timeline and project risk should be adjusted accordingly. Ultimately, your goal should be to understand the business requirements and plan the repository to fit the business need, so you can provide the required data, when it is needed, in a validated state, to fulfill the requirements of the project. Understanding the various types of repositories, and the processes that are followed to build and maintain them, can go a long way in ensuring you properly size, and complete a successful project.

So what kind of repository do you need? To answer that question, you need to understand the types of data repositories, the problems they are designed to solve, and the processes required to build them.

This is the first in a multi-part blog in which I’ll cover types of data repositories, their primary purpose, and practical ways to build them.

Data Repository Types

A data repository can generally be classified as one of three types: Data Warehouse, Data Mart, and Data Lake.

Data WarehouseData Warehouse-min

The term Data Warehouse may be the most overloaded term in data management. Ask five organizations what a data warehouse is, and you are likely to get five different definitions. You may also find evangelists who proclaim they use the only “True” data warehouse paradigm (think Inmon vs. Kimball.) Oddly, the size of an organization doesn’t determine how well they understand the true meaning of a data warehouse. The reason you care about their concept of a data warehouse is that it will help to determine the data management maturity level of the organization, which, in turn, helps you to understand how much work needs to be done to mitigate the data delivery risk. A data warehouse has the following characteristics:

  • Enterprise Context – It spans more than one area of the business
  • Validated and certified – data discrepancies are identified and corrected prior to be stored in the warehouse
  • Subject-oriented versus process-oriented – For example, you can report on sales at the company level, not just at the department or process level
  • Integrated – Information from various areas of the organization are rationalized to common, meanings, values, and granularity
  • Detailed - but may contain summarized data as well
  • Historical versus point-in-time
  • Read only
  • Schema on write – The format of the data is determined before it is stored so consumers don’t have to figure out the raw format of data to be able to read it

The primary use-case for a data warehouse is cross-departmental, historical reporting. A data warehouse should be considered a source of trusted data that can be relied upon to be accurate, timely, and certified as correct. They are also used as the source for smaller data marts, which are covered next.

Many organizations will say they have a data warehouse but often fail in the areas of integration and history. I’ve seen several organizations who drop their operational data, unaltered, on a database and claim to have a data warehouse. Their main use is point-in-time operational reporting and they overwrite the data the next day. This hardly meets the requirements to be classified as a data warehouse.

Data MartData Lake-min

Data marts have these characteristics:

  • Departmental Context – the primary focus is to provide departmental data in a format that is easily consumed
  • Subset of data warehouse – It can be argued that a data mart can be built without the presence of a data warehouse. In more mature data management organizations, data marts are source from the data warehouse
  • Schema on write – The format of the data is determined before it is stored so consumers don’t have to figure out the raw format of data to be able to read it
  • May be based on a star or snowflake schema
  • May be historical and/or point in time, depending on the needs of the consumer

The primary use case for a data mart is departmental reporting or discovery. An example would be for a purchasing department to evaluate spending patterns across vendors to reduce the number of suppliers and be better prepared for negotiating prices. A use case closer to home would be for a finance department to do financial planning and what if analysis.

Data Lake

Data lakes are more recent and are generally considered to be stored on big data platforms. The characteristics of a data lake are:

  • Landing spot for various types of data – data may be structured or unstructured and is stored in flat files.
  • Native format – data is landed as it is acquired. No transformations or quality checks are performed prior to landing the data
  • Schema on read – The consumers of the data must know how to read the files to process them
  • Large volumes of data – They are typically used to stored large volumes of data such as telemetry, social media feeds, and other IoT sources.

A data lake is a good pattern to use when batch analytics must be performed using many large data files. Newer big data technologies such as Kafka and Spark allow for near-real time streaming and event processing as well. Often, after the data has been analyzed and processed, it is stored in one of the other repositories for easier reporting by end users.

Now that we have a common definition from which to proceed, next time, we’ll cover the processes required to build and maintain them. Keep in mind that some organizations may use all these patterns depending on their needs.

   

Blog Search

Subscribe to Email Updates

Popular Posts

Recent Posts

Follow Me