In an earlier post (IBM Planning Analytics Data Modelling with Context) I stated that when modeling data as part of a planning analytics solution design, context clues should be developed, through a process referred to as profiling and then “built in” to the data.
Here’s the concept.
Once you have assembled (or “pulled together”) your data source (or at least a significant amount of the
The process of data profiling involves logically “getting to know” the data you are modeling – through query, experimentation & reviews. With profiling, you can then use the information you have collected to add context (and/or apply new “perspectives”) to data. This may mean manipulating the data to perhaps reformat, add calculations, aggregations or additional columns or just re-ordering and so on. The goal of this exercise is to make the data as valuable and usable as it can be to the consumer or downstream application. Typically, as part of this exercise, one would create visualizations, or “pictures” of the point you are looking to make, with the data.
The profiling process can be iterative and on-going and involves:
Now, let’s look at an example of what I propose.
Suppose initially, we start out by looking through the field or column names in our sample of data and some ideas start to come to mind. For example:
And so on.
As part of the exercise, taking some time for creating visualizations on the data as part of the profiling process can yield innovative ideas for enhancing the data as part of the modelling process. What you want to ask is, “what can we learn by visualizing this data?”
One of my favorites is performing comparisons during the data profiling. Beyond simple record counts (like total smoking patients visiting a hospital verses the total non-smoking patients visiting a hospital, for example) one might ponder to compare the total number of hospital visits for each state to the average number of hospital visits for a state. This would require calculating the total number of hospital visits by state as well as the total number of hospital visits over all (then computing the average).
What about calculating the percentage of total operating expense for each cost center and account and comparing it to a historical trend or budgeted amount?
During this process, being familiar with an organizations KPI’s (key performance indicators) is imperative as this will perhaps prompt you to embed KPI calculations within the model from its inception.
You get the idea. Knowing your data well and embedding additional data points within the model by applying logic or performing calculations releases the data model from having to expend its valuable resources to process the data at its lowest, raw levels to generate information that will be requested by most of the consumers on a regular basis or require perhaps significant processing time to compute. In addition, it allows for easier dashboard or report creation and may prompt a user to advance the value of the data and your model even further.