In an earlier post (IBM Planning Analytics Data Modelling with Context) I stated that when modeling data as part of a planning analytics solution design, context clues should be developed, through a process referred to as profiling and then “built in” to the data.
Here’s the concept.
Once you have assembled (or “pulled together”) your data source (or at least a significant amount of the data you plan to model), I propose that you take the time to perform some “deep dives” into the data to “get to know” your data beyond the perhaps “generic details”; column/field names, data types, size/length, refresh timetables, and so on….
The process of data profiling involves logically “getting to know” the data you are modeling – through query, experimentation & reviews. With profiling, you can then use the information you have collected to add context (and/or apply new “perspectives”) to data. This may mean manipulating the data to perhaps reformat, add calculations, aggregations or additional columns or just re-ordering and so on. The goal of this exercise is to make the data as valuable and usable as it can be to the consumer or downstream application. Typically, as part of this exercise, one would create visualizations, or “pictures” of the point you are looking to make, with the data.
The profiling process can be iterative and on-going and involves:
- Pull together (the data or enough of the data that you are modeling),
- Profile (the data through query, experimentation and review),
- add Perspective(s) (or context) and finally…
- Picture (visualize) the data
- (Perhaps) repeat the process…
Now, let’s look at an example of what I propose.
Suppose initially, we start out by looking through the field or column names in our sample of data and some ideas start to come to mind. For example:
- What is the data time-frame we are dealing with? Using the field record date or perhaps timestamp (which, by the way I always recommend including timestamps on data records whenever possible), can we establish a period of time (or time frame) for the data? (In other words, over what period of time was this data captured or generated?).
- Is preaggregation valuable? Could we start “grouping the data” using certain fields such as sex, age and state or perhaps sales region or Profit Center?
- Is there any potential to group the data in new ways? Perhaps count values for a new group (what is the breakdown of smokers by age group or what is the ratio of smokers to the number of hospital visits)?
- Is there missing data we could add by applying business logic or performing standard calculations? Employee years of service or patient BMI come to mind.
And so on.
As part of the exercise, taking some time for creating visualizations on the data as part of the profiling process can yield innovative ideas for enhancing the data as part of the modelling process. What you want to ask is, “what can we learn by visualizing this data?”
Comparisons and Trends
One of my favorites is performing comparisons during the data profiling. Beyond simple record counts (like total smoking patients visiting a hospital verses the total non-smoking patients visiting a hospital, for example) one might ponder to compare the total number of hospital visits for each state to the average number of hospital visits for a state. This would require calculating the total number of hospital visits by state as well as the total number of hospital visits over all (then computing the average).
What about calculating the percentage of total operating expense for each cost center and account and comparing it to a historical trend or budgeted amount?
During this process, being familiar with an organizations KPI’s (key performance indicators) is imperative as this will perhaps prompt you to embed KPI calculations within the model from its inception.
You get the idea. Knowing your data well and embedding additional data points within the model by applying logic or performing calculations releases the data model from having to expend its valuable resources to process the data at its lowest, raw levels to generate information that will be requested by most of the consumers on a regular basis or require perhaps significant processing time to compute. In addition, it allows for easier dashboard or report creation and may prompt a user to advance the value of the data and your model even further.