A friend of mine recently asked me for advice on interviewing a data scientist. We had a fairly lengthy discussion on the topic, but I didn’t feel like I really provided any insight. What I should have said is: “Ask them what they think of dashboards.”
Summary tools like dashboards are great. They make it easy for us to consume and extract meaning from large amounts of data.
They are also the source of colossal frustration.
Most of the time, shortcuts disconnect summary information from the actual data it represents. Whether these shortcuts were used because of tech barriers or for a faster delivery, the resulting dashboards don’t let you go deep enough.
This introduces two significant problems.
First, the ability to dig deep into the data to find the parts that make up the sum, so to speak, is forever lost. Personally, I usually want to dig down deep and see what’s going on in the data. And often, summary information dissects the data just enough to describe a set of points that is small enough to be consumable. So you SHOULD be able to drill down to see them individually.
But one lap around popular Business Intelligence and dashboard products, from pointed services like Google Analytics to the more broad and enterprise-grade offerings shows that you can’t.
This might be because importance of seeing the actual data is usually understated. Dashboards are highly diagnostic, but symptoms mostly manifest themselves at the macro level, so understanding root causes often requires diving into the micro level and doing ad-hoc analysis.
Imagine a hospital seeing high failure rates for a particular set of best-practices on a dashboard, with no ability to see those failures individually. In the descriptive analytics case, you need to deconstruct individual cases to understand exactly what went wrong. In the prescriptive analytics case, you need the data from each individual case available to ensure the rules are being followed. In both cases, you need ad-hoc analysis on the individual failures to verify the failure paths and understand problems they have in common.
Which is why a data scientist in particular should never trust a dashboard. Because, second, you should always require the ability to reconstruct previous findings and assumptions before you pursue deeper insights and understandings.
It is important to remember that summaries and aggregations are easy to intentionally manipulate and unintentionally miscalculate. A good introduction to statistical manipulation is Joel Best’s Damn Lies and Statistics, or the classic How To Lie With Statistics by Darrell Huff.
It is important to be able to see the actual data that composes a summary or goes into a calculation so that you are always free to recalculate it. If you’re serious about your business, you cannot live without this. You must know if you’re making decisions based on accurate interpretations of the data. Without the underlying data, the summary becomes a one-sided argument that you cannot challenge.
Ensuring that you can access the composing data becomes increasingly important as your volume of data grows and number of places where your data comes from increases. This makes dissecting the raw data into consumable chunks even more of a challenge and the needed calculations much more complex.
But at the same time, the policy and business implications of those calculations will be much more drastic: complex calculations and large amounts of data means that the threat of malicious data manipulation for personal gain becomes more serious. You can’t discover intentional data manipulation if you can’t get to your underlying data.
And that’s why it’s important to build and deploy functionality with the mantra of “data first”.
At MIOsoft, we have always taken this data first approach. Because the data is the one truth that can be counted on for specifications, forensics, and deep understanding of your business.
RealinfoQA, a healthcare offering built on our MIOedge platform, leverages MIOedge’s relationship discovery technologies at the database level to always maintain relationships between summary statistics and underlying data. The result is that a nurse can always follow the exact path a patient’s treatment took through a best practice algorithm, or a quality chief can always understand what is going on in his or her facility.
So while maintaining the intricate link between summary information and composing data is hard, it is extremely important, and the technology DOES exist to do it.
I urge you, whether you are a consumer or creator of dashboards or other aggregate insight, to do the following:
- Read Damned Lies and Statistics by Joel Best.
- Always insist on links between aggregations and the actual data.
Don’t let the increasingly complex world of extracting insight from data catch your business off guard. Jump right in. But always insist on data first.