One of the key notions of data visualization is that it can inspire insight about the data being presented. The idea of generating or spurring insights has been a core objective that visualization developers strive to achieve. But just what is an insight? How do we identify the insights that a visualization inspires? This is a tough question that the visualization research community has been grappling with for quite a while.
I had cause to revisit that question late last fall when the topic of our weekly Visualization Group meeting was a paper from SIGMOD ’17, “Extracting Top-K Insights from Multi-dimensional Data”, by Tang, Han, Yiu, Ding, and Zhang.1 In this fascinating project, the research team developed methods to automatically (algorithmically) identify the top insights that can be gleaned from a data set such as sales data over time for a group of products. Note that this research comes from the Database community, which is obviously quite different from the data visualization research community.
To better understand what the developed algorithm does, suppose we have sales records of five different products over a five-year period. Potential insights from that data might be that a particular product’s sales show an increasing trend over time (i.e., the delta or change from year to year is growing), or that another product’s sales ranking within the group is falling each year.
Amidst all the debate within the visualization research community about what constitutes an insight, I was curious to see how Tang et al would characterize one. They describe an insight as “an interesting observation derived from aggregation in multiple steps.” Furthermore, the researchers explain that such insights have two typical usages in business applications, to “provide informative summaries of the data to non-expert users who not know exactly what they are looking for” and to “guide directions for data exploration.”
The heart of the paper is their algorithm for finding the “Best-k” insights from a data set. Needless to say, it is quite complex and simply beyond me to completely follow it, but ultimately it is about identifying insights and quantifying their “interestingness”. Most insights they find seem to take on one of two flavors: “point” insights where values are remarkably different from others or “shape” insights that show rising or falling trends.
The paper contains a case study on car and computer tablet sales data. Their algorithm identified the following example top insights:
- When measuring the importance of SUV sales for a certain brand, brand F is outstanding number 1.
- There is a rising trend of SUV’s market share.
- In 2014, SUV exhibits most advantage over other categories than ever.
- The yearly increase of tabular sales is slowing down.
- 2012/04-07’s yearly increase of tablet sales is remarkably lower than ever.
Finally, the authors conduct a user study in which they have data analysts and managers rate the insights found by their algorithm along usefulness and difficulty dimensions. The algorithm fares well on both measures. Additionally, a comparison study of senior database researchers identifying insights via “traditional” methods uncovers the dramatic result that the time taken (average) using SQL was 29.2 minutes, using Excel pivot tables was 14.2 minutes, and using the Best-k algorithm was 0.17 seconds. The machine triumphs yet again! :^)
I was fascinated by their characterization of data insights and their descriptions of insight characteristics. But how do those notions compare with other communities’ views of insight?
I believe that a very common impression of an insight, one harbored by many people, is as a kind of “a-ha” moment when a person figures out an answer or a solution to a problem that has been simmering for a while. This perception reminds me of the famous scenario where a light bulb goes on over a person’s head while they’re in the shower, a true “Eureka!” moment.
But I don’t feel that’s how the data visualization community most commonly views insight. Chris North actually defined an insight as being an individual observation about data by a person, a unit of discovery.2 He believes that insights are complex, deep, qualitative, relevant, and unexpected. Would the insights found by Tang et al’s algorithm meet those criteria? I’m not sure.
Personally, I have always resonated with the characterization of insights by Chang, Ziemkiewicz, Green, & Ribarsky.3 Their view contrasts with the spontaneous a-ha perception described above. Instead, they believe that insight is much more about knowledge-building and model-confirmation. It is like a substance that people acquire with the aid of systems.
When I hear someone say that a “visualization gave them insights about a data set”, I tend to be thinking along the lines of Chang’s characterization. In fact, my former GT colleagues Ji Soo Yi, Youn-ah Kang, Julie Jacko, and I reflect on insight in an old BELIV workshop paper.4 In it, we focus on the processes that one undertakes in order to gain insight. This frequently occurs in “sensemaking” scenarios. We found four processes through which people frequently obtain insight using visualizations, including provide an overview, adjust, detect a pattern, and match a mental model.
I have always been struck by the importance of context and existing domain knowledge to insights too. A person’s pre-existing knowledge about a data set and its domain has a big influence on what they will consider a data insight. For a data set about wines of the world, the set of insights a novice uncovers may simply be ho-hum background information to a wine connoisseur. When determining insights about a data set, it’s likely safest to assume the person doing the exploration is unfamiliar with the data and its domain, in order to establish a common baseline.
Looping back to the paper by Tang et al, ultimately I’m not sure that I’d describe the statements that their algorithm produces as “insights”. Maybe they’re interesting data facts or data observations, but insights somehow feel to me like deeper understandings of the characteristics and implications of a data set. This in no way diminishes the remarkable achievement of Tang et al. That they can automatically identify salient and useful observations about a data set is quite remarkable.
As we move forward, it will be interesting to see if the different academic sub-communities (cognitive science, databases, KDD, visualization) can come to some shared understanding of just what insight is and how we can better help people find them. Once we do that, then maybe we can start to develop evaluation methods to determine whether particular visualizations actually do a good job generating insights. I’m also especially excited by systems that will be able to combine techniques from multiple areas – for example, systems that automatically generate insights about a data set, support those insights through illustrative visualizations, and allow analysts to manually explore the data through visualizations to uncover their own unique insights.
1 B. Tang, S. Han, M.L. Yiu, R. Ding, and D. Zhang. “Extracting Top-K Insights from Multi-dimensional Data.” In Proc. of SIGMOD ’17. May 2017, pp. 1509-1524.
2 C. North. “Toward Measuring Visualization Insight.” IEEE Computer Graphics & Applications 26, 3 (May 2006), pp. 6-9.
3 R. Chang, C. Ziemkiewicz, T.M. Green, and W. Ribarsky. “Defining insight for visual analytics.” IEEE Computer Graphics & Applications 29, 2 (March 2009), pp. 14-17.
4 J.S. Yi, Y. Kang, J. Stasko and J. Jacko, “Understanding and Characterizing Insights: How Do People Gain Insights Using Information Visualization”, In Proc. of BELIV ’08, April 2008, pp. 39-44.