Thanks for taking the time to read my thoughts about Visual Business Intelligence. This blog provides me (and others on occasion) with a venue for ideas and opinions that are either too urgent to wait for a full-blown article or too limited in length, scope, or development to require the larger venue. For a selection of articles, white papers, and books, please visit my library.

 

So Far, VR-Enabled Data Visuailzation is Nonsense

May 14th, 2018

Few data technologies are subject to more hype these days than VR-enabled data visualization. I have never seen a single example that adds value and therefore makes sense. Those who promote it don’t base their claims on actual evidence that it works. Instead, they tend to spout a lot of misinformation about visual perception and cognition. Those who have actually taken the time to study visual perception and cognition could take each of these claims apart with ease. VR has the cool factor going for it and vendors are capitalizing on this fact.

VR certainly has its applications. Data visualization just doesn’t seem to be one of them and it’s unlikely that this will change. If it does at some point in the future, I’ll gladly embrace it. Navigating physical reality in a virtual, computer-generated manner can indeed be useful. I recently visited the beautiful medieval town of Cesky Krumlov in the Czech Republic near the Austrian border. I could have relied solely on photographs and descriptions in a guide book, but walking in the midst of that old city, experiencing it directly with my own senses, enhanced the experience. Had I not been able to visit it personally, a VR tour of Cesky Krumlov could have provided a richer experience than photographs and words alone. Data visualizations, however, display abstract data, not physical reality, such as a city. There is no advantage that we have discovered so far, either perceptual or cognitive, to flying around inside a VR version of the kind of abstract data that we display in data visualizations. We can see and make sense of the data more effectively using 2-D or, on rare occasions, 3-D displays projected onto a flat plane (e.g., a screen) without donning a VR headset.

I was prompted to write this blog post by a recent article titled “Data visualization in mixed reality can unlock big data’s potential,” by Amir Bozorgzahed. This fellow is the cofounder and CEO of Virtuleap and host of the Global WebXR Hackathon, which puts his interest in perspective. The article quotes several software executives who have VR products to sell, and the claims that they make are misleading. They take advantage of the gullibility of people who are already susceptible to the allure of technological hyperbole that goes by such names as VR, Big Data, AI, and self-service analytics. They market their VR-enabled data visualization tools as techno-magical—capable of turning anyone into a skilled data analyst without an ounce of training, except in the use of their VR tools.

Let’s examine a few of the claims made in the article, beginning with the following:

The tech enables not only enterprises and organizations, but anyone, to use their spatial intelligence to spot patterns and make connections that breakthrough the tangled clutter of big data in a way that has been out of reach even with traditional 2D analytics.

“Anyone” can use their “spatial intelligence to spot patterns and make connections.” Wow, this is truly magical and downright absurd. While it is true that spatial perception is built into our brains, it is not true that we can use this ability to make sense of abstract data without having developed an array of data sensemaking skills.

The self-service claims of VR data visualization can get even more outlandish. Consider the following excerpt from the article, which describes WebVR’s “forthcoming seismic-upgrade:”

In fact, their platform wasn’t designed to cater to just highly-trained data scientists, but for anyone with a stake in the game. In the not so distant future, I picture the average Joe or Jane regularly making use of their spatial intelligence to slice and dice big data of any kind, because everyone has the basic skill-sets required to play Sherlock Holmes in mixed reality. All they need to get started is access to big data sets, which I also foresee as being more prevalent not too long from now.

Amazing! I suppose it’s true that everyone can “play” Sherlock Holmes, but playing at it is quite different from sleuthing with skill.

Here’s an example of a VR data visualization that was included in the article:

First of all, you don’t need VR to view data in this manner. At this moment you’re viewing this example on a screen or printed page. You do need VR hardware and software, however, to virtually place yourself in the middle of a 3-D scatter plot and fly around in it, but this wouldn’t make the data more accessible, perceptible, or understandable. Viewing the data laid out in front of us makes it easier to find and make sense of the meaningful patterns that exist within.

The spatial perception that is built into the human brain can indeed be leveraged, using data visualization, to make sense of data. It is not true, however, that it can do so independent of a host of other hard-won skills. Here’s another similar excerpt from the article:

Pattern recognition is an inherent talent that we all possess; the evolutionary edge that sets us apart from the animal kingdom. So, it’s not so much that immersive data visualization unlocks big data but, rather, that it allows us to interact with big data in a way that is natural for us.

This is quite misleading. Other animals also have tremendously good pattern recognition abilities built into their brains, in many cases much better than ours. What sets humans apart in regards to pattern recognition is our ability to reason about patterns in abstract ways, sometimes called patternicity. This is both a blessing and a curse, however, for we can and often do see patterns that are entirely meaningless. We are prolific meaning generators, but separating valid from illusory meanings requires a rich set of data sensemaking skills. No tool, including and perhaps especially VR, will replace the need for these skills.

Here’s another visualization that’s featured in the article:

The caption describes this as “a volatile blockchain market.” What is the claim?

The Bitcoin blockchain in particular pushes the limits of traditional data visualization technology, as its support for transactions involving multiple payers and multiple payees and high transactional volume would create an incomprehensible jumble of overlapping points on any two-dimensional viewer.

Let’s think about this for a moment. If we view a forest from the outside, it appears as a “jumble” of trees. Due to occlusion, we can’t see each of the trees. If we walk into that forest, we can examine individual trees, but we lose sight of the forest. This is a fundamental problem that we often face when trying to visualize a large and complex data set. We typically attempt to resolve this challenge by finding ways to visualize subsets of data while simultaneously viewing how those subsets fit into the larger context of the whole. A traditional data visualization approach to this problem involves the use of concurrent “focus+context” displays to keep from getting lost in the forest while focusing on the trees. Nothing about VR helps us resolve this challenge. In fact, compared to a screen-based display, VR just makes it easier to get lost in the forest.

Here’s the ultimate expression of nonsense that I encountered at the end of Bozorgzahed’s article:

We have reached a point in time where much of the vast digital landscape of data can be now rendered into visual expressions that, paired up with artificial intelligence, can be readily deciphered and understood by anyone with simply the interest to mine big data. And all this because the underlying tech has become advanced enough to finally align with how we visually process the world.

Notice the abundant sprinkling of buzzwords in this final bit of marketing. When you combine data visualization with VR, AI, and Big Data you have a magic trick as impressive as anything that David Copperfield could pull off on a Las Vegas stage, but one that is just as much an illusion.

I will continue saying what I have said before too many times to count: data sensemaking requires skills that must be learned. No tool will replace the need for these skills. It’s time that we accept the unpopular truth that data sensemaking requires a great deal of training and effort. There are no magic bullets, including VR.

Logarithmic Confusion

March 21st, 2018

We typically think of quantitative scales as linear, with equal quantities from one labeled value to the next. For example, a quantitative scale ranging from 0 to 1000 might be subdivided into equal intervals of 100 each. Linear scales seem natural to us. If we took a car trip of 1000 miles, we might imagine that distance as subdivided into ten 100 mile segments. It isn’t likely that we would imagine it subdivided into four logarithmic segments consisting of 1, 9, 90, and 900 mile intervals. Similarly, we think of time’s passage—also quantitative—in terms of days, weeks, months, years, decades, centuries, or millennia; intervals that are equal (or in the case of months, roughly equal) in duration.

Logarithms and their scales are quite useful in mathematics and at times in data analysis, but they are only useful for presenting data on those relatively rare cases when addressing an audience that consists of those who have been trained to think in logarithms. With training, we can learn to think in logarithms, although I doubt that it would ever come as easily or naturally as thinking in linear units.

For my own analytical purposes, I use logarithmic scales primarily for a single task: to compare rates of change. When two time series are displayed in a line graph, using a logarithmic scale allows us to easily compare the rates of change along the two lines by comparing their slopes, for equal slopes represent equal rates of change. This works because units along a logarithmic scale increase by rate (e.g., ten times the previous value for a log base 10 scale or two times the previous value for a log base 2 scale), not by amount. Even in this case, however, I would not ordinarily report to others what I’d discovered about rates of change using a graph with a logarithmic scale, for all but a few people would misunderstand it.

I decided to write this blog piece when I ran across the following graph in Steven Pinker’s new book Enlightenment Now:

The darkest line, which represents the worldwide distribution of per capita income in 2015, is highlighted as the star of this graph. It has the appearance of a normal, bell-shaped distribution. This shape suggests an equitable distribution of income, but look more closely. In particular, notice the income scale along the X axis. Although the labels along the scale do not consistently represent logarithmic increments—odd but never explained—the scale is indeed logarithmic. Had a linear scale been used, the income distribution would appear significantly skewed with a peak nearer to the lower end and a long declining tail extending to the right. I can think of no valid reason for using a logarithmic scale in this case. A linear scale ranging from $0 per day at the low end to $250 per day or so at the high end, would work fine. Ordinarily, $25 intervals would work well for a range of $250, breaking the scale into ten intervals, but this wouldn’t allow the extreme poverty threshold of just under $2.00 to be delineated because it would be buried within the initial interval of $0 to $25. To accommodate this particular need, tiny intervals of $2.00 each could be used throughout the scale, placing extreme poverty approximately within the first interval. As an alternative, larger intervals could be used and the percentage of people below the extreme poverty threshold could be noted as a number.

After examining Pinker’s graph closely, you might be tempted to argue that its logarithmic scale provides the advantage of showing a clearer picture of how income is distributed in the tiny $0 to $2.00 range. This, however, is not its purpose. Even if this level of detail regarding were relevant, the information that appears in this range isn’t real. The source data on which this graph is based is not precise enough to represent how income in distributed between $0 and $2.00. If reliable data existed and we really did need to clearly show how income is distributed from $0 to $2.00, we would create a separate graph to feature that range only and that graph would use a linear scale.

Why didn’t Pinker use a linear scale? Perhaps it is because the message of the graph would reveal a dark side that would somewhat undermine the message of his book that the world is getting better. Although income has increased overall, the distribution of income has become less equitable and this pattern persists today.

When I noticed that Pinker derived the graph from Gapminder and attributed it to Ola Rosling, I decided to see if Pinker introduced the logarithmic scale or inherited it in that form from Gapminder. Upon checking, I found that Gapminder’s graphs of wealth distribution indeed feature logarithmic scales. If you go to the part of Gapminder’s website that allows you to use their data visualization tools, you’ll find that you can only view the distribution of wealth logarithmically. Even though some of Gapminder’s graphs provide the option of switching between linear and logarithmic scales, those that display distributions of wealth do not. Here’s the default wealth-related graph that can be viewed using Gapminder’s tool:

This provides a cozy sense of bell-shaped equity, which isn’t truthful.

To present data clearly and truthfully, we must understand what works for the human brain and design our displays accordingly. People don’t think in logarithms. For this reason, it is usually best to avoid logarithmic scales, especially when presenting data to the general public. Surely Pinker and Rosling know this.

Let me depart from logarithms to reveal another problem with these graphs. There is no practical explanation for the smooth curves that they exhibit if they’re based on actual income data. The only time we see smooth distribution curves like this is when they result from mathematical calculations, never when they’re based on actual data. Looking at the graph above, you might speculate that when distribution data from each country was aggregated to represent the world as a whole, the aggregation somehow smoothed the data. Perhaps that’s possible, but that this isn’t what happened here. If you look closely at the graph above, in addition to the curves at the top of each of the four colored sections, one for each world region, there are many light lines within each colored section. Each of these light lines represents a particular country’s distribution data. With this in mind, look at any one of those light lines. Every single line is smooth beyond the practical possibility of being based on actual income data. Some jaggedness along the lines would always exist. This tells us that these graphs are not displaying unaltered income data for any of the countries. What we’re seeing has been manipulated in some manner. The presence of such manipulation always makes me wary. The data may be a far cry from the actual distribution of wealth in most countries.

My wariness is magnified when I examine wealth data of this type from long ago. Here’s Gapminder’s income distribution graph for the year 1800:

To Gapminder’s credit, they provide a link above the graph labeled “Data Doubts,” which leads to the following disclaimer:

Income data has large uncertainty!

There are many different ways to estimate and compare income. Different methods are used in different countries and years. Unfortunately no data source exists that would enable comparisons across all countries, not even for one single year. Gapminder has managed to adjust the picture for some differences in the data, but there are still large issues in comparing individual countries. The precise shape of a country should be taken with a large grain of salt.

I would add to this disclaimer that “The precise shape of the world as a whole should be taken with an even larger grain of salt.” This data is not reliable. If the data isn’t reliable today, data for the year 1800 is utterly unreliable. As a man of science, Pinker should have made this disclaimer in his book. The claim that 85.9% of the world’s population lived in extreme poverty in 1800 compared to only 11.4% today makes a good story of human progress, but it isn’t a reliable claim. Besides, it’s hard to reconcile my reading of history with the notion that, in 1800, all but 14% of humans were just barely surviving from one day to the next. People certainly didn’t live as long back then, but I doubt that the average person was living well below the threshold of extreme poverty as this graph suggests.

I’ve grown concerned that the recent emphasis on data storytelling has led to a reduction in clear and accurate truth telling. When I was young, to say that someone “told stories” meant that they made stuff up. This negative connotation of storytelling describes a great deal of data storytelling today. Encouraging people to develop skills in data sensemaking and communication should focus their efforts on learning how to discover, understand, and tell the truth. This is seldom how instruction in data storytelling goes. The emphasis is more often on persuasion than truth, more on art (and artifice) than science.

Randomness Is Often Not Random

March 12th, 2018

In statistics, what we often identify as randomness in data is not actually random. Bear in mind, I am not talking about randomly generated numbers or random samples. Instead, I am referring to events about which data has been recorded. We learn of these events when we examine the data. We refer to an event as random when it is not associated with a discernible pattern or cause. Random events, however, almost always have causes. We just don’t know them. Ignorance of cause is not the absence of cause.

Randomness is sometimes used as an excuse for preventable errors. I was poignantly reminded of this a decade or so ago when I became the victim of a so-called random event that occurred while undergoing one of the most despised medical procedures known to humankind: a colonoscopy. In my early fifties at the time, it was my first encounter with this dreaded procedure. After this initial encounter, which I’ll now describe, I hoped that it would be my last.

While the doctor was removing one of five polyps that he discovered during his spelunking adventure into my dark recesses, he inadvertently punctured my colon. Apparently, however, he didn’t know it at the time, so he sent me home with the encouraging news that I was polyp free. Having the contents of one’s colon leak out into other parts of the body isn’t healthy. During the next few days severe abdominal pain developed and I began to suspect that my 5-star rating was not deserved. Once admitted to the emergency room at the same facility where my illness was created, a scan revealed the truth of the colonoscopic transgression. Thus began my one and only overnight stay so far in a hospital.

After sharing a room with a fellow who was drunk out of his mind and wildly expressive, I hope to never repeat the experience. Things were touch and go for a few days as the medical staff pumped me full of antibiotics and hoped that the puncture would seal itself without surgical intervention. Had this not happened, the alternative would have involved removing a section of my colon and being fitted with a stylish bag for collecting solid waste. To make things more frightening than they needed to be, the doctor who provided this prognosis failed to mention that the bag would be temporary, lasting only about two months while my body ridded itself of infection, followed by another surgery to reconnect my plumbing.

In addition to a visit from the doctor whose communication skills and empathy were sorely lacking, I was also visited during my stay by a hospital administrator. She politely explained that punctures during a routine colonoscopy are random events that occur a tiny fraction of the time. According to her, these events should not to be confused with medical error, for they are random in nature, without cause, and therefore without fault. Lying there in pain, I remember thinking, but not expressing, “Bullshit!” Despite the administrator’s assertion of randomness, the source of my illness was not a mystery. It was that pointy little device that the doctor snaked up through my plumbing for the purpose of trimming polyps. Departing from its assigned purpose, the trimmer inadvertently forged a path through the wall of my colon. This event definitely had a cause.

Random events are typically rare, but the cause of something rare is not necessarily unknown and certainly not unknowable. The source of the problem in this case was known, but what was not known was the specific action that initiated the puncture. Several possibilities existed. Perhaps the doctor involuntarily flinched in response to an itch. Perhaps he was momentarily distracted by the charms of his medical assistant. Perhaps his snipper tool got snagged on something and then jerked to life when the obstruction was freed. Perhaps the image conveyed from the scope to the computer screen lost resolution for a moment while the computer processed the latest Windows update. In truth, the doctor might have known why the puncture happened, but if he did, he wasn’t sharing. Regardless, when we have reliable knowledge of several potential causes, we should not ignore an event just because we can’t narrow it down to the specific culprit.

The hospital administrator engaged in another bit of creative wordplay during her brief intervention. Apparently, according to the hospital, and perhaps to medical practice in general, something that happens this rarely doesn’t actually qualify as an error. Rare events, however harmful, are designated as unpreventable and therefore, for that reason, are not errors after all. This is a self-serving bit of semantic nonsense. Whether or not rare errors can be easily prevented, they remain errors.

We shouldn’t use randomness as an excuse for ongoing ignorance and negligence. While it makes no sense to assign blame without first understanding the causes of undesirable events, it also makes no sense to dismiss them as inconsequential and as necessarily beyond the realm of understanding. Think of random events as invitations to deepen our understanding. We needn’t make them a priority for responsive action necessarily, for other problems that are understood might deserve our attention more, but we shouldn’t dismiss them either. Randomness should usually be treated as a temporary label.

Big Data, Big Dupe: A Progress Report

February 23rd, 2018

My new book, Big Data, Big Dupe, was published early this month. Since its publication, several readers have expressed their gratitude in emails. As you can imagine, this is both heartwarming and affirming. Big Data, Big Dupe confirms what these seasoned data professionals recognized long ago on their own, and in some cases have been arguing for years. Here are a few excerpts from emails that I’ve received:

I hope your book is wildly successful in a hurry, does its job, and then sinks into obscurity along with its topic.  We can only hope! 

I hope this short book makes it into the hands of decision-makers everywhere just in time for their budget meetings… I can’t imagine the waste of time and money that this buzz word has cost over the past decade.

Like yourself I have been doing business intelligence, data science, data warehousing, etc., for 21 years this year and have never seen such a wool over the eyes sham as Big Data…The more we can do to destroy the ruse, the better!

I’m reading Big Data, Big Dupe and nodding my head through most of it. There is no lack of snake oil in the IT industry.

Having been in the BI world for the past 20 years…I lead a small (6 to 10) cross-functional/cross-team collaboration group with like-minded folks from across the organization. We often gather to pontificate, share, and collaborate on what we are actively working on with data in our various business units, among other topics.  Lately we’ve been discussing the Big Data, Big Dupe ideas and how within [our organization] it has become so true. At times we are like ‘been saying this for years!’…

I believe deeply in the arguments you put forward in support of the scientific method, data sensemaking, and the right things to do despite their lack of sexiness.

As the title suggests, I argue in the book that Big Data is a marketing ruse. It is a term in search of meaning. Big Data is not a specific type of data. It is not a specific volume of data. (If you believe otherwise, please identify the agreed-upon threshold in volume that must be surpassed for data to become Big Data.) It is not a specific method or technique for processing data. It is not a specific technology for making sense of data. If it is none of these, what is it?

The answer, I believe, is that Big Data is an unredeemably ill-defined and therefore meaningless term that has been used to fuel a marketing campaign that began about ten years ago to sell data technologies and services. Existing data products and services at the time were losing their luster in public consciousness, so a new campaign emerged to rejuvenate sales without making substantive changes to those products and services. This campaign has promoted a great deal of nonsense and downright bad practices.

Big Data cannot be redeemed by pointing to an example of something useful that someone has done with data and exclaiming “Three cheers for Big Data,” for that useful thing would have still been done had the term Big Data never been coined. Much of the disinformation that’s associated with Big Data is propogated by good people with good intentions who prolong its nonsense by erroneously attributing beneficial but unrelated uses of data to it. When they equate Big Data with something useful, they make a semantic connection that lacks a connection to anything real. That semantic connection is no more credible than attributing a beneficial use of data to astrology. People do useful things with data all the time. How we interact with and make use of data has been gradually evolving for many years. Nothing that is qualitatively different about data or its use emerged roughly ten years ago to correspond with the emergence of the term Big Data.

Although no there is no consensus about the meaning of Big Data, one thing is certain: the term is responsible for a great deal of confusion and waste.

I read an article yesterday titled “Big Data – Useful Tool or Fetish?” that exposes some failures of Big Data. For example, it cites the failed $200,000,000 Big Data initiative of the Obama administration. You might think that I would applaud this article, but I don’t. I certainly appreciate the fact that it recognizes failures associated with Big Data, but its argument is logically flawed. Big Data is a meaningless term. As such, Big Data can neither fail nor succeed. By pointing out the failures of Big Data, this article endorses its existence, and in so doing perpetuates the ruse.

The article correctly assigns blame to the “fetishization of data” that is promoted by the Big Data marketing campaign. While Big Data now languishes with an “increasingly negative perception,” the gradual growth of skilled professionals and useful technologies continue to make good uses of data, as they always have.


Take care,

P.S. On March 6th, Stacey Barr interviewed me about Big Data, Big Dupe. You can find an audio recording of the interview on Stacey’s website.

Different Tools for Different Tasks

February 19th, 2018

I am often asked a version of the following question: “What data visualization product do you recommend?” My response is always the same: “That depends on what you do with data.” Tools differ significantly in their intentions, strengths, and weaknesses. No one tool does everything well. Truth be told, most tools do relatively little well.

I’m always taken by surprise when the folks who ask me for a recommendation fail to understand that I can’t recommend a tool without first understanding what they do with data. A fellow emailed this week to request a tool recommendation, and when I asked him to describe what he does with data, he responded by describing the general nature of the data that he works with (medical device quality data) and the amount of data that he typically accesses (“around 10k entries…across multiple product lines”). He didn’t actually answer my question, did he? I think this was, in part, because he and many others like him don’t think of what they do with data as consisting of different types of tasks. This is a fundamental oversight.

The nature of your data (marketing, sales, healthcare, education, etc.) has little bearing on the tool that’s needed. Even the quantity of data has relatively little effect on my tool recommendations unless you’re dealing with excessively large data sets. What you do with the data—the tasks that you perform and the purposes for which you perform them—is what matters most.

Your work might involve tasks that are somewhat unique to you, which should be taken into account when selecting a tool, but you also perform general categories of tasks that should be considered. Here are a few of those general categories:

  • Exploratory data analysis (Exploring data in a free-form manner, getting to know it in general, from multiple perspectives, and asking many questions to understand it)
  • Rapid performance monitoring (Maintaining awareness of what’s currently going on as reflected in a specific set of data to fulfill a particular role)
  • A routine set of specific analytical tasks (Analyzing the data in the same specific ways again and again)
  • Production report development (Preparing reports that will be used by others to lookup data that’s needed to do their jobs)
  • Dashboard development (Developing displays that others can use to rapidly monitor performance)
  • Presentation preparation (Preparing displays of data that will be presented in meetings or in custom reports)
  • Customized analytical application development (Developing applications that others will use to analyze data in the same specific ways again and again)

Tools that do a good job of supporting exploratory data analysis usually do a poor job of supporting the development of production reports and dashboards, which require fine control over the positioning and sizing of objects. Tools that provide the most flexibility and control often do so by using a programming interface, which cannot support the fluid interaction with data that is required for exploratory data analysis. Every tool specializes in what it can do well, assuming it can do anything well.

In addition to the types of tasks that we perform, we must also consider the level of sophistication to which we peform them. For example, of you engage in exploratory data analysis, the tool that I recommend would vary significantly depending on the depth of your data analysis skills. For instance, I wouldn’t recommend a complex statistical analysis product such as SAS JMP if you’re untrained in statistics, just as I wouldn’t recommend a general purpose tool such as Tableau Software if you’re well trained in statistics, except for performing statistically lightweight tasks.

Apart from the tasks that we perform and the level of skill with which we perform them, we must also consider the size of our wallet. Some products require a significant investment to get started, while others can be purchased for an individual user at little cost or even downloaded for free.

So, what tool do I recommend? It depends. Finding the right tool begins with a clear understanting of what you need to do with data and with your ability to do it.

Take care,