By Mara Sedlins
Mara Sedlins, Ph.D., is a data management specialist at the CSU Libraries.
Information literacy has long been a focus of education in libraries. But as data increasingly saturates our world, data literacy has become a crucial skill as well.
Data literacy includes the ability to think critically about statistics and data visualizations, to understand both the power and the limitations of data, and to advocate for ethical data use are all key aspects of data literacy.
What are some key data pitfalls everyone should watch out for?
Never trust summary statistics alone
The most common way to summarize a dataset is to report single-number statistics, like the average, and the standard deviation, or how much the data varies.
But it’s always important to visualize the data too. This is essential for identifying possible errors in the dataset – like the inclusion of a 200-year-old person in the data set – as well as to understand patterns in your data.
This can be illustrated by the “datasaurus” example, a fun variation on Anscombe’s quartet created by data visualization guru Alberto Cairo in 2016.
Each of the datasets shown on the right has the exact same average and standard deviation. But once the values are plotted on a graph, each suggests a different interpretation of the same results.
Take graphs with a grain of salt
While data visualizations are important for understanding a dataset, they can also be misleading if used incorrectly.
For example, the choice of axis can dramatically affect the message conveyed by a graph. Check out this graph:
This graph suggests that Libraries employees post a lot more pet photos on Monday compared to Tuesday or Wednesday. But in that graph, the x-axis starts at 20. Let’s look at a slightly different graph now.
In contrast, this graph shows the same data but with an x-axis that goes to zero. Suddenly, the differences between days of the week appear much less pronounced.
Data may also be incomplete, or cherry picked to support a particular message.
A graph of monthly temperatures in Colorado for July through December might seem as though we were destined for an icy future, but this ignores the increasing numbers seen in the first part of the year.
Similarly, a company might show financial data only for the years that their profits increased in order to paint a rosy picture. It’s always important to consider what it is you aren’t seeing in a data visualization.
Correlation ≠ causation
It can be easy to fall into the trap of thinking that one thing causes another just because they are correlated.
For example, as ice cream sales increase, the number of drownings also increases, which means you might be tempted to forgo a frozen treat to reduce your risk of drowning.
However, there’s a third factor that causes both – warm weather.
For other correlations that are more plausible on the face of it, it can be tempting to assume causation. A 1999 study published in Nature showed that children who slept with the light on were more likely to develop myopia later in life.
But it turned out that there was also a strong link between parents’ myopia and their children’s development of myopia. Myopic parents were actually more likely to leave the light on in their children’s room.
Correlations can also occur by chance, if you go looking for them. Did you know that the divorce rate in Maine is associated with per capita consumption of margarine?
In the end, if you watch out for these common data pitfalls, you’ll be much less likely to be duped by data.