Mr. Clean, but with Datasets

Stephanie Cabral
Jun 7, 2021
4 min read

Updated: Feb 19, 2023

In the world of data gathering, nothing is more satisfying to a data collector than receiving a big file of raw data. The possibilities are endless when you have a good data set. However, rarely does someone ever receive data that is ready to go as is. It often needs cleaning up, which can include anything from removing duplicates to standardizing text to normalizing for statistical use.

A chart, graph, or other representation is only as good as its data. It’s the foundation of the whole process. Bad data can lead to inaccuracies that will cause you to lose credibility and trustworthiness. Given this day in age, with fake news running rampant across the internet, it is essential to create content that people can trust.

I found some public data sets that I took a stab at cleaning up and brainstormed some possible questions and visualizations.

COVID Vaccination Data by State for Sunday, June 6, 2021

https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-Jurisdi/unsk-b7fc

This dataset from the CDC lays out the numbers of distributed vs. administered Covid vaccinations as of today, June 6, 2021, by state (and DC). It’s broken down by vaccine manufacturer (Janssen, Moderna, and Pfizer) and by age groups (12+, 18+ 65+).

Some interesting questions could come out of this data, such as:

What percentage of vaccines went unused in each state, and overall?
Which vaccine manufacturer had the highest administered doses in each state?
Which regions of the country have a high unused vaccine rate? (New England, Midwest, etc)

Because there are so many states to compare, I think a scatter plot would work well for the first question. The second question would look visually appealing with a stacked bar graph, where the different sections of the bar represent each manufacturer. For the third, I think a bubble-size scatter plot would tie the data of regions together nicely.

Charts with many elements like these have to be clear and concise in order to be trustworthy. Every bar and dot will need to be appropriately labeled in logical places so the reader can spend time processing the data instead of searching the image for clues as to what they’re looking at.

BFI Weekend Box Office, May 28 – 30, 2021

One of the coolest datasets I came across was this one from the British Film Institute (BFI). Each weekend (Friday-Sunday) BFI releases a dataset of the movies that had the highest gross, along with UK films and other new releases. It even has the country of origin and the number of theaters that showed the film.

Some potential questions could come out of this data:

Which distributor made the most money?
How does the number of weeks since release relate to its ranking?
How does the distributor relate to its rankings?

Some common charts used from a different perspective would be intriguing here. The first question could be a simple bar graph with a simple design and layout. A scatterplot and line plot could work for the other two questions. Weeks (x-axis) vs. ranking (y-axis) would illustrate groupings in the data that would support a possible hypothesis (those films that have just been released hold higher rankings than those that have been out longer). For distributor vs. ranking, a line chart with the distributor on the y-axis and ranking on the x-axis would show the change as ranking decreased.

These comparisons can be tricky and lead to a bit more complex visual. Besides having the usual clear design and labeled information, I think for this, extra annotation, like arrows showing increase vs. decrease, would help the reader focus their attention easier.

Eurovision 2021 Finalist Songs

This set I put together using the results from this year’s Eurovision Song Contest and by looking up the number of plays each song had on Spotify through the Eurovision 2021 playlist. I was hoping Spotify would have analytics on the number of times I have played each song (because I’ve listened to Shum an unhealthy amount of times, including as I type this), but couldn’t find such a feature. It might be for the better…

A couple of questions that would be fascinating to explore:

How does the language of the song affect its ranking?
Do groups, or solo artists, rank better?
How would the rankings change if it went by play count?

For the language question, I can picture utilizing a line chart of sorts, with the rankings on the x-axis and the languages on the y-axis. After plotting all of the points, connect them to see the trend.

Groups vs. solo can be done in a bar graph, with the two comparisons being solo vs. group, and the bars (representing each country) would be clustered together within each.

The third could also be a bar graph with two bars for each song: the first representing the actual ranking and the second being the ranking by Spotify plays. I would also include the difference in ranking, annotated in the physical spacial difference, so the reader doesn’t have to spend time calculating each one.

As with the other sets, I would make this easy for the eye, with clear labels and colors. Because these charts include a little more complexity, labeling differences (as discussed above) will go a long way to convince readers that your information is correct and accurate.

References:

https://www.weforum.org/agenda/2019/03/fake-news-what-it-is-and-how-to-spot-it

https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4

Mr. Clean, but with Datasets

Recent Posts

Comments