by Shawn Syms
When I was a young activist, I read a critical feminist sexuality text called Pleasure and Danger, by American anthropologist Carole S. Vance. I was always struck by its effective title and its succinct way of encapsulating the issues at stake. Now as a burgeoning data scientist faced with some of the rigors of my trade, such as data cleaning, the phrase comes in handy.
Well, yes. I know that nearly 60 percent of data scientists surveyed described data cleaning as their most hated task, but I actually enjoy it. Really? Sure. As an introvert, I gain comfort from order and predictability, and as a published author and longtime journalist, I know the satisfaction of putting things in their proper order for maximum effectiveness. But I also know there is such a thing as too much of a good thing.
In my earliest moments in data science, I was drawn to data cleaning. What metaphor can I use for it? Well, it's sort of like taking a shower. Except it's not you, it's the data. Well, I mean, yes — I do shower before cleaning data. Daily, I swear. OK, scratch that. Cleaning data is like being a barber. Or a hairstylist, whatever you prefer. You take something that is in a raw and untamed state and make it pleasant, likeable and predictable, like the haircut of some person your parents would approve of, with whom you might feel comfortable going on a dinner date. That's all very well and good, but if you have ever had a bad haircut, you know how things can go wrong. If cleaning data is like being a potter, what if you're handed a lump of clay that roughly looks like an obelisk, but by the time you are done with it, it looks more like an easter basket? I'll stop there. You're welcome. I think you get the point!
This post has two purposes. First, to look at some means of cleaning data and second, to discuss some of the risks and perils of doing so. The ideas in this post borrow from teachings by Matt Brems and Musfiqur Rahman at General Assembly.
Several basic data-cleaning techniques are well-known and perhaps even widespread. Delete observations with missing data. Delete features with missing data! A dataset is mostly complete, but a bunch of info is missing from features related to some respondents' body weight or annual income? Well, let's pound through those puppies and average the heck out of them by replacing the missing data with mean values!
Sounds easy? Repeatable? Something you can write a function to automate? Sure you could. You could also put red hot chili peppers in ice cream. Just because you can do something doesn't mean that you should.
Yes, some times we may make these well-known compromises. But we need to remember something. Changing or deletion of data is, de facto, a distortion. Surveys themselves can suffer from distortion as it is if they, for example, ask questions people might be uncomfortable answering, or asking them in a context where they don't feel safe and secure doing so. As data scientist we don't need to make it worse. Bad data means bad analysis which can lead to bad decisions that affect actual people. We are talking about actual people, not just linear regressions and machine learning models. So let's take care to consider the problem we are trying to solve, apply domain knowledge to inform our decisions when we have it, and seek out that knowledge when we don't.
In the example below, I will explore a dataset and in the process, uncover a few data cleaning challenges that don't always have an easy answer. This case study is based upon the Canadian Community Health Survey, 2012: Mental Health Component, which is available through ODESI (the Ontario Data Documentation, Extraction Service and Infrastructure), a project of the Ontario Council of Univerity Librarians. The original dataset was pristine, but it has been mucked up strictly for illustration purposes.)
Let's do some imports to start.
# The usual suspects
import numpy as np
import pandas as pd
# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Presentation
from IPython.display import HTML, Markdown, display
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 25000)
pd.set_option('display.max_rows', 600)
Next, let's read in the dataset using the pandas read_csv
function.
mental_health = pd.read_csv('../data/cchs-82M0013-E-2012-mental-health_F1.csv')
Let's have a look at the shape of things!
mental_health.shape
So, 25,113 observations and 586 features. Sounds about right, since this is a dataset generated from a very elaborate questionnaire on mental health in Canada that asks many questions and also has some calculated features. The survey asked about mental health and wellness as well as associated diagnoses. In addition, it asks a wide range of questions about such matters as workplace stress, suicidal thoughts, physical health and exposure to certain stressors in childhood.
It's a rich dataset from a data-scientific perspective. Subsets of features could be used for binary classification, looking to see if particular medical or mental health conditions can be predicted based on other personal data. There are some opportunties for regression as well, such as predicting income based on health or other factors. This dataset is also crying out for dimensionality reduction. As a rule of thumb, the number of features should be less than the square root of the observations, but 586**2 is over 300,000!
But back to the issue at hand. The data needs to be in proper shape first. Let's have a look.
HTML (mental_health.head().to_html())
Those of you with eagle eyes may be able to see problems with the data just by looking at the first five observations.
Issue one: It looks like, in some columns intended to contain numerical values, a lower-case 'l' sits in place of the number one. Maybe 'l' stands for something though? How would we know? Often we would have no idea. However luckily, this dataset has a very robust data dictionary, so we can have a look. Now, look to the left a little bit and a couple more issues arise.
Issue two: In the DHH_SEX column, numerical values have been spelled out. What were they smoking in the data-entry department that day? Just another reminder that everyone makes mistakes — and mistakes compromise data.
Issue three The DHHGHSZ and DHHGDWE columns have some null values — not good. Let's take a look at these issues one at a time.
Looking at the first and last observations we see the letter l scattered all over the place. When we consult the data dictionary and see that all of the columns' intended datatype are either int64 or float64 — numbers, not letters.
So, we can resolve this erroneous data through a method known as deductive imputation. We know that:
We can assume that the ls have been accidentally put in place of the 1s. We don't know if someone typed in the letter l when they meant the number 2; but then again, we would not know if they had done this with the number 1 either. Our role here, at minimum, is to introduce no new errors. (Even the dataset's exhaustive data dictionary features typos; no one is perfect.)
We know, logically, it is fine to make this change. We don't know how widespread this error is; let's assume it could be present anywhere in the dataset.
Let's look at one of the columns.
mental_health['ADM_N09'].value_counts()
Let's use pandas .replace to correct this error anywhere that it appears in the database.
number_dict = {'l' : '1'}
mental_health.replace(number_dict, inplace=True)
And confirm the change:
mental_health.head()
We can also use the same technique to address the written-out numbers. First, let's see what values are present.
mental_health['DHH_SEX'].value_counts()
I look in the data dictionary to see this column refers to gender. No trans or intersex people surveyed out of almost 25,000? Or just no option for other genders? Well, the survey is a bit old. Today, one almost never sees a survey with only two options for gender. The words vs numbers problem should be an easy enough fix, though.
number_dict = {'one' : 1, 'two' : 2} # Set up a dictionary to pass to .replace
mental_health.replace({'DHH_SEX' : number_dict}, inplace=True)
mental_health['DHH_SEX'].value_counts() # Confirm the change
Given all that weirdness, let's have a look at the datatypes:
# I'm creating a df just for ease of scrolling
dtypes = pd.DataFrame(mental_health.dtypes)
dtypes
We can see that four features that should have the datatype int64 are set to be strings (no surprise, since they have unexpected string content in them): GEOGCMAl, ADM_N09, ADM_Nl0 and DHH_SEX. Now that the offending values have been expunged, let's fix that!
wrong_dtypes = ['GEOGCMAl', 'ADM_N09', 'ADM_Nl0', 'DHH_SEX']
for col in wrong_dtypes:
mental_health[col] = pd.to_numeric(mental_health[col])
So, we saw some nulls. What to do with them? First off, how many?
mental_health.isna().sum()
We see there are two columns that each contain twenty null values. These represent missing data. If we delete them, we are losing potentially valuable data for our model.
We could impute the values. Looking in the data dictionary, we see that NaN does not represent a refusal to respond. Respondents were interviewed and the interviewer captured their answers. The responses are captured as categorical variables where 8 means "REFUSAL" and 9 means "NOT STATED."
These NaN could mean anything, mostly likely a data entry error (or rather, a system with a lack of rules to prevent such an error). Given they are categorical values, it's not like you could impute the mean or anything. But filling NaNs with mean values is never a good idea anyway. Besides the fact it's just not true, it distorts the distribution of your feature by inflating the mean bin to be larger than it actually is; the more fake means, the worse the skew.
You could look at the distribution of the categorical values in the rest of the feature, and distribute the nans amongst them at a rate equivalent to their distribution, along these lines:
mental_health['DHHGHSZ'].value_counts(normalize=True)
Raise the issue for discussion. You may end up just deleting the offending observations. After all, we are only talking about 20-40 out of 20,000+, and it is unlikely there is anything (either normal values or significant outliers) that won't be present somewhere else in the dataset. But deleting data should never be the default decision!