How to clean your data—and when and why you shouldn't

by Shawn Syms

When I was a young activist, I read a critical feminist sexuality text called Pleasure and Danger, by American anthropologist Carole S. Vance. I was always struck by its effective title and its succinct way of encapsulating the issues at stake. Now as a burgeoning data scientist faced with some of the rigors of my trade, such as data cleaning, the phrase comes in handy.

Pleasures?

Well, yes. I know that nearly 60 percent of data scientists surveyed described data cleaning as their most hated task, but I actually enjoy it. Really? Sure. As an introvert, I gain comfort from order and predictability, and as a published author and longtime journalist, I know the satisfaction of putting things in their proper order for maximum effectiveness. But I also know there is such a thing as too much of a good thing.

In my earliest moments in data science, I was drawn to data cleaning. What metaphor can I use for it? Well, it's sort of like taking a shower. Except it's not you, it's the data. Well, I mean, yes — I do shower before cleaning data. Daily, I swear. OK, scratch that. Cleaning data is like being a barber. Or a hairstylist, whatever you prefer. You take something that is in a raw and untamed state and make it pleasant, likeable and predictable, like the haircut of some person your parents would approve of, with whom you might feel comfortable going on a dinner date. That's all very well and good, but if you have ever had a bad haircut, you know how things can go wrong. If cleaning data is like being a potter, what if you're handed a lump of clay that roughly looks like an obelisk, but by the time you are done with it, it looks more like an easter basket? I'll stop there. You're welcome. I think you get the point!

This post has two purposes. First, to look at some means of cleaning data and second, to discuss some of the risks and perils of doing so. The ideas in this post borrow from teachings by Matt Brems and Musfiqur Rahman at General Assembly.

Gather, cleanse, model, evaluate, repeat?

Several basic data-cleaning techniques are well-known and perhaps even widespread. Delete observations with missing data. Delete features with missing data! A dataset is mostly complete, but a bunch of info is missing from features related to some respondents' body weight or annual income? Well, let's pound through those puppies and average the heck out of them by replacing the missing data with mean values!

Sounds easy? Repeatable? Something you can write a function to automate? Sure you could. You could also put red hot chili peppers in ice cream. Just because you can do something doesn't mean that you should.

Yes, some times we may make these well-known compromises. But we need to remember something. Changing or deletion of data is, de facto, a distortion. Surveys themselves can suffer from distortion as it is if they, for example, ask questions people might be uncomfortable answering, or asking them in a context where they don't feel safe and secure doing so. As data scientist we don't need to make it worse. Bad data means bad analysis which can lead to bad decisions that affect actual people. We are talking about actual people, not just linear regressions and machine learning models. So let's take care to consider the problem we are trying to solve, apply domain knowledge to inform our decisions when we have it, and seek out that knowledge when we don't.

In the example below, I will explore a dataset and in the process, uncover a few data cleaning challenges that don't always have an easy answer. This case study is based upon the Canadian Community Health Survey, 2012: Mental Health Component, which is available through ODESI (the Ontario Data Documentation, Extraction Service and Infrastructure), a project of the Ontario Council of Univerity Librarians. The original dataset was pristine, but it has been mucked up strictly for illustration purposes.)

Cleaning our mental-health dataset using python and pandas

Let's do some imports to start.

In [84]:
# The usual suspects
import numpy as np
import pandas as pd

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Presentation
from IPython.display import HTML, Markdown, display
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 25000)
pd.set_option('display.max_rows', 600)

Next, let's read in the dataset using the pandas read_csv function.

In [85]:
mental_health = pd.read_csv('../data/cchs-82M0013-E-2012-mental-health_F1.csv')


Let's have a look at the shape of things!

In [86]:
mental_health.shape
Out[86]:
(25113, 586)

Lots going on in here!

So, 25,113 observations and 586 features. Sounds about right, since this is a dataset generated from a very elaborate questionnaire on mental health in Canada that asks many questions and also has some calculated features. The survey asked about mental health and wellness as well as associated diagnoses. In addition, it asks a wide range of questions about such matters as workplace stress, suicidal thoughts, physical health and exposure to certain stressors in childhood.

It's a rich dataset from a data-scientific perspective. Subsets of features could be used for binary classification, looking to see if particular medical or mental health conditions can be predicted based on other personal data. There are some opportunties for regression as well, such as predicting income based on health or other factors. This dataset is also crying out for dimensionality reduction. As a rule of thumb, the number of features should be less than the square root of the observations, but 586**2 is over 300,000!

But back to the issue at hand. The data needs to be in proper shape first. Let's have a look.


In [87]:
HTML (mental_health.head().to_html())
Out[87]:
VERDATE ADM_RNO GEO_PRV GEOGCMAl ADM_N09 ADM_Nl0 ADM_N11 DHHGAGE DHH_SEX DHHGMS DHHGHSZ DHHGDWE DHHGLVG DHHGLE5 DHHG611 DHHGL12 GEN_01 GEN_02 GEN_04 GEN_02A2 GEN_07 GEN_08 GEN_09 GEN_10 GENDHDI GENGSWL SCR_081 SCR_082 SCR_21 SCR_22 SCR_23 SCR_24 SCR_25 SCR_25A SCR_26 SCR_26A SCR_26B SCRDPHY SCRDMEN SCRDDEP SCRDMIA SCRDGAD HWT_1 HWT_4 HWTGHTM HWTGWTK HWTGBMI HWTGISW CCC_031 CCC_051 CCC_061 CCC_071 CCC_072 CCC_081 CCC_091 CCC_101 CCC_121 CCC_131 CCC_132 CCC_151 CCC_171 CCC_251 CCC_261 CCC_280 CCC_28AA CCCG28AE CCC_290 CCC_311 CCC_331 CCC_332 CCCG901 CCCF1 CCCDHBP CCCDCNCR HUPDPAD PHSFPPA PHSGAPA PMH_01 PMH_02 PMH_03 PMH_04 PMH_05 PMH_06 PMH_07 PMH_08 PMH_09 PMH_10 PMH_11 PMH_12 PMH_13 PMH_14 PMHDHEM PMHDLEM PMHDHFU PMHDLFU PMHDCLA PMHDSCR STS_1 STS_2 STS_3 STS_4 STS_5 DIS_10A DIS_10B DIS_10C DIS_10D DIS_10E DIS_10F DIS_10G DIS_10H DIS_10I DIS_10J DIS_10K DIS_10L DIS_10M DIS_10N DISDK6 DISDDSX DISDCHR DEP_72 DEP_86 DEP_87 SUI_01 SUI_02 SUI_03 DEPDDPS DEPGREC DEPGPER DEPDDY DEPFSLT DEPFSYT DEPFINT DEPDINT MIA_33 MIA_47 MIADEPS MIAGREC MIAGPER MIADEY MIADINT MIAFINT HYPDL HYPDEY HYPFINT HYPDINT BIPD1 BIPD2 BIPD1Y BIPD2Y BIPDL BIPDY BIPDINT1 BIPDINT2 BIPDINT BIPFINT1 BIPFINT2 BIPFINT MHPFL MHPFLM MHPFLSA MHPFY MHPFYM MHPFYSA MHPFINT GAD_44 GAD_58 GAD_59 GADDGDS GADGPER GADGREC GADDDY GADDINT GADFINT SMK_01A SMK_01B SMKG01C SMK_202 SMKG203 SMK_204 SMK_05B SMK_05C SMK_05D SMK_06A SMKG06C SMKG207 SMK_208 SMK_09A SMKG09C SMK_10 SMK_10A SMKG10C SMKDSTY SMKGSTP SMKDYCS AUD_01 AUD_02 AUD_03 AUD_04 AUD_05 AUDG06 AUD_08 AUD_10 AUDG11 AUD_39 AUDDTTM AUDFWDY AUDDWDY AUDDLD AUDDYD AUDDLA AUDDYA AUDDL AUDDY AUDFINT AUDDINT SUD_87 SUDFLTU SUDFLSU SUDFLAU SUDFLCA SUDFLCM SUDFYCM SUDGLOTH SUDDLAI SUDDLAE SUDDLID SUDDYAI SUDDYAE SUDDYID SUDDLCD SUDDYCD SUDDLOD SUDDYOD SUDDLCA SUDDYCA SUDDLOA SUDDYOA SUDDLC SUDDYC SUDDLO SUDDYO SUDDL SUDDY SUDFINT SUDDINT DASG01 DASG02 DAS_04 DAS_05 DAS_06 DAS_07 DAS_10 DAS_11 DAS_12 DAS_13 DASGSCR TWD_1 TWD_2 TWD_3 TWD_4 TWD_5 TWD_6 SR1_001 SR1_004A SR1_004B SR1_004C SR1_004D SR1_004E SR1_004F SR1_004G SR1_004H SR1_004I SR1_004J SR1_004K SR1G010 SR1_012 SR1_013 SR1_014A SR1_014B SR1_014C SR1_014D SR1_014E SR1_014G SR1_014H SR1_014I SR1_014J SR1_014L SR1G020 SR1_022 SR1_023 SR1_024A SR1_024B SR1_024C SR1_024D SR1_024E SR1_024F SR1_024G SR1_024H SR1_024I SR1_024J SR1_024K SR1_024L SR1G030 SR1_032 SR1_033 SR1_034A SR1_034B SR1_034C SR1_034D SR1_034E SR1_034G SR1_034H SR1_034I SR1_034J SR1_034L SR1G040 SR1_042 SR1_043 SR1_044A SR1_044B SR1_044C SR1_044E SR1_044F SR1_044I SR1_044K SR1_044L SR1G050 SR1_052 SR1_053 SR1_054A SR1_054B SR1_054C SR1_054D SR1_054E SR1_054F SR1_054G SR1_054H SR1_054I SR1_054J SR1_054L SR1G060 SR1_062 SR1G070 SR1_072 SR1G080 SR1_082 SR1G090 SR1_092 SR1G100 SR1_102 SR1_110 SR1_111 SR1_112A SR1_112B SR1_112C SR1_112D SR1_112E SR1_113 SR1_114A SR1_114B SR1_114C SR1_116 SR1G117 SR1_118 SR1G119 SR1_120 SR1_121 SR1FPRU SR1FNPU SR1DYPRT SR1DYNPT SR1GPRO1 SR1GPRO2 SR1GPRO3 SR1GPRO4 SR1GPRO5 SR1GPROT MED_01 MED_05 MED_06 MEDFBENZ MEDFAPSY MEDFADEP MEDGOTHR MEDFDRUG MEDGTNUM PNC_01A PNC_01B PNC_01C PNC_01D PNC_01E PNC_02A PNC_02BA PNC_02BB PNC_02BC PNC_02BD PNC_04A PNC_04BA PNC_04BB PNC_04BC PNCFH12 PNCDHCT PNCDPNI PNCDPNM PNCDPNC PNCDPNO PNCDNEED PN1_01A1 PN1_01B1 PN1_01C1 PN1_01D1 PN1_01E1 PN1_01F1 PN1_01G1 PN1_01H1 PN1_01I1 PN1_01K1 PN1_01L1 PN1_01A2 PN1_01B2 PN1_01C2 PN1_01D2 PN1_01E2 PN1_01F2 PN1_01G2 PN1_01H2 PN1_01I2 PN1_01J2 PN1_01K2 PN1_01L2 PN1_01A3 PN1_01B3 PN1_01C3 PN1_01D3 PN1_01E3 PN1_01F3 PN1_01G3 PN1_01H3 PN1_01I3 PN1_01K3 PN1_01L3 PN1_01A4 PN1_01C4 PN1_01E4 PN1_01G4 PN1_01K4 PN1_01L4 PN1_01A5 PN1_01B5 PN1_01C5 PN1_01D5 PN1_01E5 PN1_01F5 PN1_01G5 PN1_01H5 PN1_01I5 PN1_01K5 PN1_01L5 PN1_01A6 PN1_01B6 PN1_01C6 PN1_01E6 PN1_01F6 PN1_01G6 PN1_01H6 PN1_01I6 PN1_01K6 PN1_01L6 PN1_01A7 PN1_01B7 PN1_01C7 PN1_01D7 PN1_01E7 PN1_01F7 PN1_01G7 PN1_01H7 PN1_01I7 PN1_01K7 PN1_01L7 PN1_02A7 MHE_01A MHE_01B MHE_02A MHE_02B MHE_03A MHE_03B MHE_04A MHE_04B MHE_05A MHE_05B MHE_06 MHE_06A MHE_06B MHE_06C MHE_06D MHE_06E MHE_06F MHEDPIS FMI_01A FMI_01B FMI_02 FMI_03A FMI_03B FMI_04 FMI_05 FMI_06 FMI_07 SPS_01 SPS_02 SPS_03 SPS_04 SPS_05 SPS_06 SPS_07 SPS_08 SPS_09 SPS_10 SPSDCON SPSDATT SPSDGUI SPSDALL SPSDINT SPSDWOR NSI_01 NSI_02 NSI_03 NSI_04 NSI_05 NSIDSC CWP_01 CWPG02 CWP_03 CWP_04 CWP_07 CWPG08 CWP_09 CEX_01 CEX_02 CEX_03 CEX_04 CEX_05 CEX_06 CEX_07 CEXDNUM SPT_01 SPT_02 LBSG31 LBSDWSS LBSGHPW LBSDPFT LBSGSOC WST_401 WST_402 WST_403 WST_404 WST_405 WST_406 WST_407 WST_408 WST_409 WST_410 WST_411 WST_412 WST_413 WSTDSKI WSTDAUT WSTDPSY WSTDJIN WSTDPHY WSTDSOC WSTDJST INCG02 INCG7 INC_12 INCGHH INCGPER INCDRCA INCDRPR SDCFIMM SDCGRES SDCGCGT SDCGLHM SDC_8 EDUDH04 EDUDR04 WTS_M
0 20140325 1 47 l l l 6 7 one 5 NaN NaN 1 0 0 0 2 3 2 9 2 1 2 2 3 1 2 2 2 1 6 1 1 2 2 2 2 3 3 1 1 2 6 3 1.803 94.50 29.07 3 2 2 2 1 6 2 2 2 2 2 2 2 2 2 2 2 6 6 2 2 2 2 2 1 1 2 1 1 7.5000 2 2 2 2 1 2 2 2 1 1 1 2 1 2 1 0 1 0 1 61 2 1 5 1 2 4 5 5 5 5 5 5 5 5 5 3 6 6 4 0 1 4 6 6 6 6 2 6 2 96 96 2 2 2 6 99.6 6 6 2 96 96 2 99.6 6 2 2 6 99.6 2 2 2 2 2 2 99.6 99.6 99.6 6 6 6 1 2 1 2 2 2 6 6 6 6 2 6 96 2 99.6 6 2 1 1 3 96 996 996 96 6 6 6 96 996 6 6 6 6 6 5 6 996 1 4 4 6 6 2 2 96 96 9 1 2 1 2 2 1 2 1 2 6 99.6 6 2 2 2 1 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 6 99.6 1 1 1 1 1 1 1 1 1 6 0.0 2 96 2 96 2 96 2 2 2 2 2 1 2 2 2 2 2 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 1 1 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 2 2 6 6 6 6 6 2 6 6 6 2 6 2 1 2 2 1 2 1 0 1 1 1 1 2 2 2 1 2 2 2 2 2 2 6 2 2 1 2 2 1 6 6 6 6 2 6 6 6 1 1 1 1 2 1 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 96 2 6 2 6 1 1 1 1 1 2 6 96 96 96 96 96 96 96 1 2 3 6 6 6 99999.96 6 6 1 1 1 1 1 1 1 1 1 1 40 8 8 8 8 8 2 2 1 2 1 2 2 2 2 2 2 2 2 1 5 5 5 1 1 2 3 3 3 1 1 40 1 1 2 1 1 3 3 1 1 5 2 6 2 2 1 3 1 2 0 0 96 0.48 1 6 2 5 6 9 9 2 6 1 1 2 4 4 409.66
1 20140325 2 48 l l l 6 9 one 4 NaN NaN 1 0 0 0 2 3 3 7 3 1 2 3 3 2 3 2 2 2 2 2 2 6 2 2 2 2 3 2 2 2 6 3 1.778 76.50 24.20 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 6 6 2 2 2 2 2 1 2 2 1 1 1.5169 1 1 1 3 2 5 3 5 1 1 2 1 2 2 1 0 1 0 1 54 3 2 16 6 6 4 4 5 5 4 5 5 5 5 5 3 6 6 4 2 3 4 6 6 6 6 2 6 2 96 96 2 2 2 6 99.6 6 6 2 96 96 2 99.6 6 2 2 6 99.6 2 2 2 2 2 2 99.6 99.6 99.6 6 6 6 1 2 1 2 2 2 6 6 6 6 2 6 96 2 99.6 6 1 6 3 1 3 25 996 96 6 6 6 96 996 6 6 6 6 6 1 6 40 1 5 5 6 6 8 2 96 96 1 1 2 1 2 2 1 2 1 2 6 99.6 2 2 1 2 1 1 2 1 1 1 1 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 1 2 6 99.6 1 1 1 1 1 1 1 1 1 6 0.0 2 96 2 96 2 96 2 2 2 2 2 2 2 2 2 2 2 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 2 2 6 6 6 6 6 2 6 6 6 2 6 2 1 6 6 2 2 0 0 1 1 1 1 1 1 2 1 2 2 2 2 2 2 6 2 2 2 2 1 6 6 6 6 6 2 6 6 6 2 0 1 1 1 1 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 96 2 6 2 6 2 6 2 6 2 6 6 96 96 96 96 96 96 96 2 6 6 6 6 6 99999.96 6 6 1 1 1 1 1 1 1 1 1 1 40 8 8 8 8 8 2 1 2 1 1 1 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 3 4 1 1 40 1 4 2 2 1 4 3 1 2 2 2 4 2 3 2 3 1 2 1 3 4 0.53 1 6 2 3 4 4 3 2 6 1 1 2 4 4 421.86
2 20140325 3 24 l 2 2 2 7 two 2 NaN NaN 3 0 0 0 5 5 4 3 4 1 4 2 0 4 3 5 1 6 6 2 1 2 1 6 1 2 0 1 2 1 9 7 1.727 72.00 999.99 9 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 1 2 2 5 1 0.3833 4 4 6 5 4 97 3 4 5 5 3 97 97 97 0 1 0 0 3 99 5 4 3 2 5 1 1 1 1 1 1 1 3 1 2 1 1 6 1 21 37 1 1 1 2 2 1 2 1 8 8 1 1 2 1 7.5 6 6 2 96 96 2 99.6 6 2 2 6 99.6 2 2 2 2 2 2 99.6 99.6 99.6 6 6 6 1 1 9 1 1 9 1 6 6 6 2 6 96 2 99.6 6 2 2 96 3 96 996 996 96 6 6 6 96 996 6 6 6 6 6 6 6 996 1 3 2 6 6 2 2 96 96 6 1 2 1 2 2 2 2 2 2 6 99.6 9 2 2 2 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 99.9 1 4 4 5 3 3 4 3 5 5 40.0 2 96 1 7 1 14 2 2 1 1 2 2 1 1 2 2 2 2 6 6 6 6 6 6 6 6 6 6 6 6 6 1 2 2 6 6 6 6 6 6 6 6 6 6 6 6 1 4 1 2 2 2 2 2 2 2 2 1 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 3 2 3 3 6 6 6 6 6 6 2 1 1 1 2 2 2 2 6 6 6 1 1 1 2 2 2 1 1 2 4 1 9 2 1 1 9 1 1 1 1 2 1 2 1 2 1 1 1 2 2 2 2 2 1 6 6 6 6 6 1 3 2 2 3 1 3 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 2 2 2 2 2 2 2 2 2 2 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 96 2 6 1 1 1 2 1 1 1 1 2 96 96 96 96 96 96 0 2 6 6 6 6 6 99999.96 6 6 2 2 2 2 7 7 2 2 2 2 99 6 9 6 6 9 7 9 9 9 9 99 2 2 2 2 2 2 2 1 5 2 1 8 8 2 99 2 3 1 2 20 2 2 1 2 4 2 1 4 3 1 3 4 2 4 3 4 5 7 2 4 5 1.25 1 1 2 4 3 4 5 2 6 1 2 2 4 4 678.36
3 20140325 4 24 l 2 l 6 14 two 3 NaN NaN 1 0 0 0 3 3 2 10 3 6 6 2 2 1 3 2 1 6 6 2 2 6 2 2 1 2 3 1 2 1 6 3 1.753 67.79 22.06 2 2 1 2 1 6 1 2 2 2 2 2 2 2 2 2 2 6 6 2 2 2 2 2 1 1 2 2 1 0.7500 2 1 2 4 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 65 2 1 3 2 2 5 4 5 4 5 5 4 5 5 5 3 6 6 4 2 3 4 6 6 6 6 2 6 2 96 96 2 2 2 6 99.6 6 6 2 96 96 2 99.6 6 2 2 6 99.6 2 2 2 2 2 2 99.6 99.6 99.6 6 6 6 1 2 2 2 2 2 6 2 6 6 1 2 1 2 99.6 6 2 2 96 3 96 996 996 96 6 6 6 96 996 6 6 6 6 6 6 6 996 1 1 1 6 1 1 2 96 96 6 2 2 2 2 2 2 2 2 2 6 99.6 6 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 6 99.6 1 1 1 2 1 2 1 1 6 1 8.3 2 96 2 96 1 2 2 2 2 2 2 2 2 2 2 2 2 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 2 6 6 6 6 6 2 6 6 6 2 6 2 1 6 6 2 2 0 0 1 1 1 1 1 1 2 1 2 2 2 2 2 2 6 2 2 2 2 1 6 6 6 6 6 2 6 6 6 2 0 1 1 1 1 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 96 1 1 1 2 2 6 1 2 1 2 6 96 96 96 96 96 96 96 2 6 6 6 6 6 99999.96 6 6 1 2 1 1 2 1 2 1 7 1 99 8 8 8 6 9 2 1 1 1 1 0 2 2 2 2 2 2 2 1 1 1 1 1 1 2 0 1 1 6 6 996 6 6 6 6 6 6 6 6 6 6 6 96 96 96 6 96 96 96 6 6 96 9.96 3 6 2 2 3 2 2 1 3 1 1 2 4 4 662.54
4 20140325 5 59 l l l 6 13 two 3 NaN NaN 1 0 0 0 3 3 1 8 2 6 6 3 2 2 3 3 1 6 6 2 2 6 2 2 2 2 2 1 2 2 6 3 1.549 53.55 22.32 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 6 6 2 2 2 2 2 1 2 1 3 1 1.1499 2 1 2 4 5 5 1 5 1 2 1 1 1 2 1 0 1 0 1 51 3 2 4 2 2 4 4 5 5 5 5 5 5 4 5 3 6 6 4 2 3 4 6 6 6 6 2 6 2 96 96 2 2 2 6 99.6 6 6 2 96 96 2 99.6 6 2 2 6 99.6 2 2 2 2 2 2 99.6 99.6 99.6 6 6 6 2 2 2 2 2 2 6 6 6 6 2 6 96 2 99.6 6 2 2 96 3 96 996 996 96 6 6 6 96 996 6 6 6 6 6 6 6 996 1 1 1 6 1 1 2 96 96 6 2 2 2 2 2 2 2 2 2 6 99.6 6 1 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 6 99.6 2 2 1 2 1 2 1 1 1 2 16.7 2 96 2 96 2 96 2 2 2 2 2 2 2 2 2 2 2 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 2 6 6 6 6 6 2 6 6 6 2 6 2 2 6 6 2 2 0 0 1 1 1 1 1 1 1 1 2 2 2 1 2 1 1 2 1 2 2 2 1 6 6 6 6 2 6 6 6 1 1 1 2 1 1 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 96 2 6 1 2 1 2 1 1 1 1 2 96 96 96 96 96 96 0 2 6 6 6 6 6 99999.96 6 6 1 1 1 1 1 1 1 1 1 1 40 8 8 8 8 8 2 1 1 1 1 0 2 2 2 2 2 2 2 1 1 1 1 1 1 2 0 2 2 6 6 996 6 6 6 6 6 6 6 6 6 6 6 96 96 96 6 96 96 96 6 6 96 9.96 4 6 2 5 6 10 10 2 6 1 1 2 4 4 839.14


Those of you with eagle eyes may be able to see problems with the data just by looking at the first five observations.

Issue one: It looks like, in some columns intended to contain numerical values, a lower-case 'l' sits in place of the number one. Maybe 'l' stands for something though? How would we know? Often we would have no idea. However luckily, this dataset has a very robust data dictionary, so we can have a look. Now, look to the left a little bit and a couple more issues arise.

Issue two: In the DHH_SEX column, numerical values have been spelled out. What were they smoking in the data-entry department that day? Just another reminder that everyone makes mistakes — and mistakes compromise data.

Issue three The DHHGHSZ and DHHGDWE columns have some null values — not good. Let's take a look at these issues one at a time.

Erroneous data

Looking at the first and last observations we see the letter l scattered all over the place. When we consult the data dictionary and see that all of the columns' intended datatype are either int64 or float64 — numbers, not letters.

So, we can resolve this erroneous data through a method known as deductive imputation. We know that:

  • the columns are not meant to have the letter l
  • the columns are are meant to have the number 1

We can assume that the ls have been accidentally put in place of the 1s. We don't know if someone typed in the letter l when they meant the number 2; but then again, we would not know if they had done this with the number 1 either. Our role here, at minimum, is to introduce no new errors. (Even the dataset's exhaustive data dictionary features typos; no one is perfect.)

We know, logically, it is fine to make this change. We don't know how widespread this error is; let's assume it could be present anywhere in the dataset.

Let's look at one of the columns.

In [88]:
mental_health['ADM_N09'].value_counts()
Out[88]:
2    21492
l    3247 
3    323  
9    51   
Name: ADM_N09, dtype: int64


Let's use pandas .replace to correct this error anywhere that it appears in the database.

In [89]:
number_dict = {'l' : '1'}
mental_health.replace(number_dict, inplace=True)


And confirm the change:

In [90]:
mental_health.head()
Out[90]:
VERDATE ADM_RNO GEO_PRV GEOGCMAl ADM_N09 ADM_Nl0 ADM_N11 DHHGAGE DHH_SEX DHHGMS ... INCDRCA INCDRPR SDCFIMM SDCGRES SDCGCGT SDCGLHM SDC_8 EDUDH04 EDUDR04 WTS_M
0 20140325 1 47 1 1 1 6 7 one 5 ... 9 9 2 6 1 1 2 4 4 409.66
1 20140325 2 48 1 1 1 6 9 one 4 ... 4 3 2 6 1 1 2 4 4 421.86
2 20140325 3 24 1 2 2 2 7 two 2 ... 4 5 2 6 1 2 2 4 4 678.36
3 20140325 4 24 1 2 1 6 14 two 3 ... 2 2 1 3 1 1 2 4 4 662.54
4 20140325 5 59 1 1 1 6 13 two 3 ... 10 10 2 6 1 1 2 4 4 839.14

5 rows × 586 columns


We can also use the same technique to address the written-out numbers. First, let's see what values are present.

In [91]:
mental_health['DHH_SEX'].value_counts()
Out[91]:
two    13773
one    11340
Name: DHH_SEX, dtype: int64


I look in the data dictionary to see this column refers to gender. No trans or intersex people surveyed out of almost 25,000? Or just no option for other genders? Well, the survey is a bit old. Today, one almost never sees a survey with only two options for gender. The words vs numbers problem should be an easy enough fix, though.

In [92]:
number_dict = {'one' : 1, 'two' : 2} # Set up a dictionary to pass to .replace
mental_health.replace({'DHH_SEX' : number_dict}, inplace=True)
In [93]:
mental_health['DHH_SEX'].value_counts() # Confirm the change
Out[93]:
2    13773
1    11340
Name: DHH_SEX, dtype: int64

Given all that weirdness, let's have a look at the datatypes:

In [94]:
# I'm creating a df just for ease of scrolling

dtypes = pd.DataFrame(mental_health.dtypes)
dtypes
Out[94]:
0
VERDATE int64
ADM_RNO int64
GEO_PRV int64
GEOGCMAl object
ADM_N09 object
ADM_Nl0 object
ADM_N11 int64
DHHGAGE int64
DHH_SEX int64
DHHGMS int64
DHHGHSZ float64
DHHGDWE float64

We can see that four features that should have the datatype int64 are set to be strings (no surprise, since they have unexpected string content in them): GEOGCMAl, ADM_N09, ADM_Nl0 and DHH_SEX. Now that the offending values have been expunged, let's fix that!

In [95]:
wrong_dtypes = ['GEOGCMAl', 'ADM_N09', 'ADM_Nl0', 'DHH_SEX']

for col in wrong_dtypes:
    mental_health[col] = pd.to_numeric(mental_health[col])

To .dropna, or to not .dropna?

So, we saw some nulls. What to do with them? First off, how many?

In [96]:
mental_health.isna().sum() 
Out[96]:
VERDATE     0 
ADM_RNO     0 
GEO_PRV     0 
GEOGCMAl    0 
ADM_N09     0 
ADM_Nl0     0 
ADM_N11     0 
DHHGAGE     0 
DHH_SEX     0 
DHHGMS      0 
DHHGHSZ     20
DHHGDWE     20
dtype: int64


We see there are two columns that each contain twenty null values. These represent missing data. If we delete them, we are losing potentially valuable data for our model.

We could impute the values. Looking in the data dictionary, we see that NaN does not represent a refusal to respond. Respondents were interviewed and the interviewer captured their answers. The responses are captured as categorical variables where 8 means "REFUSAL" and 9 means "NOT STATED."

These NaN could mean anything, mostly likely a data entry error (or rather, a system with a lack of rules to prevent such an error). Given they are categorical values, it's not like you could impute the mean or anything. But filling NaNs with mean values is never a good idea anyway. Besides the fact it's just not true, it distorts the distribution of your feature by inflating the mean bin to be larger than it actually is; the more fake means, the worse the skew.

You could look at the distribution of the categorical values in the rest of the feature, and distribute the nans amongst them at a rate equivalent to their distribution, along these lines:

In [97]:
mental_health['DHHGHSZ'].value_counts(normalize=True)
Out[97]:
2.0    0.374208
1.0    0.296537
3.0    0.139601
4.0    0.127924
5.0    0.061571
9.0    0.000159
Name: DHHGHSZ, dtype: float64

Raise the issue for discussion. You may end up just deleting the offending observations. After all, we are only talking about 20-40 out of 20,000+, and it is unlikely there is anything (either normal values or significant outliers) that won't be present somewhere else in the dataset. But deleting data should never be the default decision!